
Cloudflare’s proxy service has limits to forestall excessive memory consumption, with the bot management system having “a limit on the variety of machine learning features that will be used at runtime.” This limit is 200, well above the actual variety of features used.
“When the bad file with greater than 200 features was propagated to our servers, this limit was hit—leading to the system panicking” and outputting errors, Prince wrote.
Worst Cloudflare outage since 2019
The variety of 5xx error HTTP status codes served by the Cloudflare network is generally “very low” but soared after the bad file spread across the network. “The spike, and subsequent fluctuations, show our system failing as a result of loading the inaccurate feature file,” Prince wrote. “What’s notable is that our system would then recuperate for a period. This was very unusual behavior for an internal error.”
This unusual behavior was explained by the actual fact “that the file was being generated every five minutes by a question running on a ClickHouse database cluster, which was being step by step updated to enhance permissions management,” Prince wrote. “Bad data was only generated if the query ran on an element of the cluster which had been updated. Because of this, every five minutes there was a likelihood of either a great or a foul set of configuration files being generated and rapidly propagated across the network.”
This fluctuation initially “led us to consider this could be brought on by an attack. Eventually, every ClickHouse node was generating the bad configuration file and the fluctuation stabilized within the failing state,” he wrote.
Prince said that Cloudflare “solved the issue by stopping the generation and propagation of the bad feature file and manually inserting a known good file into the feature file distribution queue,” after which “forcing a restart of our core proxy.” The team then worked on “restarting remaining services that had entered a foul state” until the 5xx error code volume returned to normal later within the day.
Prince said the outage was Cloudflare’s worst since 2019 and that the firm is taking steps to guard against similar failures in the long run. Cloudflare will work on “hardening ingestion of Cloudflare-generated configuration files in the identical way we might for user-generated input; enabling more global kill switches for features; eliminating the flexibility for core dumps or other error reports to overwhelm system resources; [and] reviewing failure modes for error conditions across all core proxy modules,” in accordance with Prince.
While Prince can’t promise that Cloudflare won’t ever have one other outage of the identical scale, he said that previous outages have “all the time led to us constructing recent, more resilient systems.”
