Inside the Cloudflare Outage (Nov 18, 2025)

I read about the Cloudflare outage from Nov 18, 2025 and it is a clean case study in how small changes cascade in distributed systems. This is a short breakdown. For the full postmortem, see the official write-up on the Cloudflare blog:

What happened

On Nov 18 at about 11:20 UTC, Cloudflare started returning error pages to users trying to reach sites behind their network.

500

This was not a cyberattack. The trigger was a change to database permissions. A small access control change caused the system that generates a feature file for Bot Management to output a file that was much larger than expected. That oversized file got distributed across the fleet, but the Bot Management module had a hard size limit. When it hit the limit, the module panicked and traffic that depended on it started failing with 5xx errors.

Why it went wrong

Here is the chain in simple terms:

Bot Management module: Cloudflare uses a Bot Management module that relies on a feature file for ML-based detection.
Feature file generation: The file is generated every few minutes by their ClickHouse cluster and then distributed to proxy machines.
Unexpected data growth: A query behavior change surfaced duplicate rows because metadata visibility changed. That inflated the feature count.
Memory limit hit: The module had a pre-allocated limit of about 200 features. Normal files were around 60. When the file exceeded the limit, the Rust code panicked and requests started failing.

The source check in their FL2 Rust code looked like this:

rust_code_source_image

That led to this panic:

thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

Once that core module failed, dependent services were affected too: CDN, security services, Workers KV, Access, dashboard login, and more. Early symptoms looked like a massive DDoS because error rates spiked and their status page went down, but the root cause was internal.

Timeline in short

Source: )

Scroll horizontally to view all columns

time (utc)	event
11:05	database access control change deployed
11:20ish	first errors seen
13:05	mitigation kicks in (Workers KV and Access bypass)
14:30	main fix deployed globally (old good feature file inserted)
17:06	all services fully resolved

Lessons learned

Powerful systems still fail: small changes can trigger unexpected data growth and hit hard limits
Monitoring matters: the system bouncing between old and new files made the root cause harder to spot
Dependency chains are risky: one module failing can drag down many others
High uptime is expensive: moving from 99.99% to 99.999% needs serious investment
Start small, plan growth: find the sweet spot between "works now" and "scales later"
Admit mistakes: Cloudflare called this their worst outage since 2019:

Final note: most large outages come down to unexpected data growth, broken assumptions, or one module dragging others down. You might also like this:

Adios.

Inside the Cloudflare Outage (Nov 18, 2025)

What happened

Why it went wrong

Timeline in short

Lessons learned

Related Posts

The Art of Minimal Design

How does the internet work

Building Type-Safe APIs with TypeScript