Inside the Cloudflare Outage (Nov 18, 2025)
I read about the Cloudflare outage from Nov 18, 2025 and it is a clean case study in how small changes cascade in distributed systems. This is a short breakdown. For the full postmortem, see the official write-up on the Cloudflare blog:
What happened
On Nov 18 at about 11:20 UTC, Cloudflare started returning error pages to users trying to reach sites behind their network.
500
This was not a cyberattack. The trigger was a change to database permissions. A small access control change caused the system that generates a feature file for Bot Management to output a file that was much larger than expected. That oversized file got distributed across the fleet, but the Bot Management module had a hard size limit. When it hit the limit, the module panicked and traffic that depended on it started failing with 5xx errors.
Why it went wrong
Here is the chain in simple terms:
- Bot Management module: Cloudflare uses a Bot Management module that relies on a feature file for ML-based detection.
- Feature file generation: The file is generated every few minutes by their ClickHouse cluster and then distributed to proxy machines.
- Unexpected data growth: A query behavior change surfaced duplicate rows because metadata visibility changed. That inflated the feature count.
- Memory limit hit: The module had a pre-allocated limit of about 200 features. Normal files were around 60. When the file exceeded the limit, the Rust code panicked and requests started failing.
The source check in their FL2 Rust code looked like this:
rust_code_source_image
That led to this panic:
thread fl2_worker_thread panicked: called Result::unwrap() on an Err valueOnce that core module failed, dependent services were affected too: CDN, security services, Workers KV, Access, dashboard login, and more. Early symptoms looked like a massive DDoS because error rates spiked and their status page went down, but the root cause was internal.
Timeline in short
Scroll horizontally to view all columns
| time (utc) | event |
|---|---|
| 11:05 | database access control change deployed |
| 11:20ish | first errors seen |
| 13:05 | mitigation kicks in (Workers KV and Access bypass) |
| 14:30 | main fix deployed globally (old good feature file inserted) |
| 17:06 | all services fully resolved |
Lessons learned
- Powerful systems still fail: small changes can trigger unexpected data growth and hit hard limits
- Monitoring matters: the system bouncing between old and new files made the root cause harder to spot
- Dependency chains are risky: one module failing can drag down many others
- High uptime is expensive: moving from 99.99% to 99.999% needs serious investment
- Start small, plan growth: find the sweet spot between "works now" and "scales later"
- Admit mistakes: Cloudflare called this their worst outage since 2019:
Final note: most large outages come down to unexpected data growth, broken assumptions, or one module dragging others down. You might also like this:
Adios.
Related Posts
The Art of Minimal Design
Exploring the principles of minimal design, how to achieve clean interfaces, and the balance between functionality and aesthetics in modern web applications.
How does the internet work
A summary/overview of how the internet works in the simplest terms i could possibly use
Building Type-Safe APIs with TypeScript
Leveraging TypeScript's type system to create robust, maintainable APIs with end-to-end type safety.