Cloudflare's Code Orange: Fail Small — A Stronger, More Resilient Network

Cloudflare recently completed a major engineering initiative internally called Code Orange: Fail Small. Over the past two and a half quarters, the team focused on making the network more resilient, secure, and reliable for every customer. While resilience work never truly ends, this phase specifically addresses the root causes of the global outages on November 18 and December 5, 2025. The project introduced safer configuration changes, smarter failure containment, revised emergency procedures, and improved customer communication during incidents. Below, we break down the most important aspects in a Q&A format.

What is Code Orange: Fail Small?

Code Orange: Fail Small is a comprehensive engineering project that aimed to make Cloudflare's infrastructure more resilient by isolating and reducing the blast radius of any failure. The core idea is to ensure that when something goes wrong—whether a configuration mistake or a software bug—the impact stays small and doesn't cascade into a global outage. The project was completed earlier this month and directly addresses the failures that led to the November 18 and December 5, 2025 incidents.

Cloudflare's Code Orange: Fail Small — A Stronger, More Resilient Network
Source: blog.cloudflare.com

Why was this project necessary?

The two global outages in late 2025 demonstrated that configuration changes could propagate too quickly and affect all customers simultaneously. Cloudflare needed a way to catch issues before they spread. The project focused on safer configuration changes, reducing the impact of failures, and strengthening incident management. Additionally, to maintain long-term reliability, the team introduced measures to prevent drift and regressions, ensuring that improvements stick over time.

How are configuration changes safer now?

Previously, many internal configuration changes reached the entire network instantly. Now, Cloudflare uses a health-mediated deployment methodology for all high-risk configuration pipelines. Changes are bundled into packages and released gradually, with real-time health monitoring. If the observability tools detect a degradation, the system automatically rolls back the change before it affects live traffic. This is similar to how software releases are already managed, and it closes a critical gap for configuration. For more details, see the Snapstone section below.

What is Snapstone and how does it work?

Snapstone is a new internal component built specifically to bring health-mediated deployment to configuration changes. It bundles configuration updates into a package and then progressively releases them across the network while monitoring key health metrics. Before Snapstone, applying this methodology to config was possible but required significant per-team effort and was inconsistently applied. Snapstone provides a unified way to automatically roll back any configuration change—whether it's a data file like the one involved in the November outage or a control flag from the December incident—if health signals indicate a problem. Its flexibility means teams can dynamically define any unit of configuration that needs protection.

Cloudflare's Code Orange: Fail Small — A Stronger, More Resilient Network
Source: blog.cloudflare.com

How does this affect customers during an outage?

During an outage, Cloudflare now communicates more clearly and frequently with customers. The project revised break glass procedures and incident management protocols to ensure faster, more transparent updates. While the goal is to prevent outages entirely, when incidents do occur, customers will receive timely information about what happened, what steps are being taken, and when to expect resolution. This improved communication builds trust and helps customers respond appropriately.

Will this work prevent future regressions?

The Code Orange: Fail Small project includes measures to prevent drift and regressions over time. By institutionalizing health-mediated deployments, automated rollbacks, and stricter configuration pipelines, Cloudflare ensures that resilience improvements are not lost. However, the team emphasizes that resilience is never a job done—it remains a top priority throughout the development lifecycle. Ongoing monitoring and continuous improvement will keep the network robust against new challenges.

Tags:

Recommended

Discover More

Ultrawide Monitor Mastery: Top Picks for Every Need in 2026How to Understand the Key Moments in the Musk-OpenAI TrialCerebras Targets $26.6 Billion Valuation in Renewed US IPO Push to Raise $3.5 BillionMastering Location Cues on Social Media: A Guide to Boosting Audience Connection10 Ways Y Combinator Is Betting Big on Hard Tech Beyond the Garage