Three weeks, three collapses: the brutal wake-up call
On October 20, 2025, Amazon Web Services' US-EAST-1 region collapsed. Over 140 services affected. Billions of dollars in estimated losses within hours. Thousands of companies, from startups to the Fortune 500, paralyzed by an event their architecture had never anticipated, or more precisely, had chosen to ignore to save a few budget points.
Nine days later, Azure Front Door went down. Microsoft 365, Xbox, and tens of thousands of European businesses found themselves without email, without Teams, without SharePoint. The service supposedly "always available" proved it was not.
And for those who think these incidents are reserved for American hyperscalers, a local reminder: on July 23, 2025, a cyberattack against POST Luxembourg brought down critical services for an entire country. CGDIS, Lux-Airport, LU-Alert : the emergency infrastructure itself was compromised. When your national alert system goes down, it is no longer an IT incident. It is a sovereignty failure.
Three events. Three different architectures. One same conclusion: redundancy is not a luxury. It is the price of survival.
The economic equation nobody wants to hear
I know the objection by heart. It comes up in every board meeting, delivered with the quiet certainty of someone who has never lived through a major outage: "Redundancy is an unnecessary cost. Our cloud provider guarantees 99.99% availability."
Let us dissect that number. A 99.99% SLA means 52 minutes of authorized downtime per year. In theory. In practice, when US-EAST-1 goes down for several hours, your SLA only reimburses you a derisory percentage of your monthly bill. Not your lost revenue. Not your customer defections. Not the overtime hours of your teams in firefighting mode. Not the damage to your reputation.
The equation is actually very simple: redundancy has a predictable cost. The absence of redundancy has an unpredictable price, and that price always reveals itself at the worst possible moment, when you no longer have time to react.
In my advisory engagements, I find that a 15 to 20% infrastructure surcharge for redundancy is enough to cover major failure scenarios. Compare that number to the losses from a full day without your production systems. For most companies I work with, a single hour of downtime costs more than the annual redundancy budget. The math takes five minutes. The decision should take just as long.
The three levels of resilience: from minimum coverage to the bunker
Not all workloads deserve the same level of protection. The first mistake I see in the field is the "all or nothing" approach: either you make nothing redundant, or you want to replicate everything everywhere. Both are absurd. Here is the framework I systematically apply.
Medium criticality: multi-region as the baseline
For important systems that can tolerate a few minutes of interruption, the multi-region strategy within a single cloud provider offers an excellent cost-to-protection ratio. The principle: your workloads run in two geographically distinct regions, with automatic failover orchestrated by a global load balancer.
The key ingredients: asynchronous data replication, aggressive health checks (10-second intervals maximum), and above all quarterly resilience tests. A recovery plan that has never been tested is not a plan : it is a document.
The cost differential compared to a single-region architecture is between 15 and 25%, primarily from replicated storage and inter-region traffic. This is the minimum floor for any business application in production.
High criticality: orchestrated multi-cloud
When downtime is measured in hundreds of thousands of euros per hour, a single provider is no longer sufficient. The Azure Front Door incident demonstrated this: the outage did not affect a single region, but a global component of the Microsoft network. Being multi-region on Azure would have changed nothing.
Orchestrated multi-cloud consists of deploying your critical services across at least two cloud providers, with an abstraction layer that enables failover in under five minutes. In practice, this means containerizing your applications (Kubernetes is indispensable here), using databases with cross-cloud replication, and maintaining an Infrastructure as Code layer capable of provisioning on any provider.
The additional cost is significant : expect 30 to 50% operational overhead. But for systems whose downtime directly impacts revenue, the ratio remains strongly favorable.
Maximum criticality: sovereign hybrid
For systems whose failure endangers human safety or the continuity of a regulated activity, you need to go further. The attack on POST Luxembourg illustrates the risk: when a central actor goes down, the entire ecosystem that depends on it collapses.
Hybrid architecture combines on-premise infrastructure with cloud, but in a truly independent manner. The key components: DNS and routing that depend on neither environment, synchronous dual-write for critical data, and the ability to operate autonomously in degraded mode on each leg.
This is the most expensive and complex architecture to operate. But for a CGDIS, a SWIFT banking system, or a national alert platform, the question is not "can we afford it?" but "can we afford not to?"
The missing link: chaos engineering
There is a syndrome I encounter in the majority of organizations: the false sense of security. The company has invested in a redundant architecture, the diagrams are beautiful, the recovery procedures span 40 pages, and nobody has ever tested whether any of it actually works.
Chaos engineering is the antidote. The principle, popularized by Netflix with its Chaos Monkey, consists of deliberately provoking failures in a controlled environment to verify that resilience mechanisms work. Cut a region. Kill a critical service. Simulate 500-millisecond network latency. And observe what actually happens, not what your diagrams say should happen.
The revelations are often painful. Automatic failovers that do not trigger because a certificate has expired. Replicated databases that are three hours behind instead of the promised five seconds. Alerts that arrive in a mailbox nobody monitors on weekends.
My advice: start small, in pre-production, with simple scenarios. Then gradually increase the severity and move closer to production. The goal is not to break your system. It is to discover how it breaks before reality does it for you.
Resilience is a chain : check every link
The most dangerous mistake I see in "redundant" architectures is the partial view. You make application servers redundant, replicate the database, and forget that the DNS points to a single provider. Or that the TLS certificate is managed by a single service. Or that the authentication layer depends on a third-party SSO with no fallback plan.
Resilience is a chain. It always breaks at the weakest link. Here is the checklist I use to audit actual coverage:
- DNS: Does your name resolution survive the failure of your primary DNS provider?
- Routing: Can your traffic be redirected automatically without human intervention?
- Compute: Do your workloads start in under five minutes on alternative infrastructure?
- Data: Is your RPO (Recovery Point Objective) below the maximum acceptable loss?
- Identity: Can your users authenticate if your primary IdP is unavailable?
- Observability: Do your metrics and alerts work when it is the monitoring infrastructure itself that goes down?
If you cannot answer "yes" to each of these questions with recent test evidence, your redundancy is an illusion.
Redundancy is not a cost : it is the price of trust
After AWS US-EAST-1, after Azure Front Door, after POST Luxembourg, the debate should be closed. The question is no longer whether a major outage will hit your infrastructure. The question is when, and whether you will be ready.
Organizations that treat redundancy as a cost line to optimize are playing Russian roulette with their business. Those that treat it as vital insurance, with regular testing, chaos engineering scenarios, and complete coverage of the chain, sleep better at night. And above all, they remain standing when others fall.
The 15 to 20% surcharge is not an expense. It is an investment in your clients' trust, the continuity of your operations, and the credibility of your organization. In a world where the next outage is always a few weeks away, it may be the best investment you can make.

