The Day an Engine Didn't Explode
In 2009, a quality engineer at a Tier 1 automotive supplier blocked the production release of a turbocharger. The component passed all functional tests. But the FMEA analysis had revealed a silent failure mode: thermal fatigue of the seal gasket beyond 150,000 cycles, roughly three years of normal use. Without this analysis, thousands of vehicles would have been recalled. Cost avoided: tens of millions of euros.
In IT, this scenario plays out every week. Except nobody blocks the release. We deploy, cross our fingers, and when the incident hits, service outage, data leak, performance degradation, we scramble a crisis team at two in the morning.
Reliability isn't hoped for. It's engineered.
That's the first lesson automotive and aerospace can offer our IT organizations. And it's a lesson that, twenty years after my generalist engineering studies, I continue to apply with unwavering conviction.
Why IT Remains Stuck in Reactive Mode
In aerospace, nothing flies without DO-178C certification. In automotive, nothing rolls without IATF 16949 and its mandatory FMEA. These industries learned, often through blood, that a defect in production costs a hundred times more than a defect caught in design.
IT, meanwhile, continues to operate on the inverse paradigm. We valorize delivery speed. We celebrate the heroic hotfix. We decorate the firefighter, never the architect who prevented the fire.
The numbers are clear:
- The average cost of a major production incident ranges between 100,000 and 500,000 euros for a mid-sized company, according to Gartner.
- 60-70% of incidents are linked to recent changes, meaning insufficiently analyzed production releases.
- The correction ratio is 1:10:100 : fixing a defect in design costs 1, in testing 10, in production 100.
The problem isn't technical. It's cultural. IT has never had its "aviation moment", a crash painful enough to force a systemic shift in methodology. So we keep reacting. And we call it agility.
FMEA: The Method IT Refuses to See
FMEA, Failure Mode and Effects Analysis, isn't an obscure concept reserved for mechanical engineers. It's a systematic analysis framework, born in the 1940s within the US military, refined by NASA, then massively adopted by automotive (Ford, Toyota) and aerospace (Airbus, Safran).
Its principle is disarmingly simple: before putting anything into production, you identify everything that can go wrong, you assess the severity, and you act accordingly.
In automotive, a process FMEA covers every manufacturing step. In aerospace, a system FMEA covers every critical component. In IT? We do a vague "risk assessment" in an Excel spreadsheet that nobody re-reads.
The fundamental difference is this: in mature industries, risk analysis isn't an administrative formality. It's an engineering act that conditions the right to produce.
My 5-Act Method: Transferring FMEA to IT
After practicing this approach in industrial contexts and then adapting it to multi-million-euro IT transformations, I've formalized a five-act method. It's not theoretical. It's been battle-tested in contexts where failure wasn't an option.
Act 1 : Decompose, Map the System and Its Dependencies
Before hunting for failures, you need to understand what you're protecting. You map the system, its components, its interfaces, and especially its dependencies. In IT, this means going beyond the application architecture diagram. You need to integrate data flows, dependencies on third-party services, coupling points with infrastructure, and the human processes surrounding the system.
Deliverable: an up-to-date dependency map, not a PowerPoint diagram from last year.
Act 2 : List, Identify Potential Failure Modes
For each identified component, you ask: "How can this fail?" You're not looking for Hollywood disaster scenarios. You're looking for the mundane failures, the ones that happen on a Tuesday morning when a certificate expires, when a connection pool saturates, when a cron job has been silently failing for three weeks.
Key principle: the engineer's creativity matters as much as their experience. The best FMEAs are conducted by cross-functional teams, devs, ops, architects, business, because each sees different blind spots.
Act 3 : Evaluate, Score Occurrence, Severity, Detectability
This is the quantitative heart of the method. For each failure mode, you evaluate three dimensions on a scale of 1 to 10:
- Occurrence (O): what's the probability this defect will happen?
- Severity (S): what's the impact if it does?
- Detectability (D): can we detect it before the user experiences it?
The product O x S x D gives the Risk Priority Number (RPN). An RPN above 100 (out of 1,000) warrants corrective action. An RPN above 200 is a red flag.
The value of this approach: it objectifies criticality. No more endless committee debates where everyone defends their turf. The numbers decide.
Act 4 : Prioritize, Invest the Euro Where It Creates the Most Value
FMEA doesn't say "fix everything." It says "fix what has the highest RPN first." It's a resource allocation tool, not a wish list.
In practice, you often find that 20% of components concentrate 80% of the risk. It's on that 20% that you need to invest first: architecture reinforcement, monitoring, test automation, recovery planning.
Act 5 : Lock Down, Poka-Yoke, Continuous Testing, and Periodic Reviews
The Japanese term "poka-yoke" refers to an error-proofing device. In automotive, it's a physical interlock that prevents incorrect assembly. In IT, it's an architectural guardrail: a CI/CD pipeline that refuses to deploy if security tests fail, a database schema that prohibits inconsistent states, a circuit breaker that isolates a failing service.
FMEA isn't a one-time exercise. It lives with the system. Every production incident is an opportunity to revisit the analysis, add a failure mode that wasn't anticipated, and strengthen the defenses.
What IT Concretely Gains
Organizations that adopt this rigor, and I've guided several through it, observe measurable results:
- Culture of anticipation: teams shift from firefighting to prevention. Risks are visible, quantified, and prioritized before every production release. Stress decreases. Talent retention improves.
- Operational efficiency: fewer incidents, better SLAs, stabilized delivery velocity. You don't deliver faster by cutting corners. You deliver faster by eliminating rollbacks.
- Built-in compliance: Secure-by-Design and continuous auditability integrate naturally into the FMEA framework. ISO 27001, GDPR, NIS-2 : all these requirements become failure modes to score, not boxes to check.
The Methodological Bridge Nobody Is Building
The beauty of generalist engineering is that it reveals invariants. Problems change vocabulary across industries, but the underlying structures are identical. A failure mode in mechanical engineering and a bug in production are the same thing: a gap between expected behavior and actual behavior, with measurable consequences.
Automotive took thirty years to industrialize FMEA. Aerospace did it under certification pressure. IT has neither. No regulator imposing the method. No industrial culture carrying it.
But organizations that make the deliberate choice to import these practices, not as a formality, but as an engineering act, gain a structural advantage. They're not just more reliable. They're faster, because they spend less time fighting fires and more time building.
The question isn't whether IT will eventually adopt these methods. The question is whether your organization will be among the first to do so, or among those that keep paying the price of reactive mode.

