The Fire That Reached the Backups: The OVHcloud Strasbourg Data-Centre Fire, 2021

#postmortem #backup #devops #reliability

Tales from the Bare Metal — Episode 05

In the early hours of 10 March 2021, a fire began in a power room in Strasbourg. By morning an entire data centre had been destroyed and a second badly damaged. Around 3.6 million websites went offline. For a great many of those customers the sites came back within days. For some, they never came back at all, because the only copy of their data had been in the building that burned. The data loss is not the lesson of this episode. The lesson is that a backup can be complete, valid, restorable, and still worthless, if it shares a failure domain with the thing it is backing up.

The Incident

Shortly before 01:00 on 10 March 2021, fire broke out in a power room at OVHcloud's Strasbourg site, known as SBG. The site comprised several buildings. SBG2 was destroyed entirely. SBG1 was badly damaged, several of its rooms lost. SBG3 and SBG4 were not burned but were powered down as the site was made safe and the power infrastructure was gone.

The scale of the dependent estate became clear within hours. According to figures cited in the official investigation, roughly 3.6 million websites, corresponding to around 464,000 domain names, were unavailable at the height of the crisis, close to 18 per cent of the active IP addresses OVH had assigned over the preceding fortnight. Game servers, government sites, e-commerce shops and countless small businesses went dark together. OVHcloud's founder communicated openly and frequently through the days that followed, and the company moved quickly to rebuild and to ship replacement capacity. But for customers whose only copy of their data lived on the SBG site, no amount of openness brought the data back. It was gone.

The Diagnosis

The fire started in the power room. The French Bureau of Investigation and Analysis on Industrial Risks (BEA-RI) published its report in June 2022. The report records high humidity readings near one of the power inverters in the hour before the fire began, and discusses the inverters as a likely origin, but it deliberately stops short of asserting a single definitive cause. That hedge is worth respecting: the precise ignition is not known with certainty, and inventing one would be dishonest.

What is understood, and what matters more for the lesson, is why a fire in one power room became the loss of a building. Three design facts compounded.

First, the cooling. SBG2 was built in 2011 using a tower design with free cooling, sometimes called auto-ventilation: rather than mechanical chillers, the building let the waste heat of the servers rise and vent at the top, drawing cooler outside air in at the bottom. As an energy strategy this is genuinely elegant and genuinely efficient. As a fire behaviour, a tall shaft with a strong natural updraught is, in the words that have followed the incident, rather like a chimney. The same airflow that cooled the servers fed and lifted the fire.

Second, the construction. The floors were wooden, rated to resist fire for about an hour. An hour is a long time at a desk and a short time against a fed fire in a ventilated tower.

Third, suppression. OVH had chosen not to fit any of the five buildings on the Strasbourg site with an automatic fire-extinguishing system. There were detection and human response and the fire brigade, but no gas or water system that triggers on its own in the room of origin in the first minutes, which are the minutes that decide whether a fire stays in one rack or takes a building.

None of these, on its own, is the villain. Together they meant that an event in one power room had very little standing between it and the whole structure.

The Context

The hard part of this story is not OVH's building. It is the customers' assumption, because that assumption is nearly universal and it is the part that travels.

The first condition is a mental model. We say "it is in the data centre" or "it is in the cloud" and we hear "it is safe". The phrase abstracts away the physical fact: a specific building, in a specific town, with specific walls and a specific power room. Almost nobody, choosing where their backup lives, pictures the building. The abstraction that makes cloud convenient is the same abstraction that hides the failure domain.

The second condition is the shape of the tools. A hosting panel offers a backup option, often a cheap one, and the nearest and cheapest option is frequently storage in the same data centre, sometimes the same building. The interface presents "backup" as a feature you switch on, not as a question about geography. So customers switched it on, in good faith, and their primary and their backup came to sit inside one failure domain, chosen by default rather than by decision. The word "backup" did all the reassuring; the location did all the risk.

The third condition is ownership. The location of a backup is rarely anyone's explicit, written requirement. It is a setting, a default, a checkbox during provisioning, and checkboxes have no owner. Restore-testing, the subject of this series' first episode, at least tends to land on someone's plate eventually. "Is our backup in a different failure domain from our primary?" is a question that, in a great many organisations, no one has ever been assigned to answer.

And all of it was reasonable at the time it was decided. The free-cooling tower was a real efficiency innovation that saved real energy for a decade. The single-site backup was a real saving that worked perfectly every day the building did not burn. These were not careless choices. They were ordinary trade-offs whose hidden assumption, the failure domains do not overlap, was simply never tested until a fire tested it for everyone at once.

The Principle

The rule that answers this is older than the cloud, and it is three numbers: 3-2-1. Keep three copies of your data, on at least two different kinds of media, with at least one of them off-site. The number that does the work here is the one: off-site, which does not mean a different rack or a different room, it means a different failure domain, far enough that a fire, a flood, a power surge or a flooded basement at the primary cannot reach it.

The previous episode of this series gave you the first commandment of backups: thou shalt not trust a backup thou hast not restored. This episode gives you the second, and they are not the same: thou shalt not keep that backup in the building thou art backing up. A backup you have diligently restore-tested every week is still not a backup if it burns in the same fire as the original. Restorability and separation are two independent axes, and you need both. GitLab, in episode one, had the separation and lacked the restorability. OVH's unluckier customers had neither guaranteed.

In the unixoid tradition the mechanics are unglamorous and well-proven. On FreeBSD, take a ZFS snapshot and zfs send it over SSH to a pool in another building, another region, or another provider entirely; a cron job and a receiving pool are the whole apparatus, and the stream is incremental after the first run. With restic or borg, back up to object storage in a different region, encrypted, deduplicated, with the repository somewhere the primary's misfortune cannot follow. The tooling is not the hard part and never was. The hard part is the decision to put the second copy somewhere the first copy's bad day cannot reach, and then to verify, with a restore, that it is really there.

Where It Travels

The OVH customers were not unusually careless. The same failure domain hides in nearly every modern stack, wearing the local vocabulary.

On AWS, the trap is the word "zone". Multi-AZ feels like redundancy, and against a single server or rack failure it is. But an Availability Zone is a cluster of buildings in one metropolitan area, and a region is the unit of geographic separation. A database replicated across AZs survives a host fault and may not survive a regional event; the off-site copy is cross-region replication (S3 CRR, cross-region snapshots), and it is a separate, deliberate setting.

On Azure, the distinction is the storage redundancy tier: locally-redundant storage (LRS) keeps the copies in one data centre, while geo-redundant storage (GRS) places a copy in a paired region hundreds of kilometres away. The cheaper default is the one that shares the postcode.

On Google Cloud, multi-region buckets and cross-region backups serve the same role, and the same default-versus-decision applies.

In Kubernetes, the cluster is the failure domain people forget. Velero backups and etcd snapshots that live on the same cluster, or in object storage in the same region, are a second copy in one place. Ship them off-cluster and off-region.

On-premises, the rule is at its most physical and most ignored. The backup NAS in the same server room as the production servers is not a backup; it is a second copy awaiting the same flood, the same power surge, the same fire. The unfashionable tape, written weekly and carried to a drawer across town or a safe-deposit box, has quietly saved more organisations than any cloud panel's backup toggle.

The shape is identical everywhere: a copy that shares a failure domain with the original is redundancy in name only. It survives the failures that do not matter much and dies in the one that does.

Coda

OVHcloud rebuilt, changed its designs, and the industry spent a fortnight reading think-pieces about fire suppression and free cooling. Both are worth reading. But the durable lesson of 10 March 2021 is not about cooling towers or wooden floors, which are OVH's to fix. It is about a sentence every team can check this afternoon without a single phone call to a vendor: where, physically, is our backup, and could the thing that kills our primary kill it too?

Redundancy that shares a postcode is decoration. The fire does not read your architecture diagram. It reads the floor plan.

Read the full article on vivianvoss.net →

By Vivian Voss, System Architect and Software Developer. Follow me on LinkedIn for daily technical writing.