How We Upgraded Petabytes of Data Without Stopping the Music

#azuredevops #upgrade #maintenance #elasticsearch

Resilience isn’t just about surviving failures. It’s about surviving maintenance.

There is a misconception that “Zero Downtime” means your system never changes.

In reality, a healthy large-scale system is in a constant state of flux. At Azure DevOps, we managed clusters holding petabytes of source code and search data. We couldn’t tell millions of developers, “Stop coding for 4 hours while we patch the OS.”

We had to change the engine while flying the plane. Here is the strategy we used to perform rolling upgrades on massive stateful clusters (Elasticsearch) without dropping a single query.

1. Trust the Consensus (The Quorum Strategy)

The first rule of a stateful upgrade is: Don’t fight the system; use it.

We designed our clusters with three distinct roles: Master Nodes (coordination), Index Nodes (write-heavy), and Query Nodes (read-heavy).

The Masters: We didn’t fear killing a Master node because we relied on the election algorithm. If we killed the current Master, the remaining nodes would elect a new one immediately. The system is designed to lose a head and grow a new one.
The Data Nodes: We could upgrade multiple nodes in parallel, but we strictly obeyed the Quorum Rule. If a shard was replicated across 3 nodes, we ensured at least 2 were always up. The system would simply mark the down node as “missing” and catch it up when it returned.

2. The “Smallest to Biggest” Rollout

Most teams roll out updates by Geography (East US to West US). We found a safer way: Roll out by Tenant Size.

Phase 1: The Mice. We started the upgrade on the clusters hosting our smallest, lowest-traffic accounts. If there was a performance regression or a memory leak in the new version, we caught it here with minimal blast radius.
Phase 2: The Elephants. Only after the small clusters proved stable did we touch the massive, petabyte-scale clusters.

By the time the upgrade hit our largest customers, the new version had already been battle-tested on thousands of smaller nodes.

3. The Health Check Drain

We didn’t manually reroute traffic. We relied on the load balancer’s health signals. When a Query Node needed to restart, we simply failed its health check. The traffic drained naturally, the node updated, came back online, and the traffic flowed back.

The Lesson: If you design your system to tolerate failure (Master election, Replication Quorums), you have also designed it to tolerate maintenance.