(REPOST) Everything Works Until It Doesn’t: Redundancy x Resilience / Vladimir Vedeneev

#vladimirvedeneev #vedeneev #redundancy

[https://medium.com/@vedeneev/everything-works-until-it-doesnt-eddf69453a01]

(REPOST)

Originally written by Vladimir Vedeneev

In telecom infrastructure, few words appear as often, or as confidently, as redundancy. It shows up in architecture diagrams, RFP responses and technical presentations. Somewhere in the conversation someone inevitably says, “We have redundancy,” usually with the quiet confidence of someone who believes the problem has been solved.
But after many years building and operating backbone networks, I have learned that redundancy is often the easiest thing to claim and the easiest thing to misunderstand. Two things that appear independent on paper can still fail together. And when they do, the moment tends to be memorable.

Most network diagrams are comforting things. They show two routes across a map, often in different colours. Two fibres. Two facilities. Two providers. The visual logic suggests that if one path fails, the other will continue operating and traffic will flow happily along its way, but in practice that assumption often depends on details the diagram does not capture.

Early in my career I experienced a failure that permanently changed the way I look at redundancy. On paper the design looked excellent. Two fibre routes ran across the network. Multiple providers were involved. The architecture appeared to provide the kind of diversity every operator wants to see. Then a construction incident cut a conduit. Both routes went down.

It turned out that the supposedly independent fibres shared the same physical pathway for part of their journey. In the diagram they appeared separate. In the ground they were neighbours. The network had redundancy. What it did not have was independence, and that distinction is what actually matters in practice.

A Tale of Two Fibres

If Shakespeare had been a network engineer, he might have written something like this. Two fibres, both alike in dignity, running across the infrastructure map. Each promises resilience, continuity and peace of mind. Except they share the same trench. Or the same conduit. Or the same building entrance. At that point the story becomes less comedy and more something else.

Two fibres in the same trench are not route diversity. Two providers relying on the same physical infrastructure are not independent networks. Two data centres powered by the same substation do not represent separate risk domains.
These arrangements appear surprisingly often in real networks, largely because true physical diversity is difficult to achieve. Building separate routes across cities requires coordination with municipalities, construction planning, access to rights of way and sometimes a fair amount of persistence. It is expensive and time consuming.

Duplication, on the other hand, is easy. Adding a second circuit or contracting a second provider creates the appearance of redundancy without necessarily eliminating shared risk, but duplication is not resilience.
For many years this distinction was not always critical. Network traffic grew steadily and predictably. Outages were disruptive but usually contained. The infrastructure ecosystem was able to absorb mistakes in design without catastrophic consequences, which made those weaknesses easier to overlook. That is becoming less true.
Become a Medium member

The rise of hyperscale cloud infrastructure, AI training clusters and distributed computing has dramatically increased the importance of network reliability. Massive data flows now move between data centres, between continents and within tightly interconnected compute clusters. East west traffic inside infrastructure environments is expanding rapidly.

When something breaks, the blast radius can be far larger than it once was. Entire platforms can disappear from the internet. Latency sensitive workloads can fail in ways that are difficult to recover from quickly. Companies that believed they had built resilient architectures sometimes discover that several critical components share the same vulnerability. As a result, customers are beginning to ask better questions, not simply “Do you have redundancy?”, but “What breaks together?”

That question cuts directly to the heart of infrastructure resilience. Redundancy answers a straightforward question: if something fails, is there another path available? Resilience asks a more demanding one: could the same event disable both paths at the same time? The difference between those two questions often determines whether a network continues operating during an incident or disappears entirely.

True resilience requires thinking in terms of failure domains. Physical pathways, power systems, facilities, providers and control systems all create potential dependencies. Any layer can introduce a hidden point of correlation.

Resilient architectures attempt to separate those layers as much as possible. Routes are placed in different trenches. Facilities draw power from different infrastructure. Providers rely on different upstream networks. Operational control systems are designed to avoid single points of coordination failure. Designing infrastructure this way is not always easy, but it dramatically reduces the chance that a single event cascades across the entire system.

Over the years I have developed a simple habit when reviewing network architectures or interconnection strategies. When someone says, “We have redundancy,” I ask one question, which is what breaks together, and it is a surprisingly effective test.

Sometimes the answer reveals that two routes share a conduit somewhere along the path. Sometimes backup systems depend on the same power infrastructure. Occasionally it exposes operational dependencies that were never fully examined.

None of these discoveries necessarily invalidate the design, but they make the risk visible and force a clearer understanding of whether that exposure is acceptable.
As digital infrastructure becomes more central to the global economy, expectations around reliability are evolving. Enterprises are less interested in marketing claims about uptime and more interested in understanding the underlying architecture that supports those promises.
Hyperscale operators already think this way. Their networks are built to minimise correlated failures because they have experienced the consequences of those failures at enormous scale.

Now that mindset is spreading across the broader infrastructure ecosystem. Customers want transparency about topology, routing and physical diversity. They want to understand interconnection strategies, not just bandwidth capacity. In other words, the conversation is shifting away from scale and toward resilience.
At its core, redundancy is relatively easy to create. Add another circuit. Add another route. Add another provider. Resilience requires something deeper. It requires understanding how infrastructure actually fails and designing systems that prevent a single incident from propagating across multiple layers.

It requires looking beyond the reassuring simplicity of diagrams and examining the physical and operational realities underneath them. In telecom infrastructure, reliability is rarely the result of duplication alone. It comes from thoughtful separation. The networks that understand that distinction are the ones most likely to remain standing when something unexpected inevitably happens, even in the kind of scenario where two fibres share one trench and a single backhoe takes both out at once.

Most networks don’t fail twice. They fail once, everywhere.

DEV Community

(REPOST) Everything Works Until It Doesn’t: Redundancy x Resilience / Vladimir Vedeneev

Top comments (0)