Mrinal Narang

Posted on Jun 28

Dependency Mapping and Hidden Failure Modes

#architecture #microservices #sre #systemdesign

You've got your architecture diagram.

It looks good. Services connected with clear lines. Data flows. Integration points.

Solid design.

Then production goes down.

And the outage spreads through a dependency nobody drew on that diagram.

The Reality

Most outages don't follow the architecture diagram. They follow the actual code.

You have a service that calls Service A. Service A calls Service B synchronously. Service B reads from a cache. That cache is backed by Service C. Service C has an undocumented polling relationship with Service D.

Nowhere on your diagram.

But when D fails? The entire stack goes down. In order: C gets slow, B times out, A gets backed up, your service drowns in connection timeouts.

Customers notice before your alerts fire.

What Gets Missed

Implicit dependencies. Service A doesn't explicitly call Service B. But A reads from a table that B populates. If B stops writing, A fails silently. Nobody knew they were coupled.

Transitive failures. You know you depend on the database. What you don't know is that the database client library maintains a background connection pool that hits an internal service. That service goes down. Your database works fine. Your application hangs.

Async failures hidden as success. A request succeeds, returns 200. But a background job that's supposed to process the data never fires. The dependency broke, you didn't notice for hours.

Shared infrastructure you forgot about. Two services running on the same Kubernetes node. One burns CPU, the other starves. You didn't plan for them to interfere. They do.

Third-party API cascades. Your service integrates with an API that calls another API internally. When that internal API is slow, your service times out. You didn't know about the dependency. The API provider didn't document it.

How You Actually Discover Dependencies

You don't discover them during planning sessions. You discover them during incidents.

2 AM. Everything is burning. You start tracing requests. You find a call you didn't know existed. You look at the code. "Oh. Yeah. Service X calls Service Y as a fire-and-forget."

You knew about Service X. You knew about Service Y. You didn't know they were connected.

By the time you're discovering this, customers have been down for 40 minutes.

The Tools Help But Don't Solve It

Network traffic analysis shows connections. Distributed tracing reveals call chains. APM tools map service interactions.

These help. But they only show you what's currently happening. If a dependency is dormant, it's invisible. If a failure path is rare, you won't see it until it happens.

A service that calls another service only during payment processing won't show up in your dependency map until someone tries to make a payment during an outage.

What Actually Works

Run incidents. Deliberately. Gamedays and chaos engineering aren't about proving resilience. They're about discovering unknown dependencies before they become production incidents. Shut down a service you think is non-critical and watch what breaks.

Trace the data, not the diagram. Follow what happens to a customer request. Where does it go? What systems read the results? What systems depend on side effects? Write it down. That's your actual architecture.

Check what you're not monitoring. If you're not alerting on a dependency, you probably don't know about it. Set a timer. Pick a random service. Ask: what would break if this disappeared right now? If you don't know, you've found a hidden dependency.

Document after incidents. The postmortem is the best time to update your architecture diagram. You now know something that wasn't documented before. Write it down so the next person doesn't learn it during an outage.

Assume cache failures. Every cache hit is a hidden dependency. Every background job is a failure mode. Every async operation is a silent failure waiting to happen. Don't assume these are optional.

The Honest Answer

You can't map every dependency. Some are emergent properties of how systems interact. Some only become relevant during specific failure scenarios.

But you can discover them faster.

Run incidents before production does. Trace requests end-to-end. Alert on the things you're not expecting to fail. When something breaks, update your diagram.

Most outages spread through things you didn't know existed. The goal isn't to prevent that.

It's to find out what you don't know before the customers do.

DevOps #SRE #Architecture #IncidentResponse #Systems #Observability

DEV Community