Sahil Singh

Posted on Feb 8 • Edited on Mar 5 • Originally published at getglueapp.com

Call Graphs That Prevent Production Incidents

#devtools #programming #architecture #ai

The most expensive bugs aren't the ones caught in code review. They're the ones that pass code review because the reviewer didn't know about a dependency three layers deep.

The Pattern

Engineer changes Function A
Code reviewer checks Function A — looks good
Function A is called by Service B (not in the PR diff)
Service B passes the result to Handler C (also not in the diff)
Handler C has a null check that assumed Function A's old return type
Deploy to production
Handler C crashes at 2 AM
On-call engineer spends 4 hours tracing the issue back to the change in Function A

This isn't a code quality problem. The change to Function A was correct. The code review was thorough. The bug exists in the gap between what the reviewer can see (the diff) and what the system actually does (the full call graph).

What Call Graphs Show

A call graph maps every function-to-function relationship in your codebase:

authMiddleware.validateToken()
  → sessionService.getSession()
    → redisClient.get()
    → sessionStore.validate()
      → cryptoUtils.verifySignature()
  → userService.getPermissions()
    → permissionsCache.get()
    → database.query()

This isn't just "what imports what." It's the runtime execution path — what actually gets called when this code runs.

Blast Radius Analysis

Before any code change, you should know:

Direct callers: What functions call the function you're changing?
Transitive callers: What calls those callers? Follow the chain up.
Downstream effects: What does your function call, and what do those functions affect?
Shared state: Does your function read or write state that other functions depend on?

The union of these is the "blast radius" — everything that could potentially be affected by your change.

How This Prevents Incidents

Before: Hope-Based Deployment

Make a change
Run the tests that exist (which may not cover transitive paths)
Get code review from someone who knows this file (but maybe not downstream consumers)
Deploy and hope

After: Informed Deployment

Make a change
Call graph shows: "This function is called by 7 services, 3 of which have null checks on the return value"
Check those 3 null checks — discover one assumes non-null
Fix the null check in the same PR
Deploy with confidence

The difference: 10 minutes of call graph analysis vs. 4 hours of production incident response.

Real-World Examples

The Session Timeout Bug:
An engineer changed the session TTL from 30 minutes to 60 minutes. Seemed harmless. But the WebSocket service had a hardcoded 30-minute reconnection timer that assumed sessions would expire at 30 minutes. After 30 minutes, WebSocket connections would reconnect but find a valid session, creating duplicate connections. Memory usage climbed until the service OOMed.

Call graph would have shown: sessionConfig.TTL → read by sessionService.createSession() AND websocketService.reconnectTimer(). The dependency is visible.

The API Response Change:
An engineer added a field to an API response. Non-breaking change, right? Except a downstream consumer parsed the response and had a strict size check for caching. The larger response exceeded the cache size limit and started bypassing the cache. API latency increased 300%.

Call graph shows: apiController.getUser() → consumed by mobileApp.userProfile() AND cacheProxy.intercept(). The cache dependency is visible.

Building Effective Call Graphs

Not all call graphs are equal. Useful ones need:

Cross-language support: Your API (TypeScript) calls your database (SQL) calls your cache (Redis). The graph should span all layers.
Dynamic resolution: Method calls through interfaces and abstract classes need to resolve to concrete implementations.
Depth control: Full transitive call graphs can be overwhelming. Show 2-3 levels by default, expandable on demand.
Change-aware filtering: When reviewing a PR, show only the subgraph affected by the changed functions.
Historical overlay: "The last time someone changed this function, which downstream consumers broke?" This combines call graphs with regression history.

This is what Glue builds during codebase indexing. Every function call, every dependency, every transitive path — mapped and queryable. So when you're about to change a function, you know exactly what you're affecting before you deploy.

Originally published on glue.tools. Glue is the pre-code intelligence platform — paste a ticket, get a battle plan.

DEV Community