surajrkhonde

Posted on Jul 1

The Debugging Dialogues: Uncle & Nephew Go Deep

#debugging #monitoring #performance #tools

A follow-up to the Complete Debugging Handbook — this time we go past the framework and into the actual tools, metrics, and thinking that separate a decent debugger from a great one.

Part 0: Why We're Talking Again

👦 Nephew: Uncle, I read the handbook. 7 steps, 11 layers, I get the framework. But when I actually sat down to debug something last week, I froze. I didn't know where to start looking. CPU? Memory? Logs? I just stared at the dashboard.

👨‍🦳 Uncle: That's the real gap. The framework tells you what order to think in. It doesn't tell you what to look at when you open your laptop at 2 AM and prod is on fire. That's what we're doing today. Not theory. Tools. Metrics. What number means what. What to do when you see it.

👦 Nephew: So there's a "correct" way to debug?

👨‍🦳 Uncle: No. And I want you to let go of that idea right now. There is no single correct path. Two good engineers can debug the same bug in different order and both arrive at the truth. What is true is this — the ones who look chaotic usually aren't. They've internalized a checklist so deeply it looks like instinct. We're going to build that checklist in your head today, so eventually you stop "following steps" and just know.

Part 1: Before You Even Touch the Bug — Setup

👦 Nephew: Wait, before debugging even starts?

👨‍🦳 Uncle: Yes. This is the part beginners skip and it costs them the most time. A firefighter doesn't decide where the hydrant is while the building is burning. You need instrumentation already in place before the bug happens, or you're debugging blind.

The pre-requisites for real debugging:

Dashboards you check when nothing is wrong. If the first time you look at your CPU graph is during an incident, you don't know what "normal" looks like. You can't spot an anomaly if you've never seen the baseline.
Correlation IDs on every request. One ID that travels from the frontend, through the API, through every microservice, into every log line, into every queue job. Without this, "tracing the journey" from the handbook is just a nice sentence — you have no way to actually follow one user's request across five services.
Structured logs, not print statements. We'll go deep on this — it's the single most underrated skill.
Alert thresholds set before the pain starts, not reactively added after an outage.

👦 Nephew: So debugging actually starts... before the bug exists?

👨‍🦳 Uncle: Correct. Everything from here on assumes you have at least some of this in place. If you don't — that itself is the first fix you should push for.

Part 2: CPU — The Most Misunderstood Metric

👦 Nephew: Okay, someone says "CPU is high, API is slow." Where do I even begin?

👨‍🦳 Uncle: First — don't panic at the number. High CPU is a symptom, not a diagnosis. Your job is to find out why the CPU is busy.

What high CPU usually means:

Inefficient loops or algorithms. Someone wrote an O(n²) loop where O(n) would do. Very common with array searches inside array loops.
JSON parsing/serializing huge payloads repeatedly on the hot path.
Regex catastrophic backtracking. A single badly written regex can pin a CPU core at 100% on a specific input — and it'll look completely random from the outside because it only triggers on certain strings.
Synchronous crypto or compression operations blocking the main thread in a single-threaded runtime like Node.js.
Garbage collection pressure. Ironically, a memory problem often shows up first as a CPU spike, because the garbage collector works harder and harder trying to free memory.

How you actually find the guilty function, not just the guilty symptom:

Take a CPU profile over a fixed window (say 30 seconds) during the spike, not a random moment. Profiling when the system is calm tells you nothing.
Look at a flame graph. The widest bar at the top of the stack — not the tallest — is your bottleneck. Width means time spent; height just means depth of function calls.
Compare against a flame graph from a "healthy" period if you have one saved. Diffing two flame graphs is often faster than analyzing one from scratch.

The remedy depends on the cause, not on "add more servers":

Loop problem → fix the algorithm, don't just scale horizontally and hide it.
Regex problem → rewrite the pattern, add input length limits.
Sync operation blocking the thread → move it to a worker thread or background job.
GC pressure → that's actually a memory problem wearing a CPU costume. Which brings us to—

👦 Nephew: Memory next?

👨‍🦳 Uncle: Memory next.

Part 3: Memory — The Slow Killer

👨‍🦳 Uncle: CPU spikes are loud and immediate. Memory problems are quiet and patient. They creep up over hours or days, and by the time you notice, the server is already about to crash.

The three memory patterns you must be able to tell apart:

A memory leak. Memory usage climbs steadily and never comes back down, even after garbage collection runs, even during low traffic. This is objects being held in memory that should have been released — forgotten event listeners, growing caches with no eviction, closures holding references they shouldn't.
Normal sawtooth pattern. Memory rises, garbage collector runs, memory drops, rises again. This is healthy. Don't panic over a sawtooth — panic if the bottom of each tooth keeps rising over time. That rising floor is the real signal of a leak.
A sudden spike. One massive object was allocated — someone loaded an entire table into memory instead of paginating it, or a file upload wasn't streamed and got buffered whole.

How to actually hunt a leak:

Take a heap snapshot when memory is "normal," take another one an hour later under similar load, and diff them. Whatever object type grew disproportionately is your suspect.
Look specifically at things that are easy to forget to clean up: cached objects with no TTL, subscriptions/listeners never unregistered, timers never cleared, closures capturing large objects unnecessarily.
Watch for the classic mistake: an in-memory cache implemented as a plain object or map with no size limit and no expiry. It works fine in dev, and then in production it grows forever because nobody thought about eviction.

Remedies:

Bound every cache. Every single one. Size limit and TTL, no exceptions.
Explicitly clean up listeners, timers, and subscriptions when the thing that created them is done.
Stream large payloads instead of buffering them fully in memory.
If GC pauses themselves are the problem (you'll see this as periodic latency spikes lining up exactly with GC runs), you may need to tune GC settings or reduce allocation rate rather than just adding memory.

👦 Nephew: How do I know if a slow request is a memory problem versus a CPU problem versus something else entirely?

👨‍🦳 Uncle: Good question — that's exactly why we look at them together, never in isolation. High CPU + stable memory = computation problem. Rising memory + periodic latency spikes = GC problem. Normal CPU + normal memory but still slow = look elsewhere. Which is usually...

Part 4: The Event Loop — The Silent Bottleneck (Node.js / Single-threaded Runtimes)

👦 Nephew: I keep hearing "event loop lag" in standups. Nobody explains what it actually means.

👨‍🦳 Uncle: Think of the event loop as a single waiter serving every table in a restaurant. If that waiter gets stuck doing something slow at one table — say, manually calculating a complicated bill by hand instead of using a calculator — every other table waits, even though their food is already ready. That's event loop lag. Your CPU might look fine on average, memory might look fine, but requests are still slow because they're stuck in a queue waiting for their turn.

What causes event loop lag:

Synchronous, CPU-heavy code running on the main thread — heavy JSON parsing, sorting huge arrays, synchronous encryption.
Long-running synchronous loops that never yield control back.
Poorly written third-party libraries doing blocking operations under the hood without you realizing it.

How you detect it specifically (it hides from normal CPU/memory graphs):

There's a dedicated metric for this — event loop delay/lag — and it should be on every dashboard for a Node-based service. If p95 event loop lag starts climbing even while CPU usage looks moderate, this is your smoking gun.
Requests that are simple and fast in isolation but become slow only under concurrent load are a strong hint — it means something else is hogging the loop while your simple request waits its turn.

Remedies:

Move CPU-heavy work off the main thread — worker threads or a separate worker service.
Break large synchronous loops into chunks that yield back to the event loop.
Audit third-party libraries for synchronous file or crypto operations hiding inside "async-looking" functions.

Part 5: Database Metrics — Beyond "the query is slow"

👨‍🦳 Uncle: The handbook covered slow queries and missing indexes — good foundation. Now let's go one level deeper, because "the query is slow" is rarely the whole story in production.

Connection pool exhaustion — this is the one that fools people the most.

Symptom: queries themselves are fast when you test them directly, but requests time out.
What's actually happening: every connection in the pool is checked out and busy, so new requests queue up waiting for a free connection — and that wait time doesn't show up as "query time" in your logs, so it's invisible unless you're specifically tracking pool utilization.
Cause: connections not released properly, a slow query holding a connection longer than it should, or pool size simply too small for your concurrency.
Remedy: track pool utilization as its own metric, release connections promptly, and consider whether you actually need a bigger pool or whether you need to fix whatever is holding connections too long.

Replication lag — dangerous because it causes bugs that look like data corruption.

Symptom: user updates something, refreshes, and the update is "gone" — because their read hit a replica that hasn't caught up with the primary yet.
Remedy: read-your-writes consistency for critical paths (route the read to primary right after a write), or surface lag as a real metric with alerting so you catch it before users report "my data disappeared."

Lock contention / deadlocks — the handbook mentioned this briefly, worth expanding.

Symptom: intermittent, hard-to-reproduce errors, often only under load, often blamed on "flaky" code.
Diagnosis: look at active lock/transaction views during the incident window — you're looking for transactions waiting on each other, or one long transaction holding a lock far longer than expected.
Remedy: consistent ordering of table access across all transactions, shorter transactions, avoiding unnecessary locking (e.g. SELECT ... FOR UPDATE when a plain read would do).

Query count vs query time — track both, they tell different stories.

Query time up, count flat → something changed about the data or the query plan (missing index after a schema change, table grew past a size where the old index strategy stopped working, a stats/analyze job hasn't run).
Count exploding while individual query time is fine → almost always N+1, a caching layer that stopped working, or a retry storm.

Part 6: Redis / Cache Metrics — Where "It Works on My Machine" Goes to Die

👦 Nephew: The handbook's cache section was mostly about stale data. What else is there?

👨‍🦳 Uncle: Stale data is the beginner bug. Here's what bites you once you're past that.

Hit rate — your single most important cache health number.

A healthy cache should be hitting well above 80%. If it drops, either your cache keys are wrong (too specific, so nothing ever matches twice), your TTL is too short, or traffic patterns shifted and you're now caching things nobody asks for twice.
A sudden hit-rate crash right after a deploy usually means someone changed a cache key format without a migration plan, silently invalidating the entire cache.

Eviction rate — this one is sneaky because everything still "works," just slower.

If Redis is full and evicting keys under memory pressure, your app doesn't error out — it just quietly starts hitting the database more, and everything gets a bit slower everywhere, with no single obvious failure point.
Remedy: right-size Redis memory for your working set, use appropriate eviction policies, and don't cache things that are rarely re-read.

Hot keys — one specific key gets hammered so hard it becomes a bottleneck by itself.

Symptom: overall cache metrics look fine, but specific requests (often related to one popular resource — a trending post, a popular product) are slow, and one Redis node/shard looks disproportionately loaded.
Remedy: add a short local in-process cache layer in front of Redis for extremely hot keys, or shard/replicate that specific key's data.

Cache stampede / thundering herd — the one that causes outages, not just slowness.

Symptom: a popular cache key expires, and suddenly hundreds of concurrent requests all miss the cache at the same instant and all hammer the database simultaneously trying to rebuild it — sometimes enough to take the database down.
Remedy: only allow one request to actually rebuild the cache while others wait for that result (a "single-flight" pattern), or stagger TTLs so keys don't all expire at once.

Part 7: Third-Party / External API Slowness — The Bottleneck You Don't Control

👦 Nephew: What about when the slow thing is someone else's API — payment gateway, SMS provider — and I can't fix their code?

👨‍🦳 Uncle: This is a different kind of debugging — you're not finding a root cause to fix, you're finding a way to protect yourself from something you can't fix.

First, confirm it's actually them:

Log the exact time spent waiting on that specific external call, separately from your own processing time. Don't guess — measure the boundary precisely.
Check if the slowness correlates with their known status pages or incident history, and whether it's affecting all calls to them or just specific endpoints/payload types.

Common causes on the third-party side:

Their service degrading during their own peak hours.
Rate limiting on their end causing you to queue or retry.
Network-level issues between you and them (different region, no direct peering).

What you actually control and should build regardless:

Timeouts on every external call, tuned to a sane value — an external call with no timeout can hang a request (and hold a connection, and hold memory) indefinitely.
Circuit breakers — after a certain failure/timeout rate, stop calling the failing service for a cooldown period instead of letting every single request wait and fail one by one.
Retries with backoff, and a retry limit — but be careful, blind retries can turn a minor slowdown on their end into a self-inflicted traffic storm on yours.
Fallback behavior — can the user still do something useful if the payment gateway is slow (e.g., queue the payment and confirm asynchronously) rather than the whole page hanging?

👨‍🦳 Uncle: The mindset shift here is important — for internal bugs you're hunting a root cause. For external dependencies, you're designing resilience, because you may never get a root cause at all.

Part 8: Logs — The Part Everyone Underrates

👦 Nephew: We keep coming back to logs in every section. Why are they so important?

👨‍🦳 Uncle: Because metrics tell you that something is wrong and roughly where. Logs tell you exactly what happened, to which user, at which line, with which data. Metrics narrow the search. Logs solve the case. Without good logs, you can identify the guilty layer and still be stuck for hours trying to find the actual line of failure.

What separates a useless log from a useful one:

A useless log is a sentence like "processing order" with no other information — it tells you that something happened, not which something, for whom, or how long it took.
A useful log carries structured fields: the correlation/request ID, the user ID, the specific entity being acted on, the duration, the outcome. Every one of those fields should be independently searchable — you should be able to ask "show me everything that happened for this one request ID across every service" and get a complete answer.

The habit of a great debugger, log-wise:

Log at every boundary — right before and right after calling the database, right before and after calling an external API, right when a background job starts and finishes. Boundaries are where things actually go wrong; the middle of your own business logic rarely is the mystery.
Log the decision, not just the action. Not just "sent email" but "sent email because user_type=premium and feature_flag=X was true" — because six months later when the logic seems wrong, you need to know why the system decided what it decided, not just what it did.
Never log secrets or full sensitive payloads, but do log enough of the shape of the data (IDs, sizes, types) to reconstruct what happened without needing to log the sensitive content itself.
Distinguish log levels honestly. If everything is logged at "error" level, nothing is actually alarming anymore, and real errors get lost in noise. If everything is "info," searching becomes a haystack. Reserve "error" for things that need a human, "warn" for things worth noticing, "info" for the normal shape of a healthy request, "debug" for detail you only want during active investigation.

How logs actually get used in a real investigation:

Start wide: search for the correlation ID or user ID across all services for the time window of the complaint.
Look at the request from entry to exit. At which log line does the story stop making sense — where does the log say "should return X" but the user saw "Y"? That gap is where your bug lives.
Cross-reference logs with metrics: does this failing request line up with a period of high CPU, high memory, connection pool exhaustion, or a cache eviction spike you saw earlier? Logs give you the specific instance; metrics give you the systemic condition that caused it.

👨‍🦳 Uncle: If I had to leave you with one rule about logs — log so well that a stranger, with no context about your system, could read the logs for one request and understand exactly what happened and why. If your own teammate can't do that at 3 AM without pinging you, your logging isn't good enough yet.

Part 9: How a Great Debugger Actually Thinks

👦 Nephew: Okay Uncle, tie it together for me. When you personally sit down for a hard bug, what's actually going through your head?

👨‍🦳 Uncle: In order, roughly:

"What changed?" Almost every production bug traces back to something that changed — a deploy, a config change, a traffic pattern shift, a data migration, a third-party update. I check the deploy timeline and recent changes before I touch a single tool.
"Is this systemic or specific?" One user or everyone? One endpoint or the whole system? This single question, from the handbook, is still the fastest way to cut the investigation space in half or more.
"What do my metrics say right now versus normal?" I pull up CPU, memory, event loop lag, DB pool usage, cache hit rate, and error rate side by side, for the exact incident window, compared against a healthy baseline. I'm looking for which one moved first — the first mover is usually closer to the root cause than the ones that moved after, because those are often just downstream effects.
"Follow the correlation ID." Once I have a suspect area, I pick one specific failing request and trace it through logs end to end, boundary by boundary, until the story stops matching what the user saw.
"What's the actual root cause, and is it really the root, or just the nearest visible layer?" This is where I ask "why" again on my own answer — "cache wasn't invalidated" is not fully root cause until I know why it wasn't invalidated — was it a missed code path, a race condition, a deploy order issue? Stop one level too early and the bug comes back in three weeks wearing a different symptom.
"Fix, then ask what else this fix touches." The handbook's "validate" step — I genuinely think about what else shares this code path before I call it done.

👦 Nephew: And if none of that gives you the answer?

👨‍🦳 Uncle: Then I add more instrumentation — a log line, a timing metric — exactly at the boundary I'm still unsure about, deploy it, and wait for the bug to happen again. Sometimes debugging isn't about finding the answer in your current data. It's about admitting your current data isn't good enough yet, and fixing that first.

Part 10: There Is No Perfect Way — But There Is a Reliable One

👨‍🦳 Uncle: Last thing, and I want you to really hear this. There is no checklist that solves every bug in the same order every time. Sometimes you'll start at logs because the error message is obvious. Sometimes you'll start at metrics because nothing obvious jumps out. Sometimes you'll start at "what changed" because you just deployed ten minutes ago. That flexibility isn't a lack of discipline — it is the discipline, once you've internalized what each tool tells you.

What never changes is the underlying posture:

Measure before you guess.
Narrow scope before you dive deep.
Trust the data over your assumptions, especially when the data surprises you.
Find the root cause, not the nearest symptom.
Make the system easier to debug next time, every single time you debug it this time.

👦 Nephew: That last one — "make it easier to debug next time" — that's new.

👨‍🦳 Uncle: It's the most important one. Every bug you solve without adding a metric, a log line, or an alert for it is a bug you're volunteering to solve blind again someday. The best debuggers aren't the ones who are fastest under pressure. They're the ones who make sure they're never under that much pressure again.

End of dialogue. Go debug something.

DEV Community