Last Tuesday I sat staring at a stack trace for forty minutes. Production was down, my pager was screaming, and I couldn't think. Not because the bug was hard — it was a null pointer in a service I'd written six months earlier. I just didn't know how to start anymore.
I used to know. I used to read a stack trace top to bottom, form three hypotheses, and start eliminating them. Somewhere over the past year that muscle quietly atrophied. I'd been outsourcing the first 80% of every debugging session to autocomplete and chat assistants, and when the assistant couldn't help (because it didn't know our codebase), I froze.
If any of that sounds familiar, this post is for you. Let's walk through the actual technical problem, why it happens, and a debugging workflow you can rebuild from scratch.
The problem: hypothesis-free debugging
The symptom looks like this. You see an error. Instead of asking what could cause this, you paste the error somewhere and wait for a suggestion. When the suggestion doesn't work, you paste more context. Eventually you're throwing the entire file at a model and praying.
This isn't a moral failing — it's a workflow problem. You've replaced hypothesis-driven debugging with pattern-matching debugging. Pattern matching works great when the pattern is in the training data. It collapses the moment the bug is specific to your system: a race condition between two of your services, a config drift on one node, a subtle off-by-one in code you wrote yesterday.
The fix isn't to stop using assistants. The fix is to rebuild the underlying loop so the assistant becomes a tool inside your process instead of a substitute for it.
Root cause: you skipped the observation step
Classical debugging is four steps, and they're not optional:
- Observe — gather facts about what's actually happening
- Hypothesize — form a specific, testable theory
- Test — run an experiment that distinguishes hypotheses
- Repeat — narrow down until you find the cause
When you skip step 1 and ask an assistant to jump to step 4, you're effectively rolling dice. The model doesn't know which of your hypotheses is true because you didn't form any. It guesses based on what bugs usually look like.
Let's rebuild this loop with a concrete example.
Step-by-step: debugging a flaky test the old-fashioned way
Suppose you have a test that passes locally and fails in CI maybe one run in five. Here's the systematic approach.
Step 1: Capture facts before forming theories
# Bad: vague description that invites guessing
# "test_user_signup is flaky in CI"
# Good: a fact sheet you can reason about
facts = {
"test_name": "test_user_signup",
"fails_in": ["CI on linux"],
"passes_in": ["local macOS", "CI when run alone"],
"failure_rate": "~20% of CI runs",
"failure_mode": "AssertionError: expected 'active', got 'pending'",
"recent_changes": ["upgraded postgres driver", "added email worker"],
}
Notice what's there: where it fails, where it doesn't, and the exact error. Notice what's not: any guess about why. This is hard to do at 3am with a pager going off, which is exactly why writing it down forces the discipline.
Step 2: Form falsifiable hypotheses
Look at the asymmetry: fails in CI, passes locally, passes in CI when alone. That's three signals. Brainstorm:
- H1: Test order dependency — another test mutates shared state
- H2: Timing — CI is slower, an async operation hasn't completed by the assertion
- H3: Environment — the new postgres driver behaves differently on Linux
- H4: Resource contention — parallel workers in CI are stepping on each other
Each of these predicts a different outcome under a different experiment. That's the key word: falsifiable. If your hypothesis is "something with the database," you can't test it.
Step 3: Design experiments that distinguish hypotheses
This is where most of us get lazy. Instead of running the test 50 times and hoping, design experiments that knock out hypotheses cheaply.
# Test H1: run the suite in a fixed order, then reversed
pytest --randomly-seed=1 tests/ # baseline
pytest --randomly-seed=1 tests/ -p no:randomly # ordered
# Test H2: add timing instrumentation around the assertion
# (see code below)
# Test H4: force serial execution in CI
pytest -p no:xdist tests/test_user_signup.py
For H2, add targeted logging. Don't sprinkle prints everywhere — pick the spot where the hypothesis lives.
import logging
import time
log = logging.getLogger(__name__)
def test_user_signup(client):
response = client.post("/signup", json={"email": "a@b.com"})
user_id = response.json()["id"]
# The assertion that fails — capture state before checking
start = time.monotonic()
user = fetch_user(user_id)
log.info("fetched user after %.3fs, status=%s", time.monotonic() - start, user.status)
# If status is 'pending', is it because the worker hasn't run yet?
log.info("pending jobs in queue: %d", queue_depth())
assert user.status == "active"
Now when CI fails, you have data instead of vibes. The log tells you whether status was pending because the worker never ran, or ran late, or ran on a different user.
Step 4: Bisect when the hypothesis space is too large
When you can't enumerate causes, narrow the search space mechanically.
# Find the commit that introduced the flake
git bisect start
git bisect bad HEAD
git bisect good v1.4.0 # last known stable
# Repeat the test 20 times per commit; mark good/bad
# git will binary-search the history for you
Bisection is the debugging skill I lost first and missed most. It's mechanical, it's boring, and it works when no clever theory does. An assistant can't bisect your private git history for you.
Prevention: keep the loop active
A few habits that helped me get the muscle back:
- Write the fact sheet before you ask for help. Even a three-line one. The act of articulating the problem often solves it. Yes, this is rubber duck debugging — it works because it forces step 1.
- State your hypothesis when you ask. "I think it's a race condition between the worker and the request handler" gets a useful answer. "my test is flaky" gets generic advice.
- Read the stack trace top to bottom, every time. Don't skim. The frame you skip is the frame the bug lives in.
- Time-box assistant use. Give yourself 10 minutes of unaided debugging first. If you're stuck, then bring in help — but with a specific hypothesis to test.
- Keep a bug journal. A one-line note per fix: what the symptom was, what the cause was, how you found it. Re-reading it weekly is the best diagnostic training I know of.
None of this is anti-AI. I still use assistants every day — for generating boilerplate, exploring unfamiliar APIs, drafting tests. But I stopped letting them drive the debugging session. The fact sheet, the hypothesis, the experiment — those are mine. The assistant is a fast reference book, not a co-pilot.
The debugging skill is still in there. It's just rusty. Run the loop a few times by hand and you'll feel it come back faster than you'd think.
Top comments (0)