Jamie Cole

Posted on Mar 23

I Used Claude Code for 30 Days. Here's What Actually Broke in Production

#claude #ai #programming

After a month of using Claude Code on production work, here's the unvarnished truth.

What Nobody Tells You About Claude Code in Production

Claude Code is genuinely impressive for certain tasks. It's also deeply frustrating for others. Here's what I learned after 30 days of real work.

Where It Actually Works

1. Greenfield Code Generation

Give Claude Code a clear spec and it produces solid, idiomatic code. Not perfect, but 80% of the way there. The remaining 20% is faster to hand-fix than to write from scratch.

2. Debugging Well-Scoped Problems

"Here's the error, here's the context" → Claude Code is excellent. It reads the stack trace, finds the relevant code, and often identifies the root cause in under a minute.

3. Explaining Unknown Codebases

Feed it a messy legacy module and ask for a summary. The 200K context window means it can ingest entire files worth of tangled logic and come out with something coherent.

Where It Breaks

1. Multi-File Refactoring

The promise: "refactor this entire module." The reality: it changes one file, loses context of what changed in others, and you spend an hour untangling inconsistencies.

The fix: one file at a time. Painful but reliable.

2. Tests That Pass But Test the Wrong Thing

Claude Code writes tests that pass. Sometimes those tests assert the wrong behavior. Always review test logic, not just whether they pass.

3. The Context Bleeding Problem

By hour 3 of a long session, Claude Code starts making decisions based on what it already wrote, not what's actually best. Fresh session = fresh perspective.

The Pattern That Works

# One task, one session. Don't chain.
def claude_task(spec: str) -> str:
    with claude.Code() as c:
        return c.run(spec)

# NOT: start claude, add tasks, add tasks...
# Every 5 new tasks = 1 regression from context drift

The Numbers

After 30 days:

Tasks completed: 47
Tasks abandoned: 12 (25%)
Time saved: ~8 hours (estimated)
Bugs introduced: 3 (caught in review)
Net verdict: Positive. But requires discipline.

If you're building serious production systems, you need monitoring beyond just "does it compile." I built a drift detector specifically for this — catches when LLMs silently change behavior between runs.

Writing about real production LLM work, not demos.

DEV Community