kol kol

Posted on Jun 25

I Thought It Was a "Quick Fix" — That 15-Minute Change Cost Me 3 Days of Debugging

#programming #debugging #production #career

We've all said it. We've all believed it.

"It's just one line. Should take 15 minutes, max."

Here's the story of the time I was wrong — and the checklist I now use before touching anything in production.

The Setup

It was Thursday afternoon. A support ticket came in: users on the free tier weren't seeing their usage stats. The paid tier worked fine. The dashboard looked broken for a chunk of our user base.

I found the offending code in about 5 minutes:

// Before
const usage = await prisma.usage.findMany({
  where: { userId, tier: 'free' }
});

The issue was obvious — we'd recently migrated free users to a new tier label. The query was looking for tier: 'free' but the new value was 'basic'.

One-line fix. I typed it, ran the tests, and pushed.

// After (or so I thought)
const usage = await prisma.usage.findMany({
  where: { userId, tier: { in: ['free', 'basic'] } }
});

Tests passed. Deployed. Done. 15 minutes, start to finish.

I was so proud of myself.

The Crack Appears

By Friday morning, the dashboard was worse. Not just missing stats — now free tier users were seeing negative usage numbers.

Negative. Usage.

"How?" was my first question. "Why?" was my second. "How did nobody catch this?" was my third, directed at myself.

The Investigation

I spent Friday chasing ghosts:

Is it a Prisma bug? — No. Raw SQL queries showed the same numbers.
Is it a data corruption issue? — No. The raw data was clean.
Is it a timezone problem? — I wasted 2 hours on this. It wasn't.
Is the aggregation logic wrong? — Getting warmer...

Here's what I eventually found. The change I made — adding in: ['free', 'basic'] — didn't just change which records got fetched. It changed how many records got fetched.

The old code queried one table. But basic tier users had their usage data split across two tables (a migration artifact we never cleaned up). My "one-line fix" was now double-counting usage by pulling from both.

The negative numbers? A subtraction later in the pipeline that assumed single-table data. Double the input → subtraction overflowed into negatives.

The Root Cause

My fix treated a symptom (wrong tier label) as the whole problem. The real issue was:

We had a half-finished data migration (two tables for one concept)
The code had implicit assumptions about data shape
My tests only covered the happy path with seeded data
Nobody (including me) understood the full data flow

The one-line change was syntactically correct. It was semantically wrong.

What I Learned

1. "Quick fixes" are the most dangerous kind

When something feels trivial, that's exactly when you need to slow down. The pressure to "just fix it fast" is what causes production incidents.

2. Tests don't catch what they don't know about

My unit tests passed because they used clean, seeded data. They had no idea about the migration artifact in production. I now add this to my pre-deploy checklist:

□ Does this change touch migrated data?
□ Are there duplicate/legacy data sources?
□ What assumptions does downstream code make?
□ Have I tested with production-like data?

3. Read the code around the change, not just the change

I should have asked: "What happens to this data after it's fetched?" Instead, I looked at the one line, fixed it, and moved on.

Now I trace the data flow at least two steps upstream and downstream before making any change.

4. Write down what you changed and why

When you're deep in debugging at 2 AM three days later, you need context. A commit message like "fix tier label" is useless. "Update tier query to include 'basic' users — note: basic tier has legacy data in usage_v2 table" would have saved me hours.

The Real Fix

The actual solution wasn't a one-line change. It was:

Consolidate the data — Write a migration script to merge the two tables
Update the query — Point it at the consolidated source
Add a validation test — With production-like data distribution
Document the migration — So the next person knows why things look the way they do

It took a full day. But it was the right fix, not the fast one.

My New "Quick Fix" Checklist

Before I touch anything in production now, I run through this:

[ ] What changed upstream? (What feeds this code?)
[ ] What changed downstream? (What consumes this output?)
[ ] What assumptions am I making? (Write them down, verify each)
[ ] Are there edge cases in production data that tests don't cover?
[ ] Would someone else understand this change from the commit message?

It adds 5 minutes to every fix. And it's saved me from at least three more multi-day debugging sessions since then.

The cheapest bugs are the ones you prevent. The most expensive ones start with "it's just a quick fix."

What's your worst "quick fix" story? I'd love to hear I'm not alone in this. 🙃

DEV Community