Naveen Chandra Adhikari

Posted on Jan 31

How I Cut API Latency by 95% by Fixing One Hidden Loop ❤️‍🔥

#webdev #go #database #postgres

It started with a Slack message from our support team.

“Hey, the organization dashboard feels really slow for some customers. Like… 3–4 seconds slow.”

I sighed🥴. Another performance ticket.

I hit the endpoint locally:
280ms. Fine.

Checked staging:
~320ms. Also fine.

“Probably their network,” I thought, closing the tab — a classic developer reflex.

But the messages didn’t stop.

Every complaint had the same pattern:
“It only happens for larger organizations.”
The ones with lots of workspaces. Lots of files. Real data.

By the end of the week, I couldn’t brush it off anymore.

There were no errors. No crashes. No alarming CPU spikes. The service was technically healthy. And yet, production latency had quietly crept from a respectable 200ms to an uncomfortable 3–4 seconds during peak hours.

The code responsible for this endpoint had been touched recently. It was clean. Idiomatic Go. The kind of code you skim during a review and immediately trust. The previous developer was solid — no obvious mistakes, no red flags.

Still, something felt off.

So I opened SigNoz, filtered by that endpoint, and clicked into a trace from one of the slowest requests.

What I found wasn’t a bug.
It wasn’t bad infrastructure.
It wasn’t a missing index.

It was a pattern.

And it was silently strangling our database.

The Silent Killer

The problem I was looking at wasn’t a bug.There was no stack trace. No failing test. No error log to chase.

It was a pattern — one of the most dangerous ones in backend systems.

The N+1 query problem.😫

If you’ve never heard of it, here’s the short version: Your code runs one query to get a list of items, and then loops through that list to run N additional queries for each item.

In Go, the code that causes this looks absolutely correct. It reads beautifully.

// Fetch all workspaces for an organization
workspaces, err := repo.GetWorkspaces(orgID)
if err != nil {
    return err
}
// For each workspace, fetch its storage stats
for _, ws := range workspaces {
    storage, err := repo.GetWorkspaceStorage(ws.ID) // 🔴 This is the problem
    if err != nil {
        return err
    }
    // ... do something with storage
}

It reads beautifully.
It’s clean.
It’s idiomatic Go.
And it’s absolutely destroying your database.

Here’s what’s actually happening under the hood:
Let’s say an organization has 100 workspaces.

Query #1: SELECT FROM workspaces WHERE org_id = 123 (gets all 100 workspaces)*

Query #2: SELECT used_bytes FROM workspace_storage_view WHERE workspace_id = 1

Query #3: SELECT used_bytes FROM workspace_storage_view WHERE workspace_id = 2

Query #4: SELECT used_bytes FROM workspace_storage_view WHERE workspace_id = 3

……..

Query #101: SELECT used_bytes FROM workspace_storage_view WHERE workspace_id = 100
That’s the N+1 problem 👀
One query to get the list, plus N queries to get the related data for each item.

Why does this feel so innocent?
Because it works perfectly fine in development.

With 3 workspaces, you’d never notice.
With 200 workspaces, you’ve got a 2–3 second problem.

Each individual query is fast — around 8–10ms.

But you’re paying that cost:

200 times
plus network round trips plus connection pool contention plus database CPU overhead
The math is brutal:

200 queries × 10ms = 2000ms of pure database time🥳
Add network latency and runtime overhead
→ suddenly you’re staring at 3+ seconds for an endpoint that should be sub-100ms

The Investigation

I’ll admit something uncomfortable:

I didn’t find this problem in a code review.
I didn’t catch it in testing.
I didn’t even suspect it at first.

I found it because I finally stopped guessing -and followed a single request all the way through.

Dead End #1: Logs
My first instinct was the obvious one: logs.😼

Maybe there was a slow query.
Maybe a timeout.
Maybe a retry loop silently hammering the database.

I filtered logs for the slow endpoint and scrolled through hours of requests.

Nothing.Every query completed in 8–12ms.
No timeouts.
No retries.
No errors.

The logs weren’t lying — they were just telling me the wrong story.

Each query was fast in isolation.
The problem was the volume.

Dead End #2: Basic Metrics
Next, I checked our standard metrics dashboard.

CPU: normal
Memory: normal
Error rate: 0%
Request rate: steady
Database connections: within limits
Everything looked… healthy.

The only anomaly was API latency.
Not spiking.
Not flapping.
Just consistently bad — like a fever that wouldn’t break.

At this point, I was stuck 😔

The Breakthrough: Tracing the Request
Frustrated, I opened our APM and did the thing I should have done first.

I filtered by the problematic endpoint.
Sorted by slowest requests.
Clicked into a single trace.

The moment the waterfall view loaded, the illusion of a “healthy” service vanished.

Normally, a good trace looks boring:

a short HTTP handler
one or two database calls
done
This trace looked like a skyscraper collapsing.

The Moment It Clicked
At the very top, I saw the total request time: ~3.2 seconds.

Below it:

a tiny sliver of Go application logic
and then… a massive wall of database calls
I expanded the database section.

The same query shape repeated again and again.

[DB] SELECT * FROM workspaces WHERE org_id = ? (15ms)
[DB] SELECT used_bytes FROM workspace_storage_view... (8ms)
[DB] SELECT used_bytes FROM workspace_storage_view... (7ms)
[DB] SELECT used_bytes FROM workspace_storage_view... (9ms)
[DB] SELECT used_bytes FROM workspace_storage_view... (8ms)
...
(repeats 200+ times)
I started counting.

By query #50, I stopped.

I scrolled to the bottom.

217 database queries.
For one API request.

That was my smoking gun.🚨

What the Trace Made Impossible to Ignore
The trace revealed three things that logs and metrics never could.

1. Query Count Explosion
This request executed 217 queries.
Most of our other endpoints averaged 3–5.

A 40× outlier.

2. Latency Waterfall
The breakdown was brutal:

Application code: ~45ms
Database queries: ~2,800ms
Network overhead: ~200ms
Over 90% of the request time was spent just talking to the database.

3. The Pattern
SigNoz groups similar queries together.
It showed me:

SELECT … FROM workspaces — 1 execution
SELECT … FROM workspace_storage_view — 216 executions
One query to get the list.
216 queries to get the details.

Classic N+1.

Why This Was Invisible Before
Without tracing, I would have never found this.

The N+1 problem doesn’t show up in slow query logs because no single query is slow.
It doesn’t show up in error logs because nothing is failing.
It doesn’t show up in CPU metrics because the database can handle it (until it can’t).

You need traces.
You need to see the full request lifecycle.
You need to count the queries.

The Realization😶‍🌫️

I sat there for a full minute, just staring at the trace.

217 queries.
One endpoint.
Zero errors.

I thought about the developer who wrote this code.

They weren’t careless.
They weren’t inexperienced.
They just never saw what I was seeing.

They tested with 5 workspaces.
I was staring at an organization with 216.

That’s the brutal truth about N+1 problems:

They’re invisible… until they’re not.And tracing was the flashlight that finally illuminated the dark corner where this one had been hiding.

The Real Mole
Once I saw the trace, I didn’t need to hunt for long.

I searched for the endpoint handler, followed it into the service layer, and landed in the repository code.

There it was.

workspaces, _ := repo.GetWorkspacesByOrg(ctx, orgID)
for _, ws := range workspaces {
    used, _ := repo.GetWorkspaceStorageUsed(ctx, ws.ID)
    total += used
}

That loop was the entire problem.

One query to fetch the list.
One query per workspace to fetch storage.
Repeated hundreds of times.

The code wasn’t wrong.
It wasn’t sloppy.
It wasn’t even “bad.”

It just assumed the number of workspaces would stay small.

In production, that assumption quietly collapsed.

Once I saw this loop with the trace numbers in mind, everything clicked. There was no mystery anymore just a very expensive pattern hiding behind clean code.

The fix wasn’t complicated.

It was just time to stop asking the database the same question over and over again.

The Fix
Once I saw the trace, I knew exactly what needed to change.

The solution wasn’t complicated.
It just required a different way of thinking.

Become a member
Instead of asking the database the same question 216 times, I needed to ask it once.

The Strategy🤗

The fix came down to two deliberate changes:

Stop querying inside a loop
Fetch all workspace storage data in a single database call.
Actually use the database view we already had
We already had workspace_storage_view — a pre-aggregated snapshot of storage usage. We just weren’t using it efficiently.
That was it.

The code change took about 20 minutes.

The impact was immediate.

The After Code (Conceptually)
Instead of this:

Fetch workspaces
Loop
Query storage per workspace (N times)
The flow became:

Fetch all workspace storage usage in one query
Aggregate in memory
The loop didn’t disappear — but now it was looping over in-memory data, not triggering database calls.

That distinction matters more than it sounds.

What Changed
Same logic.
Same output.
Completely different performance profile.

Why the Database View Mattered🥳
The real win wasn’t in the Go code.

It was in the database.

The workspace_storage_view already did the expensive work:

Scanning millions of rows in the files table
Calculating SUM(size) per workspace
Storing the result in a pre-aggregated form

CREATE OR REPLACE VIEW workspace_storage_view AS
SELECT
    workspace_id,
    COALESCE(SUM(size), 0) AS used_bytes
FROM files
WHERE type != 'folder'
GROUP BY workspace_id;

Without this view, the “fix” would have just moved the problem -forcing the database to recompute heavy aggregations on every request.

With the view:

Aggregation happens once
Reads are fast
Logic is consistent across services
It became our single source of truth for storage usage — shared between Go services and legacy PHP code. No duplication. No discrepancies.

The Moment I Deployed It
I pushed the fix at 3:47 PM on a Thursday☕️

Then I opened the monitoring dashboard and watched.

The latency line for the endpoint
didn’t slowly improve.

It fell off a cliff.

MetricBeforeAfterImprovementQueries per request2171217× fewerAvg response time~2.8s~80ms~35× fasterP95 latency~4.2s~120ms~35× betterDB CPU usage~65%~12%82% reduction

I refreshed the endpoint for our largest customer.78ms.

Refreshed again.82ms.

I sat there for five minutes just watching the graphs.

No spikes.
No regression.
Just… fast.

The Best Metric of All
The next morning, our support lead messaged me:

“Hey, whatever you did yesterday the dashboard complaints stopped completely. Users are happy again.”

I didn’t tell him it was one structural query change.

I just said:

“Fixed a thing.”😁 (i told after that by the way)

Lessons Learned

I’ve fixed performance bugs before.
But this one stuck with me.

Not because it was complex it wasn’t.
But because it exposed a blind spot I didn’t know I had.

Fixing the N+1 problem felt great. Watching latency drop from 3 seconds to ~80ms was an instant dopamine hit.

Here’s what I learned:

I used to think clean code was enough.Readable loops, proper error handling, separation of concerns — that was the goal.
This bug taught me otherwise.

The real lesson wasn’t about JOINs or database views.
It was about mindset.

Databases don’t think in loops. They think in sets.
Every time I write a for loop that hits the database, I’m forcing it to work against its nature — and paying for it in latency.I also learned that silent problems stay silent until you give yourself the tools to hear them.
Logs and metrics told me something was off.
SigNoz traces told me exactly what — 217 queries where one would do.

Now I test with real volumes.
I watch query counts like a hawk.
And I never, ever query inside a loop without asking:

“Can I do this once?”

Because the fix was one query change.
But the lesson -that’s permanent.

*One more thing i want to add on is that *
I've seen teams blame ORMs for N+1 problems. But the tool isn't the culprit — it's abstraction without understanding.

ORMs are convenience wrappers, not magic. They eliminate boilerplate, but you still need to know:

What SQL your code actually generates

How the database executes it

When lazy loading silently triggers extra queries

The previous developer didn't misuse GORM. They just hadn't tested at production scale. That gap — in testing, monitoring, team awareness - is on us, not the tool.

Great abstractions make the right thing easy and the wrong thing hard. But no tool replaces understanding what happens under the hood.

Best Habits to Avoid N+1
Here’s what I do now and what I wish I’d done before:

1. Don’t query the database inside a loop. Ever.
This is the cardinal rule.
If you see a for loop with a database call inside, stop. Refactor.

There are very few legitimate exceptions.
If you think you have one, measure it first.

2. Think in sets, not rows.
Before writing a query, ask:
“What’s the smallest number of queries that can get me everything I need?”

Databases excel at set operations — JOINs, IN clauses, bulk reads.
They're terrible at being asked the same question repeatedly.

Your job is to translate “I need data for each of these items” into “I need all this data at once.”

3. Enable query logging in development.
Yes, the logs are noisy.
Yes, it’s annoying to scroll past hundreds of queries.

But it’s the only way to catch N+1 patterns early.

In Go with GORM:

db, err := gorm.Open(postgres.Open(dsn), &gorm.Config{
    Logger: logger.Default.LogMode(logger.Info), // Shows all queries
})

With sqlx or database/sql, wrap your queries with logging middleware.

Make it part of your local dev setup.
Turn it off when you need to. But keep it on by default.

4. Set query count budgets for endpoints.
Not all endpoints are created equal:

Detail page (single resource): ≤ 3 queries
List page (collection): ≤ 5–7 queries
Dashboard (aggregated data): ≤ 10 queries
If you exceed these, ask why.
Is it N+1? Over-fetching? Missing JOIN?

Track these budgets in code comments or your team’s style guide.

5. Write tests that assert query counts.
This is the game-changer.

Use a query counter or mock your database layer to verify you’re not making too many calls:

func TestGetOrgStorageUsage_QueryCount(t *testing.T) {
    // Setup test with 200 workspaces
queryLog := &QueryLogger{}
    service := NewWorkspaceService(queryLog)
    service.GetOrgStorageUsage(ctx, orgID)
    if queryLog.Count() > 2 {
        t.Errorf("Expected ≤ 2 queries, got %d", queryLog.Count())
    }
}

If the test fails, you’ve caught an N+1 before it reaches production.

6. Profile with production-scale data.
Testing with 5 records is like testing a car in your driveway.
It might work. You won’t know how it handles the highway.

Before deploying:

Seed your staging database with realistic volumes
Run the endpoint with your largest customer’s data profile
Measure latency, query count, and database load
If you can’t test at scale, at least add a warning comment:

// ⚠️ N+1 risk: loops over workspaces
// Test with 200+ workspaces before deploying
7. Use APM tools to catch what you miss.
SigNoz, DataDog, New Relic — whatever you use, configure it to alert on:

Queries per request > threshold
Database time > 50% of total request time
Linear correlation between data size and latency
Set up a dashboard that shows query counts by endpoint.
Review it weekly. Look for outliers.

Silent killers stay silent until you give yourself the tools to hear them.

8. Learn your ORM’s data fetching patterns.
If you use GORM:
Preload() = separate query for related data (can cause N+1 if nested)
Joins() = single query with JOINs (usually better for lists)
Select() = fetch only the columns you need
If you use sqlx or raw queries:

Prefer IN clauses over loops: SELECT * FROM table WHERE id IN (?, ?, ?)
Use JOINs when you need related data from multiple tables
Consider database views for pre-aggregated data (like we did with workspace_storage_view)
Don’t treat your database layer as a black box.
Understand what SQL your code generates.
Read the query plans. Use EXPLAIN.

9. Make “query count” part of code reviews.
Add it to your checklist:

[ ] Does this loop make database calls?
[ ] Could this be a single query with a JOIN or IN clause?
[ ] What happens when this list grows to 200+ items?
[ ] Are we fetching more data than we need?
A second pair of eyes catches what you miss.
A team habit prevents what individuals forget.

10. When in doubt, measure.
Don’t guess. Don’t assume. Don’t trust “it worked in dev.”

Measure:

Query count per request
Total database time
Latency at different data volumes
If the numbers look wrong, the code is wrong — even if it “works.”

These habits won’t eliminate every performance problem.
But they’ll catch 90% of N+1 issues before they reach production.

And the 10% that slip through?
That’s what SigNoz is for.

If you’ve handled N+1 differently, or think there’s a better approach, I’d love to hear it.

Happy reading. Go check your query logs — I’ll wait.❤️

Top comments (1)

Martijn Assie • Feb 1

Beautiful walkthrough… one extra tip: always pair tracing with pre-aggregated views where possible. It’s not just about spotting N+1s.. it’s about giving the database a single, efficient query to do the heavy lifting. Saves hours of debugging later.