How I Fixed a Correlated Subquery That Was Behaving Like an N+1 Problem in Production

#laravel #backend #sql #postgres

What Happened

A user reported seeing no data after an absurdly long loading time in one of the procurement menus. I was maintaining an e-procurement system for a state-owned enterprise subsidiary in the mining sector and this wasn't just a UI glitch. It completely blocked the user's business process.

From the network tab, the problematic request was just sitting there with endless pending status, until it hit the server's timeout and resulted in 504, effectively sending no data to the user.

The culprit turned out to be a correlated subquery behaving exactly like an N+1 problem: common, easy to miss, and quietly fatal in production. In this post, I'll walk through how I identified the root cause and fixed it.

What is an N+1 Query Problem?

Imagine you have to fetch 500 data from the database. And then, for each of that row, you run another query to compute something like aggregations. That's 1 (the initial fetching) query plus 500 query for each computation. In production, because the data will keep growing, this would be a fatal problem.

// CLASSIC ORM N+1 EXAMPLE
// 1 query
$orders = Order::all();

// N queries — one per order
foreach ($orders as $order) {
    echo $order->supplier->name; // hits the database every single iteration
}

This commonly happens when we use ORM loops like in laravel. But N+1 problem does not happen only to ORMs. Something similar could also happen to the database query itself. Which is exactly what happened in my case.

What Actually Happened

This is a snippet of a long scope code that was implemented with laravel.

->select(
...
CASE WHEN COUNT(po_items.*) > 1 THEN
    (SELECT SUM(total_value)
     FROM po_items
     WHERE po_items.po_id = po.id) 
  END AS sum_val
...
)

Do you see the problem? Yup. Its computing an aggregation function SUM() within the select statement. So for each data that comes from the FROM keyword, let's assume we have 1000 rows, it will run that exact SUM() query for another 1000 times. This will work on small data, but in production environment and database, this query won't survive.

The Fix

->leftjoin (
  SELECT po_id,
    SUM(total_value) as sum_val
  FROM po_items
  WHERE status = 'y'
  GROUP BY po_id 
) as po_item_aggs ON po.id = po_item_aggs.po_id

-- accessing it directly
SELECT po.id, po_item_aggs.sum_val

This is what I did. Instead of doing the aggregations and the subquery inside the select statement, I moved it into another Join operations. This alone removes the computation for each row from the data source. So instead of doing 1000+ query computation, we only did one query for the computation itself.

Results

Metric	Before	After
Response time	~2 minutes / timeout	~15 seconds
HTTP status	504 Gateway Timeout	200 OK
User experience	Blank table, blocked workflow	Normal

The request that was hanging for over two minutes — or not returning at all — now resolves in around 15 seconds. That's not just a performance improvement, it's the difference between a system that works and one that doesn't.

DEV Community