Michael Interface

Posted on Feb 24 • Originally published at backendops.hashnode.dev

How Exposed Postgres view Took 8 Minutes to Retrieve data

#postgres #database #webdev

To keep production traffic isolated from heavy reporting queries, we maintain a separate analytics database. Our main service database handles user traffic, while a dedicated analytics database aggregates and enriches organization data.

Twice a day, a background job synchronizes data from analytics into the service database. The analytics team exposes a PostgreSQL view as a clean interface — no direct table access, just a stable SELECT contract.

It worked perfectly.

Until one day the sync started taking 8 minutes.

And the query had a LIMIT 5000.

The Setup

Here’s roughly how the data flowed:

On the analytics side, we had something like this:

analytics.organizations
analytics.organization_domains
analytics.partner_organizations
analytics.custom_projectsThe analytics team exposed a view:

CREATE VIEW analytics.v_sync_organizations AS
    SELECT ... FROM analytics.organizations
    UNION ALL
    SELECT ... FROM analytics.partner_organizations
    UNION ALL
    SELECT ... FROM analytics.custom_projects;

One branch of that view joined organizations with organization_domains and aggregated domains per organization.

The sync job executed something very simple:

SELECT
    org.*,
FROM analytics.v_sync_organizations org
WHERE
    o.website IS NOT NULL
    AND o.address IS NOT NULL
    AND o.display_name IS NOT NULL
ORDER BY o.organization_id
LIMIT 5000;

Nothing scary. Or so we thought.

The Symptom

The job to retrieve the data suddenly started taking over 8 minutes.

From EXPLAIN (ANALYZE, BUFFERS):

Total execution time: 518,803 ms (~8.6 minutes)
Rows returned: 5,000
Rows processed before aggregation: 3,341,534
Temp data written: ~6.6 GB
Temp read time: 407,980 ms (~6.8 minutes)

The database CPU was fine. Memory usage looked normal.

Disk I/O, however, was spiking hard.

The Hidden Problem

One branch of the view looked roughly like this:

SELECT
    org.organization_id,
    MAX(LOWER(d.domain)) AS website,
    org.display_name,
    org.address,
    org.location,
    org.created_at
FROM analytics.organizations org
LEFT JOIN analytics.organization_domains d
    ON org.organization_id = d.organization_id
    AND d.is_primary
GROUP BY
    org.organization_id,
    org.display_name,
    org.address,
    org.location,
    org.created_at;

This is where everything fell apart.

At the time of the incident:

organizations: ~3.1 million rows
organization_domains: ~933,000 rows
Resulting join size: ~3.3 million rows
Row width during sort: ~526 bytes

To perform the GROUP BY, PostgreSQL had to sort those 3.3 million wide rows.

The sort required roughly:

3,341,534 rows × 526 bytes ≈ 1.7 GB

Since that didn’t fit in memory, PostgreSQL switched to an external merge sort.

From the execution plan:

Sort Method: external merge
Disk: 1762616kB

That single sort spilled ~1.7GB to disk. In total, the query wrote ~6.6GB of temporary data and spent over 6 minutes just reading temp files back.

The LIMIT 5000 didn’t help — because PostgreSQL had to complete the aggregation and sorting before it could return the first row.

The Key Realization

We didn’t actually need to group all organization columns.

The real requirement was simple:

For each organization, attach one primary domain.

Instead of grouping everything, we rewrote the query to select a single domain per organization using a lateral subquery:

SELECT
    org.organization_id,
    d.website,
    org.display_name,
    org.address,
    org.location,
    org.created_at
FROM analytics.organizations org
LEFT JOIN LATERAL (
    SELECT LOWER(domain) AS website
    FROM analytics.organization_domains d
    WHERE d.organization_id = org.organization_id
      AND d.is_primary
    ORDER BY domain
    LIMIT 1
) d ON TRUE
WHERE org.location IS NOT NULL;

With an index like:

CREATE INDEX idx_domains_primary_org
ON analytics.organization_domains (organization_id, domain)
WHERE is_primary;

The massive sort disappeared.

No 1.7GB spill. No 6GB temp files. No external merge.

The sync time dropped from 8+ minutes to under 20 seconds.

Closing thoughts

2 main points for me from this incident:

Large GROUP BY operations can silently trigger multi‑gigabyte disk sorts.
Often, the fastest optimization is not tuning memory — but rewriting the query to avoid global aggregation entirely.

And it only surfaced when the data grew large enough to expose it.

I’d love to hear if you had similar problems and how you handled them under real traffic.