Michael Interface

Posted on Jan 25 • Originally published at backendops.hashnode.dev

When A Landing Page Almost Took Down Production

#webdev #postgres #database

Like many B2B startups, our landing page started as a simple marketing surface.

A headline, a short explanation of the product, and a section showing real usage metrics to build trust with potential customers. Nothing fancy. At least, that’s what we thought.

Then traffic happened.

The Business Context

Our product is a B2B workflow automation platform where customers connect third‑party systems (CRMs, payment providers, internal APIs), define trigger‑based rules, and we execute those workflows asynchronously on their behalf — every execution producing rows in our production database.

From a business perspective, one of the strongest trust signals we had was usage. Potential customers wanted to see that the platform was already being used at scale.

That’s why our landing page showed live-ish product metrics:

Number of active companies running workflows
Total workflows executed in the last 30 days
Daily execution volume (displayed as a small chart)
Overall success vs failure rate across all workflows

None of this data was customer-specific. It was fully aggregated and anonymized, but it was real production data.

The Hidden Cost

Behind the scenes, those “simple” numbers were powered by a single API endpoint executing a fairly heavy analytical query against our production database.

To make this concrete, our core tables looked roughly like this:

accounts – one row per customer company
workflows – workflow definitions per account
workflow_runs – one row per execution attempt
workflow_steps – steps inside each workflow
subscriptions – plan and billing state
users – used mainly to exclude internal and test data

The landing page needed to answer questions like:

How many companies ran at least one workflow in the last 30 days?
How many workflow executions happened during that time?
What does daily execution volume look like?
What percentage of runs succeeded vs failed?

The Initial Query

All of this was computed on the fly with a single query that looked roughly like this:

SELECT
    date_trunc('day', wr.created_at) AS day,
    COUNT(DISTINCT a.id) AS active_accounts,
    COUNT(wr.id) AS total_runs,
    COUNT(*) FILTER (WHERE wr.status = 'success') AS successful_runs,
    COUNT(*) FILTER (WHERE wr.status = 'failed') AS failed_runs
FROM workflow_runs wr
JOIN workflows w ON w.id = wr.workflow_id
JOIN accounts a ON a.id = w.account_id
JOIN subscriptions s ON s.account_id = a.id
LEFT JOIN users u ON u.account_id = a.id
WHERE wr.created_at >= now() - interval '30 days'
  AND s.plan != 'free'
  AND u.email NOT LIKE '%@ourcompany.com'
GROUP BY 1
ORDER BY 1;

This query:

Scanned millions of rows in workflow_runs
Joined 6 production tables
Performed multiple aggregations and filters
Returned just a few dozen rows for the UI

On average, it took ~2 seconds. Under load, it could spike to 4–5 seconds.

Early Attempts (and Why They Didn’t Work)

We tried several incremental fixes before changing the architecture.

Indexes and query tuning helped a bit, but the fundamental cost remained: we were still aggregating a large time window of execution data on every request.

Application-level caching reduced the number of queries, but introduced new problems:

When the cache expired, many requests recomputed the same expensive query at once
During cold starts, the landing page always hit production directly
Marketing traffic spikes often arrived exactly when the cache was empty, amplifying the load

Reducing the amount of returned data didn’t help much either. The expensive part wasn’t the result set — it was the computation.

The core issue was simple:

We were running analytical workloads on a transactional database, triggered by anonymous traffic.

Introducing a Materialized View

At that point, we stopped optimizing the query and started questioning the approach.

These metrics didn’t need to be real-time. Being 5 minutes stale had zero business impact. What mattered was predictable performance and protecting production.

We replaced the live query with a materialized view that precomputed all landing page metrics:

CREATE MATERIALIZED VIEW landing_page_metrics AS
SELECT
    date_trunc('day', wr.created_at) AS day,
    COUNT(DISTINCT a.id) AS active_accounts,
    COUNT(wr.id) AS total_runs,
    COUNT(*) FILTER (WHERE wr.status = 'success') AS successful_runs,
    COUNT(*) FILTER (WHERE wr.status = 'failed') AS failed_runs
FROM workflow_runs wr
JOIN workflows w ON w.id = wr.workflow_id
JOIN accounts a ON a.id = w.account_id
JOIN subscriptions s ON s.account_id = a.id
LEFT JOIN users u ON u.account_id = a.id
WHERE wr.created_at >= now() - interval '30 days'
  AND s.plan != 'free'
  AND u.email NOT LIKE '%@ourcompany.com'
GROUP BY 1;

We indexed the materialized view and refreshed it every 5 minutes using REFRESH MATERIALIZED VIEW CONCURRENTLY from a background worker.

From the application’s perspective, the landing page endpoint became a simple, fast read — a single indexed SELECT returning ~30 rows in 10–20 ms, even under peak traffic. No joins, no heavy computation, no surprises.

Closing thought

Materialized views turned a high‑traffic analytical query into a predictable, constant‑time read — regardless of how many users hit the page.

In our case, the fix wasn’t another cache or index. It was changing when the computation happened.

If you’ve solved similar problems — or taken a different approach — I’d love to hear how you handled analytical queries under real traffic.

DEV Community