Vito Secona

Posted on Jun 4

How I Cut 769 Database Queries With One Window Function

#database #performance #postgres #sql

From Grafana to Real Performance Gains

I noticed in the Grafana dashboard that GET /api/v1/dokter/public/jadwal-dokter was on average taking 1 second even when not under load. To get measurable, reproducible numbers, I built a benchmark pipeline and profiled the endpoint systematically.

What started as a simple baseline measurement turned into a two-phase optimization effort that eliminated 25% of database round-trips and reduced median response time by 18%.

The System

The application is a FastAPI + SQLAlchemy + PostgreSQL hospital management backend. The endpoint under study:

GET /api/v1/dokter/public/jadwal-dokter

This public endpoint lists active doctors with their specializations and practice schedules, supporting:

Pagination (page, page_size)
Name filtering (name: ILIKE partial match)
Specialty filtering (specialty: ANY on M:N relationship)
Schedule filtering (hari, time_from, time_to: EXISTS subquery)

Schema Overview

Table	Purpose	Est. Rows
`dokter`	Doctor profiles	~15
`spesialisasi`	Specialization names	~5
`jadwal_praktik`	Practice schedules	~30
`dokter_spesialisasi`	M:N junction table	~15

Methodology

Tools Used

Tool	Version	Role
Python	3.13	Runtime
SQLAlchemy	2.0	ORM with `echo=True` for SQL capture
FastAPI	0.134	Web framework
PostgreSQL	18	Database
Uvicorn	0.40	ASGI server
Locust	2.44	Load generation
Alembic	1.18	Schema migrations
matplotlib	3.10	Chart generation
Prometheus FastAPI Instrumentator	0.10	Request duration metrics

The Two Locustfiles

I use two separate Locustfiles because the requirements for SQL capture and load testing are fundamentally different:

	`locustfile_sql_capture.py`	`locustfile.py`
Purpose	Dense SQL statement capture	Realistic load testing
Wait time	`0` (fire as fast as possible)	`between(1, 3)` seconds
Task weights	Equal (`@task(1)` each)	Realistic traffic distribution

locustfile_sql_capture.py hits every endpoint variant with equal weight and zero delay, flooding the log with SQL statements in rapid succession. This guarantees clean, non-interleaved output for pattern analysis:

# locustfile_sql_capture.py: equal weights, zero wait
class SQLCaptureUser(HttpUser):
    def wait_time(self):
        return 0

    @task(1)
    def browse_no_filter(self):
        self.client.get("/api/v1/dokter/public/jadwal-dokter?page=1&page_size=10")

    @task(1)
    def browse_name_filter(self):
        self.client.get("/api/v1/dokter/public/jadwal-dokter?name=dr&page=1&page_size=10")

    @task(1)
    def browse_specialty_filter(self): ...
    @task(1)
    def browse_time_filter(self): ...
    @task(1)
    def browse_combined_filter(self): ...

locustfile.py uses weighted tasks to mimic real traffic patterns. Most users browse without filters (weight 30), while combined queries are rare (weight 5). A 1–3 second wait prevents a single user from saturating the server:

# locustfile.py: realistic weights, 1-3s wait
class ProfileUser(HttpUser):
    wait_time = between(1, 3)

    @task(30)
    def browse_no_filter(self):
        self.client.get("/api/v1/dokter/public/jadwal-dokter?page=1&page_size=10")

    @task(20)
    def browse_name_filter(self): ...
    @task(20)
    def browse_specialty_filter(self): ...
    @task(20)
    def browse_time_filter(self): ...

    @task(5)
    def browse_combined_filter(self): ...

    @task(5)
    def list_spesialisasi(self): ...

Benchmark Pipeline

The benchmark script (profiling/benchmark.py) automates an 8-step pipeline:

[1/8] Start uvicorn (SQLAlchemy echo=True)
[2/8] Wait for /healthcheck
[3/8] Phase 1 - SQL Capture: 1 user, 10s (locustfile_sql_capture.py)
[4/8] Snapshot the clean Phase-1 log
[5/8] Analyse SQL patterns (grep -> normalize -> group -> detect)
[6/8] Phase 2 - Load Test: 1 / 10 / 50 users, 30s each (locustfile.py)
[7/8] Scrape /metrics from Prometheus instrumentator
[8/8] Stop uvicorn

Data Collection & Analysis Pipeline

SQLAlchemy's echo=True writes every executed query to stdout in this format:

2026-05-25 19:37:47,210 INFO sqlalchemy.engine.Engine SELECT ...
2026-05-25 19:37:47,210 INFO sqlalchemy.engine.Engine [generated in 0.00020s] {}

The analysis pipeline processes these logs through four stages:

Grep: Regex extracts lines containing SQL keywords (SELECT, INSERT, UPDATE, DELETE, WITH)
Normalize: Literal values and bind parameters are replaced with ?, whitespace is collapsed, and IN (...) lists of any length are collapsed to IN (?+)
Group: Identical normalized patterns are merged with a frequency count
Detect: Duplicate patterns are identified; patterns firing ≥10 times with SELECT are flagged as N+1 suspects

When I first ran this pipeline, I got 2,959 "unique" patterns out of 3,243 queries — which is impossible for any real ORM workload. The problem was that SQLAlchemy's echo output interleaves with uvicorn access logs, parameter dicts, and cache metadata. The parser was concatenating all of this into the SQL statement. After fixing the stop-list handling, metadata stripping, timing regex, and IN-list collapsing, the same log file collapsed from 2,959 patterns to 18 real patterns. The real bottlenecks became visible.

Latency data comes from two sources:

SQLAlchemy timing: the [generated in Xs] marker logged after each statement (not [cached since Xs ago], which is cache age, not execution time).
Locust percentiles: p50/p90/p95/p99/max response times per endpoint during Phase 2 load tests.

Table hotspot analysis counts every FROM, JOIN, INTO, and UPDATE table reference across all extracted statements, split into read vs write hits.

Limitations

This profiling was performed on a test dataset (~15 doctors, ~30 schedule slots). The entire dataset fits in PostgreSQL's shared buffers, so all SQL execution times are sub-millisecond. The indexes installed in Phase B are forward-looking, they prevent full-table scans at production scale but their impact is not measurable on this tiny dataset. For tail-latency comparisons (p95/p99), 30-second Locust runs produce noticeable variance between runs; longer runs would yield more stable numbers.

Phase A: Eliminating the COUNT Round-Trip

The Bottleneck

Looking at the top queries from the baseline data:

1. [ 769x   23.7%]  SELECT DOKTER_1.ID AS DOKTER_1_ID, SPESIALISASI.ID...
2. [ 769x   23.7%]  SELECT JADWAL_PRAKTIK.DOKTER_ID AS JADWAL_PRAKTIK_DOKTER_ID...
3. [ 171x    5.3%]  SELECT COUNT(*) AS COUNT_1 FROM (SELECT DOKTER.ID...
4. [ 171x    5.3%]  SELECT DOKTER.ID AS DOKTER_ID... LIMIT ... OFFSET ...
5. [ 163x    5.0%]  SELECT SPESIALISASI.ID AS SPESIALISASI_ID...
6. [ 162x    5.0%]  SELECT COUNT(*) AS COUNT_1 FROM (SELECT DOKTER.ID...
7. [ 162x    5.0%]  SELECT DOKTER.ID AS DOKTER_ID... LIMIT ... OFFSET ...

COUNT and SELECT appear in paired duplicates. For every filter variant, the application ran the same WHERE clause twice, once for COUNT(*) to get the total, and once with LIMIT/OFFSET to get the data page.

~769 COUNT queries = 23.7% of all SQL traffic.

Tracing to the Code

# BEFORE (app/services/dokter_service.py)
def list_public_dokter_paginated(...):
    query = db.query(Dokter).filter(Dokter.is_active.is_(True))
    # ... filters built ...

    total = query.count()         # Query 1: COUNT wrapper

    dokters = (
        query.options(
            selectinload(Dokter.daftar_spesialisasi),   # Query 3
            selectinload(Dokter.daftar_jadwal_praktik), # Query 4
        )
        .order_by(...)
        .offset(...)
        .limit(page_size)
        .all()                    # Query 2: SELECT + LIMIT
    )
    return dokters, total

The Fix: Window Function

# AFTER
from sqlalchemy import func

def list_public_dokter_paginated(...):
    # ... same filter building ...

    count_over = func.count().over()

    rows = (
        query.add_columns(count_over)   # count() OVER() in SELECT
        .options(
            selectinload(Dokter.daftar_spesialisasi),
            selectinload(Dokter.daftar_jadwal_praktik),
        )
        .order_by(...)
        .offset(...)
        .limit(page_size)
        .all()
    )

    if not rows:
        return [], 0

    total = rows[0][1]                  # Same in every row
    dokters = [row[0] for row in rows]
    return dokters, total

Architecture Flow Diagram

SQL Impact

Before SQL (4 queries):

SELECT count(*) AS count_1 
FROM (SELECT dokter.id, dokter.nama, ... 
      FROM dokter 
      WHERE dokter.is_active IS true) AS anon_1;

SELECT dokter.* FROM dokter 
WHERE dokter.is_active IS true 
ORDER BY dokter.nama, dokter.id 
LIMIT 10 OFFSET 0;

SELECT spesialisasi.* FROM spesialisasi WHERE dokter_id IN (...);
SELECT jadwal_praktik.* FROM jadwal_praktik WHERE dokter_id IN (...);

After SQL (3 queries):

SELECT dokter.*, count(*) OVER () AS anon_1 
FROM dokter 
WHERE dokter.is_active IS true 
ORDER BY dokter.nama, dokter.id 
LIMIT 10 OFFSET 0;

SELECT spesialisasi.* FROM spesialisasi WHERE dokter_id IN (...);
SELECT jadwal_praktik.* FROM jadwal_praktik WHERE dokter_id IN (...);

Understanding N+1 and selectinload

If you look at the top queries, you'll notice the two selectinload queries fire 769 times each, the same magnitude as the COUNT query. My benchmark heuristic flagged these as N+1 suspects. They aren't.

The N+1 problem happens when an ORM lazy-loads relationships one row at a time. For example, without any eager loading, the endpoint would look like this:

# BAD: What the code would look like WITHOUT selectinload
def list_public_dokter_paginated(...):
    query = db.query(Dokter).filter(Dokter.is_active.is_(True))
    # ... filters ...

    dokters = (
        query
        .order_by(...)
        .offset(...)
        .limit(page_size)
        .all()
    )
    # During JSON serialization, FastAPI calls dokters[0].daftar_spesialisasi
    # which fires:   SELECT ... FROM spesialisasi WHERE dokter_id = <id_0>
    # Then the serializer calls dokters[1].daftar_spesialisasi
    # which fires:   SELECT ... FROM spesialisasi WHERE dokter_id = <id_1>
    # ... repeated for each doctor × each relationship
    #
    # With page_size=10 and 2 relationships:
    #   1 (main) + 10 (spesialisasi) + 10 (jadwal_praktik) = 21 queries
    return dokters

selectinload is the fix for N+1. Instead of 10 individual WHERE dokter_id = ? queries, it fires one batched WHERE dokter_id IN (?, ?, ?, ...) query:

# GOOD: selectinload batches everything
query.options(
    selectinload(Dokter.daftar_spesialisasi),
    selectinload(Dokter.daftar_jadwal_praktik),
)
# Fires just 2 batch queries:
#   SELECT ... FROM spesialisasi WHERE dokter_id IN (?, ?, ..., ?)
#   SELECT ... FROM jadwal_praktik WHERE dokter_id IN (?, ?, ..., ?)

The benchmark's N+1 detector uses a blunt heuristic: "any SELECT pattern firing ≥10 times." It cannot distinguish between:

N+1 (bad): SELECT * FROM x WHERE parent_id = ? repeated 10× for 10 parents
selectinload (good): SELECT * FROM x WHERE parent_id IN (?, ?, ...) 1× for 10 parents

These are false positives. No code change needed here, the "suspects" are actually the right solution.

Phase B: Database Indexes for the Road Ahead

What the Table Hotspots Revealed

The benchmark showed heavy read traffic on key tables, but the database had minimal indexes:

Table hotspots:
  dokter              2307 reads
  spesialisasi        1534 reads
  jadwal_praktik      1353 reads

Existing vs Missing Indexes

Columns Filtered On	Indexed?	Index
`dokter.is_active`	❌	-
`dokter.nama` (ILIKE)	❌	-
`jadwal_praktik.dokter_id`	❌	-
`jadwal_praktik.is_available`	❌	-
`jadwal_praktik.hari_dalam_minggu`	❌	-
`jadwal_praktik.jam_mulai` / `jam_selesai`	❌	-
`spesialisasi.nama`	✅	`ix_spesialisasi_nama`

The Migration

# alembic/versions/0e20a2fed073_add_profiling_indexes.py
def upgrade() -> None:
    # Every public query starts with WHERE is_active IS TRUE
    op.create_index("idx_dokter_is_active", "dokter", ["is_active"])

    # nama ILIKE filter
    op.create_index("idx_dokter_nama", "dokter", ["nama"])

    # Composite index covering the EXISTS subquery for jadwal_praktik filters
    op.create_index(
        "idx_jadwal_praktik_filter",
        "jadwal_praktik",
        ["dokter_id", "is_available", "hari_dalam_minggu", "jam_mulai", "jam_selesai"],
    )

Why the Indexes Didn't Speed Things Up (Yet)

On a tiny test dataset (~15 doctors, ~30 schedules), PostgreSQL's query planner correctly chooses sequential scans. When a table fits in a single 8 KB page, reading the whole thing sequentially is faster than traversing a B-tree index and then fetching individual heap pages. Adding indexes in this regime:

Does not change the query plan: the planner still picks seq scans because the cost estimate is lower
Adds planning overhead: the planner spends marginally more time evaluating index-scan paths before discarding them
Adds write overhead: every INSERT or UPDATE must maintain the B-tree structure and write to the WAL

The p95 latency regression in the index results (180 ms vs 98 ms for the optimized-only run) is likely statistical noise from short 30-second Locust runs — tail latencies vary significantly between runs. But even if there were a real regression on this dataset, it reflects planner overhead on a trivially small table, not a meaningful cost at production scale.

PostgreSQL can only exploit these indexes once the tables grow large enough that a seq scan would require reading multiple pages. At that point, the planner will switch to index scans, and the investment pays off.

Table Hit Heatmap

The heatmap shows read traffic dropping across all tables after the COUNT queries were eliminated.

Findings: Three-Way Benchmark Comparison

SQL Query Metrics

Metric	Baseline	Optimized	Optimized+Indexes
Total queries	3,243	2,443	2,649
Unique patterns	18	12	13
Duplicate patterns	13	8	8

Total queries vary slightly between optimized runs (~200) due to Locust's random task weighting. The structural change is what matters: zero COUNT queries.

The Disappearing COUNT Queries

The waterfall chart shows how 769 COUNT wrapper queries simply vanished from the query profile after the window function fix. The remaining SELECT and selectinload query counts are consistent across runs.

Latency Comparison (10 Users)

Percentile	Baseline	Optimized	Optimized+Indexes	Δ vs Baseline
Median	17 ms	14 ms	14 ms	−18%
p90	25 ms	18 ms	19 ms	−24%
p95	150 ms	98 ms	180 ms	−35% / noise
p99	230 ms	170 ms	230 ms	−26% / noise

Median dropped from 17 ms to 14 ms consistently. This is the reliable, reproducible win from eliminating one database round-trip per request.

Tail latency (p95/p99) is more volatile across short 30-second runs. The optimized run showed genuine improvement at p95, while the indexes run had higher variance, as discussed in Phase B, indexes don't change query plans on this tiny dataset, and the variance is measurement noise rather than a meaningful regression.

Conclusion & Trade-offs

What We Achieved

Improvement	Before	After
SQL queries per paginated request	4	3 (−25%)
Median response time	17 ms	14 ms (−18%)
Missing critical indexes	3	0
COUNT round-trips per request	1	0 (−100%)

What I Actually Used

Both optimizations are deployed. The window function fix (func.count().over() replacing query.count()) is the immediate, measurable win. It eliminates a full database round-trip per request with zero behavioral change. The three indexes are deployed as forward-looking insurance: they don't improve latency on today's 15-row dataset, but they prevent expensive sequential scans when the system grows.

Trade-offs: Which Optimization Should You Use?

Optimization	When to Deploy	Cost	Benefit
Window function (`COUNT() OVER()`)	Immediately	Zero: same API contract, same test expectations	−25% queries, −18% median latency
Database indexes	Before hitting production scale (1,000+ rows)	Marginal write overhead on INSERT/UPDATE, B-tree maintenance, WAL amplification	Prevents full-table scans at scale
Smarter N+1 detection	Future benchmark improvement	Engineering time to distinguish `IN (?+)` from plain `?` in normalization	Eliminates false positives from the benchmark report

The window function is a no-brainer. It has zero downside and produces a clear, measurable improvement. Indexes are a judgment call: on a trivially small dataset they add cost without benefit, but they're essential before the first real user loads 1000+ records. Install them early, verify with EXPLAIN ANALYZE at production scale, and consider dropping any that the planner never uses.

Key Lessons

query.count() is a silent bottleneck. ORMs make it invisible, but it's a full re-execution of your filters. Window functions are the escape hatch.
Batched eager loading is not N+1. selectinload with IN (...) is the solution, and your benchmark needs smarter heuristics to know the difference.
Indexes are insurance, not instant gratification. They won't speed up a 10-row table. They'll save you when data grows.

DEV Community