From Grafana to Real Performance Gains
I noticed in the Grafana dashboard that GET /api/v1/dokter/public/jadwal-dokter was on average taking 1 second even when not under load. To get measurable, reproducible numbers, I built a benchmark pipeline and profiled the endpoint systematically.
What started as a simple baseline measurement turned into a two-phase optimization effort that eliminated 25% of database round-trips and reduced median response time by 18%.
The System
The application is a FastAPI + SQLAlchemy + PostgreSQL hospital management backend. The endpoint under study:
GET /api/v1/dokter/public/jadwal-dokter
This public endpoint lists active doctors with their specializations and practice schedules, supporting:
-
Pagination (
page,page_size) -
Name filtering (
name:ILIKEpartial match) -
Specialty filtering (
specialty:ANYon M:N relationship) -
Schedule filtering (
hari,time_from,time_to:EXISTSsubquery)
Schema Overview
| Table | Purpose | Est. Rows |
|---|---|---|
dokter |
Doctor profiles | ~15 |
spesialisasi |
Specialization names | ~5 |
jadwal_praktik |
Practice schedules | ~30 |
dokter_spesialisasi |
M:N junction table | ~15 |
Methodology
Tools Used
| Tool | Version | Role |
|---|---|---|
| Python | 3.13 | Runtime |
| SQLAlchemy | 2.0 | ORM with echo=True for SQL capture |
| FastAPI | 0.134 | Web framework |
| PostgreSQL | 18 | Database |
| Uvicorn | 0.40 | ASGI server |
| Locust | 2.44 | Load generation |
| Alembic | 1.18 | Schema migrations |
| matplotlib | 3.10 | Chart generation |
| Prometheus FastAPI Instrumentator | 0.10 | Request duration metrics |
The Two Locustfiles
I use two separate Locustfiles because the requirements for SQL capture and load testing are fundamentally different:
locustfile_sql_capture.py |
locustfile.py |
|
|---|---|---|
| Purpose | Dense SQL statement capture | Realistic load testing |
| Wait time |
0 (fire as fast as possible) |
between(1, 3) seconds |
| Task weights | Equal (@task(1) each) |
Realistic traffic distribution |
locustfile_sql_capture.py hits every endpoint variant with equal weight and zero delay, flooding the log with SQL statements in rapid succession. This guarantees clean, non-interleaved output for pattern analysis:
# locustfile_sql_capture.py: equal weights, zero wait
class SQLCaptureUser(HttpUser):
def wait_time(self):
return 0
@task(1)
def browse_no_filter(self):
self.client.get("/api/v1/dokter/public/jadwal-dokter?page=1&page_size=10")
@task(1)
def browse_name_filter(self):
self.client.get("/api/v1/dokter/public/jadwal-dokter?name=dr&page=1&page_size=10")
@task(1)
def browse_specialty_filter(self): ...
@task(1)
def browse_time_filter(self): ...
@task(1)
def browse_combined_filter(self): ...
locustfile.py uses weighted tasks to mimic real traffic patterns. Most users browse without filters (weight 30), while combined queries are rare (weight 5). A 1–3 second wait prevents a single user from saturating the server:
# locustfile.py: realistic weights, 1-3s wait
class ProfileUser(HttpUser):
wait_time = between(1, 3)
@task(30)
def browse_no_filter(self):
self.client.get("/api/v1/dokter/public/jadwal-dokter?page=1&page_size=10")
@task(20)
def browse_name_filter(self): ...
@task(20)
def browse_specialty_filter(self): ...
@task(20)
def browse_time_filter(self): ...
@task(5)
def browse_combined_filter(self): ...
@task(5)
def list_spesialisasi(self): ...
Benchmark Pipeline
The benchmark script (profiling/benchmark.py) automates an 8-step pipeline:
[1/8] Start uvicorn (SQLAlchemy echo=True)
[2/8] Wait for /healthcheck
[3/8] Phase 1 - SQL Capture: 1 user, 10s (locustfile_sql_capture.py)
[4/8] Snapshot the clean Phase-1 log
[5/8] Analyse SQL patterns (grep -> normalize -> group -> detect)
[6/8] Phase 2 - Load Test: 1 / 10 / 50 users, 30s each (locustfile.py)
[7/8] Scrape /metrics from Prometheus instrumentator
[8/8] Stop uvicorn
Data Collection & Analysis Pipeline
SQLAlchemy's echo=True writes every executed query to stdout in this format:
2026-05-25 19:37:47,210 INFO sqlalchemy.engine.Engine SELECT ...
2026-05-25 19:37:47,210 INFO sqlalchemy.engine.Engine [generated in 0.00020s] {}
The analysis pipeline processes these logs through four stages:
-
Grep: Regex extracts lines containing SQL keywords (
SELECT,INSERT,UPDATE,DELETE,WITH) -
Normalize: Literal values and bind parameters are replaced with
?, whitespace is collapsed, andIN (...)lists of any length are collapsed toIN (?+) - Group: Identical normalized patterns are merged with a frequency count
-
Detect: Duplicate patterns are identified; patterns firing ≥10 times with
SELECTare flagged as N+1 suspects
When I first ran this pipeline, I got 2,959 "unique" patterns out of 3,243 queries — which is impossible for any real ORM workload. The problem was that SQLAlchemy's echo output interleaves with uvicorn access logs, parameter dicts, and cache metadata. The parser was concatenating all of this into the SQL statement. After fixing the stop-list handling, metadata stripping, timing regex, and IN-list collapsing, the same log file collapsed from 2,959 patterns to 18 real patterns. The real bottlenecks became visible.
Latency data comes from two sources:
-
SQLAlchemy timing: the
[generated in Xs]marker logged after each statement (not[cached since Xs ago], which is cache age, not execution time). - Locust percentiles: p50/p90/p95/p99/max response times per endpoint during Phase 2 load tests.
Table hotspot analysis counts every FROM, JOIN, INTO, and UPDATE table reference across all extracted statements, split into read vs write hits.
Limitations
This profiling was performed on a test dataset (~15 doctors, ~30 schedule slots). The entire dataset fits in PostgreSQL's shared buffers, so all SQL execution times are sub-millisecond. The indexes installed in Phase B are forward-looking, they prevent full-table scans at production scale but their impact is not measurable on this tiny dataset. For tail-latency comparisons (p95/p99), 30-second Locust runs produce noticeable variance between runs; longer runs would yield more stable numbers.
Phase A: Eliminating the COUNT Round-Trip
The Bottleneck
Looking at the top queries from the baseline data:
1. [ 769x 23.7%] SELECT DOKTER_1.ID AS DOKTER_1_ID, SPESIALISASI.ID...
2. [ 769x 23.7%] SELECT JADWAL_PRAKTIK.DOKTER_ID AS JADWAL_PRAKTIK_DOKTER_ID...
3. [ 171x 5.3%] SELECT COUNT(*) AS COUNT_1 FROM (SELECT DOKTER.ID...
4. [ 171x 5.3%] SELECT DOKTER.ID AS DOKTER_ID... LIMIT ... OFFSET ...
5. [ 163x 5.0%] SELECT SPESIALISASI.ID AS SPESIALISASI_ID...
6. [ 162x 5.0%] SELECT COUNT(*) AS COUNT_1 FROM (SELECT DOKTER.ID...
7. [ 162x 5.0%] SELECT DOKTER.ID AS DOKTER_ID... LIMIT ... OFFSET ...
COUNT and SELECT appear in paired duplicates. For every filter variant, the application ran the same WHERE clause twice, once for COUNT(*) to get the total, and once with LIMIT/OFFSET to get the data page.
~769 COUNT queries = 23.7% of all SQL traffic.
Tracing to the Code
# BEFORE (app/services/dokter_service.py)
def list_public_dokter_paginated(...):
query = db.query(Dokter).filter(Dokter.is_active.is_(True))
# ... filters built ...
total = query.count() # Query 1: COUNT wrapper
dokters = (
query.options(
selectinload(Dokter.daftar_spesialisasi), # Query 3
selectinload(Dokter.daftar_jadwal_praktik), # Query 4
)
.order_by(...)
.offset(...)
.limit(page_size)
.all() # Query 2: SELECT + LIMIT
)
return dokters, total
The Fix: Window Function
# AFTER
from sqlalchemy import func
def list_public_dokter_paginated(...):
# ... same filter building ...
count_over = func.count().over()
rows = (
query.add_columns(count_over) # count() OVER() in SELECT
.options(
selectinload(Dokter.daftar_spesialisasi),
selectinload(Dokter.daftar_jadwal_praktik),
)
.order_by(...)
.offset(...)
.limit(page_size)
.all()
)
if not rows:
return [], 0
total = rows[0][1] # Same in every row
dokters = [row[0] for row in rows]
return dokters, total
Architecture Flow Diagram
SQL Impact
Before SQL (4 queries):
SELECT count(*) AS count_1
FROM (SELECT dokter.id, dokter.nama, ...
FROM dokter
WHERE dokter.is_active IS true) AS anon_1;
SELECT dokter.* FROM dokter
WHERE dokter.is_active IS true
ORDER BY dokter.nama, dokter.id
LIMIT 10 OFFSET 0;
SELECT spesialisasi.* FROM spesialisasi WHERE dokter_id IN (...);
SELECT jadwal_praktik.* FROM jadwal_praktik WHERE dokter_id IN (...);
After SQL (3 queries):
SELECT dokter.*, count(*) OVER () AS anon_1
FROM dokter
WHERE dokter.is_active IS true
ORDER BY dokter.nama, dokter.id
LIMIT 10 OFFSET 0;
SELECT spesialisasi.* FROM spesialisasi WHERE dokter_id IN (...);
SELECT jadwal_praktik.* FROM jadwal_praktik WHERE dokter_id IN (...);
Understanding N+1 and selectinload
If you look at the top queries, you'll notice the two selectinload queries fire 769 times each, the same magnitude as the COUNT query. My benchmark heuristic flagged these as N+1 suspects. They aren't.
The N+1 problem happens when an ORM lazy-loads relationships one row at a time. For example, without any eager loading, the endpoint would look like this:
# BAD: What the code would look like WITHOUT selectinload
def list_public_dokter_paginated(...):
query = db.query(Dokter).filter(Dokter.is_active.is_(True))
# ... filters ...
dokters = (
query
.order_by(...)
.offset(...)
.limit(page_size)
.all()
)
# During JSON serialization, FastAPI calls dokters[0].daftar_spesialisasi
# which fires: SELECT ... FROM spesialisasi WHERE dokter_id = <id_0>
# Then the serializer calls dokters[1].daftar_spesialisasi
# which fires: SELECT ... FROM spesialisasi WHERE dokter_id = <id_1>
# ... repeated for each doctor × each relationship
#
# With page_size=10 and 2 relationships:
# 1 (main) + 10 (spesialisasi) + 10 (jadwal_praktik) = 21 queries
return dokters
selectinload is the fix for N+1. Instead of 10 individual WHERE dokter_id = ? queries, it fires one batched WHERE dokter_id IN (?, ?, ?, ...) query:
# GOOD: selectinload batches everything
query.options(
selectinload(Dokter.daftar_spesialisasi),
selectinload(Dokter.daftar_jadwal_praktik),
)
# Fires just 2 batch queries:
# SELECT ... FROM spesialisasi WHERE dokter_id IN (?, ?, ..., ?)
# SELECT ... FROM jadwal_praktik WHERE dokter_id IN (?, ?, ..., ?)
The benchmark's N+1 detector uses a blunt heuristic: "any SELECT pattern firing ≥10 times." It cannot distinguish between:
-
N+1 (bad):
SELECT * FROM x WHERE parent_id = ?repeated 10× for 10 parents -
selectinload (good):
SELECT * FROM x WHERE parent_id IN (?, ?, ...)1× for 10 parents
These are false positives. No code change needed here, the "suspects" are actually the right solution.
Phase B: Database Indexes for the Road Ahead
What the Table Hotspots Revealed
The benchmark showed heavy read traffic on key tables, but the database had minimal indexes:
Table hotspots:
dokter 2307 reads
spesialisasi 1534 reads
jadwal_praktik 1353 reads
Existing vs Missing Indexes
| Columns Filtered On | Indexed? | Index |
|---|---|---|
dokter.is_active |
❌ | - |
dokter.nama (ILIKE) |
❌ | - |
jadwal_praktik.dokter_id |
❌ | - |
jadwal_praktik.is_available |
❌ | - |
jadwal_praktik.hari_dalam_minggu |
❌ | - |
jadwal_praktik.jam_mulai / jam_selesai
|
❌ | - |
spesialisasi.nama |
✅ | ix_spesialisasi_nama |
The Migration
# alembic/versions/0e20a2fed073_add_profiling_indexes.py
def upgrade() -> None:
# Every public query starts with WHERE is_active IS TRUE
op.create_index("idx_dokter_is_active", "dokter", ["is_active"])
# nama ILIKE filter
op.create_index("idx_dokter_nama", "dokter", ["nama"])
# Composite index covering the EXISTS subquery for jadwal_praktik filters
op.create_index(
"idx_jadwal_praktik_filter",
"jadwal_praktik",
["dokter_id", "is_available", "hari_dalam_minggu", "jam_mulai", "jam_selesai"],
)
Why the Indexes Didn't Speed Things Up (Yet)
On a tiny test dataset (~15 doctors, ~30 schedules), PostgreSQL's query planner correctly chooses sequential scans. When a table fits in a single 8 KB page, reading the whole thing sequentially is faster than traversing a B-tree index and then fetching individual heap pages. Adding indexes in this regime:
- Does not change the query plan: the planner still picks seq scans because the cost estimate is lower
- Adds planning overhead: the planner spends marginally more time evaluating index-scan paths before discarding them
- Adds write overhead: every INSERT or UPDATE must maintain the B-tree structure and write to the WAL
The p95 latency regression in the index results (180 ms vs 98 ms for the optimized-only run) is likely statistical noise from short 30-second Locust runs — tail latencies vary significantly between runs. But even if there were a real regression on this dataset, it reflects planner overhead on a trivially small table, not a meaningful cost at production scale.
PostgreSQL can only exploit these indexes once the tables grow large enough that a seq scan would require reading multiple pages. At that point, the planner will switch to index scans, and the investment pays off.
Table Hit Heatmap
The heatmap shows read traffic dropping across all tables after the COUNT queries were eliminated.
Findings: Three-Way Benchmark Comparison
SQL Query Metrics
| Metric | Baseline | Optimized | Optimized+Indexes |
|---|---|---|---|
| Total queries | 3,243 | 2,443 | 2,649 |
| Unique patterns | 18 | 12 | 13 |
| Duplicate patterns | 13 | 8 | 8 |
Total queries vary slightly between optimized runs (~200) due to Locust's random task weighting. The structural change is what matters: zero COUNT queries.
The Disappearing COUNT Queries
The waterfall chart shows how 769 COUNT wrapper queries simply vanished from the query profile after the window function fix. The remaining SELECT and selectinload query counts are consistent across runs.
Latency Comparison (10 Users)
| Percentile | Baseline | Optimized | Optimized+Indexes | Δ vs Baseline |
|---|---|---|---|---|
| Median | 17 ms | 14 ms | 14 ms | −18% |
| p90 | 25 ms | 18 ms | 19 ms | −24% |
| p95 | 150 ms | 98 ms | 180 ms | −35% / noise |
| p99 | 230 ms | 170 ms | 230 ms | −26% / noise |
Median dropped from 17 ms to 14 ms consistently. This is the reliable, reproducible win from eliminating one database round-trip per request.
Tail latency (p95/p99) is more volatile across short 30-second runs. The optimized run showed genuine improvement at p95, while the indexes run had higher variance, as discussed in Phase B, indexes don't change query plans on this tiny dataset, and the variance is measurement noise rather than a meaningful regression.
Conclusion & Trade-offs
What We Achieved
| Improvement | Before | After |
|---|---|---|
| SQL queries per paginated request | 4 | 3 (−25%) |
| Median response time | 17 ms | 14 ms (−18%) |
| Missing critical indexes | 3 | 0 |
| COUNT round-trips per request | 1 | 0 (−100%) |
What I Actually Used
Both optimizations are deployed. The window function fix (func.count().over() replacing query.count()) is the immediate, measurable win. It eliminates a full database round-trip per request with zero behavioral change. The three indexes are deployed as forward-looking insurance: they don't improve latency on today's 15-row dataset, but they prevent expensive sequential scans when the system grows.
Trade-offs: Which Optimization Should You Use?
| Optimization | When to Deploy | Cost | Benefit |
|---|---|---|---|
Window function (COUNT() OVER()) |
Immediately | Zero: same API contract, same test expectations | −25% queries, −18% median latency |
| Database indexes | Before hitting production scale (1,000+ rows) | Marginal write overhead on INSERT/UPDATE, B-tree maintenance, WAL amplification | Prevents full-table scans at scale |
| Smarter N+1 detection | Future benchmark improvement | Engineering time to distinguish IN (?+) from plain ? in normalization |
Eliminates false positives from the benchmark report |
The window function is a no-brainer. It has zero downside and produces a clear, measurable improvement. Indexes are a judgment call: on a trivially small dataset they add cost without benefit, but they're essential before the first real user loads 1000+ records. Install them early, verify with EXPLAIN ANALYZE at production scale, and consider dropping any that the planner never uses.
Key Lessons
-
query.count()is a silent bottleneck. ORMs make it invisible, but it's a full re-execution of your filters. Window functions are the escape hatch. -
Batched eager loading is not N+1.
selectinloadwithIN (...)is the solution, and your benchmark needs smarter heuristics to know the difference. - Indexes are insurance, not instant gratification. They won't speed up a 10-row table. They'll save you when data grows.





Top comments (0)