Most FastAPI performance issues aren't caused by the framework - they're caused by architecture, blocking I/O, and database query patterns.
I refactored a FastAPI backend that was stuck at ~180 requests/sec with p95 latency over 4 seconds. After a series of changes, it handled ~1300 requests/sec at under 200ms p95 - on the same hardware.
No vertical scaling. No extra cloud spend. Just removing bottlenecks.
The Starting Point
The system had grown fast. Speed was prioritized over structure - until it wasn’t.
By the time performance became a problem, the backend had 14+ microservices.
In practice:
- Auth logic duplicated across 6 services
- Each service maintained its own DB connection pool
- A single request triggered 4–5 internal API hops
- Middleware inconsistently applied
The latency wasn’t coming from slow code. It was coming from the architecture.
Fix 1: Kill the Service Fragmentation
14+ repos → 4 domain-focused services:
| Before | After |
|---|---|
| auth, token, session | identity-service |
| report, export, pdf | jobs-service |
| user, profile, prefs | user-service |
| scattered | core-api |
Before:
Client → core-api → auth → user → report → export
After:
Client → core-api → identity / user / jobs
Result: Internal hops dropped ~4 → ~1
→ ~35% latency reduction
Fix 2: The Stack Wasn't Actually Async
@app.get("/users/{user_id}")
async def get_user(user_id: int):
result = db.execute(...) # blocks event loop
Async endpoint ≠ async execution.
Fix:
-
asyncpginstead ofpsycopg2 -
httpxinstead ofrequests
result = await httpx.AsyncClient().get(...)
Result: ~3x worker concurrency
Fix 3: Remove Heavy Work from Requests
Problem:
- Emails
- PDFs
- Webhooks
All inside request lifecycle.
Fix:
send_email.delay(order_id)
generate_invoice.delay(order_id)
Rule:
If user doesn’t need it before 200 OK → move it out.
Result:
800ms → 80ms endpoints
Fix 4: Fix the Database
N+1 Queries
# Before
for user_id in user_ids:
await db.fetchrow(...)
# After
await db.fetch("SELECT ... WHERE id = ANY($1)", user_ids)
Missing Index
CREATE INDEX idx_events_user_created
ON events(user_id, created_at DESC);
Overfetching
Pulled only required columns.
Result:
- Query time ↓ 60–70%
- DB handled ~4x load
Fix 5: Cache What Doesn't Change
cached = await redis.get(key)
if cached:
return cached
await redis.setex(key, 300, value)
Result:
~90% reduction in DB hits
Fix 6: Runtime Tuning (Last)
uvloophttptools- worker tuning
Impact: ~10–15%
Architecture fixes gave ~85% of gains.
Final Numbers
(4 vCPU / 8GB, k6 load test)
| Metric | Before | After |
|---|---|---|
| RPS | ~180 | ~1300 |
| p95 latency | ~4200ms | ~180ms |
| DB queries | 14 | 2 |
| Services | 14+ | 4 |
Production traffic:
~900–1400 req/sec depending on load
What Breaks Next
At ~1500 RPS:
- DB connection pool saturation
- Celery backlog
- Redis CPU spikes
Next steps:
- read replicas
- queue sharding
- rate limiting
What Actually Matters
Order matters:
- Architecture
- Async correctness
- Background work
- Database
- Caching
- Runtime tuning
Most scaling problems aren’t framework problems.
They’re architecture and DB problems.
Before You Go
If this helped, share it with one engineer hitting the same bottleneck.
🔗 LinkedIn: https://www.linkedin.com/in/winsongr/
🐦 X: https://x.com/winsongr
💻 GitHub: https://github.com/winsongr
Top comments (0)