Most FastAPI performance problems aren't caused by FastAPI itself. They're caused by architectural issues - N+1 database queries, missing indexes, poor caching strategies. I covered these bigger problems in my previous post about Python's speed, and fixing those will give you 10-100x performance improvements.
But let's say you've already optimized your architecture. Your database queries are efficient, you're caching properly, and you're using async operations correctly. There are still FastAPI-specific optimizations that can give you meaningful performance gains - usually 20-50% improvements that add up.
Here's the thing: these optimizations won't save a badly designed system, but they can make a well-designed system significantly faster. Think of them as the final polish on an already efficient architecture.
TL;DR:
- Install uvloop and httptools for faster event loops and HTTP parsing
- Use ORJSONResponse for 20-50% faster JSON serialization
- Choose async vs sync functions based on workload - I/O operations should be async, CPU-intensive work can be sync
- Optimize Pydantic models with performance-focused configuration
- Cache expensive dependencies to avoid repeated computations
- Stream large responses to reduce memory usage by 80-90%
- Increase thread pool size only if you must use sync operations
- Use background tasks so users don't wait for non-critical operations
- Run multiple workers in production to utilize all CPU cores
- Enable GZip compression for large responses
- Use FastAPI CLI production mode instead of development settings
- Implement Pure ASGI middleware instead of BaseHTTPMiddleware
- Avoid duplicate validation between request parsing and response models
Not Installing uvloop and httptools
An event loop is the heart of async programming - it's a continuous process that monitors and dispatches events and tasks. Think of it like a traffic controller at a busy intersection, deciding which car goes next and managing the flow of traffic. In Python's case, the event loop manages all your async operations: when to start database queries, when to process HTTP requests, when to handle responses from external APIs.
Python's default event loop (asyncio) is written in Python itself and does extensive safety checking. Every time it schedules a task or handles I/O, it's running interpreted Python code with all the associated overhead.
The uvloop package replaces Python's event loop with one written in Cython (compiled to C) and based on libuv - the same battle-tested C library that powers Node.js. Instead of interpreted Python managing your async operations, you get optimized C code handling the scheduling and I/O management.
The performance difference is huge under high concurrency. With 10 concurrent requests, you might not notice much difference, but with 1000+ concurrent requests (quite common in production), uvloop can handle 2-4x more throughput because the overhead of managing all those concurrent operations is reduced.
Beyond this, every HTTP request needs parsing - extracting headers, decoding the body, handling encoding. Python's default HTTP parser (h11) prioritizes correctness and handles edge cases gracefully, but it's pure Python code. httptools uses a C-based HTTP parser that's 40% faster but less forgiving of malformed requests - perfect for production where you control the client requests.
To apply these two changes is as simple as:
uv add uvloop httptools
The beauty is that Uvicorn automatically detects and uses them if installed - no code changes needed.
Note: uvloop doesn't work on Windows.
Using Python's Default JSON Encoder
Every time your FastAPI endpoint returns data, Python needs to convert your objects (dictionaries, lists, custom classes) into a JSON string that can be sent over HTTP. This process involves walking through your entire data structure, inspecting each value to determine its type, and converting it to the appropriate JSON representation.
Python's built-in json
module is thorough but slow. For every single value, it asks: "Is this a string? A number? A boolean? A nested object?" It handles every edge case Python can throw at it, includes extensive type checking, and manages encoding issues. All this safety comes at a performance cost.
ORJSON is written in Rust with aggressive performance optimizations. It uses SIMD (Single Instruction, Multiple Data) instructions to process multiple values simultaneously, has specialized fast paths for common data types, and does minimal error checking.
from fastapi import FastAPI
from fastapi.responses import ORJSONResponse
app = FastAPI(default_response_class=ORJSONResponse)
@app.get("/data")
async def get_data():
return {"message": "hello", "items": [1, 2, 3]}
The improvement is most dramatic for large responses with deeply nested data structures, APIs returning arrays with hundreds or thousands of items, responses with many numbers or dates. For small responses (under 1KB), the difference is minimal but there's no downside to using ORJSON.
Mixing Async and Sync Functions Incorrectly
This is critical and commonly misunderstood. The choice between async def
and def
fundamentally changes how FastAPI handles your endpoint, and getting it wrong can destroy performance.
How FastAPI handles each approach:
-
async def
functions run in the main event loop alongside other requests -
def
functions are offloaded to a separate thread pool (limited to 40 threads by default)
When to use async def
:
- I/O-bound operations: database calls, HTTP requests, file operations
- Operations that call other async functions
- Light computational work: JSON parsing, data validation, simple transformations
When to use regular def
:
- CPU-intensive operations: image processing, heavy calculations, data analysis
- Calling libraries that aren't async-compatible and do significant work
- Operations that would benefit from running on a separate CPU core
import httpx
# GOOD - async for I/O operations
@app.get("/users/{user_id}")
async def get_user(user_id: int):
async with httpx.AsyncClient() as client:
response = await client.get(f"https://api.example.com/users/{user_id}")
return response.json()
# GOOD - sync for CPU-intensive work
@app.post("/process-data")
def process_data(data: bytes):
# CPU-intensive processing
result = heavy_computation(data)
return {"result": result}
# BAD - blocking the event loop with CPU work
@app.post("/process-data-bad")
async def process_data_bad(data: bytes):
# This blocks the entire event loop!
result = heavy_computation(data)
return {"result": result}
Using async database drivers like asyncpg
(PostgreSQL) or aiomysql
(MySQL) with async endpoints can provide 3-5x better throughput under concurrent load. This is because your entire request pipeline becomes truly asynchronous - no blocking operations.
With 40 default threads, if you have 41 users hitting sync endpoints simultaneously, one user waits for a thread to become available. With 1000 concurrent users hitting sync endpoints, 960 are queued waiting for threads - creating massive delays.
Overusing Pydantic Models Throughout Your Application
Pydantic is excellent for data validation at API boundaries, but using it everywhere in your application creates significant performance overhead that many developers don't realize.
The problem is subtle: Pydantic models look like regular Python classes, so it's tempting to use them as your primary data structures throughout your application. This creates what's known as "serialization/deserialization debt" - you're paying validation and conversion costs even when you don't need validation:
- Pydantic object creation is 6.5x slower than Python dataclasses
- Memory usage is 2.5x higher due to validation metadata storage
- JSON operations are 1.5x slower across serialization and deserialization
This overhead compounds quickly. If you're creating thousands of objects during request processing, using Pydantic models internally can add significant latency.
Use Pydantic only at service boundaries:
from pydantic import BaseModel
from dataclasses import dataclass
# Good - Pydantic for API validation
class UserRequest(BaseModel):
name: str
email: str
age: int
# Good - Dataclass for internal processing
@dataclass
class UserInternal:
name: str
email: str
age: int
created_at: datetime
processed: bool = False
@app.post("/users")
async def create_user(user_request: UserRequest):
# Pydantic validates the incoming request
# Convert to internal dataclass for processing
user = UserInternal(
name=user_request.name,
email=user_request.email,
age=user_request.age,
created_at=datetime.utcnow()
)
# Do internal processing with lightweight dataclass
process_user(user)
return {"id": user.id}
# Bad - Using Pydantic everywhere
class UserPydantic(BaseModel):
name: str
email: str
age: int
created_at: datetime
processed: bool = False
def process_user(user: UserPydantic):
# Every object creation/access pays validation overhead
updated_user = UserPydantic(
**user.dict(),
processed=True
)
return updated_user
The rule of thumb: validate once at the boundary, then use lightweight structures internally. This gives you the safety of validation where it matters without paying the performance cost throughout your application.
Not Optimizing Pydantic Models for Performance
Pydantic models are validated on every request, so optimizing them provides immediate performance benefits for high-traffic endpoints.
from pydantic import BaseModel, Field
from typing import Optional
class OptimizedModel(BaseModel):
model_config = {
"validate_assignment": False,
"str_strip_whitespace": True,
"validate_default": False
}
name: str = Field(min_length=1, max_length=100)
email: str = Field(pattern=r'^[^@]+@[^@]+\.[^@]+$')
price: float = Field(gt=0)
# For query parameters, use models instead of individual parameters
class SearchParams(BaseModel):
q: str
limit: int = Field(10, ge=1, le=100)
offset: int = Field(0, ge=0)
sort: Optional[str] = None
@app.get("/search")
async def search(params: SearchParams = Depends()):
return await search_database(params)
These config options reduce validation overhead for frequently-created models. Using models for query parameters is faster than individual Query()
parameters because Pydantic can optimize the validation pipeline for the entire model at once.
Not Caching Expensive Dependencies
FastAPI automatically caches dependency results within a single request, but you can also create dependencies that persist across requests.
If multiple parts of your endpoint need the same dependency with identical parameters, FastAPI creates it once and reuses it within that request. For expensive resources that don't change often, you can create singleton dependencies that live for your entire application lifetime.
from functools import lru_cache
from fastapi import Depends
@lru_cache()
def get_settings():
return {"api_key": "secret", "timeout": 30}
@app.get("/data")
async def get_data(settings = Depends(get_settings)):
return {"config": settings}
A singleton ensures only one instance of a resource exists during your application's lifetime. In FastAPI, @lru_cache()
without parameters creates singleton-like behavior - the function runs once, and all subsequent calls return the cached result.
Not Streaming Large Responses
For large datasets, loading everything into memory and returning it at once can exhaust server memory and create terrible user experience.
from fastapi.responses import StreamingResponse
import json
@app.get("/large-data")
async def stream_data():
def generate():
yield '{"items": ['
for i in range(10000):
if i > 0:
yield ','
yield json.dumps({"id": i, "name": f"item_{i}"})
yield ']}'
return StreamingResponse(generate(), media_type="application/json")
Instead of loading 10,000 records into memory (potentially 100MB+), streaming processes them one by one, keeping memory usage under 1MB. This is a 99% reduction in memory usage.
Using Small Thread Pool for Sync Operations
When you use a regular def
function in FastAPI, it doesn't run in the main event loop. Instead, it runs in a thread pool - a collection of worker threads that handle synchronous operations. By default, FastAPI provides only 40 threads.
If 41 users hit a sync endpoint simultaneously, one waits for a thread. With 1000 concurrent users, 960 are stuck waiting. This creates massive response time degradation.
If you must use sync operations (legacy database drivers, CPU-bound tasks, third-party libraries without async support), you need more threads.
import anyio
from contextlib import asynccontextmanager
@asynccontextmanager
async def lifespan(app):
limiter = anyio.to_thread.current_default_thread_limiter()
limiter.total_tokens = 100
yield
app = FastAPI(lifespan=lifespan)
@app.post("/process-image")
def process_image(image_data: bytes):
return expensive_image_processing(image_data)
Convert to async when possible. An async endpoint using asyncio.to_thread()
can handle thousands of concurrent requests without thread pool limitations.
Making Users Wait for Background Work
Background tasks let you queue work that runs after the HTTP response is sent. Users get their response immediately while non-essential operations happen in the background.
Instead of a 3-second response (1s user creation + 2s email sending), users get a 1-second response while the email sends behind the scenes.
from fastapi import BackgroundTasks
def send_email(email: str):
# Email sending logic
print(f"Sending email to {email}")
@app.post("/users")
async def create_user(email: str, background_tasks: BackgroundTasks):
# Save user
user_id = 123
# Send email in background
background_tasks.add_task(send_email, email)
return {"user_id": user_id}
Some good use cases are include email notifications, file processing. For heavy operations (video processing, large imports), use Celery or ARQ instead.
Running on a Single Worker in Production
A worker is an independent process running your FastAPI application. By default, you get one worker using one CPU core, even on servers with 8+ cores.
Your FastAPI process runs on one CPU core. On a 4-core server, you're using 25% of available processing power. Under load, that single core becomes a bottleneck. Each worker is an independent FastAPI process. With 4 workers on 4 cores, you can handle 4x more concurrent requests.
# Single worker (default) - uses one CPU core
uvicorn main:app --host 0.0.0.0 --port 8000
# Multiple workers - uses multiple CPU cores
uvicorn main:app --workers 4 --host 0.0.0.0 --port 8000
# Better: Use Gunicorn to manage Uvicorn workers
gunicorn main:app -w 4 -k uvicorn.workers.UvicornWorker -b 0.0.0.0:8000
# Modern approach: FastAPI CLI
fastapi run main.py --workers 4
Gunicorn adds process management, automatic worker restarts on crashes, graceful shutdowns, better resource monitoring.
To choose worker count you may follow this process:
- Start with one worker per CPU core
- Monitor under realistic load
- I/O-heavy apps may benefit from more workers than cores
- CPU-heavy apps can suffer from too many workers (context switching overhead)
Not Compressing Large Responses
GZip finds repeated patterns in your JSON and replaces them with shorter references. JSON is highly compressible because it contains repeated keys, similar values, and predictable structure.
A 500KB API response might compress to 80KB (84% smaller), turning a 2-second download on slow connections into 0.3 seconds.
from fastapi.middleware.gzip import GZipMiddleware
app = FastAPI()
app.add_middleware(GZipMiddleware, minimum_size=1000)
@app.get("/large-data")
async def get_large_data():
return {"data": [{"id": i, "name": f"Item {i}"} for i in range(100)]}
Compressing tiny responses actually makes them larger due to compression overhead. The 1000-byte threshold ensures only responses that benefit get compressed. GZip helps most with large JSON arrays or objects, text-heavy API responses, repeated data structures, it does not help with already compressed data (images, videos), very small responses or binary data.
Using Development Settings in Production
Use fastapi run
which automatically optimizes for production environments.
# Development mode - verbose logging, auto-reload, localhost only
fastapi dev main.py
# Production mode - optimized logging, all interfaces, no reload
fastapi run main.py
# Production with multiple workers
fastapi run main.py --workers 4
fastapi run
optimizes:
- Logging: Reduces verbose development logs
-
Host binding: Listens on
0.0.0.0
(all interfaces) vs127.0.0.1
(localhost) - Auto-reload: Disabled (expensive and unnecessary in production)
- Security headers: Better defaults
Using Slow BaseHTTPMiddleware
BaseHTTPMiddleware has performance overhead due to how it wraps requests and responses. For high-traffic applications, Pure ASGI middleware provides 40% better performance.
import time
from starlette.types import ASGIApp, Scope, Receive, Send
# Slower: BaseHTTPMiddleware approach
@app.middleware("http")
async def add_process_time_header(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
process_time = time.time() - start_time
response.headers["X-Process-Time"] = str(process_time)
return response
# Faster: Pure ASGI middleware
class ProcessTimeMiddleware:
def __init__(self, app: ASGIApp):
self.app = app
async def __call__(self, scope: Scope, receive: Receive, send: Send):
if scope["type"] != "http":
await self.app(scope, receive, send)
return
start_time = time.time()
async def send_with_time(message):
if message["type"] == "http.response.start":
process_time = time.time() - start_time
headers = list(message.get("headers", []))
headers.append([b"x-process-time", str(process_time).encode()])
message["headers"] = headers
await send(message)
await self.app(scope, receive, send_with_time)
app.add_middleware(ProcessTimeMiddleware)
Pure ASGI middleware is more complex but avoids BaseHTTPMiddleware's request/response wrapping overhead.
Validating Data Multiple Times
If you're using type hints or response_model
, there's no need to manually create Pydantic models in your endpoint - FastAPI handles this automatically and doing both causes double validation.
from pydantic import BaseModel
class User(BaseModel):
id: int
name: str
# BAD - double validation
@app.get("/users/{user_id}")
async def get_user_bad(user_id: int) -> User:
user_data = {"id": user_id, "name": "John"}
user = User(**user_data) # First validation
return user # Second validation by FastAPI
# GOOD - single validation
@app.get("/users/{user_id}")
async def get_user_good(user_id: int) -> User:
user_data = {"id": user_id, "name": "John"}
return user_data # Only one validation by FastAPI
If you have a type hint or response_model
, return raw data (dicts, database objects) and let FastAPI handle model creation. Double validation can add 20-50% overhead to response processing.
Monitor Your Optimizations
Don't guess - measure the impact of your optimizations:
import time
import logging
logger = logging.getLogger(__name__)
@app.middleware("http")
async def log_requests(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
process_time = time.time() - start_time
logger.info(f"{request.method} {request.url.path} - {process_time:.4f}s")
return response
Focus on optimizing endpoints that handle the most traffic or take the longest to process.
The Reality Check
These optimizations provide meaningful performance improvements - typically 20-50% for well-architected applications. But they won't save you from fundamental design problems.
If your API is slow because of N+1 database queries, missing indexes, or poor caching, fix those first. They'll give you 10-100x improvements that dwarf any FastAPI-specific optimizations.
Architecture problems will kill your performance. FastAPI optimizations will enhance your performance.
Get the big stuff right first, then polish with these techniques.
Want a production-ready FastAPI setup with all these optimizations built-in? Check out FastroAI - it includes everything from this post plus authentication, database integration, and deployment configuration.
References:
- 101 FastAPI Tips by The FastAPI Expert
- Performance Tips by The FastAPI Expert
- Pydantic Is All You Need for Poor Performance Spaghetti Code by Han Lee
Originally published at the FastroAI Blog.
Top comments (0)