We had a 5% failure rate on long-running tasks. Every error looked the same:
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError)
SSL connection has been closed unexpectedly
This post explains why it happens and the exact configuration that fixed it.
The Setup
- Database: AlloyDB (GCP's PostgreSQL-compatible database)
- ORM: SQLAlchemy 2.0 with asyncio
- Deployment: Cloud Run (auto-scaling, 0-100 instances)
- Workload: Long-running AI evaluation tasks (5-30 minutes per task)
Why Connections Die
AlloyDB (and Cloud SQL) terminates idle SSL connections after a timeout. The exact timeout depends on your configuration, but it's typically 10-15 minutes.
SQLAlchemy's connection pool holds connections open for reuse. Here's the problem:
1. Task A grabs Connection 1 from pool → starts 20-min evaluation
2. Connection 2, 3, 4 sit idle in the pool
3. After 10 min: AlloyDB kills Connections 2, 3, 4 (idle too long)
4. Task B grabs Connection 2 from pool → DEAD → SSL error
The pool doesn't know the connection is dead until you try to use it.
The Fix
from sqlalchemy import create_engine
engine = create_engine(
DATABASE_URL,
# 1. Recycle connections before AlloyDB kills them
pool_recycle=180, # 3 minutes (AlloyDB timeout is ~10 min)
# 2. Test connection before handing it to your code
pool_pre_ping=True, # SELECT 1 before checkout
# 3. Don't hoard connections
pool_size=5, # Base pool size
max_overflow=10, # Allow burst to 15 total
# 4. Don't wait forever for a connection
pool_timeout=30, # Fail fast if pool exhausted
)
Why each setting matters:
pool_recycle=180 -- Forces SQLAlchemy to close and recreate connections every 3 minutes, well before AlloyDB's ~10 minute idle timeout. Yes, this means more connection overhead. No, it doesn't matter compared to your query time.
pool_pre_ping=True -- Before giving a connection to your code, SQLAlchemy runs SELECT 1. If it fails, the connection is discarded and a fresh one is created. This catches connections that died between recycles.
pool_size=5, max_overflow=10 -- Total max connections = 15. With Cloud Run auto-scaling to 100 instances, that's 1,500 potential connections. AlloyDB's limit depends on your tier. We use db-custom-4-16384 which handles this fine, but a db-f1-micro would not.
pool_timeout=30 -- If all 15 connections are in use, wait 30 seconds then raise an error. Better than hanging indefinitely.
Cloud Run Gotcha: Concurrency × Instances × Pool Size
This is the formula that will save you:
Max DB connections = containerConcurrency × maxInstances × pool_size
Our initial config:
- containerConcurrency: 80
- maxInstances: 100
- pool_size: 10
- Max connections: 80,000 🔥
After fixing:
- containerConcurrency: 10
- maxInstances: 100
- pool_size: 5
- Max connections: 5,000 ✅
We also capped max_overflow to prevent connection storms during traffic spikes.
For Async SQLAlchemy
If you're using AsyncSession:
from sqlalchemy.ext.asyncio import create_async_engine
engine = create_async_engine(
DATABASE_URL.replace("postgresql://", "postgresql+asyncpg://"),
pool_recycle=180,
pool_pre_ping=True,
pool_size=5,
max_overflow=10,
pool_timeout=30,
)
Same settings, different driver. asyncpg handles connection validation differently than psycopg2, but pool_pre_ping works with both.
Making Tasks Survive Connection Drops
Even with the pool fix, you should handle connection drops in long-running tasks:
from sqlalchemy.exc import OperationalError
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, max=10),
retry=retry_if_exception_type(OperationalError),
)
async def save_evaluation_result(session, result):
"""Retry on transient connection errors."""
session.add(result)
await session.commit()
But crucially, make this idempotent (see my other post on idempotent Cloud Tasks handlers). If the first attempt committed successfully but the response got lost, the retry shouldn't create a duplicate.
Monitoring
Add these metrics to catch connection issues early:
import structlog
logger = structlog.get_logger()
@app.middleware("http")
async def log_db_pool_stats(request, call_next):
pool = engine.pool
logger.info(
"db_pool_stats",
pool_size=pool.size(),
checked_out=pool.checkedout(),
overflow=pool.overflow(),
checked_in=pool.checkedin(),
)
response = await call_next(request)
return response
When checked_out is consistently at pool_size + max_overflow, you're about to hit connection exhaustion.
TL;DR Config
# Copy-paste this for AlloyDB / Cloud SQL + Cloud Run
engine = create_engine(
DATABASE_URL,
pool_recycle=180,
pool_pre_ping=True,
pool_size=5,
max_overflow=10,
pool_timeout=30,
)
And set containerConcurrency in Cloud Run to something reasonable (10-20, not 80).
This took us from 5% random failures to 0.01%. Three lines of configuration.
Read the full AlloyDB debugging story on my blog. Part of my "Production GCP Patterns" series — find me at humzakt.github.io.
Top comments (0)