Humza Tareen

Posted on Feb 17 • Originally published at humzakt.github.io

AlloyDB + SQLAlchemy: Connection Pooling Strategies That Actually Work in Production

#gcp #python #postgres #database

We had a 5% failure rate on long-running tasks. Every error looked the same:

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError)
SSL connection has been closed unexpectedly

This post explains why it happens and the exact configuration that fixed it.

The Setup

Database: AlloyDB (GCP's PostgreSQL-compatible database)
ORM: SQLAlchemy 2.0 with asyncio
Deployment: Cloud Run (auto-scaling, 0-100 instances)
Workload: Long-running AI evaluation tasks (5-30 minutes per task)

Why Connections Die

AlloyDB (and Cloud SQL) terminates idle SSL connections after a timeout. The exact timeout depends on your configuration, but it's typically 10-15 minutes.

SQLAlchemy's connection pool holds connections open for reuse. Here's the problem:

1. Task A grabs Connection 1 from pool → starts 20-min evaluation
2. Connection 2, 3, 4 sit idle in the pool
3. After 10 min: AlloyDB kills Connections 2, 3, 4 (idle too long)
4. Task B grabs Connection 2 from pool → DEAD → SSL error

The pool doesn't know the connection is dead until you try to use it.

The Fix

from sqlalchemy import create_engine

engine = create_engine(
    DATABASE_URL,

    # 1. Recycle connections before AlloyDB kills them
    pool_recycle=180,  # 3 minutes (AlloyDB timeout is ~10 min)

    # 2. Test connection before handing it to your code
    pool_pre_ping=True,  # SELECT 1 before checkout

    # 3. Don't hoard connections
    pool_size=5,         # Base pool size
    max_overflow=10,     # Allow burst to 15 total

    # 4. Don't wait forever for a connection
    pool_timeout=30,     # Fail fast if pool exhausted
)

Why each setting matters:

pool_recycle=180 -- Forces SQLAlchemy to close and recreate connections every 3 minutes, well before AlloyDB's ~10 minute idle timeout. Yes, this means more connection overhead. No, it doesn't matter compared to your query time.

pool_pre_ping=True -- Before giving a connection to your code, SQLAlchemy runs SELECT 1. If it fails, the connection is discarded and a fresh one is created. This catches connections that died between recycles.

pool_size=5, max_overflow=10 -- Total max connections = 15. With Cloud Run auto-scaling to 100 instances, that's 1,500 potential connections. AlloyDB's limit depends on your tier. We use db-custom-4-16384 which handles this fine, but a db-f1-micro would not.

pool_timeout=30 -- If all 15 connections are in use, wait 30 seconds then raise an error. Better than hanging indefinitely.

Cloud Run Gotcha: Concurrency × Instances × Pool Size

This is the formula that will save you:

Max DB connections = containerConcurrency × maxInstances × pool_size

Our initial config:

containerConcurrency: 80
maxInstances: 100
pool_size: 10
Max connections: 80,000 🔥

After fixing:

containerConcurrency: 10
maxInstances: 100
pool_size: 5
Max connections: 5,000 ✅

We also capped max_overflow to prevent connection storms during traffic spikes.

For Async SQLAlchemy

If you're using AsyncSession:

from sqlalchemy.ext.asyncio import create_async_engine

engine = create_async_engine(
    DATABASE_URL.replace("postgresql://", "postgresql+asyncpg://"),
    pool_recycle=180,
    pool_pre_ping=True,
    pool_size=5,
    max_overflow=10,
    pool_timeout=30,
)

Same settings, different driver. asyncpg handles connection validation differently than psycopg2, but pool_pre_ping works with both.

Making Tasks Survive Connection Drops

Even with the pool fix, you should handle connection drops in long-running tasks:

from sqlalchemy.exc import OperationalError
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, max=10),
    retry=retry_if_exception_type(OperationalError),
)
async def save_evaluation_result(session, result):
    """Retry on transient connection errors."""
    session.add(result)
    await session.commit()

But crucially, make this idempotent (see my other post on idempotent Cloud Tasks handlers). If the first attempt committed successfully but the response got lost, the retry shouldn't create a duplicate.

Monitoring

Add these metrics to catch connection issues early:

import structlog

logger = structlog.get_logger()

@app.middleware("http")
async def log_db_pool_stats(request, call_next):
    pool = engine.pool
    logger.info(
        "db_pool_stats",
        pool_size=pool.size(),
        checked_out=pool.checkedout(),
        overflow=pool.overflow(),
        checked_in=pool.checkedin(),
    )
    response = await call_next(request)
    return response

When checked_out is consistently at pool_size + max_overflow, you're about to hit connection exhaustion.

TL;DR Config

# Copy-paste this for AlloyDB / Cloud SQL + Cloud Run
engine = create_engine(
    DATABASE_URL,
    pool_recycle=180,
    pool_pre_ping=True,
    pool_size=5,
    max_overflow=10,
    pool_timeout=30,
)

And set containerConcurrency in Cloud Run to something reasonable (10-20, not 80).

This took us from 5% random failures to 0.01%. Three lines of configuration.

Read the full AlloyDB debugging story on my blog. Part of my "Production GCP Patterns" series — find me at humzakt.github.io.

DEV Community