The Code Is Fine, but Requests Queue Until They Time Out: Puma, Pools, CDN

#rails #performance #puma #postgres

Rails Performance: Lessons from Production — #8

The previous six posts optimized the code — queries, caching, background work, the app layer. But sometimes every request is fast on its own, and yet under concurrency the whole batch slows down or times out. The problem isn't the code, it's how the machine is configured: how many Puma workers/threads, whether the connection pool is big enough, whether static assets go through a CDN. This post covers that last layer.

💥 A single request is 50ms, but 100 at once time out

Load testing turned up something odd: one request alone takes 50ms, but 100 arriving at once make the later ones queue, and some time out outright. The code didn't get slower — "how many it can handle at once" isn't enough.

That's the infrastructure layer — how many requests your app can serve in parallel is decided by Puma's config, not by your code.

🧵 Puma: workers (processes) × threads

Puma is Rails' default web server, and its concurrency comes from two dimensions:

workers (processes): controlled by WEB_CONCURRENCY. Each worker is a separate Ruby process and uses its own slice of memory.
threads: each worker runs N threads, handling N requests at once.

# config/puma.rb
workers Integer(ENV.fetch("WEB_CONCURRENCY", 2))   # number of processes
threads 5, 5                                        # 5 threads per worker

Requests handled in parallel ≈ workers × threads. Above that's 2 × 5 = 10. The "100 at once" from the opening far exceeds that, so the rest queue.

How to tune:

Worker count: each worker uses its own memory, so the real ceiling on workers is usually memory, not CPU cores. CPU-bound work uses core count as a reference; an IO-bound Rails app often runs more workers than cores. Load test to find it.
Thread count: MRI runs only one thread of Ruby at a time (the GVL, formerly "GIL"), so for pure CPU work more threads help little. But the moment a thread enters blocking IO (DB, external API) it releases the GVL, letting another thread run in that gap — so for a Rails app that spends lots of time waiting on IO, more threads help a lot. The more IO waiting, the bigger the win.

There's no universal number — load test while watching memory and CPU. The point is to know that "parallel ceiling = workers × threads," and to add capacity here when it's not enough.

🔌 Connection pool: align threads with DB connections

This is the most common hidden trap. Every Puma thread that queries the DB needs a DB connection, taken from the connection pool. If the pool is smaller than the thread count, threads can't get a connection, stall waiting, and eventually error:

ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool

Rule: each worker's pool ≥ that worker's thread count.

# config/database.yml
production:
  pool: <%= ENV.fetch("RAILS_MAX_THREADS", 5) %>   # align with threads

And do the total math. The pool is per-process — all threads in a worker share that worker's single pool, so the real total of connections hitting the DB is workers × pool (not × threads; they happen to be equal when you set pool = threads, but pool is often set a bit larger to leave room for Active Storage, load_async, and other non-request paths). That total must not exceed the DB's max_connections (PostgreSQL defaults to 100 — but the usable number is lower after reserved connections and other services, and managed DBs often scale it by instance size).

When connections genuinely run short, put PgBouncer in front so many app connections share a few DB connections — but don't treat it as a free switch: the big savings come from its transaction mode, and transaction mode conflicts with the prepared statements Rails enables by default, so you'll need prepared_statements: false or a recent PgBouncer with prepared-statement support.

In one line: Puma threads → pool size → DB max_connections — align this chain top to bottom; any link too small and it stalls.

🌐 CDN: don't make your Rails server ship static assets

Every image, every JS/CSS file served by your Rails server means spending a precious Puma thread to ship a file — wasteful. Those static, unchanging assets should go to a CDN:

Static assets (JS/CSS/images): served from the CDN, fetched from a nearby edge node — fast, and off your server.
Active Storage uploads: configure them to go through the CDN too, instead of streaming from your app / object storage every time.

A CDN offloads "shipping files" entirely from your Rails server, leaving Puma threads for "requests that actually run Ruby."

🚀 Other production switches worth flipping on

A few low-cost, high-value settings:

bootsnap: caches compiled Ruby/YAML, speeding up boot (faster deploys and restarts).
Asset precompilation (assets:precompile): compile JS/CSS ahead of time, instead of compiling on a user's request.
Enable gzip / compression: compress responses before sending, saving bandwidth.

These are usually handled by Rails defaults or your deploy platform, but knowing they exist and confirming they're on is table stakes.

🏁 Wrap-up

When each request is fast on its own but concurrency collapses, the problem is the infrastructure layer:

setting	governs	key point
Puma workers × threads	how many requests at once	the parallel ceiling — tune by load testing
connection pool	whether threads can get a DB connection	pool ≥ threads; total (`workers × pool`) ≤ DB max_connections
CDN	who ships static assets	offload to the CDN, keep Puma for Ruby requests

One principle:

The first six posts were about "make each request faster"; this one is about "let the machine serve more requests at once." Different problems — no matter how fast the code, the wrong concurrency config still causes a traffic jam.

And this layer has a chain: Puma threads → connection pool → DB connection ceiling — align it top to bottom. Before bumping Puma, ask "can my DB handle this many connections?" That's why this post comes last: it's the bill you add up when you fit all the earlier optimizations into a real machine.