Ujjawal Tyagi

Posted on May 6

Postgres at Scale: Lessons from Running 30+ D2C Platforms on RDS

#aws #database #performance #postgres

PostgreSQL is the database we reach for first at Xenotix Labs. We've shipped 30+ platforms on it across D2C dairy (Veda Milk), service marketplaces (Cremaster, Housecare), insurance survey workflows (ClaimsMitra), legal-tech (Legal Owl), and more. None of those projects ran into a wall where Postgres couldn't keep up. But several of them ran into walls where we couldn't keep up with Postgres — not knowing how to use the indexes, the connection pool, the query planner, or the vacuum settings. Here are the lessons that landed.

Lesson 1: connection pooling is non-optional

The default Postgres max_connections on RDS is 100 (sometimes scales with instance size). A Node.js app server with 4 workers easily opens 4 connections per process; deploy 8 servers and you've eaten a third of your pool.

Fix: PgBouncer in transaction-pooling mode in front of every cluster. Each app server now holds a small pool of cheap PgBouncer connections, and PgBouncer multiplexes them onto a much smaller pool of real Postgres connections. We run with 200–1000 client connections and 20–40 actual Postgres connections.

Note: transaction pooling means you can't use session-level features like SET LOCAL in a way that survives across statements, prepared statements get tricky, and LISTEN/NOTIFY doesn't work well. Plan around this from day one.

Lesson 2: indexes that don't get used are a tax

Every index speeds up reads and slows down writes. We've found unused indexes that were costing us 15% on write throughput and zero on read latency. Find them and drop them.

SELECT schemaname, relname, indexrelname, idx_scan, pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND indexrelname NOT LIKE 'pg_toast%'
ORDER BY pg_relation_size(indexrelid) DESC;

Run it monthly. Drop anything that's been zero scans for 90+ days and isn't unique-constraint enforcement.

Lesson 3: partial indexes for the hot path

Our Veda Milk subscription engine queries "all subscriptions where the next delivery is tomorrow." Indexing every subscription on next_delivery_date works but is wasteful — 95% of subscriptions don't have tomorrow as their next delivery.

A partial index, only on rows that match the hot predicate, gets us a 10x smaller index:

CREATE INDEX subs_due_tomorrow_idx
ON subscriptions (next_delivery_date)
WHERE status = 'active' AND next_delivery_date >= CURRENT_DATE;

The planner picks this automatically when our nightly job queries with the matching predicate.

Lesson 4: vacuum tuning is where you find dragons

Default autovacuum settings are fine for normal tables. They're terrible for high-churn tables like wallet ledgers and order tables in subscription commerce.

For those, we tune per-table:

ALTER TABLE wallet_ledger
SET (autovacuum_vacuum_scale_factor = 0.05,
     autovacuum_analyze_scale_factor = 0.02);

Default is 0.2 (vacuum when 20% of rows are dead). For tables with a million writes a day, 0.2 means waiting 200k dead tuples before vacuum. Bloat builds. Indexes degrade. Reads slow.

Lesson 5: read replicas for analytics, never for primary reads

We used to route some "non-critical" reads to a read replica. It bit us when replica lag spiked during a heavy write burst and customers saw stale balances.

Now the rule: read replicas are for offline-style analytics queries (BI dashboards, reports, ML feature pipelines). Customer-facing reads always come from the primary, with caching in Redis or the application layer if latency matters.

Lesson 6: JSONB is not a free lunch

We've seen teams treat JSONB like a schemaless escape hatch. "We don't know the shape yet, let's just store JSON." 18 months later, every query has 4 JSONB extractions and 2 GIN indexes that are 5 GB each.

Use JSONB for genuinely sparse, nested, or polymorphic data — audit logs, event payloads, third-party API responses. For business entities with predictable schemas, use real columns. The future you will thank present you.

Lesson 7: backups are easy, restores are hard

RDS automated backups are great until you need to restore one. We test restores quarterly:

Spin up a new cluster from the latest snapshot
Connect a non-prod copy of the app
Run the smoke-test suite
Time how long the entire process took

The first time we did this, restore took 4 hours and the smoke tests revealed two missing migrations in the snapshot. Now restore takes 45 min and we know the process works.

Lesson 8: migrations need a rollout playbook, not just a tool

Long-running migrations on tables with millions of rows can lock the table for minutes. We use a playbook:

Schema-only changes (CREATE INDEX CONCURRENTLY, ADD COLUMN with default null): safe at any time
Data backfills: run in batches via a one-off worker, monitor lag, never block the primary connection pool
Type changes: use a multi-step pattern — add new column, dual-write from app, backfill, switch reads, drop old column over multiple deploys

Never use a migration tool's "apply" button on a table that has more than 1M rows without checking what it actually does first.

Lesson 9: query_text is your friend

pg_stat_statements is enabled on every cluster we operate. The queries that show up at the top of total_time after a week of production usage are exactly the queries that need indexing, batching, or rewriting. Read pg_stat_statements weekly, you'll out-pace anyone debugging in production.

Stack we ship with PostgreSQL

Backend: Node.js (with pgbouncer in front)
Migrations: node-pg-migrate or knex with peer-reviewed migration scripts
Connection pool: pgbouncer (transaction mode), node-postgres pool per service
Backups: RDS automated + monthly logical dumps to S3
Monitoring: CloudWatch + pg_stat_statements dashboards
Replicas: RDS read replica for analytics workloads

Building a product where Postgres has to scale?

We've shipped 30+ Postgres-backed products without hitting a wall. Most teams don't because they don't know what to look for. If you're building one, Xenotix Labs has the playbook for setup, scaling, and surviving the next round of growth. Reach out at https://xenotixlabs.com.

DEV Community