ANKUSH CHOUDHARY JOHAL

Posted on Apr 28 • Originally published at johal.in

War Story: Scaling PostgreSQL 17 to 100k IOPS for AI_ML Feature Stores with Read Replicas

#story #scaling #postgres #100k

In Q3 2024, our AI/ML feature store’s p99 read latency hit 2.8 seconds as training throughput spiked to 12k features/sec, with PostgreSQL 16 struggling to break 42k IOPS under heavy read workloads. We had 6 weeks to scale to 100k IOPS or face a $1.2M annualized infrastructure bill from over-provisioned single-node instances.

📡 Hacker News Top Stories Right Now

Microsoft and OpenAI end their exclusive and revenue-sharing deal (710 points)
Is my blue your blue? (263 points)
New Integrated by Design FreeBSD Book (11 points)
Three men are facing charges in Toronto SMS Blaster arrests (66 points)
Easyduino: Open Source PCB Devboards for KiCad (151 points)

Key Insights

PostgreSQL 17’s new parallel I/O subsystem delivers 2.3x higher read IOPS than PostgreSQL 16 on identical NVMe storage
pg_read_replica_sync (v0.4.2, https://github.com/timescale/pg\_read\_replica\_sync) reduced cross-region replica lag to <8ms for 95% of writes
Shifting 82% of feature store reads to 4 read replicas cut monthly RDS costs by $22,400
By 2026, 60% of AI/ML feature stores will use PostgreSQL 17+ read replica pools with automated query routing

The Breaking Point: When PostgreSQL 16 Hit the Wall

Our AI/ML feature store had grown organically over 18 months: what started as a single PostgreSQL 14 instance for 2k features grew to a PostgreSQL 16 single node with 1.2M features, 12k feature writes/sec, and 40k read requests/sec from training pipelines. We had added 2 read replicas to PostgreSQL 16 in Q1 2024, which got us to 58k read IOPS, but that wasn’t enough. On August 14, 2024, we launched a new large language model training job that pulled 20k features/sec for 10 hours straight. The p99 read latency spiked to 2.8 seconds, 14% of read requests timed out, and the training job failed twice, costing us $18k in wasted GPU compute time. Our on-call engineer texted the team at 3 AM: “Either we scale to 100k IOPS in 6 weeks, or we migrate to a NoSQL store and rewrite 6 months of feature store code.”

We first tried tuning PostgreSQL 16: we increased shared_buffers to 48GB, max_connections to 200, effective_io_concurrency to 100, but we only squeezed out another 4k IOPS, hitting a hard limit at 62k. The problem was PostgreSQL 16’s I/O subsystem: it used a single I/O worker per table scan, which couldn’t saturate the 4x 2TB NVMe drives on our RDS instance. We benchmarked PostgreSQL 17 beta 3 in a staging environment and saw 89k IOPS out of the box, with no tuning. That’s when we decided to commit to the upgrade: 6 weeks to migrate to PostgreSQL 17, deploy 4 more read replicas, and rewrite our query routing logic.

PostgreSQL 17’s Secret Sauce: Parallel I/O and More

PostgreSQL 17’s 2.3x IOPS improvement over 16 isn’t a single change: it’s a collection of 14+ patches to the I/O subsystem, led by the PostgreSQL core team’s performance group. The biggest change is the new parallel sequential scan and bitmap heap scan logic: instead of a single worker reading 8KB pages sequentially, PostgreSQL 17 can spawn up to max_parallel_workers_per_gather workers to read pages in parallel, saturating NVMe storage’s high random read throughput. For our feature store workload, which is 90% random reads of 8-64KB feature values, this was a game-changer: we went from 1 I/O worker per query to 8, which matched the number of parallel NVMe queues on our instance.

Another critical improvement is WAL compression: PostgreSQL 17 adds LZ4 compression for write-ahead logs, which reduced our WAL bandwidth by 42%. For read replicas, this means less network traffic to ship WAL files, which directly reduces replica lag. We also saw improved replication slot management: PostgreSQL 17 automatically reclaims WAL files when all replicas have confirmed receipt, which eliminated the 10% of disk space we were wasting on retained WAL for lagging replicas. The core team also fixed a long-standing bug in the query planner that under-estimated the cost of random I/O on NVMe storage: the default random_page_cost was lowered from 4 to 1.1, which makes the query planner choose index scans over sequential scans for feature lookups, reducing IOPS usage by 18% for our most common query: SELECT feature_value FROM features WHERE entity_id = $1 AND feature_name = $2.

Code Example 1: PostgreSQL 17 IOPS Benchmarker (Python)

import psycopg
import time
import subprocess
import json
import logging
from typing import Dict, List, Optional

# Configure logging for benchmark traceability
logging.basicConfig(
    level=logging.INFO,
    format=\"%(asctime)s - %(levelname)s - %(message)s\",
    handlers=[logging.FileHandler(\"pg17_iops_bench.log\"), logging.StreamHandler()]
)
logger = logging.getLogger(__name__)

class PG17IOPSBenchmarker:
    \"\"\"Benchmark read IOPS for PostgreSQL 17 primary and read replicas using pgbench.\"\"\"

    def __init__(self, primary_dsn: str, replica_dsns: List[str], pgbench_path: str = \"/usr/lib/postgresql/17/bin/pgbench\"):
        self.primary_dsn = primary_dsn
        self.replica_dsns = replica_dsns
        self.pgbench_path = pgbench_path
        self.results: Dict[str, Dict[str, float]] = {}

    def _run_pgbench(self, dsn: str, duration_sec: int = 300, concurrency: int = 64) -> Optional[Dict[str, float]]:
        \"\"\"Execute pgbench read-only workload and parse IOPS results.\"\"\"
        try:
            # Connect to verify DSN validity before running benchmark
            with psycopg.connect(dsn) as conn:
                conn.execute(\"SELECT version()\")
                logger.info(f\"Connected to {dsn.split('@')[1]} successfully\")
        except Exception as e:
            logger.error(f\"DSN validation failed for {dsn}: {e}\")
            return None

        # pgbench command: read-only (-S), duration (-T), concurrency (-c), output JSON (-J)
        cmd = [
            self.pgbench_path,
            \"-h\", dsn.split(\"@\")[1].split(\":\")[0],
            \"-p\", dsn.split(\":\")[-1].split(\"/\")[0],
            \"-U\", dsn.split(\"://\")[1].split(\":\")[0],
            \"-d\", dsn.split(\"/\")[-1].split(\"?\")[0],
            \"-S\",  # Read-only workload
            \"-T\", str(duration_sec),
            \"-c\", str(concurrency),
            \"-J\",  # JSON output
            \"-P\", \"10\"  # Progress every 10 seconds
        ]

        logger.info(f\"Running pgbench: {' '.join(cmd)}\")
        try:
            result = subprocess.run(
                cmd,
                capture_output=True,
                text=True,
                timeout=duration_sec + 60  # Add buffer for startup
            )
            if result.returncode != 0:
                logger.error(f\"pgbench failed: {result.stderr}\")
                return None

            # Parse JSON output (last line of stdout is JSON summary)
            json_line = [line for line in result.stdout.splitlines() if line.startswith(\"{\")][-1]
            bench_data = json.loads(json_line)

            # Calculate IOPS: total transactions / duration (since each -S is 1 read)
            iops = bench_data[\"transactions\"] / bench_data[\"duration\"]
            p99_latency = bench_data[\"latency\"][\"percentile_99\"]

            logger.info(f\"Benchmark complete: {iops:.2f} IOPS, p99 latency {p99_latency:.2f}ms\")
            return {\"iops\": iops, \"p99_latency_ms\": p99_latency, \"transactions\": bench_data[\"transactions\"]}

        except subprocess.TimeoutExpired:
            logger.error(\"pgbench timed out\")
            return None
        except json.JSONDecodeError:
            logger.error(\"Failed to parse pgbench JSON output\")
            return None
        except Exception as e:
            logger.error(f\"Unexpected error running pgbench: {e}\")
            return None

    def run_all_benchmarks(self, duration_sec: int = 300) -> None:
        \"\"\"Run benchmarks against primary and all replicas.\"\"\"
        logger.info(\"Starting benchmark run for PostgreSQL 17 IOPS\")
        # Benchmark primary first
        primary_result = self._run_pgbench(self.primary_dsn, duration_sec)
        if primary_result:
            self.results[\"primary\"] = primary_result

        # Benchmark each replica
        for idx, replica_dsn in enumerate(self.replica_dsns, 1):
            replica_result = self._run_pgbench(replica_dsn, duration_sec)
            if replica_result:
                self.results[f\"replica_{idx}\"] = replica_result

        # Log summary
        logger.info(\"=== Benchmark Summary ===\")
        for target, data in self.results.items():
            logger.info(f\"{target}: {data['iops']:.2f} IOPS, p99 {data['p99_latency_ms']:.2f}ms\")

        # Save results to JSON
        with open(\"pg17_iops_results.json\", \"w\") as f:
            json.dump(self.results, f, indent=2)

if __name__ == \"__main__\":
    # Configuration: update with your actual DSNs
    PRIMARY_DSN = \"postgresql://bench_user:bench_pass@pg17-primary:5432/feature_store?sslmode=require\"
    REPLICA_DSNS = [
        \"postgresql://bench_user:bench_pass@pg17-replica-1:5432/feature_store?sslmode=require\",
        \"postgresql://bench_user:bench_pass@pg17-replica-2:5432/feature_store?sslmode=require\",
        \"postgresql://bench_user:bench_pass@pg17-replica-3:5432/feature_store?sslmode=require\",
        \"postgresql://bench_user:bench_pass@pg17-replica-4:5432/feature_store?sslmode=require\",
    ]

    benchmarker = PG17IOPSBenchmarker(
        primary_dsn=PRIMARY_DSN,
        replica_dsns=REPLICA_DSNS,
        pgbench_path=\"/usr/lib/postgresql/17/bin/pgbench\"
    )
    benchmarker.run_all_benchmarks(duration_sec=300)

Code Example 2: Feature Store Query Router (Go)

package main

import (
    \"context\"
    \"database/sql\"
    \"encoding/json\"
    \"fmt\"
    \"log\"
    \"math/rand\"
    \"sync\"
    \"time\"

    _ \"github.com/lib/pq\" // PostgreSQL driver for Go
)

// FeatureStoreRouter routes read queries to PostgreSQL 17 read replicas, writes to primary
type FeatureStoreRouter struct {
    primaryDB *sql.DB
    replicaDBs []*sql.DB
    healthCheckInterval time.Duration
    mu sync.RWMutex
    healthyReplicas []int
}

// NewFeatureStoreRouter initializes a router with primary and replica DSNs
func NewFeatureStoreRouter(primaryDSN string, replicaDSNs []string, healthCheckInterval time.Duration) (*FeatureStoreRouter, error) {
    // Connect to primary
    primaryDB, err := sql.Open(\"postgres\", primaryDSN)
    if err != nil {
        return nil, fmt.Errorf(\"failed to connect to primary: %w\", err)
    }
    if err := primaryDB.PingContext(context.Background()); err != nil {
        return nil, fmt.Errorf(\"primary ping failed: %w\", err)
    }

    // Connect to all replicas
    replicaDBs := make([]*sql.DB, 0, len(replicaDSNs))
    for i, dsn := range replicaDSNs {
        db, err := sql.Open(\"postgres\", dsn)
        if err != nil {
            log.Printf(\"Warning: failed to connect to replica %d: %v\", i, err)
            continue
        }
        if err := db.PingContext(context.Background()); err != nil {
            log.Printf(\"Warning: replica %d ping failed: %v\", i, err)
            continue
        }
        replicaDBs = append(replicaDBs, db)
    }

    if len(replicaDBs) == 0 {
        log.Println(\"Warning: no healthy replicas found, all reads will go to primary\")
    }

    router := &FeatureStoreRouter{
        primaryDB: primaryDB,
        replicaDBs: replicaDBs,
        healthCheckInterval: healthCheckInterval,
        healthyReplicas: make([]int, len(replicaDBs)),
    }

    // Initialize healthy replicas list
    for i := range replicaDBs {
        router.healthyReplicas[i] = i
    }

    // Start health check goroutine
    go router.runHealthChecks()

    return router, nil
}

// runHealthChecks periodically checks replica health and updates healthy list
func (r *FeatureStoreRouter) runHealthChecks() {
    ticker := time.NewTicker(r.healthCheckInterval)
    defer ticker.Stop()

    for range ticker.C {
        r.mu.Lock()
        healthy := make([]int, 0, len(r.replicaDBs))
        for i, db := range r.replicaDBs {
            ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
            err := db.PingContext(ctx)
            cancel()
            if err == nil {
                healthy = append(healthy, i)
            } else {
                log.Printf(\"Replica %d failed health check: %v\", i, err)
            }
        }
        r.healthyReplicas = healthy
        r.mu.Unlock()
        log.Printf(\"Health check complete: %d/%d replicas healthy\", len(healthy), len(r.replicaDBs))
    }
}

// GetReplicaDB returns a random healthy replica, or primary if none available
func (r *FeatureStoreRouter) GetReplicaDB() *sql.DB {
    r.mu.RLock()
    defer r.mu.RUnlock()

    if len(r.healthyReplicas) == 0 {
        return r.primaryDB
    }

    // Randomly select a healthy replica
    idx := r.healthyReplicas[rand.Intn(len(r.healthyReplicas))]
    return r.replicaDBs[idx]
}

// ExecuteRead executes a read-only query against a replica
func (r *FeatureStoreRouter) ExecuteRead(ctx context.Context, query string, args ...interface{}) (*sql.Rows, error) {
    db := r.GetReplicaDB()
    rows, err := db.QueryContext(ctx, query, args...)
    if err != nil {
        // Fallback to primary if replica fails
        log.Printf(\"Replica query failed, falling back to primary: %v\", err)
        return r.primaryDB.QueryContext(ctx, query, args...)
    }
    return rows, nil
}

// ExecuteWrite executes a write query against the primary
func (r *FeatureStoreRouter) ExecuteWrite(ctx context.Context, query string, args ...interface{}) (sql.Result, error) {
    return r.primaryDB.ExecContext(ctx, query, args...)
}

// GetFeature retrieves a feature by entity ID and feature name (read-only)
func (r *FeatureStoreRouter) GetFeature(ctx context.Context, entityID string, featureName string) (map[string]interface{}, error) {
    query := `SELECT feature_value, updated_at FROM features WHERE entity_id = $1 AND feature_name = $2 AND deleted_at IS NULL`
    rows, err := r.ExecuteRead(ctx, query, entityID, featureName)
    if err != nil {
        return nil, fmt.Errorf(\"failed to get feature: %w\", err)
    }
    defer rows.Close()

    if !rows.Next() {
        return nil, fmt.Errorf(\"feature not found for entity %s, name %s\", entityID, featureName)
    }

    var value interface{}
    var updatedAt time.Time
    if err := rows.Scan(&value, &updatedAt); err != nil {
        return nil, fmt.Errorf(\"failed to scan feature: %w\", err)
    }

    return map[string]interface{}{
        \"entity_id\": entityID,
        \"feature_name\": featureName,
        \"feature_value\": value,
        \"updated_at\": updatedAt,
    }, nil
}

func main() {
    // Configuration
    primaryDSN := \"host=pg17-primary port=5432 user=app_user password=app_pass dbname=feature_store sslmode=require\"
    replicaDSNs := []string{
        \"host=pg17-replica-1 port=5432 user=app_user password=app_pass dbname=feature_store sslmode=require\",
        \"host=pg17-replica-2 port=5432 user=app_user password=app_pass dbname=feature_store sslmode=require\",
        \"host=pg17-replica-3 port=5432 user=app_user password=app_pass dbname=feature_store sslmode=require\",
        \"host=pg17-replica-4 port=5432 user=app_user password=app_pass dbname=feature_store sslmode=require\",
    }

    router, err := NewFeatureStoreRouter(primaryDSN, replicaDSNs, 10*time.Second)
    if err != nil {
        log.Fatalf(\"Failed to initialize router: %v\", err)
    }

    // Example: retrieve a feature
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()

    feature, err := router.GetFeature(ctx, \"entity_12345\", \"user_age\")
    if err != nil {
        log.Fatalf(\"Failed to get feature: %v\", err)
    }

    jsonData, _ := json.MarshalIndent(feature, \"\", \"  \")
    fmt.Printf(\"Retrieved feature: %s\\n\", jsonData)
}

Code Example 3: PostgreSQL 16 to 17 Migration Script (Bash)

#!/bin/bash
set -euo pipefail  # Exit on error, undefined vars, pipe failures

# Configuration: update these values for your environment
OLD_CLUSTER_PATH=\"/var/lib/postgresql/16/main\"
NEW_CLUSTER_PATH=\"/var/lib/postgresql/17/main\"
OLD_BIN_PATH=\"/usr/lib/postgresql/16/bin\"
NEW_BIN_PATH=\"/usr/lib/postgresql/17/bin\"
PRIMARY_HOST=\"pg17-primary\"
REPLICA_HOSTS=(\"pg17-replica-1\" \"pg17-replica-2\" \"pg17-replica-3\" \"pg17-replica-4\")
REPLICATION_USER=\"repl_user\"
REPLICATION_PASS=\"repl_pass_secure\"
BACKUP_DIR=\"/backups/pg_migration_$(date +%Y%m%d)\"
LOG_FILE=\"pg17_migration_$(date +%Y%m%d_%H%M%S).log\"

# Logging function
log() {
    echo \"[$(date +%Y-%m-%d\\ %H:%M:%S)] $1\" | tee -a \"$LOG_FILE\"
}

# Error handler
trap 'log \"Migration failed at line $LINENO: $BASH_COMMAND\"; exit 1' ERR

log \"Starting PostgreSQL 16 to 17 migration with read replica setup\"

# Step 1: Validate prerequisites
log \"Validating prerequisites...\"
command -v \"$NEW_BIN_PATH/pg_upgrade\" >/dev/null 2>&1 || { log \"pg_upgrade not found in $NEW_BIN_PATH\"; exit 1; }
command -v pg_basebackup >/dev/null 2>&1 || { log \"pg_basebackup not found\"; exit 1; }
[ -d \"$OLD_CLUSTER_PATH\" ] || { log \"Old cluster path $OLD_CLUSTER_PATH does not exist\"; exit 1; }
[ -d \"$NEW_CLUSTER_PATH\" ] && { log \"New cluster path $NEW_CLUSTER_PATH already exists; aborting\"; exit 1; }

# Step 2: Create backup of old cluster
log \"Creating backup of PostgreSQL 16 cluster...\"
mkdir -p \"$BACKUP_DIR\"
sudo -u postgres \"$OLD_BIN_PATH/pg_dumpall\" -U postgres | gzip > \"$BACKUP_DIR/full_backup.sql.gz\"
log \"Backup created at $BACKUP_DIR/full_backup.sql.gz\"

# Step 3: Stop old PostgreSQL 16 cluster
log \"Stopping PostgreSQL 16 cluster...\"
sudo systemctl stop postgresql@16-main
log \"PostgreSQL 16 cluster stopped\"

# Step 4: Initialize new PostgreSQL 17 cluster
log \"Initializing PostgreSQL 17 cluster...\"
sudo -u postgres \"$NEW_BIN_PATH/initdb\" -D \"$NEW_CLUSTER_PATH\" --auth=scram-sha-256 --encoding=UTF8 --locale=en_US.UTF-8
log \"PostgreSQL 17 cluster initialized\"

# Step 5: Run pg_upgrade to migrate data
log \"Running pg_upgrade...\"
sudo -u postgres \"$NEW_BIN_PATH/pg_upgrade\" \
    --old-datadir \"$OLD_CLUSTER_PATH\" \
    --new-datadir \"$NEW_CLUSTER_PATH\" \
    --old-bindir \"$OLD_BIN_PATH\" \
    --new-bindir \"$NEW_BIN_PATH\" \
    --old-port 5432 \
    --new-port 5433 \
    --jobs 4 \
    --verbose 2>&1 | tee -a \"$LOG_FILE\"
log \"pg_upgrade completed successfully\"

# Step 6: Start new PostgreSQL 17 primary
log \"Starting PostgreSQL 17 primary...\"
sudo systemctl start postgresql@17-main
sudo systemctl enable postgresql@17-main
log \"PostgreSQL 17 primary started\"

# Step 7: Configure replication user
log \"Configuring replication user...\"
sudo -u postgres psql -p 5433 -c \"CREATE USER $REPLICATION_USER WITH REPLICATION ENCRYPTED PASSWORD '$REPLICATION_PASS';\"
sudo -u postgres psql -p 5433 -c \"GRANT SELECT ON ALL TABLES IN SCHEMA public TO $REPLICATION_USER;\"
log \"Replication user $REPLICATION_USER created\"

# Step 8: Update pg_hba.conf on primary to allow replica connections
log \"Updating pg_hba.conf on primary...\"
echo \"host replication $REPLICATION_USER all scram-sha-256\" | sudo tee -a \"$NEW_CLUSTER_PATH/pg_hba.conf\"
echo \"hostssl replication $REPLICATION_USER all scram-sha-256\" | sudo tee -a \"$NEW_CLUSTER_PATH/pg_hba.conf\"
sudo systemctl reload postgresql@17-main
log \"pg_hba.conf updated\"

# Step 9: Set up each read replica
for replica in \"${REPLICA_HOSTS[@]}\"; do
    log \"Setting up replica $replica...\"
    # Create base backup on replica
    ssh \"$replica\" \"sudo systemctl stop postgresql@17-main 2>/dev/null; sudo rm -rf /var/lib/postgresql/17/main\"
    sudo -u postgres pg_basebackup \
        -h \"$PRIMARY_HOST\" \
        -p 5433 \
        -U \"$REPLICATION_USER\" \
        -D \"/var/lib/postgresql/17/main\" \
        -Fp \
        -Xs \
        -P \
        -R \
        -S \"replica_slot_${replica}\" \
        --checkpoint=fast 2>&1 | tee -a \"$LOG_FILE\"

    # Configure replica to start as standby
    ssh \"$replica\" \"sudo chown -R postgres:postgres /var/lib/postgresql/17/main; sudo systemctl start postgresql@17-main\"
    log \"Replica $replica setup complete\"
done

# Step 10: Verify replication status
log \"Verifying replication status...\"
sudo -u postgres psql -p 5433 -c \"SELECT slot_name, active, restart_lsn FROM pg_replication_slots;\"
log \"Migration complete! Verify application connectivity before decommissioning PostgreSQL 16.\"

PostgreSQL 16 vs PostgreSQL 17: Performance Comparison

Metric

PostgreSQL 16 (Single Node)

PostgreSQL 17 (1 Primary + 4 Replicas)

% Improvement

Read IOPS (pgbench -S)

42,100

103,400

145%

p99 Read Latency (ms)

280

-85%

Write IOPS (pgbench default)

12,400

14,100

14%

Monthly Infrastructure Cost (RDS)

$38,200

$15,800

-59%

Replica Lag (p95, ms)

N/A

7.2

N/A

Max Feature Throughput (features/sec)

12,000

41,000

242%

Case Study: Scaling Our Production Feature Store

Team size: 4 backend engineers, 1 site reliability engineer (SRE)
Stack & Versions: PostgreSQL 17.0, Go 1.22, Python 3.11, pg_read_replica_sync v0.4.2 (https://github.com/timescale/pg\_read\_replica\_sync), pgbench 17.0, AWS RDS for PostgreSQL 17
Problem: Initial state: p99 read latency 2.8s, max read IOPS 42k, monthly RDS cost $38.2k, feature store could only support 12k features/sec, with training jobs timing out daily
Solution & Implementation: Upgraded from PostgreSQL 16 to 17 using pg_upgrade, deployed 4 read replicas across 2 AWS regions, implemented query routing to shift 82% of reads to replicas, configured pg_read_replica_sync to reduce cross-region lag, tuned PostgreSQL 17 parameters: max_parallel_workers = 16, max_parallel_workers_per_gather = 8, effective_io_concurrency = 200, wal_compression = on
Outcome: Read IOPS scaled to 103k, p99 latency dropped to 42ms, monthly cost reduced by $22.4k to $15.8k, feature throughput increased to 41k features/sec, zero training timeouts in 3 months post-migration

Developer Tips

Tip 1: Tune PostgreSQL 17’s Parallel I/O Parameters for NVMe Storage

PostgreSQL 17 introduces a rewritten parallel I/O subsystem that delivers up to 2.3x higher read throughput than PostgreSQL 16, but only if you tune parameters to match your storage. For NVMe-backed instances (critical for AI/ML feature stores with high random read workloads), start by setting effective_io_concurrency to 200 or higher: this controls how many parallel I/O requests the database issues for bitmap heap scans, which are common in feature lookups that filter by entity ID ranges. Next, set max_parallel_workers to 2x the number of vCPUs on your instance, and max_parallel_workers_per_gather to 8 for read replicas handling feature queries. We saw a 40% IOPS boost just from adjusting these parameters on our 16-vCPU replicas. Avoid over-tuning shared_buffers for read-heavy workloads: we found 25% of total RAM (instead of the traditional 75%) reduced buffer churn and improved p99 latency by 18%. Use pg_test_fsync (bundled with PostgreSQL 17) to benchmark your storage’s actual random read throughput before setting these values, and validate changes with 24-hour pgbench runs to avoid regressions.

# PostgreSQL 17 postgresql.conf snippet for read replicas
effective_io_concurrency = 200
max_parallel_workers = 32  # 2x 16-vCPU instance
max_parallel_workers_per_gather = 8
shared_buffers = 16GB  # 25% of 64GB RAM
wal_compression = on
random_page_cost = 1.1  # NVMe has near-equal random/sequential cost

Tip 2: Use Synchronous Replication Only for Critical Writes, Async for Everything Else

AI/ML feature stores have two distinct write patterns: critical metadata updates (e.g., feature schema changes, access control updates) that require strong consistency, and high-volume feature writes (e.g., batch ingestion of 1M+ features/hour) that can tolerate 10-100ms of lag. For PostgreSQL 17 read replicas, configure synchronous replication only for the critical writes using synchronous_standby_names with a quorum of 1: this ensures metadata writes are replicated to at least one replica before acknowledging the client. For high-volume feature writes, use asynchronous replication with the pg_read_replica_sync tool (https://github.com/timescale/pg\_read\_replica\_sync) to monitor lag and alert if p95 lag exceeds 50ms. We reduced write latency by 62% by switching 92% of our writes to async replication, while still meeting our 100ms lag SLA for feature training. Always use replication slots (added in PostgreSQL 9.4, but improved in 17) to avoid replica lag from WAL retention issues: create a permanent slot for each replica, and monitor slot lag with pg_replication_slots. Never use synchronous replication for all writes: we saw p99 write latency spike to 1.2s when we tried to sync all writes to 4 replicas, which caused training pipeline timeouts.

-- Create permanent replication slot for a read replica
SELECT pg_create_physical_replication_slot('replica_1_slot', true);

-- Configure synchronous replication for critical writes only
ALTER SYSTEM SET synchronous_standby_names = 'ANY 1 (replica_1_slot, replica_2_slot)';
SELECT pg_reload_conf();

Tip 3: Implement Query-Level Read/Write Splitting with Fallback Logic

Don’t rely on application-level query routing alone: use a combination of PgBouncer 1.21 (which supports read/write splitting for PostgreSQL 17) and application-level logic to route queries. PgBouncer can automatically route read-only transactions (started with BEGIN READ ONLY) to replicas, but it can’t parse complex queries to detect writes, so you need application logic to route explicit writes to the primary. Our Go-based FeatureStoreRouter (included earlier) adds health checks and fallback logic: if a replica fails a query, it automatically retries against the primary, which eliminated 99% of transient replica errors for our training workloads. We also added query tagging: all read queries for features are prefixed with /* read */, and our router checks for this prefix to route to replicas even if the query isn’t wrapped in a read-only transaction. For PgBouncer, set pool_mode = transaction and server_lifetime = 3600 to avoid connection churn, and use ignore_startup_parameters = extra_float_digits to support PostgreSQL 17’s new floating-point features. Never route writes to replicas: we saw data corruption once when a misconfigured router sent a feature update to a replica, which caused replication lag to spike to 10 minutes.

# PgBouncer 1.21 configuration for read/write splitting
[databases]
feature_store = host=pg17-primary port=5432 dbname=feature_store

[pgbouncer]
pool_mode = transaction
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt
server_lifetime = 3600
read_only_server = host=pg17-replica-1 port=5432 dbname=feature_store
read_only_server = host=pg17-replica-2 port=5432 dbname=feature_store

Join the Discussion

We’ve shared our war story of scaling PostgreSQL 17 to 100k IOPS for AI/ML feature stores, but we know every infrastructure stack is different. We’d love to hear from you: what’s your experience scaling relational databases for ML workloads? Have you hit similar IOPS limits with other tools?

Discussion Questions

By 2026, PostgreSQL 17+ read replica pools will be the default for 60% of AI/ML feature stores—do you agree, or will purpose-built feature stores like Feast replace relational databases entirely?
We chose to trade 7.2ms of p95 replica lag for 59% cost savings and 2.4x higher throughput—would you have accepted higher lag for more cost savings, or prioritized lower latency?
How does PostgreSQL 17’s read replica performance compare to ClickHouse for feature store workloads, especially for time-series feature lookups?

Frequently Asked Questions

Does PostgreSQL 17’s read replica support scale linearly with the number of replicas?

No, we saw diminishing returns after 4 replicas: adding a 5th replica only increased total read IOPS by 8%, because the primary’s WAL sender processes became a bottleneck. We solved this by enabling PostgreSQL 17’s new WAL compression (wal_compression = on) which reduced WAL bandwidth by 42%, allowing us to scale to 6 replicas with 18% more IOPS. For most feature stores, 4-6 replicas are the sweet spot for cost vs. throughput.

Is PostgreSQL 17’s parallel I/O subsystem backward compatible with PostgreSQL 16 clients?

Yes, all PostgreSQL 16 and earlier clients (including psycopg3, lib/pq, and pgbench 16) are fully compatible with PostgreSQL 17 servers. We ran our application using psycopg2 (for legacy Python services) and psycopg3 during the migration with no compatibility issues. The parallel I/O improvements are server-side only, so clients don’t need to be updated to benefit from higher IOPS.

How much overhead does pg_read_replica_sync add to write latency?

We measured less than 0.8ms of added latency for writes when using pg_read_replica_sync v0.4.2 (https://github.com/timescale/pg\_read\_replica\_sync) to monitor cross-region replica lag. The tool runs as a background process on the primary, so it doesn’t block write operations. We only saw overhead exceed 1ms when we configured it to check lag every 100ms, which we don’t recommend—our production config checks lag every 5 seconds, which is more than sufficient for our 50ms p95 lag SLA.

Conclusion & Call to Action

After 6 weeks of migration, benchmarking, and tuning, we’re confident that PostgreSQL 17 with read replicas is the most cost-effective, performant solution for AI/ML feature stores handling 40k+ features/sec. Purpose-built feature stores like Feast are great for greenfield projects, but if you already have PostgreSQL expertise and need to scale to 100k IOPS without rewriting your entire data pipeline, PostgreSQL 17 is the clear choice. The parallel I/O improvements alone deliver 2.3x higher throughput than PostgreSQL 16, and read replicas let you scale reads horizontally without breaking the bank. Don’t wait for your p99 latency to hit 2.8 seconds: start benchmarking PostgreSQL 17 today, and use the code samples in this article to accelerate your migration.

103,400Read IOPS achieved with PostgreSQL 17 + 4 read replicas

DEV Community