TL;DR: Discover the exact database optimization and caching strategies that reduced query times by 92%, scaled from handling 1K to 100K+ concurrent users, and achieved 87% cache hit rates - all while cutting database costs by 45%. Production-tested patterns and real metrics included! š
Your database is the heart of your application. When it struggles, everything suffers. At 100K+ users, database performance isn't optional - it's existential. Here's how I transformed my PostgreSQL database from a bottleneck into a high-performance data layer.
šÆ The Database Challenge: Performance, Scale & Cost
Scaling databases to 100K+ users reveals harsh truths:
- Queries that worked fine at 1K users timeout at 100K
- Single database servers become bottlenecks
- Connection limits are hit constantly
- Slow queries cascade into system-wide failures
- Storage costs explode without optimization
- Backup/recovery times become unmanageable
The critical insight: You can't just add more RAM - you need a systematic approach to database architecture and caching.
š Starting Point vs. Results
Before Database Optimization:
Performance:
āāā Avg Query Time: 200ms
āāā P95 Query Time: 850ms
āāā P99 Query Time: 2,100ms
āāā Slow Queries (>1s): 12% of total
āāā Database CPU: 92% average
Architecture:
āāā Database: Single PostgreSQL instance
āāā Connections: Direct (no pooling)
āāā Indexes: Basic primary keys only
āāā Caching: None
āāā Read Replicas: 0
Cost:
āāā Monthly: $530/month
After Database Optimization:
Performance:
āāā Avg Query Time: 15ms (92% faster) š
āāā P95 Query Time: 48ms (94% faster) ā”
āāā P99 Query Time: 125ms (94% faster) š„
āāā Slow Queries (>1s): 0.02% of total
āāā Database CPU: 28% average (healthy!)
Architecture:
āāā Database: Primary + 3 read replicas
āāā Connections: Connection pooling (100 max)
āāā Indexes: 47 optimized indexes
āāā Caching: Redis (87% hit rate)
āāā Read Replicas: 3 (distributed globally)
Cost:
āāā Monthly: $920/month (73% more cost, 100x capacity!)
Cost per query dropped from $0.000265 to $0.000012 - that's 95% more efficient!
ā” Strategy #1: Index Optimization
The Problem: Full Table Scans Killing Performance
Before: No proper indexes
-- ā BAD: Query without indexes
SELECT * FROM metrics
WHERE team_id = 123
AND created_at > '2024-01-01'
AND status = 'active'
ORDER BY created_at DESC
LIMIT 50;
-- Execution plan shows:
-- Seq Scan on metrics (cost=0.00..450000.00 rows=10000000 width=120)
-- Execution time: 4,500ms with 10M rows
After: Strategic composite indexes
-- ā
GOOD: Composite index for common query pattern
CREATE INDEX idx_metrics_team_date_status
ON metrics(team_id, created_at DESC, status)
WHERE status = 'active';
-- Now the query uses:
-- Index Scan using idx_metrics_team_date_status (cost=0.43..850.20 rows=50 width=120)
-- Execution time: 12ms with same 10M rows!
-- Additional optimized indexes
CREATE INDEX idx_metrics_user_date
ON metrics(user_id, created_at DESC)
INCLUDE (value, metric_type);
CREATE INDEX idx_team_stats_team_date
ON team_stats(team_id, date DESC);
-- Partial index for specific use cases
CREATE INDEX idx_active_high_value_metrics
ON metrics(team_id, value)
WHERE status = 'active' AND value > 1000;
-- GIN index for JSONB columns
CREATE INDEX idx_metrics_metadata_gin
ON metrics USING GIN (metadata jsonb_path_ops);
Index Analysis & Maintenance
-- Find missing indexes
SELECT
schemaname,
tablename,
attname,
n_distinct,
correlation
FROM pg_stats
WHERE schemaname = 'public'
AND n_distinct > 100
AND abs(correlation) < 0.1;
-- Find unused indexes (candidates for removal)
SELECT
schemaname,
tablename,
indexname,
idx_scan,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelname NOT LIKE '%pkey%'
ORDER BY pg_relation_size(indexrelid) DESC;
-- Analyze index bloat
SELECT
schemaname,
tablename,
indexname,
pg_size_pretty(pg_relation_size(indexrelid)) AS size,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;
-- Regular maintenance
REINDEX INDEX CONCURRENTLY idx_metrics_team_date_status;
VACUUM ANALYZE metrics;
Results:
- Query time: 4,500ms ā 12ms (99.7% faster)
- Index count: 3 ā 47 (strategic)
- Database CPU: 92% ā 28%
š Strategy #2: Read Replicas & Geographic Distribution
Multi-Region Read Replica Architecture
Database Architecture:
Primary (Write):
āāā Region: us-east-1a
āāā Instance: db.r5.2xlarge
āāā Purpose: All writes + critical reads
āāā Connections: 100 max
āāā Backup: Automated daily snapshots
Read Replica 1 (Read):
āāā Region: us-east-1b (same region)
āāā Instance: db.r5.large
āāā Purpose: Same-region read distribution
āāā Lag: <100ms
āāā Connections: 100 max
Read Replica 2 (Read):
āāā Region: us-west-2a (cross-region)
āāā Instance: db.r5.large
āāā Purpose: West coast users
āāā Lag: <500ms
āāā Connections: 100 max
Read Replica 3 (Read):
āāā Region: eu-west-1a (cross-region)
āāā Instance: db.r5.large
āāā Purpose: European users
āāā Lag: <800ms
āāā Connections: 100 max
Intelligent Read/Write Routing
@Injectable()
export class DatabaseRouter {
private readonly PRIMARY_ENDPOINT: string;
private readonly READ_ENDPOINTS: string[];
private currentReplicaIndex = 0;
constructor() {
this.PRIMARY_ENDPOINT = process.env.DB_PRIMARY;
this.READ_ENDPOINTS = [
process.env.DB_REPLICA_1,
process.env.DB_REPLICA_2,
process.env.DB_REPLICA_3
];
}
// Write operations always go to primary
async executeWrite(query: string, params: any[]): Promise<any> {
return this.execute(this.PRIMARY_ENDPOINT, query, params);
}
// Read operations distributed across replicas
async executeRead(query: string, params: any[]): Promise<any> {
const endpoint = this.getNextReadEndpoint();
return this.execute(endpoint, query, params);
}
// Critical reads that need latest data
async executeReadPrimary(query: string, params: any[]): Promise<any> {
return this.execute(this.PRIMARY_ENDPOINT, query, params);
}
private getNextReadEndpoint(): string {
// Round-robin load balancing across read replicas
const endpoint = this.READ_ENDPOINTS[this.currentReplicaIndex];
this.currentReplicaIndex =
(this.currentReplicaIndex + 1) % this.READ_ENDPOINTS.length;
return endpoint;
}
// Geography-aware routing
async executeReadGeographic(
region: string,
query: string,
params: any[]
): Promise<any> {
const endpoint = this.getRegionalEndpoint(region);
return this.execute(endpoint, query, params);
}
private getRegionalEndpoint(region: string): string {
const regionMap = {
'us-east': process.env.DB_REPLICA_1,
'us-west': process.env.DB_REPLICA_2,
'eu': process.env.DB_REPLICA_3
};
return regionMap[region] || this.READ_ENDPOINTS[0];
}
}
Handling Replication Lag
@Injectable()
export class ReplicationAwareService {
constructor(
private dbRouter: DatabaseRouter,
private cache: CacheService
) {}
async createAndRead(data: any): Promise<any> {
// Write to primary
const result = await this.dbRouter.executeWrite(
'INSERT INTO metrics (team_id, value) VALUES ($1, $2) RETURNING id',
[data.teamId, data.value]
);
const newId = result.rows[0].id;
// Critical: Read from primary to avoid replication lag
const newRecord = await this.dbRouter.executeReadPrimary(
'SELECT * FROM metrics WHERE id = $1',
[newId]
);
// Cache the new record
await this.cache.set(`metric:${newId}`, newRecord, 300);
return newRecord;
}
async getNonCriticalData(teamId: number): Promise<any> {
// Non-critical reads can use replicas (handle eventual consistency)
return this.dbRouter.executeRead(
'SELECT * FROM team_stats WHERE team_id = $1',
[teamId]
);
}
}
Results:
- Read query distribution: 100% primary ā 85% replicas
- Primary database load: Reduced by 85%
- Geographic latency: Reduced by 60% for distant users
šŖ Strategy #3: Query Optimization
Materialized Views for Complex Aggregations
Before: Expensive aggregation queries
-- ā BAD: Complex aggregation on-the-fly
SELECT
t.id,
t.name,
COUNT(DISTINCT u.id) as member_count,
COUNT(DISTINCT m.id) as total_metrics,
AVG(m.value) as avg_metric_value,
SUM(CASE WHEN m.created_at > NOW() - INTERVAL '7 days' THEN 1 ELSE 0 END) as weekly_metrics,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY m.value) as median_value
FROM teams t
LEFT JOIN users u ON t.id = u.team_id
LEFT JOIN metrics m ON u.id = m.user_id
WHERE t.status = 'active'
GROUP BY t.id, t.name;
-- Execution time: 8,500ms (unacceptable!)
After: Materialized view with scheduled refresh
-- ā
GOOD: Pre-computed materialized view
CREATE MATERIALIZED VIEW team_metrics_summary AS
SELECT
t.id as team_id,
t.name as team_name,
COUNT(DISTINCT u.id) as member_count,
COUNT(DISTINCT m.id) as total_metrics,
AVG(m.value) as avg_metric_value,
SUM(CASE WHEN m.created_at > NOW() - INTERVAL '7 days' THEN 1 ELSE 0 END) as weekly_metrics,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY m.value) as median_value,
NOW() as last_updated
FROM teams t
LEFT JOIN users u ON t.id = u.team_id
LEFT JOIN metrics m ON u.id = m.user_id
WHERE t.status = 'active'
GROUP BY t.id, t.name;
-- Create index on materialized view
CREATE UNIQUE INDEX idx_team_metrics_summary_team
ON team_metrics_summary(team_id);
-- Now queries are instant
SELECT * FROM team_metrics_summary WHERE team_id = 123;
-- Execution time: 3ms!
-- Refresh materialized view every hour
CREATE EXTENSION IF NOT EXISTS pg_cron;
SELECT cron.schedule(
'refresh-team-metrics',
'0 * * * *', -- Every hour
$$REFRESH MATERIALIZED VIEW CONCURRENTLY team_metrics_summary$$
);
Query Optimization Techniques
-- Use EXPLAIN ANALYZE to understand query performance
EXPLAIN ANALYZE
SELECT * FROM metrics WHERE team_id = 123;
-- Optimize JOINs
-- ā BAD: Unnecessary JOIN
SELECT m.*
FROM metrics m
JOIN users u ON m.user_id = u.id
WHERE u.team_id = 123;
-- ā
GOOD: Use WHERE with IN
SELECT m.*
FROM metrics m
WHERE m.user_id IN (
SELECT id FROM users WHERE team_id = 123
);
-- Even better with indexed column
SELECT m.*
FROM metrics m
WHERE m.team_id = 123; -- Direct lookup if team_id is in metrics table
-- Use CTEs for complex queries
WITH active_teams AS (
SELECT id FROM teams WHERE status = 'active'
),
recent_metrics AS (
SELECT team_id, COUNT(*) as count
FROM metrics
WHERE created_at > NOW() - INTERVAL '7 days'
GROUP BY team_id
)
SELECT
t.id,
t.name,
COALESCE(rm.count, 0) as recent_count
FROM active_teams at
JOIN teams t ON at.id = t.id
LEFT JOIN recent_metrics rm ON t.id = rm.team_id;
-- Batch updates for better performance
-- ā BAD: Update in loop
-- UPDATE metrics SET processed = true WHERE id = 1;
-- UPDATE metrics SET processed = true WHERE id = 2;
-- ... (1000 queries)
-- ā
GOOD: Single batch update
UPDATE metrics
SET processed = true
WHERE id = ANY(ARRAY[1, 2, 3, ..., 1000]);
Results:
- Complex aggregations: 8,500ms ā 3ms (99.96% faster)
- JOIN queries: Optimized by 85%
- Batch operations: 95% faster
š Strategy #4: Redis Caching Architecture
Multi-Layer Caching Strategy
@Injectable()
export class CachingService {
private readonly TTL = {
INSTANT: 10, // 10 seconds - real-time data
SHORT: 60, // 1 minute - frequently changing
MEDIUM: 300, // 5 minutes - semi-static
LONG: 3600, // 1 hour - rarely changing
VERY_LONG: 86400, // 24 hours - static reference data
PERMANENT: -1 // No expiration (manual invalidation)
};
constructor(
private redis: RedisClient,
private db: DatabaseRouter
) {}
// Cache-aside pattern
async getWithCache<T>(
key: string,
fetchFn: () => Promise<T>,
ttl: number = this.TTL.MEDIUM,
useReadReplica: boolean = true
): Promise<T> {
try {
// Try cache first
const cached = await this.redis.get(key);
if (cached) {
return JSON.parse(cached);
}
} catch (error) {
console.error('Cache read error:', error);
// Continue to database if cache fails
}
// Cache miss - fetch from database
const data = await fetchFn();
// Update cache (fire and forget)
this.redis.setex(key, ttl, JSON.stringify(data))
.catch(err => console.error('Cache write error:', err));
return data;
}
// Write-through pattern
async setWithCache(
key: string,
data: any,
ttl: number = this.TTL.MEDIUM
): Promise<void> {
// Write to database first
await this.writeToDatabase(key, data);
// Update cache
await this.redis.setex(key, ttl, JSON.stringify(data));
}
// Cache invalidation
async invalidate(pattern: string): Promise<void> {
const keys = await this.redis.keys(pattern);
if (keys.length > 0) {
await this.redis.del(...keys);
}
}
// Multi-get for batching
async mget<T>(keys: string[]): Promise<(T | null)[]> {
const values = await this.redis.mget(...keys);
return values.map(v => v ? JSON.parse(v) : null);
}
// Atomic increment for counters
async incrementCounter(key: string, amount: number = 1): Promise<number> {
return await this.redis.incrby(key, amount);
}
}
Advanced Redis Patterns
// Cache warming for frequently accessed data
@Injectable()
export class CacheWarmingService {
constructor(
private cache: CachingService,
private db: DatabaseRouter
) {
this.startWarmingSchedule();
}
private startWarmingSchedule(): void {
// Warm cache before expiry
setInterval(() => {
this.warmFrequentData();
}, 4 * 60 * 1000); // Every 4 minutes (before 5-min TTL)
}
private async warmFrequentData(): Promise<void> {
try {
// Get active teams
const activeTeams = await this.db.executeRead(
`SELECT id FROM teams
WHERE last_activity > NOW() - INTERVAL '1 hour'`,
[]
);
// Warm cache for each
const promises = activeTeams.rows.map(team =>
this.warmTeamData(team.id)
);
await Promise.all(promises);
console.log(`Warmed cache for ${activeTeams.rows.length} teams`);
} catch (error) {
console.error('Cache warming failed:', error);
}
}
private async warmTeamData(teamId: number): Promise<void> {
const [metrics, stats, members] = await Promise.all([
this.fetchAndCacheMetrics(teamId),
this.fetchAndCacheStats(teamId),
this.fetchAndCacheMembers(teamId)
]);
}
}
// Redis pub/sub for cache invalidation across instances
@Injectable()
export class CacheInvalidationService {
private subscriber: RedisClient;
private publisher: RedisClient;
constructor(private cache: CachingService) {
this.subscriber = new Redis(process.env.REDIS_URL);
this.publisher = new Redis(process.env.REDIS_URL);
this.setupSubscriptions();
}
private setupSubscriptions(): void {
this.subscriber.subscribe('cache:invalidate');
this.subscriber.on('message', async (channel, message) => {
if (channel === 'cache:invalidate') {
const { pattern } = JSON.parse(message);
await this.cache.invalidate(pattern);
console.log(`Invalidated cache pattern: ${pattern}`);
}
});
}
async broadcastInvalidation(pattern: string): Promise<void> {
// Notify all instances to invalidate cache
await this.publisher.publish(
'cache:invalidate',
JSON.stringify({ pattern, timestamp: Date.now() })
);
}
}
// Circuit breaker for database with Redis fallback
@Injectable()
export class ResilientDataService {
constructor(
private cache: CachingService,
private db: DatabaseRouter,
private circuitBreaker: CircuitBreakerService
) {}
async getTeamMetrics(teamId: number): Promise<any> {
const cacheKey = `metrics:team:${teamId}`;
return this.circuitBreaker.execute(
'database:team-metrics',
async () => {
// Try cache first
return await this.cache.getWithCache(
cacheKey,
async () => {
// Fetch from database
const result = await this.db.executeRead(
'SELECT * FROM team_metrics WHERE team_id = $1',
[teamId]
);
return result.rows;
},
this.cache.TTL.MEDIUM
);
},
async () => {
// Fallback: Return stale cache if available
const stale = await this.redis.get(cacheKey);
if (stale) {
console.warn('Using stale cache due to database circuit open');
return JSON.parse(stale);
}
throw new ServiceUnavailableException('No data available');
}
);
}
}
Results:
- Cache hit rate: 87%
- Database queries: Reduced by 87%
- Average response time: 200ms ā 5ms (with cache hit)
š”ļø Strategy #5: Connection Pooling & Management
Optimized Connection Pool Configuration
import { Pool } from 'pg';
// Primary database pool
const primaryPool = new Pool({
host: process.env.DB_PRIMARY_HOST,
port: 5432,
database: process.env.DB_NAME,
user: process.env.DB_USER,
password: process.env.DB_PASSWORD,
// Pool configuration
min: 10, // Minimum connections always open
max: 100, // Maximum connections
idleTimeoutMillis: 30000, // Close idle connections after 30s
connectionTimeoutMillis: 2000,
// Performance tuning
statement_timeout: 10000, // Kill queries after 10s
query_timeout: 10000,
keepAlive: true,
keepAliveInitialDelayMillis: 10000,
// SSL configuration for production
ssl: process.env.NODE_ENV === 'production' ? {
rejectUnauthorized: false
} : false
});
// Read replica pools
const createReadPool = (host: string) => new Pool({
...primaryPool.options,
host,
max: 100 // Can handle more reads
});
const readPools = [
createReadPool(process.env.DB_REPLICA_1),
createReadPool(process.env.DB_REPLICA_2),
createReadPool(process.env.DB_REPLICA_3)
];
// Pool monitoring
primaryPool.on('connect', (client) => {
console.log('New client connected to primary');
});
primaryPool.on('error', (err, client) => {
console.error('Unexpected error on idle client', err);
});
// Health monitoring
setInterval(async () => {
console.log('Primary pool stats:', {
total: primaryPool.totalCount,
idle: primaryPool.idleCount,
waiting: primaryPool.waitingCount
});
// Alert if pool is saturated
if (primaryPool.waitingCount > 20) {
console.error('ā ļø Connection pool saturated!');
// Send alert to monitoring service
}
}, 60000);
// Graceful shutdown
process.on('SIGTERM', async () => {
console.log('Closing database pools...');
await Promise.all([
primaryPool.end(),
...readPools.map(pool => pool.end())
]);
console.log('Database pools closed');
process.exit(0);
});
š Real-World Performance Metrics
Production Database Metrics (30 days)
Query Performance:
ā
Total Queries: 125.8M
ā
Avg Query Time: 15ms
ā
P95 Query Time: 48ms
ā
P99 Query Time: 125ms
ā
Slow Queries (>1s): 0.02%
ā
Failed Queries: 0.005%
Database Health:
ā
Primary CPU: 28% average
ā
Replica 1 CPU: 22% average
ā
Replica 2 CPU: 19% average
ā
Replica 3 CPU: 21% average
ā
Uptime: 100%
ā
Replication Lag: <100ms (same region)
Connection Pool:
ā
Total Connections: 100 (max)
ā
Idle Connections: 45 (average)
ā
Active Connections: 55 (average)
ā
Waiting Connections: 0 (excellent)
ā
Connection Errors: 0.001%
Cache Performance:
ā
Total Cache Requests: 144.9M
ā
Cache Hits: 126.1M (87%)
ā
Cache Misses: 18.8M (13%)
ā
Avg Cache Response: 5ms
ā
Redis Memory Usage: 4.2GB
ā
Evictions: 0 (perfect sizing)
Storage & Cost:
ā
Database Size: 125GB
ā
Index Size: 48GB
ā
Daily Growth: 850MB
ā
Monthly Cost: $920
ā
Cost per Query: $0.000012
Load Test Results
# pgbench load test
pgbench -c 100 -j 10 -T 600 -S postgres
# Results:
transaction type: SELECT only
scaling factor: 1
query mode: simple
number of clients: 100
number of threads: 10
duration: 600 s
number of transactions: 15,420,850
tps: 25,701 (including connections establishing)
tps: 25,703 (excluding connections establishing)
Latency:
average: 3.89 ms
p50: 3.12 ms
p95: 8.45 ms
p99: 15.23 ms
š” Key Lessons Learned
What Made the Biggest Impact
- Redis Caching (45% improvement): 87% hit rate dramatically reduced database load
- Read Replicas (30% improvement): Distributed load, improved geographic performance
- Index Optimization (15% improvement): 99.7% faster queries
- Materialized Views (7% improvement): Complex aggregations became instant
- Connection Pooling (3% improvement): Eliminated connection overhead
What Didn't Work
ā Over-normalization: Joins became too expensive, had to denormalize some tables
ā Too many indexes: Index bloat slowed writes, had to be strategic
ā Aggressive caching of everything: Led to stale data issues
ā NoSQL migration: Didn't justify complexity for our relational data model
šÆ Power Your Analytics with Optimized Data Infrastructure
These database and caching strategies transformed our data layer from a bottleneck into a high-performance engine capable of serving 100K+ users with 15ms query times and 87% cache hit rates. But having a fast database means nothing without actionable insights.
Experience Database Performance at Scale
Ready to see what <20ms query times feel like?
OrgSignals leverages every database optimization strategy covered in this article to deliver:
- ā” 15ms average query times across millions of data points
- š 87% cache hit rate for instant insights
- šŖ Multi-region performance with <100ms global response times
- š Real-time data sync without compromising performance
- š”ļø 99.98% database uptime with automatic failover
Transform How You Understand Developer Productivity
Stop making decisions based on gut feel. Get data-driven insights instantly with OrgSignals:
ā
Lightning-fast analytics - Query millions of data points in milliseconds
ā
Real-time DORA metrics - Track deployment frequency, lead time, MTTR, change failure rate
ā
Historical trend analysis - Spot patterns across weeks, months, and quarters
ā
Team & individual insights - From executive overview to developer deep-dives
ā
Unified integration data - GitHub, GitLab, Jira, Slack, and more in one platform
ā
AI-powered recommendations - Identify bottlenecks and improvement opportunities automatically
ā
Beautiful, fast dashboards - Data visualization that loads instantly, every time
Start Your Free Trial ā
No credit card required. Full platform access. Setup in under 5 minutes.
Learn More About Building High-Performance Systems
š Read the complete series:
- Part 1: How I Built an Enterprise Angular App in 30 Days ā
- Part 2: From Code to Production: Deployment Strategies ā
- Part 3: Frontend Performance at Scale ā
- Part 4: Backend & API Optimization ā
- Part 5: You are here - Database & Caching Strategies at Scale
Have questions about scaling your database? Drop them in the comments - I love helping solve complex data challenges!
Found this valuable? Follow for more database optimization, caching strategies, and system architecture content.
Top comments (1)
the materialized view section is the most underrated part here. most teams reach for more replicas or bigger instances when complex aggregation queries are the real bottleneck. one thing to watch: CONCURRENTLY refresh requires a unique index on the view or it falls back to an exclusive lock. did you hit any staleness issues when the hourly refresh was too slow for certain dashboards?