TL;DR: Discover the exact database optimization and caching strategies that reduced query times by 92%, scaled from handling 1K to 100K+ concurrent users, and achieved 87% cache hit rates - all while cutting database costs by 45%. Production-tested patterns and real metrics included! š
Your database is the heart of your application. When it struggles, everything suffers. At 100K+ users, database performance isn't optional - it's existential. Here's how I transformed my PostgreSQL database from a bottleneck into a high-performance data layer.
šÆ The Database Challenge: Performance, Scale & Cost
Scaling databases to 100K+ users reveals harsh truths:
- Queries that worked fine at 1K users timeout at 100K
- Single database servers become bottlenecks
- Connection limits are hit constantly
- Slow queries cascade into system-wide failures
- Storage costs explode without optimization
- Backup/recovery times become unmanageable
The critical insight: You can't just add more RAM - you need a systematic approach to database architecture and caching.
š Starting Point vs. Results
Before Database Optimization:
Performance:
āāā Avg Query Time: 200ms
āāā P95 Query Time: 850ms
āāā P99 Query Time: 2,100ms
āāā Slow Queries (>1s): 12% of total
āāā Database CPU: 92% average
Architecture:
āāā Database: Single PostgreSQL instance
āāā Connections: Direct (no pooling)
āāā Indexes: Basic primary keys only
āāā Caching: None
āāā Read Replicas: 0
Cost:
āāā Monthly: $530/month
After Database Optimization:
Performance:
āāā Avg Query Time: 15ms (92% faster) š
āāā P95 Query Time: 48ms (94% faster) ā”
āāā P99 Query Time: 125ms (94% faster) š„
āāā Slow Queries (>1s): 0.02% of total
āāā Database CPU: 28% average (healthy!)
Architecture:
āāā Database: Primary + 3 read replicas
āāā Connections: Connection pooling (100 max)
āāā Indexes: 47 optimized indexes
āāā Caching: Redis (87% hit rate)
āāā Read Replicas: 3 (distributed globally)
Cost:
āāā Monthly: $920/month (73% more cost, 100x capacity!)
Cost per query dropped from $0.000265 to $0.000012 - that's 95% more efficient!
ā” Strategy #1: Index Optimization
The Problem: Full Table Scans Killing Performance
Before: No proper indexes
-- ā BAD: Query without indexes
SELECT * FROM metrics
WHERE team_id = 123
AND created_at > '2024-01-01'
AND status = 'active'
ORDER BY created_at DESC
LIMIT 50;
-- Execution plan shows:
-- Seq Scan on metrics (cost=0.00..450000.00 rows=10000000 width=120)
-- Execution time: 4,500ms with 10M rows
After: Strategic composite indexes
-- ā
GOOD: Composite index for common query pattern
CREATE INDEX idx_metrics_team_date_status
ON metrics(team_id, created_at DESC, status)
WHERE status = 'active';
-- Now the query uses:
-- Index Scan using idx_metrics_team_date_status (cost=0.43..850.20 rows=50 width=120)
-- Execution time: 12ms with same 10M rows!
-- Additional optimized indexes
CREATE INDEX idx_metrics_user_date
ON metrics(user_id, created_at DESC)
INCLUDE (value, metric_type);
CREATE INDEX idx_team_stats_team_date
ON team_stats(team_id, date DESC);
-- Partial index for specific use cases
CREATE INDEX idx_active_high_value_metrics
ON metrics(team_id, value)
WHERE status = 'active' AND value > 1000;
-- GIN index for JSONB columns
CREATE INDEX idx_metrics_metadata_gin
ON metrics USING GIN (metadata jsonb_path_ops);
Index Analysis & Maintenance
-- Find missing indexes
SELECT
schemaname,
tablename,
attname,
n_distinct,
correlation
FROM pg_stats
WHERE schemaname = 'public'
AND n_distinct > 100
AND abs(correlation) < 0.1;
-- Find unused indexes (candidates for removal)
SELECT
schemaname,
tablename,
indexname,
idx_scan,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelname NOT LIKE '%pkey%'
ORDER BY pg_relation_size(indexrelid) DESC;
-- Analyze index bloat
SELECT
schemaname,
tablename,
indexname,
pg_size_pretty(pg_relation_size(indexrelid)) AS size,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;
-- Regular maintenance
REINDEX INDEX CONCURRENTLY idx_metrics_team_date_status;
VACUUM ANALYZE metrics;
Results:
- Query time: 4,500ms ā 12ms (99.7% faster)
- Index count: 3 ā 47 (strategic)
- Database CPU: 92% ā 28%
š Strategy #2: Read Replicas & Geographic Distribution
Multi-Region Read Replica Architecture
Database Architecture:
Primary (Write):
āāā Region: us-east-1a
āāā Instance: db.r5.2xlarge
āāā Purpose: All writes + critical reads
āāā Connections: 100 max
āāā Backup: Automated daily snapshots
Read Replica 1 (Read):
āāā Region: us-east-1b (same region)
āāā Instance: db.r5.large
āāā Purpose: Same-region read distribution
āāā Lag: <100ms
āāā Connections: 100 max
Read Replica 2 (Read):
āāā Region: us-west-2a (cross-region)
āāā Instance: db.r5.large
āāā Purpose: West coast users
āāā Lag: <500ms
āāā Connections: 100 max
Read Replica 3 (Read):
āāā Region: eu-west-1a (cross-region)
āāā Instance: db.r5.large
āāā Purpose: European users
āāā Lag: <800ms
āāā Connections: 100 max
Intelligent Read/Write Routing
@Injectable()
export class DatabaseRouter {
private readonly PRIMARY_ENDPOINT: string;
private readonly READ_ENDPOINTS: string[];
private currentReplicaIndex = 0;
constructor() {
this.PRIMARY_ENDPOINT = process.env.DB_PRIMARY;
this.READ_ENDPOINTS = [
process.env.DB_REPLICA_1,
process.env.DB_REPLICA_2,
process.env.DB_REPLICA_3
];
}
// Write operations always go to primary
async executeWrite(query: string, params: any[]): Promise<any> {
return this.execute(this.PRIMARY_ENDPOINT, query, params);
}
// Read operations distributed across replicas
async executeRead(query: string, params: any[]): Promise<any> {
const endpoint = this.getNextReadEndpoint();
return this.execute(endpoint, query, params);
}
// Critical reads that need latest data
async executeReadPrimary(query: string, params: any[]): Promise<any> {
return this.execute(this.PRIMARY_ENDPOINT, query, params);
}
private getNextReadEndpoint(): string {
// Round-robin load balancing across read replicas
const endpoint = this.READ_ENDPOINTS[this.currentReplicaIndex];
this.currentReplicaIndex =
(this.currentReplicaIndex + 1) % this.READ_ENDPOINTS.length;
return endpoint;
}
// Geography-aware routing
async executeReadGeographic(
region: string,
query: string,
params: any[]
): Promise<any> {
const endpoint = this.getRegionalEndpoint(region);
return this.execute(endpoint, query, params);
}
private getRegionalEndpoint(region: string): string {
const regionMap = {
'us-east': process.env.DB_REPLICA_1,
'us-west': process.env.DB_REPLICA_2,
'eu': process.env.DB_REPLICA_3
};
return regionMap[region] || this.READ_ENDPOINTS[0];
}
}
Handling Replication Lag
@Injectable()
export class ReplicationAwareService {
constructor(
private dbRouter: DatabaseRouter,
private cache: CacheService
) {}
async createAndRead(data: any): Promise<any> {
// Write to primary
const result = await this.dbRouter.executeWrite(
'INSERT INTO metrics (team_id, value) VALUES ($1, $2) RETURNING id',
[data.teamId, data.value]
);
const newId = result.rows[0].id;
// Critical: Read from primary to avoid replication lag
const newRecord = await this.dbRouter.executeReadPrimary(
'SELECT * FROM metrics WHERE id = $1',
[newId]
);
// Cache the new record
await this.cache.set(`metric:${newId}`, newRecord, 300);
return newRecord;
}
async getNonCriticalData(teamId: number): Promise<any> {
// Non-critical reads can use replicas (handle eventual consistency)
return this.dbRouter.executeRead(
'SELECT * FROM team_stats WHERE team_id = $1',
[teamId]
);
}
}
Results:
- Read query distribution: 100% primary ā 85% replicas
- Primary database load: Reduced by 85%
- Geographic latency: Reduced by 60% for distant users
šŖ Strategy #3: Query Optimization
Materialized Views for Complex Aggregations
Before: Expensive aggregation queries
-- ā BAD: Complex aggregation on-the-fly
SELECT
t.id,
t.name,
COUNT(DISTINCT u.id) as member_count,
COUNT(DISTINCT m.id) as total_metrics,
AVG(m.value) as avg_metric_value,
SUM(CASE WHEN m.created_at > NOW() - INTERVAL '7 days' THEN 1 ELSE 0 END) as weekly_metrics,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY m.value) as median_value
FROM teams t
LEFT JOIN users u ON t.id = u.team_id
LEFT JOIN metrics m ON u.id = m.user_id
WHERE t.status = 'active'
GROUP BY t.id, t.name;
-- Execution time: 8,500ms (unacceptable!)
After: Materialized view with scheduled refresh
-- ā
GOOD: Pre-computed materialized view
CREATE MATERIALIZED VIEW team_metrics_summary AS
SELECT
t.id as team_id,
t.name as team_name,
COUNT(DISTINCT u.id) as member_count,
COUNT(DISTINCT m.id) as total_metrics,
AVG(m.value) as avg_metric_value,
SUM(CASE WHEN m.created_at > NOW() - INTERVAL '7 days' THEN 1 ELSE 0 END) as weekly_metrics,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY m.value) as median_value,
NOW() as last_updated
FROM teams t
LEFT JOIN users u ON t.id = u.team_id
LEFT JOIN metrics m ON u.id = m.user_id
WHERE t.status = 'active'
GROUP BY t.id, t.name;
-- Create index on materialized view
CREATE UNIQUE INDEX idx_team_metrics_summary_team
ON team_metrics_summary(team_id);
-- Now queries are instant
SELECT * FROM team_metrics_summary WHERE team_id = 123;
-- Execution time: 3ms!
-- Refresh materialized view every hour
CREATE EXTENSION IF NOT EXISTS pg_cron;
SELECT cron.schedule(
'refresh-team-metrics',
'0 * * * *', -- Every hour
$$REFRESH MATERIALIZED VIEW CONCURRENTLY team_metrics_summary$$
);
Query Optimization Techniques
-- Use EXPLAIN ANALYZE to understand query performance
EXPLAIN ANALYZE
SELECT * FROM metrics WHERE team_id = 123;
-- Optimize JOINs
-- ā BAD: Unnecessary JOIN
SELECT m.*
FROM metrics m
JOIN users u ON m.user_id = u.id
WHERE u.team_id = 123;
-- ā
GOOD: Use WHERE with IN
SELECT m.*
FROM metrics m
WHERE m.user_id IN (
SELECT id FROM users WHERE team_id = 123
);
-- Even better with indexed column
SELECT m.*
FROM metrics m
WHERE m.team_id = 123; -- Direct lookup if team_id is in metrics table
-- Use CTEs for complex queries
WITH active_teams AS (
SELECT id FROM teams WHERE status = 'active'
),
recent_metrics AS (
SELECT team_id, COUNT(*) as count
FROM metrics
WHERE created_at > NOW() - INTERVAL '7 days'
GROUP BY team_id
)
SELECT
t.id,
t.name,
COALESCE(rm.count, 0) as recent_count
FROM active_teams at
JOIN teams t ON at.id = t.id
LEFT JOIN recent_metrics rm ON t.id = rm.team_id;
-- Batch updates for better performance
-- ā BAD: Update in loop
-- UPDATE metrics SET processed = true WHERE id = 1;
-- UPDATE metrics SET processed = true WHERE id = 2;
-- ... (1000 queries)
-- ā
GOOD: Single batch update
UPDATE metrics
SET processed = true
WHERE id = ANY(ARRAY[1, 2, 3, ..., 1000]);
Results:
- Complex aggregations: 8,500ms ā 3ms (99.96% faster)
- JOIN queries: Optimized by 85%
- Batch operations: 95% faster
š Strategy #4: Redis Caching Architecture
Multi-Layer Caching Strategy
@Injectable()
export class CachingService {
private readonly TTL = {
INSTANT: 10, // 10 seconds - real-time data
SHORT: 60, // 1 minute - frequently changing
MEDIUM: 300, // 5 minutes - semi-static
LONG: 3600, // 1 hour - rarely changing
VERY_LONG: 86400, // 24 hours - static reference data
PERMANENT: -1 // No expiration (manual invalidation)
};
constructor(
private redis: RedisClient,
private db: DatabaseRouter
) {}
// Cache-aside pattern
async getWithCache<T>(
key: string,
fetchFn: () => Promise<T>,
ttl: number = this.TTL.MEDIUM,
useReadReplica: boolean = true
): Promise<T> {
try {
// Try cache first
const cached = await this.redis.get(key);
if (cached) {
return JSON.parse(cached);
}
} catch (error) {
console.error('Cache read error:', error);
// Continue to database if cache fails
}
// Cache miss - fetch from database
const data = await fetchFn();
// Update cache (fire and forget)
this.redis.setex(key, ttl, JSON.stringify(data))
.catch(err => console.error('Cache write error:', err));
return data;
}
// Write-through pattern
async setWithCache(
key: string,
data: any,
ttl: number = this.TTL.MEDIUM
): Promise<void> {
// Write to database first
await this.writeToDatabase(key, data);
// Update cache
await this.redis.setex(key, ttl, JSON.stringify(data));
}
// Cache invalidation
async invalidate(pattern: string): Promise<void> {
const keys = await this.redis.keys(pattern);
if (keys.length > 0) {
await this.redis.del(...keys);
}
}
// Multi-get for batching
async mget<T>(keys: string[]): Promise<(T | null)[]> {
const values = await this.redis.mget(...keys);
return values.map(v => v ? JSON.parse(v) : null);
}
// Atomic increment for counters
async incrementCounter(key: string, amount: number = 1): Promise<number> {
return await this.redis.incrby(key, amount);
}
}
Advanced Redis Patterns
// Cache warming for frequently accessed data
@Injectable()
export class CacheWarmingService {
constructor(
private cache: CachingService,
private db: DatabaseRouter
) {
this.startWarmingSchedule();
}
private startWarmingSchedule(): void {
// Warm cache before expiry
setInterval(() => {
this.warmFrequentData();
}, 4 * 60 * 1000); // Every 4 minutes (before 5-min TTL)
}
private async warmFrequentData(): Promise<void> {
try {
// Get active teams
const activeTeams = await this.db.executeRead(
`SELECT id FROM teams
WHERE last_activity > NOW() - INTERVAL '1 hour'`,
[]
);
// Warm cache for each
const promises = activeTeams.rows.map(team =>
this.warmTeamData(team.id)
);
await Promise.all(promises);
console.log(`Warmed cache for ${activeTeams.rows.length} teams`);
} catch (error) {
console.error('Cache warming failed:', error);
}
}
private async warmTeamData(teamId: number): Promise<void> {
const [metrics, stats, members] = await Promise.all([
this.fetchAndCacheMetrics(teamId),
this.fetchAndCacheStats(teamId),
this.fetchAndCacheMembers(teamId)
]);
}
}
// Redis pub/sub for cache invalidation across instances
@Injectable()
export class CacheInvalidationService {
private subscriber: RedisClient;
private publisher: RedisClient;
constructor(private cache: CachingService) {
this.subscriber = new Redis(process.env.REDIS_URL);
this.publisher = new Redis(process.env.REDIS_URL);
this.setupSubscriptions();
}
private setupSubscriptions(): void {
this.subscriber.subscribe('cache:invalidate');
this.subscriber.on('message', async (channel, message) => {
if (channel === 'cache:invalidate') {
const { pattern } = JSON.parse(message);
await this.cache.invalidate(pattern);
console.log(`Invalidated cache pattern: ${pattern}`);
}
});
}
async broadcastInvalidation(pattern: string): Promise<void> {
// Notify all instances to invalidate cache
await this.publisher.publish(
'cache:invalidate',
JSON.stringify({ pattern, timestamp: Date.now() })
);
}
}
// Circuit breaker for database with Redis fallback
@Injectable()
export class ResilientDataService {
constructor(
private cache: CachingService,
private db: DatabaseRouter,
private circuitBreaker: CircuitBreakerService
) {}
async getTeamMetrics(teamId: number): Promise<any> {
const cacheKey = `metrics:team:${teamId}`;
return this.circuitBreaker.execute(
'database:team-metrics',
async () => {
// Try cache first
return await this.cache.getWithCache(
cacheKey,
async () => {
// Fetch from database
const result = await this.db.executeRead(
'SELECT * FROM team_metrics WHERE team_id = $1',
[teamId]
);
return result.rows;
},
this.cache.TTL.MEDIUM
);
},
async () => {
// Fallback: Return stale cache if available
const stale = await this.redis.get(cacheKey);
if (stale) {
console.warn('Using stale cache due to database circuit open');
return JSON.parse(stale);
}
throw new ServiceUnavailableException('No data available');
}
);
}
}
Results:
- Cache hit rate: 87%
- Database queries: Reduced by 87%
- Average response time: 200ms ā 5ms (with cache hit)
š”ļø Strategy #5: Connection Pooling & Management
Optimized Connection Pool Configuration
import { Pool } from 'pg';
// Primary database pool
const primaryPool = new Pool({
host: process.env.DB_PRIMARY_HOST,
port: 5432,
database: process.env.DB_NAME,
user: process.env.DB_USER,
password: process.env.DB_PASSWORD,
// Pool configuration
min: 10, // Minimum connections always open
max: 100, // Maximum connections
idleTimeoutMillis: 30000, // Close idle connections after 30s
connectionTimeoutMillis: 2000,
// Performance tuning
statement_timeout: 10000, // Kill queries after 10s
query_timeout: 10000,
keepAlive: true,
keepAliveInitialDelayMillis: 10000,
// SSL configuration for production
ssl: process.env.NODE_ENV === 'production' ? {
rejectUnauthorized: false
} : false
});
// Read replica pools
const createReadPool = (host: string) => new Pool({
...primaryPool.options,
host,
max: 100 // Can handle more reads
});
const readPools = [
createReadPool(process.env.DB_REPLICA_1),
createReadPool(process.env.DB_REPLICA_2),
createReadPool(process.env.DB_REPLICA_3)
];
// Pool monitoring
primaryPool.on('connect', (client) => {
console.log('New client connected to primary');
});
primaryPool.on('error', (err, client) => {
console.error('Unexpected error on idle client', err);
});
// Health monitoring
setInterval(async () => {
console.log('Primary pool stats:', {
total: primaryPool.totalCount,
idle: primaryPool.idleCount,
waiting: primaryPool.waitingCount
});
// Alert if pool is saturated
if (primaryPool.waitingCount > 20) {
console.error('ā ļø Connection pool saturated!');
// Send alert to monitoring service
}
}, 60000);
// Graceful shutdown
process.on('SIGTERM', async () => {
console.log('Closing database pools...');
await Promise.all([
primaryPool.end(),
...readPools.map(pool => pool.end())
]);
console.log('Database pools closed');
process.exit(0);
});
š Real-World Performance Metrics
Production Database Metrics (30 days)
Query Performance:
ā
Total Queries: 125.8M
ā
Avg Query Time: 15ms
ā
P95 Query Time: 48ms
ā
P99 Query Time: 125ms
ā
Slow Queries (>1s): 0.02%
ā
Failed Queries: 0.005%
Database Health:
ā
Primary CPU: 28% average
ā
Replica 1 CPU: 22% average
ā
Replica 2 CPU: 19% average
ā
Replica 3 CPU: 21% average
ā
Uptime: 100%
ā
Replication Lag: <100ms (same region)
Connection Pool:
ā
Total Connections: 100 (max)
ā
Idle Connections: 45 (average)
ā
Active Connections: 55 (average)
ā
Waiting Connections: 0 (excellent)
ā
Connection Errors: 0.001%
Cache Performance:
ā
Total Cache Requests: 144.9M
ā
Cache Hits: 126.1M (87%)
ā
Cache Misses: 18.8M (13%)
ā
Avg Cache Response: 5ms
ā
Redis Memory Usage: 4.2GB
ā
Evictions: 0 (perfect sizing)
Storage & Cost:
ā
Database Size: 125GB
ā
Index Size: 48GB
ā
Daily Growth: 850MB
ā
Monthly Cost: $920
ā
Cost per Query: $0.000012
Load Test Results
# pgbench load test
pgbench -c 100 -j 10 -T 600 -S postgres
# Results:
transaction type: SELECT only
scaling factor: 1
query mode: simple
number of clients: 100
number of threads: 10
duration: 600 s
number of transactions: 15,420,850
tps: 25,701 (including connections establishing)
tps: 25,703 (excluding connections establishing)
Latency:
average: 3.89 ms
p50: 3.12 ms
p95: 8.45 ms
p99: 15.23 ms
š” Key Lessons Learned
What Made the Biggest Impact
- Redis Caching (45% improvement): 87% hit rate dramatically reduced database load
- Read Replicas (30% improvement): Distributed load, improved geographic performance
- Index Optimization (15% improvement): 99.7% faster queries
- Materialized Views (7% improvement): Complex aggregations became instant
- Connection Pooling (3% improvement): Eliminated connection overhead
What Didn't Work
ā Over-normalization: Joins became too expensive, had to denormalize some tables
ā Too many indexes: Index bloat slowed writes, had to be strategic
ā Aggressive caching of everything: Led to stale data issues
ā NoSQL migration: Didn't justify complexity for our relational data model
šÆ Power Your Analytics with Optimized Data Infrastructure
These database and caching strategies transformed our data layer from a bottleneck into a high-performance engine capable of serving 100K+ users with 15ms query times and 87% cache hit rates. But having a fast database means nothing without actionable insights.
Experience Database Performance at Scale
Ready to see what <20ms query times feel like?
OrgSignals leverages every database optimization strategy covered in this article to deliver:
- ā” 15ms average query times across millions of data points
- š 87% cache hit rate for instant insights
- šŖ Multi-region performance with <100ms global response times
- š Real-time data sync without compromising performance
- š”ļø 99.98% database uptime with automatic failover
Transform How You Understand Developer Productivity
Stop making decisions based on gut feel. Get data-driven insights instantly with OrgSignals:
ā
Lightning-fast analytics - Query millions of data points in milliseconds
ā
Real-time DORA metrics - Track deployment frequency, lead time, MTTR, change failure rate
ā
Historical trend analysis - Spot patterns across weeks, months, and quarters
ā
Team & individual insights - From executive overview to developer deep-dives
ā
Unified integration data - GitHub, GitLab, Jira, Slack, and more in one platform
ā
AI-powered recommendations - Identify bottlenecks and improvement opportunities automatically
ā
Beautiful, fast dashboards - Data visualization that loads instantly, every time
Start Your Free Trial ā
No credit card required. Full platform access. Setup in under 5 minutes.
Learn More About Building High-Performance Systems
š Read the complete series:
- Part 1: How I Built an Enterprise Angular App in 30 Days ā
- Part 2: From Code to Production: Deployment Strategies ā
- Part 3: Frontend Performance at Scale ā
- Part 4: Backend & API Optimization ā
- Part 5: You are here - Database & Caching Strategies at Scale
Have questions about scaling your database? Drop them in the comments - I love helping solve complex data challenges!
Found this valuable? Follow for more database optimization, caching strategies, and system architecture content.
Top comments (0)