丁久

Posted on May 21 • Originally published at dingjiu1989-hue.github.io

Database Capacity Planning: Sizing, Growth Forecasting, and Scaling

#database #backend #data

This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.

Database Capacity Planning: Sizing, Growth Forecasting, and Scaling

Capacity planning ensures your database has enough resources to handle current and future workloads without over-provisioning. It is a continuous process that combines monitoring, forecasting, and proactive scaling.

Key Capacity Metrics

Storage

Storage is the most predictable resource to plan. Track:

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\-- Database sizes

SELECT datname,

pg_size_pretty(pg_database_size(datname)) AS size

FROM pg_database

ORDER BY pg_database_size(datname) DESC;

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\-- Table sizes (top 10)

SELECT relname AS table_name,

pg_size_pretty(pg_total_relation_size(relid)) AS total_size,

pg_size_pretty(pg_relation_size(relid)) AS table_size,

pg_size_pretty(pg_indexes_size(relid)) AS index_size

FROM pg_catalog.pg_statio_user_tables

ORDER BY pg_total_relation_size(relid) DESC

LIMIT 10;

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\-- Growth by day

SELECT date(created_at) AS day,

count(*) AS rows_added,

count(*) * 200 AS estimated_bytes -- rough estimate

FROM orders

WHERE created_at > now() - interval '30 days'

GROUP BY day

ORDER BY day;

Compute (CPU)

CPU usage correlates with query complexity and concurrency:

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\-- Queries with highest total CPU time

SELECT queryid, query,

total_exec_time,

calls,

mean_exec_time,

rows

FROM pg_stat_statements

ORDER BY total_exec_time DESC

LIMIT 20;

Monitor: CPU utilization %, replication CPU usage, autovacuum CPU usage.

Memory

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\-- Shared buffers usage

SELECT name, setting, unit,

current_setting(name)::numeric / pg_size_pretty('') AS ratio

FROM pg_settings

WHERE name IN ('shared_buffers', 'effective_cache_size',

'work_mem', 'maintenance_work_mem');

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\-- Hit ratio (should be >99%)

SELECT 'shared_buffers' AS area,

sum(blks_hit)::float / (sum(blks_hit) + sum(blks_read)) AS hit_ratio

FROM pg_stat_database;

Connections

SELECT max_conn.setting AS max_connections,

used_conn.count AS used_connections,

used_conn.count::float / max_conn.setting::int AS utilization_pct

FROM (SELECT setting FROM pg_settings WHERE name = 'max_connections') max_conn,

(SELECT count(*) AS count FROM pg_stat_activity) used_conn;

Growth Forecasting

Simple Linear Model

import psycopg2

from datetime import datetime, timedelta

import numpy as np

conn = psycopg2.connect("dbname=mydb")

cur = conn.cursor()

Get daily row counts for last 90 days

cur.execute("""

SELECT date(created_at) AS day, count(*) AS rows

FROM orders

WHERE created_at > now() - interval '90 days'

GROUP BY day

ORDER BY day

""")

data = cur.fetchall()

days = np.array([(row[0] - data[0][0]).days for row in data])

rows = np.array([row[1] for row in data])

Linear regression

coefficients = np.polyfit(days, rows, 1)

daily_growth = coefficients[0]

Forecast: 90 days out

forecast_days = 90

current_total = sum(rows)

forecast_total = current_total + daily_growth * forecast_days

print(f"Daily growth: {daily_growth:.0f} rows")

print(f"Current monthly row count: {current_total}")

print(f"Forecast in 90 days: {forecast_total:.0f} rows")

Projecting Storage

avg_row_size_bytes = 250 # From pgstattuple

bytes_per_day = daily_growth * avg_row_size_bytes

gb_per_month = bytes_per_day * 30 / (1024**3)

current_gb = 10 # Current database size

months_to_full = (50 - current_gb) / gb_per_month # Assuming 50 GB limit

print(f"Growth: {gb_per_month:.1f} GB/month")

print(f"Time to 50 GB: {months_to_full:.0f} months")

Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.

Found this useful? Check out more developer guides and tool comparisons on AI Study Room.

DEV Community

Database Capacity Planning: Sizing, Growth Forecasting, and Scaling

Database Capacity Planning: Sizing, Growth Forecasting, and Scaling

Get daily row counts for last 90 days

Linear regression

Forecast: 90 days out

Top comments (0)