Capacity Planning Toolkit
Stop guessing when to scale. This toolkit gives you battle-tested load forecasting models, resource utilization dashboards, and scaling decision frameworks used by SRE teams at scale. Whether you're planning for traffic spikes, rightsizing Kubernetes clusters, or building your first capacity review process, these templates turn capacity planning from art into engineering.
Key Features
- Linear regression forecaster — Python script that fits historical metric data and projects resource needs 7/30/90 days out
- Prometheus recording rules — Pre-built rules for CPU, memory, disk, and network utilization aggregation
- Scaling decision matrix — YAML-based framework that maps utilization thresholds to scaling actions (vertical, horizontal, architectural)
- Capacity review templates — Markdown templates for weekly and quarterly capacity reviews with stakeholder sections
- Grafana dashboard JSON — Import-ready panels for utilization trends, headroom tracking, and forecast overlays
- Saturation alerting rules — Prometheus alerts that fire at 70%, 85%, and 95% resource saturation with appropriate severities
- Cost-per-request calculator — Script to estimate infrastructure cost per API request across your service fleet
Quick Start
# Extract and explore
unzip capacity-planning-toolkit.zip
cd capacity-planning-toolkit/
# Run the forecaster against sample data
python3 src/capacity_planning_toolkit/core.py \
--input examples/sample_metrics.csv \
--forecast-days 30
# Copy Prometheus rules to your config
cp examples/prometheus_rules.yml /etc/prometheus/rules.d/capacity.yml
promtool check rules /etc/prometheus/rules.d/capacity.yml
Architecture / How It Works
┌─────────────────────────────────────────────────┐
│ Capacity Planning │
├──────────────┬────────────────┬─────────────────┤
│ MEASURE │ FORECAST │ DECIDE │
│ Prometheus │ Linear model + │ Decision matrix │
│ rules + │ seasonal │ maps thresholds │
│ Grafana │ decomposition │ to actions │
└──────────────┴────────────────┴─────────────────┘
- Measure — Recording rules aggregate raw metrics into utilization ratios.
- Forecast — Least-squares regression with optional seasonal adjustment projects future usage.
- Decide — The scaling decision matrix maps utilization levels to recommended actions.
Usage Examples
Prometheus Recording Rules
# Recording rules for capacity tracking
groups:
- name: capacity_planning
interval: 5m
rules:
- record: instance:cpu_utilization:ratio
expr: |
1 - avg by (instance) (
rate(node_cpu_seconds_total{mode="idle"}[5m])
)
- record: instance:memory_utilization:ratio
expr: |
1 - (
node_memory_MemAvailable_bytes
/ node_memory_MemTotal_bytes
)
- record: instance:disk_utilization:ratio
expr: |
1 - (
node_filesystem_avail_bytes{mountpoint="/"}
/ node_filesystem_size_bytes{mountpoint="/"}
)
Load Forecasting Script
from capacity_planning_toolkit.core import LoadForecaster
forecaster = LoadForecaster()
# Load historical CPU utilization (timestamp, value pairs)
forecaster.load_csv("metrics/cpu_utilization_90d.csv")
# Generate 30-day forecast with confidence intervals
forecast = forecaster.predict(days=30, confidence=0.95)
print(f"Current utilization: {forecast.current:.1%}")
print(f"Projected (30d): {forecast.projected:.1%}")
print(f"Days to 85% threshold: {forecast.days_to_threshold(0.85)}")
print(f"Recommended action: {forecast.recommendation}")
Scaling Decision Matrix
# config.example.yaml — scaling decision framework
scaling_matrix:
cpu:
thresholds:
- level: nominal
range: [0.0, 0.70]
action: none
review: quarterly
- level: elevated
range: [0.70, 0.85]
action: plan_scale_out
review: weekly
alert: warning
- level: critical
range: [0.85, 0.95]
action: execute_scale_out
review: daily
alert: critical
- level: emergency
range: [0.95, 1.0]
action: immediate_intervention
review: continuous
alert: page
Configuration
# config.example.yaml — main configuration
forecaster:
model: linear # linear | polynomial | seasonal
lookback_days: 90 # Historical data window
forecast_days: 30 # Projection horizon
confidence_level: 0.95 # Confidence interval width
prometheus:
url: https://prometheus.example.com
query_timeout: 30s
alerting:
saturation_warning: 0.70 # Warning threshold
saturation_critical: 0.85 # Critical threshold
saturation_page: 0.95 # Page threshold
evaluation_interval: 5m
reporting:
schedule: weekly # weekly | biweekly | monthly
recipients:
- sre-team@example.com
- infra-leads@example.com
format: markdown # markdown | html | pdf
Best Practices
- Collect at least 90 days of data before running forecasts — shorter windows miss weekly and monthly patterns
- Set thresholds per service, not globally — a database at 70% CPU differs from a stateless web server
- Run capacity reviews on a fixed cadence — weekly for fast-growing services, monthly for stable
- Include cost data alongside utilization — scaling down sometimes saves more than scaling up prevents
- Version your decision matrices in Git — trace thresholds active during incidents
Troubleshooting
Forecast shows negative utilization
Input data likely has gaps. Run forecaster.validate_data() and use --interpolate to fill missing windows.
Prometheus recording rules not appearing
Verify the rules file path matches your rule_files glob in prometheus.yml. Run promtool check rules <file> to validate.
Grafana dashboard shows "No Data"
Recording rules need time to populate. Wait at least 2x the evaluation_interval. Check datasource UID matches.
MIT License — see LICENSE file.
This is 1 of 7 resources in the SRE Platform Pro toolkit. Get the complete [Capacity Planning Toolkit] with all files, templates, and documentation for $39.
Or grab the entire SRE Platform Pro bundle (7 products) for $89 — save 30%.
Top comments (0)