DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Capacity Planning Toolkit

Capacity Planning Toolkit

Stop guessing when to scale. This toolkit gives you battle-tested load forecasting models, resource utilization dashboards, and scaling decision frameworks used by SRE teams at scale. Whether you're planning for traffic spikes, rightsizing Kubernetes clusters, or building your first capacity review process, these templates turn capacity planning from art into engineering.

Key Features

  • Linear regression forecaster — Python script that fits historical metric data and projects resource needs 7/30/90 days out
  • Prometheus recording rules — Pre-built rules for CPU, memory, disk, and network utilization aggregation
  • Scaling decision matrix — YAML-based framework that maps utilization thresholds to scaling actions (vertical, horizontal, architectural)
  • Capacity review templates — Markdown templates for weekly and quarterly capacity reviews with stakeholder sections
  • Grafana dashboard JSON — Import-ready panels for utilization trends, headroom tracking, and forecast overlays
  • Saturation alerting rules — Prometheus alerts that fire at 70%, 85%, and 95% resource saturation with appropriate severities
  • Cost-per-request calculator — Script to estimate infrastructure cost per API request across your service fleet

Quick Start

# Extract and explore
unzip capacity-planning-toolkit.zip
cd capacity-planning-toolkit/

# Run the forecaster against sample data
python3 src/capacity_planning_toolkit/core.py \
  --input examples/sample_metrics.csv \
  --forecast-days 30

# Copy Prometheus rules to your config
cp examples/prometheus_rules.yml /etc/prometheus/rules.d/capacity.yml
promtool check rules /etc/prometheus/rules.d/capacity.yml
Enter fullscreen mode Exit fullscreen mode

Architecture / How It Works

┌─────────────────────────────────────────────────┐
│                Capacity Planning                │
├──────────────┬────────────────┬─────────────────┤
│   MEASURE    │    FORECAST    │     DECIDE      │
│  Prometheus  │ Linear model + │ Decision matrix │
│  rules +     │ seasonal       │ maps thresholds │
│  Grafana     │ decomposition  │ to actions      │
└──────────────┴────────────────┴─────────────────┘
Enter fullscreen mode Exit fullscreen mode
  1. Measure — Recording rules aggregate raw metrics into utilization ratios.
  2. Forecast — Least-squares regression with optional seasonal adjustment projects future usage.
  3. Decide — The scaling decision matrix maps utilization levels to recommended actions.

Usage Examples

Prometheus Recording Rules

# Recording rules for capacity tracking
groups:
  - name: capacity_planning
    interval: 5m
    rules:
      - record: instance:cpu_utilization:ratio
        expr: |
          1 - avg by (instance) (
            rate(node_cpu_seconds_total{mode="idle"}[5m])
          )
      - record: instance:memory_utilization:ratio
        expr: |
          1 - (
            node_memory_MemAvailable_bytes
            / node_memory_MemTotal_bytes
          )
      - record: instance:disk_utilization:ratio
        expr: |
          1 - (
            node_filesystem_avail_bytes{mountpoint="/"}
            / node_filesystem_size_bytes{mountpoint="/"}
          )
Enter fullscreen mode Exit fullscreen mode

Load Forecasting Script

from capacity_planning_toolkit.core import LoadForecaster

forecaster = LoadForecaster()

# Load historical CPU utilization (timestamp, value pairs)
forecaster.load_csv("metrics/cpu_utilization_90d.csv")

# Generate 30-day forecast with confidence intervals
forecast = forecaster.predict(days=30, confidence=0.95)

print(f"Current utilization: {forecast.current:.1%}")
print(f"Projected (30d):     {forecast.projected:.1%}")
print(f"Days to 85% threshold: {forecast.days_to_threshold(0.85)}")
print(f"Recommended action:  {forecast.recommendation}")
Enter fullscreen mode Exit fullscreen mode

Scaling Decision Matrix

# config.example.yaml — scaling decision framework
scaling_matrix:
  cpu:
    thresholds:
      - level: nominal
        range: [0.0, 0.70]
        action: none
        review: quarterly
      - level: elevated
        range: [0.70, 0.85]
        action: plan_scale_out
        review: weekly
        alert: warning
      - level: critical
        range: [0.85, 0.95]
        action: execute_scale_out
        review: daily
        alert: critical
      - level: emergency
        range: [0.95, 1.0]
        action: immediate_intervention
        review: continuous
        alert: page
Enter fullscreen mode Exit fullscreen mode

Configuration

# config.example.yaml — main configuration
forecaster:
  model: linear           # linear | polynomial | seasonal
  lookback_days: 90        # Historical data window
  forecast_days: 30        # Projection horizon
  confidence_level: 0.95   # Confidence interval width

prometheus:
  url: https://prometheus.example.com
  query_timeout: 30s

alerting:
  saturation_warning: 0.70   # Warning threshold
  saturation_critical: 0.85  # Critical threshold
  saturation_page: 0.95      # Page threshold
  evaluation_interval: 5m

reporting:
  schedule: weekly           # weekly | biweekly | monthly
  recipients:
    - sre-team@example.com
    - infra-leads@example.com
  format: markdown           # markdown | html | pdf
Enter fullscreen mode Exit fullscreen mode

Best Practices

  • Collect at least 90 days of data before running forecasts — shorter windows miss weekly and monthly patterns
  • Set thresholds per service, not globally — a database at 70% CPU differs from a stateless web server
  • Run capacity reviews on a fixed cadence — weekly for fast-growing services, monthly for stable
  • Include cost data alongside utilization — scaling down sometimes saves more than scaling up prevents
  • Version your decision matrices in Git — trace thresholds active during incidents

Troubleshooting

Forecast shows negative utilization
Input data likely has gaps. Run forecaster.validate_data() and use --interpolate to fill missing windows.

Prometheus recording rules not appearing
Verify the rules file path matches your rule_files glob in prometheus.yml. Run promtool check rules <file> to validate.

Grafana dashboard shows "No Data"
Recording rules need time to populate. Wait at least 2x the evaluation_interval. Check datasource UID matches.

MIT License — see LICENSE file.


This is 1 of 7 resources in the SRE Platform Pro toolkit. Get the complete [Capacity Planning Toolkit] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire SRE Platform Pro bundle (7 products) for $89 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)