DatanestDigital

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Capacity Planning Toolkit

#sre #devops #monitoring #kubernetes

Capacity Planning Toolkit

Stop guessing when to scale. This toolkit gives you battle-tested load forecasting models, resource utilization dashboards, and scaling decision frameworks used by SRE teams at scale. Whether you're planning for traffic spikes, rightsizing Kubernetes clusters, or building your first capacity review process, these templates turn capacity planning from art into engineering.

Key Features

Linear regression forecaster — Python script that fits historical metric data and projects resource needs 7/30/90 days out
Prometheus recording rules — Pre-built rules for CPU, memory, disk, and network utilization aggregation
Scaling decision matrix — YAML-based framework that maps utilization thresholds to scaling actions (vertical, horizontal, architectural)
Capacity review templates — Markdown templates for weekly and quarterly capacity reviews with stakeholder sections
Grafana dashboard JSON — Import-ready panels for utilization trends, headroom tracking, and forecast overlays
Saturation alerting rules — Prometheus alerts that fire at 70%, 85%, and 95% resource saturation with appropriate severities
Cost-per-request calculator — Script to estimate infrastructure cost per API request across your service fleet

Quick Start

# Extract and explore
unzip capacity-planning-toolkit.zip
cd capacity-planning-toolkit/

# Run the forecaster against sample data
python3 src/capacity_planning_toolkit/core.py \
  --input examples/sample_metrics.csv \
  --forecast-days 30

# Copy Prometheus rules to your config
cp examples/prometheus_rules.yml /etc/prometheus/rules.d/capacity.yml
promtool check rules /etc/prometheus/rules.d/capacity.yml

Architecture / How It Works

┌─────────────────────────────────────────────────┐
│                Capacity Planning                │
├──────────────┬────────────────┬─────────────────┤
│   MEASURE    │    FORECAST    │     DECIDE      │
│  Prometheus  │ Linear model + │ Decision matrix │
│  rules +     │ seasonal       │ maps thresholds │
│  Grafana     │ decomposition  │ to actions      │
└──────────────┴────────────────┴─────────────────┘

Measure — Recording rules aggregate raw metrics into utilization ratios.
Forecast — Least-squares regression with optional seasonal adjustment projects future usage.
Decide — The scaling decision matrix maps utilization levels to recommended actions.

Usage Examples

Prometheus Recording Rules

# Recording rules for capacity tracking
groups:
  - name: capacity_planning
    interval: 5m
    rules:
      - record: instance:cpu_utilization:ratio
        expr: |
          1 - avg by (instance) (
            rate(node_cpu_seconds_total{mode="idle"}[5m])
          )
      - record: instance:memory_utilization:ratio
        expr: |
          1 - (
            node_memory_MemAvailable_bytes
            / node_memory_MemTotal_bytes
          )
      - record: instance:disk_utilization:ratio
        expr: |
          1 - (
            node_filesystem_avail_bytes{mountpoint="/"}
            / node_filesystem_size_bytes{mountpoint="/"}
          )

Load Forecasting Script

from capacity_planning_toolkit.core import LoadForecaster

forecaster = LoadForecaster()

# Load historical CPU utilization (timestamp, value pairs)
forecaster.load_csv("metrics/cpu_utilization_90d.csv")

# Generate 30-day forecast with confidence intervals
forecast = forecaster.predict(days=30, confidence=0.95)

print(f"Current utilization: {forecast.current:.1%}")
print(f"Projected (30d):     {forecast.projected:.1%}")
print(f"Days to 85% threshold: {forecast.days_to_threshold(0.85)}")
print(f"Recommended action:  {forecast.recommendation}")

Scaling Decision Matrix

# config.example.yaml — scaling decision framework
scaling_matrix:
  cpu:
    thresholds:
      - level: nominal
        range: [0.0, 0.70]
        action: none
        review: quarterly
      - level: elevated
        range: [0.70, 0.85]
        action: plan_scale_out
        review: weekly
        alert: warning
      - level: critical
        range: [0.85, 0.95]
        action: execute_scale_out
        review: daily
        alert: critical
      - level: emergency
        range: [0.95, 1.0]
        action: immediate_intervention
        review: continuous
        alert: page

Configuration

# config.example.yaml — main configuration
forecaster:
  model: linear           # linear | polynomial | seasonal
  lookback_days: 90        # Historical data window
  forecast_days: 30        # Projection horizon
  confidence_level: 0.95   # Confidence interval width

prometheus:
  url: https://prometheus.example.com
  query_timeout: 30s

alerting:
  saturation_warning: 0.70   # Warning threshold
  saturation_critical: 0.85  # Critical threshold
  saturation_page: 0.95      # Page threshold
  evaluation_interval: 5m

reporting:
  schedule: weekly           # weekly | biweekly | monthly
  recipients:
    - sre-team@example.com
    - infra-leads@example.com
  format: markdown           # markdown | html | pdf

Best Practices

Collect at least 90 days of data before running forecasts — shorter windows miss weekly and monthly patterns
Set thresholds per service, not globally — a database at 70% CPU differs from a stateless web server
Run capacity reviews on a fixed cadence — weekly for fast-growing services, monthly for stable
Include cost data alongside utilization — scaling down sometimes saves more than scaling up prevents
Version your decision matrices in Git — trace thresholds active during incidents

Troubleshooting

Forecast shows negative utilization
Input data likely has gaps. Run forecaster.validate_data() and use --interpolate to fill missing windows.

Prometheus recording rules not appearing
Verify the rules file path matches your rule_files glob in prometheus.yml. Run promtool check rules <file> to validate.

Grafana dashboard shows "No Data"
Recording rules need time to populate. Wait at least 2x the evaluation_interval. Check datasource UID matches.

MIT License — see LICENSE file.

This is 1 of 7 resources in the SRE Platform Pro toolkit. Get the complete [Capacity Planning Toolkit] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire SRE Platform Pro bundle (7 products) for $89 — save 30%.

Get the Complete Bundle →

DEV Community

Capacity Planning Toolkit

Capacity Planning Toolkit

Key Features

Quick Start

Architecture / How It Works

Usage Examples

Prometheus Recording Rules

Load Forecasting Script

Scaling Decision Matrix

Configuration

Best Practices

Troubleshooting

Related Articles

Top comments (0)