beefed.ai

Posted on Apr 16 • Originally published at beefed.ai

Capacity Planning and Right-Sizing for Cloud Applications

#machinelearning #testing

Translating Load Tests into Concrete Instance Counts
Designing Autoscaling Policies That Match Real Traffic Patterns
Right-sizing Instances to Trim Cost Without Sacrificing Performance
Operational Monitoring, Forecasting and Continuous Re-Evaluation
Practical Capacity Planning Checklist
Sources

Capacity planning is the engineering step that converts a load test into the fleet you run, the autoscaling you trust, and the cloud bill you accept. Get the conversion wrong and you either overspend for unused capacity or miss SLOs when traffic spikes.

The symptoms you live with are predictable: load tests that look fine but mispredict production, autoscalers that chase the wrong metric, p95 latency that balloons under real traffic, and a cloud bill that drifts upward month after month. That friction shows up as post-release incidents, expensive reserved commitments made against bad assumptions, and repeated firefights when marketing or external events drive unexpected peaks.

Translating Load Tests into Concrete Instance Counts

The core of mapping test results to capacity is a simple resource-by-resource capacity model: measure, normalize to a per-instance rate, scale to target traffic, then add operating headroom. Follow the math faithfully and the rest—the autoscaler, the budget—becomes engineering instead of guesswork.

Practical step-by-step conversion (CPU-based example)

Capture the canonical test snapshot:
- R_test = total throughput in the steady phase (requests/sec).
- N_test = number of identical instances running during that steady phase.
- CPU_test = observed average per-instance CPU utilization as a percent (e.g., 50 for 50%).
Decide your operational target utilization U_target (fraction). Many SRE teams provision components to about 50% CPU headroom at peak, using this as a safety margin for unexpected bursts. Use this as a guideline not a law.
Specify R_prod_peak = expected production peak throughput (requests/sec).
Compute required instances:

N_needed = ceil( N_test * (R_prod_peak / R_test) * ( (CPU_test / 100) / U_target ) )

Worked example

R_test = 2,000 RPS, N_test = 10 instances, CPU_test = 50
R_prod_peak = 5,000 RPS, U_target = 0.7 (70%)
N_needed = ceil(10 * (5000 / 2000) * (0.5 / 0.7)) = ceil(17.857) = 18

Why this works: you compute observed RPS per instance, scale that per-instance capacity to your desired CPU headroom, then divide the target traffic by that per-instance capacity.

Code you can drop into a runbook

import math

def instances_needed(r_test, n_test, cpu_test_percent, r_prod_peak, u_target):
    """
    r_test: observed throughput during test (requests/sec)
    n_test: instances used in test
    cpu_test_percent: observed per-instance CPU (e.g., 50 for 50%)
    r_prod_peak: expected peak throughput to plan for
    u_target: acceptable per-instance CPU fraction (e.g., 0.7)
    """
    cpu_frac = cpu_test_percent / 100.0
    scale = (r_prod_peak / r_test)
    n_needed = math.ceil(n_test * scale * (cpu_frac / u_target))
    return int(n_needed)

# example
print(instances_needed(2000, 10, 50, 5000, 0.7))  # -> 18

Important checklist for multi-resource decisions

Compute N_needed for CPU, memory, network throughput, disk IOPS, and DB connection limits. Use the maximum value — that resource is your effective limiter. > Important: Choose the highest instance count among resources; scaling CPU when the system is memory-bound won't help.
If your service is concurrency-limited (thread pools, event-loop), measure requests in-flight per instance and scale for concurrent capacity instead of RPS.
For queue-driven/async workloads, scale consumers on queue length or messages processed/sec, not CPU. Use a steady-state test to derive per-consumer throughput and apply the same per-resource math.

Measure what matters during tests

Throughput: R_test (RPS), and per-endpoint RPS.
Latency percentiles: p50, p95, p99 (use histograms). k6 and other modern tools make this straightforward to codify as thresholds.
Error rates and saturation signals (HTTP 5xx, GC pause, thread exhaustion).
Resource counters: CPU%, memory used, NIC throughput, EBS IOPS, DB TPS, connection pool usage.
Application-specific metrics: queue depth, open file descriptors, external API rate limits.

Designing Autoscaling Policies That Match Real Traffic Patterns

Autoscaling is a control system; pick the right control variable and tune the thermostat. Use target-tracking for steady proportional loads, step-based for bursty events you want to damp, and scheduled/predictive for known patterns. AWS, GCP and Azure provide built-in primitives that work well when you pick the correct metric.

Which scaling model to choose

Target tracking (thermostat model): keep a chosen metric near a setpoint (e.g., average CPU 50%, ALB request count per target = 1000/min). This is simple and safe for proportional workloads.
Step scaling: use when you need controlled jumps and explicit cooldowns (e.g., scale +3 when CPU > 80% for 3 minutes).
Scheduled scaling / Predictive scaling: use for recurring, predictable peaks (daily traffic cycles, known campaigns). Predictive scaling can pre-provision capacity in advance using historical patterns; use forecast-only mode to validate before enabling scale actions.
Custom metric scaling: if CPU/NIC don't correlate with user-facing load, publish a custom metric (requests/sec, queue depth, in-flight operations) and scale on that instead. Target-tracking policies support custom metrics when they represent utilization proportional to capacity.

Practical adjustments and safety buffers

Maintain a minimum capacity: never scale to zero for critical frontends unless your system is architected for complete shutdown. Include a min instance count based on failure scenarios.
Use warm pools or pre-initialized instances for services with long boot or cold-start times; this shortens effective scale-out latency while saving cost vs permanently idle instances.
Choose a safe target utilization — many teams aim for 60–75% CPU on web tiers for a balance of cost and headroom; SRE guidance supports provisioning to ~50% headroom for critical services where bursts or cascading failures are costly. Use your failure mode analysis to set the right band.
Timeout and cooldowns matter: aggressive scale-out + aggressive scale-in causes thrash. Configure cooldown windows and test scale-in paths.

Sample target-tracking policy (conceptual, placeholders)

# Conceptual: Target tracking on ALB request count per target
scaling_policy:
  Type: TargetTrackingScaling
  Metric: ALBRequestCountPerTarget
  TargetValue: 1000    # requests per target per minute (tune from tests)
  ScaleOutCooldown: 60
  ScaleInCooldown: 300
  MinCapacity: 4
  MaxCapacity: 200

Use provider docs for exact commands and features; the idea is to keep the metric you control at a steady, efficient level while ensuring headroom for bursts.

Right-sizing Instances to Trim Cost Without Sacrificing Performance

Right-sizing is not a one-off: it’s measurement, experiment, commit. Start with accurate telemetry, run controlled A/B instance-type tests, and only then buy savings commitments.

Process to right-size

Inventory: tag and list every instance (production and non-prod) with owner and purpose. Use Cloud provider tools (Compute Optimizer / Recommender / Azure Advisor) to get starting recommendations.
Baseline: collect 2–4 weeks of detailed metrics (CPU, memory, NIC, IOPS) at 1-minute resolution where possible; ensure you capture business peaks (payroll close, marketing). Compute Optimizer benefits from several weeks of metric history.
Experiment: pick candidate instance families (e.g., m -> c or r families or Graviton vs x86), run the workload in a staging environment under load, and compare p95 latency, GC behaviour, and throughput. Price-performance wins on running tests, not specs.
Commit after validation: buy Reserved Instances / Savings Plans / Committed Use only after you’ve stabilized the instance profile; right-size first, then commit.

Cost techniques that pair well with right-sizing

Use spot / preemptible instances for fault-tolerant, non-critical, or background workloads to shave significant cost. Test preemption behavior in staging.
Employ mixed-instance policies and instance type flexibility for Auto Scaling groups to improve availability and price-performance.
Use smaller instances for bin-packing stateful services to avoid licensing and networking overhead — but weigh the management cost of many small instances vs a few larger ones.

Quick decision matrix (summary)

Constraint	Tune for	How to test
CPU-bound	Compute-optimized family (C)	CPU-bound synthetic workloads, p95 CPU saturation
Memory-bound	Memory-optimized (R)	Heap profiles, OOM checks under load
IO-bound	Storage-optimized (I)	Disk throughput tests, iops saturation
Latency-sensitive	Higher single-core perf	Single-threaded latency benchmarks

AWS and other providers include right-sizing guidance in their well-architected frameworks; treat those recommendations as starting points, not final decisions.

Operational Monitoring, Forecasting and Continuous Re-Evaluation

Capacity planning is a feedback loop: monitor, forecast, validate, commit, and repeat.

Key metrics and SLO alignment

Always track the user-facing SLI (e.g., p95 latency, error rate) alongside infrastructure metrics (CPU, mem, RPS, DB TPS, queue depth). SLOs must drive scaling decisions when possible. If your SLO is tail-latency, scale on a correlated application metric rather than CPU alone.
Instrument service internals (per-endpoint latency histograms, active requests, queue lengths) using a consistent metrics model (Prometheus-style instrumentation is recommended).

Monitoring & observability best practices

Use histograms for latency distributions and record percentiles p50/p95/p99 rather than relying on averages. Instrumentation guidance in Prometheus provides concrete rules for histogram vs summary usage and label cardinality.
Export and retain high-resolution data for at least the period you need to model seasonality; push aggregated records to long-term storage (Thanos/Cortex/VictoriaMetrics) if needed.

Forecasting demand (practical method)

Build a baseline forecast from historical peaks (e.g., weekly high), then apply an event multiplier for planned campaigns and a growth factor (monthly or quarterly). R_target = peak_lookback_max * (1 + event_multiplier) * (1 + expected_growth)
Validate the forecast with predictive autoscalers (run in forecast-only mode to compare predictions to actuals) before acting on them. AWS and other vendors provide predictive scaling features that analyze historical metrics and suggest pre-warms; use them with caution and validation.
Re-evaluate after every major release, product launch, or marketing event.

Re-evaluation cadence

Weekly to monthly: dashboard review of utilization, top spenders, and anomalies.
Pre-release: run smoke & load tests, update forecasts, and validate scaling policies.
Quarterly: fleet-wide rightsizing pass and review of reserved/commitment posture (don’t buy commitments until right-sized). Flexera and industry reports show that cost control remains a top cloud challenge; regular FinOps review is critical.

Practical Capacity Planning Checklist

This is the runbook you execute when turning a load-test into deployable capacity.

Pre-test (prepare)

[ ] Define SLOs and set clear p95/p99 latency targets.
[ ] Ensure test environment mirrors production (same network, DB, caches, feature flags).
[ ] Instrument everything: RPS, latency histograms, in-flight requests, CPU, memory, IOPS, network, DB metrics. Use Prometheus/OpenTelemetry conventions.

During test (collect)

[ ] Run steady-state and spike tests (ramp, steady, spike, soak).
[ ] Capture R_test, N_test, CPU_test, memory, and external dependency metrics.
[ ] Tag and export test metrics to a persistent store for analysis.

Post-test (analyze & size)

[ ] Compute N_needed per resource using the CPU formula and equivalents for memory/IO; pick the max.
[ ] Select U_target based on SRE risk tolerance (50%–70% common starting band).
[ ] Add buffer: choose a buffer strategy — percentage headroom (e.g., 20–50%) or absolute min-instances (e.g., keep 3 spares). Document rationale.

Autoscaler & deployment

[ ] Prefer target-tracking on a correlated metric (ALB request count per target, requests/sec, or custom app metric) rather than raw CPU when possible. Validate correlation.
[ ] Configure warm pools or pre-warmed capacity for slow-start components.
[ ] Set sensible cooldowns and scale-in safeguards to avoid thrash.

Cost controls

[ ] Run an instance-type A/B to validate price-performance.
[ ] Plan reserved/commitments only after right-sizing and observing steady usage for a representative period.
[ ] Use Spot/Preemptible for non-critical workloads and build graceful preemption handlers.

Automation & governance

[ ] Codify sizing rules and scaling policies in IaC (Terraform/CloudFormation).
[ ] Add capacity tests to CI (smoke + a periodic larger test).
[ ] Put owner and runbook links into each dashboard and alert to route responsibility clearly.

Quick decision matrix: which metric to scale on

Metric	Use when	Example scaling action
`CPU%`	CPU is proven to correlate with work done	Target tracking to 60%
`ALBRequestCountPerTarget`	Stateless web servers behind ALB	Target track on requests/target/minute.
`Queue length`	Worker/consumer backlog controls latency	Scale consumers to keep backlog < X
`DB connections`	DB limits are the bottleneck	Scale app pool horizontally or add read replicas

Sources

Google SRE — Improve and Optimize Data Processing Pipelines / Capacity planning - Practical SRE guidance on demand forecasting, provisioning decisions, and a recommendation to provision components with CPU headroom for peak handling; used to justify headroom and capacity modeling approaches.

Amazon Application Auto Scaling — Target tracking scaling policies overview - Documentation describing target tracking, metric choices (including ALBRequestCountPerTarget), and operational behaviour of autoscaling policies.

k6 — Thresholds (performance testing best practices) - Guidance on using p95/p99 percentiles, thresholds and test validation; used for describing what to capture from load tests.

AWS Well-Architected Framework — Configure and right-size compute resources - Right-sizing and compute selection guidance from the Performance Efficiency pillar; used to frame instance family selection and right-sizing workflow.

AWS Prescriptive Guidance — Right size Windows workloads & Compute Optimizer recommendations - Practical instructions for enabling Compute Optimizer and using its recommendations as part of a rightsizing program.

Amazon EC2 Auto Scaling — Create a warm pool for an Auto Scaling group - Documentation on warm pools which reduce scale-out latency by keeping pre-initialized instances ready.

Amazon EC2 Auto Scaling — How predictive scaling works - Details on predictive scaling, forecast-only validation, and how to use forecasts to schedule capacity.

Google Cloud — Create and use preemptible VMs - Official guidance on using preemptible/spot instances for significant cost savings and caveats about preemption.

Flexera — State of the Cloud Report (2025) - Industry data showing cloud cost management is a top challenge and motivating disciplined capacity planning and FinOps practices.

Prometheus — Instrumentation best practices - Authoritative guidance on metrics design, label cardinality, histograms, and instrumentation patterns for reliable capacity planning telemetry.