In the summer of 2016, Pokémon GO launched to a user base roughly fifty times larger than its capacity planning had anticipated. The engineering team had done load testing. They had throughput thresholds. They had autoscaling configured. Within hours of launch, the service was degraded globally — not because the infrastructure could not scale, but because it scaled too slowly against an arrival rate that exceeded every modelled scenario, and because the metric that was driving scaling decisions (CPU utilisation) lagged behind the actual saturation signal by several minutes. By the time CPU registered critical, the request queue had already grown to the point where p99 latency had crossed into the range where users were abandoning sessions faster than new sessions were being created.
The engineering post-mortem identified the same root cause that appears in the post-mortems of most capacity-related incidents: the organisation's operational metrics were measuring how hard the infrastructure was working, not how much work the service could safely accept. CPU percentage is a resource utilisation metric. Memory percentage is a resource utilisation metric. IOPS is a resource utilisation metric. None of them is a service throughput metric. None of them tells you, with precision, at what arrival rate your SLO begins to degrade.
Safe Operating Throughput is that metric. It is not a new concept in queueing theory or systems engineering — the idea of a safe operating ceiling predates modern distributed systems. What is new is its treatment as a first-class SRE metric: formally derived from load test data and SLO targets, continuously monitored for drift, and operationally enforced as a constraint in autoscaling configuration, capacity planning decisions, and deployment pipeline gates.
Why Existing Capacity Metrics Are Insufficient
The canonical capacity management approach in most organisations works like this: observe CPU or memory utilisation, set an autoscaling threshold (typically 70–80%), and configure the HPA to scale up when that threshold is breached. This approach has three structural problems.
Problem 1 — Resource metrics are lagging indicators. Under JVM workloads, a garbage collection pause can cause request queue depth to spike and p99 latency to breach SLO bounds while CPU utilisation is briefly low — because the GC is pausing application threads, not consuming CPU. The HPA threshold is not breached. The scaling event does not fire. Users experience degraded service that the autoscaler cannot see.
Problem 2 — Resource metrics do not encode SLO position. A service running at 75% CPU utilisation may be well within its SLO targets or may be breaching them, depending on its request mix, its dependency latency profile, and its thread pool configuration. The CPU number alone carries no information about which situation applies. SOT, derived from load tests run against the actual SLO targets, encodes exactly that information: it is the throughput at which the service is known to be within its SLO bounds, with an explicit safety margin.
Problem 3 — Resource metrics produce the wrong HPA input. Scaling on CPU means the autoscaler is responding to how much work is currently being done, not to how much more work is arriving. By the time CPU crosses the scaling threshold, the system is already under load. The cold-start latency of new replicas — JVM warm-up, connection pool establishment, Istio sidecar certificate negotiation — means that scaling events triggered by resource metrics consistently lag behind the demand curve they are responding to.
The core definition: Safe Operating Throughput is the maximum sustained request arrival rate at which a service can maintain all of its SLO targets — availability, latency, and error rate — under realistic production conditions, including representative request mix, dependency latency profiles, and infrastructure overhead. It is expressed in requests per second per replica, enabling direct use as an HPA target metric.
Formal Derivation: Little's Law and the SLO-Anchored Ceiling
The theoretical foundation for SOT derivation is Little's Law, one of the most robust results in queueing theory:
────────────────────────────────────────────────────────────────────────────
LITTLE'S LAW
L = λ × W
Where:
L = average number of requests concurrently in the system
λ = average arrival rate (requests per second)
W = average time a request spends in the system (seconds)
(service time + queue wait time)
────────────────────────────────────────────────────────────────────────────
IMPLICATION FOR SOT DERIVATION:
For a service with maximum concurrency ceiling C
(thread pool size, connection pool limit, or async worker count):
Maximum theoretical throughput = C / W
At this ceiling, all concurrency slots are occupied on average.
Beyond it, requests begin queuing — and W starts increasing,
which reduces throughput further. This is the saturation knee.
SOT = Safety Factor × (C / W_baseline)
Where:
W_baseline = average response time at low load (measured)
C = effective concurrency limit (measured or configured)
Safety Factor = 0.75–0.85 (accounts for GC pauses, burst variance,
Istio mTLS overhead, OTel agent overhead)
────────────────────────────────────────────────────────────────────────────
WORKED EXAMPLE:
Service: payments-api (JVM, Spring Boot, Tomcat thread pool)
Thread pool size (C): 200 threads
Baseline response time (W): 45ms = 0.045s (measured at 10% load)
Theoretical max throughput: 200 / 0.045 = 4,444 RPS
Load test results:
At 3,000 RPS: p95 latency = 112ms ✓ within SLO (< 300ms)
At 3,500 RPS: p95 latency = 198ms ✓ within SLO
At 4,000 RPS: p95 latency = 347ms ✗ SLO breach begins
At 4,200 RPS: error rate = 0.15% ✗ error budget burning at 3×
SLO breach threshold (empirical): ~3,800 RPS per service instance
SOT = 0.80 × 3,800 = 3,040 RPS per replica (80% safety margin)
HPA target: 3,040 RPS per replica → scale up before SLO risk materialises
────────────────────────────────────────────────────────────────────────────
The 80% safety margin is not arbitrary. It provides headroom for three concurrent sources of throughput variance: request mix variation (some requests are more expensive than others), GC pause-induced latency spikes (which temporarily reduce effective throughput), and the cold-start latency window during which new replicas are being initialised but not yet serving traffic. An organisation with highly consistent request mix and minimal GC pressure may use 85%; one with high variance or bursty traffic profiles should use 75% or lower.
Load Test Design for SOT Derivation
SOT is only as valid as the load test that derives it. A load test that uses synthetic requests with uniform size, uniform think time, and no downstream dependency simulation will produce a SOT that overestimates safe production throughput — sometimes dramatically. The load test protocol for SOT derivation has five mandatory design requirements.
────────────────────────────────────────────────────────────────────────────
SOT LOAD TEST DESIGN REQUIREMENTS
────────────────────────────────────────────────────────────────────────────
REQUIREMENT 1: REPRESENTATIVE REQUEST MIX
Traffic must reflect production request distribution.
Source: Splunk query against production access logs, last 30 days.
Typical mix (payments-api example):
45% GET /payment-status (lightweight, cache-friendly)
30% POST /payment-initiate (heavyweight, synchronous DB write)
15% GET /payment-history (medium, paginated DB read)
10% POST /payment-refund (heavyweight, multi-step saga)
A load test using only GET /health is not a SOT derivation;
it is a health check stress test.
REQUIREMENT 2: RAMP PROTOCOL (STEP LOAD, NOT SPIKE)
Use stepped ramp increments of 10–15% throughput increase,
holding each step for ≥ 5 minutes before advancing.
Rationale: JVM JIT compilation and connection pool warm-up
require sustained load before steady-state performance stabilises.
A spike load test measures cold-start behaviour, not sustained SOT.
REQUIREMENT 3: SLO METRICS AS PASS/FAIL GATES
The load test terminates at the step where SLO targets are first breached.
Gate 1: p95 latency must remain < [SLO latency threshold]
Gate 2: error rate must remain < [1 - SLO availability target]
Gate 3: error budget burn rate must remain < 3× (ticket tier)
SOT threshold = the highest throughput step where all three gates pass.
REQUIREMENT 4: DEPENDENCY SIMULATION
Downstream service latency must be simulated at realistic P50/P95 values,
not at ideally-low stub values. A payments-api that calls a card-network
gateway at P50=80ms in production should call a stub at P50=80ms in the
load test. Understating dependency latency understates W in Little's Law
and overstates the SOT ceiling.
REQUIREMENT 5: INFRASTRUCTURE PARITY
The test environment must match production:
→ Same JVM flags (heap size, GC algorithm, ActiveProcessorCount)
→ Same CPU and memory limits (Kubernetes resource requests/limits)
→ Istio sidecar ENABLED in STRICT mTLS mode (not bypassed)
→ OTel agent ENABLED (not disabled for "performance testing")
→ Same replica count as production minimum (not a single instance)
Each of these deviations produces a SOT that does not apply to production.
────────────────────────────────────────────────────────────────────────────
<!-- JMeter Test Plan — SOT Derivation Protocol -->
<!-- Stepped ramp load test with SLO-anchored pass/fail gates -->
<?xml version="1.0" encoding="UTF-8"?>
<jmeterTestPlan version="1.2">
<hashTree>
<TestPlan testname="SOT Derivation — payments-api">
<hashTree>
<!-- Stepped Throughput Controller: 500 → 1000 → 1500 → ... RPS -->
<ThreadGroup testname="Stepped Load Ramp">
<!-- Each step: target threads × ramp duration × hold duration -->
<!-- Step 1: 500 RPS for 5 minutes (warm-up) -->
<!-- Step 2: 1000 RPS for 5 minutes -->
<!-- Step 3: 1500 RPS — continue until SLO gate fails -->
<stringProp name="ThreadGroup.num_threads">300</stringProp>
<stringProp name="ThreadGroup.ramp_time">30</stringProp>
<hashTree>
<!-- Weighted request mix matching production distribution -->
<ThroughputController testname="GET /payment-status (45%)">
<boolProp name="ThroughputController.perThread">false</boolProp>
<floatProp name="ThroughputController.percentThroughput">45</floatProp>
</ThroughputController>
<ThroughputController testname="POST /payment-initiate (30%)">
<floatProp name="ThroughputController.percentThroughput">30</floatProp>
</ThroughputController>
<ThroughputController testname="GET /payment-history (15%)">
<floatProp name="ThroughputController.percentThroughput">15</floatProp>
</ThroughputController>
<ThroughputController testname="POST /payment-refund (10%)">
<floatProp name="ThroughputController.percentThroughput">10</floatProp>
</ThroughputController>
<!-- SLO Gate: fail test step if p95 latency > 300ms -->
<ResultCollector testname="SLO Gate — Latency">
<stringProp name="filename">sot-results.csv</stringProp>
</ResultCollector>
</hashTree>
</ThreadGroup>
<!-- Backend Listener: stream results to Splunk HEC in real time -->
<BackendListener testname="Splunk Real-Time Metrics">
<stringProp name="classname">
org.apache.jmeter.visualizers.backend.influxdb.InfluxdbBackendListenerClient
</stringProp>
<!-- Configure to forward to Splunk via InfluxDB line protocol proxy -->
</BackendListener>
</hashTree>
</TestPlan>
</hashTree>
</jmeterTestPlan>
JVM-Specific Considerations
JVM services require two non-obvious adjustments to the SOT derivation protocol. Both are sources of systematic error when overlooked.
OTel Agent Memory Overhead
The OpenTelemetry Java agent adds 100–200 MB of heap pressure under production-representative load. This overhead comes from span buffer allocation, metric exemplar storage, and the agent's own internal telemetry. A load test run without the OTel agent will measure a SOT that is optimistic by the amount of throughput reduction that heap pressure introduces — typically 5–15% at production trace sampling rates.
The OTel agent must be enabled during SOT load tests at the same sampling rate as production. Disabling it "to get clean performance numbers" produces numbers that do not apply to the system that will actually run in production.
CPU Limit and ActiveProcessorCount Alignment
The JVM determines the size of its internal thread pools — GC threads, ForkJoinPool workers, Netty event loop threads — based on the number of available processors it detects at startup. In a containerised environment, this detection reads the host's processor count unless explicitly overridden, not the container's CPU limit.
────────────────────────────────────────────────────────────────────────────
CPU LIMIT vs ACTIVEPROCESSORCOUNT MISALIGNMENT
Scenario:
Node CPU count: 32 cores
Container CPU limit: 2 cores
JVM detected CPUs: 32 (reads host, not container)
Consequence:
ForkJoinPool workers: 32 (should be 2)
GC threads: 13 (should be 2–4)
Netty event loops: 32 (should be 2)
Result:
JVM creates 32 worker threads competing for 2 CPU cores.
CPU throttling inflates W (response time) non-linearly.
SOT derived without this setting overestimates safe throughput
by 20–40% in observed enterprise JVM deployments.
Fix: Add to JVM flags in Kubernetes Deployment manifest:
-XX:ActiveProcessorCount=2 (match container CPU limit integer)
────────────────────────────────────────────────────────────────────────────
# Kubernetes Deployment — JVM flags aligned to container CPU limits
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api
namespace: production
spec:
template:
spec:
containers:
- name: payments-api
resources:
requests:
cpu: "2"
memory: "2Gi"
limits:
cpu: "2"
memory: "3Gi" # Limit > request: headroom for GC spikes
env:
- name: JAVA_TOOL_OPTIONS
value: >-
-XX:ActiveProcessorCount=2
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-Xms1g
-Xmx2g
-XX:+ExitOnOutOfMemoryError
-javaagent:/otel/opentelemetry-javaagent.jar
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://splunk-otel-collector.monitoring.svc:4317"
- name: OTEL_TRACES_SAMPLER
value: "parentbased_traceidratio"
- name: OTEL_TRACES_SAMPLER_ARG
value: "0.1" # 10% sampling: match this rate in load test
Istio STRICT mTLS Overhead on SOT
In environments running Istio in STRICT mTLS mode, connection establishment carries an overhead that is material to SOT under specific traffic patterns. The mTLS handshake adds approximately 1–3ms per new connection. Under HTTP/2 with connection reuse (the default for gRPC and modern REST clients), this overhead is amortised across many requests and is negligible.
Under bursty traffic where the connection pool is frequently recycled — common at service startup, after circuit breaker trips, and during rolling deployments — mTLS handshake overhead can materially inflate W in Little's Law during the connection establishment phase, temporarily reducing effective throughput below the steady-state SOT.
────────────────────────────────────────────────────────────────────────────
ISTIO mTLS OVERHEAD: IMPACT ON SOT DERIVATION
Scenario: payments-api post-rolling-deployment burst
Connection pool size per replica: 100 connections
mTLS handshake time per connection: 2ms
Time to establish full connection pool: 200ms
Incoming RPS during this window: 2,000 RPS
Effective capacity during pool establishment:
Available connections: 0 → 100 (linear ramp over 200ms)
Average available connections: 50
Effective throughput ceiling (Little's Law, W=45ms):
50 / 0.045 = 1,111 RPS
Throughput deficit: 2,000 - 1,111 = 889 RPS queued
Queue growth: 889 RPS × 0.2s = 178 requests backlogged in 200ms
At baseline p95 latency of 112ms, 178 queued requests represent
~16 seconds of queue drain time — well into SLO breach territory.
Mitigation: SOT for post-deployment burst scenarios must include
a connection pool warm-up adjustment factor. Configure Istio
connection pool settings to reduce churn during rolling deployments:
────────────────────────────────────────────────────────────────────────────
# Istio DestinationRule — Connection Pool Tuning for SOT Protection
# Prevents connection pool churn from creating transient SOT violations
# during rolling deployments and circuit breaker recovery
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: payments-api-connection-pool
namespace: production
spec:
host: payments-api.production.svc.cluster.local
trafficPolicy:
connectionPool:
tcp:
maxConnections: 1000
connectTimeout: 10ms
tcpKeepalive:
time: 7200s
interval: 75s
http:
http2MaxRequests: 1000
maxRequestsPerConnection: 0 # 0 = unlimited; enable connection reuse
maxRetries: 3
idleTimeout: 90s
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
minHealthPercent: 30
SOT as the Input to HPA Configuration
The derivation of SOT is half the work. The operationalisation of SOT as a live autoscaling constraint is where it becomes a first-class metric. The HPA target value is derived directly from SOT, not from CPU thresholds.
# HPA configured from SOT derivation output
# SOT = 3,040 RPS per replica (derived above)
# HPA target = SOT value directly
# When average RPS per replica exceeds 3,040, scale out
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: payments-api-sot-hpa
namespace: production
annotations:
sre.internal/sot-value: "3040"
sre.internal/sot-derived-from: "load-test-2025-Q1"
sre.internal/sot-slo-target: "99.95%-availability-300ms-p95"
sre.internal/sot-safety-margin: "0.80"
sre.internal/sot-next-review: "2025-Q2"
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: payments-api
minReplicas: 3
maxReplicas: 60
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "3040" # SOT value: scale before SLO risk materialises
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 30
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 20
periodSeconds: 60
The annotations on the HPA resource are operational documentation: they record where the SOT value came from, which SLO it was derived against, what safety margin was applied, and when it should next be re-derived. Without this documentation, SOT values become magical numbers in configuration files — present but inexplicable, and never updated because no one remembers what they represent.
SOT Drift: How Safe Throughput Changes Over Time
SOT is not a static value. It drifts as the service evolves, and undetected SOT drift is the mechanism by which a well-tuned autoscaling configuration becomes dangerously mis-calibrated over time.
────────────────────────────────────────────────────────────────────────────
SOT DRIFT SOURCES
Code changes:
New feature adds a synchronous downstream call → W increases → SOT decreases
Database query optimisation → W decreases → SOT increases (budget grows)
ORM N+1 query introduced → W increases non-linearly under load → SOT drops
Dependency changes:
Downstream service degrades from P50=80ms to P50=150ms → W increases
New rate limit on external API → effective concurrency ceiling C decreases
Infrastructure changes:
CPU limit reduced in cost-optimisation exercise → ActiveProcessorCount effect
Memory limit reduced → more frequent GC → GC pause inflation of W
Istio sidecar version upgrade → connection handling changes
Traffic mix changes:
New client sends 3× more POST /payment-refund (expensive endpoint)
→ Effective W increases even with no code changes
→ SOT derived from old traffic mix no longer applies
────────────────────────────────────────────────────────────────────────────
SOT DRIFT DETECTION: Prometheus Recording Rule
Continuously compare observed service throughput at SLO-boundary latency
against the SOT value stored in the HPA annotation.
Divergence > 15% = SOT re-derivation required.
────────────────────────────────────────────────────────────────────────────
# Prometheus Recording Rules — SOT Drift Detection
# Monitors the gap between observed throughput-at-SLO-boundary
# and the configured SOT value in the HPA
groups:
- name: sot.drift_detection
interval: 60s
rules:
# Current RPS per replica — the live throughput signal
- record: sot:current_rps_per_replica:rate2m
expr: |
sum(
rate(istio_requests_total{
destination_service_name="payments-api",
reporter="destination"
}[2m])
)
/
count(
kube_pod_info{
namespace="production",
pod=~"payments-api-.*"
}
)
# p95 latency trend at current throughput
- record: sot:p95_latency_at_current_rps:seconds
expr: |
histogram_quantile(0.95,
sum(rate(istio_request_duration_milliseconds_bucket{
destination_service_name="payments-api",
reporter="destination"
}[5m])) by (le)
) / 1000
# SOT utilisation: actual RPS vs configured SOT ceiling
# Values approaching 1.0 indicate the HPA is scaling near the SOT boundary
# Values > 1.0 during load indicate SOT may have drifted downward
- record: sot:utilisation_ratio:rate2m
expr: |
sot:current_rps_per_replica:rate2m
/
3040 # Configured SOT value — update when HPA annotation changes
# SOT Drift Alert: p95 latency breaching SLO threshold at
# throughput levels previously considered safe
- alert: SOT_DriftDetected
expr: |
sot:p95_latency_at_current_rps:seconds > 0.25
AND
sot:current_rps_per_replica:rate2m < 2800 # Below current SOT config
for: 10m
labels:
severity: ticket
domain: capacity_planning
annotations:
summary: >
payments-api p95 latency at {{ $value | humanizeDuration }}
while RPS/replica is {{ with query "sot:current_rps_per_replica:rate2m" }}
{{ . | first | value | humanize }}{{ end }} — below configured SOT of 3,040.
SOT may have drifted downward. Re-derivation required.
runbook: "https://wiki.internal/sre/runbooks/sot-drift"
load_test_trigger: "https://wiki.internal/sre/load-tests/sot-rederivation"
SOT as a Capacity Debt Signal
The relationship between SOT and capacity debt mirrors the relationship between SLO targets and error budget. When a service consistently operates at a high fraction of its SOT ceiling — above 70% of SOT on average — the organisation is accumulating capacity debt: the gap between current safe throughput and the throughput that will be demanded when the next traffic growth event occurs.
────────────────────────────────────────────────────────────────────────────
CAPACITY DEBT FRAMEWORK (SOT-Anchored)
SOT utilisation bands:
< 50% of SOT → Capacity surplus. Service can absorb 2× current traffic.
Autoscaling min replica count may be reducible.
Action: consider scaling floor reduction in off-peak windows.
50–70% of SOT → Healthy operating band. Sufficient headroom for burst
traffic without SLO risk. No capacity action required.
70–85% of SOT → Capacity watch. At P95 traffic spike (2× average), SOT
ceiling will be reached. Autoscaling must fire fast enough
to prevent SLO breach during spike.
Action: review scaleUp stabilizationWindowSeconds.
Validate cold-start latency within SLO tolerance.
> 85% of SOT → Capacity debt. Service is operating too close to its
safe ceiling for burst traffic absorption.
Action: increase minimum replica count to provide
headroom, AND schedule SOT re-derivation to
validate current value reflects current codebase.
> 100% of SOT → Active SLO risk. Throughput has exceeded the empirically
derived safe ceiling. Error budget consumption likely.
Action: immediate capacity intervention + incident review.
────────────────────────────────────────────────────────────────────────────
# Splunk Dashboard: SOT Capacity Debt Tracking
# CronJob forwards SOT utilisation to Splunk for trend analysis
# and quarterly capacity planning review
apiVersion: batch/v1
kind: CronJob
metadata:
name: sot-capacity-forwarder
namespace: sre-platform
spec:
schedule: "*/5 * * * *"
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: sot-forwarder
image: sre-platform/metrics-forwarder:v1.2.0
env:
- name: PROMETHEUS_URL
value: "http://prometheus.monitoring.svc:9090"
- name: SPLUNK_HEC_URL
valueFrom:
secretKeyRef:
name: splunk-hec-creds
key: url
# Emits to Splunk sourcetype="sre:capacity":
# {
# "service": "payments-api",
# "sot_configured_rps": 3040,
# "current_rps_per_replica": 2187,
# "sot_utilisation_pct": 71.9,
# "capacity_band": "CAPACITY_WATCH",
# "replica_count": 12,
# "p95_latency_ms": 143,
# "slo_headroom_ms": 157,
# "sot_last_derived": "2025-Q1",
# "drift_detected": false
# }
Automated SOT Gate in the Deployment Pipeline
SOT re-derivation should be triggered automatically when changes that are likely to affect service throughput characteristics are deployed. A deployment that adds a synchronous downstream call, changes the thread pool configuration, or modifies the OTel sampling rate should trigger a SOT re-derivation run in the performance environment before the new SOT value is propagated to the HPA configuration in production.
# Argo CD PostSync Hook — SOT Re-Derivation Trigger
# Fires after deployments that carry the sre.internal/affects-sot annotation
# Triggers a JMeter load test run in the performance environment
# Updates HPA SOT annotation if new SOT differs by > 10% from current value
apiVersion: batch/v1
kind: Job
metadata:
name: sot-rederivation-trigger
namespace: sre-platform
annotations:
argocd.argoproj.io/hook: PostSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
# Gate: only fire if the deployed Application carries SOT-affect annotation
argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
spec:
template:
spec:
restartPolicy: Never
serviceAccountName: sot-automation-sa
containers:
- name: sot-gate
image: sre-platform/sot-automation:v1.1.0
env:
- name: SERVICE_NAME
value: "payments-api"
- name: JMETER_CONTROLLER_URL
value: "http://jmeter-controller.perf.svc:8080"
- name: PERFORMANCE_ENV_NAMESPACE
value: "performance"
- name: SOT_CHANGE_THRESHOLD
value: "0.10" # Re-derive if new SOT differs > 10% from current
- name: HPA_UPDATE_ON_CHANGE
value: "true" # Auto-update HPA annotation when SOT changes
- name: SPLUNK_HEC_URL
valueFrom:
secretKeyRef:
name: splunk-hec-creds
key: url
- name: ALERT_ON_REGRESSION
value: "true" # Page if new SOT is lower than current (regression)
# Execution sequence:
# 1. Check if deployed Application has sre.internal/affects-sot: "true"
# 2. If yes: trigger JMeter SOT derivation test in performance environment
# 3. Wait for test completion (timeout: 45 minutes)
# 4. Parse results: extract SOT at SLO boundary
# 5. Apply safety margin: new_SOT = 0.80 × threshold_rps
# 6. Compare with current HPA SOT annotation
# 7. If delta > 10%: update HPA annotation + emit Splunk event
# 8. If new SOT < current SOT (regression): page SRE team
# 9. If new SOT > current SOT (improvement): update silently + ticket
Common Antipatterns
The CPU-Threshold Disguise antipattern → Configuring HPA on CPU percentage while calling it "SOT-based autoscaling" because the CPU threshold was derived from a load test. CPU threshold and SOT are not equivalent. CPU measures resource utilisation at a point in time; SOT measures the service's relationship with its SLO boundary. Under GC-heavy or IO-bound workloads they can diverge substantially, and the divergence is always in the direction of overconfidence.
The Single-Endpoint SOT antipattern → Deriving SOT from a load test that exercises only the healthiest, fastest, most cache-friendly endpoint. The SOT of a service is determined by its most expensive sustained request mix, not its fastest. A SOT derived from GET requests that ignores POST requests will overestimate safe throughput for the traffic mix that actually matters.
The Dependency-Free SOT antipattern → Running the SOT derivation load test with stubbed downstream dependencies at unrealistically low latency. The W in Little's Law is the time a request spends in the entire system, including time waiting for downstream responses. A dependency stub at 5ms when production latency is 80ms produces a W that is 16× too small and a SOT that is 16× too optimistic.
The Set-and-Forget SOT antipattern → Deriving SOT once, configuring the HPA, and never revisiting it. SOT drifts with every significant code change, dependency change, and traffic mix evolution. An HPA configured to a SOT value derived eighteen months ago may be operating with a ceiling that no longer reflects the service's actual throughput characteristics. The
sre.internal/sot-next-reviewannotation should be enforced by a scheduled Kyverno audit policy that generates a ticket when the review date passes.The Missing Safety Margin antipattern → Setting HPA target to the empirical SLO breach threshold rather than to 80% of that threshold. At 100% of the breach threshold, the system is one traffic spike away from SLO violation, with no headroom for the autoscaler's cold-start latency. The safety margin is not conservatism; it is the engineering compensation for the inescapable lag between demand arrival and capacity availability.
Maturity Progression
────────────────────────────────────────────────────────────────────────────
STAGE SOT MATURITY STATE NORTH STAR SIGNAL
────────────────────────────────────────────────────────────────────────────
Reactive CPU/memory-based HPA. No SOT Capacity incidents
concept. Load tests run after the fact.
periodically with no SLO No leading capacity
anchoring. signal exists.
Defined SOT derived for critical HPA targets updated
services. Little's Law applied. to SOT values. Load
Safety margin documented. test protocol standardised.
Measured SOT drift detection active. SOT utilisation tracked
Capacity debt bands tracked in Splunk. JVM flags
in Splunk. SOT annotated aligned. OTel agent
on HPA resources. included in tests.
Optimised SOT re-derivation automated SOT gate fires
on deploys carrying SOT-affect automatically. Capacity
annotation. Quarterly SOT debt trend visible
review cadence enforced to leadership. Istio
by Kyverno. overhead modelled.
Generative SOT incorporated into Capacity planning
architectural review process. decisions made from
SOT regression blocks SOT data, not from
deployments automatically. intuition or CPU%.
SOT data feeds demand New services cannot
forecasting model. launch without SOT
derivation complete.
────────────────────────────────────────────────────────────────────────────
Five Action Items for This Week
Run a Little's Law ceiling calculation for your most critical service before running any load test. Take your thread pool or concurrency limit C and your baseline response time W from existing Splunk APM data. Calculate C / W. This gives the theoretical maximum throughput ceiling. If your current HPA target is anywhere near this number, your safety margin is insufficient and you have a latent capacity risk.
Audit your most recent load test against the five SOT design requirements. Was the request mix representative of production traffic distribution? Were downstream dependencies simulated at production-representative latency? Was the Istio sidecar enabled in STRICT mTLS mode? Was the OTel agent running? For each requirement not met, estimate the direction and magnitude of the SOT overestimate it produced.
Add SOT-relevant JVM flags to every production JVM deployment and verify alignment. Check that
-XX:ActiveProcessorCountis set to match the container CPU limit integer on every JVM service. Runkubectl execagainst a production pod and verifyjava -XshowSettings:allreports the correct processor count. Misalignment between CPU limit and JVM-detected processors is the single most common source of capacity headroom overestimation in containerised JVM deployments.Deploy the SOT drift detection recording rule and alert against your current load test data. Use the p95 latency at current RPS as the drift signal. If p95 latency is already elevated at throughput levels that should be well below the SOT ceiling, SOT has drifted downward since the last derivation — the HPA target is optimistic and the service is operating with less safety margin than the configuration implies.
Add
sre.internal/sot-value,sre.internal/sot-derived-from, andsre.internal/sot-next-reviewannotations to every HPA resource. Even if the values are estimates rather than empirically derived, the act of annotating creates the documentation anchor for the conversation about re-derivation. A Kyverno policy that generates a ticket whensot-next-reviewis in the past enforces the review cadence without requiring anyone to remember to check.
"CPU percentage tells you how hard your infrastructure is working. Safe Operating Throughput tells you how close your service is to the edge of what it has promised its users. These are not the same number. In the gap between them lives every capacity incident that was predicted by the wrong metric, triggered by the right load, and owned by the team that was measuring resource utilisation when they should have been measuring reliability margin."
Top comments (0)