Nijo George Payyappilly

Posted on Jun 8

Safe Operating Throughput (SOT) as a First-Class SRE Metric: Derivation and Operationalization

#sre #kubernetes #devops #reliability

In the summer of 2016, Pokémon GO launched to a user base roughly fifty times larger than its capacity planning had anticipated. The engineering team had done load testing. They had throughput thresholds. They had autoscaling configured. Within hours of launch, the service was degraded globally — not because the infrastructure could not scale, but because it scaled too slowly against an arrival rate that exceeded every modelled scenario, and because the metric that was driving scaling decisions (CPU utilisation) lagged behind the actual saturation signal by several minutes. By the time CPU registered critical, the request queue had already grown to the point where p99 latency had crossed into the range where users were abandoning sessions faster than new sessions were being created.

The engineering post-mortem identified the same root cause that appears in the post-mortems of most capacity-related incidents: the organisation's operational metrics were measuring how hard the infrastructure was working, not how much work the service could safely accept. CPU percentage is a resource utilisation metric. Memory percentage is a resource utilisation metric. IOPS is a resource utilisation metric. None of them is a service throughput metric. None of them tells you, with precision, at what arrival rate your SLO begins to degrade.

Safe Operating Throughput is that metric. It is not a new concept in queueing theory or systems engineering — the idea of a safe operating ceiling predates modern distributed systems. What is new is its treatment as a first-class SRE metric: formally derived from load test data and SLO targets, continuously monitored for drift, and operationally enforced as a constraint in autoscaling configuration, capacity planning decisions, and deployment pipeline gates.

Why Existing Capacity Metrics Are Insufficient

The canonical capacity management approach in most organisations works like this: observe CPU or memory utilisation, set an autoscaling threshold (typically 70–80%), and configure the HPA to scale up when that threshold is breached. This approach has three structural problems.

Problem 1 — Resource metrics are lagging indicators. Under JVM workloads, a garbage collection pause can cause request queue depth to spike and p99 latency to breach SLO bounds while CPU utilisation is briefly low — because the GC is pausing application threads, not consuming CPU. The HPA threshold is not breached. The scaling event does not fire. Users experience degraded service that the autoscaler cannot see.

Problem 2 — Resource metrics do not encode SLO position. A service running at 75% CPU utilisation may be well within its SLO targets or may be breaching them, depending on its request mix, its dependency latency profile, and its thread pool configuration. The CPU number alone carries no information about which situation applies. SOT, derived from load tests run against the actual SLO targets, encodes exactly that information: it is the throughput at which the service is known to be within its SLO bounds, with an explicit safety margin.

Problem 3 — Resource metrics produce the wrong HPA input. Scaling on CPU means the autoscaler is responding to how much work is currently being done, not to how much more work is arriving. By the time CPU crosses the scaling threshold, the system is already under load. The cold-start latency of new replicas — JVM warm-up, connection pool establishment, Istio sidecar certificate negotiation — means that scaling events triggered by resource metrics consistently lag behind the demand curve they are responding to.

The core definition: Safe Operating Throughput is the maximum sustained request arrival rate at which a service can maintain all of its SLO targets — availability, latency, and error rate — under realistic production conditions, including representative request mix, dependency latency profiles, and infrastructure overhead. It is expressed in requests per second per replica, enabling direct use as an HPA target metric.

Formal Derivation: Little's Law and the SLO-Anchored Ceiling

The theoretical foundation for SOT derivation is Little's Law, one of the most robust results in queueing theory:

────────────────────────────────────────────────────────────────────────────
LITTLE'S LAW

  L = λ × W

  Where:
    L  = average number of requests concurrently in the system
    λ  = average arrival rate (requests per second)
    W  = average time a request spends in the system (seconds)
         (service time + queue wait time)

────────────────────────────────────────────────────────────────────────────
IMPLICATION FOR SOT DERIVATION:

  For a service with maximum concurrency ceiling C
  (thread pool size, connection pool limit, or async worker count):

    Maximum theoretical throughput = C / W

  At this ceiling, all concurrency slots are occupied on average.
  Beyond it, requests begin queuing — and W starts increasing,
  which reduces throughput further. This is the saturation knee.

  SOT = Safety Factor × (C / W_baseline)

  Where:
    W_baseline  = average response time at low load (measured)
    C           = effective concurrency limit (measured or configured)
    Safety Factor = 0.75–0.85 (accounts for GC pauses, burst variance,
                  Istio mTLS overhead, OTel agent overhead)

────────────────────────────────────────────────────────────────────────────
WORKED EXAMPLE:

  Service: payments-api (JVM, Spring Boot, Tomcat thread pool)
  Thread pool size (C):      200 threads
  Baseline response time (W): 45ms = 0.045s (measured at 10% load)
  Theoretical max throughput: 200 / 0.045 = 4,444 RPS

  Load test results:
    At 3,000 RPS: p95 latency = 112ms  ✓ within SLO (< 300ms)
    At 3,500 RPS: p95 latency = 198ms  ✓ within SLO
    At 4,000 RPS: p95 latency = 347ms  ✗ SLO breach begins
    At 4,200 RPS: error rate  = 0.15%  ✗ error budget burning at 3×

  SLO breach threshold (empirical): ~3,800 RPS per service instance
  SOT = 0.80 × 3,800 = 3,040 RPS per replica  (80% safety margin)

  HPA target: 3,040 RPS per replica → scale up before SLO risk materialises
────────────────────────────────────────────────────────────────────────────

The 80% safety margin is not arbitrary. It provides headroom for three concurrent sources of throughput variance: request mix variation (some requests are more expensive than others), GC pause-induced latency spikes (which temporarily reduce effective throughput), and the cold-start latency window during which new replicas are being initialised but not yet serving traffic. An organisation with highly consistent request mix and minimal GC pressure may use 85%; one with high variance or bursty traffic profiles should use 75% or lower.

Load Test Design for SOT Derivation

SOT is only as valid as the load test that derives it. A load test that uses synthetic requests with uniform size, uniform think time, and no downstream dependency simulation will produce a SOT that overestimates safe production throughput — sometimes dramatically. The load test protocol for SOT derivation has five mandatory design requirements.

────────────────────────────────────────────────────────────────────────────
SOT LOAD TEST DESIGN REQUIREMENTS
────────────────────────────────────────────────────────────────────────────

REQUIREMENT 1: REPRESENTATIVE REQUEST MIX
  Traffic must reflect production request distribution.
  Source: Splunk query against production access logs, last 30 days.
  Typical mix (payments-api example):
    45% GET /payment-status   (lightweight, cache-friendly)
    30% POST /payment-initiate (heavyweight, synchronous DB write)
    15% GET /payment-history  (medium, paginated DB read)
    10% POST /payment-refund  (heavyweight, multi-step saga)
  A load test using only GET /health is not a SOT derivation;
  it is a health check stress test.

REQUIREMENT 2: RAMP PROTOCOL (STEP LOAD, NOT SPIKE)
  Use stepped ramp increments of 10–15% throughput increase,
  holding each step for ≥ 5 minutes before advancing.
  Rationale: JVM JIT compilation and connection pool warm-up
  require sustained load before steady-state performance stabilises.
  A spike load test measures cold-start behaviour, not sustained SOT.

REQUIREMENT 3: SLO METRICS AS PASS/FAIL GATES
  The load test terminates at the step where SLO targets are first breached.
  Gate 1: p95 latency must remain < [SLO latency threshold]
  Gate 2: error rate must remain < [1 - SLO availability target]
  Gate 3: error budget burn rate must remain < 3× (ticket tier)
  SOT threshold = the highest throughput step where all three gates pass.

REQUIREMENT 4: DEPENDENCY SIMULATION
  Downstream service latency must be simulated at realistic P50/P95 values,
  not at ideally-low stub values. A payments-api that calls a card-network
  gateway at P50=80ms in production should call a stub at P50=80ms in the
  load test. Understating dependency latency understates W in Little's Law
  and overstates the SOT ceiling.

REQUIREMENT 5: INFRASTRUCTURE PARITY
  The test environment must match production:
    → Same JVM flags (heap size, GC algorithm, ActiveProcessorCount)
    → Same CPU and memory limits (Kubernetes resource requests/limits)
    → Istio sidecar ENABLED in STRICT mTLS mode (not bypassed)
    → OTel agent ENABLED (not disabled for "performance testing")
    → Same replica count as production minimum (not a single instance)
  Each of these deviations produces a SOT that does not apply to production.
────────────────────────────────────────────────────────────────────────────

<!-- JMeter Test Plan — SOT Derivation Protocol -->
<!-- Stepped ramp load test with SLO-anchored pass/fail gates -->

<?xml version="1.0" encoding="UTF-8"?>
<jmeterTestPlan version="1.2">
  <hashTree>
    <TestPlan testname="SOT Derivation — payments-api">
      <hashTree>

        <!-- Stepped Throughput Controller: 500 → 1000 → 1500 → ... RPS -->
        <ThreadGroup testname="Stepped Load Ramp">
          <!-- Each step: target threads × ramp duration × hold duration -->
          <!-- Step 1: 500 RPS for 5 minutes (warm-up) -->
          <!-- Step 2: 1000 RPS for 5 minutes -->
          <!-- Step 3: 1500 RPS — continue until SLO gate fails -->
          <stringProp name="ThreadGroup.num_threads">300</stringProp>
          <stringProp name="ThreadGroup.ramp_time">30</stringProp>

          <hashTree>
            <!-- Weighted request mix matching production distribution -->
            <ThroughputController testname="GET /payment-status (45%)">
              <boolProp name="ThroughputController.perThread">false</boolProp>
              <floatProp name="ThroughputController.percentThroughput">45</floatProp>
            </ThroughputController>

            <ThroughputController testname="POST /payment-initiate (30%)">
              <floatProp name="ThroughputController.percentThroughput">30</floatProp>
            </ThroughputController>

            <ThroughputController testname="GET /payment-history (15%)">
              <floatProp name="ThroughputController.percentThroughput">15</floatProp>
            </ThroughputController>

            <ThroughputController testname="POST /payment-refund (10%)">
              <floatProp name="ThroughputController.percentThroughput">10</floatProp>
            </ThroughputController>

            <!-- SLO Gate: fail test step if p95 latency > 300ms -->
            <ResultCollector testname="SLO Gate — Latency">
              <stringProp name="filename">sot-results.csv</stringProp>
            </ResultCollector>
          </hashTree>
        </ThreadGroup>

        <!-- Backend Listener: stream results to Splunk HEC in real time -->
        <BackendListener testname="Splunk Real-Time Metrics">
          <stringProp name="classname">
            org.apache.jmeter.visualizers.backend.influxdb.InfluxdbBackendListenerClient
          </stringProp>
          <!-- Configure to forward to Splunk via InfluxDB line protocol proxy -->
        </BackendListener>

      </hashTree>
    </TestPlan>
  </hashTree>
</jmeterTestPlan>

JVM-Specific Considerations

JVM services require two non-obvious adjustments to the SOT derivation protocol. Both are sources of systematic error when overlooked.

OTel Agent Memory Overhead

The OpenTelemetry Java agent adds 100–200 MB of heap pressure under production-representative load. This overhead comes from span buffer allocation, metric exemplar storage, and the agent's own internal telemetry. A load test run without the OTel agent will measure a SOT that is optimistic by the amount of throughput reduction that heap pressure introduces — typically 5–15% at production trace sampling rates.

The OTel agent must be enabled during SOT load tests at the same sampling rate as production. Disabling it "to get clean performance numbers" produces numbers that do not apply to the system that will actually run in production.

CPU Limit and ActiveProcessorCount Alignment

The JVM determines the size of its internal thread pools — GC threads, ForkJoinPool workers, Netty event loop threads — based on the number of available processors it detects at startup. In a containerised environment, this detection reads the host's processor count unless explicitly overridden, not the container's CPU limit.

────────────────────────────────────────────────────────────────────────────
CPU LIMIT vs ACTIVEPROCESSORCOUNT MISALIGNMENT

  Scenario:
    Node CPU count:        32 cores
    Container CPU limit:   2 cores
    JVM detected CPUs:     32  (reads host, not container)

  Consequence:
    ForkJoinPool workers:  32  (should be 2)
    GC threads:            13  (should be 2–4)
    Netty event loops:     32  (should be 2)

  Result:
    JVM creates 32 worker threads competing for 2 CPU cores.
    CPU throttling inflates W (response time) non-linearly.
    SOT derived without this setting overestimates safe throughput
    by 20–40% in observed enterprise JVM deployments.

  Fix: Add to JVM flags in Kubernetes Deployment manifest:
    -XX:ActiveProcessorCount=2   (match container CPU limit integer)

────────────────────────────────────────────────────────────────────────────

# Kubernetes Deployment — JVM flags aligned to container CPU limits
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
  namespace: production
spec:
  template:
    spec:
      containers:
        - name: payments-api
          resources:
            requests:
              cpu: "2"
              memory: "2Gi"
            limits:
              cpu: "2"
              memory: "3Gi"    # Limit > request: headroom for GC spikes
          env:
            - name: JAVA_TOOL_OPTIONS
              value: >-
                -XX:ActiveProcessorCount=2
                -XX:+UseG1GC
                -XX:MaxGCPauseMillis=200
                -Xms1g
                -Xmx2g
                -XX:+ExitOnOutOfMemoryError
                -javaagent:/otel/opentelemetry-javaagent.jar
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: "http://splunk-otel-collector.monitoring.svc:4317"
            - name: OTEL_TRACES_SAMPLER
              value: "parentbased_traceidratio"
            - name: OTEL_TRACES_SAMPLER_ARG
              value: "0.1"    # 10% sampling: match this rate in load test

Istio STRICT mTLS Overhead on SOT

In environments running Istio in STRICT mTLS mode, connection establishment carries an overhead that is material to SOT under specific traffic patterns. The mTLS handshake adds approximately 1–3ms per new connection. Under HTTP/2 with connection reuse (the default for gRPC and modern REST clients), this overhead is amortised across many requests and is negligible.

Under bursty traffic where the connection pool is frequently recycled — common at service startup, after circuit breaker trips, and during rolling deployments — mTLS handshake overhead can materially inflate W in Little's Law during the connection establishment phase, temporarily reducing effective throughput below the steady-state SOT.

────────────────────────────────────────────────────────────────────────────
ISTIO mTLS OVERHEAD: IMPACT ON SOT DERIVATION

  Scenario: payments-api post-rolling-deployment burst
  Connection pool size per replica: 100 connections
  mTLS handshake time per connection: 2ms
  Time to establish full connection pool: 200ms
  Incoming RPS during this window: 2,000 RPS

  Effective capacity during pool establishment:
    Available connections: 0 → 100 (linear ramp over 200ms)
    Average available connections: 50
    Effective throughput ceiling (Little's Law, W=45ms):
      50 / 0.045 = 1,111 RPS
    Throughput deficit: 2,000 - 1,111 = 889 RPS queued
    Queue growth: 889 RPS × 0.2s = 178 requests backlogged in 200ms

  At baseline p95 latency of 112ms, 178 queued requests represent
  ~16 seconds of queue drain time — well into SLO breach territory.

  Mitigation: SOT for post-deployment burst scenarios must include
  a connection pool warm-up adjustment factor. Configure Istio
  connection pool settings to reduce churn during rolling deployments:

────────────────────────────────────────────────────────────────────────────

# Istio DestinationRule — Connection Pool Tuning for SOT Protection
# Prevents connection pool churn from creating transient SOT violations
# during rolling deployments and circuit breaker recovery

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payments-api-connection-pool
  namespace: production
spec:
  host: payments-api.production.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 1000
        connectTimeout: 10ms
        tcpKeepalive:
          time: 7200s
          interval: 75s
      http:
        http2MaxRequests: 1000
        maxRequestsPerConnection: 0    # 0 = unlimited; enable connection reuse
        maxRetries: 3
        idleTimeout: 90s
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
      minHealthPercent: 30

SOT as the Input to HPA Configuration

The derivation of SOT is half the work. The operationalisation of SOT as a live autoscaling constraint is where it becomes a first-class metric. The HPA target value is derived directly from SOT, not from CPU thresholds.

# HPA configured from SOT derivation output
# SOT = 3,040 RPS per replica (derived above)
# HPA target = SOT value directly
# When average RPS per replica exceeds 3,040, scale out

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payments-api-sot-hpa
  namespace: production
  annotations:
    sre.internal/sot-value: "3040"
    sre.internal/sot-derived-from: "load-test-2025-Q1"
    sre.internal/sot-slo-target: "99.95%-availability-300ms-p95"
    sre.internal/sot-safety-margin: "0.80"
    sre.internal/sot-next-review: "2025-Q2"
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payments-api
  minReplicas: 3
  maxReplicas: 60
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "3040"    # SOT value: scale before SLO risk materialises
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 100
          periodSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 20
          periodSeconds: 60

The annotations on the HPA resource are operational documentation: they record where the SOT value came from, which SLO it was derived against, what safety margin was applied, and when it should next be re-derived. Without this documentation, SOT values become magical numbers in configuration files — present but inexplicable, and never updated because no one remembers what they represent.

SOT Drift: How Safe Throughput Changes Over Time

SOT is not a static value. It drifts as the service evolves, and undetected SOT drift is the mechanism by which a well-tuned autoscaling configuration becomes dangerously mis-calibrated over time.

────────────────────────────────────────────────────────────────────────────
SOT DRIFT SOURCES

  Code changes:
    New feature adds a synchronous downstream call → W increases → SOT decreases
    Database query optimisation → W decreases → SOT increases (budget grows)
    ORM N+1 query introduced → W increases non-linearly under load → SOT drops

  Dependency changes:
    Downstream service degrades from P50=80ms to P50=150ms → W increases
    New rate limit on external API → effective concurrency ceiling C decreases

  Infrastructure changes:
    CPU limit reduced in cost-optimisation exercise → ActiveProcessorCount effect
    Memory limit reduced → more frequent GC → GC pause inflation of W
    Istio sidecar version upgrade → connection handling changes

  Traffic mix changes:
    New client sends 3× more POST /payment-refund (expensive endpoint)
    → Effective W increases even with no code changes
    → SOT derived from old traffic mix no longer applies

────────────────────────────────────────────────────────────────────────────
SOT DRIFT DETECTION: Prometheus Recording Rule

  Continuously compare observed service throughput at SLO-boundary latency
  against the SOT value stored in the HPA annotation.
  Divergence > 15% = SOT re-derivation required.
────────────────────────────────────────────────────────────────────────────

# Prometheus Recording Rules — SOT Drift Detection
# Monitors the gap between observed throughput-at-SLO-boundary
# and the configured SOT value in the HPA

groups:
  - name: sot.drift_detection
    interval: 60s
    rules:

      # Current RPS per replica — the live throughput signal
      - record: sot:current_rps_per_replica:rate2m
        expr: |
          sum(
            rate(istio_requests_total{
              destination_service_name="payments-api",
              reporter="destination"
            }[2m])
          )
          /
          count(
            kube_pod_info{
              namespace="production",
              pod=~"payments-api-.*"
            }
          )

      # p95 latency trend at current throughput
      - record: sot:p95_latency_at_current_rps:seconds
        expr: |
          histogram_quantile(0.95,
            sum(rate(istio_request_duration_milliseconds_bucket{
              destination_service_name="payments-api",
              reporter="destination"
            }[5m])) by (le)
          ) / 1000

      # SOT utilisation: actual RPS vs configured SOT ceiling
      # Values approaching 1.0 indicate the HPA is scaling near the SOT boundary
      # Values > 1.0 during load indicate SOT may have drifted downward
      - record: sot:utilisation_ratio:rate2m
        expr: |
          sot:current_rps_per_replica:rate2m
          /
          3040    # Configured SOT value — update when HPA annotation changes

      # SOT Drift Alert: p95 latency breaching SLO threshold at
      # throughput levels previously considered safe
      - alert: SOT_DriftDetected
        expr: |
          sot:p95_latency_at_current_rps:seconds > 0.25
          AND
          sot:current_rps_per_replica:rate2m < 2800    # Below current SOT config
        for: 10m
        labels:
          severity: ticket
          domain: capacity_planning
        annotations:
          summary: >
            payments-api p95 latency at {{ $value | humanizeDuration }}
            while RPS/replica is {{ with query "sot:current_rps_per_replica:rate2m" }}
            {{ . | first | value | humanize }}{{ end }} — below configured SOT of 3,040.
            SOT may have drifted downward. Re-derivation required.
          runbook: "https://wiki.internal/sre/runbooks/sot-drift"
          load_test_trigger: "https://wiki.internal/sre/load-tests/sot-rederivation"

SOT as a Capacity Debt Signal

The relationship between SOT and capacity debt mirrors the relationship between SLO targets and error budget. When a service consistently operates at a high fraction of its SOT ceiling — above 70% of SOT on average — the organisation is accumulating capacity debt: the gap between current safe throughput and the throughput that will be demanded when the next traffic growth event occurs.

────────────────────────────────────────────────────────────────────────────
CAPACITY DEBT FRAMEWORK (SOT-Anchored)

  SOT utilisation bands:

  < 50% of SOT   → Capacity surplus. Service can absorb 2× current traffic.
                   Autoscaling min replica count may be reducible.
                   Action: consider scaling floor reduction in off-peak windows.

  50–70% of SOT  → Healthy operating band. Sufficient headroom for burst
                   traffic without SLO risk. No capacity action required.

  70–85% of SOT  → Capacity watch. At P95 traffic spike (2× average), SOT
                   ceiling will be reached. Autoscaling must fire fast enough
                   to prevent SLO breach during spike.
                   Action: review scaleUp stabilizationWindowSeconds.
                           Validate cold-start latency within SLO tolerance.

  > 85% of SOT   → Capacity debt. Service is operating too close to its
                   safe ceiling for burst traffic absorption.
                   Action: increase minimum replica count to provide
                           headroom, AND schedule SOT re-derivation to
                           validate current value reflects current codebase.

  > 100% of SOT  → Active SLO risk. Throughput has exceeded the empirically
                   derived safe ceiling. Error budget consumption likely.
                   Action: immediate capacity intervention + incident review.
────────────────────────────────────────────────────────────────────────────

# Splunk Dashboard: SOT Capacity Debt Tracking
# CronJob forwards SOT utilisation to Splunk for trend analysis
# and quarterly capacity planning review

apiVersion: batch/v1
kind: CronJob
metadata:
  name: sot-capacity-forwarder
  namespace: sre-platform
spec:
  schedule: "*/5 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: sot-forwarder
              image: sre-platform/metrics-forwarder:v1.2.0
              env:
                - name: PROMETHEUS_URL
                  value: "http://prometheus.monitoring.svc:9090"
                - name: SPLUNK_HEC_URL
                  valueFrom:
                    secretKeyRef:
                      name: splunk-hec-creds
                      key: url
              # Emits to Splunk sourcetype="sre:capacity":
              # {
              #   "service": "payments-api",
              #   "sot_configured_rps": 3040,
              #   "current_rps_per_replica": 2187,
              #   "sot_utilisation_pct": 71.9,
              #   "capacity_band": "CAPACITY_WATCH",
              #   "replica_count": 12,
              #   "p95_latency_ms": 143,
              #   "slo_headroom_ms": 157,
              #   "sot_last_derived": "2025-Q1",
              #   "drift_detected": false
              # }

Automated SOT Gate in the Deployment Pipeline

SOT re-derivation should be triggered automatically when changes that are likely to affect service throughput characteristics are deployed. A deployment that adds a synchronous downstream call, changes the thread pool configuration, or modifies the OTel sampling rate should trigger a SOT re-derivation run in the performance environment before the new SOT value is propagated to the HPA configuration in production.

# Argo CD PostSync Hook — SOT Re-Derivation Trigger
# Fires after deployments that carry the sre.internal/affects-sot annotation
# Triggers a JMeter load test run in the performance environment
# Updates HPA SOT annotation if new SOT differs by > 10% from current value

apiVersion: batch/v1
kind: Job
metadata:
  name: sot-rederivation-trigger
  namespace: sre-platform
  annotations:
    argocd.argoproj.io/hook: PostSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
    # Gate: only fire if the deployed Application carries SOT-affect annotation
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
spec:
  template:
    spec:
      restartPolicy: Never
      serviceAccountName: sot-automation-sa
      containers:
        - name: sot-gate
          image: sre-platform/sot-automation:v1.1.0
          env:
            - name: SERVICE_NAME
              value: "payments-api"
            - name: JMETER_CONTROLLER_URL
              value: "http://jmeter-controller.perf.svc:8080"
            - name: PERFORMANCE_ENV_NAMESPACE
              value: "performance"
            - name: SOT_CHANGE_THRESHOLD
              value: "0.10"        # Re-derive if new SOT differs > 10% from current
            - name: HPA_UPDATE_ON_CHANGE
              value: "true"        # Auto-update HPA annotation when SOT changes
            - name: SPLUNK_HEC_URL
              valueFrom:
                secretKeyRef:
                  name: splunk-hec-creds
                  key: url
            - name: ALERT_ON_REGRESSION
              value: "true"        # Page if new SOT is lower than current (regression)
          # Execution sequence:
          # 1. Check if deployed Application has sre.internal/affects-sot: "true"
          # 2. If yes: trigger JMeter SOT derivation test in performance environment
          # 3. Wait for test completion (timeout: 45 minutes)
          # 4. Parse results: extract SOT at SLO boundary
          # 5. Apply safety margin: new_SOT = 0.80 × threshold_rps
          # 6. Compare with current HPA SOT annotation
          # 7. If delta > 10%: update HPA annotation + emit Splunk event
          # 8. If new SOT < current SOT (regression): page SRE team
          # 9. If new SOT > current SOT (improvement): update silently + ticket

Common Antipatterns

The CPU-Threshold Disguise antipattern → Configuring HPA on CPU percentage while calling it "SOT-based autoscaling" because the CPU threshold was derived from a load test. CPU threshold and SOT are not equivalent. CPU measures resource utilisation at a point in time; SOT measures the service's relationship with its SLO boundary. Under GC-heavy or IO-bound workloads they can diverge substantially, and the divergence is always in the direction of overconfidence.
The Single-Endpoint SOT antipattern → Deriving SOT from a load test that exercises only the healthiest, fastest, most cache-friendly endpoint. The SOT of a service is determined by its most expensive sustained request mix, not its fastest. A SOT derived from GET requests that ignores POST requests will overestimate safe throughput for the traffic mix that actually matters.
The Dependency-Free SOT antipattern → Running the SOT derivation load test with stubbed downstream dependencies at unrealistically low latency. The W in Little's Law is the time a request spends in the entire system, including time waiting for downstream responses. A dependency stub at 5ms when production latency is 80ms produces a W that is 16× too small and a SOT that is 16× too optimistic.
The Set-and-Forget SOT antipattern → Deriving SOT once, configuring the HPA, and never revisiting it. SOT drifts with every significant code change, dependency change, and traffic mix evolution. An HPA configured to a SOT value derived eighteen months ago may be operating with a ceiling that no longer reflects the service's actual throughput characteristics. The sre.internal/sot-next-review annotation should be enforced by a scheduled Kyverno audit policy that generates a ticket when the review date passes.
The Missing Safety Margin antipattern → Setting HPA target to the empirical SLO breach threshold rather than to 80% of that threshold. At 100% of the breach threshold, the system is one traffic spike away from SLO violation, with no headroom for the autoscaler's cold-start latency. The safety margin is not conservatism; it is the engineering compensation for the inescapable lag between demand arrival and capacity availability.

Maturity Progression

────────────────────────────────────────────────────────────────────────────
STAGE        SOT MATURITY STATE                  NORTH STAR SIGNAL
────────────────────────────────────────────────────────────────────────────
Reactive     CPU/memory-based HPA. No SOT        Capacity incidents
             concept. Load tests run             after the fact.
             periodically with no SLO            No leading capacity
             anchoring.                          signal exists.

Defined      SOT derived for critical            HPA targets updated
             services. Little's Law applied.     to SOT values. Load
             Safety margin documented.           test protocol standardised.

Measured     SOT drift detection active.         SOT utilisation tracked
             Capacity debt bands tracked         in Splunk. JVM flags
             in Splunk. SOT annotated            aligned. OTel agent
             on HPA resources.                   included in tests.

Optimised    SOT re-derivation automated         SOT gate fires
             on deploys carrying SOT-affect      automatically. Capacity
             annotation. Quarterly SOT           debt trend visible
             review cadence enforced             to leadership. Istio
             by Kyverno.                         overhead modelled.

Generative   SOT incorporated into              Capacity planning
             architectural review process.      decisions made from
             SOT regression blocks              SOT data, not from
             deployments automatically.         intuition or CPU%.
             SOT data feeds demand              New services cannot
             forecasting model.                 launch without SOT
                                                derivation complete.
────────────────────────────────────────────────────────────────────────────

Five Action Items for This Week

Run a Little's Law ceiling calculation for your most critical service before running any load test. Take your thread pool or concurrency limit C and your baseline response time W from existing Splunk APM data. Calculate C / W. This gives the theoretical maximum throughput ceiling. If your current HPA target is anywhere near this number, your safety margin is insufficient and you have a latent capacity risk.
Audit your most recent load test against the five SOT design requirements. Was the request mix representative of production traffic distribution? Were downstream dependencies simulated at production-representative latency? Was the Istio sidecar enabled in STRICT mTLS mode? Was the OTel agent running? For each requirement not met, estimate the direction and magnitude of the SOT overestimate it produced.
Add SOT-relevant JVM flags to every production JVM deployment and verify alignment. Check that -XX:ActiveProcessorCount is set to match the container CPU limit integer on every JVM service. Run kubectl exec against a production pod and verify java -XshowSettings:all reports the correct processor count. Misalignment between CPU limit and JVM-detected processors is the single most common source of capacity headroom overestimation in containerised JVM deployments.
Deploy the SOT drift detection recording rule and alert against your current load test data. Use the p95 latency at current RPS as the drift signal. If p95 latency is already elevated at throughput levels that should be well below the SOT ceiling, SOT has drifted downward since the last derivation — the HPA target is optimistic and the service is operating with less safety margin than the configuration implies.
Add sre.internal/sot-value, sre.internal/sot-derived-from, and sre.internal/sot-next-review annotations to every HPA resource. Even if the values are estimates rather than empirically derived, the act of annotating creates the documentation anchor for the conversation about re-derivation. A Kyverno policy that generates a ticket when sot-next-review is in the past enforces the review cadence without requiring anyone to remember to check.

"CPU percentage tells you how hard your infrastructure is working. Safe Operating Throughput tells you how close your service is to the edge of what it has promised its users. These are not the same number. In the gap between them lives every capacity incident that was predicted by the wrong metric, triggered by the right load, and owned by the team that was measuring resource utilisation when they should have been measuring reliability margin."

DEV Community