Ayush Raj Jha

Posted on Mar 18

Scaling SRE Systems with GCP + Kubernetes: Lessons from Running at 10x Traffic

#sre #kubernetes #googlecloud #devops

How we redesigned our reliability engineering stack on Google Kubernetes Engine — and the SLO framework that changed how our team thinks about uptime.

Context

Eighteen months ago, our SRE team was managing reliability the old-fashioned way: dashboards nobody looked at, on-call rotations driven by alert fatigue, and postmortems that produced action items nobody followed up on.

Then we 10x'd our traffic during a product launch. Everything broke, not catastrophically, but in that slow, grinding way that's actually worse. Cascading latency. Partial outages. Customer-facing errors that took us 40 minutes to even notice.

We rebuilt everything. This is what we built, and more importantly, why.

The Stack

Layer	Technology	Purpose
Orchestration	GKE Autopilot	Container management, autoscaling
Service Mesh	Istio on GKE	Traffic management, mTLS, observability
Metrics	Cloud Monitoring + Prometheus	SLI/SLO tracking
Tracing	Cloud Trace + OpenTelemetry	Distributed request tracing
Logging	Cloud Logging + Log-based metrics	Structured logs, alerting
Incident Management	PagerDuty + custom runbooks	On-call, escalation

Part 1: Defining SLOs That Actually Mean Something

The most important thing we did wasn't technical, it was agreeing on what "reliable" means.

We adopted the Google SRE Book's framework: define SLIs (what you measure), set SLOs (your target), track your error budget (how much you can fail).

Here's our core SLI/SLO definitions in code, using the Prometheus recording rules format we deploy to GKE:

# slo-rules.yaml
groups:
  - name: slo_rules
    interval: 30s
    rules:

      # SLI: Availability (% of requests that succeed)
      - record: sli:availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

      # SLI: Latency (% of requests under 300ms)
      - record: sli:latency:ratio_rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
          /
          sum(rate(http_request_duration_seconds_count[5m]))

      # Error budget remaining (SLO = 99.9% availability over 30 days)
      - record: slo:error_budget:remaining
        expr: |
          1 - (
            (1 - sli:availability:ratio_rate5m) / (1 - 0.999)
          )

And the corresponding alerting policy:

# alert-rules.yaml
groups:
  - name: slo_alerts
    rules:

      # Burn rate alert: fast burn (2% budget in 1 hour)
      - alert: HighErrorBudgetBurn
        expr: |
          slo:error_budget:remaining < 0.02
          and
          slo:error_budget:remaining offset 1h > 0.02
        for: 2m
        labels:
          severity: critical
          team: sre
        annotations:
          summary: "Burning error budget at critical rate"
          description: "2% of monthly error budget consumed in 1 hour. Current availability: {{ $value | humanizePercentage }}"
          runbook: "https://wiki.internal/runbooks/high-error-budget-burn"

      # Slow burn alert: 5% budget in 6 hours
      - alert: ElevatedErrorBudgetBurn
        expr: |
          slo:error_budget:remaining < 0.05
          and
          slo:error_budget:remaining offset 6h > 0.05
        for: 15m
        labels:
          severity: warning
          team: sre
        annotations:
          summary: "Elevated error budget consumption"
          runbook: "https://wiki.internal/runbooks/elevated-burn"

💡 Key insight: Before SLOs, we'd page on-call for any spike in errors, even if it lasted 30 seconds. Burn-rate alerting means we only wake people up when budget consumption threatens our monthly target. On-call load dropped 70%.

Part 2: GKE Autopilot for Hands-Off Infrastructure

We moved from self-managed GKE Standard to GKE Autopilot, and it was the right call for our scale.

Autopilot handles node provisioning, sizing, and security hardening automatically. You define what you need; GKE figures out how to run it.

Here's our typical service deployment manifest:

# service-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
  namespace: production
  labels:
    app: api-service
    version: v2.3.1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
        version: v2.3.1
      annotations:
        # Prometheus scraping
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
        - name: api-service
          image: gcr.io/my-project/api-service:v2.3.1
          ports:
            - containerPort: 8080

          # Resource requests drive Autopilot node sizing
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "2000m"
              memory: "2Gi"

          # Readiness/liveness probes are non-negotiable
          readinessProbe:
            httpGet:
              path: /healthz/ready
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 3

          livenessProbe:
            httpGet:
              path: /healthz/live
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
            failureThreshold: 3

          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace

---
# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-service-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300   # 5 min cooldown before scaling down

💡 Lesson: Set scaleDown.stabilizationWindowSeconds to at least 300 seconds. We had premature scale-down during traffic waves that caused latency spikes as pods restarted. The 5-minute cooldown eliminated this.

Part 3: Istio for Traffic Management and Reliability Patterns

Istio gives us reliability primitives such as circuit breaking, retries, timeouts, at the infrastructure layer, so services don't have to implement them individually.

# istio-destinationrule.yaml — circuit breaking
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: api-service
  namespace: production
spec:
  host: api-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000

    # Circuit breaker
    outlierDetection:
      consecutiveGatewayErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
      minHealthPercent: 30

---
# istio-virtualservice.yaml — retries + timeouts
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: api-service
  namespace: production
spec:
  hosts:
    - api-service
  http:
    - route:
        - destination:
            host: api-service
            port:
              number: 8080
      timeout: 10s
      retries:
        attempts: 3
        perTryTimeout: 3s
        retryOn: "gateway-error,connect-failure,retriable-4xx"

💡 Lesson: Be careful with retry configuration. Retrying on 5xx without thought can amplify load on an already-struggling downstream service. We scope retries to idempotent operations only and never retry POST endpoints that modify state.

Part 4: Structured Logging for Faster Incident Response

When you're in an incident, log quality is everything. We standardized on structured JSON logs emitted via our shared logging library:

import json
import logging
import time
from contextvars import ContextVar

trace_id_var: ContextVar[str] = ContextVar("trace_id", default="")
span_id_var: ContextVar[str] = ContextVar("span_id", default="")

class StructuredLogger:
    def __init__(self, service_name: str):
        self.service = service_name

    def _log(self, level: str, message: str, **kwargs):
        entry = {
            "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
            "severity": level,
            "message": message,
            "service": self.service,
            "trace_id": trace_id_var.get(),
            "span_id": span_id_var.get(),
            **kwargs
        }
        print(json.dumps(entry))

    def info(self, message: str, **kwargs):
        self._log("INFO", message, **kwargs)

    def error(self, message: str, **kwargs):
        self._log("ERROR", message, **kwargs)

    def warn(self, message: str, **kwargs):
        self._log("WARNING", message, **kwargs)


# Usage
log = StructuredLogger("payment-service")

log.info(
    "Payment processed",
    user_id="u_12345",
    amount_cents=4999,
    currency="USD",
    duration_ms=142,
)

log.error(
    "Payment gateway timeout",
    user_id="u_12345",
    gateway="stripe",
    timeout_ms=3000,
    retry_attempt=2,
)

Cloud Logging automatically parses these JSON entries. We then create log-based metrics and wire them into our SLO dashboards:

# Create a log-based metric for payment errors
gcloud logging metrics create payment_errors \
  --description="Count of payment processing errors" \
  --log-filter='
    resource.type="k8s_container"
    jsonPayload.service="payment-service"
    jsonPayload.severity="ERROR"
  '

Results

After 6 months running this stack at 10x traffic:

Metric	Before	After
Mean time to detect (MTTD)	38 minutes	4 minutes
Mean time to resolve (MTTR)	2.1 hours	22 minutes
On-call pages per week	47	9
Monthly availability	99.2%	99.94%
Error budget consumed (avg)	N/A	18%

What I'd Do Differently

Adopt OpenTelemetry from day one. We had a mix of Cloud Trace and a legacy Zipkin setup. Migrating mid-scale was painful.
Define SLOs before writing a line of infra code. The technical decisions flow naturally once you know what you're optimizing for.
Invest in runbook quality early. Our fastest incident resolutions happen when the runbook is good enough that a junior engineer can follow it at 3am without escalating.

Resources

What's your team's error budget policy? Do you freeze deploys when budget hits a threshold? I'd love to compare notes in the comments.

Tags: #sre #kubernetes #gke #googlecloud #devops #platform #reliability

DEV Community