How we redesigned our reliability engineering stack on Google Kubernetes Engine — and the SLO framework that changed how our team thinks about uptime.
Context
Eighteen months ago, our SRE team was managing reliability the old-fashioned way: dashboards nobody looked at, on-call rotations driven by alert fatigue, and postmortems that produced action items nobody followed up on.
Then we 10x'd our traffic during a product launch. Everything broke, not catastrophically, but in that slow, grinding way that's actually worse. Cascading latency. Partial outages. Customer-facing errors that took us 40 minutes to even notice.
We rebuilt everything. This is what we built, and more importantly, why.
The Stack
| Layer | Technology | Purpose |
|---|---|---|
| Orchestration | GKE Autopilot | Container management, autoscaling |
| Service Mesh | Istio on GKE | Traffic management, mTLS, observability |
| Metrics | Cloud Monitoring + Prometheus | SLI/SLO tracking |
| Tracing | Cloud Trace + OpenTelemetry | Distributed request tracing |
| Logging | Cloud Logging + Log-based metrics | Structured logs, alerting |
| Incident Management | PagerDuty + custom runbooks | On-call, escalation |
Part 1: Defining SLOs That Actually Mean Something
The most important thing we did wasn't technical, it was agreeing on what "reliable" means.
We adopted the Google SRE Book's framework: define SLIs (what you measure), set SLOs (your target), track your error budget (how much you can fail).
Here's our core SLI/SLO definitions in code, using the Prometheus recording rules format we deploy to GKE:
# slo-rules.yaml
groups:
- name: slo_rules
interval: 30s
rules:
# SLI: Availability (% of requests that succeed)
- record: sli:availability:ratio_rate5m
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# SLI: Latency (% of requests under 300ms)
- record: sli:latency:ratio_rate5m
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
# Error budget remaining (SLO = 99.9% availability over 30 days)
- record: slo:error_budget:remaining
expr: |
1 - (
(1 - sli:availability:ratio_rate5m) / (1 - 0.999)
)
And the corresponding alerting policy:
# alert-rules.yaml
groups:
- name: slo_alerts
rules:
# Burn rate alert: fast burn (2% budget in 1 hour)
- alert: HighErrorBudgetBurn
expr: |
slo:error_budget:remaining < 0.02
and
slo:error_budget:remaining offset 1h > 0.02
for: 2m
labels:
severity: critical
team: sre
annotations:
summary: "Burning error budget at critical rate"
description: "2% of monthly error budget consumed in 1 hour. Current availability: {{ $value | humanizePercentage }}"
runbook: "https://wiki.internal/runbooks/high-error-budget-burn"
# Slow burn alert: 5% budget in 6 hours
- alert: ElevatedErrorBudgetBurn
expr: |
slo:error_budget:remaining < 0.05
and
slo:error_budget:remaining offset 6h > 0.05
for: 15m
labels:
severity: warning
team: sre
annotations:
summary: "Elevated error budget consumption"
runbook: "https://wiki.internal/runbooks/elevated-burn"
💡 Key insight: Before SLOs, we'd page on-call for any spike in errors, even if it lasted 30 seconds. Burn-rate alerting means we only wake people up when budget consumption threatens our monthly target. On-call load dropped 70%.
Part 2: GKE Autopilot for Hands-Off Infrastructure
We moved from self-managed GKE Standard to GKE Autopilot, and it was the right call for our scale.
Autopilot handles node provisioning, sizing, and security hardening automatically. You define what you need; GKE figures out how to run it.
Here's our typical service deployment manifest:
# service-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
namespace: production
labels:
app: api-service
version: v2.3.1
spec:
replicas: 3
selector:
matchLabels:
app: api-service
template:
metadata:
labels:
app: api-service
version: v2.3.1
annotations:
# Prometheus scraping
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: api-service
image: gcr.io/my-project/api-service:v2.3.1
ports:
- containerPort: 8080
# Resource requests drive Autopilot node sizing
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
# Readiness/liveness probes are non-negotiable
readinessProbe:
httpGet:
path: /healthz/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /healthz/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
---
# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-service-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # 5 min cooldown before scaling down
💡 Lesson: Set
scaleDown.stabilizationWindowSecondsto at least 300 seconds. We had premature scale-down during traffic waves that caused latency spikes as pods restarted. The 5-minute cooldown eliminated this.
Part 3: Istio for Traffic Management and Reliability Patterns
Istio gives us reliability primitives such as circuit breaking, retries, timeouts, at the infrastructure layer, so services don't have to implement them individually.
# istio-destinationrule.yaml — circuit breaking
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: api-service
namespace: production
spec:
host: api-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 100
http2MaxRequests: 1000
# Circuit breaker
outlierDetection:
consecutiveGatewayErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
minHealthPercent: 30
---
# istio-virtualservice.yaml — retries + timeouts
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: api-service
namespace: production
spec:
hosts:
- api-service
http:
- route:
- destination:
host: api-service
port:
number: 8080
timeout: 10s
retries:
attempts: 3
perTryTimeout: 3s
retryOn: "gateway-error,connect-failure,retriable-4xx"
💡 Lesson: Be careful with retry configuration. Retrying on
5xxwithout thought can amplify load on an already-struggling downstream service. We scope retries to idempotent operations only and never retryPOSTendpoints that modify state.
Part 4: Structured Logging for Faster Incident Response
When you're in an incident, log quality is everything. We standardized on structured JSON logs emitted via our shared logging library:
import json
import logging
import time
from contextvars import ContextVar
trace_id_var: ContextVar[str] = ContextVar("trace_id", default="")
span_id_var: ContextVar[str] = ContextVar("span_id", default="")
class StructuredLogger:
def __init__(self, service_name: str):
self.service = service_name
def _log(self, level: str, message: str, **kwargs):
entry = {
"timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
"severity": level,
"message": message,
"service": self.service,
"trace_id": trace_id_var.get(),
"span_id": span_id_var.get(),
**kwargs
}
print(json.dumps(entry))
def info(self, message: str, **kwargs):
self._log("INFO", message, **kwargs)
def error(self, message: str, **kwargs):
self._log("ERROR", message, **kwargs)
def warn(self, message: str, **kwargs):
self._log("WARNING", message, **kwargs)
# Usage
log = StructuredLogger("payment-service")
log.info(
"Payment processed",
user_id="u_12345",
amount_cents=4999,
currency="USD",
duration_ms=142,
)
log.error(
"Payment gateway timeout",
user_id="u_12345",
gateway="stripe",
timeout_ms=3000,
retry_attempt=2,
)
Cloud Logging automatically parses these JSON entries. We then create log-based metrics and wire them into our SLO dashboards:
# Create a log-based metric for payment errors
gcloud logging metrics create payment_errors \
--description="Count of payment processing errors" \
--log-filter='
resource.type="k8s_container"
jsonPayload.service="payment-service"
jsonPayload.severity="ERROR"
'
Results
After 6 months running this stack at 10x traffic:
| Metric | Before | After |
|---|---|---|
| Mean time to detect (MTTD) | 38 minutes | 4 minutes |
| Mean time to resolve (MTTR) | 2.1 hours | 22 minutes |
| On-call pages per week | 47 | 9 |
| Monthly availability | 99.2% | 99.94% |
| Error budget consumed (avg) | N/A | 18% |
What I'd Do Differently
- Adopt OpenTelemetry from day one. We had a mix of Cloud Trace and a legacy Zipkin setup. Migrating mid-scale was painful.
- Define SLOs before writing a line of infra code. The technical decisions flow naturally once you know what you're optimizing for.
- Invest in runbook quality early. Our fastest incident resolutions happen when the runbook is good enough that a junior engineer can follow it at 3am without escalating.
Resources
What's your team's error budget policy? Do you freeze deploys when budget hits a threshold? I'd love to compare notes in the comments.
Tags: #sre #kubernetes #gke #googlecloud #devops #platform #reliability

Top comments (0)