beefed.ai

Posted on Apr 20 • Originally published at beefed.ai

Observability, SLOs and Cost Optimization for Cache Systems

#programming

Caches look healthy until they don't: cold-start storms, configuration changes that inflate TTLs, or subtle regressions in serialization can double misses overnight and send your tail latency (p99) and cloud bill through the roof. You need observable SLIs that map to user pain, instrumentation that ties those SLIs to traces and logs, dashboards that show why the SLO is trending bad, and playbooks that let you buy time (or budget) without blind guessing.

Contents

Key cache metrics and SLOs you cannot ignore
Instrumenting caches: traces, metrics, and logs with OpenTelemetry
Dashboards and alerts that surface real problems early
Sizing and cost: capacity planning and cache cost-per-request math
Practical runbook: implement an SLO-driven cache observability stack

Key cache metrics and SLOs you cannot ignore

Start with a tight set of SLIs (small, measurable, user-oriented). For caches the three anchors are p99 latency, cache hit ratio, and availability / error yield. Choose an SLO window, a target, and an error-budget policy that reflects how critical the cached workload is to customer experience. The SRE canon on SLIs/SLOs and error budgets explains why percentiles and windows matter for operational decision-making.

Core metrics to emit (names are examples — standardize across teams):

cache_requests_total{result="hit|miss",cache="NAME"} — Counter for all cache requests split by result. Use rate() in PromQL to compute RPS.
cache_request_duration_seconds_bucket — Histogram buckets for cache GET/SET latency. Use histogram_quantile(0.99, ...) to compute p99 from buckets.
cache_memory_bytes — Gauge for used memory on the node/shard.
cache_items — Gauge for cardinality if affordable (or track sampled key counts).
cache_evictions_total — Counter for eviction events (signals memory pressure or churn).
cache_errors_total — Counter for timeouts, connection errors, or rejects.
cache_connections and cache_cpu_seconds_total — saturation signals for capacity planning.

How to compute the two SLIs you’ll act on every day:

Cache hit ratio (SLI): hit_rate = sum(rate(cache_requests_total{result="hit"}[5m])) / sum(rate(cache_requests_total[5m])) This gives you an honest view of origin load reduction. Low hit ratio → higher DB load and higher cost.
p99 latency (SLI): p99 = histogram_quantile(0.99, sum(rate(cache_request_duration_seconds_bucket[5m])) by (le)) Histograms are the right primitive for aggregated percentiles across instances. Pick buckets that sit around your target SLO (see bucket recommendations below).

Example SLOs (templates you can adapt):

SLO A (latency): 99% of GET requests served from cache complete in < 20 ms, measured on a rolling 30‑day window.
SLO B (effectiveness): Rolling 30‑day cache hit ratio ≥ 95% for the session-cache workload. Adjust window/target to reflect business risk and usage patterns.

Quick table: metric → SLO candidate → example alert trigger

Metric	SLO candidate	Example SLO target	Example alert
`p99(cache latency)`	User tail latency	p99 < 20ms (30d)	p99 > 20ms for 5m → page.
`cache hit ratio`	Origin-offload effectiveness	hit_ratio ≥ 95% (30d)	hit_ratio < 90% for 10m → page.
`cache_evictions_total`	Stability	evictions per 1M reqs < X	spike in eviction rate and memory > 80% → page.

Important: SLOs are policy. Pick windows and targets that drive rational trade-offs between availability, cost, and velocity — let the error budget guide remediation and releases.

Instrumenting caches: traces, metrics, and logs with OpenTelemetry

Instrument every cache call with three signals: a short span, precise metrics, and trace-correlated logs. Use OpenTelemetry for consistent naming and to enable cross-signal correlation. Instrumentation should be low-overhead, low-cardinality by default, and selective about keys and user identifiers.

Traces

Create a short CLIENT span around each cache operation with attributes following OTel semantic conventions: db.system="redis", db.operation.name (e.g., GET/MGET/HMGET), net.peer.name, redis.key.summary (low-cardinality key prefix), and db.response.status_code when available. This follows OTel Redis conventions and lets you filter traces by operation type.
Record a span attribute cache.hit=true / cache.miss=true so you can filter traces that correspond to misses (the high-value ones). Linking traces to misses is critical for root-cause.

Metrics

Emit the counters and histograms listed above via OpenTelemetry metrics or a Prometheus client. Prefer histograms for latency so you can compute p99 at query-time. Use the OpenTelemetry Prometheus exporter or OTLP → Collector → Prometheus pipeline as fits your topology.
Keep label cardinality low: cache, result, region, shard — avoid cache_key as a label. For hot-key analysis emit sampled telemetry (see exemplars below).

Logs

Structured logs should include trace_id and span_id when emitted inside a span. That enables jump-to-trace from error logs and exemplars. Use OpenTelemetry log bridges or ensure your logging appender automatically includes trace context. Sanitize PII.

Exemplars — link metrics to traces

Enable exemplars so that outlier histogram buckets carry a trace_id/span_id back to the trace that created the measurement. Exemplars let you click a p99 spike and land on the exact trace that produced the outlier. Configure exemplar sampling as trace-based (default) and keep the reservoir small.

Practical instrumentation examples

OpenTelemetry (Python) — counters / histogram + Prometheus scraping endpoint:

# Python (schematic)
from opentelemetry import metrics, trace
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from opentelemetry.sdk.resources import Resource

resource = Resource.create({"service.name": "user-cache"})
reader = PrometheusMetricReader()  # exposes /metrics for Prometheus to scrape
metrics.set_meter_provider(MeterProvider(metric_readers=[reader]))
meter = metrics.get_meter("cache.instrumentation")

cache_requests = meter.create_counter("cache_requests_total", description="Total cache requests")
cache_latency = meter.create_histogram("cache_request_duration_seconds", description="Cache request latency (s)")

# In your cache call path:
with tracer.start_as_current_span("cache.get", attributes={"db.system":"redis","db.operation.name":"GET"}):
    start = time.monotonic()
    val = redis_client.get(key)
    dur = time.monotonic() - start
    cache_requests.add(1, {"result": "hit" if val is not None else "miss"})
    cache_latency.record(dur, {"result": "hit" if val is not None else "miss"})

Caveat: language SDK APIs evolve; consult the OpenTelemetry docs for your language and exporter configuration.

Bucket guidance for cache histograms

Cache latencies are typically sub-10ms for local in-memory caches; choose buckets around expected SLOs, e.g.: buckets = [0.0005, 0.001, 0.0025, 0.005, 0.01, 0.02, 0.05, 0.1, 0.5, 1.0] (seconds) — that maps to 0.5ms, 1ms, 2.5ms, 5ms, 10ms, etc. Tune if you have higher-latency remote caches.

Cardinality and sampling rules

Keep labels low-cardinality. For diagnosing hot keys, emit a sampled cache_key histogram or a separate hot_key_probe metric at low rate (1/1000 requests) instead of making cache_key a label on primary metrics. Use exemplars to capture the trace for the sampled event.

Dashboards and alerts that surface real problems early

Dashboards are not trophies — they are triage surfaces. Design dashboards for signal + root-cause work: a top-line SLO panel, a burn-rate gauge, and a set of diagnostic panels (evictions, memory, top namespaces, hot-key sparkline, errors, and downstream DB load). Follow RED/USE methods for panels: Rate, Errors, Duration and Utilization/Saturation.

Suggested dashboard layout (top to bottom)

Headline SLOs: p99 latency sparkline, cache hit ratio, error budget remaining (30d).
Burn-rate widgets: multi-window burn-rate (1h/6h/3d) and an indicator to map burn → severity.
Resource & health: memory usage, evictions per sec, CPU, connection count.
Diagnostic drilldowns: top 10 busiest key prefixes, miss rate by prefix, origin request rate (to show fallout).
Traces & exemplars: p99 chart with exemplars that link to traces for quick root-cause.

Prometheus examples: recording rules and alerts

Recording rule (hit ratio):

# recording_rules.yml
groups:
- name: cache.rules
  rules:
  - record: job:cache_hit_ratio:ratio
    expr: |
      sum(rate(cache_requests_total{result="hit"}[5m]))
      /
      sum(rate(cache_requests_total[5m]))

Alert rule (p99 breach):

# alerts.yml
groups:
- name: cache.alerts
  rules:
  - alert: CacheHighP99Latency
    expr: histogram_quantile(0.99, sum(rate(cache_request_duration_seconds_bucket[5m])) by (le)) > 0.02
    for: 5m
    labels:
      severity: page
    annotations:
      summary: "Cache p99 latency > 20ms"
      runbook: "https://runbooks.example.com/cache_high_p99"

Use for to avoid paging on short blips; use multi-window burn-rate alerts (fast and slow) as SRE recommends to detect sharp and gradual budget consumption.

Alerting strategy (practical)

Alert on symptoms (user-visible pain) — p99 spikes and hit-ratio drops — not internal counters alone. Pager on critical burns (e.g., 14.4x burn for 1h on a 30d SLO), create Slack/ops tickets for lower-severity burns. Use multiple windows to avoid blind spots.

Incident playbook (triage steps)

First 2 minutes (what you must observe)
- Look at SLO dashboard: p99, hit ratio, error budget. Note which SLO is burning fastest.
- Inspect resource panels: memory, evictions, CPU — is cluster under memory pressure?
- Check exemplars on p99 chart → click-through to trace (identifies hot key / slow downstream).
2–10 minutes (actions)
- For heavy eviction/churn: increase cache capacity (scale out or add nodes), or temporarily increase TTLs for safe content.
- For hot-key storms: identify top key_prefix with topk() PromQL and apply rate-limiting or local near-cache for that prefix.
- For config or deployment regressions: roll back the change that affected serialization/TTL mapping.
Recovery window
- Rebalance shards, add headroom (reserve 20–30% memory), and follow capacity plan below.

Include redis-cli quick checks (for Redis-like caches):

# Quick Redis checks
redis-cli INFO stats    # keyspace_hits, keyspace_misses, evicted_keys
redis-cli INFO memory   # used_memory, maxmemory, fragmentation_ratio
redis-cli INFO commandstats  # top command counts

Use these to validate whether misses are cache-miss (low keys) vs. errors/timeouts.

Sizing and cost: capacity planning and cache cost-per-request math

Plan capacity against two dimensions: working set (how many items you need to keep cached to meet your hit-rate SLO) and throughput (requests/sec influencing CPU/network sizing).

Capacity formulas (back-of-envelope)

Required in-RAM bytes = target_items_to_cache × average_item_size_bytes × (1 + overhead). Overhead accounts for allocator fragmentation and per-key metadata (commonly 10–40% depending on engine and data shape).
Node count = ceil(required_RAM_total / usable_RAM_per_node). Reserve headroom (20–30%) to avoid excessive evictions.

Example sizing worked example

You need to keep 10M items, average 1 KB payload, overhead 30%:
- bytes = 10,000,000 × 1,024 × 1.3 ≈ 13,312,000,000 bytes ≈ 12.4 GiB ⇒ pick nodes to provide 16 GiB usable RAM across the cluster.

Monitoring guidance

Keep sustained CPU < ~70% per core and memory utilization in a comfortable band (20–80%) to reduce evictions and fragmentation; Redis monitoring guidance mirrors these operational bands.

Cost-per-request optimization (model)

Step 1: compute cost per hour of cache cluster (cloud charges, reserved vs on-demand) — example pricing models and serverless options are published in vendor pricing pages.
Step 2: compute requests/hour (from monitoring).
Step 3: cache cost-per-request = cluster_cost_per_hour / requests_per_hour. Compare that with the marginal cost of a direct DB request (RPC CPU, disk I/O, egress). If cache reduces backend cost and improves latency, the difference justifies the cache. Example math is available in vendor pricing docs showing how serverless cache charges combine storage and CPU units.

Concrete example (pattern, not a vendor recommendation)

If cache cluster costs $2.90/hr (serverless example) and serves 3.6M requests/hr (1k RPS), cache cost per request ≈ $0.00000081. The same hour, a DB request might cost more when you add CPU/IO and scaling. Use these numbers to quantify ROI before increasing RAM or adding nodes. Refer to cloud provider pricing pages for accurate numbers for your region and instance types.

Cost levers to watch (operational)

Improve hit ratio (biggest lever). Small increases in hit ratio produce outsized savings on DB load and egress.
Right-size node classes and consider serverless cache (if traffic is spiky) to avoid paying for idle capacity.
Use near-cache (client local) for extreme hot keys to reduce network hops and lower p99.

Practical runbook: implement an SLO-driven cache observability stack

This checklist is a minimal, deployable plan you can apply in the next sprint.

Phase 0 — measurement plan (define before you change infra)

Choose SLIs and windows: pick p99 and hit_ratio with a 30‑day evaluation window and a 5‑minute detection window for alerts. Document SLI definitions precisely (aggregation interval, included requests, measurement point).
Define SLO targets and error budget policy (who gets paged at what burn rate).

Phase 1 — instrumentation (required signals)

Implement counters and histograms in your cache client (or in a thin proxy layer) using OpenTelemetry metrics. Emit: cache_requests_total, cache_request_duration_seconds_bucket, cache_errors_total, cache_evictions_total, cache_memory_bytes.
Add short cache.get spans with db.system="redis" and db.operation.name. Add boolean attribute cache.hit. Ensure logs include trace_id.
Enable exemplars (trace-based) in your metric pipeline so p99 points can link to traces.

Phase 2 — pipeline and backend

Route metrics to Prometheus (scrape OTel Prometheus exporter or use OTLP → Collector → Prometheus remote-write). Configure retention: high-res metrics (15–30 days), downsampled long-term store for 1y.
Route traces to a tracing backend (Tempo/Jaeger/Cloud Trace) and logs to a structured log backend with OTLP ingestion.

Phase 3 — dashboards & alerts

Build a small SLO dashboard: p99, hit ratio, error budget, burn-rate windows, memory/evictions. Use RED/USE for panel design.
Implement recording rules for SLI calculation and a set of alert rules:
- Fast-burn page (e.g., 14.4x burn for 1h) → page.
- Slow-burn warn (e.g., 1x burn over 3d) → ticket.
- Resource page: sustained memory > 85% or evictions spike → page.

Phase 4 — runbooks and drills

Add concise runbooks for each alert: what to query, commands to run (redis-cli INFO), how to scale, and safe mitigations (increase TTLs, add nodes, enable near-cache, rate-limit writes). Keep playbooks ≤ 10 steps for the first 10 minutes. (See playbook excerpt above.)

Phase 5 — review cadence

Weekly: review SLO burn and cost reports. Monthly: capacity reforecast and pre-warm plan for seasonal load. Use SLOs to prioritize work (error budget remaining should map to feature release cadence).

Callout: Instrumentation without correlation is noise. Exemplars + trace-linked logs convert p99 spike graphs into actionable traces — that single capability reduces MTTI dramatically.

Sources:
Service Level Objectives (Google SRE Book) - Core definitions for SLIs, SLOs, error budgets, and percentile rationale used to define p99 and SLO windows.

Implementing SLOs (Google SRE Workbook) - Practical recipes for setting SLOs, burn-rate alerting, and error-budget-based alert workflows.

OpenTelemetry — Metrics concepts and instrumentation - Guidance on metric types, instrument design, and SDK behavior when emitting counters, histograms, and gauges.

Prometheus — Histograms and summaries (practices) - Rationale for histograms vs summaries, histogram_quantile() usage, and bucket guidance used to compute p99.

Grafana — Dashboard best practices - RED/USE methods and dashboard design patterns for operational triage.

Monitoring Performance with Redis Insight (Redis) - Metrics and operational ranges (latency, hit rate guidance, memory utilization, eviction signals) referenced for cache health thresholds.

OpenTelemetry — Semantic conventions for Redis - Recommended attributes and span conventions for instrumenting Redis cache operations.

OpenTelemetry — Prometheus exporter & integration guidance - Patterns to export OpenTelemetry metrics for Prometheus scraping or remote-write workflows.

OpenTelemetry — Metrics data model: Exemplars - How exemplars work and how they enable metric → trace correlation for p99 investigation.

Amazon ElastiCache Pricing (AWS) - Pricing model examples and serverless vs node-based cost examples used to illustrate cost-per-request calculations and trade-offs.

Prometheus — Alerting rules documentation - Syntax and guidance for writing alerting rules and using for to avoid flapping.