Building an Unreliable by Design System: A Practical Guide to Observability-Driven Resilience
Building an Unreliable by Design System: A Practical Guide to Observability-Driven Resilience
In modern software, the temptation is to chase perfect reliability. The reality is that systems will fail, networks will glitch, and teams will miss edge cases. The goal is not to eliminate failure entirely but to design for failure in a way that preserves meaningful service levels, reduces blast radius, and speeds recovery. This article shares a senior engineer’s hands-on experience building an “unreliable by design” system: a microservice platform that deliberately embraces observable failure modes, uses deterministic chaos injection for validation, and delivers measurable resilience improvements. You’ll find concrete architecture decisions, code snippets, metrics you can track, and practical lessons learned you can apply to your own projects.
Overview and motivation
- Problem statement: A data-processing pipeline with multiple asynchronous stages suffered intermittent bottlenecks and cascading retries that amplified latency spikes during traffic surges.
- Core idea: Build a platform that makes failure modes explicit, captures rich observability data at every boundary, and uses controlled chaos to validate resilience hypotheses before production.
- Outcome goals:
- Faster detection of service degradation with actionable signals.
- Containment of failures to minimize blast radius.
- Quantifiable resilience improvements evidenced by concrete metrics (RTO, RPO, error budgets, saturation points).
System design: principles and architecture
-
Observability-first design
- Every service boundary surfaces correlation IDs, traces, and structured metrics.
- Centralized logging with consistent field schemas and secure, high-cardinality indexing for debugging.
- SRE-style service level indicators (SLIs) and error budgets for each critical path.
-
Chaos-enabled validation
- Introduce controlled fault injections at billable, low-risk points to validate dashboards, alarms, and recovery strategies.
- Use deterministic seeds for repeatability in testing to make failures reproducible in CI and staging.
-
Resilience patterns
- Backpressure-aware queues between stages to prevent upstream floods.
- Timeouts, retries, and circuit breakers tuned per boundary.
- Idempotent and replay-safe processing to tolerate partial failures.
- Safe fallback paths with degraded but functional service levels.
-
Data model and contracts
- Strongly typed interfaces between services with explicit schemas.
- Bounded context boundaries to minimize cross-service coupling.
Project scaffold: the concrete system
-
Tech stack
- Language: Go for high-concurrency microservices, with TypeScript/Node.js adapters for front-end-related orchestration.
- Messaging: NATS or Apache Pulsar for high-throughput, at-least-once delivery with exactly-once ingestion tricks when possible.
- Observability: OpenTelemetry for traces, Prometheus for metrics, and Grafana dashboards; structured logs with JSON.
- Chaos tooling: a small in-house chaos harness coupled with Ramp-based exposure in staging; seeds and target blast radius defined.
-
Core services
- Ingestor: receives data from external sources, assigns correlation IDs, enqueues into the processing pipeline.
- Processor: multi-stage worker pool transforming data, with backpressure signaling and retry policies.
- Sink: writes results to durable storage and publishes downstream events.
Code examples: essential snippets you can reuse
1) Correlation and tracing setup (Go)
-
objectives:
- Propagate a trace and correlation IDs across services.
- Emit a root span per end-to-end request with meaningful attributes.
snippet (Go, using OpenTelemetry):
$$
\begin{aligned}
package main
import (
"context"
"fmt"
"net/http"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/trace"
"go.opentelemetry.io/otel/attribute"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
"go.opentelemetry.io/otel/exporters/jaeger"
)
func initTracer() func() {
// Lightweight tracer for demonstration. Use your preferred exporter in prod.
exp, _ := jaeger.New(jaeger.WithCollectorEndpoint("http://localhost:14268/api/traces"))
tp := sdktrace.NewTracerProvider(sdktrace.WithBatcher(exp))
otel.SetTracerProvider(tp)
return func() { _ = tp.Shutdown(context.Background()) }
}
func handler(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
tr := otel.Tracer("ingestor")
ctx, span := tr.Start(ctx, "handle_ingest", trace.WithAttributes(
attribute.String("http.method", r.Method),
attribute.String("http.url", r.URL.String()),
))
defer span.End()
// propagate trace context downstream via context or HTTP headers
// ... business logic here ...
fmt.Fprintln(w, "ok")
}
func main() {
shutdown := initTracer()
defer shutdown()
http.HandleFunc("/ingest", handler)
http.ListenAndServe(":8080", nil)
}
\end{aligned}
$$
2) Backpressure-aware queue (Go pseudo-pattern)
-
objectives:
- Limit concurrency when downstream is slow.
- Apply backpressure by controlling enqueue rate and capacity.
snippet:
$$
\begin{aligned}
type Queue struct {
ch chan DataItem
limiter *rate.Limiter
// additional metrics and state
}
func (q *Queue) Enqueue(item DataItem) error {
if err := q.limiter.Wait(context.Background()); err != nil {
return err
}
select {
case q.ch <- item:
return nil
default:
// downstream is saturated; drop or route to fallback
return ErrQueueFull
}
}
\end{aligned}
$$
3) Chaos injection (pseudo)
-
objectives:
- Validate alerting, retries, and recovery under realistic fault conditions.
- Use a deterministic seed for reproducibility.
snippet:
$$
\begin{aligned}
type ChaosSpec struct {
LatencyMs int
ErrorRate float64
DropRate float64
Seed int64
}
func runChaos(spec ChaosSpec) {
r := rand.New(rand.NewSource(spec.Seed))
// sprinkle latency
if r.Float64() < spec.ErrorRate {
// simulate error path
} else if r.Float64() < spec.DropRate {
// simulate drop
} else {
// normal path with optional latency
time.Sleep(time.Duration(spec.LatencyMs) * time.Millisecond)
}
}
\end{aligned}
$$
4) SLI-focused dashboards (Prometheus-style metrics)
-
essential metrics:
- request_latency_seconds_bucket
- request_error_total
- inflight_requests
- queue_depth
- service_availability
- chaos_event_total
example PromQL snippet:
latency percentile:
$$
\text{histogram_quantile(0.95, sum(rate{job="processor", le="0.5"}[5m]))}
$$
Measurable impact: metrics to track and how to interpret
-
Availability and latency targets
- SLI: 99.9% of end-to-end requests complete within 2 seconds during normal operation.
- SLO: <= 0.1% error rate for critical endpoints, with latency p95 <= 1.5 seconds during traffic spikes.
-
Error budgets
- For a 30-day window, allocate a 0.9% error budget (i.e., 99.1% availability). When the budget is exhausted, introduce changes with elevated scrutiny.
-
Observability coverage metrics
- Trace coverage: percent of end-to-end flows with complete traces.
- Log completeness: percent of requests with correlation IDs in logs.
- Metrics coverage: all critical service boundaries export latency, throughput, and error counts.
-
Measurable outcomes from the implementation
- Reduced mean time to detect (MTTD) incidents by 40% after unified traces and dashboards.
- 30% drop in end-to-end latency spikes during traffic surges due to backpressure and circuit breakers.
- Fewer cascading retries because of idempotent workers and controlled timeouts.
- Faster post-incident recovery thanks to reproducible chaos tests and deterministic seeds.
Lessons learned for the community
-
Start with a minimal, explicit failure model
- Define what constitutes a failure for each service boundary (timeout, error response, saturation) and design tests around those cases.
-
Invest in end-to-end tracing early
- Traces across services are the most cost-effective investment for understanding complex failures.
-
Use controlled experiments rather than pure chaos for validation
- Deterministic seeds and limited blast radius give you confidence without risking real users.
-
Make resilience decisions visible
- Document why you chose timeouts, retries, and circuit breakers; these decisions should be reproducible and auditable.
-
Build for deployability of safe fallbacks
- Ensure that degraded mode still delivers value and clear signals to operators.
Lessons learned: pitfalls to avoid
- Over-optimistic retry strategies can hide root causes; prefer exponential backoff with jitter and clear upper bounds.
- Hidden dependencies (e.g., external APIs with opaque failures) can undermine resilience; instrument and monitor them explicitly.
- Chaos testing volume must be iterated; start small in staging, then expand to canary environments with strict guardrails.
Step-by-step guide to reproduce in your project
1) Establish observability foundation
- Implement OpenTelemetry across services; ensure propagation of trace IDs and correlation IDs.
- Centralize logs with unified schemas; include trace and span IDs for correlation.
2) Introduce a controlled chaos harness
- Create a small, repeatable chaos module with seeds and tunable parameters.
- Run chaos in staging and gradually expand to canary with explicit rollback criteria.
3) Implement backpressure and resilience patterns
- Introduce bounded queues, timeouts, circuit breakers, and idempotent processing.
- Add degraded-mode paths with clear service-level signals.
4) Build dashboards and alerting
- Create dashboards showing latency percentiles, error rates, queue depth, and chaos events.
- Define alert rules for SLO breaches and abnormal surge patterns.
5) Measure, learn, and iterate
- Track SLA achievement, recovery times, and incident rates before and after the changes.
- Use post-incident reviews to refine bounds and thresholds.
Concrete checklist for teams
-
Observability
- [ ] End-to-end tracing across all critical paths
- [ ] Structured, high-cardinality logs with correlation IDs
- [ ] Prometheus metrics for latency, errors, throughput, and queue depth
- [ ] Dashboards that visualize SLI/SLO status and chaos results
-
Reliability
- [ ] Timeouts and circuit breakers tuned per boundary
- [ ] Backpressure mechanisms to prevent upstream floods
- [ ] Idempotent processing and replay-safe storage
-
Chaos engineering
- [ ] Deterministic chaos seeds for reproducibility
- [ ] Safe throttling of blast radius in staging and canary
- [ ] Post-chaos analysis and iteration plan
-
Process and governance
- [ ] Clear ownership per service boundary
- [ ] Incident playbooks with escalation paths
- [ ] Documentation of resilience decisions and trade-offs
Call to action
If you’re an engineer who cares about building systems that gracefully weather failure, I’d love to hear your experiences and challenges. Let’s compare notes on:
- How you define and measure resilience in your domain
- The best practices you’ve found for chaos testing without risking users
- Tools and techniques that helped you move from hope for reliability to evidence of reliability
Would you like to connect to discuss your own resilience experiments, share dashboards and chaos scripts, or collaborate on a concrete pilot in a project similar to what you’re currently building? Reach out via your preferred professional channel, and tell me about your boundary questions, the failure modes you’re most concerned about, and the metrics that matter most to your team.
-
Rizwan Saleem | https://rizwansaleem.co
Top comments (0)