You inherit a 12-year-old service with no observability. You can't rewrite it. You need to make it debuggable. Here's how I've done it three times.
The 3 layers of instrumentation
Layer 1: Black-box. Everything you can observe without touching the code.
- Process-level metrics (CPU, memory, file descriptors)
- Network-level metrics (connection counts, latency from proxy logs)
- Request traces at the edge (nginx/ALB/CDN logs)
Start here. It's free and requires no code changes.
Layer 2: Runtime hooks. Inject observability into the running process without changing source code.
- eBPF for system-level tracing
- APM auto-instrumentation for JVM, .NET, Node, Python
- Sidecar patterns (Envoy, Linkerd) for network-level tracing
This is where the real win is. Modern APM agents can instrument most languages with a one-line install. You get request traces, method-level timing, and error visibility without touching a line of application code.
Layer 3: Surgical code changes. Only when black-box and runtime hooks aren't enough.
Add structured logging around the 5-10 most important operations. Not comprehensive — surgical. Just enough to answer 'did operation X succeed for request Y?'
What I don't do
I don't try to rewrite legacy code to add modern observability. It's a trap. The rewrite takes 6 months, introduces new bugs, and by the time you're done the original problem is still there.
The order of operations
- Layer 1 (1 week): black-box metrics and dashboards
- Layer 2 (2 weeks): APM agent or sidecar
- Identify the worst 3-5 debug gaps that remain
- Layer 3 (1 week): surgical logging for those gaps
Total: 4 weeks, minimal code changes, dramatically better observability.
The emotional win
The day you can finally answer 'what happened in that outage?' for a legacy service, you unlock the team's ability to actually improve it. Observability is the prerequisite for every other reliability investment.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)