I Built a Self-Healing Observability Stack — Here Are the Bugs That Nearly Broke Me
My blue/green deployment rolled back successfully.
I had no idea why.
The CloudWatch alarm fired. CodeDeploy reverted. The Slack alert said "5xx spike." But which service? Which endpoint? Which specific request triggered the cascade? All I had was a timestamp and an alarm name. The system worked exactly as designed — and I couldn't explain what it had just protected me from.
That's when I started this project.
What I Was Actually Missing
I'd built a solid GitOps pipeline at this point: Jenkins security gates, ECS Fargate, blue/green deployments with automatic rollback. The deployment mechanics were production-grade. The observability layer was... three CloudWatch log groups and a feeling.
The stack I built to close that gap:
- OpenTelemetry auto-instrumentation on the NestJS backend — every HTTP request generates a trace with spans across every service hop
- Jaeger as the trace backend (receiving via OTLP HTTP on port 4318)
-
Pino structured logging with
trace_idandspan_idinjected into every log line — so a CloudWatch log entry links directly to a Jaeger trace - Prometheus scraping custom NestJS metrics (request rate, latency histograms, error counters)
- Grafana dashboards
- Alertmanager → Slack for alert routing
- Lambda auto-remediation — a function that detects high error rates via CloudWatch alarm and autonomously stops unhealthy ECS tasks
The goal: when something breaks, I can go from Slack alert → log line → trace → root cause in one click. And if the error rate spikes, the system handles it before I even see the alert.
Bug #1: OpenTelemetry Was Running But Not Working
This was the first thing I got completely wrong.
I installed @opentelemetry/auto-instrumentations-node, wired up the OTLP exporter, pointed it at Jaeger, and ran the app. Zero traces in Jaeger. No error. No warning. Just nothing.
I spent a long time confirming things that weren't the problem: Jaeger was reachable, the exporter config was correct, the SDK was initialising without throwing. Everything looked fine. Nothing was traced.
The problem was import order.
Node.js auto-instrumentation works by monkey-patching built-in modules (http, https, net) at process startup. The patches need to be applied before any other module loads. If NestJS (or Express, or anything) bootstraps first, those modules are already in memory — the patches never apply. The app runs normally but generates no spans.
The fix is one constraint:
// main.ts — THIS ORDER IS MANDATORY
import './tracing'; // Must be FIRST — patches Node.js internals
import { NestFactory } from '@nestjs/core';
async function bootstrap() {
const app = await NestFactory.create(AppModule);
await app.listen(3001);
}
bootstrap();
And the tracing.ts initialisation itself:
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://localhost:4318/v1/traces', // localhost because ECS awsvpc
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start(); // Synchronous — must complete before bootstrap() runs
Note the URL: localhost:4318, not jaeger:4318. ECS Fargate's awsvpc network mode puts all containers in the same task into a shared network namespace. Same-task containers talk on localhost. Docker Compose service names don't resolve here.
After fixing the import order, traces started flowing immediately.
Correlating Logs to Traces
Having traces is useful. Having traces you can find from a log line is the actual goal.
The Pino logger needed a mixin function that reads the active OpenTelemetry span and injects its IDs into every log entry:
import { trace } from '@opentelemetry/api';
const logger = pino({
mixin() {
const span = trace.getActiveSpan();
if (!span) return {};
const ctx = span.spanContext();
return {
trace_id: ctx.traceId,
span_id: ctx.spanId,
};
},
formatters: {
level(label) { return { level: label }; },
},
});
Now every log line looks like this:
{
"level": "error",
"msg": "Database connection refused",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"timestamp": "2024-01-15T14:23:01.234Z"
}
When a Slack alert fires, I can grep CloudWatch Logs for the trace_id, then paste it directly into Jaeger's search. One click. The full trace — every service, every database query, every millisecond — is right there.
Bug #2: Lambda Auto-Remediation Broke the Deployment Controller
This one was more subtle.
The Lambda function's job: CloudWatch alarm fires (high 5xx rate) → Lambda detects it → Lambda restarts the unhealthy ECS service.
My first implementation used UpdateService with forceNewDeployment: true. That's the standard approach for restarting an ECS service. It should have worked.
ecs.update_service(
cluster=cluster_arn,
service=service_name,
forceNewDeployment=True # This fails silently or throws
)
It threw:
InvalidParameterException: Unable to update the service because
a deployment is already in progress
The reason: when an ECS service uses deployment_controller { type = "CODE_DEPLOY" }, AWS hands deployment control entirely to CodeDeploy. UpdateService --forceNewDeployment is incompatible with an active CodeDeploy-controlled service. The two systems conflict.
The correct approach is ecs:StopTask — stop the specific unhealthy task directly:
def remediate_unhealthy_task(cluster_arn, service_arn):
# List running tasks for the service
tasks = ecs.list_tasks(
cluster=cluster_arn,
serviceName=service_arn,
desiredStatus='RUNNING'
)['taskArns']
if not tasks:
return
# Stop the first running task
ecs.stop_task(
cluster=cluster_arn,
task=tasks[0],
reason='Auto-remediation: high error rate detected via CloudWatch alarm'
)
When a task stops, ECS detects the task count is below desired and launches a replacement. The service recovers. CodeDeploy is never touched. No deployment state corruption.
Bug #3: The Idempotency Problem
Lambda triggered three times for the same alarm window. Three concurrent invocations. Three tasks stopped simultaneously. The service dropped to zero running tasks and couldn't recover fast enough to pass health checks.
The fix: check your own logs before acting.
def check_recent_remediation(log_group, log_stream_prefix, window_minutes=10):
"""Return True if auto-remediation ran successfully in the last N minutes."""
cutoff = int((datetime.utcnow() - timedelta(minutes=window_minutes)).timestamp() * 1000)
streams = logs.describe_log_streams(
logGroupName=log_group,
logStreamNamePrefix=log_stream_prefix,
orderBy='LastEventTime',
descending=True,
limit=5
)['logStreams']
for stream in streams:
events = logs.get_log_events(
logGroupName=log_group,
logStreamName=stream['logStreamName'],
startTime=cutoff
)['events']
for event in events:
if 'Auto-remediation successful' in event['message']:
return True
return False
def lambda_handler(event, context):
if check_recent_remediation(LOG_GROUP, LOG_STREAM_PREFIX):
print('Recent remediation found — skipping to avoid thrash')
return {'status': 'skipped', 'reason': 'idempotency_guard'}
remediate_unhealthy_task(CLUSTER_ARN, SERVICE_ARN)
print('Auto-remediation successful')
One alarm. One Lambda invocation that acts. All subsequent invocations within 10 minutes exit early. The service gets one clean restart instead of a cascade.
The Full Pipeline: 11 Stages
The Jenkins pipeline that drives all of this:
- Secret Scan (Gitleaks)
- Type Check + Lint (TypeScript + ESLint)
- Dependency Audit (npm audit)
- Code Quality (SonarCloud)
- Build Images (Docker, tagged with git SHA)
- Image Scan (Trivy CVE detection)
- SBOM Generation (Syft — CycloneDX + SPDX)
- IaC Scan (Checkov on Terraform)
- ECR Push
- Task Definition Registration
- Blue/Green Deployment (CodeDeploy, 10% traffic per minute)
Security gates first. Deployment last. The same principle from the GitOps project applies here.
What It Looks Like When It Works
A request comes in to the NestJS backend:
- OpenTelemetry generates a trace ID and creates a root span
- Each downstream call (database query, external HTTP) gets a child span
- Pino injects the trace ID into every log line during that request's lifecycle
- Prometheus records the request duration in a histogram
- If the response is 5xx: Alertmanager routes to Slack with the alarm context
- In Slack: I see the alert, click the CloudWatch link, grep for the trace ID, open Jaeger, see the full call graph in under 60 seconds
And if the error rate crosses the threshold:
- CloudWatch alarm fires
- Lambda checks for recent remediation (idempotency guard)
- Lambda stops the unhealthy task
- ECS replaces it with a fresh task
- Error rate drops
- Alarm clears
No manual intervention. No 3am pages.
Key Takeaways
OTel import order is a hard constraint, not a preference. The SDK must patch Node.js internals before any framework loads. One wrong line breaks the entire tracing setup with no error message to guide you.
ecs:StopTask is the correct remediation call when using CODE_DEPLOY. forceNewDeployment conflicts with the CodeDeploy controller. Stop the task — ECS handles the replacement.
Idempotency in Lambda isn't optional when CloudWatch alarms are your trigger. Alarms fire multiple times. Your remediation function needs to know when it already ran.
Trace ID correlation turns three separate signals into one investigation. Logs, traces, and metrics are each useful in isolation. Together, with the trace ID as the link, they tell the complete story of a request.
Resources
- Full repository — github.com/celetrialprince166/Advanced_monitoring
- OpenTelemetry Node.js Auto-Instrumentation
- Jaeger OTLP Ingestion
- AWS CodeDeploy ECS Blue/Green
What's the most useful observability signal you've added to a production system? Drop it below — I'm building a list of what actually helps vs. what just adds noise. 👇




Top comments (0)