Prince Ayiku

Posted on Apr 13

I Built a Self-Healing Observability Stack on AWS ECS — Here Are the Bugs That Nearly Broke Me

#devops #aws #opentelemetry #observability

I Built a Self-Healing Observability Stack — Here Are the Bugs That Nearly Broke Me

My blue/green deployment rolled back successfully.

I had no idea why.

The CloudWatch alarm fired. CodeDeploy reverted. The Slack alert said "5xx spike." But which service? Which endpoint? Which specific request triggered the cascade? All I had was a timestamp and an alarm name. The system worked exactly as designed — and I couldn't explain what it had just protected me from.

That's when I started this project.

What I Was Actually Missing

I'd built a solid GitOps pipeline at this point: Jenkins security gates, ECS Fargate, blue/green deployments with automatic rollback. The deployment mechanics were production-grade. The observability layer was... three CloudWatch log groups and a feeling.

The stack I built to close that gap:

OpenTelemetry auto-instrumentation on the NestJS backend — every HTTP request generates a trace with spans across every service hop
Jaeger as the trace backend (receiving via OTLP HTTP on port 4318)
Pino structured logging with trace_id and span_id injected into every log line — so a CloudWatch log entry links directly to a Jaeger trace
Prometheus scraping custom NestJS metrics (request rate, latency histograms, error counters)
Grafana dashboards
Alertmanager → Slack for alert routing
Lambda auto-remediation — a function that detects high error rates via CloudWatch alarm and autonomously stops unhealthy ECS tasks

The goal: when something breaks, I can go from Slack alert → log line → trace → root cause in one click. And if the error rate spikes, the system handles it before I even see the alert.

Bug #1: OpenTelemetry Was Running But Not Working

This was the first thing I got completely wrong.

I installed @opentelemetry/auto-instrumentations-node, wired up the OTLP exporter, pointed it at Jaeger, and ran the app. Zero traces in Jaeger. No error. No warning. Just nothing.

I spent a long time confirming things that weren't the problem: Jaeger was reachable, the exporter config was correct, the SDK was initialising without throwing. Everything looked fine. Nothing was traced.

The problem was import order.

Node.js auto-instrumentation works by monkey-patching built-in modules (http, https, net) at process startup. The patches need to be applied before any other module loads. If NestJS (or Express, or anything) bootstraps first, those modules are already in memory — the patches never apply. The app runs normally but generates no spans.

The fix is one constraint:

// main.ts — THIS ORDER IS MANDATORY

import './tracing';          // Must be FIRST — patches Node.js internals
import { NestFactory } from '@nestjs/core';

async function bootstrap() {
  const app = await NestFactory.create(AppModule);
  await app.listen(3001);
}
bootstrap();

And the tracing.ts initialisation itself:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces',  // localhost because ECS awsvpc
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();  // Synchronous — must complete before bootstrap() runs

Note the URL: localhost:4318, not jaeger:4318. ECS Fargate's awsvpc network mode puts all containers in the same task into a shared network namespace. Same-task containers talk on localhost. Docker Compose service names don't resolve here.

After fixing the import order, traces started flowing immediately.

Correlating Logs to Traces

Having traces is useful. Having traces you can find from a log line is the actual goal.

The Pino logger needed a mixin function that reads the active OpenTelemetry span and injects its IDs into every log entry:

import { trace } from '@opentelemetry/api';

const logger = pino({
  mixin() {
    const span = trace.getActiveSpan();
    if (!span) return {};
    const ctx = span.spanContext();
    return {
      trace_id: ctx.traceId,
      span_id: ctx.spanId,
    };
  },
  formatters: {
    level(label) { return { level: label }; },
  },
});

Now every log line looks like this:

{
  "level": "error",
  "msg": "Database connection refused",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "timestamp": "2024-01-15T14:23:01.234Z"
}

When a Slack alert fires, I can grep CloudWatch Logs for the trace_id, then paste it directly into Jaeger's search. One click. The full trace — every service, every database query, every millisecond — is right there.

Bug #2: Lambda Auto-Remediation Broke the Deployment Controller

This one was more subtle.

The Lambda function's job: CloudWatch alarm fires (high 5xx rate) → Lambda detects it → Lambda restarts the unhealthy ECS service.

My first implementation used UpdateService with forceNewDeployment: true. That's the standard approach for restarting an ECS service. It should have worked.

ecs.update_service(
    cluster=cluster_arn,
    service=service_name,
    forceNewDeployment=True  # This fails silently or throws
)

It threw:

InvalidParameterException: Unable to update the service because
a deployment is already in progress

The reason: when an ECS service uses deployment_controller { type = "CODE_DEPLOY" }, AWS hands deployment control entirely to CodeDeploy. UpdateService --forceNewDeployment is incompatible with an active CodeDeploy-controlled service. The two systems conflict.

The correct approach is ecs:StopTask — stop the specific unhealthy task directly:

def remediate_unhealthy_task(cluster_arn, service_arn):
    # List running tasks for the service
    tasks = ecs.list_tasks(
        cluster=cluster_arn,
        serviceName=service_arn,
        desiredStatus='RUNNING'
    )['taskArns']

    if not tasks:
        return

    # Stop the first running task
    ecs.stop_task(
        cluster=cluster_arn,
        task=tasks[0],
        reason='Auto-remediation: high error rate detected via CloudWatch alarm'
    )

When a task stops, ECS detects the task count is below desired and launches a replacement. The service recovers. CodeDeploy is never touched. No deployment state corruption.

Bug #3: The Idempotency Problem

Lambda triggered three times for the same alarm window. Three concurrent invocations. Three tasks stopped simultaneously. The service dropped to zero running tasks and couldn't recover fast enough to pass health checks.

The fix: check your own logs before acting.

def check_recent_remediation(log_group, log_stream_prefix, window_minutes=10):
    """Return True if auto-remediation ran successfully in the last N minutes."""
    cutoff = int((datetime.utcnow() - timedelta(minutes=window_minutes)).timestamp() * 1000)

    streams = logs.describe_log_streams(
        logGroupName=log_group,
        logStreamNamePrefix=log_stream_prefix,
        orderBy='LastEventTime',
        descending=True,
        limit=5
    )['logStreams']

    for stream in streams:
        events = logs.get_log_events(
            logGroupName=log_group,
            logStreamName=stream['logStreamName'],
            startTime=cutoff
        )['events']

        for event in events:
            if 'Auto-remediation successful' in event['message']:
                return True

    return False

def lambda_handler(event, context):
    if check_recent_remediation(LOG_GROUP, LOG_STREAM_PREFIX):
        print('Recent remediation found — skipping to avoid thrash')
        return {'status': 'skipped', 'reason': 'idempotency_guard'}

    remediate_unhealthy_task(CLUSTER_ARN, SERVICE_ARN)
    print('Auto-remediation successful')

One alarm. One Lambda invocation that acts. All subsequent invocations within 10 minutes exit early. The service gets one clean restart instead of a cascade.

The Full Pipeline: 11 Stages

The Jenkins pipeline that drives all of this:

Secret Scan (Gitleaks)
Type Check + Lint (TypeScript + ESLint)
Dependency Audit (npm audit)
Code Quality (SonarCloud)
Build Images (Docker, tagged with git SHA)
Image Scan (Trivy CVE detection)
SBOM Generation (Syft — CycloneDX + SPDX)
IaC Scan (Checkov on Terraform)
ECR Push
Task Definition Registration
Blue/Green Deployment (CodeDeploy, 10% traffic per minute)

Security gates first. Deployment last. The same principle from the GitOps project applies here.

What It Looks Like When It Works

A request comes in to the NestJS backend:

OpenTelemetry generates a trace ID and creates a root span
Each downstream call (database query, external HTTP) gets a child span
Pino injects the trace ID into every log line during that request's lifecycle
Prometheus records the request duration in a histogram
If the response is 5xx: Alertmanager routes to Slack with the alarm context
In Slack: I see the alert, click the CloudWatch link, grep for the trace ID, open Jaeger, see the full call graph in under 60 seconds

And if the error rate crosses the threshold:

CloudWatch alarm fires
Lambda checks for recent remediation (idempotency guard)
Lambda stops the unhealthy task
ECS replaces it with a fresh task
Error rate drops
Alarm clears

No manual intervention. No 3am pages.

Key Takeaways

OTel import order is a hard constraint, not a preference. The SDK must patch Node.js internals before any framework loads. One wrong line breaks the entire tracing setup with no error message to guide you.

ecs:StopTask is the correct remediation call when using CODE_DEPLOY. forceNewDeployment conflicts with the CodeDeploy controller. Stop the task — ECS handles the replacement.

Idempotency in Lambda isn't optional when CloudWatch alarms are your trigger. Alarms fire multiple times. Your remediation function needs to know when it already ran.

Trace ID correlation turns three separate signals into one investigation. Logs, traces, and metrics are each useful in isolation. Together, with the trace ID as the link, they tell the complete story of a request.

Resources

What's the most useful observability signal you've added to a production system? Drop it below — I'm building a list of what actually helps vs. what just adds noise. 👇

DEV Community