DEV Community

Prince Ayiku
Prince Ayiku

Posted on

I Built a Self-Healing Observability Stack on AWS ECS — Here Are the Bugs That Nearly Broke Me

I Built a Self-Healing Observability Stack — Here Are the Bugs That Nearly Broke Me

My blue/green deployment rolled back successfully.

I had no idea why.

The CloudWatch alarm fired. CodeDeploy reverted. The Slack alert said "5xx spike." But which service? Which endpoint? Which specific request triggered the cascade? All I had was a timestamp and an alarm name. The system worked exactly as designed — and I couldn't explain what it had just protected me from.

That's when I started this project.


What I Was Actually Missing

I'd built a solid GitOps pipeline at this point: Jenkins security gates, ECS Fargate, blue/green deployments with automatic rollback. The deployment mechanics were production-grade. The observability layer was... three CloudWatch log groups and a feeling.

The stack I built to close that gap:

  • OpenTelemetry auto-instrumentation on the NestJS backend — every HTTP request generates a trace with spans across every service hop
  • Jaeger as the trace backend (receiving via OTLP HTTP on port 4318)
  • Pino structured logging with trace_id and span_id injected into every log line — so a CloudWatch log entry links directly to a Jaeger trace
  • Prometheus scraping custom NestJS metrics (request rate, latency histograms, error counters)
  • Grafana dashboards
  • Alertmanager → Slack for alert routing
  • Lambda auto-remediation — a function that detects high error rates via CloudWatch alarm and autonomously stops unhealthy ECS tasks

Advanced Observability Stack Architecture

The goal: when something breaks, I can go from Slack alert → log line → trace → root cause in one click. And if the error rate spikes, the system handles it before I even see the alert.


Bug #1: OpenTelemetry Was Running But Not Working

This was the first thing I got completely wrong.

I installed @opentelemetry/auto-instrumentations-node, wired up the OTLP exporter, pointed it at Jaeger, and ran the app. Zero traces in Jaeger. No error. No warning. Just nothing.

I spent a long time confirming things that weren't the problem: Jaeger was reachable, the exporter config was correct, the SDK was initialising without throwing. Everything looked fine. Nothing was traced.

The problem was import order.

Node.js auto-instrumentation works by monkey-patching built-in modules (http, https, net) at process startup. The patches need to be applied before any other module loads. If NestJS (or Express, or anything) bootstraps first, those modules are already in memory — the patches never apply. The app runs normally but generates no spans.

The fix is one constraint:

// main.ts — THIS ORDER IS MANDATORY

import './tracing';          // Must be FIRST — patches Node.js internals
import { NestFactory } from '@nestjs/core';

async function bootstrap() {
  const app = await NestFactory.create(AppModule);
  await app.listen(3001);
}
bootstrap();
Enter fullscreen mode Exit fullscreen mode

And the tracing.ts initialisation itself:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces',  // localhost because ECS awsvpc
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();  // Synchronous — must complete before bootstrap() runs
Enter fullscreen mode Exit fullscreen mode

Note the URL: localhost:4318, not jaeger:4318. ECS Fargate's awsvpc network mode puts all containers in the same task into a shared network namespace. Same-task containers talk on localhost. Docker Compose service names don't resolve here.

After fixing the import order, traces started flowing immediately.


Correlating Logs to Traces

Having traces is useful. Having traces you can find from a log line is the actual goal.

The Pino logger needed a mixin function that reads the active OpenTelemetry span and injects its IDs into every log entry:

import { trace } from '@opentelemetry/api';

const logger = pino({
  mixin() {
    const span = trace.getActiveSpan();
    if (!span) return {};
    const ctx = span.spanContext();
    return {
      trace_id: ctx.traceId,
      span_id: ctx.spanId,
    };
  },
  formatters: {
    level(label) { return { level: label }; },
  },
});
Enter fullscreen mode Exit fullscreen mode

Now every log line looks like this:

{
  "level": "error",
  "msg": "Database connection refused",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "timestamp": "2024-01-15T14:23:01.234Z"
}
Enter fullscreen mode Exit fullscreen mode

When a Slack alert fires, I can grep CloudWatch Logs for the trace_id, then paste it directly into Jaeger's search. One click. The full trace — every service, every database query, every millisecond — is right there.

Grafana Dashboard


Bug #2: Lambda Auto-Remediation Broke the Deployment Controller

This one was more subtle.

The Lambda function's job: CloudWatch alarm fires (high 5xx rate) → Lambda detects it → Lambda restarts the unhealthy ECS service.

My first implementation used UpdateService with forceNewDeployment: true. That's the standard approach for restarting an ECS service. It should have worked.

ecs.update_service(
    cluster=cluster_arn,
    service=service_name,
    forceNewDeployment=True  # This fails silently or throws
)
Enter fullscreen mode Exit fullscreen mode

It threw:

InvalidParameterException: Unable to update the service because
a deployment is already in progress
Enter fullscreen mode Exit fullscreen mode

The reason: when an ECS service uses deployment_controller { type = "CODE_DEPLOY" }, AWS hands deployment control entirely to CodeDeploy. UpdateService --forceNewDeployment is incompatible with an active CodeDeploy-controlled service. The two systems conflict.

The correct approach is ecs:StopTask — stop the specific unhealthy task directly:

def remediate_unhealthy_task(cluster_arn, service_arn):
    # List running tasks for the service
    tasks = ecs.list_tasks(
        cluster=cluster_arn,
        serviceName=service_arn,
        desiredStatus='RUNNING'
    )['taskArns']

    if not tasks:
        return

    # Stop the first running task
    ecs.stop_task(
        cluster=cluster_arn,
        task=tasks[0],
        reason='Auto-remediation: high error rate detected via CloudWatch alarm'
    )
Enter fullscreen mode Exit fullscreen mode

When a task stops, ECS detects the task count is below desired and launches a replacement. The service recovers. CodeDeploy is never touched. No deployment state corruption.


Bug #3: The Idempotency Problem

Lambda triggered three times for the same alarm window. Three concurrent invocations. Three tasks stopped simultaneously. The service dropped to zero running tasks and couldn't recover fast enough to pass health checks.

The fix: check your own logs before acting.

def check_recent_remediation(log_group, log_stream_prefix, window_minutes=10):
    """Return True if auto-remediation ran successfully in the last N minutes."""
    cutoff = int((datetime.utcnow() - timedelta(minutes=window_minutes)).timestamp() * 1000)

    streams = logs.describe_log_streams(
        logGroupName=log_group,
        logStreamNamePrefix=log_stream_prefix,
        orderBy='LastEventTime',
        descending=True,
        limit=5
    )['logStreams']

    for stream in streams:
        events = logs.get_log_events(
            logGroupName=log_group,
            logStreamName=stream['logStreamName'],
            startTime=cutoff
        )['events']

        for event in events:
            if 'Auto-remediation successful' in event['message']:
                return True

    return False

def lambda_handler(event, context):
    if check_recent_remediation(LOG_GROUP, LOG_STREAM_PREFIX):
        print('Recent remediation found — skipping to avoid thrash')
        return {'status': 'skipped', 'reason': 'idempotency_guard'}

    remediate_unhealthy_task(CLUSTER_ARN, SERVICE_ARN)
    print('Auto-remediation successful')
Enter fullscreen mode Exit fullscreen mode

One alarm. One Lambda invocation that acts. All subsequent invocations within 10 minutes exit early. The service gets one clean restart instead of a cascade.

Jenkins Pipeline Flow


The Full Pipeline: 11 Stages

The Jenkins pipeline that drives all of this:

  1. Secret Scan (Gitleaks)
  2. Type Check + Lint (TypeScript + ESLint)
  3. Dependency Audit (npm audit)
  4. Code Quality (SonarCloud)
  5. Build Images (Docker, tagged with git SHA)
  6. Image Scan (Trivy CVE detection)
  7. SBOM Generation (Syft — CycloneDX + SPDX)
  8. IaC Scan (Checkov on Terraform)
  9. ECR Push
  10. Task Definition Registration
  11. Blue/Green Deployment (CodeDeploy, 10% traffic per minute)

Security gates first. Deployment last. The same principle from the GitOps project applies here.


What It Looks Like When It Works

A request comes in to the NestJS backend:

  1. OpenTelemetry generates a trace ID and creates a root span
  2. Each downstream call (database query, external HTTP) gets a child span
  3. Pino injects the trace ID into every log line during that request's lifecycle
  4. Prometheus records the request duration in a histogram
  5. If the response is 5xx: Alertmanager routes to Slack with the alarm context
  6. In Slack: I see the alert, click the CloudWatch link, grep for the trace ID, open Jaeger, see the full call graph in under 60 seconds

And if the error rate crosses the threshold:

  1. CloudWatch alarm fires
  2. Lambda checks for recent remediation (idempotency guard)
  3. Lambda stops the unhealthy task
  4. ECS replaces it with a fresh task
  5. Error rate drops
  6. Alarm clears

No manual intervention. No 3am pages.

Slack Alerts


Key Takeaways

OTel import order is a hard constraint, not a preference. The SDK must patch Node.js internals before any framework loads. One wrong line breaks the entire tracing setup with no error message to guide you.

ecs:StopTask is the correct remediation call when using CODE_DEPLOY. forceNewDeployment conflicts with the CodeDeploy controller. Stop the task — ECS handles the replacement.

Idempotency in Lambda isn't optional when CloudWatch alarms are your trigger. Alarms fire multiple times. Your remediation function needs to know when it already ran.

Trace ID correlation turns three separate signals into one investigation. Logs, traces, and metrics are each useful in isolation. Together, with the trace ID as the link, they tell the complete story of a request.


Resources

What's the most useful observability signal you've added to a production system? Drop it below — I'm building a list of what actually helps vs. what just adds noise. 👇

Top comments (0)