Marcos Henrique for AWS Community Builders

Posted on Dec 8

Real-Time ALB Log Analysis for Proactive Integration Recovery via Datadog Monitors, Workflows and AWS Lambda

#todayilearned #aws #datadog

In the realm of the unknown, sometimes we need to reinvent ourselves and make something we call, here in Brazil, "gambiarra," a.k.a. the art of improvisation.

The Problem:

The Drama: Google Calendar events were giving us major side-eye, and we had zero visibility into which users were feeling the pain.
The Cause: Errors were mostly tied to channels that were ghosting us or events that couldn't reach the API.
The Worst Part: Our ELB/ALB hadn't access logs, so we were total in the dark about it

The light of reasoning

Why not add access logs? Since we already use Datadog, I decided to add the ALB/ELB access logs there.

This project was built using Pulumi, but no worries from my end as a CDK lover, let's make it happen. If you haven't worked with Pulumi before, I recommend checking out their documentation. It seems pretty interesting, as you can manage IaC for many providers using the same lib/IaC tool, almost like Terraform, but using Typescript hehe

Adding ALB/ELB Access Logs

What are Access Logs?

The Access Logs are like the security guard's notebook or a paper trail of everything that happens at that intersection.

Every time a car (a request) passes through the Load Balancer, the guard writes down an entry in the notebook. This entry includes all the important "who, what, and when" details.

Each log entry is a line of information that tells a story. It answers questions like:

When did the car arrive? (The exact time)
Who was driving the car? (The user's IP address)
Where were they trying to go? (The web page URL they requested)
Did the car get where it was going? (The status code—like a 200 OK means "Yes, all good!" or a 404 means "No, that page doesn't exist.")
How long did the whole trip take? (How fast the server responded)
Which server did the Load Balancer send them to? (The specific server ID)

Pulumi Code example for a fresh ALB/ELB

Here, we are going to create the ALB/ELB, the target group, and add certificates and listener rules.
our ALB was already created, just showing you guys how to do it using Pulumi

First we need to create the ALB/ELB:

import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";

const mainAlb = new awsx.lb.ApplicationLoadBalancer("main-alb", {
    name: "main-alb",
    subnetIds: vpc.publicSubnetIds,
    securityGroups: [securityGroup.id],
    internal: false,
    accessLogs: {
        bucket: lbLogs.id,
        prefix: "access-logs-lb",
        enabled: true,
    }, // the magic happens here xD
    tags: {
        stage: environment,
        managed: "true",
    },
})

Creating the target group:

const webhooksTargetGroup = new aws.lb.TargetGroup("webhooks-tg", {
    port: 80,
    protocol: "HTTP",
    targetType: "ip",
    vpcId: vpc.vpcId,
    healthCheck: {
        enabled: true,
        healthyThreshold: 2,
        unhealthyThreshold: 2,
        interval: 10,
        path: "/api/v1/health-check",
        port: "traffic-port",
    },
});

Http Listener:

// the http listener
const httpsListener = new aws.lb.Listener("httpsListener", {
    loadBalancerArn: mainAlb.loadBalancer.arn,
    port: 443,
    protocol: "HTTPS",
    defaultActions: [{
        type: "fixed-response",
        fixedResponse: {
            contentType: "text/plain",
            statusCode: "404",
            messageBody: "Not Found",
        },
    }],
    sslPolicy: "ELBSecurityPolicy-2016-08",
    certificateArn: myCertificateArn,
});

//additional certificates
new aws.lb.ListenerCertificate("webhooks-attachment", {
    listenerArn: httpsListener.arn,
    certificateArn: webhooksCertificateArn,
});

Listener Rules:

new aws.lb.ListenerRule("rule-webhooks", {
    listenerArn: httpsListener.arn,
    priority: 3,
    actions: [{ type: "forward", targetGroupArn: webhooksTargetGroup.arn }],
    conditions: [{ hostHeader: { values: [`webhooks.${route53ZoneName}`] } }],
});

Pulumi Code Example for an already created ALB/ELB

import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";

// 1. Create S3 bucket for ALB access logs
const albAccessLogsBucket = new aws.s3.Bucket("alb-access-logs", {
    bucket: `alb-access-logs.${environment}.us-east-1.careops`,
    acl: "private",
});

// Bucket policy to allow ELB to write logs
new aws.s3.BucketPolicy("albLogsBucketPolicy", {
    bucket: albAccessLogsBucket.id,
    policy: albAccessLogsBucket.arn.apply((bucketArn) => JSON.stringify({
        Version: "2012-10-17",
        Statement: [
            {
                Effect: "Allow",
                Principal: { AWS: "arn:aws:iam::127311923021:root" }, // ELB service account for us-east-1
                Action: ["s3:PutObject", "s3:GetBucketAcl"],
                Resource: [bucketArn, `${bucketArn}/*`],
            },
        ],
    })),
});

// 2. Enable access logs on ALB
new aws.lb.LoadBalancerAttributes("mainAlbAccessLogs", {
    loadBalancerArn: mainAlb.loadBalancer.arn,
    attributes: [
        { key: "access_logs.s3.enabled", value: "true" },
        { key: "access_logs.s3.bucket", value: albAccessLogsBucket.bucket },
        { key: "access_logs.s3.prefix", value: "main-alb/" },
    ],
});

... so now you should have a bucket that stores your ALB/ELB traffic

ELB publishes a log file for each load balancer node every 5 minutes. Log delivery is eventually consistent. The load balancer can deliver multiple logs for the same period. This usually happens if the site has high traffic.

The file names of the access logs use the following format:
bucket[/prefix]/AWSLogs/aws-account-id/elasticloadbalancing/region/yyyy/mm/dd/aws-account-id_elasticloadbalancing_region_app.load-balancer-id_end-time_ip-address_random-string.log.gz

An example with less variables:
s3://amzn-s3-demo-logging-bucket/logging-prefix/AWSLogs/123456789012/elasticloadbalancing/us-east-2/2022/05/01/123456789012_elasticloadbalancing_us-east-2_app.my-loadbalancer.1234567890abcdef_20220215T2340Z_172.160.001.192_20sg8hgm.log.gz

Forwarding the logs to Datadog

As it is not magic, you need to explicitly send your logs to Datadog, basically by creating a lambda and sending it all

// 3. Deploy Datadog Forwarder Lambda (using AWS SAM/CloudFormation or manual deployment)
// For Pulumi, you can use aws.cloudformation.Stack or deploy the Lambda directly
const datadogForwarder = new aws.lambda.Function("datadogForwarder", {
    runtime: aws.lambda.Runtime.Python39,
    handler: "lambda_function.lambda_handler",
    role: datadogForwarderRole.arn,
    // Use the official Datadog Forwarder layer or package
    code: new pulumi.asset.AssetArchive({
        ".": new pulumi.asset.FileArchive("./datadog-forwarder"),
    }),
    environment: {
        variables: {
            DD_API_KEY_SECRET_ARN: datadogApiKeySecret.arn,
            DD_SITE: "datadoghq.com",
            DD_TAGS: `env:${environment}`,
        },
    },
});

// 4. Configure S3 event notification to trigger Lambda when new logs arrive
new aws.s3.BucketNotification("albLogsNotification", {
    bucket: albAccessLogsBucket.id,
    lambdaFunctions: [{
        lambdaFunctionArn: datadogForwarder.arn,
        events: ["s3:ObjectCreated:*"],
        filterPrefix: "main-alb/AWSLogs/", // ALB logs path pattern
    }],
});

// Grant S3 permission to invoke Lambda
new aws.lambda.Permission("allowS3InvokeLambda", {
    action: "lambda:InvokeFunction",
    function: datadogForwarder.name,
    principal: "s3.amazonaws.com",
    sourceArn: albAccessLogsBucket.arn,
});

After this setup, you will be able to see something like this:

Now our ALB/ELB was spamming error logs, but we didn't have the tea on who was affected. We were fancy, low-key, clueless with our access logs 🫠

The hacky solution

So the idea was to address this lack of observability by adding ALB/ELB Access Logs, but that wasn’t enough to get a full picture of who was experiencing the issue. Hence, the solution was a “hack” on the webhook Google uses to send us events: adding user_id and user_email to the query string when we generate the webhook URL for Google. So yeah, now we have more observability.

Let's monitor it

Basically, now we are grouping by user_email, and whenever an error occurs, we are triggering a Datadog Workflow
Monitoring using logs, but could also be metrics or anomaly detection

The threshold we are using & the workflow ✨

The Druid workflow, which heals itself

Datadog Workflow Automation (often just called "Workflows") is a low-code, drag-and-drop orchestration engine that lets you automate complex, multi-step actions across your entire technology stack, often in response to an alert or an event detected by Datadog monitoring.

Imagine it as a helpful digital buddy that smoothly links your monitoring, communication, and remediation tools, making it easier to stay on top of everything 😅

So, the target could be one of those massive options, as you can see:

For our case we are going to place an event on the event bridge to be processed by a lambda, serverless servelerss baby (you should read it like the ice ice baby song xD)

// 1. Create IAM Role for Lambda
const lambdaRole = new aws.iam.Role("datadogMonitorProcessorRole", {
    assumeRolePolicy: aws.iam.assumeRolePolicyForPrincipal({ 
        Service: "lambda.amazonaws.com" 
    }),
});

// Attach basic execution role for CloudWatch Logs
new aws.iam.RolePolicyAttachment("lambdaBasicExecution", {
    role: lambdaRole.name,
    policyArn: aws.iam.ManagedPolicies.AWSLambdaBasicExecutionRole,
});

// Optional: Add custom policy if Lambda needs to access other AWS services
new aws.iam.RolePolicy("lambdaCustomPolicy", {
    role: lambdaRole.id,
    policy: JSON.stringify({
        Version: "2012-10-17",
        Statement: [
            {
                Effect: "Allow",
                Action: [
                    "logs:CreateLogGroup",
                    "logs:CreateLogStream",
                    "logs:PutLogEvents",
                ],
                Resource: "arn:aws:logs:*:*:*",
            },
        ],
    }),
});

// 2. Build Lambda code from external source
// Lambda code is stored in: pulumi/lambdas/src/datadogMonitorProcessor.ts
const buildDir = path.join(__dirname, "../lambdas/dist");
const lambdaSourcePath = path.join(__dirname, "../lambdas/src/datadogMonitorProcessor.ts");

// Build the Lambda code using esbuild (or your preferred bundler)
// Make sure esbuild is installed: npm install --save-dev esbuild
esbuild.buildSync({
    entryPoints: [lambdaSourcePath],
    bundle: true,
    platform: "node",
    target: "node18",
    outfile: path.join(buildDir, "datadogMonitorProcessor.js"),
    external: ["aws-sdk"], // AWS SDK is available in Lambda runtime
});

// Create Lambda function
const lambdaFunction = new aws.lambda.Function("datadogMonitorProcessor", {
    name: `datadog-monitor-processor-${environment}`,
    runtime: aws.lambda.Runtime.NodeJS18dX,
    handler: "datadogMonitorProcessor.handler", // Matches the exported function name
    role: lambdaRole.arn,
    code: new pulumi.asset.AssetArchive({
        ".": new pulumi.asset.FileArchive(buildDir),
    }),
    timeout: 30,
    memorySize: 256,
    environment: {
        variables: {
            ENVIRONMENT: environment,
        },
    },
    tags: {
        Environment: environment,
        ManagedBy: "pulumi",
        Purpose: "datadog-monitor-processing",
    },
});

// 3. Create EventBridge Rule to capture Datadog monitor events
const eventRule = new aws.cloudwatch.EventRule("datadogMonitorRule", {
    name: `datadog-monitor-rule-${environment}`,
    description: "Capture Datadog monitor alerts via EventBridge",

    // Event pattern to match Datadog events
    // Datadog sends events with source "datadog" and detail-type varies
    eventPattern: JSON.stringify({
        source: ["datadog"],
        "detail-type": [
            "Datadog Monitor Alert"
        ],
        detail: {
             monitor_name: ["Google Calendar Sync"]
        },
    }),

    tags: {
        Environment: environment,
        ManagedBy: "pulumi",
    },
});

// 4. Add Lambda as target for the EventBridge rule
const eventTarget = new aws.cloudwatch.EventTarget("datadogMonitorTarget", {
    rule: eventRule.name,
    arn: lambdaFunction.arn,

    // Optional: Transform the event before sending to Lambda
    // inputTransformer: {
    //     inputPathsMap: {
    //         alertType: "$.detail.alert_type",
    //         monitorName: "$.detail.monitor_name",
    //     },
    //     inputTemplate: JSON.stringify({
    //         alertType: "<alertType>",
    //         monitorName: "<monitorName>",
    //         timestamp: "<aws.events.event.ingestion-time>",
    //     }),
    // },
});

// 5. Grant EventBridge permission to invoke Lambda
new aws.lambda.Permission("allowEventBridgeInvokeLambda", {
    action: "lambda:InvokeFunction",
    function: lambdaFunction.name,
    principal: "events.amazonaws.com",
    sourceArn: eventRule.arn,
});

// 6. Create CloudWatch Log Group for Lambda
const lambdaLogGroup = new aws.cloudwatch.LogGroup("datadogMonitorProcessorLogs", {
    name: `/aws/lambda/${lambdaFunction.name}`,
    retentionInDays: 14,
    tags: {
        Environment: environment,
    },
});

The moral of the story? In a fluid, uncertain digital existence, sometimes the only way to establish truth (and a working calendar haha) is through radical improvisation and definitive action 🍻

Pulumi was fun to use, but in my opinion, CDK is still my preferred option, haha

DEV Community