DEV Community: ajithmanmu

Async Tracing on AWS Lambda: Carrying Context Across SQS with OpenTelemetry

ajithmanmu — Fri, 26 Jun 2026 22:29:43 +0000

Tracing synchronous services is mostly a solved problem. Call a downstream service, the SDK injects headers, the trace connects. One chain, full context.

Async is different. Take a Stripe webhook pipeline: payment.failed comes in, your producer Lambda validates it and puts it on SQS, your consumer processes it — updates the subscription, notifies the user. Standard setup. Now a customer says their payment wasn't handled correctly. You open the trace in CloudWatch. It ends at SQS. Whatever happened on the consumer side — success, failure, partial write — is invisible. You know the event was received. You have no idea what happened next.

SQS doesn't carry trace context. Whatever triggered that message is disconnected from whatever processed it. This is what OpenTelemetry's propagation API is built for. What follows is a practical implementation on AWS Lambda using AWS Distro for OpenTelemetry (ADOT) and CloudWatch Application Signals.

The Problem

In synchronous HTTP-based architectures, tracing just works. When your API calls a downstream service, the SDK automatically injects trace context into the request headers. The downstream service picks it up, and the full call chain appears as one trace.

SQS breaks this. When you call sqs.send_message(), the message goes into a queue. The consumer Lambda runs later — independently, with its own trace ID. There's nothing connecting the two.

Here's what your Application Map looks like without instrumentation:

The consumer Lambda is processing events, writing to DynamoDB — but it's completely invisible to your observability stack. The service map ends at SQS. If a payment event fails processing, you have no idea what happened on the other side of the queue.

If you're new to OpenTelemetry concepts — spans, traces, context propagation — the LFS148 course from the Linux Foundation is worth going through first. It's the foundation this project was built on, and the concepts translate directly to what follows.

The Architecture

For this demo, I built a simplified webhook processing pipeline using AWS CDK:

API Gateway receives the webhook (e.g., a Stripe payment.failed event)
Producer Lambda validates and pushes it to SQS, injecting trace context into message attributes
SQS Queue buffers events with a 5-second batching window
Consumer Lambda pulls from SQS, extracts trace context per message, processes and writes to DynamoDB
ADOT (AWS Distro for OpenTelemetry) layer on both Lambdas handles auto-instrumentation and exports to Application Signals

Stack: Python 3.12, AWS CDK, ADOT Lambda layer (AWSOpenTelemetryDistroPython), CloudWatch Application Signals, X-Ray

The Fix

The fix requires taking matters into your own hands: the producer must explicitly inject the active trace context into the SQS payload, and the consumer has to manually extract it before opening a new span.

Producer: inject trace context

from opentelemetry import propagate

carrier = {}
propagate.inject(carrier)

sqs.send_message(
    QueueUrl=QUEUE_URL,
    MessageBody=json.dumps(payload),
    MessageAttributes={
        'X-Amzn-Trace-Id': {
            'DataType': 'String',
            'StringValue': carrier.get('X-Amzn-Trace-Id', ''),
        },
    },
)

propagate.inject(carrier) writes the current trace context into the carrier dict. With OTEL_PROPAGATORS=xray, the key is X-Amzn-Trace-Id — X-Ray's native format. We put it in the SQS message attributes so it travels with the message.

Consumer: extract and create child span

from opentelemetry import propagate, trace

tracer = trace.get_tracer(__name__)

def process_message(message):
    carrier = {}
    attrs = message.get('messageAttributes', {})
    if 'X-Amzn-Trace-Id' in attrs:
        carrier['X-Amzn-Trace-Id'] = attrs['X-Amzn-Trace-Id']['stringValue']

    ctx = propagate.extract(carrier)

    with tracer.start_as_current_span('process-webhook', context=ctx) as span:
        try:
            span.set_attribute('webhook.event_id', event_id)
            span.set_attribute('webhook.type', event_type)
            # ... process and write to DynamoDB
        except Exception as e:
            span.record_exception(e)
            span.set_status(trace.StatusCode.ERROR, str(e))
            raise

propagate.extract(carrier) reconstructs the trace context from the message attribute. Passing it as context=ctx continues the upstream trace — this span uses the producer span as its parent when the context is valid.

The span.record_exception() + span.set_status(ERROR) calls are easy to miss but important: without them, a failed message closes its span with status UNSET and your error rate metrics stay at zero. The dashboards look fine while messages are silently failing.

The Batching Paradox

SQS delivers up to 10 messages per Lambda invocation. Each message carries a different X-Amzn-Trace-Id from a different upstream producer.

The wrong approach: extract trace context once at the handler level and apply it to all messages. Every consumer span would get linked to the same (wrong) producer trace.

The right approach: extract per message, inside the loop.

def handler(event, context):
    print(f"[batch] size={len(event['Records'])}")
    batch_item_failures = []
    for message in event['Records']:
        try:
            process_message(message)  # each call extracts its own context
        except Exception as e:
            batch_item_failures.append({'itemIdentifier': message['messageId']})
    return {'batchItemFailures': batch_item_failures}

In production, this gets messy quickly. Lambda will spin up multiple concurrent consumers, each processing a different batch. A burst of 8 events might land as two invocations — batches of 5 and 3 — each correctly fanning spans out to their respective producer traces, because the extraction happens per message.

Graceful Degradation

Production is messy. You're going to get messages from legacy producers, manual CLI retries, or uninstrumented upstream teams — all without X-Amzn-Trace-Id. If your consumer assumes the attribute is always there, it'll blow up.

has_trace_context = 'X-Amzn-Trace-Id' in attrs

if has_trace_context:
    carrier['X-Amzn-Trace-Id'] = attrs['X-Amzn-Trace-Id']['stringValue']
else:
    print(f"[trace] no upstream context for message {message['messageId']}")

ctx = propagate.extract(carrier)

with tracer.start_as_current_span('process-webhook', context=ctx) as span:
    span.set_attribute('trace.has_upstream_context', has_trace_context)

When context is missing, propagate.extract({}) returns an empty context and the span becomes a fresh root span. Processing continues normally. The trace.has_upstream_context=false attribute lets you query Application Signals to see how much of your traffic is still coming from uninstrumented sources.

One more edge case worth knowing: if the upstream trace was not sampled, the producer still injects a context, but it's a no-op — the consumer span becomes a root with no upstream link. In practice, this looks identical to graceful degradation. Sampling decisions happen at the producer; the consumer has no way to tell the difference.

The Result

After instrumentation, a single trace shows the full chain from the original HTTP request all the way to the DynamoDB write:

The "This trace is part of a linked set of traces" badge confirms the async boundary was crossed. The process-webhook → DynamoDB path in the same trace as the producer proves the context propagation worked.

Finding a Specific Event

When something goes wrong, you need to find what happened to a specific event fast. Since we set webhook.event_id as a span attribute, CloudWatch Transaction Search can find it directly.

Go to CloudWatch → Application Signals → Transaction Search, switch to Logs Insights QL, and run:

filter @message like "evt_your_event_id"

One result. It shows the span with all context: which service processed it, duration, whether trace context was present, and the traceId to navigate directly to the full waterfall.

CDK Configuration

A few things that aren't obvious from the AWS docs:

The ADOT layer requires a separate IAM policy. Tracing.ACTIVE in CDK only grants X-Ray permissions. Application Signals needs its own managed policy:

const appSignalsPolicy = iam.ManagedPolicy.fromAwsManagedPolicyName(
  'CloudWatchLambdaApplicationSignalsExecutionRolePolicy',
);
producer.role?.addManagedPolicy(appSignalsPolicy);
consumer.role?.addManagedPolicy(appSignalsPolicy);

Environment variables that matter:

environment: {
  AWS_LAMBDA_EXEC_WRAPPER: '/opt/otel-instrument',
  OTEL_SERVICE_NAME: 'webhook-producer',
  OTEL_AWS_APPLICATION_SIGNALS_ENABLED: 'true',
  OTEL_PROPAGATORS: 'xray',
  OTEL_METRICS_EXPORTER: 'none',
}

OTEL_PROPAGATORS=xray switches the context format from W3C traceparent to X-Ray's native X-Amzn-Trace-Id format — consistent with what the Lambda runtime and API Gateway already use.

Don't set OTEL_TRACES_EXPORTER=xray. This looks right but causes a RuntimeError: Requested component 'xray' not found at cold start. The ADOT Lambda layer doesn't expose an xray exporter entry point the way you'd expect — Application Signals handles the export path internally. Setting this env var tries to load a component that doesn't exist in this layer.

Key Takeaways

Async boundaries don't preserve trace context in any reliable way. Most AWS async services (SQS, SNS, Kinesis, EventBridge) don't propagate it end-to-end. Assume you need to handle it yourself.
Extract context per message, not per batch. If you extract once at the handler level, every message in the batch links to the same (wrong) producer trace. Individual event traceability breaks down silently.
Messages without trace context will arrive. Legacy producers, manual retries, uninstrumented teams — design for it from the start. Also: if the upstream trace wasn't sampled, the consumer span becomes a root regardless. Sampling decisions happen upstream.
Mark failing spans as errors explicitly. span.record_exception() + span.set_status(ERROR) are what make Application Signals register a processing failure as an error. Without them, error rates look fine while messages are failing.
Silent OTEL misconfiguration is worse than a crash. If OTEL fails to initialize, your Lambda runs normally but exports nothing. The customer sees success — and your observability stack shows a gap you can't explain.

Resources

GitHub repo — full CDK stack, Lambda code, and test scripts
LFS148: Getting Started with OpenTelemetry — the Linux Foundation course this project was built on; good starting point if you're new to OTel concepts
AWS ADOT Lambda layer docs
CloudWatch Application Signals
OpenTelemetry Python propagators
OTel Span Links — the spec-compliant alternative to parent-child for async trace stitching

When Stripe's Built-In Dunning Isn't Enough

ajithmanmu — Wed, 13 May 2026 14:04:02 +0000

When Stripe's Built-In Dunning Isn't Enough

Every subscription business eventually hits the same silent killer: involuntary churn. A customer wants to keep paying you, but their card fails. RevenueCat's State of Subscriptions 2026 report highlights payment failures as one of the leading causes of involuntary churn across subscription apps — and that's not just mobile. For Stripe-based web subscriptions, the same problem exists and the default tooling leaves a lot of room for improvement.

Stripe handles the basics — it retries on a schedule and sends a generic email. For an early-stage startup, that's enough.

But once revenue starts to matter, Stripe's defaults start to feel blunt.

A $500/month VIP customer gets the same retry schedule as someone on a free trial. A stolen card — which will never succeed — gets retried the same way as a card that temporarily had insufficient funds. And there's no clean way to know what's happening with any specific customer's failed payment without digging through raw logs.

By sticking to the defaults, you're leaving money on the table.

I built a custom payment recovery system on AWS to fix this. Instead of a one-size-fits-all retry loop, it gives every failed payment its own isolated workflow — one that understands who the customer is, why the payment failed, and exactly how hard to fight before giving up.

The problem with Stripe's default dunning

Stripe's built-in dunning has four real blind spots:

1. Every customer gets the same treatment

A long-time VIP and a day-one trial user both hit the same retry schedule. That's not how you'd handle it manually. High-value customers deserve more patience. Trial users with unproven intent to pay deserve less.

2. No understanding of why the payment failed

Most engineers look at a failed payment and just see "Failed." But the bank usually tells you exactly why — and that reason changes everything.

Expired card → worth retrying
Temporarily insufficient funds → retry in a few days
Stolen card → retrying is pointless, and repeatedly hitting a stolen card can hurt your reputation with card networks

Stripe retries regardless.

3. Retrying hard declines damages your standing

Stolen cards, lost cards, do-not-honor codes — these will never succeed. Retrying them wastes money on Stripe fees and signals to card networks that you're not doing basic fraud hygiene. This can affect your authorization rates over time.

4. No per-customer audit trail

There's no single place that tells you: "This customer failed on Day 0, got retried on Day 3, received an email here, and will be cancelled on Day 14." You piece it together from logs. That makes debugging and support harder than it should be.

The architecture

The entire infrastructure is provisioned with Terraform — API Gateway, Lambda functions, Step Functions, DynamoDB tables, IAM roles, and Secrets Manager. The system sits between Stripe and your business logic:

Stripe → API Gateway → webhook-handler Lambda → Step Functions
                               ↕
                           DynamoDB
                   (idempotency, customer tier, outcomes)

Stack:

Terraform — all infrastructure as code, fully reproducible
API Gateway — receives invoice.payment_failed webhooks from Stripe
webhook-handler Lambda — validates the Stripe signature, reads the decline reason, starts a Step Functions execution
enricher Lambda — looks up the customer's tier in DynamoDB, defaults to standard if no record exists
payment-checker Lambda — polls Stripe to see if the invoice was paid between retries
canceller Lambda — cancels the subscription and writes the outcome to DynamoDB
DynamoDB — idempotency table (prevents double-processing), customer tier table, dunning-state table (final outcomes)

Every failed payment gets its own Step Functions execution. One customer, one workflow, fully isolated.

Hard vs. soft declines

The first thing the webhook handler does is read charge.outcome.reason from the Stripe charge object. This is one of the most under-used data points in the Stripe API. Most people just see "payment failed" — but the bank tells you exactly why, and that signal is what drives the routing logic.

const charges = await stripe.charges.list({ customer: invoice.customer, limit: 1 });
const charge = charges.data[0];
failureCode = (charge.outcome?.reason ?? charge.failure_code) || 'unknown';

Hard decline codes — stolen_card, lost_card, do_not_honor, pickup_card — go straight to cancellation. No retries, no waiting.

Everything else is a soft decline and enters the retry flow.

The four recovery paths

Once the decline type and customer tier are known, the Step Functions state machine routes to one of four paths:

Scenario	Decline type	Retry schedule	Total window
VIP	soft	Day 1, Day 3, Day 7	7 days
Trial	soft	Day 1, Day 3	3 days
Standard	soft	Day 3, Day 7, Day 14	14 days
Hard decline	hard	None	Immediate cancel

Why the trial window is short: if the first real payment fails after a trial ends, the intent to pay is unproven. You don't want to provide weeks of free access to someone who signed up with a card they never planned to use.

Why VIP gets more time: these customers have demonstrated value. A payment failure is more likely a temporary issue — card replacement, bank fraud hold — than an intent to leave.

The key insight: Step Functions as a per-user scheduled task

The most common engineering approach to retry logic is a database table and a cron job: every hour, scan for rows where next_retry < NOW(). It works, but it's painful to debug. If one customer's retry logic breaks, you're sifting through thousands of rows to find it.

Step Functions changes the mental model entirely.

An execution can sit dormant at a Wait state for days. In production, the wait durations are real:

86,400 seconds = 1 day
259,200 seconds = 3 days
604,800 seconds = 7 days

While an execution is waiting, it costs nothing. When the timer expires, it picks up exactly where it left off.

This means every failed payment is its own isolated process — like a tiny scheduled program running just for that customer. If a PM asks "what's happening with Customer X right now?", you open the Step Functions console, find their execution by name, and see exactly which state they're in and when the next action fires. You can stop that one execution, skip a step, or inspect the full history — without touching anyone else.

No cron jobs. No polling. No database flags tracking retry state.

Seeing it in action

To demo all four paths, I wrote a script that creates real Stripe customers with real subscriptions and triggers genuine payment failures:

node scripts/trigger-failure.js all

The script creates a fresh Stripe customer for each scenario, seeds DynamoDB with the correct tier, attaches a subscription to the Premium or Starter product, and confirms the invoice's PaymentIntent with a test card that declines. For hard decline, it uses pm_card_chargeDeclinedStolenCard which produces outcome.reason: stolen_card on the charge — exactly what the webhook handler reads.

Here's how the customer looks in Stripe — a real subscription with a real failed invoice, not a mocked event:

Each scenario kicks off its own Step Functions execution, named by customer and decline type so they're readable at a glance:

The routing difference is visible in the execution graphs. VIP takes the full retry path:

Hard decline goes straight to cancel — no wait states, done in under a second:

Every outcome is recorded in DynamoDB:

This pattern extends beyond dunning

The most underrated part of this architecture is that dunning is just the first workflow you plug in.

The webhook handler doesn't care what event it receives. The same infrastructure — API Gateway, Lambda, Step Functions — can orchestrate any Stripe event into any workflow:

customer.subscription.deleted — customer cancels. Trigger a win-back flow: wait 3 days, send a discount offer, wait a week, send a final email, archive the record.
customer.subscription.trial_will_end — Stripe fires this exactly 3 days before a trial ends. Start a conversion sequence: "here's what you'll lose", followed by a discount offer, followed by a day-before reminder.
invoice.payment_succeeded after a failure — payment recovered. Send a confirmation, re-activate paused features, notify the account team for high-value customers.
charge.dispute.created — chargeback filed. Immediately alert the team, auto-gather evidence, flag the customer record.

Each of these is a separate state machine sharing the same webhook infrastructure. You're not building a dunning system — you're building a Stripe event orchestration layer. Dunning is just the first workflow you wire up.

What you'd add in production

The current system handles the retry logic. A production deployment would layer on:

Tiered messaging at each retry point

Not one generic email after 14 days, but a different message at each step:

Day 1: "There was an issue with your payment — here's how to fix it"
Day 7: "Your access will end in 7 days"
Final step for VIPs: offer a one-month discount before cancelling

Because each Wait state is followed by a Lambda invocation, adding email is just wiring up SES or your email provider at each step.

Pause instead of cancel

Cancellation is destructive — it might delete data or revoke API access. In many cases it's better to pause the account: the customer can still log in and see their data, but can't take new actions. This keeps the door open for them to return without the friction of a full account reset.

Alerts for hard decline spikes

If 50 customers hit stolen_card errors within ten minutes, that's not a billing issue — it's likely a card-testing attack or a bug in your integration. A CloudWatch alarm feeding into Slack or PagerDuty lets a human intervene before the system mass-cancels legitimate customers.

Discount offer state before cancelling VIPs

Add one more state before CancelSubscription for VIP customers: offer a one-month discount. This is low effort and high impact — the state machine makes it a two-line addition.

What building this changed

The biggest shift wasn't technical — it was how I think about background jobs.

Before this, my mental model was: "How do I schedule retries?" Which leads to cron jobs, retry tables, and scattered logic.

After building this, the question became: "What's the lifecycle of this event?" Then model that lifecycle directly as a state machine.

Once you see a failed payment as its own workflow rather than a row in a retry queue, the whole design becomes cleaner. The complexity lives in the workflow definition, not in scattered application code.

Is it worth building?

Stripe's defaults are designed for the average business. But your customers aren't average — a VIP paying $500/month deserves a different recovery experience than a trial user on day one.

This system pays for itself the first time a VIP customer's card is successfully retried on Day 7 — a retry that Stripe's default schedule might have already given up on.

If you're running subscriptions at any meaningful scale, go check your Stripe dashboard. Look at how many hard declines you're retrying, and how many VIPs you're losing on the same schedule as trial users. The data is usually enough to make the case for building something better.

The full project — Terraform, all four Lambdas, the Step Functions definition, and the demo trigger script — is on GitHub: https://github.com/ajithmanmu/dunning-system

I'm an AWS Community Builder focused on serverless and subscription infrastructure. If you found this useful, follow for more posts on building production systems on AWS.

AWS DevOps Agent: From Setup to First Real Investigation (And the Gotchas in Between)

ajithmanmu — Mon, 04 May 2026 13:29:32 +0000

When an alert fires at 2am, the first 15 minutes aren't spent fixing anything. They're spent gathering context — opening CloudWatch, checking New Relic, scanning recent deploys, pulling up the runbook. By the time you actually understand what's happening, you've burned the fastest part of your response window.

AWS DevOps Agent is designed to do that first pass for you. It's an agentic AI system that investigates incidents autonomously — querying your telemetry, correlating deploy events, and posting findings to Slack before you've finished reading the page notification. I set it up as a PoC for our team recently and wanted to share what the setup actually looks like, including the things that tripped me up.

What the Agent Actually Does

The DevOps Agent sits between your alerting layer and your engineers. When a P1/P2 alert fires, it automatically:

Reads CloudWatch logs and metrics
Queries New Relic for transaction data, error rates, span-level telemetry
Correlates recent deploy events with the incident timeline
Posts an investigation summary to Slack with observations ranked by relevance

Engineers arrive with context rather than a blank slate. That's the core value proposition.

Setup Overview

The setup has a few distinct pieces: creating the Agent Space, wiring up New Relic, connecting GitHub, connecting Slack, and uploading a custom Skill. Here's how each went.

Creating the Agent Space

The first thing I hit: us-west-1 is not a supported region. The DevOps Agent is GA but only available in a subset of regions — us-east-1, us-west-2, ap-southeast-2, ap-northeast-1, eu-central-1, and eu-west-1 as of this writing.

The good news is you don't have to move your infrastructure. Creating the Agent Space in us-west-2 still gives it access to resources in us-west-1 (and other regions) through cross-region discovery. During initial setup it mapped over 4,000 entity relationships automatically — services, ECS clusters, ALBs, Lambda functions — without any manual configuration.

One thing worth knowing: there are two interfaces. The AWS console is for admin and setup only. The actual investigation experience — running manual investigations, viewing topology, reading findings — happens in a separate web app, accessed via an IAM auth link from the console. I spent longer than I'd like to admit looking for things in the wrong place.

New Relic Integration: Telemetry Source vs Capability Webhook

New Relic integration is two separate things and it's easy to confuse them:

Telemetry Source — the agent can read your New Relic data during an investigation (transaction metrics, spans, etc.)
Capability Webhook — New Relic can trigger investigations automatically when an alert fires

You need both. The Telemetry Source requires a Full Platform User API key — a regular API key won't work and you'll get a 403 with no helpful error message. If you don't have admin access to New Relic, flag this upfront so you're not blocked waiting on someone else.

Once connected, I created a New Relic workflow pointing at the Agent Space webhook with a custom payload template. This is where I spent most of my debugging time.

Fixing the Payload Template

New Relic's alert payloads don't match what the DevOps Agent expects out of the box. Two fixes were required:

Action field casing. New Relic sends uppercase — ACTIVATED, CLOSED. The DevOps Agent expects lowercase — created, closed. Fix with a conditional in the template:

{{#eq state "ACTIVATED"}}created{{/eq}}
{{#eq state "CLOSED"}}closed{{/eq}}

Service field type. New Relic sends the entity name as an array ["My Service"]. The agent expects a string. Fix:

{{ json entitiesData.names.[0] }}

Neither of these is documented anywhere obvious. I found them by running a test investigation and reading the agent's error observations.

Scoping with Alert Policies

Rather than exposing every alert to the agent, I scoped it to specific New Relic alert policies covering the services my team owns, filtered to P1/P2 priority only. This keeps investigation volume (and cost) predictable. Each team can have their own New Relic workflow pointing to the same webhook destination — clean separation, shared infrastructure.

GitHub Integration

The agent supports GitHub integration for correlating deploy events with incident timelines — useful for the most common first question in any incident: "was this caused by a recent deploy?" Setup is a two-step GitHub App install (account-level, then repo-level) with read-only access.

Once connected, the agent can pull recent commit and deploy activity and factor it into its investigation. If your team uses New Relic deployment markers already, there's some overlap — but the GitHub integration adds code-level context that telemetry alone doesn't give you.

Slack Integration

Straightforward — register Slack as a capability provider at the account level, then enable it per Agent Space. I'd recommend creating a dedicated channel for investigations rather than posting to a general engineering channel. The agent posts detailed findings with multiple observations; it's noisy in context, but useful in its own space.

Custom Skill: What to Put In and What to Leave Out

This is the part most writeups skip over, but it's where you get the most leverage.

The agent already knows a lot from topology mapping — your service names, ECS cluster layout, log group names, ALB configuration. Don't repeat any of that in the Skill. What the agent can't discover on its own is everything outside your AWS account. That's what the Skill is for.

Here's the structure I used:

1. Investigation priority order

Tell the agent explicitly what to check first. Left to its own judgment, it will start with internal metrics. But in practice, a large percentage of incidents are caused by third-party outages — and finding that early saves everyone time.

1. Check external dependency status pages first
2. Check for recent deployments
3. Analyze internal metrics (CloudWatch, New Relic error rates, traces)
4. Correlate across sources and build a timeline

2. External dependency status pages

Organize these by category so the agent knows which ones are relevant to which type of incident:

Infrastructure: your CDN, DNS, cloud provider health dashboard
Payments: payment processor, app store IAP status pages
Monitoring: your APM tool, analytics platforms
Content/third-party: any external APIs or services your product depends on

The agent will check the ones relevant to the alert context and include their status in its findings. This alone has caught third-party outages that would have taken significantly longer to identify manually.

3. Runbook pointers

The agent can't access Confluence or internal wikis directly. But you can give it the URLs and it will surface the right one in its Slack output so the on-call engineer can jump straight to it. Map them by service or incident type.

4. Communication guidelines

Tell the agent how to structure its Slack findings:

Lead with the root cause or most likely hypothesis
Include a timeline of key events with timestamps
Reference specific metrics and thresholds
If an external dependency is down, state it clearly
End with recommended next steps for the on-call engineer

This makes the agent's output immediately actionable rather than a wall of observations to interpret.

End-to-End Test

To validate the setup before real alerts, I created a test alert policy with a condition that would fire reliably (p95 response time > 10ms on a service where baseline is ~60ms).

The first triggered investigation ran for about 15 minutes and produced 23 observations. The agent correctly identified it as a false positive (threshold was too low). What it also surfaced — unprompted — was a real issue: ECS task churn on a different service, with recurring task stops every few hours and a 2-hour replacement delay. That had nothing to do with the test alert. It just noticed it while reviewing CloudWatch events.

A second test a few hours later triggered faster — the agent recognized the same condition from the previous investigation and completed the analysis more quickly.

Incident deduplication also worked as expected: when two alerts for the same condition fired 80 seconds apart, the agent linked the second to the first investigation rather than starting a duplicate. This matters for high-severity incidents where multiple conditions often trigger simultaneously.

Cost

Pricing is $0.0083/second (~$0.50/minute), billed only during active investigations. At roughly 6–7 incidents per month averaging 10–12 minutes each, that comes to around $40–50/month. There's a 2-month free trial (20 hours of investigation time) — useful for validating the PoC without any spend pressure.

One gap: at the time of setup, the DevOps Agent service didn't appear as a filterable dimension in AWS Budgets or Cost Anomaly Detection. I have the CLI commands ready for when charges start showing up in billing. If you're cost-conscious (and you should be), wire up AWS Budgets with a monthly threshold as soon as the service appears in your billing dimensions.

What I'd Tell Someone Setting This Up

Check your region first. Don't create the Agent Space in an unsupported region and wonder why nothing works.
Get the New Relic Full Platform User API key sorted before you start. It's a blocker and it requires admin access.
Test the payload template early. Run a manual investigation from the web app, read the error observations, and fix the action casing and service field before connecting real alert policies.
Scope narrowly to start. P1/P2 only, your team's services only. Expand once you've seen a few real investigations.
Build the Skill around what the agent can't discover. Status pages, runbook links, investigation priority order, communication guidelines — not infrastructure details it already knows from topology mapping.

Is It Worth It?

For a team that handles production incidents regularly, yes. The setup is half a day of work. The ongoing maintenance is minimal. And the value isn't the AI itself — it's that the first 15 minutes of an incident are done by the time you open your laptop.

The agent isn't perfect. It sometimes surfaces obvious things you'd have found in 30 seconds. But on a bad night with a pager, "obvious with context already in Slack" is more useful than "obvious after you've dug through four different tools."

AWS Community Builder — Serverless category. Questions or feedback welcome in the comments.

I Built a Usage-Based Billing Engine From Scratch — Here's How It Works

ajithmanmu — Mon, 23 Feb 2026 03:49:32 +0000

I spent the last few weeks building MeterFlow — a usage-based billing engine that handles event ingestion, deduplication, aggregation, fraud detection, tiered pricing, and Stripe invoice generation.

This post walks through the technical decisions behind each component, including how the architecture maps to a production AWS deployment.

Why Build This?

I work on subscription infrastructure at my day job — Stripe integrations, webhook handlers, entitlement APIs. But I wanted to understand how billing platforms like Lago, Metronome, and Stripe Billing work internally. Not just calling the API, but building the metering and pricing layer myself.

MeterFlow covers the full lifecycle:

Events → Dedup → Store → Aggregate → Price → Invoice → Stripe

Stack: TypeScript, Fastify, Redis, ClickHouse, MinIO (S3-compatible), Docker Compose.

Architecture

┌──────────────┐
│   Fastify     │
│   API         │
│               │
│ POST /events  │──┬──────────────────────┐
│ GET /usage    │  │                      │
└───────────────┘  │                      │
                   ▼                      ▼
          ┌──────────────┐      ┌──────────────┐
          │ Redis         │      │     MinIO     │
          │               │      │   (S3 backup) │
          │ • Dedup (NX)  │      │               │
          │ • Rate Limit  │      │ Raw events    │
          │ • Auth keys   │      │ (append-only) │
          │ • Fraud bases │      └───────────────┘
          └───────┬───────┘
                  │
                  ▼
          ┌──────────────┐      ┌──────────────┐
          │  ClickHouse   │      │    Stripe     │
          │               │      │               │
          │ • Events      │      │ Draft invoice │
          │ • Aggregation │      │ Line items    │
          │ • Analytics   │      │ Finalize+send │
          └───────────────┘      └───────────────┘

Events come in through the API, get deduplicated via Redis, stored in ClickHouse for analytics, and backed up to S3. When billing runs, the system aggregates usage, applies pricing rules, and builds Stripe invoice payloads.

Every component was chosen with a clear production equivalent in mind — MinIO maps to S3, Redis to ElastiCache, the Fastify server to API Gateway + Lambda. More on that in the production architecture section below.

1. Event Ingestion & Deduplication

Billing systems can't double-count. If a client retries a request, we need to reject the duplicate without rejecting new events.

The approach: use Redis SET NX (set-if-not-exists) with a 30-day TTL. The transaction ID is the key.

const key = `dedup:${transaction_id}`;
const result = await redis.set(key, '1', 'EX', 2592000, 'NX');
// 'OK' → new event (accepted)
// null → duplicate (rejected)

This is atomic — two identical events hitting Redis simultaneously, only one wins. No race condition, no distributed lock needed.

The 30-day TTL matches the validation window. Events older than 30 days are rejected by business logic anyway, so dedup keys auto-expire.

For batch ingestion (up to 1,000 events/request), I pipeline the Redis calls so the entire batch is one round-trip. The API validates each event's schema, checks for required fields (customer_id, event_type, transaction_id, timestamp), and rejects events with timestamps outside the 30-day window before they even reach Redis.

Accepted events are then written to both ClickHouse (for querying) and MinIO/S3 (as an append-only backup). The S3 backup is organized by date (events/YYYY-MM-DD/batch_timestamp.json), giving you a full audit trail that's independent of the analytics store.

2. Rate Limiting with Sorted Sets

Standard token bucket has a boundary problem — 200 requests at 0:59 and 200 at 1:01 both pass their respective windows, but that's 400 in 2 seconds.

I used Redis sorted sets for a true sliding window:

const key = `ratelimit:${customer_id}`;
const now = Date.now();
const windowStart = now - 60000; // 60-second window

await redis.pipeline()
  .zremrangebyscore(key, 0, windowStart)  // Remove old entries
  .zadd(key, now, `${now}:${uuid}`)       // Add current request
  .zcard(key)                              // Count in window
  .expire(key, 120)                        // Safety TTL
  .exec();

Each request is a member with its timestamp as the score. To check the limit, remove anything older than 60 seconds, count what's left. The response includes X-RateLimit-Remaining so clients know where they stand.

In production, this pipeline approach could allow slight over-counting under high concurrency. A Lua script wrapping the same sorted set logic (ZREMRANGEBYSCORE → ZADD → ZCARD → EXPIRE) would execute atomically on the Redis server. You'd also want two layers: API Gateway throttling for coarse IP-based protection, and the application layer for fine-grained per-customer limits tied to billing tiers.

3. Billable Metrics Catalog

Rather than hardcoding what's billable, I use a catalog that maps raw events to billable quantities:

Metric	Event Type	Aggregation	Property
api_calls	api_request	COUNT	—
bandwidth	api_request	SUM	bytes
storage_peak	storage	MAX	gb_stored
compute_time	compute	SUM	cpu_ms

The usage query engine reads this catalog and builds the appropriate ClickHouse query dynamically. Adding a new metric means adding one config entry — no query changes.

// COUNT → SELECT count() FROM events WHERE ...
// SUM   → SELECT sum(JSONExtractFloat(properties, 'bytes')) WHERE ...
// MAX   → SELECT max(JSONExtractFloat(properties, 'gb_stored')) WHERE ...

ClickHouse is a good fit here because it's columnar — SUM(bytes) FROM events only reads the bytes column, not the entire row. But it's append-optimized, so you don't want to update or delete individual rows. For billing, that's fine — events are immutable.

4. Tiered Pricing Calculation

MeterFlow supports flat and tiered pricing. Tiered is the interesting one — the system walks through tiers progressively:

API Calls Pricing:
  Tier 1: 0–1,000     → $0.00/call (free tier)
  Tier 2: 1,001–10,000 → $0.001/call
  Tier 3: 10,001+      → $0.0005/call

For 15,000 API calls:

Tier 1: 1,000 × $0.00   = $0.00
Tier 2: 9,000 × $0.001  = $9.00
Tier 3: 5,000 × $0.0005 = $2.50
Total: $11.50

The pricing engine consumes quantity through each tier until it's exhausted. All amounts are converted to cents before hitting Stripe — billing systems should never do floating-point math on final amounts.

5. Fraud Detection

This is where MeterFlow goes beyond basic metering. It uses a two-layer approach to catch both volume anomalies and pattern anomalies.

Layer 1: Z-Score Volume Detection

Compare current usage against a 30-day baseline:

Z = (current_value - mean) / stddev

If |Z| >= 3 (three standard deviations), flag it. This catches obvious spikes — someone hammering your API 10x more than normal.

But it misses a critical attack vector: same volume, different pattern.

Layer 2: Cosine Similarity Pattern Detection

A stolen API key might generate the same number of calls per day, but at completely different hours. Z-score wouldn't catch this because the volume is normal.

The approach:

Build baselines — process 30 days of history into per-weekday, 24-dimensional hourly vectors (Mondays have different patterns than weekends)
Normalize — divide by the sum so we're comparing shape, not volume
Compare with cosine similarity — 1.0 means identical pattern, below 0.9 triggers a flag

Normal Tuesday:    [0.01, 0.01, ..., 0.15, 0.15, ..., 0.02]
                   (quiet at night, peaks 9am-5pm)

Stolen key usage:  [0.15, 0.15, ..., 0.01, 0.01, ..., 0.15]
                   (peaks at night — attacker in different timezone)

Cosine similarity: ~0.28 → FRAUD DETECTED

The volume is identical. The Z-score is normal. But the pattern is inverted. Cosine similarity catches it immediately.

Baselines are stored in Redis with 90-day TTL. The detection runs per-customer, per-metric, with separate weekday profiles. The system includes a dashboard that visualizes normal usage patterns vs. detected anomalies:

When fraud is detected, the dashboard highlights the anomaly with the cosine similarity score. In this example, a customer's pattern dropped to 30.2% similarity against their baseline — a clear sign of compromised credentials:

6. Stripe Integration

The billing endpoint builds complete Stripe API payloads following the full invoice lifecycle: create draft invoice, add line items per metric (with tier breakdowns in metadata), finalize, and send.

Each operation uses an idempotency key derived from the invoice ID and billing period:

const idempotencyKey =
  `meterflow_${invoiceId}_${customerId}_${periodStart}_${periodEnd}`;

This ensures that retries, duplicate triggers, or manual re-runs don't double-charge customers. Stripe rejects duplicate requests with the same key within 48 hours.

For the demo, this runs in dry-run mode — payloads are built but not sent to Stripe. Swapping to live is a one-line change from payload builders to actual SDK calls.

7. Production Architecture on AWS

Every local component in MeterFlow was designed with a clear AWS production mapping. Here's how the demo stack translates:

Demo (Local)	Production (AWS)	Why
Fastify server	API Gateway + Lambda	Auto-scaling, managed TLS, WAF
Redis (single)	ElastiCache (Redis Cluster)	HA, automatic failover
ClickHouse (Docker)	ClickHouse Cloud	Managed, scalable, VPC peering
MinIO	S3	Lifecycle policies, cross-region replication
In-process sync	Kinesis	Async buffer, back-pressure, replay
cron / manual	EventBridge	Managed scheduling, reliable triggers
Console logs	CloudWatch + SNS	Alerting, dashboards, PagerDuty

Async Ingestion with Kinesis

The biggest architectural shift for production is decoupling ingestion from processing. In the demo, the pipeline is synchronous: validate → dedup → store → backup, all in one request. In production, you'd buffer through Kinesis:

Client → API Gateway → Lambda (validate + dedup) → Kinesis → Lambda (store to ClickHouse)
                                                         ↘ S3 (backup)

Kinesis gives you ordered delivery within a shard (partition by customer_id), replayability if a downstream consumer fails, and natural back-pressure through shard limits. Clients receive 202 Accepted immediately instead of waiting for the full pipeline.

Failed batches route to an SQS dead letter queue for investigation and replay. The dedup layer (Redis SET NX) works the same way regardless of whether an event arrives via HTTP or Kinesis — duplicates are caught either way.

Scheduled Jobs with EventBridge

Billing, anomaly detection, and fraud baseline rebuilds all become scheduled jobs:

EventBridge (1st of month)  → Lambda → Aggregate usage → Stripe API (invoicing)
EventBridge (hourly/daily)  → Lambda → Z-score + cosine similarity → SNS (alerts)
EventBridge (weekly)        → Lambda → Rebuild fraud baselines → Redis

The detection logic itself (checkAnomaly(), checkFraud()) would be reused as-is from the demo — it already takes parameters for baseline window and threshold. The change is just in how it's triggered and where alerts go (SNS → Slack/PagerDuty instead of console logs).

State and Alerting

DynamoDB handles billing state (invoice status, anomaly records with TTL for auto-cleanup). SNS topics route to email, Slack, or PagerDuty based on alert severity. CloudWatch dashboards provide real-time visibility into ingestion rates, error rates, and billing job status.

What I Learned

Deduplication is deceptively simple. SET NX solves it cleanly, but the hard part is deciding what the dedup window should be and how to handle events that arrive after the window closes.

Billing math needs to happen in cents. Floating-point rounding will bite you. Convert to integers as early as possible.

Pattern-based fraud detection is more useful than volume-based. Sophisticated attackers will stay under volume thresholds. They can't easily replicate a customer's hourly usage pattern.

Design for the production target early. Using MinIO instead of a local filesystem, Redis instead of in-memory maps, and S3-compatible APIs from the start meant every component has a clear AWS upgrade path. The business logic doesn't change — only the infrastructure layer.

Try It

The entire system runs locally with Docker Compose:

git clone https://github.com/ajithmanmu/meterflow
cd meterflow
docker compose up -d
pnpm install && pnpm dev

There's a validation script that runs 60 end-to-end checks across all components, and demo scripts that simulate 30 days of normal usage followed by fraud injection so you can see the detection in action.

GitHub: github.com/ajithmanmu/meterflow

Inspired by Lago's open-source billing platform.

I'm a backend engineer building subscription and payment infrastructure. If you're working on billing systems or usage-based pricing, I'd love to hear about your approach.

Building a Webhook Replay System with AWS Kinesis

ajithmanmu — Thu, 15 Jan 2026 20:09:50 +0000

The Problem

Payment webhooks from Stripe, Apple, and Google are revenue-critical, but they're tricky to handle correctly. Events arrive out of order, can be duplicated, and if your processing logic has a bug, you can corrupt subscription state with no way to recover.

I built a webhook broker that treats Kinesis Data Streams as an immutable event log. When things go wrong, you can replay events and rebuild subscription state from scratch.

Architecture Overview

Here's how the system works:

Payment Providers → API Gateway → Lambda (Ingestion)
                                       ↓
                                  Kinesis Stream (7-day retention)
                                       ↓
                                  Lambda (Processor)
                                       ↓
                                  DynamoDB (State + Idempotency)

AWS Services Used

API Gateway (HTTP API)

Public endpoint for webhooks: /webhooks/{provider}
Rate limiting: 100 req/sec, burst 200
Routes: /webhooks/stripe, /webhooks/apple, /webhooks/google

Lambda 1: Ingestion

Verifies webhook signatures (HMAC-SHA256 for Stripe)
Extracts partition key: provider:subscriptionId
Writes raw event to Kinesis

Kinesis Data Streams

7-day retention (extendable to 365 days)
Source of truth for all webhook events
Partition key ensures per-subscription ordering

Lambda 2: Processor

Reads from Kinesis stream
Sorts events by timestamp (handles out-of-order delivery)
Uses DynamoDB conditional writes for idempotency
Updates subscription state

DynamoDB Tables

ProcessedEvents: Idempotency check (TTL: 90 days)
SubscriptionState: Current subscription data

SQS Dead Letter Queue

Captures poison messages after 3 retries
14-day retention for manual investigation

Key Design Decisions

1. Kinesis as Event Log

Kinesis isn't just a queue—it's a durable log. Every webhook is preserved for 7 days (configurable up to 365). This gives you time-travel capability: replay events from any point in the retention window.

For longer-term needs (regulatory audits, multi-year forensics), you can archive to S3 and implement cold replay from there.

2. Partition Keys for Ordering

Events are sharded by provider:subscriptionId (e.g., stripe:sub_premium_user_001). This gives:

Ordering guarantees per subscription: Events for the same subscription are processed in order
Granular replay control: Replay just one customer's events, or all events from a specific provider

3. Idempotency with DynamoDB

The processor uses conditional writes to the ProcessedEvents table:

// Only write if eventId doesn't exist
await dynamodb.putItem({
  TableName: 'ProcessedEvents',
  Item: { eventId, provider, subscriptionId, timestamp },
  ConditionExpression: 'attribute_not_exists(eventId)'
});

This prevents duplicate processing, even when replaying events that were already handled.

Demo: Recovery from Data Loss

Scenario: A bug deletes subscription data from DynamoDB.

Step 1: Initial State

Subscription has 4 processed events:

aws dynamodb get-item \
  --table-name webhook-broker-dev-subscription-state \
  --key '{"provider": {"S": "stripe"}, "subscriptionId": {"S": "sub_premium_user_001"}}'

# Returns: eventCount: 4

Step 2: Simulate Data Loss

Delete subscription state and processed events (simulating a bug):

python scripts/delete_subscription_data.py \
  --provider stripe \
  --subscription sub_premium_user_001 \
  --execute

# Deletes: SubscriptionState + 4 ProcessedEvents

Step 3: Confirm Deletion

aws dynamodb get-item \
  --table-name webhook-broker-dev-subscription-state \
  --key '{"provider": {"S": "stripe"}, "subscriptionId": {"S": "sub_premium_user_001"}}'

# Returns: (empty)

Step 4: Replay from Kinesis

python scripts/replay.py \
  --subscription sub_premium_user_001 \
  --from-beginning \
  --execute

# Replays 9 events from stream
# Processes 4 unique events
# Skips 5 duplicates (idempotency)

Step 5: Verify Recovery

aws dynamodb get-item \
  --table-name webhook-broker-dev-subscription-state \
  --key '{"provider": {"S": "stripe"}, "subscriptionId": {"S": "sub_premium_user_001"}}'

# Returns: eventCount: 4 (restored!)

Result: Subscription rebuilt with exact same state. Idempotency prevented duplicate processing.

Real-World Use Cases

1. Bug Recovery
Deploy a bug that corrupts state → Fix the code → Replay events → State rebuilt correctly

2. Schema Evolution
Add a new field to your subscription model → Replay events to backfill the data

3. Revenue Debugging
Finance reports a discrepancy → Replay specific time range → Trace what happened

4. Provider Outage Recovery
Stripe had an outage yesterday → Replay all Stripe events from that window → Ensure nothing was missed

Tech Stack

Infrastructure: Terraform
Lambda Functions: TypeScript (Node.js 18)
Replay CLI: Python 3.9+
AWS Services: API Gateway, Kinesis, Lambda, DynamoDB, SQS

Cost Considerations

With 7-day retention and moderate volume (10,000 events/day):

Kinesis Data Streams: ~$15/month (1 shard)
Lambda: ~$5/month (first 1M requests free)
DynamoDB: ~$5/month (on-demand pricing)
API Gateway: ~$3.50/month (first 1M requests free)

Total: ~$30/month for production-grade event replay capability.

Source Code

Full implementation with Terraform, TypeScript Lambdas, and Python replay tool:

https://github.com/ajithmanmu/webhook-broker

The key insight: Treat your event stream as a source of truth, not just a transport layer. When you have an immutable log, recovery becomes a replay operation instead of a panic.

What the AWS us-east-1 Outage Taught Me About Building Resilient Systems

ajithmanmu — Sun, 14 Dec 2025 19:45:22 +0000

AWS us-east-1 will go down again. When it does, will your system survive?

This past weekend, I built a system designed to survive it.

After 8 years building subscription infrastructure at Surfline—processing payments through Stripe, Apple, and Google Play—I've learned that the question isn't whether your cloud provider will fail. It's whether your architecture degrades gracefully when it does.

I spent 4 hours implementing three reliability patterns sourced directly from the AWS Builders' Library, Google SRE practices, and Stripe's engineering blog. Here's what I learned.

The Problem: Payment Systems Can't Afford to Fail

When AWS has an incident, your Lambda functions timeout. Your DynamoDB calls fail. Your SQS queues back up.

For most applications, users see an error page and retry later. But payment systems are different:

A failed charge might actually have succeeded
A retry might double-charge the customer
A thundering herd of retries can cascade the failure

You need patterns that handle partial failures without losing money or trust.

Pattern 1: Exponential Backoff with Full Jitter

The AWS Builders' Library article on Timeouts, retries, and backoff with jitter changed how I think about retry logic.

The insight: Without jitter, all clients retry at the exact same intervals. If 1,000 requests fail at t=0, they all retry at t=1s, then t=2s, then t=4s—creating synchronized waves that hammer your recovering service.

// Full jitter formula from AWS Builders' Library
const calculateDelay = (attempt: number): number => {
  const exponentialDelay = Math.min(
    MAX_DELAY,
    INITIAL_DELAY * Math.pow(2, attempt)
  );
  // Full jitter: random value between 0 and exponential delay
  return Math.random() * exponentialDelay;
};

The result: Success rates improved from ~70% to 99%+ in my load tests. The jitter spreads retry load evenly across time instead of creating synchronized spikes.

AWS Application

This pattern is critical when calling AWS services during degraded states:

Lambda retrying DynamoDB during throttling
ECS tasks calling external APIs through NAT Gateway
Step Functions with retry policies on service integrations

Pattern 2: Bounded Queues with Worker Pools

Here's something I discovered through testing that surprised me:

A bounded queue alone doesn't limit concurrent processing.

I set up a queue with capacity 100, sent 200 requests, and expected ~100 rejections. Instead: zero rejections. Why? Node.js was processing requests faster than they accumulated. The queue checked capacity but didn't control throughput.

// What you actually need: queue + worker pool
class BoundedQueue {
  private queue: Request[] = [];
  private readonly capacity = 100;

  enqueue(request: Request): boolean {
    if (this.queue.length >= this.capacity) {
      return false; // HTTP 429 - fail fast
    }
    this.queue.push(request);
    return true;
  }
}

class WorkerPool {
  private activeWorkers = 0;
  private readonly maxWorkers = 10; // THIS controls throughput

  async process(queue: BoundedQueue) {
    while (this.activeWorkers < this.maxWorkers) {
      // Actually limits concurrent execution
    }
  }
}

AWS Application

This maps directly to AWS service patterns:

SQS + Lambda concurrency limits: The queue (SQS) buffers; reserved concurrency limits throughput
API Gateway + throttling: Request queuing with rate limits
Kinesis + Lambda: Batch size and parallelization factor control processing rate

The key insight: SQS without Lambda concurrency limits is like a bounded queue without a worker pool—it buffers but doesn't protect downstream systems.

Pattern 3: Idempotency with Strategic Caching

Stripe's idempotency documentation shaped this implementation. The pattern: cache successful responses for 24 hours, never cache errors.

class IdempotencyStore {
  private cache = new Map<string, CachedResponse>();
  private inFlight = new Set<string>();

  async process(idempotencyKey: string, operation: () => Promise<Response>) {
    // Check cache first
    const cached = this.cache.get(idempotencyKey);
    if (cached) return cached.response;

    // Detect concurrent duplicates
    if (this.inFlight.has(idempotencyKey)) {
      throw new ConflictError('Request already in progress');
    }

    this.inFlight.add(idempotencyKey);
    try {
      const response = await operation();
      // Only cache successes
      if (response.success) {
        this.cache.set(idempotencyKey, { response, ttl: 24 * 60 * 60 });
      }
      return response;
    } finally {
      this.inFlight.delete(idempotencyKey);
    }
  }
}

AWS Application

DynamoDB for idempotency keys: Conditional writes with TTL for automatic cleanup
Lambda Powertools: Built-in idempotency utility using DynamoDB
Step Functions: Native idempotency with execution names

// DynamoDB idempotency pattern
await dynamodb.put({
  TableName: 'IdempotencyStore',
  Item: {
    idempotencyKey: key,
    response: result,
    ttl: Math.floor(Date.now() / 1000) + 86400 // 24 hours
  },
  ConditionExpression: 'attribute_not_exists(idempotencyKey)'
});

The Architecture: Putting It Together

Here's how these patterns compose into a resilient payment processing system on AWS:

┌─────────────────────────────────────────────────────────────┐
│                     API Gateway                              │
│                   (Rate Limiting)                            │
└─────────────────────┬───────────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────────┐
│                      SQS Queue                               │
│              (Bounded Queue - Buffer)                        │
└─────────────────────┬───────────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────────┐
│              Lambda (Reserved Concurrency = 10)              │
│                   (Worker Pool)                              │
│  ┌─────────────────────────────────────────────────────────┐│
│  │  1. Check DynamoDB idempotency store                    ││
│  │  2. Process payment with retry + jitter                 ││
│  │  3. Store result in DynamoDB                            ││
│  └─────────────────────────────────────────────────────────┘│
└─────────────────────┬───────────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────────┐
│                 DynamoDB Tables                              │
│    - IdempotencyStore (with TTL)                            │
│    - ProcessingResults                                       │
└─────────────────────────────────────────────────────────────┘

Key Takeaways for AWS Builders

Read the AWS Builders' Library. It's written by engineers who've operated services at massive scale. The jitter article alone is worth your time.
Test your assumptions. I assumed bounded queues limited throughput. They don't. Load testing revealed the gap.
Accept the tradeoff. These patterns increase latency. A request that would fail in 100ms might now take 5 seconds across retries. But 99%+ success beats 70% success every time.
Use AWS primitives. SQS, Lambda concurrency, DynamoDB TTL, and Step Functions give you these patterns without building from scratch.

What's Next

The resilient-relay repo has the full implementation. I'm planning to add:

Dead-letter queue handling for failed payments
CloudWatch metrics for RED (Rate, Errors, Duration) observability
Multi-region failover patterns

When us-east-1 goes down again—and it will—your system should degrade gracefully, not catastrophically.

The AWS Builders' Library exists because Amazon learned these lessons operating AWS itself. The patterns are proven. The question is whether we apply them.

What reliability patterns have you implemented in your AWS architectures? I'd love to hear what's worked (or failed spectacularly) in production.

GitHub: resilient-relay | LinkedIn

How I Built an Autonomous AI Customer Retention Agent with AWS Bedrock AgentCore

ajithmanmu — Wed, 15 Oct 2025 12:23:21 +0000

Built for the AWS AI Agent Global Hackathon

Introduction

After building a serverless data analytics pipeline for customer churn, I had clean, query-ready customer data sitting in Amazon Athena. The next logical step was to make that data actionable — not just for analysts, but for customers themselves.

That's where the Customer Retention Agent comes in. This is a fully autonomous AI agent built on AWS Bedrock AgentCore that identifies at-risk customers and proactively offers them personalized retention deals through natural conversation. I built this as part of the AWS AI Agent Global Hackathon, and it's a natural continuation of my previous project.

Before diving into the build, I spent time going through the Amazon Bedrock AgentCore Samples repository. The tutorials there were incredibly helpful for getting up to speed with AgentCore concepts — from Runtime and Gateway to Memory and Identity. If you're new to AgentCore, I highly recommend starting there.

The goal was simple: What if customers could talk to an AI agent that knows their churn risk and can instantly generate personalized discount codes? No forms, no waiting for customer service — just a conversation that might save their subscription.

Architecture

Here's the high-level design:

Core Components:

Amazon Bedrock AgentCore (Runtime, Gateway, Memory) — The brain of the system. Runtime hosts the agent, Gateway connects to external tools, and Memory persists conversation context.
Claude 3.7 Sonnet — Powers autonomous reasoning and multi-step decision-making.
Next.js Frontend — Chat interface deployed on Vercel with streaming responses.
AWS Lambda (3 functions) — Churn Data Query, Retention Offer, Web Search exposed via MCP protocol.
Amazon Athena — Queries the Telco customer churn dataset (from my previous project).
Amazon Cognito — Dual authentication: web client for users, M2M client for agent-to-Gateway communication.
Bedrock Knowledge Base — RAG implementation with company policies and troubleshooting guides.
Amazon S3 — Stores customer data and knowledge base documents.

You can find the full implementation here: https://github.com/ajithmanmu/customer-retention-agent

Demo Video

https://www.youtube.com/watch?v=nt2-iE_qBIw
URL: https://customer-retention-agent.vercel.app/
Demo showing the agent in action - analyzing churn risk and generating discount codes

Walkthrough

1. The User Journey

When a customer logs into the chat interface:

Authentication: Frontend authenticates via Cognito, receives JWT token
JWT Mapping: Token contains Cognito user ID (UUID) which gets mapped to actual customer ID in the dataset (e.g., "3916-NRPAP")
Conversation Starts: User sends a message, AgentCore Runtime receives request with JWT
Memory Retrieval: Before responding, agent pulls customer context from Memory
Agent Reasoning: Claude 3.7 Sonnet decides which tools to call (if any)
Tool Execution: Agent calls Lambda functions via Gateway for data/actions
Response Generation: Claude synthesizes response with retrieved data
Memory Saving: Interaction gets saved to Memory for future conversations

2. Dual Authentication Architecture

This was one of the trickier parts. The system needs two separate authentication flows:

Web Client (User → Runtime):

User logs in with username/password
Cognito returns JWT token
Frontend includes JWT in every request to AgentCore Runtime
Token contains sub field with user ID

M2M Client (Agent → Gateway):

Agent needs to call Lambda functions via Gateway
Uses OAuth 2.0 client credentials flow
Confidential client with client secret stored in SSM
Access token validates at Gateway before allowing tool calls

Working with Cognito was more complicated than I expected — configuring two different clients, getting the OAuth flows right, and debugging token scopes took several iterations. But it was a valuable learning experience in production authentication patterns.

3. The Agent's Brain: AgentCore Runtime + Memory

The agent runs on AgentCore Runtime, which is a fully managed, serverless platform for hosting AI agents. No servers to manage, auto-scaling built-in.

Memory Integration is what makes this agent truly conversational:

class CustomerRetentionMemoryHooks:
    def __init__(self, memory_id, customer_id, session_id, region):
        self.memory_client = boto3.client('bedrock-agent-runtime')
        self.memory_id = memory_id
        self.actor_id = customer_id  # Maps to customer in dataset

Three memory strategies work together:

USER_PREFERENCE: Stores explicit preferences ("I prefer email contact")
SEMANTIC: Vector-based semantic memory for conversation context
SUMMARIZATION: Condensed conversation summaries

This means if a customer says "My customer ID is 3916-NRPAP" in one session, the agent remembers it in future conversations.

4. Tools Layer: Lambda Functions via Gateway

I created three Lambda functions, each with a specific purpose:

Churn Data Query Lambda:

# Queries Athena with SQL
query = f"""
SELECT customerid, churn_risk_score, tenure, contract, monthlycharges 
FROM telco_augmented_vw 
WHERE customerid = '{customer_id}'
"""

This function:

Hits Amazon Athena (the data from my previous pipeline project!)
Returns customer profile, churn risk score, usage patterns
Uses cancel_intent field as our "synthetic churn model" — no separate ML training needed

Retention Offer Lambda:

Generates personalized discount codes based on risk level
High risk (>70%): 20-30% off for 3 months (code: SAVE25)
Medium risk (40-70%): 15-25% off for 2 months
Low risk (<40%): Service upgrades and add-ons

Web Search Lambda:

DuckDuckGo API for real-time information
Helps agent answer general retention strategy questions

Internal Tool: Product Catalog

In addition to the three external Lambda functions, the agent also has an internal tool that runs directly within the AgentCore Runtime - no external API calls needed. The get_product_catalog() tool provides real-time information about available telecom plans, pricing, add-on services, and retention offers. This tool is perfect for answering customer questions like "What plans do you offer?" or "Tell me about your premium features" without making external API calls. Having this as an internal tool means faster response times and reduced latency for common queries.

@tool
def get_product_catalog() -> str:
    """Get information about available telecom plans and services."""
    # Returns plan details, pricing, features, and retention offers
    return formatted_catalog_info

This demonstrates a key architectural pattern: use internal tools for static/reference data that doesn't require external systems, and use external tools (via Gateway) for dynamic data queries or actions that need database access.

All three functions are exposed via AgentCore Gateway using the MCP (Model Context Protocol). The Gateway handles authentication, request routing, and response formatting.

5. The Autonomous Reasoning Flow

Here's what happens when a customer asks: "Can you give me a discount code?"

Agent Receives Request: Claude reads the prompt and system instructions
Decision Making: Agent decides it needs customer churn data first
Tool Call #1: Calls churn_data_query via Gateway → Lambda → Athena
Risk Analysis: Receives churn risk score (e.g., 85% — HIGH risk)
Decision Making: Agent decides to generate retention offer
Tool Call #2: Calls retention_offer with customer data
Offer Generation: Lambda generates SAVE25 discount code (25% off)
Response: Agent synthesizes natural response with discount code

The agent makes all these decisions autonomously — I didn't hardcode the workflow. The system prompt guides the agent, but Claude decides when and how to use tools.

6. RAG with Bedrock Knowledge Base

The Knowledge Base stores:

Company policies
Troubleshooting guides
FAQ documents

RAG Flow:

User Query → Agent → Knowledge Base → Retrieved Context → Enhanced Response

Using Amazon Titan Embeddings, documents get vectorized for semantic search. When a customer asks about policies, the agent retrieves relevant sections and includes them in the response.

7. Data Connection: From Previous Project

The customer data comes from my previous serverless pipeline project. That pipeline:

Ingested the Kaggle Telco dataset
Converted CSV to Parquet with Glue ETL
Partitioned data in S3
Made it queryable via Athena

This agent project is the natural next step — taking that clean, query-ready data and making it accessible through conversational AI.

Key Technical Decisions

Why AgentCore Over DIY?

I could have built this with raw Lambda functions and LangChain, but AgentCore provided:

Built-in Memory: No need to build my own vector database
Gateway with MCP: Standardized protocol for tool integration
Managed Runtime: No ECS clusters or container management
Observability: CloudWatch integration out of the box

Why Dual Cognito Architecture?

Security: Separates user authentication from agent-to-service authentication
Scalability: M2M tokens can be cached and reused
Best Practice: Follows OAuth 2.0 patterns for service-to-service communication

Why Synthetic Churn Model?

The dataset includes a cancel_intent field which acts as our "pretend ML model." For a hackathon demo, this works perfectly without needing to train and deploy a separate ML model. In production, you'd integrate with SageMaker for real churn predictions.

Security

Even for a hackathon project, I applied production security practices:

IAM Roles: Least-privilege access for Lambda, Runtime, and Gateway
JWT Authentication: Secure token-based auth with Cognito
SSM Parameter Store: All secrets and config stored securely
S3 Encryption: SSE-S3 for data at rest
Private Lambda (TODO): Current Lambdas are public; production would use VPC

Challenges & Learnings

1. Cognito Complexity

Setting up dual authentication was harder than expected. Key lessons:

USER_PASSWORD_AUTH flow must be explicitly enabled
M2M clients need proper scopes configured
Discovery URLs must be exact (.well-known/openid-configuration)
Token decoding requires proper base64 padding

Working with Cognito was more complicated than I anticipated, but it forced me to deeply understand OAuth 2.0 flows and JWT token structure.

2. Cold Start Problem

The first request to the agent often timed out. Classic serverless cold start:

AgentCore Runtime takes time to spin up
Solution: Better error handling and retry logic
Future: Consider provisioned concurrency for production

3. Multi-Step Tool Calling

Getting Claude to call churn_data_query first, then pass that data to retention_offer required explicit prompt engineering:

SYSTEM_PROMPT = """
IMPORTANT: When customers ask for discount codes, you MUST:
1. First call the churn_data_query tool to get customer data
2. Then call the retention_offer tool with the complete churn_data
"""

Learning: LLMs need very explicit instructions for sequential workflows.

4. SSM Parameter Store Permissions

The auto-created Runtime execution role didn't include SSM permissions. Quick fix:

{
    "Effect": "Allow",
    "Action": ["ssm:GetParameter"],
    "Resource": "arn:aws:ssm:*:*:parameter/customer-retention-agent/*"
}

Learning: Always verify IAM permissions when integrating AWS services.

5. Local Development Setup

Testing locally before deploying was crucial:

Used agentcore invoke --local to simulate Runtime
Created automated test suite (test_invoke_local.py)
Tested with real AWS services (Lambda, Athena, Memory)

Learning: Local-first development saves time and AWS costs.

6. On-Demand Throughput Not Supported

Discovered that not all Bedrock models support on-demand throughput. Had to adjust model selection.

Learning: Read the AWS documentation carefully for service limitations.

7. Boto3 Sessions

Lambda functions need proper boto3 session management:

athena_client = boto3.client('athena', region_name='us-east-1')

Learning: Always specify region explicitly in Lambda functions.

What I Learned

Technical:

AgentCore primitives (Runtime, Gateway, Memory) work incredibly well together
MCP protocol standardizes tool integration
Memory strategies: USER_PREFERENCE for explicit data, SEMANTIC for context
JWT token structure and OAuth 2.0 flows
RAG implementation with Bedrock Knowledge Base
Serverless cold starts are real — plan accordingly

Architectural:

Dual authentication is complex but necessary for production systems
Tool design matters: focused, single-responsibility functions compose well
Explicit prompt engineering is crucial for multi-step workflows
Local testing infrastructure saves time and money

Data:

Synthetic data (like cancel_intent) works great for demos
Previous data pipeline projects can be extended with AI layers
Parquet + Athena = fast, cost-effective queries

Next Steps

If I continue this project:

Security Enhancements:
- Make Lambdas private
- Set up VPC and subnets
- Add Web Application Firewall (WAF)
Responsible AI:
- Content moderation with Bedrock Guardrails
- Human oversight for high-value offers
- Policy checks before generating discounts
Production Features:
- Real-time alerts when high-risk customers detected
- A/B testing for retention strategies
- Analytics dashboard for offer effectiveness
- Sentiment analysis for conversation tone
Integration:
- Connect to Confluence for live policy updates (Bedrock KB supports this!)
- Integrate with CRM (Salesforce/HubSpot)
- Multi-channel support (SMS, email, phone)

Conclusion

Building the Customer Retention Agent taught me that autonomous AI agents are production-ready today. With AWS Bedrock AgentCore, I went from idea to working demo faster than expected.

The hardest parts weren't the AI — they were the authentication, cold starts, and getting all the AWS services to work together. But that's the reality of building production systems.

This project is a natural continuation of my data pipeline work. The pipeline gave me clean data in Athena; the agent makes that data actionable through conversation. Together, they demonstrate how serverless + AI can solve real business problems.

Key takeaway: Modern cloud platforms make it possible to build sophisticated AI agents without managing infrastructure. The future of customer service is autonomous, personalized, and conversational.

Thanks to Devpost for hosting the AI Agent Global Hackathon and creating AgentCore. Building with these tools has been an incredible learning experience! 🚀

Resources

GitHub Repository: https://github.com/ajithmanmu/customer-retention-agent
Demo Video: https://www.youtube.com/watch?v=nt2-iE_qBIw
AWS AI Agent Hackathon: https://devpost.com/software/customer-retention-agent?ref_content=user-portfolio&ref_feature=in_progress
Previous Project: Serverless Data Pipeline
AWS Bedrock AgentCore Docs: https://aws.amazon.com/bedrock/agentcore/
AgentCore Samples & Tutorials: https://github.com/awslabs/amazon-bedrock-agentcore-samples (Highly recommended for learning AgentCore!)

How I Built a Serverless Data Analytics Pipeline for Customer Churn with S3, Glue, Athena, and QuickSight

ajithmanmu — Sat, 20 Sep 2025 16:53:32 +0000

I wanted to explore how AWS services can be combined into a simple data pipeline that not only processes customer churn data, but also highlights the kind of insights companies rely on to drive retention and revenue growth.

For this project, I used the Telco Customer Churn dataset from Kaggle. The goal was to take raw CSV data, process it into a query-optimized format, and power dashboards that surface churn KPIs.

Architecture

Here’s the high-level design:

Amazon S3 — Stores raw Kaggle CSV input and processed Parquet output.
AWS Glue — Crawler to catalog schemas + ETL job to convert CSV into Parquet and partition the data.
Amazon Athena — Runs SQL queries and views over the processed data.
Amazon QuickSight — Dashboards to visualize churn KPIs like churn %, revenue loss, and segmentation.
Amazon EventBridge (optional) — Triggers Glue ETL jobs on a schedule.
Terraform — Infrastructure as Code for reproducible setup.

You can find the full implementation here: GitHub Repo.

Walkthrough

Data Ingestion The raw Telco churn dataset was uploaded into an S3 bucket. To keep data organized, I added key prefixes such as ingest_date=YYYY-MM-DD/. This structure makes it easier for Glue Crawlers to detect and register new data.

Schema Discovery & ETL Glue Crawlers scanned the raw bucket and registered the schema in the Glue Data Catalog. A Glue ETL job then converted the CSV files into Parquet and wrote the results to a processed S3 bucket with partitions. This format makes queries faster and more cost-efficient.

Partitioning Strategy

Partitioning turned out to be a critical design choice:
- Avoid high-cardinality keys that generate too many small files.
- Place date partitions last so queries can easily filter recent data.
- Hive (the engine behind Athena) processes partitions from left to right, so ordering matters.
Athena Queries

With data processed and partitioned, Athena queries became much more efficient. I created views for:
- Overall churn percentage
- Churn by contract type (month-to-month vs annual)
- Revenue lost from churners
- Tenure vs churn patterns

Visualization QuickSight connected directly to Athena, enabling dashboards with filters and visuals for churn % by demographics, add-on services, and contract types. This provided clear insights into which customers were most at risk.

Security

Even though this was a demo project, I applied security best practices:

IAM roles scoped with least privilege
S3 encryption (SSE-S3) for data at rest
Dedicated Glue and Athena execution roles
Restricted access to QuickSight dashboards

Conclusion

This pipeline shows how AWS services can be combined to build a self-service analytics solution with no servers to manage. Starting from raw CSVs, I was able to generate Parquet data, run queries in Athena, and visualize churn insights in QuickSight.

The next step for me is extending the pipeline with Amazon Bedrock. By creating a Knowledge Base and Bedrock Agent, I’ll enable natural-language questions like “What’s the churn rate for two-year contracts vs month-to-month?” and have the agent execute the Athena queries under the hood.

Learnings

Some of the key lessons from this build:

Adding ingest_date prefixes in S3 simplified partitioning and Glue Crawling.
Partitioning design is critical: avoid high-cardinality keys, put date last, and understand Hive’s left-to-right partition evaluation.
Encountered a HIVE_BAD_DATA error — a good reminder that Hive is running under the hood of Athena (flashback to Big Data classes!).
Parquet format greatly improved query speed and reduced cost.
Used Amazon Q Developer with the Diagram MCP server to auto-generate the architecture diagram — which made documentation far easier.

How I Built a Secure Serverless Orders Pipeline with Lambda, SNS, and SQS

ajithmanmu — Fri, 12 Sep 2025 03:58:31 +0000

After finishing my 3-tier web app project on AWS, I wanted my next portfolio project to be something different — more serverless, event-driven, and decoupled. I also wanted to test out the SQS fan-out architecture, where a single event can trigger multiple downstream actions. And, just as important, I wanted to build it all with a strong security-first mindset.

So I built a Serverless Orders Pipeline. Here’s how it works and what I learned along the way.

Architecture Overview

At a high level, the system works like this:

A public ALB accepts incoming requests (POST /orders) and routes them to a LambdaPublisher.
The LambdaPublisher validates the request and publishes it to an SNS topic.
That SNS topic fans out to multiple SQS queues: billing and archive.
Consumer Lambdas read from these queues and do their thing:
- Billing → write to DynamoDB
- Archive → store a JSON copy in S3

Everything runs inside a VPC, with public subnets for the ALB and private subnets for the Lambdas. Importantly, the Lambdas don’t have internet access — they only talk to AWS services through VPC endpoints.

Diagram

This captures the fan-out pattern: one request → SNS → multiple queues → independent consumers.

Security First

Authentication at the Publisher Lambda

The first entry point into the system is the Publisher Lambda, so I added a basic authentication layer:

Incoming requests must include a X-Client-Id and X-Signature header.
The Lambda checks these against a secret (stored as an environment variable for now, but could be moved to Secrets Manager later).
If the check fails → immediate 401 Unauthorized.

This ensures only trusted clients can even publish into the pipeline.

IAM Roles

Each Lambda got its own execution role, with the bare minimum permissions. For example:

Publisher: just sns:Publish.
Billing: sqs:ReceiveMessage/DeleteMessage + dynamodb:PutItem.
Archive: sqs:ReceiveMessage/DeleteMessage + s3:PutObject.

No shared “super-role” — each function is tightly scoped.

Resource Policies

Each SQS queue is locked down so it only accepts messages from the SNS topic.
Optionally, you can go a step further and tie resources to a specific VPC endpoint using conditions like aws:SourceVpce.

This prevents direct access from outside the system.

VPC and Subnets

This one was a good learning moment for me:

Lambdas don’t need inbound rules.
The ALB doesn’t hit the Lambda over the network — it calls it through the AWS control plane.

So, the Lambda’s security group matters only for outbound traffic (e.g., when writing to DynamoDB, publishing to SNS, or sending logs).

VPC Endpoints

Because my Lambdas don’t have internet access, I needed endpoints for them to reach AWS services:

Gateway endpoints: S3, DynamoDB
Interface endpoints: SNS, SQS, CloudWatch Logs

This way, traffic stays private inside AWS. No NAT gateways, no public internet.

Monitoring

Every Lambda writes to CloudWatch Logs, and I set up some metrics/alerts:

Lambda errors
Queue depth (important if consumers fall behind)
DLQ depth
ALB 5XXs

It’s not fancy, but it gives enough visibility to know if something’s going wrong.

What I Learned

Lambda SGs are different → no inbound rules needed; outbound is what matters.
Terraform zip packaging → I had to get comfortable with packaging functions cleanly in Terraform.
Security-first thinking → IAM roles, queue policies, endpoint restrictions, and even simple client auth at the Publisher Lambda baked in from the start.
Decoupling really works → each consumer Lambda is independent. If one fails, the others keep working fine.
Event-driven scaling is nice → SQS + Lambda handles bursts way better than a traditional setup.

Terraform Implementation

I also implemented the whole thing in Terraform, splitting the code into multiple folders for clarity:

infra/
  ├─ network/        # VPC, subnets, route tables, SGs, endpoints
  ├─ data/           # DynamoDB table + S3 archive bucket
  ├─ messaging/      # SNS topic, SQS queues, DLQs, policies
  ├─ iam/            # Lambda execution roles + inline policies
  ├─ compute/        # Lambda functions (publisher + consumers) + event source mappings
  ├─ frontend/       # ALB, target group, listener rules
  ├─ observability/  # CloudWatch alarms for SQS/Lambda/ALB

And here’s the order in which I applied them:

network → get the VPC and endpoints in place
data → DynamoDB and S3 storage
messaging → SNS, SQS, and their policies
iam → Lambda execution roles
compute → Lambdas + event source mappings
frontend → ALB and listener rules
observability → monitoring and alarms

This folder-based approach made it easier to build one layer at a time and keep things manageable.

You can check out the full Terraform code and project details on my GitHub: serverless-orders-pipeline.

Wrapping Up

This project felt like the natural next step after the 3-tier web app. Instead of servers and RDS, I worked with Lambdas, queues, and private networking. Adding authentication at the Publisher Lambda gave me an extra layer of control, and Terraform helped keep the setup reproducible and organized.

Hands-On with AWS: Building and Securing a 3-Tier Web App

ajithmanmu — Fri, 29 Aug 2025 23:01:00 +0000

Building a Secure 3-Tier Application on AWS

I recently worked on a portfolio project where I built a 3-tier application on AWS. My goal wasn’t only to get the app running, but also to design it with security and best practices in mind, and then migrate everything into Terraform so it’s reproducible.

👉 Full source code and Terraform setup: three-tier-architecture-aws

Project Overview

The setup follows the classic 3-tier architecture:

Frontend: A React app served by Nginx on EC2, behind a public ALB.
Backend: A FastAPI app running with Uvicorn on EC2, behind an internal ALB.
Database: Amazon RDS PostgreSQL in private subnets.

Only the frontend ALB is public — everything else runs in private subnets. Configuration values like the backend ALB DNS and database connection string are securely injected at runtime using AWS SSM Parameter Store and Secrets Manager.

Security Focus

From the start, I set up the application with least-privilege principles:

No public IPs on app or DB servers — only the ALB is exposed.
Security Groups allow traffic only along the intended path (ALB → Frontend → Backend → RDS).
IAM roles are locked down so instances can only read what they need.
AMIs are kept generic; user data injects environment-specific config at boot.

This way, the environment is both secure and flexible.

Building AMIs with Setup Scripts

A key part of this project was baking AMIs. Instead of installing everything during auto-scaling launches, I ran the setup scripts on temporary builder EC2 instances in public subnets. Once the app was installed and tested, I created an AMI from that instance.

For the frontend, I launched a temporary EC2, ran the React + Nginx setup script, and created a frontend AMI.
For the backend, I did the same: launched a builder EC2, installed FastAPI + dependencies, configured systemd, and created a backend AMI.

These AMIs were then used in Launch Templates + Auto Scaling Groups, with user data scripts wiring environment-specific details at boot.

Frontend Setup

sudo dnf update -y
sudo dnf install -y nginx git
sudo systemctl enable nginx

curl -fsSL https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash
. ~/.nvm/nvm.sh
nvm install 20

git clone https://github.com/ajithmanmu/three-tier-architecture-aws.git
cd app
npm ci && npm run build

sudo rm -rf /usr/share/nginx/html/*
sudo cp -R out/* /usr/share/nginx/html/

Nginx config snippet (/etc/nginx/nginx.conf):

location /api/ {
    proxy_pass http://__BACKEND_INTERNAL_ALB__;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
}

Frontend user data script fetches the backend ALB DNS from SSM and rewrites the config at boot.

Backend Setup

sudo dnf update -y
sudo dnf install -y python3.11 python3.11-pip git

sudo mkdir -p /opt/app && sudo chown ec2-user:ec2-user /opt/app
cd /opt/app
python3.11 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip

git clone https://github.com/ajithmanmu/three-tier-architecture-aws.git src
cd src/backend
pip install -r requirements.txt
cp -R . /opt/app/

The backend is wired to a systemd service running Uvicorn. At boot, a user data script pulls the DB connection string from SSM and writes it into /etc/app.env before starting the app.

Challenges Along the Way

This wasn’t all smooth sailing. A few things I had to troubleshoot:

Networking: With 12 subnets and multiple route tables, I initially struggled to get NAT and IGW routing right. Debugging outbound access from private subnets was a key learning moment.
Frontend 404s: The frontend served fine, but API calls failed until I realized Nginx needed the backend ALB DNS injected dynamically.
Secrets Management: At first I hardcoded DB creds. Moving them into Secrets Manager and pulling them at runtime made the setup much cleaner and safer.
Terraform Migration: Rebuilding everything as code was tedious, but it forced me to understand the resource dependencies and gave me a reproducible setup.

What’s Next

Some natural next steps to build on this project:

Add ACM + HTTPS for the frontend ALB.
Configure CloudWatch logs and alarms for monitoring and alerting.
Use S3 + CloudFront for hosting assets (like images), while continuing to serve the frontend itself from EC2.

👉 Full repo: three-tier-architecture-aws

Solving Flood Fill - LeetCode problem

ajithmanmu — Fri, 15 Aug 2025 15:13:53 +0000

Originally published on my Hashnode blog — cross-posted here for the Dev.to community.

In this problem, we delve into the Flood Fill algorithm, which plays a crucial role in tracing bounded areas with the same color. This algorithm finds applications in various real-world scenarios, such as the bucket-filling tool in painting software and the Minesweeper game.

https://leetcode.com/problems/flood-fill/description/

Problem Description

Given a 2D matrix, along with the indices of a source cell mat[x][y] and a target color C, the task is to color the region connected to the source cell with color C. The key idea here is to view the matrix as an undirected graph and find an efficient way to traverse it. Importantly, the movement is restricted to adjacent cells in four directions (up, down, left, and right).

Breadth-First Search (BFS) Approach

One way to solve this problem is to employ a Breadth-First Search (BFS) using a queue. Here's the step-by-step process:

Start by inserting the source cell into the queue and change its color to C.
While the queue is not empty:

* Pop the next element from the queue.

* Change the color of the current cell to `C`.

* Calculate the coordinates of the neighboring cells in all four directions.

* If any neighboring cell has the same color, insert it into the queue.

Time Complexity (TC): O(N*M) where N and M are the dimensions of the matrix.

Depth-First Search (DFS) Approach

Alternatively, you can implement the Depth-First Search (DFS) approach, which uses recursion:

Begin by changing the color of the source cell to C.
Calculate the coordinates of the neighboring cells in all four directions.
If any neighboring cell has the same color, recursively call the function on that cell until the base case is satisfied.

Time Complexity (TC): O(N*M) Space Complexity (SC): *O(N**M)

 let dx = [-1, 0, 1, 0];
 let dy = [0, 1, 0, -1];

const isValid = (x, y, image) => {
    let rows = image.length;
    let cols = image[0].length;
    if (x >= rows || y >= cols || x < 0 || y < 0) return false;
    return true;
};


const colorCell = (image,x,y,color, current_color) => {
   if(!isValid(x,y,image)) return;
   if(image[x][y] !== current_color) return;
   if(image[x][y] === color) return;
   image[x][y] = color;
   for(let i=0; i<4; i+=1) {
       let u = dx[i] + x;
       let v = dy[i] + y;
       colorCell(image, u, v, color, current_color);
   }
}

var floodFill = function(image, sr, sc, color) {
    colorCell(image, sr, sc, color, image[sr][sc]);
    return image;
};

Both approaches are effective, and your choice may depend on the specific requirements and constraints of the problem you are solving.

Solving Balanced Binary Tree - Leetcode problem

ajithmanmu — Fri, 15 Aug 2025 15:09:25 +0000

Originally published on my Hashnode blog — cross-posted here for the Dev.to community.

In the cinematic adaptation of this challenge, we find ourselves on an intriguing quest to determine the balance of a mystical binary tree. Our mission is to unveil the tree's equilibrium, where the difference between the heights of the Left Subtree (LST) and Right Subtree (RST) is no more than 1.

https://leetcode.com/problems/balanced-binary-tree/

Math.abs(LST_Height - RST_Height) <= 1.   --> Tree is balanced

Our journey begins with a postorder traversal, an exploration strategy suited for our enigmatic task. During our odyssey, we meticulously calculate the height of each node, a vital piece of information. The height of a node, we discern, is the grander of the heights of its LST and RST.

Height of a Node = Math.max(LST_Height, RST_Height) + 1

As we delve deeper into the forest of nodes, we scrutinize the height difference, ensuring it remains within the confines of balance.

The narrative unfolds with an assumption that the tree, like any good story, is inherently balanced. Yet, our vigilant traversal harbors the power to unveil any imbalances lurking in the shadows. Should the height conditions falter, a revelation of imbalance shatters our illusion, and we exit our quest immediately, having uncovered the truth of this captivating binary tree.

/**
 * Definition for a binary tree node.
 * function TreeNode(val, left, right) {
 *     this.val = (val===undefined ? 0 : val)
 *     this.left = (left===undefined ? null : left)
 *     this.right = (right===undefined ? null : right)
 * }
 */
/**
 * @param {TreeNode} root
 * @return {boolean}
 */

/*
    TC: O(N)
    SC: O(Height of the tree)
*/

const postorder = (node) => {
    if(node === null) return 0;
    let leftheight = postorder(node.left);
    let rightheight = postorder(node.right);
    /* 
        The idea is that we assume that the tree is balanced by default. 
        If the height condition fails during the traversal we say it is unbalanced and exit         immidietly
    Height of LST - Height of RST <= 1 -->. height balanced
    */
    let diff = Math.abs(leftheight - rightheight);
    if(diff > 1) {
        return 'not balanced';
    }
    if(leftheight === 'not balanced' || rightheight === 'not balanced') return 'not balanced'
    return Math.max(leftheight, rightheight) + 1;
}

var isBalanced = function(root) {
    let res = postorder(root);
    // console.log({res})
    if(res === 'not balanced') return false;
    return true;
};