augusthottie

Posted on Apr 6

I Built a Serverless Event-Driven Pipeline on AWS

#aws #lambda #serverless #devops

After five projects deep on containers and Kubernetes, I needed to add range to my portfolio. Every DevOps role I've looked at mentions serverless somewhere, Lambda for glue code, API Gateway for webhooks, DynamoDB for low-latency lookups, SQS for decoupling services. So this week I built a serverless event-driven pipeline from scratch.

The use case is a URL shortener with click analytics, but the architecture pattern is the same one you'd use for payment processing, IoT ingestion, audit logging, or any event-driven system. The interesting part isn't the URL shortener, it's the async decoupling, the atomic operations, the failure handling, and the IAM patterns.

What I Built

A four-Lambda pipeline:

Shortener (POST /shorten): generates a short code, writes to DynamoDB with a conditional write to handle collisions
Redirect (GET /{code}): reads the URL, fire-and-forget sends a click event to SQS, returns a 302 in under 100ms
Analytics (SQS-triggered): processes click events in batches of 10, writes details to a clicks table, atomically increments the counter on the urls table
Stats (GET /stats/{code}): queries both tables, aggregates top user agents and referers, returns JSON

All Python 3.12, all behind API Gateway HTTP API v2, all defined as Terraform with reusable modules. 38 resources total. Deploys in under two minutes.

The Architecture

POST /shorten      →  Lambda (shortener)  →  DynamoDB (urls table)

GET /{code}        →  Lambda (redirect)   →  DynamoDB → SQS → 302
                                                         ↓
                                              Lambda (analytics)
                                                         ↓
                                              DynamoDB (clicks + counter)

GET /stats/{code}  →  Lambda (stats)      →  DynamoDB (both tables)

The key insight is the SQS queue between the redirect and analytics Lambdas. The redirect Lambda doesn't wait for analytics processing, it sends a message to SQS and immediately returns the 302. Whether analytics takes 10ms or 10 seconds, users get redirected instantly. Click event processing happens entirely in the background.

The Patterns That Make This Production-Realistic

Conditional writes for collision handling. The shortener generates a 7-character random code, but two simultaneous requests could generate the same one. DynamoDB's ConditionExpression: "attribute_not_exists(code)" makes the write fail atomically if the code already exists. Combined with retry logic, this is collision-safe at any concurrency level.

Atomic counters with UpdateExpression. When the analytics Lambda increments the click count, it uses UpdateExpression: "ADD clicks :inc". This is atomic at the database level, no read-modify-write race conditions. If 100 clicks come in simultaneously, all 100 get counted correctly.

Partial batch failure for SQS. The default SQS → Lambda trigger fails the entire batch if any message errors. That's wasteful and creates retry storms. By setting function_response_types = ["ReportBatchItemFailures"] and returning {"batchItemFailures": [{"itemIdentifier": messageId}]} from the Lambda, only the failed messages go back to the queue. The successful 9 out of 10 stay processed.

Dead Letter Queue with 3-retry redrive. Failed messages get retried 3 times before being moved to a dead letter queue. The DLQ holds them for 14 days so you can investigate without blocking the main queue. This is the difference between "the pipeline is broken" and "we have visibility into 12 failed messages from yesterday."

Least-privilege IAM per Lambda. Each function has its own role with the minimum permissions it needs. The shortener can only PutItem on the urls table. The redirect can only GetItem on urls and SendMessage to SQS. The analytics Lambda can read from SQS and write to both tables. The stats Lambda can only read from both tables. If any function gets compromised, the blast radius is contained.

The Terraform Modules

I built four reusable modules:

lambda/ takes a source directory, handler name, IAM policy JSON, and environment variables. It packages the code into a zip, creates an IAM role, attaches the basic execution policy plus the custom one, creates the function, and sets up a CloudWatch log group with 14-day retention. Adding a fifth Lambda would be 15 lines of root-level code.

dynamodb/ creates the urls and clicks tables. The clicks table has a Global Secondary Index on (code, timestamp) so the stats Lambda can query recent clicks efficiently without scanning the whole table.

sqs/ creates the main queue and the dead letter queue, with the redrive policy linking them. Visibility timeout is 60 seconds (must be at least the Lambda timeout), long polling is enabled for 20 seconds.

api_gateway/ creates an HTTP API v2 (cheaper than REST API), three integrations, three routes (POST /shorten, GET /{code}, GET /stats/{code}), the auto-deploy stage with access logging, and the Lambda permissions allowing API Gateway to invoke each function.

The root main.tf composes them and passes the outputs between modules. There's also a null_resource with local-exec that copies the shared utilities folder into each Lambda's source directory before packaging — Lambda doesn't have a native way to share source code between functions without using Layers.

The Live Demo

I built an interactive HTML page that calls the live API directly so you don't have to take my word for it. There's a "Shorten URL" button, a "Generate 10 Clicks" button, and a "Fetch Stats" button. There's a real-time event log at the bottom showing every request as it happens.

The whole demo is a single HTML file with no build step. CSS uses a Fraunces serif display font paired with JetBrains Mono, an orange accent on a dark grid background, and a noise overlay for texture. It looks like a developer tool, not a generic landing page.

The Debugging That Taught Me the Most

HTTP API v2 payload format is different. I had a shared log_event() helper that read event.get("httpMethod") from API Gateway. That worked fine in REST API v1, but in HTTP API v2 the method is at event["requestContext"]["http"]["method"]. The result was that every log entry showed "event_type": "sqs" even for HTTP requests, because httpMethod was missing and the default was "sqs". Subtle bug, easy to miss until you're trying to debug something else.

Lambda cold starts cause race conditions in test scripts. My test script does POST /shorten immediately followed by GET /{code}. If the redirect Lambda is cold, the first GET happens before the Lambda has finished initializing, and somehow this causes API Gateway to return a 404 without invoking the Lambda. I confirmed this by checking CloudWatch logs: no log entries for the failed requests. Adding a 1-2 second delay between the shorten and the first redirect fixed it. In production this isn't an issue because Lambdas stay warm under continuous load.

Partial batch failure requires opt-in. I assumed SQS → Lambda would handle failures at the message level by default. It doesn't. You have to explicitly set function_response_types = ["ReportBatchItemFailures"] on the event source mapping AND return the failed message IDs in the right format from your Lambda. Without both, one bad message fails the entire batch and you get retry storms.

Shared code in Lambda is harder than it should be. Python doesn't have a clean way to share modules between Lambda functions without using Layers (which add deployment complexity). I ended up using a Terraform null_resource with local-exec to copy src/shared/ into each function's source directory before packaging. Hacky but effective.

The Cost Comparison

Running this pipeline with 1 million requests per month costs approximately $3.60. That includes API Gateway, all four Lambdas, DynamoDB on-demand, SQS, and CloudWatch Logs.

For context, my EKS cluster from Projects 3, 4, and 5 costs about $213 per month, and that's whether it's serving zero requests or a million. Serverless is genuinely cheaper for event-driven workloads, especially during development when traffic is sporadic.

Why This Matters for My Portfolio

Before this project, my portfolio was container-heavy. Five projects on EKS, ECS, Helm, ArgoCD. Strong on Kubernetes, weak on serverless. This adds a completely different dimension, event-driven architecture, async processing, NoSQL design, reusable IaC modules, and security patterns specific to AWS managed services.

In an interview, if someone asks "tell me about a serverless project," I now have a 90-second answer that hits async decoupling, atomic operations, partial batch failures, dead letter queues, and least-privilege IAM. And I have a live demo URL they can click.

DEV Community