<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ajithmanmu</title>
    <description>The latest articles on DEV Community by ajithmanmu (@ajithmanmu).</description>
    <link>https://dev.to/ajithmanmu</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F671593%2F80913a79-a513-4557-9166-593acc8bc3a1.png</url>
      <title>DEV Community: ajithmanmu</title>
      <link>https://dev.to/ajithmanmu</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ajithmanmu"/>
    <language>en</language>
    <item>
      <title>I Built a Usage-Based Billing Engine From Scratch — Here's How It Works</title>
      <dc:creator>ajithmanmu</dc:creator>
      <pubDate>Mon, 23 Feb 2026 03:49:32 +0000</pubDate>
      <link>https://dev.to/ajithmanmu/i-built-a-usage-based-billing-engine-from-scratch-heres-how-it-works-2l58</link>
      <guid>https://dev.to/ajithmanmu/i-built-a-usage-based-billing-engine-from-scratch-heres-how-it-works-2l58</guid>
      <description>&lt;p&gt;I spent the last few weeks building &lt;a href="https://github.com/ajithmanmu/meterflow" rel="noopener noreferrer"&gt;MeterFlow&lt;/a&gt; — a usage-based billing engine that handles event ingestion, deduplication, aggregation, fraud detection, tiered pricing, and Stripe invoice generation.&lt;/p&gt;

&lt;p&gt;This post walks through the technical decisions behind each component, including how the architecture maps to a production AWS deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Build This?
&lt;/h2&gt;

&lt;p&gt;I work on subscription infrastructure at my day job — Stripe integrations, webhook handlers, entitlement APIs. But I wanted to understand how billing platforms like Lago, Metronome, and Stripe Billing work &lt;em&gt;internally&lt;/em&gt;. Not just calling the API, but building the metering and pricing layer myself.&lt;/p&gt;

&lt;p&gt;MeterFlow covers the full lifecycle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Events → Dedup → Store → Aggregate → Price → Invoice → Stripe
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt; TypeScript, Fastify, Redis, ClickHouse, MinIO (S3-compatible), Docker Compose.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────┐
│   Fastify     │
│   API         │
│               │
│ POST /events  │──┬──────────────────────┐
│ GET /usage    │  │                      │
└───────────────┘  │                      │
                   ▼                      ▼
          ┌──────────────┐      ┌──────────────┐
          │ Redis         │      │     MinIO     │
          │               │      │   (S3 backup) │
          │ • Dedup (NX)  │      │               │
          │ • Rate Limit  │      │ Raw events    │
          │ • Auth keys   │      │ (append-only) │
          │ • Fraud bases │      └───────────────┘
          └───────┬───────┘
                  │
                  ▼
          ┌──────────────┐      ┌──────────────┐
          │  ClickHouse   │      │    Stripe     │
          │               │      │               │
          │ • Events      │      │ Draft invoice │
          │ • Aggregation │      │ Line items    │
          │ • Analytics   │      │ Finalize+send │
          └───────────────┘      └───────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Events come in through the API, get deduplicated via Redis, stored in ClickHouse for analytics, and backed up to S3. When billing runs, the system aggregates usage, applies pricing rules, and builds Stripe invoice payloads.&lt;/p&gt;

&lt;p&gt;Every component was chosen with a clear production equivalent in mind — MinIO maps to S3, Redis to ElastiCache, the Fastify server to API Gateway + Lambda. More on that in the production architecture section below.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Event Ingestion &amp;amp; Deduplication
&lt;/h2&gt;

&lt;p&gt;Billing systems can't double-count. If a client retries a request, we need to reject the duplicate without rejecting new events.&lt;/p&gt;

&lt;p&gt;The approach: use Redis &lt;code&gt;SET NX&lt;/code&gt; (set-if-not-exists) with a 30-day TTL. The transaction ID is the key.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`dedup:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;transaction_id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;EX&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2592000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;NX&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// 'OK' → new event (accepted)&lt;/span&gt;
&lt;span class="c1"&gt;// null → duplicate (rejected)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is atomic — two identical events hitting Redis simultaneously, only one wins. No race condition, no distributed lock needed.&lt;/p&gt;

&lt;p&gt;The 30-day TTL matches the validation window. Events older than 30 days are rejected by business logic anyway, so dedup keys auto-expire.&lt;/p&gt;

&lt;p&gt;For batch ingestion (up to 1,000 events/request), I pipeline the Redis calls so the entire batch is one round-trip. The API validates each event's schema, checks for required fields (&lt;code&gt;customer_id&lt;/code&gt;, &lt;code&gt;event_type&lt;/code&gt;, &lt;code&gt;transaction_id&lt;/code&gt;, &lt;code&gt;timestamp&lt;/code&gt;), and rejects events with timestamps outside the 30-day window before they even reach Redis.&lt;/p&gt;

&lt;p&gt;Accepted events are then written to both ClickHouse (for querying) and MinIO/S3 (as an append-only backup). The S3 backup is organized by date (&lt;code&gt;events/YYYY-MM-DD/batch_timestamp.json&lt;/code&gt;), giving you a full audit trail that's independent of the analytics store.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Rate Limiting with Sorted Sets
&lt;/h2&gt;

&lt;p&gt;Standard token bucket has a boundary problem — 200 requests at 0:59 and 200 at 1:01 both pass their respective windows, but that's 400 in 2 seconds.&lt;/p&gt;

&lt;p&gt;I used Redis sorted sets for a true sliding window:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`ratelimit:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;windowStart&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;60000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// 60-second window&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zremrangebyscore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;windowStart&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;// Remove old entries&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zadd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;now&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;now&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;       &lt;span class="c1"&gt;// Add current request&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zcard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                              &lt;span class="c1"&gt;// Count in window&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                        &lt;span class="c1"&gt;// Safety TTL&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exec&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each request is a member with its timestamp as the score. To check the limit, remove anything older than 60 seconds, count what's left. The response includes &lt;code&gt;X-RateLimit-Remaining&lt;/code&gt; so clients know where they stand.&lt;/p&gt;

&lt;p&gt;In production, this pipeline approach could allow slight over-counting under high concurrency. A Lua script wrapping the same sorted set logic (&lt;code&gt;ZREMRANGEBYSCORE → ZADD → ZCARD → EXPIRE&lt;/code&gt;) would execute atomically on the Redis server. You'd also want two layers: API Gateway throttling for coarse IP-based protection, and the application layer for fine-grained per-customer limits tied to billing tiers.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Billable Metrics Catalog
&lt;/h2&gt;

&lt;p&gt;Rather than hardcoding what's billable, I use a catalog that maps raw events to billable quantities:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Event Type&lt;/th&gt;
&lt;th&gt;Aggregation&lt;/th&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;api_calls&lt;/td&gt;
&lt;td&gt;api_request&lt;/td&gt;
&lt;td&gt;COUNT&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bandwidth&lt;/td&gt;
&lt;td&gt;api_request&lt;/td&gt;
&lt;td&gt;SUM&lt;/td&gt;
&lt;td&gt;bytes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;storage_peak&lt;/td&gt;
&lt;td&gt;storage&lt;/td&gt;
&lt;td&gt;MAX&lt;/td&gt;
&lt;td&gt;gb_stored&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;compute_time&lt;/td&gt;
&lt;td&gt;compute&lt;/td&gt;
&lt;td&gt;SUM&lt;/td&gt;
&lt;td&gt;cpu_ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The usage query engine reads this catalog and builds the appropriate ClickHouse query dynamically. Adding a new metric means adding one config entry — no query changes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// COUNT → SELECT count() FROM events WHERE ...&lt;/span&gt;
&lt;span class="c1"&gt;// SUM   → SELECT sum(JSONExtractFloat(properties, 'bytes')) WHERE ...&lt;/span&gt;
&lt;span class="c1"&gt;// MAX   → SELECT max(JSONExtractFloat(properties, 'gb_stored')) WHERE ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ClickHouse is a good fit here because it's columnar — &lt;code&gt;SUM(bytes) FROM events&lt;/code&gt; only reads the bytes column, not the entire row. But it's append-optimized, so you don't want to update or delete individual rows. For billing, that's fine — events are immutable.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Tiered Pricing Calculation
&lt;/h2&gt;

&lt;p&gt;MeterFlow supports flat and tiered pricing. Tiered is the interesting one — the system walks through tiers progressively:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;API Calls Pricing:
  Tier 1: 0–1,000     → $0.00/call (free tier)
  Tier 2: 1,001–10,000 → $0.001/call
  Tier 3: 10,001+      → $0.0005/call
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For 15,000 API calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tier 1: 1,000 × $0.00   = $0.00
Tier 2: 9,000 × $0.001  = $9.00
Tier 3: 5,000 × $0.0005 = $2.50
Total: $11.50
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pricing engine consumes quantity through each tier until it's exhausted. All amounts are converted to cents before hitting Stripe — billing systems should never do floating-point math on final amounts.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Fraud Detection
&lt;/h2&gt;

&lt;p&gt;This is where MeterFlow goes beyond basic metering. It uses a two-layer approach to catch both volume anomalies and pattern anomalies.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Z-Score Volume Detection
&lt;/h3&gt;

&lt;p&gt;Compare current usage against a 30-day baseline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Z = (current_value - mean) / stddev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;|Z| &amp;gt;= 3&lt;/code&gt; (three standard deviations), flag it. This catches obvious spikes — someone hammering your API 10x more than normal.&lt;/p&gt;

&lt;p&gt;But it misses a critical attack vector: &lt;strong&gt;same volume, different pattern.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Cosine Similarity Pattern Detection
&lt;/h3&gt;

&lt;p&gt;A stolen API key might generate the same number of calls per day, but at completely different hours. Z-score wouldn't catch this because the volume is normal.&lt;/p&gt;

&lt;p&gt;The approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Build baselines&lt;/strong&gt; — process 30 days of history into per-weekday, 24-dimensional hourly vectors (Mondays have different patterns than weekends)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Normalize&lt;/strong&gt; — divide by the sum so we're comparing &lt;em&gt;shape&lt;/em&gt;, not volume&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Compare with cosine similarity&lt;/strong&gt; — 1.0 means identical pattern, below 0.9 triggers a flag&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Normal Tuesday:    [0.01, 0.01, ..., 0.15, 0.15, ..., 0.02]
                   (quiet at night, peaks 9am-5pm)

Stolen key usage:  [0.15, 0.15, ..., 0.01, 0.01, ..., 0.15]
                   (peaks at night — attacker in different timezone)

Cosine similarity: ~0.28 → FRAUD DETECTED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The volume is identical. The Z-score is normal. But the pattern is inverted. Cosine similarity catches it immediately.&lt;/p&gt;

&lt;p&gt;Baselines are stored in Redis with 90-day TTL. The detection runs per-customer, per-metric, with separate weekday profiles. The system includes a dashboard that visualizes normal usage patterns vs. detected anomalies:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmye8e76r1uqj7agthjer.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmye8e76r1uqj7agthjer.png" alt="MeterFlow Dashboard — Normal Usage Patterns" width="800" height="717"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When fraud is detected, the dashboard highlights the anomaly with the cosine similarity score. In this example, a customer's pattern dropped to 30.2% similarity against their baseline — a clear sign of compromised credentials:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnn82lvu8e88vr15via7o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnn82lvu8e88vr15via7o.png" alt="MeterFlow Dashboard — Fraud Detected (30.2% similarity)" width="800" height="731"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Stripe Integration
&lt;/h2&gt;

&lt;p&gt;The billing endpoint builds complete Stripe API payloads following the full invoice lifecycle: create draft invoice, add line items per metric (with tier breakdowns in metadata), finalize, and send.&lt;/p&gt;

&lt;p&gt;Each operation uses an idempotency key derived from the invoice ID and billing period:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;idempotencyKey&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
  &lt;span class="s2"&gt;`meterflow_&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;invoiceId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;_&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;customerId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;_&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;periodStart&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;_&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;periodEnd&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures that retries, duplicate triggers, or manual re-runs don't double-charge customers. Stripe rejects duplicate requests with the same key within 48 hours.&lt;/p&gt;

&lt;p&gt;For the demo, this runs in dry-run mode — payloads are built but not sent to Stripe. Swapping to live is a one-line change from payload builders to actual SDK calls.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Production Architecture on AWS
&lt;/h2&gt;

&lt;p&gt;Every local component in MeterFlow was designed with a clear AWS production mapping. Here's how the demo stack translates:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Demo (Local)&lt;/th&gt;
&lt;th&gt;Production (AWS)&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fastify server&lt;/td&gt;
&lt;td&gt;API Gateway + Lambda&lt;/td&gt;
&lt;td&gt;Auto-scaling, managed TLS, WAF&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Redis (single)&lt;/td&gt;
&lt;td&gt;ElastiCache (Redis Cluster)&lt;/td&gt;
&lt;td&gt;HA, automatic failover&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ClickHouse (Docker)&lt;/td&gt;
&lt;td&gt;ClickHouse Cloud&lt;/td&gt;
&lt;td&gt;Managed, scalable, VPC peering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MinIO&lt;/td&gt;
&lt;td&gt;S3&lt;/td&gt;
&lt;td&gt;Lifecycle policies, cross-region replication&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;In-process sync&lt;/td&gt;
&lt;td&gt;Kinesis&lt;/td&gt;
&lt;td&gt;Async buffer, back-pressure, replay&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cron / manual&lt;/td&gt;
&lt;td&gt;EventBridge&lt;/td&gt;
&lt;td&gt;Managed scheduling, reliable triggers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Console logs&lt;/td&gt;
&lt;td&gt;CloudWatch + SNS&lt;/td&gt;
&lt;td&gt;Alerting, dashboards, PagerDuty&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Async Ingestion with Kinesis
&lt;/h3&gt;

&lt;p&gt;The biggest architectural shift for production is decoupling ingestion from processing. In the demo, the pipeline is synchronous: validate → dedup → store → backup, all in one request. In production, you'd buffer through Kinesis:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client → API Gateway → Lambda (validate + dedup) → Kinesis → Lambda (store to ClickHouse)
                                                         ↘ S3 (backup)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kinesis gives you ordered delivery within a shard (partition by &lt;code&gt;customer_id&lt;/code&gt;), replayability if a downstream consumer fails, and natural back-pressure through shard limits. Clients receive &lt;code&gt;202 Accepted&lt;/code&gt; immediately instead of waiting for the full pipeline.&lt;/p&gt;

&lt;p&gt;Failed batches route to an SQS dead letter queue for investigation and replay. The dedup layer (Redis &lt;code&gt;SET NX&lt;/code&gt;) works the same way regardless of whether an event arrives via HTTP or Kinesis — duplicates are caught either way.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scheduled Jobs with EventBridge
&lt;/h3&gt;

&lt;p&gt;Billing, anomaly detection, and fraud baseline rebuilds all become scheduled jobs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EventBridge (1st of month)  → Lambda → Aggregate usage → Stripe API (invoicing)
EventBridge (hourly/daily)  → Lambda → Z-score + cosine similarity → SNS (alerts)
EventBridge (weekly)        → Lambda → Rebuild fraud baselines → Redis
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The detection logic itself (&lt;code&gt;checkAnomaly()&lt;/code&gt;, &lt;code&gt;checkFraud()&lt;/code&gt;) would be reused as-is from the demo — it already takes parameters for baseline window and threshold. The change is just in how it's triggered and where alerts go (SNS → Slack/PagerDuty instead of console logs).&lt;/p&gt;

&lt;h3&gt;
  
  
  State and Alerting
&lt;/h3&gt;

&lt;p&gt;DynamoDB handles billing state (invoice status, anomaly records with TTL for auto-cleanup). SNS topics route to email, Slack, or PagerDuty based on alert severity. CloudWatch dashboards provide real-time visibility into ingestion rates, error rates, and billing job status.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Deduplication is deceptively simple.&lt;/strong&gt; &lt;code&gt;SET NX&lt;/code&gt; solves it cleanly, but the hard part is deciding what the dedup window should be and how to handle events that arrive after the window closes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Billing math needs to happen in cents.&lt;/strong&gt; Floating-point rounding will bite you. Convert to integers as early as possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern-based fraud detection is more useful than volume-based.&lt;/strong&gt; Sophisticated attackers will stay under volume thresholds. They can't easily replicate a customer's hourly usage pattern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Design for the production target early.&lt;/strong&gt; Using MinIO instead of a local filesystem, Redis instead of in-memory maps, and S3-compatible APIs from the start meant every component has a clear AWS upgrade path. The business logic doesn't change — only the infrastructure layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;The entire system runs locally with Docker Compose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ajithmanmu/meterflow
&lt;span class="nb"&gt;cd &lt;/span&gt;meterflow
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
pnpm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; pnpm dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There's a validation script that runs 60 end-to-end checks across all components, and demo scripts that simulate 30 days of normal usage followed by fraud injection so you can see the detection in action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/ajithmanmu/meterflow" rel="noopener noreferrer"&gt;github.com/ajithmanmu/meterflow&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Inspired by &lt;a href="https://github.com/getlago/lago" rel="noopener noreferrer"&gt;Lago&lt;/a&gt;'s open-source billing platform.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm a backend engineer building subscription and payment infrastructure. If you're working on billing systems or usage-based pricing, I'd love to hear about your approach.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>typescript</category>
      <category>billing</category>
      <category>aws</category>
      <category>redis</category>
    </item>
    <item>
      <title>Building a Webhook Replay System with AWS Kinesis</title>
      <dc:creator>ajithmanmu</dc:creator>
      <pubDate>Thu, 15 Jan 2026 20:09:50 +0000</pubDate>
      <link>https://dev.to/ajithmanmu/building-a-webhook-replay-system-with-aws-kinesis-2682</link>
      <guid>https://dev.to/ajithmanmu/building-a-webhook-replay-system-with-aws-kinesis-2682</guid>
      <description>&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Payment webhooks from Stripe, Apple, and Google are revenue-critical, but they're tricky to handle correctly. Events arrive out of order, can be duplicated, and if your processing logic has a bug, you can corrupt subscription state with no way to recover.&lt;/p&gt;

&lt;p&gt;I built a webhook broker that treats Kinesis Data Streams as an immutable event log. When things go wrong, you can replay events and rebuild subscription state from scratch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;Here's how the system works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Payment Providers → API Gateway → Lambda (Ingestion)
                                       ↓
                                  Kinesis Stream (7-day retention)
                                       ↓
                                  Lambda (Processor)
                                       ↓
                                  DynamoDB (State + Idempotency)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  AWS Services Used
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;API Gateway (HTTP API)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Public endpoint for webhooks: &lt;code&gt;/webhooks/{provider}&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Rate limiting: 100 req/sec, burst 200&lt;/li&gt;
&lt;li&gt;Routes: &lt;code&gt;/webhooks/stripe&lt;/code&gt;, &lt;code&gt;/webhooks/apple&lt;/code&gt;, &lt;code&gt;/webhooks/google&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Lambda 1: Ingestion&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Verifies webhook signatures (HMAC-SHA256 for Stripe)&lt;/li&gt;
&lt;li&gt;Extracts partition key: &lt;code&gt;provider:subscriptionId&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Writes raw event to Kinesis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Kinesis Data Streams&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;7-day retention (extendable to 365 days)&lt;/li&gt;
&lt;li&gt;Source of truth for all webhook events&lt;/li&gt;
&lt;li&gt;Partition key ensures per-subscription ordering&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Lambda 2: Processor&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reads from Kinesis stream&lt;/li&gt;
&lt;li&gt;Sorts events by timestamp (handles out-of-order delivery)&lt;/li&gt;
&lt;li&gt;Uses DynamoDB conditional writes for idempotency&lt;/li&gt;
&lt;li&gt;Updates subscription state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;DynamoDB Tables&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;ProcessedEvents&lt;/code&gt;: Idempotency check (TTL: 90 days)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SubscriptionState&lt;/code&gt;: Current subscription data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;SQS Dead Letter Queue&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Captures poison messages after 3 retries&lt;/li&gt;
&lt;li&gt;14-day retention for manual investigation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Design Decisions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Kinesis as Event Log
&lt;/h3&gt;

&lt;p&gt;Kinesis isn't just a queue—it's a durable log. Every webhook is preserved for 7 days (configurable up to 365). This gives you time-travel capability: replay events from any point in the retention window.&lt;/p&gt;

&lt;p&gt;For longer-term needs (regulatory audits, multi-year forensics), you can archive to S3 and implement cold replay from there.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Partition Keys for Ordering
&lt;/h3&gt;

&lt;p&gt;Events are sharded by &lt;code&gt;provider:subscriptionId&lt;/code&gt; (e.g., &lt;code&gt;stripe:sub_premium_user_001&lt;/code&gt;). This gives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ordering guarantees per subscription&lt;/strong&gt;: Events for the same subscription are processed in order&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Granular replay control&lt;/strong&gt;: Replay just one customer's events, or all events from a specific provider&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Idempotency with DynamoDB
&lt;/h3&gt;

&lt;p&gt;The processor uses conditional writes to the &lt;code&gt;ProcessedEvents&lt;/code&gt; table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Only write if eventId doesn't exist&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;dynamodb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;putItem&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;TableName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ProcessedEvents&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;Item&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;eventId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;subscriptionId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;timestamp&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;ConditionExpression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;attribute_not_exists(eventId)&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This prevents duplicate processing, even when replaying events that were already handled.&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo: Recovery from Data Loss
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Scenario&lt;/strong&gt;: A bug deletes subscription data from DynamoDB.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Initial State
&lt;/h3&gt;

&lt;p&gt;Subscription has 4 processed events:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws dynamodb get-item &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--table-name&lt;/span&gt; webhook-broker-dev-subscription-state &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--key&lt;/span&gt; &lt;span class="s1"&gt;'{"provider": {"S": "stripe"}, "subscriptionId": {"S": "sub_premium_user_001"}}'&lt;/span&gt;

&lt;span class="c"&gt;# Returns: eventCount: 4&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Simulate Data Loss
&lt;/h3&gt;

&lt;p&gt;Delete subscription state and processed events (simulating a bug):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python scripts/delete_subscription_data.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--provider&lt;/span&gt; stripe &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--subscription&lt;/span&gt; sub_premium_user_001 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--execute&lt;/span&gt;

&lt;span class="c"&gt;# Deletes: SubscriptionState + 4 ProcessedEvents&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Confirm Deletion
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws dynamodb get-item &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--table-name&lt;/span&gt; webhook-broker-dev-subscription-state &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--key&lt;/span&gt; &lt;span class="s1"&gt;'{"provider": {"S": "stripe"}, "subscriptionId": {"S": "sub_premium_user_001"}}'&lt;/span&gt;

&lt;span class="c"&gt;# Returns: (empty)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Replay from Kinesis
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python scripts/replay.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--subscription&lt;/span&gt; sub_premium_user_001 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-beginning&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--execute&lt;/span&gt;

&lt;span class="c"&gt;# Replays 9 events from stream&lt;/span&gt;
&lt;span class="c"&gt;# Processes 4 unique events&lt;/span&gt;
&lt;span class="c"&gt;# Skips 5 duplicates (idempotency)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 5: Verify Recovery
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws dynamodb get-item &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--table-name&lt;/span&gt; webhook-broker-dev-subscription-state &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--key&lt;/span&gt; &lt;span class="s1"&gt;'{"provider": {"S": "stripe"}, "subscriptionId": {"S": "sub_premium_user_001"}}'&lt;/span&gt;

&lt;span class="c"&gt;# Returns: eventCount: 4 (restored!)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Subscription rebuilt with exact same state. Idempotency prevented duplicate processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Use Cases
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Bug Recovery&lt;/strong&gt;&lt;br&gt;
Deploy a bug that corrupts state → Fix the code → Replay events → State rebuilt correctly&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Schema Evolution&lt;/strong&gt;&lt;br&gt;
Add a new field to your subscription model → Replay events to backfill the data&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Revenue Debugging&lt;/strong&gt;&lt;br&gt;
Finance reports a discrepancy → Replay specific time range → Trace what happened&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Provider Outage Recovery&lt;/strong&gt;&lt;br&gt;
Stripe had an outage yesterday → Replay all Stripe events from that window → Ensure nothing was missed&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure&lt;/strong&gt;: Terraform&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lambda Functions&lt;/strong&gt;: TypeScript (Node.js 18)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replay CLI&lt;/strong&gt;: Python 3.9+&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Services&lt;/strong&gt;: API Gateway, Kinesis, Lambda, DynamoDB, SQS&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cost Considerations
&lt;/h2&gt;

&lt;p&gt;With 7-day retention and moderate volume (10,000 events/day):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kinesis Data Streams: ~$15/month (1 shard)&lt;/li&gt;
&lt;li&gt;Lambda: ~$5/month (first 1M requests free)&lt;/li&gt;
&lt;li&gt;DynamoDB: ~$5/month (on-demand pricing)&lt;/li&gt;
&lt;li&gt;API Gateway: ~$3.50/month (first 1M requests free)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Total&lt;/strong&gt;: ~$30/month for production-grade event replay capability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Source Code
&lt;/h2&gt;

&lt;p&gt;Full implementation with Terraform, TypeScript Lambdas, and Python replay tool:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/ajithmanmu/webhook-broker" rel="noopener noreferrer"&gt;https://github.com/ajithmanmu/webhook-broker&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;The key insight: Treat your event stream as a source of truth, not just a transport layer. When you have an immutable log, recovery becomes a replay operation instead of a panic.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>webdev</category>
      <category>stripe</category>
      <category>eventdriven</category>
    </item>
    <item>
      <title>What the AWS us-east-1 Outage Taught Me About Building Resilient Systems</title>
      <dc:creator>ajithmanmu</dc:creator>
      <pubDate>Sun, 14 Dec 2025 19:45:22 +0000</pubDate>
      <link>https://dev.to/ajithmanmu/what-the-aws-us-east-1-outage-taught-me-about-building-resilient-systems-4k59</link>
      <guid>https://dev.to/ajithmanmu/what-the-aws-us-east-1-outage-taught-me-about-building-resilient-systems-4k59</guid>
      <description>&lt;p&gt;AWS us-east-1 will go down again. When it does, will your system survive?&lt;/p&gt;

&lt;p&gt;This past weekend, I built a system designed to survive it.&lt;/p&gt;

&lt;p&gt;After 8 years building subscription infrastructure at Surfline—processing payments through Stripe, Apple, and Google Play—I've learned that &lt;strong&gt;the question isn't whether your cloud provider will fail. It's whether your architecture degrades gracefully when it does.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I spent 4 hours implementing three reliability patterns sourced directly from the &lt;a href="https://aws.amazon.com/builders-library/" rel="noopener noreferrer"&gt;AWS Builders' Library&lt;/a&gt;, Google SRE practices, and Stripe's engineering blog. Here's what I learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Payment Systems Can't Afford to Fail
&lt;/h2&gt;

&lt;p&gt;When AWS has an incident, your Lambda functions timeout. Your DynamoDB calls fail. Your SQS queues back up.&lt;/p&gt;

&lt;p&gt;For most applications, users see an error page and retry later. But payment systems are different:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A failed charge might actually have succeeded&lt;/li&gt;
&lt;li&gt;A retry might double-charge the customer&lt;/li&gt;
&lt;li&gt;A thundering herd of retries can cascade the failure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;You need patterns that handle partial failures without losing money or trust.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 1: Exponential Backoff with Full Jitter
&lt;/h2&gt;

&lt;p&gt;The AWS Builders' Library article on &lt;a href="https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/" rel="noopener noreferrer"&gt;Timeouts, retries, and backoff with jitter&lt;/a&gt; changed how I think about retry logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The insight:&lt;/strong&gt; Without jitter, all clients retry at the exact same intervals. If 1,000 requests fail at t=0, they all retry at t=1s, then t=2s, then t=4s—creating synchronized waves that hammer your recovering service.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Full jitter formula from AWS Builders' Library&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;calculateDelay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;exponentialDelay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nx"&gt;MAX_DELAY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;INITIAL_DELAY&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// Full jitter: random value between 0 and exponential delay&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;exponentialDelay&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The result:&lt;/strong&gt; Success rates improved from ~70% to 99%+ in my load tests. The jitter spreads retry load evenly across time instead of creating synchronized spikes.&lt;/p&gt;

&lt;h3&gt;
  
  
  AWS Application
&lt;/h3&gt;

&lt;p&gt;This pattern is critical when calling AWS services during degraded states:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lambda retrying DynamoDB&lt;/strong&gt; during throttling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ECS tasks calling external APIs&lt;/strong&gt; through NAT Gateway&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Step Functions&lt;/strong&gt; with retry policies on service integrations&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Pattern 2: Bounded Queues with Worker Pools
&lt;/h2&gt;

&lt;p&gt;Here's something I discovered through testing that surprised me:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A bounded queue alone doesn't limit concurrent processing.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I set up a queue with capacity 100, sent 200 requests, and expected ~100 rejections. Instead: zero rejections. Why? Node.js was processing requests faster than they accumulated. The queue checked capacity but didn't control throughput.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// What you actually need: queue + worker pool&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BoundedQueue&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="nx"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
  &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;readonly&lt;/span&gt; &lt;span class="nx"&gt;capacity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="nf"&gt;enqueue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;capacity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// HTTP 429 - fail fast&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;WorkerPool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="nx"&gt;activeWorkers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;readonly&lt;/span&gt; &lt;span class="nx"&gt;maxWorkers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// THIS controls throughput&lt;/span&gt;

  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;BoundedQueue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;activeWorkers&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;maxWorkers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="c1"&gt;// Actually limits concurrent execution&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  AWS Application
&lt;/h3&gt;

&lt;p&gt;This maps directly to AWS service patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SQS + Lambda concurrency limits&lt;/strong&gt;: The queue (SQS) buffers; reserved concurrency limits throughput&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API Gateway + throttling&lt;/strong&gt;: Request queuing with rate limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kinesis + Lambda&lt;/strong&gt;: Batch size and parallelization factor control processing rate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key insight: &lt;strong&gt;SQS without Lambda concurrency limits is like a bounded queue without a worker pool&lt;/strong&gt;—it buffers but doesn't protect downstream systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 3: Idempotency with Strategic Caching
&lt;/h2&gt;

&lt;p&gt;Stripe's &lt;a href="https://stripe.com/docs/api/idempotent_requests" rel="noopener noreferrer"&gt;idempotency documentation&lt;/a&gt; shaped this implementation. The pattern: cache successful responses for 24 hours, never cache errors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;IdempotencyStore&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="nx"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;CachedResponse&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="nx"&gt;inFlight&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;Set&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;idempotencyKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Response&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Check cache first&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;idempotencyKey&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Detect concurrent duplicates&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;inFlight&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;idempotencyKey&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ConflictError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Request already in progress&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;inFlight&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;idempotencyKey&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
      &lt;span class="c1"&gt;// Only cache successes&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;success&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;idempotencyKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;ttl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;finally&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;inFlight&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;idempotencyKey&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  AWS Application
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DynamoDB for idempotency keys&lt;/strong&gt;: Conditional writes with TTL for automatic cleanup&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lambda Powertools&lt;/strong&gt;: Built-in &lt;a href="https://docs.powertools.aws.dev/lambda/python/latest/utilities/idempotency/" rel="noopener noreferrer"&gt;idempotency utility&lt;/a&gt; using DynamoDB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Step Functions&lt;/strong&gt;: Native idempotency with execution names
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// DynamoDB idempotency pattern&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;dynamodb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;TableName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;IdempotencyStore&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;Item&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;idempotencyKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;ttl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;floor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;86400&lt;/span&gt; &lt;span class="c1"&gt;// 24 hours&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;ConditionExpression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;attribute_not_exists(idempotencyKey)&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Architecture: Putting It Together
&lt;/h2&gt;

&lt;p&gt;Here's how these patterns compose into a resilient payment processing system on AWS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────┐
│                     API Gateway                              │
│                   (Rate Limiting)                            │
└─────────────────────┬───────────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────────┐
│                      SQS Queue                               │
│              (Bounded Queue - Buffer)                        │
└─────────────────────┬───────────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────────┐
│              Lambda (Reserved Concurrency = 10)              │
│                   (Worker Pool)                              │
│  ┌─────────────────────────────────────────────────────────┐│
│  │  1. Check DynamoDB idempotency store                    ││
│  │  2. Process payment with retry + jitter                 ││
│  │  3. Store result in DynamoDB                            ││
│  └─────────────────────────────────────────────────────────┘│
└─────────────────────┬───────────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────────┐
│                 DynamoDB Tables                              │
│    - IdempotencyStore (with TTL)                            │
│    - ProcessingResults                                       │
└─────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Key Takeaways for AWS Builders
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Read the AWS Builders' Library.&lt;/strong&gt; It's written by engineers who've operated services at massive scale. The &lt;a href="https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/" rel="noopener noreferrer"&gt;jitter article&lt;/a&gt; alone is worth your time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Test your assumptions.&lt;/strong&gt; I assumed bounded queues limited throughput. They don't. Load testing revealed the gap.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Accept the tradeoff.&lt;/strong&gt; These patterns increase latency. A request that would fail in 100ms might now take 5 seconds across retries. But 99%+ success beats 70% success every time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use AWS primitives.&lt;/strong&gt; SQS, Lambda concurrency, DynamoDB TTL, and Step Functions give you these patterns without building from scratch.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://github.com/ajithmanmu/resilient-relay" rel="noopener noreferrer"&gt;resilient-relay repo&lt;/a&gt; has the full implementation. I'm planning to add:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dead-letter queue handling for failed payments&lt;/li&gt;
&lt;li&gt;CloudWatch metrics for RED (Rate, Errors, Duration) observability&lt;/li&gt;
&lt;li&gt;Multi-region failover patterns&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;When us-east-1 goes down again—and it will—your system should degrade gracefully, not catastrophically.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The AWS Builders' Library exists because Amazon learned these lessons operating AWS itself. The patterns are proven. The question is whether we apply them.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What reliability patterns have you implemented in your AWS architectures? I'd love to hear what's worked (or failed spectacularly) in production.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;a href="https://github.com/ajithmanmu/resilient-relay" rel="noopener noreferrer"&gt;GitHub: resilient-relay&lt;/a&gt; | &lt;a href="https://www.linkedin.com/in/ajith-manmadhan-94a36713/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>aws</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>How I Built an Autonomous AI Customer Retention Agent with AWS Bedrock AgentCore</title>
      <dc:creator>ajithmanmu</dc:creator>
      <pubDate>Wed, 15 Oct 2025 12:23:21 +0000</pubDate>
      <link>https://dev.to/ajithmanmu/how-i-built-an-autonomous-ai-customer-retention-agent-with-aws-bedrock-agentcore-3ckp</link>
      <guid>https://dev.to/ajithmanmu/how-i-built-an-autonomous-ai-customer-retention-agent-with-aws-bedrock-agentcore-3ckp</guid>
      <description>&lt;p&gt;&lt;em&gt;Built for the AWS AI Agent Global Hackathon&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;After building a &lt;a href="https://dev.to/ajithmanmu/how-i-built-a-serverless-data-analytics-pipeline-for-customer-churn-with-s3-glue-athena-and-bfk"&gt;serverless data analytics pipeline for customer churn&lt;/a&gt;, I had clean, query-ready customer data sitting in Amazon Athena. The next logical step was to make that data &lt;em&gt;actionable&lt;/em&gt; — not just for analysts, but for customers themselves.&lt;/p&gt;

&lt;p&gt;That's where the &lt;strong&gt;Customer Retention Agent&lt;/strong&gt; comes in. This is a fully autonomous AI agent built on AWS Bedrock AgentCore that identifies at-risk customers and proactively offers them personalized retention deals through natural conversation. I built this as part of the &lt;strong&gt;AWS AI Agent Global Hackathon&lt;/strong&gt;, and it's a natural continuation of my previous project.&lt;/p&gt;

&lt;p&gt;Before diving into the build, I spent time going through the &lt;a href="https://github.com/awslabs/amazon-bedrock-agentcore-samples" rel="noopener noreferrer"&gt;Amazon Bedrock AgentCore Samples&lt;/a&gt; repository. The tutorials there were incredibly helpful for getting up to speed with AgentCore concepts — from Runtime and Gateway to Memory and Identity. If you're new to AgentCore, I highly recommend starting there.&lt;/p&gt;

&lt;p&gt;The goal was simple: &lt;strong&gt;What if customers could talk to an AI agent that knows their churn risk and can instantly generate personalized discount codes?&lt;/strong&gt; No forms, no waiting for customer service — just a conversation that might save their subscription.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Here's the high-level design:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frclaicx7jtryp2ebvj8n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frclaicx7jtryp2ebvj8n.png" alt="Architecture" width="800" height="680"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Core Components:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Amazon Bedrock AgentCore (Runtime, Gateway, Memory)&lt;/strong&gt; — The brain of the system. Runtime hosts the agent, Gateway connects to external tools, and Memory persists conversation context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude 3.7 Sonnet&lt;/strong&gt; — Powers autonomous reasoning and multi-step decision-making.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Next.js Frontend&lt;/strong&gt; — Chat interface deployed on Vercel with streaming responses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Lambda (3 functions)&lt;/strong&gt; — Churn Data Query, Retention Offer, Web Search exposed via MCP protocol.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon Athena&lt;/strong&gt; — Queries the Telco customer churn dataset (from my previous project).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon Cognito&lt;/strong&gt; — Dual authentication: web client for users, M2M client for agent-to-Gateway communication.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bedrock Knowledge Base&lt;/strong&gt; — RAG implementation with company policies and troubleshooting guides.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon S3&lt;/strong&gt; — Stores customer data and knowledge base documents.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can find the full implementation here: &lt;a href="https://github.com/ajithmanmu/customer-retention-agent" rel="noopener noreferrer"&gt;https://github.com/ajithmanmu/customer-retention-agent&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Demo Video
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=nt2-iE_qBIw" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=nt2-iE_qBIw&lt;/a&gt;&lt;br&gt;
URL: &lt;a href="https://customer-retention-agent.vercel.app/" rel="noopener noreferrer"&gt;https://customer-retention-agent.vercel.app/&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Demo showing the agent in action - analyzing churn risk and generating discount codes&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Walkthrough
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. &lt;strong&gt;The User Journey&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When a customer logs into the chat interface:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Authentication&lt;/strong&gt;: Frontend authenticates via Cognito, receives JWT token&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JWT Mapping&lt;/strong&gt;: Token contains Cognito user ID (UUID) which gets mapped to actual customer ID in the dataset (e.g., &lt;code&gt;"3916-NRPAP"&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conversation Starts&lt;/strong&gt;: User sends a message, AgentCore Runtime receives request with JWT&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory Retrieval&lt;/strong&gt;: Before responding, agent pulls customer context from Memory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent Reasoning&lt;/strong&gt;: Claude 3.7 Sonnet decides which tools to call (if any)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool Execution&lt;/strong&gt;: Agent calls Lambda functions via Gateway for data/actions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Response Generation&lt;/strong&gt;: Claude synthesizes response with retrieved data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory Saving&lt;/strong&gt;: Interaction gets saved to Memory for future conversations&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2l1rd9sertdhr6dfrdkr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2l1rd9sertdhr6dfrdkr.png" alt="chat ui" width="800" height="424"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  2. &lt;strong&gt;Dual Authentication Architecture&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This was one of the trickier parts. The system needs two separate authentication flows:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Web Client (User → Runtime):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User logs in with username/password&lt;/li&gt;
&lt;li&gt;Cognito returns JWT token&lt;/li&gt;
&lt;li&gt;Frontend includes JWT in every request to AgentCore Runtime&lt;/li&gt;
&lt;li&gt;Token contains &lt;code&gt;sub&lt;/code&gt; field with user ID&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;M2M Client (Agent → Gateway):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agent needs to call Lambda functions via Gateway&lt;/li&gt;
&lt;li&gt;Uses OAuth 2.0 client credentials flow&lt;/li&gt;
&lt;li&gt;Confidential client with client secret stored in SSM&lt;/li&gt;
&lt;li&gt;Access token validates at Gateway before allowing tool calls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Working with Cognito was &lt;strong&gt;more complicated than I expected&lt;/strong&gt; — configuring two different clients, getting the OAuth flows right, and debugging token scopes took several iterations. But it was a valuable learning experience in production authentication patterns.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. &lt;strong&gt;The Agent's Brain: AgentCore Runtime + Memory&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The agent runs on &lt;strong&gt;AgentCore Runtime&lt;/strong&gt;, which is a fully managed, serverless platform for hosting AI agents. No servers to manage, auto-scaling built-in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory Integration&lt;/strong&gt; is what makes this agent truly conversational:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CustomerRetentionMemoryHooks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bedrock-agent-runtime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;memory_id&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;actor_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;  &lt;span class="c1"&gt;# Maps to customer in dataset
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three memory strategies work together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;USER_PREFERENCE&lt;/strong&gt;: Stores explicit preferences ("I prefer email contact")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SEMANTIC&lt;/strong&gt;: Vector-based semantic memory for conversation context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SUMMARIZATION&lt;/strong&gt;: Condensed conversation summaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means if a customer says "My customer ID is 3916-NRPAP" in one session, the agent remembers it in future conversations.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. &lt;strong&gt;Tools Layer: Lambda Functions via Gateway&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;I created three Lambda functions, each with a specific purpose:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Churn Data Query Lambda:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Queries Athena with SQL
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
SELECT customerid, churn_risk_score, tenure, contract, monthlycharges 
FROM telco_augmented_vw 
WHERE customerid = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This function:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hits Amazon Athena (the data from my previous pipeline project!)&lt;/li&gt;
&lt;li&gt;Returns customer profile, churn risk score, usage patterns&lt;/li&gt;
&lt;li&gt;Uses &lt;code&gt;cancel_intent&lt;/code&gt; field as our "synthetic churn model" — no separate ML training needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Retention Offer Lambda:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generates personalized discount codes based on risk level&lt;/li&gt;
&lt;li&gt;High risk (&amp;gt;70%): 20-30% off for 3 months (code: &lt;code&gt;SAVE25&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Medium risk (40-70%): 15-25% off for 2 months&lt;/li&gt;
&lt;li&gt;Low risk (&amp;lt;40%): Service upgrades and add-ons&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Web Search Lambda:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DuckDuckGo API for real-time information&lt;/li&gt;
&lt;li&gt;Helps agent answer general retention strategy questions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Internal Tool: Product Catalog
&lt;/h3&gt;

&lt;p&gt;In addition to the three external Lambda functions, the agent also has an &lt;strong&gt;internal tool&lt;/strong&gt; that runs directly within the AgentCore Runtime - no external API calls needed. The &lt;code&gt;get_product_catalog()&lt;/code&gt; tool provides real-time information about available telecom plans, pricing, add-on services, and retention offers. This tool is perfect for answering customer questions like "What plans do you offer?" or "Tell me about your premium features" without making external API calls. Having this as an internal tool means faster response times and reduced latency for common queries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_product_catalog&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Get information about available telecom plans and services.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Returns plan details, pricing, features, and retention offers
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;formatted_catalog_info&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This demonstrates a key architectural pattern: &lt;strong&gt;use internal tools for static/reference data that doesn't require external systems&lt;/strong&gt;, and use external tools (via Gateway) for dynamic data queries or actions that need database access.&lt;/p&gt;

&lt;p&gt;All three functions are exposed via &lt;strong&gt;AgentCore Gateway&lt;/strong&gt; using the &lt;strong&gt;MCP (Model Context Protocol)&lt;/strong&gt;. The Gateway handles authentication, request routing, and response formatting.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgoba3266zlbkbr5drpnp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgoba3266zlbkbr5drpnp.png" alt="Gateway Architecture" width="800" height="741"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  5. &lt;strong&gt;The Autonomous Reasoning Flow&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Here's what happens when a customer asks: &lt;em&gt;"Can you give me a discount code?"&lt;/em&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Agent Receives Request&lt;/strong&gt;: Claude reads the prompt and system instructions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decision Making&lt;/strong&gt;: Agent decides it needs customer churn data first&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool Call #1&lt;/strong&gt;: Calls &lt;code&gt;churn_data_query&lt;/code&gt; via Gateway → Lambda → Athena&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Risk Analysis&lt;/strong&gt;: Receives churn risk score (e.g., 85% — HIGH risk)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decision Making&lt;/strong&gt;: Agent decides to generate retention offer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool Call #2&lt;/strong&gt;: Calls &lt;code&gt;retention_offer&lt;/code&gt; with customer data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offer Generation&lt;/strong&gt;: Lambda generates &lt;code&gt;SAVE25&lt;/code&gt; discount code (25% off)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Response&lt;/strong&gt;: Agent synthesizes natural response with discount code&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The agent makes all these decisions autonomously — I didn't hardcode the workflow. The system prompt guides the agent, but Claude decides when and how to use tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. &lt;strong&gt;RAG with Bedrock Knowledge Base&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The Knowledge Base stores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Company policies&lt;/li&gt;
&lt;li&gt;Troubleshooting guides&lt;/li&gt;
&lt;li&gt;FAQ documents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;RAG Flow:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query → Agent → Knowledge Base → Retrieved Context → Enhanced Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using &lt;strong&gt;Amazon Titan Embeddings&lt;/strong&gt;, documents get vectorized for semantic search. When a customer asks about policies, the agent retrieves relevant sections and includes them in the response.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. &lt;strong&gt;Data Connection: From Previous Project&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The customer data comes from my &lt;a href="https://dev.to/ajithmanmu/how-i-built-a-serverless-data-analytics-pipeline-for-customer-churn-with-s3-glue-athena-and-bfk"&gt;previous serverless pipeline project&lt;/a&gt;. That pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ingested the Kaggle Telco dataset&lt;/li&gt;
&lt;li&gt;Converted CSV to Parquet with Glue ETL&lt;/li&gt;
&lt;li&gt;Partitioned data in S3&lt;/li&gt;
&lt;li&gt;Made it queryable via Athena&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This agent project is the &lt;strong&gt;natural next step&lt;/strong&gt; — taking that clean, query-ready data and making it accessible through conversational AI.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Technical Decisions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why AgentCore Over DIY?
&lt;/h3&gt;

&lt;p&gt;I could have built this with raw Lambda functions and LangChain, but AgentCore provided:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Built-in Memory&lt;/strong&gt;: No need to build my own vector database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gateway with MCP&lt;/strong&gt;: Standardized protocol for tool integration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Managed Runtime&lt;/strong&gt;: No ECS clusters or container management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability&lt;/strong&gt;: CloudWatch integration out of the box&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why Dual Cognito Architecture?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt;: Separates user authentication from agent-to-service authentication&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt;: M2M tokens can be cached and reused&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best Practice&lt;/strong&gt;: Follows OAuth 2.0 patterns for service-to-service communication&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why Synthetic Churn Model?
&lt;/h3&gt;

&lt;p&gt;The dataset includes a &lt;code&gt;cancel_intent&lt;/code&gt; field which acts as our "pretend ML model." For a hackathon demo, this works perfectly without needing to train and deploy a separate ML model. In production, you'd integrate with SageMaker for real churn predictions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Security
&lt;/h2&gt;

&lt;p&gt;Even for a hackathon project, I applied production security practices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;IAM Roles&lt;/strong&gt;: Least-privilege access for Lambda, Runtime, and Gateway&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JWT Authentication&lt;/strong&gt;: Secure token-based auth with Cognito&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSM Parameter Store&lt;/strong&gt;: All secrets and config stored securely&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;S3 Encryption&lt;/strong&gt;: SSE-S3 for data at rest&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Private Lambda&lt;/strong&gt; (TODO): Current Lambdas are public; production would use VPC&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Challenges &amp;amp; Learnings
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Cognito Complexity&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Setting up dual authentication was harder than expected. Key lessons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;USER_PASSWORD_AUTH flow must be explicitly enabled&lt;/li&gt;
&lt;li&gt;M2M clients need proper scopes configured&lt;/li&gt;
&lt;li&gt;Discovery URLs must be exact (&lt;code&gt;.well-known/openid-configuration&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Token decoding requires proper base64 padding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Working with Cognito was more complicated than I anticipated, but it forced me to deeply understand OAuth 2.0 flows and JWT token structure.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;Cold Start Problem&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The first request to the agent often timed out. Classic serverless cold start:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AgentCore Runtime takes time to spin up&lt;/li&gt;
&lt;li&gt;Solution: Better error handling and retry logic&lt;/li&gt;
&lt;li&gt;Future: Consider provisioned concurrency for production&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. &lt;strong&gt;Multi-Step Tool Calling&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Getting Claude to call &lt;code&gt;churn_data_query&lt;/code&gt; first, then pass that data to &lt;code&gt;retention_offer&lt;/code&gt; required explicit prompt engineering:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
IMPORTANT: When customers ask for discount codes, you MUST:
1. First call the churn_data_query tool to get customer data
2. Then call the retention_offer tool with the complete churn_data
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Learning&lt;/strong&gt;: LLMs need very explicit instructions for sequential workflows.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. &lt;strong&gt;SSM Parameter Store Permissions&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The auto-created Runtime execution role didn't include SSM permissions. Quick fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"ssm:GetParameter"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:ssm:*:*:parameter/customer-retention-agent/*"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Learning&lt;/strong&gt;: Always verify IAM permissions when integrating AWS services.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. &lt;strong&gt;Local Development Setup&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Testing locally before deploying was crucial:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Used &lt;code&gt;agentcore invoke --local&lt;/code&gt; to simulate Runtime&lt;/li&gt;
&lt;li&gt;Created automated test suite (&lt;code&gt;test_invoke_local.py&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Tested with real AWS services (Lambda, Athena, Memory)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Learning&lt;/strong&gt;: Local-first development saves time and AWS costs.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. &lt;strong&gt;On-Demand Throughput Not Supported&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Discovered that not all Bedrock models support on-demand throughput. Had to adjust model selection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learning&lt;/strong&gt;: Read the AWS documentation carefully for service limitations.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. &lt;strong&gt;Boto3 Sessions&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Lambda functions need proper boto3 session management:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;athena_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;athena&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us-east-1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Learning&lt;/strong&gt;: Always specify region explicitly in Lambda functions.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Technical:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AgentCore primitives (Runtime, Gateway, Memory) work incredibly well together&lt;/li&gt;
&lt;li&gt;MCP protocol standardizes tool integration&lt;/li&gt;
&lt;li&gt;Memory strategies: USER_PREFERENCE for explicit data, SEMANTIC for context&lt;/li&gt;
&lt;li&gt;JWT token structure and OAuth 2.0 flows&lt;/li&gt;
&lt;li&gt;RAG implementation with Bedrock Knowledge Base&lt;/li&gt;
&lt;li&gt;Serverless cold starts are real — plan accordingly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architectural:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dual authentication is complex but necessary for production systems&lt;/li&gt;
&lt;li&gt;Tool design matters: focused, single-responsibility functions compose well&lt;/li&gt;
&lt;li&gt;Explicit prompt engineering is crucial for multi-step workflows&lt;/li&gt;
&lt;li&gt;Local testing infrastructure saves time and money&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Data:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Synthetic data (like &lt;code&gt;cancel_intent&lt;/code&gt;) works great for demos&lt;/li&gt;
&lt;li&gt;Previous data pipeline projects can be extended with AI layers&lt;/li&gt;
&lt;li&gt;Parquet + Athena = fast, cost-effective queries&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;p&gt;If I continue this project:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Security Enhancements&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Make Lambdas private&lt;/li&gt;
&lt;li&gt;Set up VPC and subnets&lt;/li&gt;
&lt;li&gt;Add Web Application Firewall (WAF)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Responsible AI&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Content moderation with Bedrock Guardrails&lt;/li&gt;
&lt;li&gt;Human oversight for high-value offers&lt;/li&gt;
&lt;li&gt;Policy checks before generating discounts&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Production Features&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time alerts when high-risk customers detected&lt;/li&gt;
&lt;li&gt;A/B testing for retention strategies&lt;/li&gt;
&lt;li&gt;Analytics dashboard for offer effectiveness&lt;/li&gt;
&lt;li&gt;Sentiment analysis for conversation tone&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Integration&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Connect to Confluence for live policy updates (Bedrock KB supports this!)&lt;/li&gt;
&lt;li&gt;Integrate with CRM (Salesforce/HubSpot)&lt;/li&gt;
&lt;li&gt;Multi-channel support (SMS, email, phone)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building the Customer Retention Agent taught me that autonomous AI agents are production-ready today. With AWS Bedrock AgentCore, I went from idea to working demo faster than expected.&lt;/p&gt;

&lt;p&gt;The hardest parts weren't the AI — they were the authentication, cold starts, and getting all the AWS services to work together. But that's the reality of building production systems.&lt;/p&gt;

&lt;p&gt;This project is a natural continuation of my data pipeline work. The pipeline gave me clean data in Athena; the agent makes that data actionable through conversation. Together, they demonstrate how serverless + AI can solve real business problems.&lt;/p&gt;

&lt;p&gt;Key takeaway: &lt;strong&gt;Modern cloud platforms make it possible to build sophisticated AI agents without managing infrastructure.&lt;/strong&gt; The future of customer service is autonomous, personalized, and conversational.&lt;/p&gt;

&lt;p&gt;Thanks to Devpost for hosting the AI Agent Global Hackathon and creating AgentCore. Building with these tools has been an incredible learning experience! 🚀&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Repository&lt;/strong&gt;: &lt;a href="https://github.com/ajithmanmu/customer-retention-agent" rel="noopener noreferrer"&gt;https://github.com/ajithmanmu/customer-retention-agent&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Demo Video&lt;/strong&gt;: &lt;a href="https://www.youtube.com/watch?v=nt2-iE_qBIw" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=nt2-iE_qBIw&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS AI Agent Hackathon&lt;/strong&gt;: &lt;a href="https://devpost.com/software/customer-retention-agent?ref_content=user-portfolio&amp;amp;ref_feature=in_progress" rel="noopener noreferrer"&gt;https://devpost.com/software/customer-retention-agent?ref_content=user-portfolio&amp;amp;ref_feature=in_progress&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Previous Project&lt;/strong&gt;: &lt;a href="https://dev.to/ajithmanmu/how-i-built-a-serverless-data-analytics-pipeline-for-customer-churn-with-s3-glue-athena-and-bfk"&gt;Serverless Data Pipeline&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Bedrock AgentCore Docs&lt;/strong&gt;: &lt;a href="https://aws.amazon.com/bedrock/agentcore/" rel="noopener noreferrer"&gt;https://aws.amazon.com/bedrock/agentcore/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AgentCore Samples &amp;amp; Tutorials&lt;/strong&gt;: &lt;a href="https://github.com/awslabs/amazon-bedrock-agentcore-samples" rel="noopener noreferrer"&gt;https://github.com/awslabs/amazon-bedrock-agentcore-samples&lt;/a&gt; (Highly recommended for learning AgentCore!)&lt;/li&gt;
&lt;/ul&gt;




</description>
      <category>aws</category>
      <category>ai</category>
      <category>agents</category>
      <category>portfolio</category>
    </item>
    <item>
      <title>How I Built a Serverless Data Analytics Pipeline for Customer Churn with S3, Glue, Athena, and QuickSight</title>
      <dc:creator>ajithmanmu</dc:creator>
      <pubDate>Sat, 20 Sep 2025 16:53:32 +0000</pubDate>
      <link>https://dev.to/ajithmanmu/how-i-built-a-serverless-data-analytics-pipeline-for-customer-churn-with-s3-glue-athena-and-bfk</link>
      <guid>https://dev.to/ajithmanmu/how-i-built-a-serverless-data-analytics-pipeline-for-customer-churn-with-s3-glue-athena-and-bfk</guid>
      <description>&lt;p&gt;I wanted to explore how AWS services can be combined into a simple data pipeline that not only processes customer churn data, but also highlights the kind of insights companies rely on to drive retention and revenue growth.&lt;/p&gt;

&lt;p&gt;For this project, I used the &lt;strong&gt;Telco Customer Churn dataset&lt;/strong&gt; from Kaggle. The goal was to take raw CSV data, process it into a query-optimized format, and power dashboards that surface churn KPIs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Here’s the high-level design:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F96r8id8cjjidci1tei34.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F96r8id8cjjidci1tei34.png" alt="architecture" width="800" height="817"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Amazon S3&lt;/strong&gt; — Stores raw Kaggle CSV input and processed Parquet output.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Glue&lt;/strong&gt; — Crawler to catalog schemas + &lt;strong&gt;ETL job to convert CSV into Parquet and partition the data&lt;/strong&gt;.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon Athena&lt;/strong&gt; — Runs SQL queries and views over the processed data.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon QuickSight&lt;/strong&gt; — Dashboards to visualize churn KPIs like churn %, revenue loss, and segmentation.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon EventBridge (optional)&lt;/strong&gt; — Triggers Glue ETL jobs on a schedule.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Terraform&lt;/strong&gt; — Infrastructure as Code for reproducible setup.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can find the full implementation here: &lt;a href="https://github.com/ajithmanmu/aws-telco-churn-analytics" rel="noopener noreferrer"&gt;GitHub Repo&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Walkthrough
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Data Ingestion&lt;/strong&gt;
The raw Telco churn dataset was uploaded into an S3 bucket. To keep data organized, I added &lt;strong&gt;key prefixes&lt;/strong&gt; such as &lt;code&gt;ingest_date=YYYY-MM-DD/&lt;/code&gt;. This structure makes it easier for Glue Crawlers to detect and register new data.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F71qlbkfje7h9yfh7znth.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F71qlbkfje7h9yfh7znth.png" alt="s3" width="800" height="273"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Schema Discovery &amp;amp; ETL&lt;/strong&gt;
Glue Crawlers scanned the raw bucket and registered the schema in the Glue Data Catalog. A &lt;strong&gt;Glue ETL job then converted the CSV files into Parquet&lt;/strong&gt; and wrote the results to a processed S3 bucket with partitions. This format makes queries faster and more cost-efficient.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Filqtpdce2t9pgp3ha3rx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Filqtpdce2t9pgp3ha3rx.png" alt="tables" width="800" height="261"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcj16cf6ci0des6hu9gbk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcj16cf6ci0des6hu9gbk.png" alt="gluejob" width="800" height="359"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Partitioning Strategy&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Partitioning turned out to be a critical design choice:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Avoid high-cardinality keys that generate too many small files.
&lt;/li&gt;
&lt;li&gt;Place &lt;strong&gt;date partitions last&lt;/strong&gt; so queries can easily filter recent data.
&lt;/li&gt;
&lt;li&gt;Hive (the engine behind Athena) processes partitions &lt;strong&gt;from left to right&lt;/strong&gt;, so ordering matters.
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Athena Queries&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
With data processed and partitioned, Athena queries became much more efficient. I created views for:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Overall churn percentage
&lt;/li&gt;
&lt;li&gt;Churn by contract type (month-to-month vs annual)
&lt;/li&gt;
&lt;li&gt;Revenue lost from churners
&lt;/li&gt;
&lt;li&gt;Tenure vs churn patterns
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx5vija0fyn9mersva4zo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx5vija0fyn9mersva4zo.png" alt="athena" width="800" height="618"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqbcgd0y41926317w4ry8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqbcgd0y41926317w4ry8.png" alt="workflow" width="800" height="665"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Visualization&lt;/strong&gt;
QuickSight connected directly to Athena, enabling dashboards with filters and visuals for churn % by demographics, add-on services, and contract types. This provided clear insights into which customers were most at risk.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Security
&lt;/h2&gt;

&lt;p&gt;Even though this was a demo project, I applied security best practices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IAM roles scoped with least privilege
&lt;/li&gt;
&lt;li&gt;S3 encryption (SSE-S3) for data at rest
&lt;/li&gt;
&lt;li&gt;Dedicated Glue and Athena execution roles
&lt;/li&gt;
&lt;li&gt;Restricted access to QuickSight dashboards
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This pipeline shows how AWS services can be combined to build a &lt;strong&gt;self-service analytics solution&lt;/strong&gt; with no servers to manage. Starting from raw CSVs, I was able to generate Parquet data, run queries in Athena, and visualize churn insights in QuickSight.&lt;/p&gt;

&lt;p&gt;The next step for me is extending the pipeline with &lt;strong&gt;Amazon Bedrock&lt;/strong&gt;. By creating a Knowledge Base and Bedrock Agent, I’ll enable natural-language questions like &lt;em&gt;“What’s the churn rate for two-year contracts vs month-to-month?”&lt;/em&gt; and have the agent execute the Athena queries under the hood.&lt;/p&gt;




&lt;h2&gt;
  
  
  Learnings
&lt;/h2&gt;

&lt;p&gt;Some of the key lessons from this build:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adding &lt;strong&gt;ingest_date prefixes&lt;/strong&gt; in S3 simplified partitioning and Glue Crawling.
&lt;/li&gt;
&lt;li&gt;Partitioning design is critical: avoid high-cardinality keys, put date last, and understand Hive’s left-to-right partition evaluation.
&lt;/li&gt;
&lt;li&gt;Encountered a &lt;code&gt;HIVE_BAD_DATA&lt;/code&gt; error — a good reminder that Hive is running under the hood of Athena (flashback to Big Data classes!).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parquet format&lt;/strong&gt; greatly improved query speed and reduced cost.
&lt;/li&gt;
&lt;li&gt;Used &lt;strong&gt;Amazon Q Developer with the Diagram MCP server&lt;/strong&gt; to auto-generate the architecture diagram — which made documentation far easier.
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>cloud</category>
      <category>programming</category>
      <category>learning</category>
    </item>
    <item>
      <title>How I Built a Secure Serverless Orders Pipeline with Lambda, SNS, and SQS</title>
      <dc:creator>ajithmanmu</dc:creator>
      <pubDate>Fri, 12 Sep 2025 03:58:31 +0000</pubDate>
      <link>https://dev.to/ajithmanmu/how-i-built-a-secure-serverless-orders-pipeline-with-lambda-sns-and-sqs-36ej</link>
      <guid>https://dev.to/ajithmanmu/how-i-built-a-secure-serverless-orders-pipeline-with-lambda-sns-and-sqs-36ej</guid>
      <description>&lt;p&gt;After finishing my 3-tier web app project on AWS, I wanted my next portfolio project to be something different — more &lt;strong&gt;serverless, event-driven, and decoupled&lt;/strong&gt;. I also wanted to test out the &lt;strong&gt;SQS fan-out architecture&lt;/strong&gt;, where a single event can trigger multiple downstream actions. And, just as important, I wanted to build it all with a strong &lt;strong&gt;security-first mindset&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So I built a &lt;strong&gt;Serverless Orders Pipeline&lt;/strong&gt;. Here’s how it works and what I learned along the way.&lt;/p&gt;




&lt;h3&gt;
  
  
  Architecture Overview
&lt;/h3&gt;

&lt;p&gt;At a high level, the system works like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;public ALB&lt;/strong&gt; accepts incoming requests (&lt;code&gt;POST /orders&lt;/code&gt;) and routes them to a &lt;strong&gt;LambdaPublisher&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;LambdaPublisher&lt;/strong&gt; validates the request and publishes it to an &lt;strong&gt;SNS topic&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;That SNS topic fans out to multiple &lt;strong&gt;SQS queues&lt;/strong&gt;: billing and archive.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Consumer Lambdas&lt;/strong&gt; read from these queues and do their thing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Billing → write to DynamoDB&lt;/li&gt;
&lt;li&gt;Archive → store a JSON copy in S3&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Everything runs inside a VPC, with &lt;strong&gt;public subnets for the ALB&lt;/strong&gt; and &lt;strong&gt;private subnets for the Lambdas&lt;/strong&gt;. Importantly, the Lambdas don’t have internet access — they only talk to AWS services through &lt;strong&gt;VPC endpoints&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Diagram
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo0op7quibyfexgtw38xx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo0op7quibyfexgtw38xx.png" alt="architecture" width="800" height="433"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This captures the fan-out pattern: one request → SNS → multiple queues → independent consumers.&lt;/p&gt;




&lt;h3&gt;
  
  
  Security First
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Authentication at the Publisher Lambda
&lt;/h4&gt;

&lt;p&gt;The first entry point into the system is the &lt;strong&gt;Publisher Lambda&lt;/strong&gt;, so I added a basic authentication layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Incoming requests must include a &lt;code&gt;X-Client-Id&lt;/code&gt; and &lt;code&gt;X-Signature&lt;/code&gt; header.&lt;/li&gt;
&lt;li&gt;The Lambda checks these against a secret (stored as an environment variable for now, but could be moved to &lt;strong&gt;Secrets Manager&lt;/strong&gt; later).&lt;/li&gt;
&lt;li&gt;If the check fails → immediate &lt;code&gt;401 Unauthorized&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures only trusted clients can even publish into the pipeline.&lt;/p&gt;




&lt;h4&gt;
  
  
  IAM Roles
&lt;/h4&gt;

&lt;p&gt;Each Lambda got its own execution role, with the bare minimum permissions. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Publisher: just &lt;code&gt;sns:Publish&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Billing: &lt;code&gt;sqs:ReceiveMessage/DeleteMessage&lt;/code&gt; + &lt;code&gt;dynamodb:PutItem&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Archive: &lt;code&gt;sqs:ReceiveMessage/DeleteMessage&lt;/code&gt; + &lt;code&gt;s3:PutObject&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No shared “super-role” — each function is tightly scoped.&lt;/p&gt;




&lt;h4&gt;
  
  
  Resource Policies
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Each &lt;strong&gt;SQS queue&lt;/strong&gt; is locked down so it only accepts messages from the SNS topic.&lt;/li&gt;
&lt;li&gt;Optionally, you can go a step further and tie resources to a &lt;strong&gt;specific VPC endpoint&lt;/strong&gt; using conditions like &lt;code&gt;aws:SourceVpce&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This prevents direct access from outside the system.&lt;/p&gt;




&lt;h4&gt;
  
  
  VPC and Subnets
&lt;/h4&gt;

&lt;p&gt;This one was a good learning moment for me:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Lambdas don’t need inbound rules.&lt;/strong&gt;&lt;br&gt;
The ALB doesn’t hit the Lambda over the network — it calls it through the AWS control plane.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So, the Lambda’s security group matters only for &lt;strong&gt;outbound traffic&lt;/strong&gt; (e.g., when writing to DynamoDB, publishing to SNS, or sending logs).&lt;/p&gt;




&lt;h4&gt;
  
  
  VPC Endpoints
&lt;/h4&gt;

&lt;p&gt;Because my Lambdas don’t have internet access, I needed endpoints for them to reach AWS services:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gateway endpoints&lt;/strong&gt;: S3, DynamoDB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interface endpoints&lt;/strong&gt;: SNS, SQS, CloudWatch Logs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This way, traffic stays private inside AWS. No NAT gateways, no public internet.&lt;/p&gt;




&lt;h3&gt;
  
  
  Monitoring
&lt;/h3&gt;

&lt;p&gt;Every Lambda writes to &lt;strong&gt;CloudWatch Logs&lt;/strong&gt;, and I set up some metrics/alerts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lambda errors&lt;/li&gt;
&lt;li&gt;Queue depth (important if consumers fall behind)&lt;/li&gt;
&lt;li&gt;DLQ depth&lt;/li&gt;
&lt;li&gt;ALB 5XXs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s not fancy, but it gives enough visibility to know if something’s going wrong.&lt;/p&gt;




&lt;h3&gt;
  
  
  What I Learned
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lambda SGs are different&lt;/strong&gt; → no inbound rules needed; outbound is what matters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Terraform zip packaging&lt;/strong&gt; → I had to get comfortable with packaging functions cleanly in Terraform.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security-first thinking&lt;/strong&gt; → IAM roles, queue policies, endpoint restrictions, and even simple client auth at the Publisher Lambda baked in from the start.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decoupling really works&lt;/strong&gt; → each consumer Lambda is independent. If one fails, the others keep working fine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Event-driven scaling is nice&lt;/strong&gt; → SQS + Lambda handles bursts way better than a traditional setup.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Terraform Implementation
&lt;/h3&gt;

&lt;p&gt;I also implemented the whole thing in &lt;strong&gt;Terraform&lt;/strong&gt;, splitting the code into multiple folders for clarity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;infra/
  ├─ network/        # VPC, subnets, route tables, SGs, endpoints
  ├─ data/           # DynamoDB table + S3 archive bucket
  ├─ messaging/      # SNS topic, SQS queues, DLQs, policies
  ├─ iam/            # Lambda execution roles + inline policies
  ├─ compute/        # Lambda functions (publisher + consumers) + event source mappings
  ├─ frontend/       # ALB, target group, listener rules
  ├─ observability/  # CloudWatch alarms for SQS/Lambda/ALB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And here’s the order in which I applied them:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;network&lt;/strong&gt; → get the VPC and endpoints in place&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;data&lt;/strong&gt; → DynamoDB and S3 storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;messaging&lt;/strong&gt; → SNS, SQS, and their policies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;iam&lt;/strong&gt; → Lambda execution roles&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;compute&lt;/strong&gt; → Lambdas + event source mappings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;frontend&lt;/strong&gt; → ALB and listener rules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;observability&lt;/strong&gt; → monitoring and alarms&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This folder-based approach made it easier to build one layer at a time and keep things manageable.&lt;/p&gt;




&lt;p&gt;You can check out the full Terraform code and project details on my GitHub: &lt;a href="https://github.com/ajithmanmu/serverless-orders-pipeline" rel="noopener noreferrer"&gt;serverless-orders-pipeline&lt;/a&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Wrapping Up
&lt;/h3&gt;

&lt;p&gt;This project felt like the natural next step after the 3-tier web app. Instead of servers and RDS, I worked with &lt;strong&gt;Lambdas, queues, and private networking&lt;/strong&gt;. Adding authentication at the Publisher Lambda gave me an extra layer of control, and Terraform helped keep the setup reproducible and organized.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>buildinpublic</category>
      <category>cloud</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Hands-On with AWS: Building and Securing a 3-Tier Web App</title>
      <dc:creator>ajithmanmu</dc:creator>
      <pubDate>Fri, 29 Aug 2025 23:01:00 +0000</pubDate>
      <link>https://dev.to/ajithmanmu/hands-on-with-aws-building-and-securing-a-3-tier-web-app-1fjb</link>
      <guid>https://dev.to/ajithmanmu/hands-on-with-aws-building-and-securing-a-3-tier-web-app-1fjb</guid>
      <description>&lt;h2&gt;
  
  
  Building a Secure 3-Tier Application on AWS
&lt;/h2&gt;

&lt;p&gt;I recently worked on a portfolio project where I built a &lt;strong&gt;3-tier application on AWS&lt;/strong&gt;. My goal wasn’t only to get the app running, but also to design it with &lt;strong&gt;security and best practices in mind&lt;/strong&gt;, and then migrate everything into &lt;strong&gt;Terraform&lt;/strong&gt; so it’s reproducible.&lt;/p&gt;

&lt;p&gt;👉 Full source code and Terraform setup: &lt;a href="https://github.com/ajithmanmu/three-tier-architecture-aws" rel="noopener noreferrer"&gt;three-tier-architecture-aws&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Project Overview
&lt;/h2&gt;

&lt;p&gt;The setup follows the classic &lt;strong&gt;3-tier architecture&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontend&lt;/strong&gt;: A React app served by Nginx on EC2, behind a public ALB.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend&lt;/strong&gt;: A FastAPI app running with Uvicorn on EC2, behind an internal ALB.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database&lt;/strong&gt;: Amazon RDS PostgreSQL in private subnets.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only the frontend ALB is public — everything else runs in private subnets. Configuration values like the backend ALB DNS and database connection string are securely injected at runtime using &lt;strong&gt;AWS SSM Parameter Store&lt;/strong&gt; and &lt;strong&gt;Secrets Manager&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Security Focus
&lt;/h2&gt;

&lt;p&gt;From the start, I set up the application with &lt;strong&gt;least-privilege principles&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No public IPs on app or DB servers — only the ALB is exposed.&lt;/li&gt;
&lt;li&gt;Security Groups allow traffic only along the intended path (ALB → Frontend → Backend → RDS).&lt;/li&gt;
&lt;li&gt;IAM roles are locked down so instances can only read what they need.&lt;/li&gt;
&lt;li&gt;AMIs are kept generic; user data injects environment-specific config at boot.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This way, the environment is both secure and flexible.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxnw350a9gongnid5lzt2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxnw350a9gongnid5lzt2.png" alt=" " width="800" height="704"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnvdchd9xap5f92z4obyz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnvdchd9xap5f92z4obyz.png" alt=" " width="800" height="291"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Building AMIs with Setup Scripts
&lt;/h2&gt;

&lt;p&gt;A key part of this project was &lt;strong&gt;baking AMIs&lt;/strong&gt;. Instead of installing everything during auto-scaling launches, I ran the setup scripts on &lt;strong&gt;temporary builder EC2 instances in public subnets&lt;/strong&gt;. Once the app was installed and tested, I created an AMI from that instance.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For the &lt;strong&gt;frontend&lt;/strong&gt;, I launched a temporary EC2, ran the React + Nginx setup script, and created a frontend AMI.&lt;/li&gt;
&lt;li&gt;For the &lt;strong&gt;backend&lt;/strong&gt;, I did the same: launched a builder EC2, installed FastAPI + dependencies, configured systemd, and created a backend AMI.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These AMIs were then used in &lt;strong&gt;Launch Templates + Auto Scaling Groups&lt;/strong&gt;, with user data scripts wiring environment-specific details at boot.&lt;/p&gt;




&lt;h3&gt;
  
  
  Frontend Setup
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;dnf update &lt;span class="nt"&gt;-y&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;dnf &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; nginx git
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl &lt;span class="nb"&gt;enable &lt;/span&gt;nginx

curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash
&lt;span class="nb"&gt;.&lt;/span&gt; ~/.nvm/nvm.sh
nvm &lt;span class="nb"&gt;install &lt;/span&gt;20

git clone https://github.com/ajithmanmu/three-tier-architecture-aws.git
&lt;span class="nb"&gt;cd &lt;/span&gt;app
npm ci &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npm run build

&lt;span class="nb"&gt;sudo rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /usr/share/nginx/html/&lt;span class="k"&gt;*&lt;/span&gt;
&lt;span class="nb"&gt;sudo cp&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; out/&lt;span class="k"&gt;*&lt;/span&gt; /usr/share/nginx/html/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nginx config snippet (&lt;code&gt;/etc/nginx/nginx.conf&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/api/&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;proxy_pass&lt;/span&gt; &lt;span class="s"&gt;http://__BACKEND_INTERNAL_ALB__&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;proxy_set_header&lt;/span&gt; &lt;span class="s"&gt;Host&lt;/span&gt; &lt;span class="nv"&gt;$host&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;proxy_set_header&lt;/span&gt; &lt;span class="s"&gt;X-Real-IP&lt;/span&gt; &lt;span class="nv"&gt;$remote_addr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Frontend user data script fetches the backend ALB DNS from SSM and rewrites the config at boot.&lt;/p&gt;




&lt;h3&gt;
  
  
  Backend Setup
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;dnf update &lt;span class="nt"&gt;-y&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;dnf &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; python3.11 python3.11-pip git

&lt;span class="nb"&gt;sudo mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /opt/app &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo chown &lt;/span&gt;ec2-user:ec2-user /opt/app
&lt;span class="nb"&gt;cd&lt;/span&gt; /opt/app
python3.11 &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv
&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt; pip

git clone https://github.com/ajithmanmu/three-tier-architecture-aws.git src
&lt;span class="nb"&gt;cd &lt;/span&gt;src/backend
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt; /opt/app/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The backend is wired to a systemd service running Uvicorn. At boot, a user data script pulls the DB connection string from SSM and writes it into &lt;code&gt;/etc/app.env&lt;/code&gt; before starting the app.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges Along the Way
&lt;/h2&gt;

&lt;p&gt;This wasn’t all smooth sailing. A few things I had to troubleshoot:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Networking&lt;/strong&gt;: With 12 subnets and multiple route tables, I initially struggled to get NAT and IGW routing right. Debugging outbound access from private subnets was a key learning moment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frontend 404s&lt;/strong&gt;: The frontend served fine, but API calls failed until I realized Nginx needed the backend ALB DNS injected dynamically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secrets Management&lt;/strong&gt;: At first I hardcoded DB creds. Moving them into Secrets Manager and pulling them at runtime made the setup much cleaner and safer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Terraform Migration&lt;/strong&gt;: Rebuilding everything as code was tedious, but it forced me to understand the resource dependencies and gave me a reproducible setup.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What’s Next
&lt;/h2&gt;

&lt;p&gt;Some natural next steps to build on this project:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add &lt;strong&gt;ACM + HTTPS&lt;/strong&gt; for the frontend ALB.&lt;/li&gt;
&lt;li&gt;Configure &lt;strong&gt;CloudWatch logs and alarms&lt;/strong&gt; for monitoring and alerting.&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;S3 + CloudFront&lt;/strong&gt; for hosting assets (like images), while continuing to serve the frontend itself from EC2.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;👉 Full repo: &lt;a href="https://github.com/ajithmanmu/three-tier-architecture-aws" rel="noopener noreferrer"&gt;three-tier-architecture-aws&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>buildinpublic</category>
      <category>programming</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Solving Flood Fill - LeetCode problem</title>
      <dc:creator>ajithmanmu</dc:creator>
      <pubDate>Fri, 15 Aug 2025 15:13:53 +0000</pubDate>
      <link>https://dev.to/ajithmanmu/solving-flood-fill-leetcode-problem-hdj</link>
      <guid>https://dev.to/ajithmanmu/solving-flood-fill-leetcode-problem-hdj</guid>
      <description>&lt;p&gt;Originally published on my &lt;a href="https://ajithmanmu.hashnode.dev/solving-flood-fill-leetcode-problem" rel="noopener noreferrer"&gt;Hashnode blog&lt;/a&gt; — cross-posted here for the Dev.to community.&lt;/p&gt;

&lt;p&gt;In this problem, we delve into the Flood Fill algorithm, which plays a crucial role in tracing bounded areas with the same color. This algorithm finds applications in various real-world scenarios, such as the bucket-filling tool in painting software and the Minesweeper game.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://leetcode.com/problems/flood-fill/description/" rel="noopener noreferrer"&gt;https://leetcode.com/problems/flood-fill/description/&lt;/a&gt;&lt;/p&gt;

&lt;h5&gt;
  
  
  Problem Description
&lt;/h5&gt;

&lt;p&gt;Given a 2D matrix, along with the indices of a source cell &lt;code&gt;mat[x][y]&lt;/code&gt; and a target color &lt;code&gt;C&lt;/code&gt;, the task is to color the region connected to the source cell with color &lt;code&gt;C&lt;/code&gt;. The key idea here is to view the matrix as an undirected graph and find an efficient way to traverse it. Importantly, the movement is restricted to adjacent cells in four directions (up, down, left, and right).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmyjncc6eob1ixal8ntjd.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmyjncc6eob1ixal8ntjd.jpeg" alt="flood fill" width="613" height="253"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h5&gt;
  
  
  Breadth-First Search (BFS) Approach
&lt;/h5&gt;

&lt;p&gt;One way to solve this problem is to employ a Breadth-First Search (BFS) using a queue. Here's the step-by-step process:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Start by inserting the source cell into the queue and change its color to &lt;code&gt;C&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;While the queue is not empty:&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;* Pop the next element from the queue.

* Change the color of the current cell to `C`.

* Calculate the coordinates of the neighboring cells in all four directions.

* If any neighboring cell has the same color, insert it into the queue.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Time Complexity (TC):&lt;/strong&gt; O(N*M) where N and M are the dimensions of the matrix.&lt;/p&gt;

&lt;h5&gt;
  
  
  Depth-First Search (DFS) Approach
&lt;/h5&gt;

&lt;p&gt;Alternatively, you can implement the Depth-First Search (DFS) approach, which uses recursion:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Begin by changing the color of the source cell to &lt;code&gt;C&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Calculate the coordinates of the neighboring cells in all four directions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If any neighboring cell has the same color, recursively call the function on that cell until the base case is satisfied.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Time Complexity (TC):&lt;/strong&gt; O(N*&lt;em&gt;M)&lt;/em&gt; &lt;strong&gt;&lt;em&gt;Space Complexity (SC):&lt;/em&gt;&lt;/strong&gt; *O(N**M)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt; &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;dx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
 &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;dy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;isValid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;image&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nx"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;y&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nx"&gt;cols&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;y&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;


&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;colorCell&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nx"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nx"&gt;color&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;current_color&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nf"&gt;isValid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nx"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nx"&gt;image&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
   &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;image&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="nx"&gt;y&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="nx"&gt;current_color&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
   &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;image&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="nx"&gt;y&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;color&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
   &lt;span class="nx"&gt;image&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="nx"&gt;y&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;color&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
   &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
       &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;u&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dx&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
       &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;y&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
       &lt;span class="nf"&gt;colorCell&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;u&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;color&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;current_color&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
   &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;var&lt;/span&gt; &lt;span class="nx"&gt;floodFill&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;color&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;colorCell&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;color&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;image&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;sr&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="nx"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;image&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both approaches are effective, and your choice may depend on the specific requirements and constraints of the problem you are solving.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>algorithms</category>
      <category>leetcode</category>
      <category>learning</category>
    </item>
    <item>
      <title>Solving Balanced Binary Tree - Leetcode problem</title>
      <dc:creator>ajithmanmu</dc:creator>
      <pubDate>Fri, 15 Aug 2025 15:09:25 +0000</pubDate>
      <link>https://dev.to/ajithmanmu/solving-balanced-binary-tree-leetcode-problem-d0b</link>
      <guid>https://dev.to/ajithmanmu/solving-balanced-binary-tree-leetcode-problem-d0b</guid>
      <description>&lt;p&gt;Originally published on my &lt;a href="https://ajithmanmu.hashnode.dev/solving-balanced-binary-tree-leetcode-problem" rel="noopener noreferrer"&gt;Hashnode blog&lt;/a&gt; — cross-posted here for the Dev.to community.&lt;/p&gt;

&lt;p&gt;In the cinematic adaptation of this challenge, we find ourselves on an intriguing quest to determine the balance of a mystical binary tree. Our mission is to unveil the tree's equilibrium, where the difference between the heights of the Left Subtree (LST) and Right Subtree (RST) is no more than 1.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://leetcode.com/problems/balanced-binary-tree/" rel="noopener noreferrer"&gt;https://leetcode.com/problems/balanced-binary-tree/&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;LST_Height&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;RST_Height&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;   &lt;span class="o"&gt;--&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;Tree&lt;/span&gt; &lt;span class="nx"&gt;is&lt;/span&gt; &lt;span class="nx"&gt;balanced&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our journey begins with a postorder traversal, an exploration strategy suited for our enigmatic task. During our odyssey, we meticulously calculate the height of each node, a vital piece of information. The height of a node, we discern, is the grander of the heights of its LST and RST.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;Height&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="nx"&gt;Node&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;LST_Height&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;RST_Height&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As we delve deeper into the forest of nodes, we scrutinize the height difference, ensuring it remains within the confines of balance.&lt;/p&gt;

&lt;p&gt;The narrative unfolds with an assumption that the tree, like any good story, is inherently balanced. Yet, our vigilant traversal harbors the power to unveil any imbalances lurking in the shadows. Should the height conditions falter, a revelation of imbalance shatters our illusion, and we exit our quest immediately, having uncovered the truth of this captivating binary tree.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="cm"&gt;/**
 * Definition for a binary tree node.
 * function TreeNode(val, left, right) {
 *     this.val = (val===undefined ? 0 : val)
 *     this.left = (left===undefined ? null : left)
 *     this.right = (right===undefined ? null : right)
 * }
 */&lt;/span&gt;
&lt;span class="cm"&gt;/**
 * @param {TreeNode} root
 * @return {boolean}
 */&lt;/span&gt;

&lt;span class="cm"&gt;/*
    TC: O(N)
    SC: O(Height of the tree)
*/&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;postorder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;leftheight&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;postorder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;left&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;rightheight&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;postorder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;right&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="cm"&gt;/* 
        The idea is that we assume that the tree is balanced by default. 
        If the height condition fails during the traversal we say it is unbalanced and exit         immidietly
    Height of LST - Height of RST &amp;lt;= 1 --&amp;gt;. height balanced
    */&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;diff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;leftheight&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;rightheight&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;diff&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;not balanced&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;leftheight&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;not balanced&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;rightheight&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;not balanced&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;not balanced&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;leftheight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;rightheight&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;var&lt;/span&gt; &lt;span class="nx"&gt;isBalanced&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;root&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;postorder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;root&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// console.log({res})&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;not balanced&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>datastructures</category>
      <category>algorithms</category>
      <category>learning</category>
      <category>leetcode</category>
    </item>
    <item>
      <title>Solving the JavaScript Execution Order Challenge</title>
      <dc:creator>ajithmanmu</dc:creator>
      <pubDate>Fri, 15 Aug 2025 15:05:46 +0000</pubDate>
      <link>https://dev.to/ajithmanmu/solving-the-javascript-execution-order-challenge-4adp</link>
      <guid>https://dev.to/ajithmanmu/solving-the-javascript-execution-order-challenge-4adp</guid>
      <description>&lt;p&gt;Originally published on my &lt;a href="https://ajithmanmu.hashnode.dev/solving-the-javascript-execution-order-challenge" rel="noopener noreferrer"&gt;Hashnode blog&lt;/a&gt; — cross-posted here for the Dev.to community.&lt;/p&gt;

&lt;p&gt;In JavaScript, managing the order of function execution can be tricky, especially when dealing with asynchronous operations like &lt;code&gt;setTimeout&lt;/code&gt;. Recently, I encountered an interesting problem that required executing a series of functions in a specific order. Let's explore the problem and a simple solution.&lt;/p&gt;

&lt;h4&gt;
  
  
  The Problem
&lt;/h4&gt;

&lt;p&gt;Suppose we have two functions, &lt;code&gt;f1&lt;/code&gt; and &lt;code&gt;f2&lt;/code&gt;, defined as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;f1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
     &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;f2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;2&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If we run &lt;code&gt;f1()&lt;/code&gt; followed by &lt;code&gt;f2()&lt;/code&gt;, the output is not what we expect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nf"&gt;f1&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="nf"&gt;f2&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="cm"&gt;/*
Output:
2
1
*/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, we want to execute &lt;code&gt;f1&lt;/code&gt; first and then &lt;code&gt;f2&lt;/code&gt;. So, the expected output is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="cm"&gt;/*
    After 1 second, print 1
    Then after 0.5 seconds, print 2
*/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Initial Solution
&lt;/h2&gt;

&lt;p&gt;To control the order of execution, one approach is to pass &lt;code&gt;f2&lt;/code&gt; as a callback to &lt;code&gt;f1&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;f1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
     &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="nf"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;f2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;2&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, calling &lt;code&gt;f1(f2)&lt;/code&gt; ensures that &lt;code&gt;f2&lt;/code&gt; runs after &lt;code&gt;f1&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nf"&gt;f1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;f2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="cm"&gt;/*
Output:
1
2
*/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Adding a Third Function
&lt;/h2&gt;

&lt;p&gt;Now, let's introduce a third function, &lt;code&gt;f3&lt;/code&gt;, that needs to be executed after &lt;code&gt;f2&lt;/code&gt;. To maintain clean and logical code, we continue using callbacks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;f1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
     &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="nf"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;f2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;2&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="nf"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;f3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;done&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;f2Callback&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;f3&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;f1Callback&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;f2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;f2Callback&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nf"&gt;f1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;f1Callback&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach works, but it can become unwieldy when dealing with a large number of functions. What if we want a more generic solution that can handle any number of functions in a specific order?&lt;/p&gt;

&lt;h3&gt;
  
  
  The Magic Function
&lt;/h3&gt;

&lt;p&gt;Let's create a magic function, &lt;code&gt;magic&lt;/code&gt;, that takes an arbitrary number of functions and executes them in the order they are received:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;magic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(...&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Start off with assigning callback f as the last function&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;f&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;iterate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;f&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;iterate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="nf"&gt;f&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nf"&gt;magic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;f1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;f2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;f3&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the &lt;code&gt;magic&lt;/code&gt; function, we can specify the order of execution as &lt;code&gt;f1&lt;/code&gt; -&amp;gt; &lt;code&gt;f2&lt;/code&gt; -&amp;gt; &lt;code&gt;f3&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="cm"&gt;/*
Output:
1
2
done
*/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach allows us to execute functions in a defined order, making our code more modular and easier to maintain.&lt;/p&gt;

&lt;h4&gt;
  
  
  Other Options and Considerations
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Using Promises:&lt;/strong&gt; Another option would be to return a Promise from both the functions. By chaining Promises, you can control the order of execution and handle asynchronous operations more elegantly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;API Design Perspective:&lt;/strong&gt; From an API design perspective, if you consider these &lt;code&gt;f*&lt;/code&gt; functions as a global library, it is a logical choice to allow the addition of callback functions. This design allows users of the library to execute custom logic after the library functions complete execution, enhancing the flexibility of your library.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://jsfiddle.net/ajithmanmu/bof3taqe/27/" rel="noopener noreferrer"&gt;Check out the code on JSFiddle&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Happy coding!&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>javascript</category>
      <category>programming</category>
      <category>learning</category>
    </item>
    <item>
      <title>Putting AWS Skills to Work: Building an AB Testing Tracker</title>
      <dc:creator>ajithmanmu</dc:creator>
      <pubDate>Fri, 15 Aug 2025 14:49:53 +0000</pubDate>
      <link>https://dev.to/ajithmanmu/putting-aws-skills-to-work-building-an-ab-testing-tracker-34c</link>
      <guid>https://dev.to/ajithmanmu/putting-aws-skills-to-work-building-an-ab-testing-tracker-34c</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;Intro&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Over the last few weeks, I’ve been working on projects that combine my AWS learning with my background in Growth Engineering.&lt;/p&gt;

&lt;p&gt;I wanted something more relevant to my domain than the classic &lt;a href="https://cloudresumechallenge.dev/" rel="noopener noreferrer"&gt;“Cloud Resume” challenge&lt;/a&gt;. So I built an &lt;strong&gt;AB Testing Tracker&lt;/strong&gt; — a simple, serverless app to track experiments, impressions, clicks, and calculate click-through rates (CTR).&lt;/p&gt;

&lt;p&gt;Demo link - &lt;a href="https://ab-testing-tracker-frontend.vercel.app/" rel="noopener noreferrer"&gt;https://ab-testing-tracker-frontend.vercel.app/&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Architecture &amp;amp; What the App Does&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here’s the high-level architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Frontend&lt;/strong&gt;: Next.js app hosted on Vercel, fetches experiment manifest from S3 via CloudFront.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Backend&lt;/strong&gt;: AWS Lambda (Node.js) with API Gateway to receive experiment events.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Storage&lt;/strong&gt;: DynamoDB for storing impressions, clicks, and calculating CTR.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;CDN&lt;/strong&gt;: CloudFront with OAI (Origin Access Identity) for secure access to S3.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyahw7tk0lsaepfq0hutm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyahw7tk0lsaepfq0hutm.png" alt="Architecture" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The app supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Tracking &lt;strong&gt;impressions&lt;/strong&gt; and &lt;strong&gt;clicks&lt;/strong&gt; for AB tests.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Aggregating stats by variant and calculating CTR.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Returning results via a simple API.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Cost Optimization&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;S3 Lifecycle Policies&lt;/strong&gt; to delete unused objects after a set time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;DynamoDB TTL&lt;/strong&gt; for expiring old events automatically.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fully serverless → No idle server cost, Lambda is pay-per-use.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Security&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;S3 bucket is private, accessible only via CloudFront using OAI.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Can move DynamoDB and Lambda to a VPC with &lt;strong&gt;VPC Endpoints&lt;/strong&gt; for tighter control.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Learnings &amp;amp; Notes&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Used &lt;strong&gt;Vercel&lt;/strong&gt; for quick and painless frontend deployment.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fixed a &lt;strong&gt;CORS&lt;/strong&gt; issue by switching CloudFront’s response header policy from &lt;em&gt;SimpleCORS&lt;/em&gt; to &lt;em&gt;CORS With Preflight&lt;/em&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Leveraged &lt;strong&gt;Cursor&lt;/strong&gt; and &lt;strong&gt;ChatGPT&lt;/strong&gt; for coding assistance and documentation.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Full Project &amp;amp; Code&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/ajithmanmu/ab-testing-tracker" rel="noopener noreferrer"&gt;https://github.com/ajithmanmu/ab-testing-tracker&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>aws</category>
      <category>buildinpublic</category>
      <category>learning</category>
    </item>
  </channel>
</rss>
