DEV Community: Karthik K Pradeep

I Stopped Deploying Manually - Here’s My CI/CD Pipeline with GitHub Actions

Karthik K Pradeep — Mon, 23 Mar 2026 21:39:25 +0000

How I moved AquaChain from manual deployments to a GitHub Actions pipeline that catches bad changes before they hit production.

AquaChain is a production-focused IoT water quality monitoring platform: sensors send readings into AWS, Lambda services process and analyze them, and the frontend gives admins, technicians, and consumers role-based views into device health, alerts, and operations.

Where I Started

For a long time, my deployment process looked like this:

Build locally
SSH into the server
Restart the app
Hope nothing breaks

That last step was the real problem.

The moment I finally stopped trusting manual deploys was a frontend release that looked fine on my machine but went out with the wrong production environment settings. AquaChain loaded, but API requests started failing immediately. I was SSH'ing into the box, checking logs, rolling back, and trying to remember which local change caused it. The rollback only took a few tense minutes, but a few minutes of visible production errors is still a production incident. Nothing about that failure was dramatic. It was worse: it was avoidable.

That was the pattern I wanted to kill. Too much depended on memory, local state, and luck.

What I Wanted From the Pipeline

I wanted a pipeline that:

Runs linting, type checks, and tests on every pull request
Blocks merges when quality checks fail
Deploys the frontend only when frontend files change
Deploys Lambda functions only when backend files change
Uses short-lived AWS credentials instead of long-lived secrets
Completes under 5 minutes so it does not become something people route around

One practical note before the YAML: GitHub Actions minutes are not free forever. On GitHub-hosted runners, every PR build costs time and money once you move past the included allowance. Path filters, caching, and splitting work by concern are not just performance improvements. They control your bill.

The Repository Reality

This post uses AquaChain names throughout because the examples are derived from the current AquaChain workflow, not a sanitized demo.

The repo itself looks like this:

aquachain/
├── frontend/          # Next.js app
├── lambda/            # AWS Lambda functions (Python)
├── infrastructure/    # AWS CDK and infra code
├── config/            # Shared CI config such as requirements-dev.txt
└── .github/
    └── workflows/
        └── ci-cd-pipeline.yml

These examples are derived from .github/workflows/ci-cd-pipeline.yml; I am splitting them into three workflows here for clarity, but in practice they currently live in one file.

The config/ directory holds shared CI tooling such as requirements-dev.txt and pytest.ini, so linting and test setup lives in one place instead of being duplicated across every Lambda service.

Workflow 1: PR Checks

This runs on pull requests and on pushes to main. If this fails, nothing should merge.

If code can still drift into main after this job fails, the pipeline is just theater.

# .github/workflows/pr-checks.yml
name: PR Checks

on:
  pull_request:
    branches: [main, develop]
  push:
    branches: [main]

jobs:
  frontend-checks:
    name: Frontend - Lint, Type Check, Test
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: frontend

    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: "18"
          cache: "npm"
          cache-dependency-path: frontend/package-lock.json

      - name: Install dependencies
        run: npm ci

      - name: Lint
        run: npm run lint

      - name: Type check
        run: npx tsc --noEmit

      - name: Run tests
        run: npm test -- --watchAll=false --coverage --passWithNoTests

      - name: Build
        run: npm run build

  backend-checks:
    name: Backend - Lint, Type Check, Test
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: "pip"
          cache-dependency-path: config/requirements-dev.txt

      - name: Install backend tooling
        run: pip install -r ./config/requirements-dev.txt

      - name: Lint
        run: flake8 ./lambda --max-line-length=120 --exclude=./lambda/layers

      - name: Type check
        run: mypy ./lambda --config-file ./lambda/mypy.ini --ignore-missing-imports

      - name: Run tests
        run: pytest ./lambda -v --tb=short -c ./config/pytest.ini

A few details matter here:

npm ci instead of npm install. ci installs exactly what is in package-lock.json and fails if the lockfile is out of sync. That is what you want in CI.

The backend paths are explicit. In AquaChain, shared Python tooling lives under config/, so I reference ./config/requirements-dev.txt directly instead of assuming readers know what the runner's default directory is.

--watchAll=false on Jest is non-negotiable. Without it, the job hangs waiting for interactive input.

If your Lambda estate gets large, this is the point where I would switch from one backend job to a matrix strategy per function. AquaChain's full workflow already does that for better parallelism and clearer failures.

Caching Is Worth It

The cache lines in the workflow above are not decorative. On a codebase this size, a cold npm ci can easily take roughly 60-90 seconds, while a warm cache often brings that same step down closer to 10-20 seconds. The exact numbers vary by runner, but the order-of-magnitude difference is real.

pip caching buys the same kind of win for Python tooling. If you skip caching, the pipeline still works. It just feels slower every single run, and that is how teams end up resenting CI.

Workflow 2: Deploy Frontend to Vercel

I use the Vercel CLI instead of relying entirely on the built-in GitHub integration. It gives me tighter control over when builds happen and which environment variables are in play.

# .github/workflows/deploy-frontend.yml
name: Deploy Frontend

on:
  push:
    branches: [main]
    paths:
      - "frontend/**"

jobs:
  deploy:
    name: Deploy to Vercel
    runs-on: ubuntu-latest
    environment: production

    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: "18"
          cache: "npm"
          cache-dependency-path: frontend/package-lock.json

      - name: Install Vercel CLI
        run: npm install -g vercel@latest

      - name: Pull Vercel environment
        run: vercel pull --yes --environment=production --token=${{ secrets.VERCEL_TOKEN }}
        working-directory: frontend

      - name: Build
        run: vercel build --prod --token=${{ secrets.VERCEL_TOKEN }}
        working-directory: frontend
        env:
          REACT_APP_API_ENDPOINT: ${{ secrets.PROD_API_ENDPOINT }}
          REACT_APP_USER_POOL_ID: ${{ secrets.PROD_USER_POOL_ID }}
          REACT_APP_USER_POOL_CLIENT_ID: ${{ secrets.PROD_USER_POOL_CLIENT_ID }}

      - name: Deploy
        run: vercel deploy --prebuilt --prod --token=${{ secrets.VERCEL_TOKEN }}
        working-directory: frontend

The paths filter is doing two jobs for me:

It stops backend-only commits from wasting frontend build minutes.
It keeps the workflow history clean because only relevant deploys appear.

Convenience is great right up until you are explaining an unexpected deploy during an incident. I would rather be explicit.

For PR previews, I use the same pattern on pull_request without --prod. Each PR gets its own preview deployment URL.

Workflow 3: Deploy Lambda to AWS

This is the part I would explicitly not implement with github.event.commits[0].modified.

That field only tells you what changed in the first commit listed in the push payload. On a multi-commit push or a squash merge, it can miss files that absolutely changed. It looks clever, but it is fragile.

If a deployment decision depends on a webhook payload shortcut instead of the actual file paths I care about, I do not trust it.

The more reliable pattern is:

Use a top-level paths filter so the workflow runs only for backend changes.
Use dorny/paths-filter inside the job when you need per-folder deploy decisions.

# .github/workflows/deploy-backend.yml
name: Deploy Backend

on:
  push:
    branches: [main]
    paths:
      - "lambda/**"

jobs:
  deploy:
    name: Deploy Lambda Functions
    runs-on: ubuntu-latest
    environment: production

    permissions:
      id-token: write
      contents: read

    steps:
      - uses: actions/checkout@v4

      - name: Detect changed Lambda folders
        id: changes
        uses: dorny/paths-filter@v3
        with:
          filters: |
            data_processing:
              - "lambda/data_processing/**"
            notification_service:
              - "lambda/notification_service/**"

      - name: Configure AWS credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy
          aws-region: ap-south-1

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Deploy data processing Lambda
        if: steps.changes.outputs.data_processing == 'true'
        working-directory: lambda/data_processing
        run: |
          zip -r function.zip . -x "tests/*" "*.pyc"
          aws lambda update-function-code \
            --function-name AquaChain-Function-data_processing-production \
            --zip-file fileb://function.zip \
            --region ap-south-1

      - name: Deploy notification API Lambda
        if: steps.changes.outputs.notification_service == 'true'
        working-directory: lambda/notification_service
        run: |
          zip -r function.zip . -x "tests/*" "*.pyc"
          aws lambda update-function-code \
            --function-name AquaChain-Function-notification_service-production \
            --zip-file fileb://function.zip \
            --region ap-south-1

If you only have one Lambda function, the workflow-level paths filter is enough. If you have many, add a change-detection step like the one above. AquaChain's current production workflow uses environment-specific names such as AquaChain-Function-data_processing-production, so the example mirrors that pattern instead of pointing at a -dev function.

The OIDC Auth Setup

The security mistake I still see too often in GitHub Actions is storing AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as repository secrets.

Those are long-lived credentials. If they leak, the problem does not end when the workflow ends.

OIDC is the better pattern. GitHub Actions assumes an IAM role directly and receives short-lived credentials for that run only.

Here is the CDK shape:

from aws_cdk import aws_iam as iam

github_provider = iam.OpenIdConnectProvider(
    self,
    "GitHubOIDC",
    url="https://token.actions.githubusercontent.com",
    client_ids=["sts.amazonaws.com"],
)

deploy_role = iam.Role(
    self,
    "GitHubActionsDeployRole",
    role_name="github-actions-deploy",
    assumed_by=iam.WebIdentityPrincipal(
        github_provider.open_id_connect_provider_arn,
        conditions={
            "StringEquals": {
                "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
            },
            "StringLike": {
                "token.actions.githubusercontent.com:sub":
                    "repo:your-org/your-repo:ref:refs/heads/main"
            },
        },
    ),
)

deploy_role.add_to_policy(iam.PolicyStatement(
    actions=["lambda:UpdateFunctionCode", "lambda:UpdateFunctionConfiguration"],
    resources=["arn:aws:lambda:ap-south-1:123456789012:function:AquaChain-Function-*"],
))

The important detail is the sub restriction. Without that, you are trusting GitHub broadly. With it, you are trusting one repository and one branch.

Secrets Management

GitHub gives you three useful scopes for secrets:

Repository secrets for repo-wide values
Environment secrets for protected environments such as production
Organization secrets for shared values across repos

For production deploys, I prefer environment secrets plus protection rules. That forces a deliberate approval step before the workflow can read sensitive values.

jobs:
  deploy:
    environment: production

That one line is small, but it changes the blast radius of a bad push.

Branch Protection Rules

The pipeline only matters if people cannot quietly route around it.

GitHub starts running these workflow files as soon as they exist under .github/workflows/ and you push them to the repo. What it does not do by itself is block merges. For that, go to Settings -> Branches in the repository and lock down main with:

Required status checks
Up-to-date branch enforcement
At least one approving review
No direct pushes to main

That turns the release path into one narrow lane instead of a social convention.

The Full Picture

Developer pushes branch
        ↓
PR opened -> PR checks run
  ├── Frontend: lint + typecheck + test + build
  └── Backend: flake8 + mypy + pytest
        ↓
All checks pass -> PR review
        ↓
PR merged to main
        ↓
  ├── frontend/** changed -> deploy-frontend.yml -> Vercel production
  └── lambda/** changed   -> deploy-backend.yml  -> AWS Lambda

CI/CD Pipeline Visualization

Real GitHub Actions workflow for AquaChain — showing code quality checks, testing, build, and deployment stages.

Next Steps I'm Planning

This setup is already useful, but it is not the endpoint.

Add a smoke test after deployment. A quick health-check call after deploy would catch broken releases before users do.

Extract reusable workflow pieces. Once the pipeline settles down, shared setup can move into reusable workflows or composite actions.

Keep GitHub Actions versions current automatically. Dependabot already makes sense for application dependencies. It should also keep workflow actions fresh.

The Payoff

Before this pipeline, a configuration mistake turned into a production debugging session over SSH. After it, that same class of mistake is far more likely to show up as a visible CI failure, a scoped deploy failure, or at minimum a traceable release with logs and approvals attached.

That is the benchmark I care about now. The failed deploy from the intro should not be a live fire exercise on a server. It should be a boring red workflow run that never gets mistaken for a normal release.

Serverless ML Inference with AWS Lambda + Docker

Karthik K Pradeep — Sun, 22 Mar 2026 11:10:49 +0000

Running ML models in production sounds simple until you realize you're paying for servers 24/7 even when nobody is using them. That was my situation.
I had a model running on EC2, serving predictions through Flask. It worked. It also quietly burned money every hour of the day. So I rebuilt the entire inference pipeline using AWS Lambda and reduced costs to almost zero during idle time.
This post walks through exactly how I did it.

The Problem with "Always-On" ML Inference

When I first deployed a machine learning model, I followed the standard approach:

Flask API
EC2 instance
Load model at startup
Serve predictions over HTTP

It worked.

But it also meant:

Paying for compute 24/7
Even at 3AM when traffic = 0

For systems like AquaChain, inference is event-driven:

Bursts of requests from devices
Long idle periods

Running a server continuously for this pattern is wasteful.

Enter: Serverless ML Inference
With AWS Lambda:

You pay only when your model runs
No idle infrastructure
Fully event-driven execution

The Stack

scikit-learn 1.4.0
XGBoost 2.0.3
numpy 1.26.3 + pandas 2.1.4
Python 3.11
AWS Lambda (container image)
Amazon ECR (container registry)
S3 (model artifact storage)

Project Structure

ml_inference/
├── handler.py          # Lambda entry point
├── model_loader.py     # S3 model caching logic
├── feature_extractor.py
├── Dockerfile
└── requirements.txt

The Dockerfile

The key is using AWS's official Lambda base image. It includes the Lambda runtime interface client, so your container behaves exactly like a standard Lambda function.

FROM public.ecr.aws/lambda/python:3.11

# Copy requirements first for layer caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy function code
COPY handler.py model_loader.py feature_extractor.py ./

# Lambda handler entrypoint
CMD ["handler.lambda_handler"]

The requirements.txt for the ML stack:

scikit-learn==1.4.0
xgboost==2.0.3
numpy==1.26.3
pandas==2.1.4
boto3==1.34.34
joblib==1.3.2

One important detail: put COPY requirements.txt and RUN pip install before copying your application code. Docker caches each layer — if your code changes but your dependencies don't, the pip install layer is reused and your build takes seconds instead of minutes.

The Handler

import json
import logging
from model_loader import get_model
from feature_extractor import extract_features

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    try:
        readings = event.get('readings', {})
        device_id = event.get('deviceId', 'unknown')

        # Validate inputs before touching the model
        required = ['pH', 'turbidity', 'tds', 'temperature']
        missing = [f for f in required if f not in readings]
        if missing:
            return {
                'statusCode': 400,
                'body': json.dumps({
                    'error': f"Missing fields: {missing}",
                    'code': 'VALIDATION_ERROR'
                })
            }

        # Extract features (includes trend calculations)
        features = extract_features(readings)

        # Get model — cached in /tmp after first load
        model = get_model()

        # Run inference
        wqi = float(model.predict([features])[0])
        confidence = float(model.predict_proba([features]).max())

        quality = classify_wqi(wqi)

        logger.info(f"Inference complete", extra={
            'deviceId': device_id,
            'wqi': wqi,
            'quality': quality,
            'confidence': confidence
        })

        return {
            'statusCode': 200,
            'body': json.dumps({
                'wqi': round(wqi, 2),
                'quality': quality,
                'confidence': round(confidence, 4),
                'deviceId': device_id
            })
        }

    except Exception as e:
        logger.error(f"Inference error: {e}", exc_info=True)
        return {
            'statusCode': 500,
            'body': json.dumps({
                'error': 'Inference failed',
                'code': 'INFERENCE_ERROR'
            })
        }


def classify_wqi(wqi: float) -> str:
    if wqi >= 90: return 'Excellent'
    if wqi >= 70: return 'Good'
    if wqi >= 50: return 'Fair'
    if wqi >= 25: return 'Poor'
    return 'Very Poor'

Model Caching: The Most Important Optimization

Lambda's /tmp directory persists across warm invocations of the same container instance. Loading a model from S3 on every request would add 200–500ms of latency and unnecessary S3 GET costs. Cache it in /tmp on first load:

import os
import boto3
import joblib
import logging

logger = logging.getLogger()

MODEL_S3_BUCKET = os.environ['MODEL_BUCKET']
MODEL_S3_KEY = os.environ['MODEL_KEY']
LOCAL_MODEL_PATH = '/tmp/model.joblib'

_model_cache = None  # Module-level cache — survives across warm invocations

def get_model():
    global _model_cache

    if _model_cache is not None:
        logger.debug("Using in-memory model cache")
        return _model_cache

    # Check /tmp first (warm container, model already downloaded)
    if os.path.exists(LOCAL_MODEL_PATH):
        logger.info("Loading model from /tmp cache")
        _model_cache = joblib.load(LOCAL_MODEL_PATH)
        return _model_cache

    # Cold start — download from S3
    logger.info(f"Downloading model from s3://{MODEL_S3_BUCKET}/{MODEL_S3_KEY}")
    s3 = boto3.client('s3')
    s3.download_file(MODEL_S3_BUCKET, MODEL_S3_KEY, LOCAL_MODEL_PATH)
    _model_cache = joblib.load(LOCAL_MODEL_PATH)
    logger.info("Model loaded and cached")
    return _model_cache

Two levels of caching here:

_model_cache — in-memory, fastest possible, survives as long as the container is warm
/tmp/model.joblib — survives container reuse even if the Python process restarts

On a cold start you pay the S3 download once. Every subsequent warm invocation skips it entirely.

Building and Pushing to ECR

# Authenticate Docker with ECR
aws ecr get-login-password --region ap-south-1 | \
  docker login --username AWS --password-stdin \
  758346259059.dkr.ecr.ap-south-1.amazonaws.com

# Build the image
docker build -t aquachain-ml-inference .

# Tag for ECR
docker tag aquachain-ml-inference:latest \
  758346259059.dkr.ecr.ap-south-1.amazonaws.com/aquachain-ml-inference:latest

# Push
docker push \
  758346259059.dkr.ecr.ap-south-1.amazonaws.com/aquachain-ml-inference:latest

Then deploy the Lambda pointing at the ECR image:

aws lambda update-function-code \
  --function-name aquachain-function-ml-inference-dev \
  --image-uri 758346259059.dkr.ecr.ap-south-1.amazonaws.com/aquachain-ml-inference:latest \
  --region ap-south-1

CDK Definition

from aws_cdk import (
    aws_lambda as lambda_,
    aws_ecr as ecr,
    aws_iam as iam,
    Duration
)

# Reference existing ECR repo
repo = ecr.Repository.from_repository_name(
    self, "MLInferenceRepo", "aquachain-ml-inference"
)

ml_inference_fn = lambda_.DockerImageFunction(
    self, "MLInferenceFunction",
    function_name="aquachain-function-ml-inference-dev",
    code=lambda_.DockerImageCode.from_ecr(
        repo,
        tag_or_digest="latest"
    ),
    memory_size=1024,   # ML models benefit from more memory
    timeout=Duration.seconds(30),
    environment={
        "MODEL_BUCKET": "aquachain-models-dev",
        "MODEL_KEY": "wqi/model_v2.joblib",
        "LOG_LEVEL": "INFO"
    }
)

# Grant S3 read access for model download
model_bucket.grant_read(ml_inference_fn)

Memory sizing matters here. I started at 512MB and saw ~180ms inference times. Bumping to 1024MB dropped it to ~85ms — Lambda allocates CPU proportionally to memory, so more memory = faster CPU = faster inference. Run a few tests at different memory sizes; the cost difference is often negligible compared to the latency improvement.

Handling Cold Starts

Cold starts for container-based Lambdas are longer than zip-based ones — typically 2–5 seconds for a 500MB image. For AquaChain this is acceptable because inference is triggered asynchronously (the data processing Lambda doesn't wait for the result). But if you need synchronous inference with strict latency SLAs, two options:

1. Provisioned Concurrency — keeps N container instances warm at all times. Eliminates cold starts, but you pay for idle time. Only worth it if your p99 latency requirement is under 500ms and you have consistent traffic.

2. Scheduled warm-up ping — an EventBridge rule that invokes the function every 5 minutes with a dummy payload. Cheap, effective for low-traffic functions, but not a guarantee.

For most ML inference use cases, async invocation + accepting occasional cold starts is the right trade-off.

Updating the Model

One of the best things about this setup: updating the model doesn't require a code deployment. You just upload a new model.joblib to S3 with the same key. The next cold start picks it up automatically.

For versioned rollouts, use S3 versioning and point the Lambda env var at a specific version ID:

# Upload new model version
aws s3 cp model_v3.joblib s3://aquachain-models-dev/wqi/model_v2.joblib

# If you need to roll back, just update the env var to point at the previous version
aws lambda update-function-configuration \
  --function-name aquachain-function-ml-inference-dev \
  --environment "Variables={MODEL_KEY=wqi/model_v1.joblib,MODEL_BUCKET=aquachain-models-dev}" \
  --region ap-south-1

The Numbers

Running in production on AquaChain:

Metric	Value
Cold start (image download + model load)	~2.1s
Warm inference (in-memory cache)	~85ms
Warm inference (first call, /tmp cache)	~120ms
Memory used	~310MB of 1024MB allocated
Cost per 1M inferences	~$0.21

Compare that to a t3.small EC2 instance running 24/7: ~$15/month regardless of traffic. At our current inference volume, Lambda costs under $1/month.

When NOT to Use Lambda

Serverless ML is not a silver bullet.

Avoid Lambda if:

You need ultra-low latency (<50ms)
You have constant high traffic
Your model is extremely large (>5GB and slow to load)

In those cases, a dedicated endpoint (SageMaker / ECS / EC2) is a better fit.

What I'd Do Differently

1. Use multi-stage Docker builds. The current image includes build tools that aren't needed at runtime. A multi-stage build copies only the installed packages into the final image, reducing image size by 30–40% and speeding up cold starts.

2. Pin the base image digest, not just the tag. python:3.11 tags can change. Use the SHA256 digest for reproducible builds in production.

3. Add model validation on load. Before caching the model, run a quick sanity check — predict on a known input and assert the output is in the expected range. Catches corrupted model files before they serve bad predictions.

Serverless ML inference isn’t for every system.But for event-driven workloads — like AquaChain — it hits a rare sweet spot:

low cost, zero idle infrastructure, and production-grade performance.
If your model doesn’t need to run 24/7, your infrastructure shouldn’t either.

How I Built a Serverless IoT Pipeline on AWS

Karthik K Pradeep — Sat, 21 Mar 2026 07:10:20 +0000

Water quality testing normally takes between 24 and 48 hours.
This makes it almost impossible to monitor water quality in real-time, especially in cases where the water has to be clean in real-time.I wanted this time taken down to seconds.
That is why I created a real-time water quality monitoring system using ESP32 sensors, AWS IoT Core, Lambda, and DynamoDB — a production-ready pipeline, not a demo.

The Problem

Most IoT tutorials go this far:

Sending a temperature reading
Displaying it on a dashboard

They don’t cover what happens when you need to:

Handle 100K+ messages per hour reliably
Run ML inference on every incoming reading
Trigger alerts within <5 seconds
Keep costs under $2 per device/month

The Architecture

This system is built as a totally serverless, event-driven architecture where each component is triggered by incoming data rather than continuous operation.
It eliminates the need for infrastructure management as there are no EC2 instances, container orchestration layers, or manual scaling configurations.
Each service (IoT Core, Lambda, DynamoDB, and API Gateway) scales independently based on workload, enabling the system to handle variable data ingestion rates efficiently while maintaining low operational overhead.

Device Layer: ESP32 + MQTT

Each ESP32 device collects sensor data every 60 seconds:

pH (0–14)
Turbidity (0–1000 NTU)
TDS — Total Dissolved Solids (0–2000 ppm)
Temperature (-10°C to 50°C)

Example payload:

{
  "deviceId": "ESP32-ABC123",
  "timestamp": "2024-01-15T10:30:00Z",
  "readings": {
    "pH": 7.2,
    "turbidity": 3.5,
    "tds": 450,
    "temperature": 22.5
  },
  "metadata": {
    "firmwareVersion": "2.1.0",
    "batteryLevel": 85,
    "signalStrength": -45
  }
}

Why MQTT over HTTP? It's a fraction of the overhead. MQTT keeps a persistent TCP connection, so each message is just the payload no HTTP headers, no TLS handshake per message. At 100K messages/hour that difference matters.

AWS IoT Core: Secure Ingestion Layer

IoT Core acts as the managed MQTT broker. Devices connect with X.509 certificates — no username/password, no API keys. Each device gets its own cert, so you can revoke a single compromised device without touching anything else.
The IoT Rule that routes messages to Lambda is dead simple:
SELECT * FROM 'aquachain/devices/+/data'
The '+' wildcard matches any device ID, and the IoT Core evaluates this rule for every matching message, invoking the Lambda function. This approach eliminates the need for polling and queues, making it a pure event-driven system.

Lambda: Validation and Storage

The data processing Lambda does three things: validate, store, and trigger inference.

def lambda_handler(event, context):
    device_id = event['deviceId']
    readings = event['readings']

    # 1. Validate sensor ranges
    if not validate_readings(readings):
        logger.warning(f"Invalid readings from {device_id}", extra={
            'deviceId': device_id,
            'readings': readings
        })
        return  # Drop the message, don't store garbage

    # 2. Store in DynamoDB
    table.put_item(Item={
        'deviceId': device_id,
        'timestamp': event['timestamp'],
        **readings,
        'ttl': int((datetime.utcnow() + timedelta(days=90)).timestamp())
    })

    # 3. Trigger ML inference asynchronously
    lambda_client.invoke(
        FunctionName='aquachain-function-ml-inference-dev',
        InvocationType='Event',  # async — don't wait
        Payload=json.dumps(event)
    )

A few things worth calling out here:

1. Validation is non-negotiable. Sensors drift, connectors corrode, and firmware bugs happen. If you store garbage readings, your ML model trains on garbage. I reject anything outside physical bounds — pH can't be 15, temperature can't be -50°C.

2. TTL is free data lifecycle management. DynamoDB's TTL feature automatically deletes items after a timestamp you set. Raw readings expire after 90 days with zero Lambda invocations and zero cost.

3. Async ML invocation. I invoke the inference Lambda with InvocationType='Event' so the data processing function returns immediately. The ML inference runs in parallel. This is what keeps the pipeline fast.

The Latency Reality

Here's what the actual CloudWatch data shows for the data processing Lambda:

Scenario	Duration
Warm, no DB write	~2ms
Warm, with DynamoDB PutItem	~150ms
Cold start	~617ms init + ~2ms execution

The warm execution with a DynamoDB write averages around 150ms. Cold starts add ~617ms on top. The product aim was <5 seconds from sensor reading to dashboard we're well inside that. However, if you consistently target sub-100ms Lambda execution, the DynamoDB write is likely your bottleneck. The fix involves making the write asynchronous via SQS, so the Lambda validates and returns in ~2ms, while a separate consumer handles persistence.

ML Inference: XGBoost on Lambda

The ML model is an XGBoost classifier that outputs a Water Quality Index (WQI) from 0-100. It is fully contained in Lambda, no SageMaker endpoint to keep warm, no idle costs.

def lambda_handler(event, context):
    # Load model from S3 (cached in /tmp after first load)
    model = load_model()

    features = extract_features(event['readings'])
    wqi = model.predict([features])[0]

    if wqi < 50:
        trigger_alert(event['deviceId'], wqi, event['readings'])

    return {'wqi': float(wqi), 'quality': classify_wqi(wqi)}

The ML model used here is an XGBoost classifier that outputs a Water Quality Index (WQI) ranging from 0-100. It is fully contained in Lambda, no SageMakeThe model caches in /tmp after the initial run; subsequent warm calls skip S3 downloads, keeping inference under 100ms. Speaking of model performance, this model has 99.74% accuracy on the validation set, which sounds great until you see that the data is highly structured and the classes are well-separated. XGBoost is actually the correct choice here, as it handles missing sensor values, trains quickly, and is even interpretable enough that you be able to explain why it flagged a particular reading.

DynamoDB: Schema Design for Time-Series

The readings table uses a composite key:

Partition key: deviceId
Sort key: timestamp This means all readings for a device are co-located on the same partition, and you can query a time range with a single DynamoDB Query call, no scans, no GSIs needed for the primary access pattern.

response = table.query(
    KeyConditionExpression=Key('deviceId').eq(device_id) & 
                           Key('timestamp').between(start, end),
    ScanIndexForward=False,  # newest first
    Limit=100
)

One thing I got wrong early on: I was storing floats directly. DynamoDB doesn't support Python floats; it uses Decimal. The fix is a recursive converter before any put_item call:

def floats_to_decimal(obj):
    if isinstance(obj, float):
        return Decimal(str(obj))
    if isinstance(obj, dict):
        return {k: floats_to_decimal(v) for k, v in obj.items()}
    if isinstance(obj, list):
        return [floats_to_decimal(i) for i in obj]
    return obj

Infrastructure as Code: AWS CDK

Everything is defined in Python CDK. No clicking around the console, no manual resource creation. The IoT rule, Lambda functions, DynamoDB tables, IAM policies are all code.

# IoT Rule → Lambda
iot_rule = iot.CfnTopicRule(self, "DataIngestionRule",
    rule_name="aquachain_data_ingestion_dev",
    topic_rule_payload=iot.CfnTopicRule.TopicRulePayloadProperty(
        sql="SELECT * FROM 'aquachain/devices/+/data'",
        actions=[iot.CfnTopicRule.ActionProperty(
            lambda_=iot.CfnTopicRule.LambdaActionProperty(
                function_arn=data_processing_fn.function_arn
            )
        )]
    )
)

# Grant IoT Core permission to invoke Lambda
data_processing_fn.add_permission("IoTInvoke",
    principal=iam.ServicePrincipal("iot.amazonaws.com"),
    source_arn=iot_rule.attr_arn
)

The IAM policy for the data processing Lambda follows least privilege. It can only write to the specific DynamoDB table and invoke the specific ML inference function. Nothing else.

What I'd Do Differently

1. SQS Buffer between IoT Core & Lambda. Rule -> Lambda is fine, but throttling on Lambda will cause IoT Core to drop messages. An SQS queue between the two provides a buffer & retry functionality automatically.

2. Using DynamoDB on-demand since day one. I used provisioned capacity initially, which consumed a lot of time to get the read & write units right. Using on-demand costs a little more per request, but removes the need to think about capacity planning altogether. It’s worth paying a little more considering the nature of IoT data, which can cause spikes in usage.

3. Using structured logging since day one. I implemented structured logging in JSON later on, which was a nightmare to add to 30+ Lambda functions. It would have been so much easier to add a logging utility that adds a device ID, request ID, & timestamp to every log line, which would have made querying logs a breeze using CloudWatch Insights.

The Numbers

After running in production:

712 Lambda invocations over the past week, zero errors
Average warm execution: ~83ms
Cold start init: ~615ms (happens rarely after the function is warm)
DynamoDB write latency: ~148ms average
Cost: well under $2/device/month at current scale In this case, the serverless model really delivers on its promise. There is no infrastructure to babysit, scaling is automatic, and costs scale linearly with usage, not being a flat monthly charge.

Wrapping Up

The hardest part of this build was not the AWS services themselves – the documentation is good, and the SDKs are solid. The hardest part was the sensor layer – calibration drift, WiFi reconnection logic, etc.
The cloud pipeline is actually the easy part – as long as you've got the DynamoDB key schema correct upfront, validate at the edge before anything goes into the database, and keep the Lambda functions thin, validate, store, and delegate.
The entire stack – ESP32 firmware, Lambda functions, CDK infrastructure, React dashboard is the AquaChain project. Would be happy to dive deeper into any of these parts.
_
Built with: ESP32, AWS IoT Core, Lambda (Python 3.11), DynamoDB, XGBoost, AWS CDK, React 19_