DEV Community: Roberto Belotti

How I Built a Drone-Based Crack Detection Pipeline on AWS

Roberto Belotti — Tue, 19 May 2026 15:30:00 +0000

I needed to build a pipeline that takes drone footage of infrastructure (bridges, facades, roads), detects surface defects like cracks and corrosion, and delivers actionable reports to engineers who don't care about ML.

Sounds straightforward. It wasn't.

This article walks through the architecture decisions, the Python code that ties it all together, and the lessons I learned about what happens when computer vision meets real-world AWS constraints.

The problem (and why it's not a model problem)

Let me be clear upfront: this project is not about training a state-of-the-art detection model. YOLOv8 with a pre-trained checkpoint gets you 90%+ accuracy on structural defects out of the box. The hard part is everything else.

When a drone lands after a 15-minute inspection flight, you have:

Hundreds of high-resolution images (4K, 8-12 MB each)
GPS metadata embedded in EXIF
No guarantee of consistent lighting, angle, or overlap
An engineer waiting for a report, not a folder of annotated JPEGs

The real engineering challenge is the pipeline: ingest, process, store, report. And the architecture decision that shapes everything else: where does inference run?

Edge vs cloud vs hybrid (the decision that changes everything)

I explored three options before writing a single line of code.

Option A: Full edge inference. Run YOLO on a Jetson Nano strapped to the drone. Process frames in real-time, store results on an SD card, download after landing. Pros: zero connectivity dependency, immediate triage. Cons: 5W power budget, thermal throttling at altitude, model size limited to what fits on 4GB RAM. And you only see results when the drone is back on the ground.

Option B: Full cloud inference. Upload raw frames to S3, trigger a Lambda (or Fargate task) to run detection, store results in DynamoDB. Pros: unlimited compute, easy to swap models, centralized results. Cons: you need connectivity during or after flight, and processing 500 images at 8MB each means moving ~4GB to the cloud before anything happens.

Option C: Hybrid (the one I built). Lightweight triage on-device flags "interesting" frames during flight. After landing, the full-resolution flagged images get uploaded to S3 and processed by a beefier model in the cloud. Best of both: fast triage, accurate detection, no wasted bandwidth on clear sky shots.

For this project I went with a simplified version of Option B (cloud-only), because the primary use case is batch processing of post-flight image dumps. The edge component is a future iteration.

Architecture overview

Three S3 prefixes, one DynamoDB table, one processing function. No orchestrator, no step function. Deliberately simple.

Project structure

drone-defect-detector/
├── src/
│   ├── __init__.py
│   ├── detector.py          # YOLOv8 inference wrapper
│   ├── pipeline.py          # Orchestrates ingest → detect → report
│   ├── report.py            # Generates annotated images + summary
│   ├── s3.py                # S3 upload/download helpers
│   └── models.py            # Pydantic models for detections
├── lambda/
│   └── handler.py           # Lambda entry point
├── tests/
│   ├── test_detector.py
│   ├── test_pipeline.py
│   └── conftest.py
├── Dockerfile
├── pyproject.toml
└── README.md

The detection wrapper

The first thing I built was a thin wrapper around Ultralytics YOLOv8. The goal: isolate the ML dependency behind a clean interface so the rest of the pipeline doesn't care what model runs underneath.

# src/detector.py
from dataclasses import dataclass
from pathlib import Path

import cv2
import numpy as np
from ultralytics import YOLO


@dataclass(frozen=True)
class Detection:
    """A single detected defect."""
    label: str
    confidence: float
    bbox: tuple[int, int, int, int]  # x1, y1, x2, y2
    area_px: int


class DefectDetector:
    """Wraps YOLOv8 for structural defect detection."""

    # Defect classes we care about (COCO-pretrained as baseline,
    # swap with a fine-tuned checkpoint for production)
    DEFECT_CLASSES = {"crack", "corrosion", "spalling", "delamination"}

    def __init__(
        self,
        model_path: str = "yolov8n.pt",
        confidence_threshold: float = 0.4,
        device: str = "cpu",
    ) -> None:
        self._model = YOLO(model_path)
        self._conf_threshold = confidence_threshold
        self._device = device

    def detect(self, image_path: Path) -> list[Detection]:
        """Run inference on a single image. Returns detected defects."""
        results = self._model.predict(
            source=str(image_path),
            conf=self._conf_threshold,
            device=self._device,
            verbose=False,
        )

        detections: list[Detection] = []
        for result in results:
            for box in result.boxes:
                label = result.names[int(box.cls)]
                if label not in self.DEFECT_CLASSES:
                    continue

                x1, y1, x2, y2 = map(int, box.xyxy[0].tolist())
                detections.append(
                    Detection(
                        label=label,
                        confidence=float(box.conf),
                        bbox=(x1, y1, x2, y2),
                        area_px=(x2 - x1) * (y2 - y1),
                    )
                )

        return detections

    def annotate(
        self, image_path: Path, detections: list[Detection]
    ) -> np.ndarray:
        """Draw bounding boxes on the image. Returns annotated frame."""
        img = cv2.imread(str(image_path))

        colors = {
            "crack": (0, 0, 255),       # red
            "corrosion": (0, 165, 255),  # orange
            "spalling": (0, 255, 255),   # yellow
            "delamination": (255, 0, 0), # blue
        }

        for det in detections:
            color = colors.get(det.label, (0, 255, 0))
            x1, y1, x2, y2 = det.bbox
            cv2.rectangle(img, (x1, y1), (x2, y2), color, 2)

            text = f"{det.label} {det.confidence:.0%}"
            (tw, th), _ = cv2.getTextSize(
                text, cv2.FONT_HERSHEY_SIMPLEX, 0.6, 1
            )
            cv2.rectangle(
                img, (x1, y1 - th - 8), (x1 + tw + 4, y1), color, -1
            )
            cv2.putText(
                img, text, (x1 + 2, y1 - 4),
                cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255, 255, 255), 1,
            )

        return img

A few notes on this:

Why yolov8n.pt (nano)? Because this runs inside a Lambda or a lightweight Fargate container. The nano variant is 6MB and runs inference in ~50ms on CPU. For a batch pipeline where you're processing hundreds of images post-flight, that's fast enough. If you need better accuracy on fine-grained defect types, swap in a fine-tuned checkpoint (the interface doesn't change).

Why filter by DEFECT_CLASSES? The COCO-pretrained model detects 80 classes. We only care about structural defects. In production, you'd use a model fine-tuned on a crack/corrosion dataset (like the RDD2022 road damage dataset), but the architecture is identical.

Why frozen=True on the dataclass? Detections are immutable facts. Once you've detected a crack at coordinates (x1, y1, x2, y2) with confidence 0.87, that shouldn't change downstream.

The pipeline: from S3 event to report

The pipeline orchestrates the full flow: download images from S3, run detection, generate annotated outputs, upload results.

# src/pipeline.py
import json
import logging
from pathlib import Path
from datetime import datetime, timezone

from .detector import DefectDetector, Detection
from .report import ReportGenerator
from .s3 import S3Client

logger = logging.getLogger(__name__)


class InspectionPipeline:
    """End-to-end: S3 download → detection → annotation → upload."""

    def __init__(
        self,
        bucket: str,
        model_path: str = "yolov8n.pt",
        confidence_threshold: float = 0.4,
    ) -> None:
        self._bucket = bucket
        self._detector = DefectDetector(
            model_path=model_path,
            confidence_threshold=confidence_threshold,
        )
        self._s3 = S3Client(bucket)
        self._report = ReportGenerator()

    def process_inspection(
        self, inspection_id: str, raw_prefix: str, work_dir: Path
    ) -> dict:
        """Process all images in an S3 prefix. Returns summary."""
        images_dir = work_dir / "images"
        output_dir = work_dir / "output"
        images_dir.mkdir(parents=True, exist_ok=True)
        output_dir.mkdir(parents=True, exist_ok=True)

        # 1. Download raw images
        image_keys = self._s3.list_images(raw_prefix)
        logger.info(
            "Inspection %s: found %d images", inspection_id, len(image_keys)
        )

        local_paths = []
        for key in image_keys:
            local_path = images_dir / Path(key).name
            self._s3.download(key, local_path)
            local_paths.append(local_path)

        # 2. Run detection on each image
        all_results: dict[str, list[Detection]] = {}
        total_defects = 0

        for path in local_paths:
            detections = self._detector.detect(path)
            all_results[path.name] = detections
            total_defects += len(detections)

            if detections:
                # Generate annotated image
                annotated = self._detector.annotate(path, detections)
                annotated_path = output_dir / f"annotated_{path.name}"
                import cv2
                cv2.imwrite(str(annotated_path), annotated)

                # Upload annotated image
                self._s3.upload(
                    annotated_path,
                    f"annotated/{inspection_id}/{annotated_path.name}",
                )

            logger.info(
                "  %s: %d defects found", path.name, len(detections)
            )

        # 3. Generate summary report
        summary = self._build_summary(inspection_id, all_results)
        report_path = output_dir / f"report_{inspection_id}.json"
        report_path.write_text(json.dumps(summary, indent=2))

        self._s3.upload(
            report_path,
            f"reports/{inspection_id}/report.json",
        )

        # 4. Generate visual report (matplotlib)
        chart_path = self._report.generate_charts(
            summary, output_dir / f"charts_{inspection_id}.png"
        )
        self._s3.upload(
            chart_path,
            f"reports/{inspection_id}/charts.png",
        )

        logger.info(
            "Inspection %s complete: %d images, %d defects",
            inspection_id,
            len(local_paths),
            total_defects,
        )

        return summary

    def _build_summary(
        self,
        inspection_id: str,
        results: dict[str, list[Detection]],
    ) -> dict:
        """Build a structured summary from detection results."""
        defect_counts: dict[str, int] = {}
        high_severity: list[dict] = []

        for filename, detections in results.items():
            for det in detections:
                defect_counts[det.label] = (
                    defect_counts.get(det.label, 0) + 1
                )
                if det.confidence >= 0.75:
                    high_severity.append(
                        {
                            "file": filename,
                            "label": det.label,
                            "confidence": round(det.confidence, 3),
                            "bbox": det.bbox,
                            "area_px": det.area_px,
                        }
                    )

        return {
            "inspection_id": inspection_id,
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "total_images": len(results),
            "images_with_defects": sum(
                1 for dets in results.values() if dets
            ),
            "total_defects": sum(len(d) for d in results.values()),
            "defect_counts": defect_counts,
            "high_severity_detections": high_severity,
        }

The report generator

Engineers don't want JSON. They want a chart that says "this bridge has 14 cracks, mostly on the north face, and 3 of them are high-confidence."

# src/report.py
from pathlib import Path

import matplotlib
matplotlib.use("Agg")  # non-interactive backend (Lambda has no display)
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker


class ReportGenerator:
    """Generates visual reports from inspection summaries."""

    def generate_charts(self, summary: dict, output_path: Path) -> Path:
        """Create a summary chart with defect distribution."""
        fig, axes = plt.subplots(1, 2, figsize=(14, 5))
        fig.suptitle(
            f"Inspection Report — {summary['inspection_id']}",
            fontsize=14,
            fontweight="bold",
        )

        # Chart 1: Defect counts by type
        defect_counts = summary.get("defect_counts", {})
        if defect_counts:
            labels = list(defect_counts.keys())
            values = list(defect_counts.values())
            colors = self._get_colors(labels)

            bars = axes[0].barh(labels, values, color=colors)
            axes[0].set_xlabel("Count")
            axes[0].set_title("Defects by Type")
            axes[0].xaxis.set_major_locator(
                ticker.MaxNLocator(integer=True)
            )

            for bar, val in zip(bars, values):
                axes[0].text(
                    bar.get_width() + 0.2, bar.get_y() + bar.get_height() / 2,
                    str(val), va="center", fontsize=10,
                )
        else:
            axes[0].text(
                0.5, 0.5, "No defects detected",
                ha="center", va="center", transform=axes[0].transAxes,
            )

        # Chart 2: Coverage overview
        total = summary["total_images"]
        with_defects = summary["images_with_defects"]
        clean = total - with_defects

        axes[1].pie(
            [with_defects, clean],
            labels=["With defects", "Clean"],
            autopct="%1.0f%%",
            colors=["#e74c3c", "#2ecc71"],
            startangle=90,
        )
        axes[1].set_title(
            f"Image Coverage ({total} images)"
        )

        plt.tight_layout()
        plt.savefig(output_path, dpi=150, bbox_inches="tight")
        plt.close()

        return output_path

    @staticmethod
    def _get_colors(labels: list[str]) -> list[str]:
        color_map = {
            "crack": "#e74c3c",
            "corrosion": "#e67e22",
            "spalling": "#f1c40f",
            "delamination": "#3498db",
        }
        return [color_map.get(label, "#95a5a6") for label in labels]

S3 helpers (the boring part that matters)

# src/s3.py
import logging
from pathlib import Path

import boto3
from botocore.config import Config

logger = logging.getLogger(__name__)

IMAGE_EXTENSIONS = {".jpg", ".jpeg", ".png", ".tiff", ".bmp"}


class S3Client:
    """Thin wrapper around boto3 S3 operations."""

    def __init__(self, bucket: str) -> None:
        self._bucket = bucket
        self._client = boto3.client(
            "s3",
            config=Config(
                retries={"max_attempts": 3, "mode": "adaptive"}
            ),
        )

    def list_images(self, prefix: str) -> list[str]:
        """List image keys under a prefix."""
        paginator = self._client.get_paginator("list_objects_v2")
        keys: list[str] = []

        for page in paginator.paginate(
            Bucket=self._bucket, Prefix=prefix
        ):
            for obj in page.get("Contents", []):
                if Path(obj["Key"]).suffix.lower() in IMAGE_EXTENSIONS:
                    keys.append(obj["Key"])

        return sorted(keys)

    def download(self, key: str, local_path: Path) -> None:
        """Download a single object to a local file."""
        logger.debug("Downloading s3://%s/%s", self._bucket, key)
        self._client.download_file(self._bucket, key, str(local_path))

    def upload(self, local_path: Path, key: str) -> None:
        """Upload a local file to S3."""
        content_type = self._guess_content_type(local_path)
        logger.debug("Uploading to s3://%s/%s", self._bucket, key)
        self._client.upload_file(
            str(local_path),
            self._bucket,
            key,
            ExtraArgs={"ContentType": content_type},
        )

    @staticmethod
    def _guess_content_type(path: Path) -> str:
        mapping = {
            ".json": "application/json",
            ".png": "image/png",
            ".jpg": "image/jpeg",
            ".jpeg": "image/jpeg",
        }
        return mapping.get(path.suffix.lower(), "application/octet-stream")

Lambda handler

The glue that connects S3 events to the pipeline. When images land in the raw/ prefix, this fires.

# lambda/handler.py
import json
import logging
import os
import tempfile
from pathlib import Path
from urllib.parse import unquote_plus

from src.pipeline import InspectionPipeline

logger = logging.getLogger()
logger.setLevel(logging.INFO)


def handler(event: dict, context) -> dict:
    """Lambda entry point. Triggered by S3 PutObject events."""
    bucket = os.environ["BUCKET_NAME"]
    model_path = os.environ.get("MODEL_PATH", "yolov8n.pt")
    confidence = float(os.environ.get("CONFIDENCE_THRESHOLD", "0.4"))

    pipeline = InspectionPipeline(
        bucket=bucket,
        model_path=model_path,
        confidence_threshold=confidence,
    )

    # Extract inspection ID from the S3 key
    # Expected format: raw/{inspection_id}/image_001.jpg
    records = event.get("Records", [])
    processed_inspections = set()

    for record in records:
        key = unquote_plus(record["s3"]["object"]["key"])
        parts = key.split("/")

        if len(parts) < 3 or parts[0] != "raw":
            logger.warning("Unexpected key format: %s", key)
            continue

        inspection_id = parts[1]
        if inspection_id in processed_inspections:
            continue

        with tempfile.TemporaryDirectory() as tmp:
            summary = pipeline.process_inspection(
                inspection_id=inspection_id,
                raw_prefix=f"raw/{inspection_id}/",
                work_dir=Path(tmp),
            )

        processed_inspections.add(inspection_id)
        logger.info(
            "Processed inspection %s: %s",
            inspection_id,
            json.dumps(summary, default=str),
        )

    return {
        "statusCode": 200,
        "body": json.dumps(
            {"processed": list(processed_inspections)}
        ),
    }

Why Lambda might be wrong here (and what I'd use instead)

This is the part where I second-guess my own architecture.

Lambda with a container image works for small inspections (50-100 images). But it has hard limits:

15-minute timeout. Processing 500 high-res images with YOLOv8 nano on CPU takes ~25 seconds of pure inference, but add S3 downloads, annotation, uploads, and report generation, and you're looking at 3-5 minutes. Doable, but tight for larger inspections.
10 GB ephemeral storage. 500 images at 8 MB each = 4 GB just for the raw files. Add annotated outputs and you're flirting with the limit.
No GPU. Lambda doesn't support GPU instances. YOLOv8 nano on CPU is fine, but if you want to run a larger model (yolov8m, yolov8l) for better accuracy, you need Fargate with GPU or a SageMaker endpoint.

For production at scale, I'd swap the detection Lambda for a Fargate task with a GPU-enabled instance. The trigger stays the same (S3 event → SQS → Fargate), but you get configurable timeout, more storage, and GPU access.

The pipeline code doesn't change at all. That's the whole point of keeping the infrastructure concerns (Lambda handler, S3 events) separate from the business logic (detector, pipeline, report).

Dockerfile

FROM public.ecr.aws/lambda/python:3.12

# System deps for OpenCV
RUN dnf install -y mesa-libGL && dnf clean all

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY src/ ${LAMBDA_TASK_ROOT}/src/
COPY lambda/ ${LAMBDA_TASK_ROOT}/

CMD ["handler.handler"]

# requirements.txt
ultralytics>=8.2.0,<9.0.0
opencv-python-headless>=4.9.0,<5.0.0
boto3>=1.34.0,<2.0.0
matplotlib>=3.8.0,<4.0.0
pydantic>=2.6.0,<3.0.0

Note: opencv-python-headless, not opencv-python. The headless variant doesn't pull in Qt/GTK dependencies, which saves ~200 MB in the container image and avoids display-related errors in Lambda.

Testing with moto (no AWS account required)

# tests/conftest.py
import pytest
import boto3
from moto import mock_aws


@pytest.fixture
def aws_credentials(monkeypatch):
    monkeypatch.setenv("AWS_ACCESS_KEY_ID", "testing")
    monkeypatch.setenv("AWS_SECRET_ACCESS_KEY", "testing")
    monkeypatch.setenv("AWS_SECURITY_TOKEN", "testing")
    monkeypatch.setenv("AWS_DEFAULT_REGION", "eu-central-1")


@pytest.fixture
def s3_bucket(aws_credentials):
    with mock_aws():
        client = boto3.client("s3", region_name="eu-central-1")
        client.create_bucket(
            Bucket="test-inspection",
            CreateBucketConfiguration={
                "LocationConstraint": "eu-central-1"
            },
        )
        yield "test-inspection"

# tests/test_detector.py
from pathlib import Path
import numpy as np
import cv2
import pytest

from src.detector import DefectDetector


@pytest.fixture
def sample_image(tmp_path: Path) -> Path:
    """Create a synthetic test image with a dark line (simulated crack)."""
    img = np.ones((640, 640, 3), dtype=np.uint8) * 200  # gray background
    cv2.line(img, (100, 100), (400, 300), (30, 30, 30), 3)  # dark line
    path = tmp_path / "test_crack.jpg"
    cv2.imwrite(str(path), img)
    return path


def test_detector_returns_list(sample_image: Path):
    detector = DefectDetector(confidence_threshold=0.1)
    results = detector.detect(sample_image)
    assert isinstance(results, list)


def test_detector_annotate_preserves_dimensions(sample_image: Path):
    detector = DefectDetector(confidence_threshold=0.1)
    detections = detector.detect(sample_image)
    annotated = detector.annotate(sample_image, detections)
    original = cv2.imread(str(sample_image))
    assert annotated.shape == original.shape

Lessons learned

1. The model is the easiest part. I spent maybe 10% of my time on inference code and 90% on the pipeline around it: S3 key conventions, error handling, report formatting, container packaging. If you're building a CV pipeline and you think the hard part is the model, you haven't started the hard part yet.

2. Separate infrastructure from logic. The DefectDetector class doesn't know about S3, Lambda, or AWS. The InspectionPipeline doesn't know about Lambda events. The handler is just glue. This means I can run the exact same pipeline locally (python -m src.pipeline) for testing, or swap the Lambda trigger for Fargate without touching any business logic.

3. opencv-python-headless saves headaches. I lost an hour debugging an import error in Lambda because the full OpenCV package tried to load libGL.so. The headless variant just works. Always use it in server/serverless environments.

4. S3 key conventions are your schema. raw/{inspection_id}/, annotated/{inspection_id}/, reports/{inspection_id}/. Simple, predictable, greppable. No database needed to track which inspection produced which outputs.

5. Confidence thresholds are a product decision, not a technical one. Setting the threshold at 0.4 catches more potential defects but generates more false positives. Setting it at 0.8 is more precise but misses borderline cracks. The right value depends on whether your users prefer "flag everything, I'll triage manually" or "only show me what you're sure about." I made it configurable via environment variable and let the ops team decide.

What's next

Edge triage module: a lightweight ONNX model running on-device that flags frames worth uploading, reducing bandwidth by 60-70%
GPS overlay: extract EXIF GPS data and map defect locations on a geo-referenced grid
Severity scoring: use defect area (in pixels, relative to image resolution) as a proxy for physical size, and flag anything above a threshold

The code is on GitHub: github.com/biscolab/drone-defect-detector

I write about cloud architecture, AI in production, and the engineering decisions nobody puts in the README. Follow me on LinkedIn for the short version.

How I Built a Serverless Scanner to Find (and Kill) Zombie AWS Resources

Roberto Belotti — Wed, 13 May 2026 20:48:07 +0000

Every AWS account has zombies.

Not the fun kind. The kind that silently drain your budget while nobody's looking. An EBS volume that was attached to an instance you terminated six months ago. A NAT Gateway routing traffic for a VPC that no longer has any workloads. A Transfer Family SFTP server that was set up for a migration, used once, and forgotten.

I've audited enough accounts to know this isn't an edge case. It's the default. Infrastructure outlives the context that created it. Projects get cancelled, teams move on, POCs never get torn down. But the meter keeps running.

AWS Cost Explorer will tell you what you're spending. It won't tell you why (or whether anyone still needs it). So I built a tool that answers that question.

aws-zombie-hunter is an open-source, container-based Lambda that scans an AWS account for orphaned resources across seven categories, estimates the monthly waste, and writes a structured JSON report to S3.

This article walks through the architecture, the scanner design pattern, the testing strategy, and the things I learned along the way.

Why a Lambda (and not a CLI)

The first version of this in my head was a CLI script. Run it locally, pipe output to a file, done. But that falls apart quickly for anything beyond a hobby project.

A CLI means someone has to remember to run it. It needs credentials on a developer's machine. It doesn't scale to multiple accounts. And when you want to track zombie trends over time (are we getting better at cleaning up, or worse?), you need persistent storage and a schedule.

A Lambda solves all of that. EventBridge triggers it on a schedule (daily, weekly, whatever makes sense). Results go to S3 with a date-based key, so you get historical data for free. IAM handles permissions (read-only, no credentials on laptops). And it costs roughly $0.10/month to run (which is ironic, given that it typically finds hundreds of dollars in waste).

I went with a container-based Lambda (Python 3.12 on the official AWS base image) instead of a zip deployment. The reason is practical: moto, boto3, and the rest of the test/dev tooling push the package well past Lambda's 250 MB zip limit. A container image gives you up to 10 GB and a clean Dockerfile to reproduce the build. It also shows Docker competence in a portfolio project, which doesn't hurt.

The scanner architecture

The core design problem was: how do you scan seven different AWS resource types without the codebase turning into a 500-line if/elif chain?

The answer is a scanner registry with a common interface. Every scanner is a subclass of BaseScanner, which defines the contract:

from abc import ABC, abstractmethod

class BaseScanner(ABC):
    VERSION: str = "1.0.0"

    def __init__(self, session: boto3.Session, regions: list[str]):
        self.session = session
        self.regions = regions

    @property
    @abstractmethod
    def resource_type(self) -> ResourceType:
        ...

    @abstractmethod
    def scan(self) -> list[ZombieResource]:
        ...

Each scanner knows how to detect zombies for one resource type. The EIPScanner looks for Elastic IPs not associated with a running instance. The EBSScanner finds unattached volumes and snapshots not linked to any AMI. The TransferFamilyScanner checks for SFTP servers with zero configured users (or zero file transfers in the last 30 days via CloudWatch metrics).

A registry module auto-discovers all BaseScanner subclasses and runs them in parallel using ThreadPoolExecutor. This keeps the handler clean:

def lambda_handler(event, context):
    config = load_config()
    session = boto3.Session()

    scanners = ScannerRegistry.discover()
    results = ScannerRegistry.run_all(
        scanners, session, config.regions
    )

    report = ScanResult(
        zombies=results.zombies,
        errors=results.errors,
        regions_scanned=config.regions,
        # ...
    )

    save_to_s3(report, config.bucket, config.prefix)

    if config.sns_topic:
        notify(report.summary(), config.sns_topic)

    return report.summary()

Adding a new scanner means creating a new file in src/scanner/, subclassing BaseScanner, and implementing scan(). The registry picks it up automatically. No handler changes, no configuration updates.

What it scans (and how it decides what's "dead")

Seven resource types, each with specific detection criteria:

EC2 Instances — state is stopped. A stopped instance still incurs EBS storage costs for its root and attached volumes, and it's almost always a forgotten resource.

EBS Volumes — not attached to any instance. The volume exists, it has data (maybe), but nothing is using it. Also catches old snapshots not linked to any active AMI.

Elastic IPs — allocated but not associated with a running instance. AWS charges for idle EIPs ($3.60/month each since the February 2024 pricing change), so even a handful adds up.

Transfer Family Servers — SFTP/FTPS servers with zero configured users, or (via CloudWatch) zero FilesIn/FilesOut in the last 30 days. These are expensive ($219/month base cost for a public endpoint) and easy to forget after a one-time migration.

RDS Instances — state is stopped. AWS automatically restarts stopped RDS instances after 7 days (a detail many people miss), so a "stopped" RDS instance is either very recent or has been cycling through stop/auto-restart for months.

Load Balancers — ALBs and NLBs with zero healthy targets. The load balancer exists, it's routing nothing, and it's costing you ~$16/month plus hourly charges.

NAT Gateways — present in a subnet but no active routes point to them in the route table. At $32/month plus data processing charges, an orphaned NAT Gateway is one of the more expensive zombies.

Cost estimation: good enough, not perfect

Every zombie gets an estimated_monthly_cost_usd field. This isn't meant to match your bill to the cent. It's meant to make you go "wait, we're wasting how much?"

The estimation uses a static prices.json file with base prices per resource type and regional multipliers. A stopped t3.medium in us-east-1 gets a different EBS cost estimate than one in ap-southeast-1. It's approximate, but consistently approximate (and that's what matters for prioritization).

I considered pulling real-time pricing from the AWS Price List API, but it's slow, complex, and overkill for a scanner that runs once a day. The static file approach means the Lambda has no external dependencies at scan time (no API calls that could fail or add latency) and the prices are easy to update with a PR.

The S3 output format

Reports land in S3 with a date-based key pattern: {prefix}{YYYY-MM-DD}.json. Running the scan twice on the same day overwrites the previous result (last scan wins, by design).

The JSON structure is designed for downstream consumption. A future report-generator Lambda (not built yet) can read these files to produce trend charts. The format includes a top-level summary (total zombies, total waste, breakdown by severity and type) plus the full list of zombie resources with all the metadata you'd need to act on them:

{
  "scan_id": "a1b2c3d4-...",
  "account_id": "123456789012",
  "scan_timestamp": "2026-05-12T06:00:12Z",
  "regions_scanned": ["us-east-1", "eu-west-1", "eu-central-1"],
  "total_monthly_waste_usd": 1247.60,
  "total_zombies": 12,
  "summary": {
    "by_severity": { "low": 3, "medium": 5, "high": 3, "critical": 1 },
    "by_type": {
      "ec2_instance": 2,
      "ebs_volume": 4,
      "elastic_ip": 2,
      "transfer_server": 1
    }
  },
  "zombies": [
    {
      "resource_type": "transfer_server",
      "resource_id": "s-0abc1234def567890",
      "region": "eu-west-1",
      "reason": "Transfer Family server with 0 configured users",
      "estimated_monthly_cost_usd": 219.00,
      "severity": "high",
      "age_days": 427,
      "recommended_action": "Review and terminate if unused"
    }
  ],
  "errors": []
}

The errors array is important. If one scanner fails (say, you don't have transfer:List* permissions), the Lambda doesn't crash. It logs the error, adds it to the report, and continues with the remaining scanners. Partial results are better than no results.

Testing with moto (62 tests, 90% coverage)

This project has a test suite I'm actually proud of. Every scanner has its own test file, every detection rule has a positive and negative test case, and the full handler flow is integration-tested end to end.

The secret weapon is moto, which mocks AWS services in-process. No LocalStack, no Docker containers for tests, no real AWS calls. You decorate a test with @mock_aws, create fake resources, run the scanner, and assert what it found:

@mock_aws
def test_finds_unattached_ebs_volumes():
    ec2 = boto3.client("ec2", region_name="us-east-1")
    volume = ec2.create_volume(
        AvailabilityZone="us-east-1a",
        Size=100,
        VolumeType="gp3"
    )

    scanner = EBSScanner(
        session=boto3.Session(), regions=["us-east-1"]
    )
    zombies = scanner.scan()

    assert len(zombies) == 1
    assert zombies[0].resource_id == volume["VolumeId"]
    assert zombies[0].estimated_monthly_cost_usd > 0


@mock_aws
def test_ignores_attached_volumes():
    # Create instance, volume gets attached automatically
    ec2 = boto3.client("ec2", region_name="us-east-1")
    ec2.run_instances(
        ImageId="ami-test",
        InstanceType="t3.medium",
        MinCount=1, MaxCount=1
    )

    scanner = EBSScanner(
        session=boto3.Session(), regions=["us-east-1"]
    )
    zombies = scanner.scan()

    # Root volume is attached, so no zombies
    assert len(zombies) == 0

This pattern scales. Each scanner gets a test_finds_* and test_ignores_* pair at minimum, plus edge cases (multi-region, tagged resources, empty regions). The handler integration test mocks S3 and SNS too, verifying the full flow from event to stored report.

Quality gates: mypy --strict with zero errors, ruff for linting and formatting, pytest --cov for the 90% coverage floor. All of this runs in CI before any merge.

Deployment: SAM with a two-phase pattern

The infrastructure is defined in a SAM template.yaml that provisions:

The Lambda function (container image from ECR)
An S3 bucket for reports
An EventBridge rule for scheduled scanning
IAM roles with read-only permissions (EC2, EBS, ELB, RDS, Transfer, CloudWatch, S3 write to the report bucket)
An optional SNS topic for notifications

Deployment is sam build && sam deploy --guided. The ECR repository for the container image needs to exist before the first deploy, so there's a bootstrapping step documented in the README.

One thing I'd flag: the IAM policy is deliberately read-only for all scanned services. The tool finds zombies. It doesn't kill them. That's a conscious decision (you don't want an automated tool terminating resources without human review, even if they look dead).

What I learned (and what I'd change)

ThreadPoolExecutor was the right call. Scanners are I/O-bound (API calls to AWS), so threading gives you near-linear speedup across scanners. The full scan across seven resource types and three regions takes about 45 seconds. Without parallelism, it was closer to three minutes.

Static pricing beats dynamic pricing for this use case. I initially tried to pull real-time prices from the AWS Price List Bulk API. The API is slow, the response format is bizarre (nested JSON with string-encoded numbers), and the latency added 10+ seconds per scan. A prices.json file that gets updated once a quarter is simpler and good enough.

The scanner registry pattern pays for itself. When I wanted to add NAT Gateway scanning after the initial six scanners, it took exactly one new file and one test file. Zero changes to the handler, zero changes to the registry. That's the kind of extensibility that makes a project maintainable.

Error isolation is non-negotiable. Early versions crashed the entire Lambda when one scanner hit a permissions error. Now each scanner runs in its own try/except block, and failures get logged and recorded without affecting the others. In multi-account setups where IAM policies vary, this is essential.

What's next: a report-generator Lambda that reads the JSON files from S3 and produces charts (cost trends over time, breakdown by resource type, worst offenders). Same bucket, separate function, clean separation of concerns. That's a v2 project.

Try it

The repo is at github.com/biscolab/aws-zombie-hunter. Deploy it, run it once, and I'd be surprised if it doesn't find at least a couple of zombies you didn't know about.

The Lambda costs pennies to run. The zombies it finds cost a lot more.

I also wrote a shorter take on this problem on LinkedIn — the "why" rather than the "how". If you're curious about the project or have ideas for additional scanners, drop a comment or open an issue on the repo.

How I Locked Down a Static Site with Lambda@Edge and Cognito (No Backend Required)

Roberto Belotti — Tue, 12 May 2026 11:30:00 +0000

Your internal docs are wide open.

That Docusaurus site you deployed to S3? The one with your API specs, runbooks, onboarding guides? Anyone with the URL can read it. S3 + CloudFront gives you HTTPS, caching, and global distribution out of the box. What it doesn't give you is a login page.

Most teams solve this by moving docs to a platform (Notion, Confluence, whatever) and giving up control. Or they shove everything behind a VPN and call it a day. Both options work. Both have trade-offs that get annoying fast.

I wanted a third option: keep the static site exactly as it is (Docusaurus in my case, but anything works), keep it on S3 + CloudFront (cheap, fast, zero maintenance), and add a real authentication layer in front of it without touching the site's code or build pipeline.

The result is docusaurus-cognito-auth — a fully serverless auth layer built with Lambda@Edge and AWS Cognito. This article is a walkthrough of the architecture, the decisions behind it, and the things that bit me along the way.

The stack at a glance

Four AWS services, each doing one thing:

S3 stores the static site files. Private bucket, no public access, no website hosting enabled. Just objects in a bucket. The bucket is provisioned empty by the stack — you deploy your static site separately with aws s3 sync.

CloudFront sits in front of S3 and handles everything HTTP: TLS termination, caching, compression, global edge distribution. It accesses S3 through an Origin Access Control (OAC), which means the bucket stays fully private. No public ACLs, no bucket policies leaking read access. CloudFront is the only thing that can read from S3.

Lambda@Edge is where the auth logic lives. Two functions, both running at the CloudFront edge as viewer-request triggers. One checks the JWT cookie on every request. The other handles the OAuth callback after login. More on both in a moment.

Cognito is the identity provider. User Pool with Hosted UI — it handles signup, login, password reset, email verification. The Lambda functions talk to Cognito's token endpoint and validate JWTs against its JWKS.

The key constraint: the auth layer and the static site are completely decoupled. You can swap the site (Docusaurus, Next.js export, plain HTML) without redeploying auth. You can update auth without rebuilding the site. Two independent concerns, two independent deploy cycles.

The request flow (when it actually matters)

There are really only two scenarios. Understanding both is the key to understanding the whole system.

Scenario 1: you have a valid cookie

Browser → CloudFront → auth-check Lambda → "cookie is valid" → S3 → page served

The auth-check function extracts the auth_token cookie, verifies the JWT signature against Cognito's JWKS (RS256), checks expiry, issuer, and audience. If everything passes, it returns the original CloudFront request object unchanged. CloudFront continues to S3, gets the page, serves it. The user never notices anything happened. This check takes about 1 ms at the edge once the JWKS keys are cached.

Scenario 2: no cookie (or expired)

Browser → CloudFront → auth-check Lambda → 302 to Cognito login
User logs in on Cognito Hosted UI
Cognito → 302 to /callback?code=AUTH_CODE&state=/original-page
CloudFront → auth-callback Lambda → exchanges code for tokens → sets cookie → 302 to /original-page
Browser → (back to scenario 1)

This is the OAuth Authorization Code flow. The state parameter carries the originally requested URL, so after login the user lands exactly where they intended. The cookie is HttpOnly; Secure; SameSite=Lax — not accessible from JavaScript, transmitted only over HTTPS.

Once the cookie is set, every subsequent request is scenario 1. No more redirects until the token expires.

Lambda@Edge: the part that makes it work (and the part that hurts)

Lambda@Edge is powerful but comes with constraints that'll surprise you if you've only used regular Lambda.

No environment variables

This is the big one. Lambda@Edge runs at CloudFront edge locations worldwide, and AWS decided that environment variables aren't supported. Period. So you can't do the normal thing (put your Cognito pool ID, client ID, and domain in env vars and read them at runtime).

My solution: a config.mjs file that gets its values baked in at build time. The deploy script reads the .env file (which itself is auto-generated from CloudFormation outputs) and writes the actual values into the config before packaging the Lambda.

It works. It's not elegant. But it's the only pattern that makes sense for Lambda@Edge.

The 1 MB package limit

Viewer-request functions have a 1 MB deployment package limit. This is why I chose the jose library for JWT validation instead of something heavier. jose is pure JavaScript (no native dependencies, no compiled bindings), handles JWKS fetching and caching automatically via createRemoteJWKSet, and keeps the total bundle size well under the limit.

If I'd gone with Python (my first instinct), PyJWT plus the cryptography library for RS256 verification would have blown past 1 MB easily. JavaScript was the pragmatic choice here.

5-second timeout

Viewer-request functions must respond within 5 seconds. The JWKS fetch (cold start only) and the token exchange in the callback both need to complete within this window. In practice it's never been an issue — Cognito's endpoints respond in under 200 ms — but it's something to be aware of if you add custom logic.

Must deploy to us-east-1

Lambda@Edge functions must be created in us-east-1. AWS replicates them to edge locations globally, but the source must live in N. Virginia. The SAM template handles this, but if your default region is eu-central-1 (like mine), you need to be explicit in samconfig.toml.

Two Lambdas, not one

I split the auth into two separate functions, wired to two separate CloudFront cache behaviors:

auth-check is attached to the DefaultCacheBehavior — it fires on every request to every path. Its job is simple: check the cookie, validate the JWT, pass through or redirect. It never talks to Cognito's token endpoint. It only reads the JWKS (and caches it in memory across warm invocations).

auth-callback is attached to a specific CacheBehavior for the /callback path. It only fires when Cognito redirects back after login. Its job is to exchange the authorization code for tokens (one POST to Cognito), set the cookie, and redirect the user.

Why not one function that handles both? Separation of concerns. The auth-check function runs on every single request — it needs to be fast and lightweight. The callback function runs once per login session — it can afford the overhead of an HTTP call to Cognito. Mixing both flows into one handler would mean every request pays the cost of parsing callback logic it doesn't need.

CloudFront evaluates CacheBehaviors patterns before the DefaultCacheBehavior, most-specific first. A request to /callback matches the explicit path pattern and goes to auth-callback. Everything else falls through to auth-check. Clean routing, no conditionals in code.

The SAM template: IaC for real

The entire infrastructure (S3, CloudFront, Cognito, Lambda@Edge, IAM, OAC) is defined in a single template.yaml. One sam deploy and you have a working auth layer. Here are the things worth highlighting:

OAC over OAI. Origin Access Control is the current AWS recommendation. Origin Access Identity (OAI) still works but is considered legacy. OAC uses SigV4 signing and supports more S3 features.

Two-phase deploy. The chicken-and-egg problem: the Cognito callback URL needs the CloudFront domain, but the CloudFront domain doesn't exist until the first deploy. The deploy script solves this by running an initial deploy with a placeholder callback URL, reading the CloudFront domain from CloudFormation outputs, updating .env, and running a second deploy with the real URL. Subsequent deploys are single-pass.

Caching policies. The default behavior uses AWS's managed CachingOptimized policy (cache everything). The /callback behavior uses CachingDisabled (never cache the auth callback) plus an origin request policy that forwards query strings (the code and state parameters).

SPA error handling. Custom error responses map 403 and 404 to /index.html with a 200 status code. This lets client-side routing work after authentication (Docusaurus, React Router, etc.). Without this, a direct link to /docs/some-page would return a 404 from S3 because there's no /docs/some-page object — only index.html that handles routing client-side.

Logout

The auth-check Lambda also handles /logout — no static file needed. When it sees a request to /logout, it clears the auth_token cookie (setting Max-Age=0) and redirects to Cognito's logout endpoint, which invalidates the server-side session. Cognito then redirects back to the site root, and the next request triggers a fresh login.

Adding a logout button to any static site is just an anchor tag: <a href="/logout">Logout</a>.

Cost

This is one of the nice parts. For a typical internal docs site (let's say a few hundred users, a few thousand page views per day), the cost is effectively zero. All four services have generous free tiers:

CloudFront: 1 TB transfer + 10 million requests/month free
Lambda@Edge: 1 million requests + 400,000 GB-seconds/month free
Cognito: 50,000 monthly active users free
S3: 5 GB storage + 20,000 GET requests/month free

Even past the free tier, we're talking single-digit dollars per month. The most expensive component at scale is Cognito ($0.0055 per MAU after 50K), but if you have 50,000 people reading your internal docs, you have bigger problems to solve.

What I'd do differently

CloudFront Functions instead of Lambda@Edge for auth-check. CloudFront Functions run at the edge with sub-millisecond latency, support up to 10 million requests per second, and cost about one-sixth of Lambda@Edge. The limitation is that they can't make external network calls — which means no JWKS fetching. But if you pre-bake the JWKS public keys into the function code at build time (they rotate infrequently), you could do the entire JWT validation in a CloudFront Function. I might explore this for v2.

Custom domain from day one. The current setup uses the default CloudFront domain (d1234abcd.cloudfront.net). Adding a custom domain (with ACM certificate) is straightforward but isn't included in the template to keep the initial setup simple. For production use, you'd want this.

Try it

The full project is on GitHub: biscolab/docusaurus-cognito-auth

Clone, configure samconfig.toml, run npm run deploy, upload your site. The README covers everything including GitHub Actions CI/CD with OIDC (no long-lived access keys).

If you're running internal docs on S3 without auth today, this gets you to enterprise-grade access control in about 15 minutes. And you keep full control of your infrastructure.

What's your setup for protecting internal docs? VPN, platform, or something custom? Always curious to hear how other teams handle this.