DEV Community: Rupam Golui

5 Steps to Deploying an AWS Lambda with EventBridge using Terraform

Rupam Golui — Mon, 02 Feb 2026 08:59:01 +0000

I used to think serverless meant "upload code, magic happens." Then today I tried to deploy a Python Lambda that needed custom dependencies, a schedule trigger, and enough memory to actually stay alive, and quietly questioning my career choices.

Nevermind, Here's how I finally got my atlas-worker Lambda running, containerized, scheduled, and only slightly over-provisioned. (You can jump to the last part to get the whole code)

Step 1: Accept That ZIP Files Are for Cowards (Use Containers)

Look, you could package your Lambda as a ZIP. You could also commute to work on a unicycle. Both are technically possible, but why suffer?

I needed psycopg2, some geospatial libraries, and enough room to breathe, also zip has a size limit. So I went with a container image. The first hurdle? Terraform needs to log into ECR before it can push the image, but it needs to know the registry URL to log in... which requires knowing your account ID.

That's why Use aws_caller_identity and aws_ecr_authorization_token data sources. It feels like Terraform inception, but it works:

data "aws_caller_identity" "current" {}
data "aws_ecr_authorization_token" "token" {
  registry_id = data.aws_caller_identity.current.account_id
}

provider "docker" {
  registry_auth {
    address  = "${data.aws_caller_identity.current.account_id}.dkr.ecr.${var.aws_region}.amazonaws.com"
    username = data.aws_ecr_authorization_token.token.user_name
    password = data.aws_ecr_authorization_token.token.password
  }
}

I hope your aws cli is already setup.

Step 2: Let Terraform Build the Image (Yes, Really)

I used to build the Docker image separately, tag it manually, push it, then update Terraform. That's three steps too many. The terraform-aws-modules/lambda module has a docker-build submodule that handles this in one go.

module "docker_image" {
  source = "terraform-aws-modules/lambda/aws//modules/docker-build"

  create_ecr_repo = true
  ecr_repo        = "atlas-worker"
  use_image_tag   = true
  image_tag       = var.image_tag # Keep a track on it
  source_path     = "../apps/workers" # your docker file location
}

IMP: The image will only build again if you change the image_tag not if you change the code, so keep a track on it.
Set create_ecr_repo = true and Terraform provisions the registry, builds the image, and pushes it. It's terrifyingly convenient. I kept waiting for the catch.

(There is a catch. We'll get to it in step 4½.)

Step 3: Configure the Lambda

AWS defaults are... optimistic. A 3-second timeout and 128MB of RAM works for "Hello World." My worker connects to Postgres, queries an API, and uploads to S3. So I maxed out the timeout to 15 minutes and gave it 1GB of memory.

module "lambda_function" {
  source = "terraform-aws-modules/lambda/aws"

  function_name  = "atlas-worker"
  create_package = false
  image_uri      = module.docker_image.image_uri
  package_type   = "Image"

  timeout     = var.lambda_timeout  # 900 seconds = "Please just finish"
  memory_size = var.lambda_memory   # 1024 MB = "I believe in you"
}

Yes, I'm paying for a Lambo to do grocery runs. But it works.

Step 4: EventBridge (Or "CloudWatch Events" for Us Old Timers)

I wanted this thing to run weekly. My brain said "CloudWatch cron," but AWS renamed CloudWatch Events to EventBridge five years ago and my muscle memory hasn't caught up.

The EventBridge module connects your Lambda to a schedule:

module "eventbridge" {
  source  = "terraform-aws-modules/eventbridge/aws"
  version = "4.2.2"

  create_bus = false  # Don't need a custom event bus

  rules = {
    weekly_sync = {
      description         = "Trigger ARGO float weekly sync"
      schedule_expression = var.schedule_expression  # cron(0 12 ? * SUN *)
    }
  }

  targets = {
    weekly_sync = [
      {
        name  = "atlas-worker-lambda"
        arn   = module.lambda_function.lambda_function_arn
        input = jsonencode({ operation = "update" })
      }
    ]
  }
}

That input field? It sends a custom JSON payload to your Lambda. My function checks event["operation"] to know whether it's a scheduled run or a manual invocation.

Step 4½: The Permission That Haunts My Dreams ☠️

Here's the "½ step"—the thing that isn't in the main tutorial but will break everything if you skip it.

EventBridge can say it will trigger your Lambda, but unless you explicitly give it permission, AWS will silently drop those invocations. No error logs. No failed invocations metric. Just... nothing happening at 2 AM when your cron fires.

You need the allowed_triggers block in your Lambda module:

allowed_triggers = {
  EventBridgeRule = {
    principal  = "events.amazonaws.com"
    source_arn = module.eventbridge.eventbridge_rule_arns["weekly_sync"]
  }
}

I spent a morning thinking my cron expression was wrong. Nope. The Lambda just... rejected the trigger. Politely. Without telling me.

This is the ½ step because it's invisible until it isn't.

Step 5: Environment Variables (AKA "How Many Secrets Can I Fit In Here")

Last the config. Database URLs, S3 credentials, the whole messy reality of "my app needs to talk to things":

environment_variables = {
  PG_WRITE_URL   = var.pg_write_url
  S3_ACCESS_KEY  = var.s3_access_key
  S3_SECRET_KEY  = var.s3_secret_key
  S3_ENDPOINT    = var.s3_endpoint
  S3_BUCKET_NAME = var.s3_bucket_name
  ARGO_DAC       = var.argo_dac
}

Are these in AWS Secrets Manager? No. Should they be? Probably. But this is a dev blog, not a security audit. We'll pretend I used Terraform Cloud workspaces with encrypted variables and move on.

I assume you have already setup your main.tf so here's the full worker.tf. Now just setup the env vars and Thanks me later :)

data "aws_caller_identity" "current" {}

data "aws_ecr_authorization_token" "token" {
  registry_id = data.aws_caller_identity.current.account_id
}

provider "docker" {
  registry_auth {
    address  = "${data.aws_caller_identity.current.account_id}.dkr.ecr.${var.aws_region}.amazonaws.com"
    username = data.aws_ecr_authorization_token.token.user_name
    password = data.aws_ecr_authorization_token.token.password
  }
}

module "lambda_function" {
  source = "terraform-aws-modules/lambda/aws"

  function_name  = "atlas-worker"
  create_package = false

  image_uri    = module.docker_image.image_uri
  package_type = "Image"

  timeout     = var.lambda_timeout # 15 minutes
  memory_size = var.lambda_memory  # 1GB

  environment_variables = {
    PG_WRITE_URL   = var.pg_write_url
    S3_ACCESS_KEY  = var.s3_access_key
    S3_SECRET_KEY  = var.s3_secret_key
    S3_ENDPOINT    = var.s3_endpoint
    S3_BUCKET_NAME = var.s3_bucket_name
    ARGO_DAC       = var.argo_dac
  }

  # Allow EventBridge to invoke this Lambda
  create_current_version_allowed_triggers = false
  allowed_triggers = {
    EventBridgeRule = {
      principal  = "events.amazonaws.com"
      source_arn = module.eventbridge.eventbridge_rule_arns["weekly_sync"]
    }
  }
}

module "docker_image" {
  source = "terraform-aws-modules/lambda/aws//modules/docker-build"

  create_ecr_repo = true
  ecr_repo        = "atlas-worker"

  use_image_tag = true
  image_tag     = var.image_tag

  source_path = "../apps/workers"
}

module "eventbridge" {
  source  = "terraform-aws-modules/eventbridge/aws"
  version = "4.2.2"

  create_bus = false

  rules = {
    weekly_sync = {
      description         = "Trigger ARGO float weekly sync"
      schedule_expression = var.schedule_expression
    }
  }

  # Connect Lambda as target
  targets = {
    weekly_sync = [
      {
        name  = "atlas-worker-lambda"
        arn   = module.lambda_function.lambda_function_arn
        input = jsonencode({ operation = "update" })
      }
    ]
  }

  tags = {
    Name = "atlas-worker-scheduler"
  }
}

You can checkout my tf repo structure here.

Running Goose in Containers (Without Losing Your Mind)

Rupam Golui — Sat, 04 Oct 2025 06:29:56 +0000

I'm a huge fan of containers. They're not just a cool buzzword for résumés; they actually save your sanity. And today, I'll show you how to run goose inside Docker. Specifically how to integrate it into CI/CD pipelines, debug containerized workflows, and manage secure deployments at scale. Once you containerize goose, you'll never go back to raw installs.

Why Containerize Goose? The Real Benefits

Goose is an AI agent that can automate engineering tasks, build projects from scratch, debug code, and orchestrate complex workflows through the Model Context Protocol (MCP). But here's why containers unlock its true potential:

For CI/CD Integration: Run automated code reviews, documentation generation, and testing across your entire pipeline without environment drift. Imagine having goose automatically review every PR, generate release notes, or validate your infrastructure as code-all running consistently in containers.

For Team Collaboration: Every developer gets the same goose setup, eliminating "works on my machine" problems when sharing AI-powered workflows or debugging sessions.

For Production Deployments: Scale goose instances horizontally, manage API keys securely, and deploy across multiple environments with confidence.

In this guide, we'll cover:

Quick deployment with pre-built images
CI/CD pipeline integration with GitHub Actions and GitLab
Debugging containerized goose workflows
Production-ready security and scaling patterns

The benefits are immediate and compound over time, especially when you're managing API keys for multiple LLM providers and need consistent behavior across environments.

Quick Start: Your First Containerized Workflow

Let's jump straight into a practical example-using goose to analyze and improve a codebase:

# Pull the image and analyze your project
$ docker run --rm \
   -v $(pwd):/workspace \
   -w /workspace \
   -e GOOSE_PROVIDER=openai \
   -e GOOSE_MODEL=gpt-4o \
   -e OPENAI_API_KEY=$OPENAI_API_KEY \
   ghcr.io/block/goose:v0.9.3 run -t "Review this code for security issues and suggest improvements"

That's it. The ~340MB image contains everything goose needs to analyze your code, suggest improvements, or even refactor entire functions. This same pattern works for documentation generation, test creation, or architectural reviews.

Building Your Own Images

Sometimes you need customizations. Maybe you want the bleeding edge from source, or you need additional tools. Building from source is straightforward:

# Clone and build
$ git clone https://github.com/block/goose.git
$ cd goose
$ docker build -t goose:local .

The build uses multi-stage magic: compiles with Rust's heavy toolchain, then copies just the binary to a minimal Debian runtime. Smart stuff. The Dockerfile even includes Link-Time Optimization (LTO) and binary stripping to keep things lean.

Pro tip: For development builds with debug symbols, add --build-arg CARGO_PROFILE_RELEASE_STRIP=false.

Running Goose Effectively

Basic CLI Usage

Mount your workspace and let goose work its magic:

$ docker run --rm \
   -v $(pwd):/workspace \
   -w /workspace \
   -e GOOSE_PROVIDER=openai \
   -e GOOSE_MODEL=gpt-4o \
   -e OPENAI_API_KEY=$OPENAI_API_KEY \
   goose:local run -t "Analyze this codebase"

Interactive Sessions

For longer sessions, use the interactive mode:

$ docker run -it --rm \
   -v $(pwd):/workspace \
   -w /workspace \
   -e GOOSE_PROVIDER=anthropic \
   -e GOOSE_MODEL=claude-3-5-sonnet-20241022 \
   -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
   goose:local session

Docker Compose for Complex Setups

When you're dealing with multiple services or persistent config, use Docker Compose:

version: "3.8"
services:
  goose:
    image: ghcr.io/block/goose:latest
    environment:
      - GOOSE_PROVIDER=${GOOSE_PROVIDER:-openai}
      - GOOSE_MODEL=${GOOSE_MODEL:-gpt-4o}
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    volumes:
      - ./workspace:/workspace
      - goose-config:/home/goose/.config/goose
    working_dir: /workspace
    stdin_open: true
    tty: true

volumes:
  goose-config:

Run with: $ docker-compose run --rm goose session

Configuration Deep Dive

Goose supports all the usual environment variables: GOOSE_PROVIDER, GOOSE_MODEL, and provider-specific keys. The container runs as a non-root user (UID 1000) by default, which is great for security.

For persistent config, mount the config directory:

$ docker run --rm \
    -v ~/.config/goose:/home/goose/.config/goose \
    goose:local configure

Need extra tools? The image is based on Debian Bookworm Slim, so you can install what you need:

FROM ghcr.io/block/goose:latest

USER root

RUN apt-get update && apt-get install -y vim tmux && rm -rf /var/lib/apt/lists/*

USER goose

CI/CD Integration: Where Containers Shine

This is where containerization really pays off. In CI/CD pipelines, you want consistent, isolated environments. Here's how to integrate goose:

GitHub Actions

jobs:
  analyze:
    runs-on: ubuntu-latest
    container:
      image: ghcr.io/block/goose:latest
      env:
        GOOSE_PROVIDER: openai
        GOOSE_MODEL: gpt-4o
        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
    steps:
      - uses: actions/checkout@v4
      - name: Run goose analysis
        run: |
          goose run -t "Review this codebase for security issues"

GitLab CI

analyze:
  image: ghcr.io/block/goose:latest
  variables:
    GOOSE_PROVIDER: openai
    GOOSE_MODEL: gpt-4o
  script:
    - goose run -t "Generate documentation for this project"

I've used this setup on multiple projects, and it's a game-changer for automated code reviews and documentation generation.

Troubleshooting: Real-World Issues

File Permission Problems with Workspace Mounts

When goose can't write to your mounted workspace, it's usually a user ID mismatch:

# Problem: "Permission denied" when goose tries to create files
$ docker run --rm -v $(pwd):/workspace ghcr.io/block/goose:v0.9.3

# Solution: Match container user with host user
$ docker run --rm \
   -v $(pwd):/workspace \
   -u $(id -u):$(id -g) \
   ghcr.io/block/goose:v0.9.3 run -t "Create a README.md"

Managing Multiple API Keys

When working with multiple LLM providers (OpenAI, Anthropic, Google), passing individual -e flags becomes unwieldy and error-prone. Instead, use a single .env file:

# Create a comprehensive .env file
$ cat > .env << EOF
GOOSE_PROVIDER=openai
GOOSE_MODEL=gpt-4o
OPENAI_API_KEY=sk-your-openai-key
ANTHROPIC_API_KEY=sk-ant-your-anthropic-key
GOOGLE_API_KEY=your-google-key
EOF

# Use the entire file at once
$ docker run --rm --env-file .env -v $(pwd):/workspace goose:v0.9.3

This approach is cleaner, more maintainable, and essential for CI/CD environments where you might switch between providers based on cost, availability, or model capabilities. It also keeps sensitive keys out of your shell history and makes it easy to version different configurations for different environments.

Connecting to Local Development Services

When goose needs to interact with local databases, development servers, or APIs running on your host machine, the container's isolated network becomes a barrier.

Use host networking to bridge this gap:

# Allow goose to access your local services
$ docker run --rm --network host \
   --env-file .env \
   -v $(pwd):/workspace \
   ghcr.io/block/goose:v0.9.3 run -t "Test the API endpoints in our local development server"

Security note: Only use host networking in development environments. For production, use proper service discovery and networking configurations.

Advanced Patterns

Resource Limits

Set memory and CPU limits for production:

$ docker run --rm \
   --memory="2g" \
   --cpus="2" \
   goose:local

This prevents individual containers from consuming excessive resources, ensuring predictable performance and cost control in production environments. This approach supports both horizontal scaling (adding more container instances to distribute load) and vertical scaling (increasing resources per container) based on your specific workload demands and infrastructure constraints.

Debugging Container Issues

When goose behaves unexpectedly in containers, you need to investigate the environment. Here are common debugging scenarios:

Inspect the container environment:

# Drop into a shell to examine the container
$ docker run --rm -it --entrypoint bash \
   -v $(pwd):/workspace \
   --env-file .env \
   ghcr.io/block/goose:v0.9.3

# Inside the container, check:
goose@container:~$ env | grep GOOSE      # Environment variables
goose@container:~$ goose --version       # Binary version
goose@container:~$ ls -la /workspace     # File permissions
goose@container:~$ cat ~/.config/goose/config.yaml  # Configuration

Debug API connectivity issues:

# Test network connectivity and API access
$ docker run --rm -it --entrypoint bash goose:v1.8.0
goose@container:~$ curl -I https://api.openai.com/v1/models
# Verify API keys
goose@container:~$ goose configure --check

Examine goose logs in verbose mode:

$ docker run --rm -it \
   -v $(pwd):/workspace \
   --env-file .env \
   ghcr.io/block/goose:v0.9.3 \
   --verbose run -t "Your failing command"

This debugging approach helps identify issues with file permissions, network connectivity, API authentication, or configuration problems that might not be obvious from normal goose output.

Multi-Platform Builds

For deployment across architectures:

$ docker buildx build --platform linux/amd64,linux/arm64 -t goose:multi .

Production Considerations

For production deployments, replace all the latest tags from our examples with specific release versions:

# Instead of this (unpredictable):
$ docker pull ghcr.io/block/goose:latest

# Use this (reproducible):
$ docker pull ghcr.io/block/goose:v1.8.0

Pinning to specific versions prevents surprises during deployments and makes rollbacks predictable when issues arise.

Wrap-Up

Containerizing goose has been one of those "why didn't I do this sooner?" moments in my career. It eliminates environment drift, simplifies deployments, and makes your AI-powered workflows truly portable. So, start with the prebuilt image, then customize as you grow. Your future self will thank you.

Why You Should Use uv Inside Jupyter Notebooks

Rupam Golui — Mon, 18 Aug 2025 13:50:53 +0000

Rupam Golui

Aug 18 '25

Why You Should Use uv Inside Jupyter Notebooks

#python #jupyter #datascience #productivity

Comments

2 min read

Why You Should Use uv Inside Jupyter Notebooks

Rupam Golui — Mon, 18 Aug 2025 13:50:20 +0000

If you’ve ever worked with Jupyter Notebooks, you know the pain:

One notebook runs on Python 3.10, another wants 3.11.
Some depend on torch==2.1.0, others break unless it’s 2.0.1.
And don’t even get me started on dependency conflicts when you’re switching between projects.

Traditional tools like pip + venv work fine, but they can feel slow and clunky. Enter uv, a blazing-fast Python package manager + environment manager.

Think of uv as the next-gen replacement for pip/venv/poetry. It makes project setup faster, cleaner, and way more consistent. And yes, you can use it seamlessly with Jupyter Notebooks.

Here’s how to set it up 👇

Setup Instructions

We’ll register a new Jupyter kernel that uses a uv-managed environment.

1. Install project dependencies

Inside your project folder:

uv install
# or
uv sync

💡 Requires uv installed globally. If you don’t have it yet:

curl -LsSf https://astral.sh/uv/install.sh | sh

or follow the official install guide.

2. Register a Jupyter kernel for your project

For example, let’s say we’re working on Virtus (my deepfake detection model):

uv run python -m ipykernel install --user --name vitrus --display-name "Python (vitrus)"

This creates a new Jupyter kernel tied directly to your uv environment.

3. Add the new kernel to Jupyter

In PyCharm: just open the .ipynb file and select Python (myenv) from the top-right kernel selector.

In Jupyter Lab / Notebook: switch kernels from the dropdown menu to use your shiny new environment.

4. Remove old kernels

After your work is done, Clean up the messy leftovers:

jupyter kernelspec list

Output will look like:

Available kernels:
  python3           /home/you/.local/share/jupyter/kernels/python3
  lopt              /home/you/.local/share/jupyter/kernels/lopt
  vitrus            /home/you/.local/share/jupyter/kernels/vitrus

To remove one:

jupyter kernelspec uninstall vitrus

Why bother with `uv`?

Speed: installs packages ridiculously fast (thanks to Rust).
Isolation: each project gets a clean, dedicated env.
Reproducibility: uv.lock means no “works on my machine” nonsense.
Jupyter-friendly: easy kernel registration, no hacky workarounds.

👉 My recommendation: set up your Jupyter projects with uv, and you’ll avoid 90% of Python environment headaches. Do it once per project and you’re golden.

How to Properly Clean Up Docker (and Save Your Sanity)

Rupam Golui — Sat, 16 Aug 2025 08:14:14 +0000

If you’ve been messing around with Docker for a while, chances are your system has turned into a junkyard of old containers, images, and volumes you don’t even remember creating. Docker is amazing, but it’s also a hoarder by default.

And let’s be honest. if you’re using Docker Desktop, especially on Mac or Windows, that bloated piece of software is probably eating your RAM for breakfast. My honest advice: uninstall Docker Desktop.

On Mac/Linux and you still want a GUI? → Check out OrbStack. Way lighter, way faster.
Or, better yet: learn the CLI like a seasoned dev. Once you get comfortable, it feels way cleaner and faster.

Now, let’s talk about how to completely clean up Docker.
👉 I recommend doing this weekly if you’re a heavy Docker user, or at least twice a month to keep things fresh.

🐳 Docker Cleanup (Purge Everything)

This will clear out unused containers, images, networks, volumes, and if you want, completely nuke Docker’s system files.

1. Soft Cleanup (recommended)

The easiest, safest way to clear unused junk:

docker system prune -a --volumes

-a: removes all unused images (not just dangling ones).
--volumes: removes unused volumes too.

Think of this as Docker’s version of spring cleaning.
(Run this once a week, and Docker won’t ever balloon out of control.)

2. Stop and Remove All Containers

Sometimes you just want a fresh start:

docker stop $(docker ps -aq)
docker rm $(docker ps -aq)

3. Remove All Images, Volumes, and Networks

Go full “factory reset” mode on your Docker resources:

docker rmi -f $(docker images -aq)
docker volume rm $(docker volume ls -q)
docker network rm $(docker network ls -q)

4. (Optional) Completely Reset Docker

⚠️ Warning: this wipes Docker’s entire state, including cached layers.

sudo systemctl stop docker
sudo rm -rf /var/lib/docker /var/lib/containerd
sudo systemctl start docker

Keeps things lean and mean.

TL;DR

Ditch Docker Desktop → Use OrbStack (Mac) or CLI (Linux).
Run docker system prune -a --volumes once a week (or at least twice a month).
Reset Docker fully only if things are really broken.

Your future self (and your SSD) will thank you.

How I Fine-Tuned a Vision Transformer to Spot Deepfakes

Rupam Golui — Mon, 12 May 2025 14:12:38 +0000

This project started out as a hackathon idea — we wanted to create a practical tool that could detect deepfakes in images with high confidence. The goal? Build a complete multi-model deepfake classification framework that doesn’t just sound cool, but works.

We named the image model Virtus (because hey, if you're fighting fakes, might as well sound noble). I handled the image classification side & devops while others tackled video detection, frontend/backend.

This post dives into how I built and trained Virtus: the thinking behind the model choice, dataset, training strategies, evaluation, and pushing it to Hugging Face. I’ll also sprinkle in some tips and lessons I picked up along the way — stuff I wish I knew before starting.

Want to skip the reading and jump straight into the code? Here's the full training notebook on GitHub.

Choosing a Base Model: Why Vision Transformers?

Initially, I considered the usual CNN suspects — ResNet, EfficientNet, all the classics. But deepfakes are tricky. The difference between a real and fake face can be insanely subtle — we're talking fine textures, light inconsistencies, stuff that might get blurred out or overlooked by CNNs.

So I started digging into Vision Transformers (ViTs) — and let’s just say, I went down the rabbit hole.

Turns out, ViTs aren't just trendy — they're built different. While CNNs work with pixel grids and sliding filters, ViTs treat images like sequences, kind of like sentences. They split an image into patches (aka "visual tokens") and feed them into a transformer — the same architecture that powers modern NLP models like BERT and GPT.

CNN vs. ViT: FLOPs and throughput comparison of CNN and Vision Transformer Models – Source

ViTs actually have weaker inductive bias compared to CNNs — which sounds bad, but it means they don’t assume as much about the structure of images. With enough data (or strong augmentations), they learn better generalizations. And here’s the kicker: they can outperform CNNs with 4x fewer computational resources, which was a huge win for my Kaggle GPU budget.

If you want a more in-depth comparison, check out this awesome article — seriously, worth a read.

Eventually, I ended up choosing facebook/deit-base-distilled-patch16-224, a ViT that’s been distilled from a CNN teacher model. It’s lightweight (only 87M parameters), fast to train, and surprisingly accurate — even outperforming the standard ViT-Base on ImageNet with just 1k classes. Plus, it doesn’t need massive compute or crazy pretraining to get good results, which made it perfect for our hackathon timeline.

Vision Transformer ViT Architecture - Source

If you’ve got more GPU headroom or a larger dataset, there are beefier models out there like google/vit-large-patch16-224-in21k, vit-base-patch32, or even the 384px version of DeiT — but for this project, I wanted something fast, efficient, and reliable. DeiT hit that sweet spot.

Data Preparation: Deepfake Data Is Messy

I started with a Kaggle dataset that had around 190,000 labeled images of real and fake faces — a solid foundation to begin with. On top of that, I manually added a bunch of extra samples I’d collected from other sources to make things a bit more diverse. Everything was organized into Real/ and Fake/ folders, so loading them with Path.glob was smooth sailing.

After loading the dataset, I quickly noticed the class distribution was skewed — one of the classes (either real or fake) had noticeably more images than the other. That’s not great for training, since the model might just learn to always predict the majority class.

To fix that, I used RandomOverSampler to duplicate samples from the underrepresented class. It’s a quick and dirty way to balance things — works surprisingly well for binary classification.

from imblearn.over_sampling import RandomOverSampler
import gc

# Separate out the labels before resampling
y = df[['label']]
df = df.drop(['label'], axis=1)

# Oversample to balance the classes
ros = RandomOverSampler(random_state=83)
df, y_resampled = ros.fit_resample(df, y)

# Stick the labels back
df['label'] = y_resampled
gc.collect()  # Clean up some memory — just to be safe

Now with the classes balanced, I converted the DataFrame into a Hugging Face Dataset object. This makes everything later (transforms, batching, etc.) super smooth. Before that, I mapped string labels ("Real" / "Fake") to numeric IDs (0 / 1). Hugging Face has a ClassLabel feature built exactly for this:

from datasets import ClassLabel

# Define the class order explicitly
labels_list = ['Real', 'Fake']
class_labels = ClassLabel(num_classes=2, names=labels_list)

# Label encoding function
def map_label2id(example):
    example["label"] = class_labels.str2int(example["label"])
    return example

# Apply the mapping to the dataset
dataset = dataset.map(map_label2id, batched=True)
dataset = dataset.cast_column("label", class_labels)  # Ensures label column behaves like an integer class

Finally, I split the dataset into 60% for training and 40% for testing. I also made sure the label distribution was preserved across both splits using stratify_by_column.

# Train-test split with stratified labels
dataset = dataset.train_test_split(test_size=0.4, shuffle=True, stratify_by_column="label")

train_data = dataset['train']
test_data = dataset['test']

How I Trained Virtus — Step-by-Step

Alright, now comes the fun part — training the model. We’re going to fine-tune facebook/deit-base-distilled-patch16-224 on our deepfake dataset using Hugging Face's Trainer API.

Quick side note: I trained everything inside a Kaggle Notebook using a single NVIDIA P100 GPU. After some trial runs, I found it performed noticeably better than the T4s (even the dual T4 setup Kaggle sometimes gives you). Turns out, the P100 has higher memory bandwidth and better raw compute — which really helps when you're fine-tuning ViTs.

If Kaggle isn’t your thing, no worries. I also tried Lightning AI Studio and AWS Studio Lab, and both were solid freemium options. Way more stable than google Colab, honestly. With Colab’s free tier, I kept hitting runtime errors, couldn’t even get a GPU some days, and the lack of persistent storage was a dealbreaker. Hardware quality also felt kinda... meh.

⚡ Bonus tip: If you’re training locally, try managing your Python environment with uv. It’s ridiculously fast. Plus, you won’t fall into dependency hell™, which is honestly half the battle when setting up ML projects. Follow this tutorial if you wanna give it a spin — highly recommend.

Step 1: Preprocessing and Augmentation

Before throwing data at the model, we need to make sure the input images are normalized exactly the way the pre-trained ViT expects.

from transformers import ViTImageProcessor
from torchvision.transforms import Compose, Resize, RandomRotation, RandomAdjustSharpness, ToTensor, Normalize

model_str = "facebook/deit-base-distilled-patch16-224"
processor = ViTImageProcessor.from_pretrained(model_str)

image_mean, image_std = processor.image_mean, processor.image_std
size = processor.size["height"]

Then I defined two sets of transforms — one for training (with augmentations) and one for validation.

_train_transforms = Compose([
    Resize((size, size)),
    RandomRotation(90),
    RandomAdjustSharpness(2),
    ToTensor(),
    Normalize(mean=image_mean, std=image_std)
])

_val_transforms = Compose([
    Resize((size, size)),
    ToTensor(),
    Normalize(mean=image_mean, std=image_std)
])

Why these augmentations? Deepfakes can vary a lot depending on the source. A little rotation and sharpness tweaks help the model generalize to those variations. No augmentation for validation though — we want to keep that clean.

Step 2: Applying Transforms

I used Hugging Face’s set_transform method to apply the preprocessing on-the-fly. This keeps RAM usage low and plays nicely with their Dataset objects.

train_data.set_transform(lambda x: {"pixel_values": [_train_transforms(img.convert("RGB")) for img in x["image"]]})
test_data.set_transform(lambda x: {"pixel_values": [_val_transforms(img.convert("RGB")) for img in x["image"]]})

Step 3: Custom Collate Function

The Trainer needs batches of images and labels. Here's a simple collate function to stack tensors correctly:

def collate_fn(examples):
    pixel_values = torch.stack([e["pixel_values"] for e in examples])
    labels = torch.tensor([e["label"] for e in examples])
    return {"pixel_values": pixel_values, "labels": labels}

Step 4: Loading the Model

Now we bring in the ViT model with the correct number of labels and label mappings:

from transformers import ViTForImageClassification

model = ViTForImageClassification.from_pretrained(
    model_str,
    num_labels=2
)

model.config.label2id = {'Real': 0, 'Fake': 1}
model.config.id2label = {0: 'Real', 1: 'Fake'}

Step 5: TrainingArguments

These settings worked great for me — small learning rate, a couple of epochs, early checkpoint saving, etc.

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="virtus",
    logging_dir="./logs",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-6,  # Tiny learning rate to avoid overshooting on a sensitive task like classification
    per_device_train_batch_size=32,
    per_device_eval_batch_size=8,
    num_train_epochs=2, # 2 epochs were enough to converge for my dataset; more can overfit
    weight_decay=0.02,  # Helps regularize and reduce overfitting
    warmup_steps=50, # Linearly ramps up LR at the start for more stable training
    load_best_model_at_end=True,
    save_total_limit=1,
    report_to="none"
)

Step 6: Train Time!

Let’s go!

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_data,
    eval_dataset=test_data,
    data_collator=collate_fn,
    tokenizer=processor,  # Required even if we don’t use text
    compute_metrics=lambda p: {"accuracy": (p.predictions.argmax(-1) == p.label_ids).mean()}
)

trainer.train()

It trained in around 2 hours on Kaggle’s GPU runtime and reached ~99.2% accuracy. Not bad for just two epochs.

Evaluation: Did Virtus Actually Learn Anything?

After training wrapped up, I wanted to be sure the model wasn’t just memorizing the training data. Hugging Face's Trainer makes it dead simple to evaluate performance on the test set.

# Run evaluation on the test set
trainer.evaluate()

That gave me some solid metrics:

{'eval_loss': 0.0248, 'eval_accuracy': 0.9919, ...}

Yeah — over 99% accuracy. I double-checked this wasn’t a fluke by inspecting predictions manually:

# Make predictions on test data
outputs = trainer.predict(test_data)

# See predicted vs actual for the first 5 samples
preds = outputs.predictions.argmax(axis=1)
labels = outputs.label_ids

for i in range(5):
    print(f"Predicted: {id2label[preds[i]]} | Actual: {id2label[labels[i]]}")

And it matched up nicely.

To dig deeper, I calculated the macro F1 score and plotted the confusion matrix — just to visualize how well the model was doing on both classes.And here's the result:

The matrix was basically diagonal — which means the model was nailing both classes.

Publishing to Hugging Face: Share It With the World

Once I was happy with Virtus, I wanted to make it public. Hugging Face Hub is the easiest way to share models — and you can even push directly from a Kaggle notebook using secrets.

First, install the CLI tools:

!pip install -q huggingface_hub

Then authenticate using a token (I stored mine using Kaggle secrets, but you can use huggingface-cli login locally):

from huggingface_hub import login, create_repo
from kaggle_secrets import UserSecretsClient

token = UserSecretsClient().get_secret("HF_TOKEN")
login(token)

Create your repo (this can be done via the website too, but I like automation):

create_repo(repo_id="agasta/virtus", private=False)

Finally, push both the model and its image processor (so others don’t have to guess your preprocessing steps):

from transformers import AutoModelForImageClassification, AutoFeatureExtractor

model = AutoModelForImageClassification.from_pretrained("./virtus")
extractor = AutoFeatureExtractor.from_pretrained("./virtus")

model.push_to_hub("agasta/virtus")
extractor.push_to_hub("agasta/virtus")

Boom. Your model’s live.

👉 Check it out here: https://huggingface.co/agasta/virtus

But we’re not done yet.

In the next blog, I’ll show you how to wrap this model in a FastAPI-powered backend, make it production-ready, and deploy it like a real-world service — something you can actually integrate into an app or use in a real-time system.

If you're into this kind of stuff — AI, web3, backend dev, DevOps, hackathon builds — follow me on X @idkAgasta. I post cool projects, quick tips, and sometimes just chaos.

Until next time — keep building, keep shipping Nerds.

DEV Community: Rupam Golui

5 Steps to Deploying an AWS Lambda with EventBridge using Terraform

Step 1: Accept That ZIP Files Are for Cowards (Use Containers)

Step 2: Let Terraform Build the Image (Yes, Really)

Step 3: Configure the Lambda

Step 4: EventBridge (Or "CloudWatch Events" for Us Old Timers)

Step 4½: The Permission That Haunts My Dreams ☠️

Step 5: Environment Variables (AKA "How Many Secrets Can I Fit In Here")

Running Goose in Containers (Without Losing Your Mind)

Why Containerize Goose? The Real Benefits

Quick Start: Your First Containerized Workflow

Building Your Own Images

Running Goose Effectively

Basic CLI Usage

Interactive Sessions

Docker Compose for Complex Setups

Configuration Deep Dive

CI/CD Integration: Where Containers Shine

GitHub Actions

GitLab CI

Troubleshooting: Real-World Issues

File Permission Problems with Workspace Mounts

Managing Multiple API Keys

Connecting to Local Development Services

Advanced Patterns

Resource Limits

Debugging Container Issues

Multi-Platform Builds

Production Considerations

Wrap-Up

Why You Should Use uv Inside Jupyter Notebooks

Why You Should Use uv Inside Jupyter Notebooks

Why You Should Use uv Inside Jupyter Notebooks

Setup Instructions

1. Install project dependencies

2. Register a Jupyter kernel for your project

3. Add the new kernel to Jupyter

4. Remove old kernels

Why bother with uv?

How to Properly Clean Up Docker (and Save Your Sanity)

🐳 Docker Cleanup (Purge Everything)

1. Soft Cleanup (recommended)

2. Stop and Remove All Containers

3. Remove All Images, Volumes, and Networks

4. (Optional) Completely Reset Docker

TL;DR

How I Fine-Tuned a Vision Transformer to Spot Deepfakes

Choosing a Base Model: Why Vision Transformers?

Data Preparation: Deepfake Data Is Messy

How I Trained Virtus — Step-by-Step

Step 1: Preprocessing and Augmentation

Step 2: Applying Transforms

Step 3: Custom Collate Function

Step 4: Loading the Model

Step 5: TrainingArguments

Step 6: Train Time!

Evaluation: Did Virtus Actually Learn Anything?

Publishing to Hugging Face: Share It With the World

Why bother with `uv`?