Haripriya Veluchamy

Posted on Jun 7

How I Built a Self Resizing EC2 for My ML Data

#devops #cloud #aws #ai

The Pain Point

I'm on an ML learning journey. That means a lot of data. A lot of processing. And a lot of AWS free credits I really can't afford to waste.

Here's what my typical day looked like:

I'd spin up a t3.large to run a data pipeline. Load some datasets, process them, store them. The pipeline would run for a couple of hours sometimes I didn't even know exactly how long it would take. Then I'd go to sleep.

Next morning I'd check CloudWatch and realise the VM had been sitting idle since 3AM. Running. Doing nothing. Burning credits.

That's the problem. You need a big machine for data loading. You don't need it after.

The First Thought Let AWS ML Decide

My first instinct was to use AWS Compute Optimizer. It's a managed ML service that analyses your EC2 usage patterns and recommends the right instance type. Smart, right?

I enabled it. Waited. And waited.

Turns out Compute Optimizer needs at least 30 consecutive hours of usage data before it generates recommendations. For a VM I spin up occasionally for pipeline runs that's not practical.

So I moved on.

The Second Thought Bedrock Agent

Next idea: use Amazon Bedrock as an agent to reason about when and how to resize. Let an LLM decide.

But the more I thought about it, the more it felt like overkill. The decision isn't complex:

"Did the pipeline finish? Yes → resize down. No → don't."

That's not a reasoning problem. That's an automation problem. Using Bedrock here would be AI washing adding complexity without adding value.

So I kept it simple.

The Solution EventDriven VM Resize

What I built: a lightweight agent that listens for your pipeline to complete, then automatically resizes the VM down.

Stack:

emit_event.py runs on your VM, fires when pipeline exits
Amazon EventBridge receives the event
AWS Lambda handles resize logic
Amazon SNS sends email alert (success or failure)

Pipeline finishes
      ↓
emit_event.py → EventBridge
      ↓
Lambda triggered
      ↓
SUCCESS → resize down + email
FAILURE → keep size (debug) + email

No ML. No LLM. Just the right tool for the job.

Architecture

Your VM                        AWS Cloud
──────────────────             ─────────────────────
run_pipeline.sh                EventBridge (custom bus)
  step1.py               →          ↓
  step2.py                     Lambda
  step3.py                       ├── stop EC2
  emit_event.py ──────────→      ├── resize instance type
                                 ├── start EC2
                                 └── SNS email alert

All AWS resources are deployed with a single CloudFormation command. No manual console clicking.

Setup 3 Steps

Step 1 Deploy AWS infrastructure

git clone https://github.com/Harivelu0/vm-resize-agent
cd vm-resize-agent

aws cloudformation deploy \
  --template-file infra/template.yaml \
  --stack-name vm-resize-agent \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameter-overrides \
      AlertEmail=your@email.com \
      TargetInstanceType=t3.medium \
  --region us-east-1

Check your email → confirm the AWS subscription link.

Step 2 Copy agent to your VM

scp -i your-key.pem -r agent/ pipeline/ \
  ec2-user@your-vm-ip:~/vm-resize-agent/

# on VM
pip install boto3
aws configure

Step 3 Add your pipeline steps

Edit pipeline/steps.conf:

python3 /home/user/myproject/fetch_data.py
python3 /home/user/myproject/transform.py
python3 /home/user/myproject/load_db.py

Run:

bash pipeline/run_pipeline.sh

That's it. When pipeline finishes → VM resizes → email arrives.

The Key Design Decision steps.conf

One thing I was particular about: this tool should work for anyone's pipeline, not just mine.

So run_pipeline.sh never changes. Users only edit steps.conf one command per line, one step per line.

# steps.conf
python3 /home/user/fetch_gdelt.py
python3 /home/user/build_forecasts.py
python3 /home/user/calibrate.py

The wrapper reads each line, runs it in order, tracks success/failure, and emits the event at the end. Generic by design.

The Email Alert

On success:

Pipeline: data-loader
Status:   SUCCESS
Steps:    3/3
Duration: 45m 12s
Instance: i-0abc123
Resized:  Yes -> t3.medium

On failure:

Pipeline:  data-loader
Status:    FAILURE
Steps:     2/3
Failed at: step_3
Instance:  i-0abc123
Resized:   No (kept original size for debugging)

Failure case is important the VM intentionally stays large so you can SSH in and debug without losing state.

Difficulties I Faced

1. IP changes after resize

When Lambda stops and restarts the EC2, it gets a new public IP. Learned this the hard way when SSH stopped working. Fix: assign an Elastic IP before running demos.

aws ec2 allocate-address --region us-east-1
aws ec2 associate-address \
  --instance-id i-xxx \
  --allocation-id eipalloc-xxx

2. Docker goes down after resize

EC2 resize = reboot. Any running Docker containers stop. Add this to /etc/rc.local on your VM so services restart automatically:

sudo service docker start
cd /home/ec2-user/myproject && docker-compose up -d

Live Demo

For the demo I used a real weather dataset loaded into Postgres running in Docker on the EC2.

EC2: t3.large (before)
       ↓
Pipeline runs: download → parse → load into Postgres
       ↓
emit_event.py fires automatically when done
       ↓
Lambda: stops EC2 → resizes → starts EC2
       ↓
EC2: t3.medium (after)
       ↓
Email arrives in Gmail

Cost Reality

Resource	Monthly Cost
EventBridge	~$0 (free tier)
Lambda	~$0 (free tier)
SNS email	~$0 (first 1000 free)
Total stack	~$0

The savings depend on your instance. A t3.large idle for 20hrs/day wastes ~$25/month. For larger instances like c5.4xlarge you're looking at $200+ saved per month.

When to Use This

This is for you if:

You run data pipelines on EC2 manually or on a schedule
Your pipeline takes unpredictable time to complete
You don't need the heavy instance after loading is done
You're not ready for EMR or Glue yet (pre-pipeline stage)

This is not for you if:

You're already using EMR Serverless or Glue (they auto-terminate)
Your pipeline runs less than 30 minutes (manual resize is fine)
You need horizontal scaling (use Auto Scaling Groups instead)

Why a VM and Not Glue or EMR?

Honest answer I didn't know enough about my data yet to make that decision.

Glue and EMR are great services. But they come with questions you need to answer upfront:

What's your data format?
What transformations do you need?
What's the volume?
Do you need Spark?

When you're learning, you don't have those answers yet. You just need to load data, see what you're working with, and prove the pipeline works.

A VM lets you do that with zero infrastructure decisions. Just Python scripts. When the pipeline is proven and you understand your data then you migrate to the right managed service.

This is that phase. Before you know which service you need.

This is actually how real teams work too. Nobody starts with EMR on day one of a new data project. You explore first. You prove it works. Then you scale.

Phase 1: Explore         → VM + Python scripts
Phase 2: Prove it works  → VM + vm-resize-agent
Phase 3: Scale           → Glue / EMR / Spark

Repo

Everything is open source. Clone, edit steps.conf, deploy.

GitHub: https://github.com/Harivelu0/vm-resize-agent

What's Next

A few things I want to add:

Auto resize back up before next scheduled run (cron-based)
Slack alerts alongside email
Support for Azure VMs (same pattern, different SDK)

If you have ideas or run into issues open a GitHub issue. Happy to help.

Building in public as part of my ML learning journey. Follow along for more practical AWS patterns.

DEV Community