Dhananjay Lakkawar

Posted on Apr 18 • Edited on Jun 10

I Thought Fine-Tuning Needed an ML Team. I Was Wrong.

#aws #ai #mlops #architecture

A few months ago, I almost killed a feature.

Not because it didn’t work
but because improving it felt… impossible.

We had an AI system in production.
Users were interacting with it daily.

And they were doing something incredibly valuable:

👎 Clicking “thumbs down”

At first, we treated it like a metric.

Then it hit me:

That is the dataset.

🧠 The Moment Everything Clicked

Every time a user said:

“this is wrong”
“this isn’t helpful”
“this makes no sense”

They were giving us:

real-world training data

Not synthetic.
Not curated.
Not delayed.

Raw. Messy. Honest.

And we were… ignoring it.

Because like most teams, we thought:

“Fine-tuning is expensive. We’ll deal with it later.”

⚠️ The Lie Most Founders Believe

Fine-tuning has a reputation problem.

You hear it and think:

GPU clusters
ML engineers
weeks of experimentation

That’s true for large-scale research.

But for a product?

It’s overkill.

🔁 The Shift: From Pipelines to Loops

Instead of building a “training pipeline,”
we built a feedback loop.

Small difference. Massive impact.

⚙️ What We Actually Built

Nothing fancy.

Just:

SQS → store feedback
Lambda → decide when to train
Batch + Spot GPU → run training
S3 → store model versions

That’s it.

No always-on infrastructure.
No ML team.
No pipeline monster.

💡 The Part Nobody Tells You

This only works if you fix one thing:

❌ “thumbs down” is not enough

A negative signal tells you:

something is wrong

But not:

what is right

So we added one tiny UX change:

👉 “What should it have said instead?”

That single input:

improved training quality dramatically
reduced noise
made the model actually improve

⚠️ Where We Almost Broke Everything

This is where most blog posts lie to you.

1. We shipped a worse model

The first time we automated training:

accuracy dropped
responses got inconsistent

Why?

Because we skipped evaluation.

Now:

every model is tested before deployment
bad versions never go live

2. Spot instances killed our jobs

We loved the cost savings…
until training jobs randomly died.

Turns out:

Spot instances can terminate anytime

Fix:

checkpoint training to S3
retry automatically

3. Costs weren’t zero (but close)

We expected “almost free”

Reality:

small but real costs from SQS, logs, storage
occasional spikes from training

Nothing scary — but not $0 either.

💰 What This Actually Costs

Here’s what we see at early-stage scale:

Component	What you pay for	Monthly cost
SQS	requests (1M free tier)	$1–3 ([Amazon Web Services, Inc.][1])
Lambda	executions + duration	$1–10 ([Amazon Web Services, Inc.][2])
S3	storage + requests	$1–5 ([Amazon Web Services, Inc.][3])
Batch	orchestration	$0 ([Amazon Web Services, Inc.][4])
GPU (Spot)	training time	$5–30
Logs + misc	CloudWatch etc.	$1–10