DEV Community: Utkarsh Bahuguna

MIRR: An RL Environment Where Gemma 4 Gets Graded on How It Thinks, Not Just What It Answers

Utkarsh Bahuguna — Tue, 19 May 2026 07:30:55 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

🏆 An earlier version of this project finished in the top 50 out of 8,000+ teams at the Meta × PyTorch Hackathon (see LinkedIn post). This submission rebuilds it on Gemma 4, with reasoning-quality scoring designed around Gemma 4's native thinking modes.

What I Built

MIRR is a stateful, RL-compatible environment where a Gemma 4 agent debugs failures in a simulated microservice system the way an on-call SRE would. It pulls logs, queries metrics, walks the service graph, and commits to a root-cause hypothesis under uncertainty.

Here's the problem I wanted to solve. Most "agent benchmarks" today reward only the outcome. Did the agent fix it? Yes / no. That's a terrible signal for incident response, where a good engineer can be right for bad reasons (lucky pattern match), and a great engineer can be wrong for excellent reasons (a sensible hypothesis ruled out by data the team didn't have). If we want LLMs that on-call engineers actually trust at 3 AM, we have to score how they think, not just whether they happened to land on the answer.

MIRR introduces a novel diagnose() action that does exactly that. Every time the agent commits to a root-cause hypothesis, the environment scores:

Reasoning quality: causal chain validity, evidence cited, alternatives ruled out
Outcome correctness: did the hypothesis actually match the injected fault

…separately, with independent reward signals. That makes the env legible to RL fine-tuning with TRL. You can train Gemma 4 to be a better diagnostician, not just a luckier guesser.

The sim ships with:

A microservice topology (auth → gateway → orders → payments → DB)
A fault library: cascading timeouts, deadlocks, memory leaks, poison messages, cert expiry, the usual hall of fame
Synthetic logs, metrics, and traces generated per episode
An OpenEnv-compatible step() / reset() interface, so the same env trains agents and serves the live demo

Demo

🎥 Live Gradio demo: [your-space-link-here]

Try the "memory leak in payments" episode if you want to see the thinking mode really earn its keep.

Code

📦 Repo: github.com/u7k4rs6/MIRR
🤗 Rollouts dataset: huggingface.co/datasets/u7k4rs6/incident-response-rollouts

The repo includes the OpenEnv environment, fault generators, Gemma 4 fine-tuning scripts (TRL + Unsloth), eval harness, and the Gradio demo. The HF dataset contains agent rollouts from MIRR episodes (state, action, reasoning trace, dual reward), ready to drop straight into a TRL training loop.

How I Used Gemma 4

I used a two-model strategy: Gemma 4 E4B for fast, on-device iteration and RL fine-tuning, and Gemma 4 31B Dense for the heavy reasoning that does the actual diagnosing in the live demo.

Gemma 4 31B Dense: the diagnostician

The 31B is doing real chain-of-thought work: walking a service graph, correlating timestamps across logs, ruling out hypotheses. Two Gemma 4 properties make it exactly the right model:

Configurable thinking modes. This is the whole game for MIRR. The diagnose() action needs a visible, structured reasoning trace, because the environment scores reasoning quality independently of outcome. Gemma 4's native thinking mode gives me 4K+ tokens of clean chain-of-thought to grade against the ground-truth causal chain. I'd rather have one model that thinks transparently than two models stapled together with a "show your work" prompt that the model is free to ignore.
256K context. A real incident has logs from five services, three dashboards, and a runbook. The 31B eats all of that in one shot, with no RAG plumbing and no summarization step quietly dropping the critical line. For incident response specifically, context fidelity is everything.

The 31B is served via HF Inference for the demo, which keeps the Space cheap and snappy.

Gemma 4 E4B: the training proxy

RL fine-tuning the 31B on a hackathon budget is a non-starter. But because Gemma 4 ships the same architecture and tokenizer across the entire family, I could fine-tune the E4B on the MIRR rollouts dataset using TRL + Unsloth on a single Colab T4, and then transfer the learnings (reward shaping, prompting structure, the diagnose() action schema) to the 31B at inference time. Same family, same instincts.

Per-Layer Embeddings (PLE) make E4B genuinely punchy too. Even the small model produces watchable demos on the on-device path, which matters for the eventual story of "your laptop runs a copy of your team's SRE agent locally."

Why Gemma 4 over other open models

Open weights + Apache 2.0. I can actually fine-tune and ship. A closed API would have killed the RL story before it started.
Family symmetry. Same tokenizer and chat template across E2B → 31B means a training signal designed on E4B transfers up cleanly. No other open family gives you this kind of clean ladder.
Thinking modes as first-class API, not a prompting hack. For an environment that grades reasoning, that's the difference between scoring real signal and scoring formatting.
Multimodal headroom. v2 of MIRR includes service topology images (Grafana panels, dependency graphs), and Gemma 4's vision input means one model handles it end-to-end.

What's next

If I had another week, I'd swap the 31B for the 26B MoE to cut inference cost on the demo (3.8B active params at 31B-class quality is hard to beat), and use E2B's native audio input to let on-call engineers literally talk to the agent during an incident. "What's burning?" → live answer, while the pager is still vibrating.

Gemma 4 is the first open model family where you can prototype on a phone-class checkpoint and ship on a workstation-class checkpoint without rewriting your stack. That's the unlock MIRR needed.

Vibe Coding Is Just Blind Coding

Utkarsh Bahuguna — Tue, 19 May 2026 07:11:57 +0000

Why Most Developers Can't Actually Build Anything

There's a trend going around called "vibe coding." You open an AI editor, describe what you want in plain English, accept whatever the model spits out, and keep iterating until something seems to work. If it runs, you ship it. If it breaks, you prompt again.

This isn't coding. It's blind coding and it's creating a generation of developers who can prompt but can't build.

The Illusion of Competence

AI has made it trivially easy to generate code that looks correct. A React component here. A Docker Compose file there. A Python script that "handles" your data pipeline. The problem isn't that the code is always wrong, it's that the person prompting has no mental model for what's actually happening underneath.

Here's what I mean:

They don't know how data structures work. Ask them why their AI-generated list traversal is O(n²) instead of O(n), and they'll stare at you. The code "works" on 100 rows but times out on 100,000.
They don't know how Docker works. They copy-paste a Dockerfile from ChatGPT, build an image, and celebrate when docker run doesn't immediately crash. But they can't explain layers, caching, multi-stage builds, or why their image is 2GB for a simple Node app.
They don't understand integration. Their frontend "talks to" their backend, but they don't know how HTTP works, what CORS actually means, or why their WebSocket drops connections under load. Everything is a black box connected to another black box by vibes.

The result? Fragile systems held together by hope and hallucinated confidence.

What Vibe Coding Actually Looks Like

The Junior Engineer Prompt

"Build me a full-stack app with React and Node.js that lets users upload files and stores them. Make it secure and fast."

This is what gets fed into Cursor, v0, or ChatGPT. The output is usually:

A frontend with axios.post('/upload')
A backend with multer dumping files to disk
No auth, no validation, no rate limiting
A Dockerfile that copies everything and runs npm start as root
"It works on my machine" until it doesn't

The junior engineer doesn't know what's missing because they never learned to ask the right questions. They got a working demo, and that was enough.

What Deliberate Engineering Looks Like

The Senior Engineer Prompt

"I need a file upload service. Constraints:

Files up to 100MB, images and PDFs only

Must validate MIME type server-side, not just extension

Scan with ClamAV before persisting

Store in S3 with presigned URLs, never stream through our servers

Rate limit: 10 uploads/hour per user, tracked in Redis

Return signed CDN URL for immediate display

Docker: multi-stage build, distroless final image, non-root user, health checks

Frontend: resumable uploads with progress, cancel support

Start with the threat model and API contract. Then the storage layer. Then the upload handler. Then the UI."

Notice the difference. The senior engineer isn't asking for code they're defining boundaries, constraints, and failure modes first. They know:

That client-side validation is cosmetic
That streaming large files through app servers is a bottleneck
That Docker images should be minimal and hardened
That UX requires handling network interruption

The AI still writes the code. But the senior engineer directs it, because they understand the system.

The Real Problem

AI isn't making engineers obsolete. It's making it harder to distinguish between engineers who think and operators who prompt.

The danger of vibe coding isn't that you use AI. Everyone should use AI. The danger is using AI instead of understanding:

Vibe Coding	Deliberate Engineering
"Make me a login system"	"Design auth with refresh tokens, CSRF protection, and OWASP Top 10 coverage"
"Dockerize this"	"Optimize layer caching, use non-root, handle graceful shutdown"
"Fix this bug"	"Trace the execution flow, identify the race condition, write a regression test"
"Make it faster"	"Profile the hot path, reduce N+1 queries, add connection pooling"

Vibe coding produces demos. Deliberate engineering produces systems.

What to Do Instead

If you're early in your career, here's how to avoid the trap:

1. Learn one layer deeper
Don't stop at "it works." Understand why it works. Read the source code of the libraries you use. Trace a network request from browser to server to database and back.

2. Build without AI sometimes
Write a CRUD app from scratch with no Copilot. Configure nginx manually. Set up a database replica. The friction teaches you what the abstractions hide.

3. Ask "what could go wrong?"
For every feature you build, list three ways it could fail. Then design for those failures. This is the difference between a toy and production code.

4. Study systems, not syntax
Data structures, networking, concurrency, distributed systems, these are timeless. Frameworks change. Fundamentals don't.

5. Use AI as a multiplier, not a crutch
The best engineers I know use AI constantly. But they use it to accelerate decisions they've already reasoned through, not to replace reasoning itself.

The Bottom Line

Vibe coding is seductive because it delivers instant gratification. You get a working prototype in minutes. But software engineering isn't about prototypes, it's about building things that survive contact with reality: scale, security, edge cases, and time.

AI is the most powerful tool we've ever had. But tools don't replace judgment. They amplify it.

If your judgment is blind, so is your code.