<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Utkarsh Bahuguna</title>
    <description>The latest articles on DEV Community by Utkarsh Bahuguna (@utkarshbahuguna).</description>
    <link>https://dev.to/utkarshbahuguna</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3843380%2F5e8feb8f-5400-4dde-b58d-103ac5d9afa1.png</url>
      <title>DEV Community: Utkarsh Bahuguna</title>
      <link>https://dev.to/utkarshbahuguna</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/utkarshbahuguna"/>
    <language>en</language>
    <item>
      <title>MIRR: An RL Environment Where Gemma 4 Gets Graded on How It Thinks, Not Just What It Answers</title>
      <dc:creator>Utkarsh Bahuguna</dc:creator>
      <pubDate>Tue, 19 May 2026 07:30:55 +0000</pubDate>
      <link>https://dev.to/utkarshbahuguna/mirr-an-rl-environment-where-gemma-4-gets-graded-on-how-it-thinks-not-just-what-it-answers-m5d</link>
      <guid>https://dev.to/utkarshbahuguna/mirr-an-rl-environment-where-gemma-4-gets-graded-on-how-it-thinks-not-just-what-it-answers-m5d</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Build with Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🏆 An earlier version of this project finished in the &lt;strong&gt;top 50 out of 8,000+ teams&lt;/strong&gt; at the Meta × PyTorch Hackathon (&lt;a href="https://www.linkedin.com/posts/utkarshbahuguna666_finalist-at-the-meta-pytorch-openenv-hackathon-share-7450622611962658816-1suQ?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAADIjIc4B8KPGBF3iXnYnUhrXFbUZaMTEDWE" rel="noopener noreferrer"&gt;see LinkedIn post&lt;/a&gt;). This submission rebuilds it on &lt;strong&gt;Gemma 4&lt;/strong&gt;, with reasoning-quality scoring designed around Gemma 4's native thinking modes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;MIRR&lt;/strong&gt; is a stateful, RL-compatible environment where a Gemma 4 agent debugs failures in a simulated microservice system the way an on-call SRE would. It pulls logs, queries metrics, walks the service graph, and commits to a root-cause hypothesis under uncertainty.&lt;/p&gt;

&lt;p&gt;Here's the problem I wanted to solve. Most "agent benchmarks" today reward only the &lt;strong&gt;outcome&lt;/strong&gt;. Did the agent fix it? Yes / no. That's a terrible signal for incident response, where a good engineer can be right for bad reasons (lucky pattern match), and a great engineer can be wrong for excellent reasons (a sensible hypothesis ruled out by data the team didn't have). If we want LLMs that on-call engineers actually trust at 3 AM, we have to score &lt;strong&gt;how they think&lt;/strong&gt;, not just whether they happened to land on the answer.&lt;/p&gt;

&lt;p&gt;MIRR introduces a novel &lt;strong&gt;&lt;code&gt;diagnose()&lt;/code&gt;&lt;/strong&gt; action that does exactly that. Every time the agent commits to a root-cause hypothesis, the environment scores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning quality&lt;/strong&gt;: causal chain validity, evidence cited, alternatives ruled out&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outcome correctness&lt;/strong&gt;: did the hypothesis actually match the injected fault&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…&lt;strong&gt;separately&lt;/strong&gt;, with independent reward signals. That makes the env legible to RL fine-tuning with TRL. You can train Gemma 4 to be a better &lt;em&gt;diagnostician&lt;/em&gt;, not just a luckier guesser.&lt;/p&gt;

&lt;p&gt;The sim ships with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A microservice topology (auth → gateway → orders → payments → DB)&lt;/li&gt;
&lt;li&gt;A fault library: cascading timeouts, deadlocks, memory leaks, poison messages, cert expiry, the usual hall of fame&lt;/li&gt;
&lt;li&gt;Synthetic logs, metrics, and traces generated per episode&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;OpenEnv-compatible&lt;/strong&gt; &lt;code&gt;step()&lt;/code&gt; / &lt;code&gt;reset()&lt;/code&gt; interface, so the same env trains agents &lt;em&gt;and&lt;/em&gt; serves the live demo&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;🎥 &lt;strong&gt;Live Gradio demo:&lt;/strong&gt; [your-space-link-here]&lt;/p&gt;

&lt;p&gt;Try the &lt;em&gt;"memory leak in payments"&lt;/em&gt; episode if you want to see the thinking mode really earn its keep.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;📦 &lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/u7k4rs6/MIRR" rel="noopener noreferrer"&gt;github.com/u7k4rs6/MIRR&lt;/a&gt;&lt;br&gt;
🤗 &lt;strong&gt;Rollouts dataset:&lt;/strong&gt; &lt;a href="https://huggingface.co/datasets/u7k4rs6/incident-response-rollouts" rel="noopener noreferrer"&gt;huggingface.co/datasets/u7k4rs6/incident-response-rollouts&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The repo includes the OpenEnv environment, fault generators, Gemma 4 fine-tuning scripts (TRL + Unsloth), eval harness, and the Gradio demo. The HF dataset contains agent rollouts from MIRR episodes (state, action, reasoning trace, dual reward), ready to drop straight into a TRL training loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Used Gemma 4
&lt;/h2&gt;

&lt;p&gt;I used a &lt;strong&gt;two-model strategy&lt;/strong&gt;: &lt;strong&gt;Gemma 4 E4B&lt;/strong&gt; for fast, on-device iteration and RL fine-tuning, and &lt;strong&gt;Gemma 4 31B Dense&lt;/strong&gt; for the heavy reasoning that does the actual diagnosing in the live demo.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gemma 4 31B Dense: the diagnostician
&lt;/h3&gt;

&lt;p&gt;The 31B is doing real chain-of-thought work: walking a service graph, correlating timestamps across logs, ruling out hypotheses. Two Gemma 4 properties make it exactly the right model:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Configurable thinking modes.&lt;/strong&gt; This is the whole game for MIRR. The &lt;code&gt;diagnose()&lt;/code&gt; action &lt;em&gt;needs&lt;/em&gt; a visible, structured reasoning trace, because the environment scores reasoning quality independently of outcome. Gemma 4's native thinking mode gives me 4K+ tokens of clean chain-of-thought to grade against the ground-truth causal chain. I'd rather have one model that thinks transparently than two models stapled together with a "show your work" prompt that the model is free to ignore.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;256K context.&lt;/strong&gt; A real incident has logs from five services, three dashboards, and a runbook. The 31B eats all of that in one shot, with no RAG plumbing and no summarization step quietly dropping the critical line. For incident response specifically, &lt;em&gt;context fidelity&lt;/em&gt; is everything.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The 31B is served via HF Inference for the demo, which keeps the Space cheap and snappy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gemma 4 E4B: the training proxy
&lt;/h3&gt;

&lt;p&gt;RL fine-tuning the 31B on a hackathon budget is a non-starter. But because Gemma 4 ships the &lt;strong&gt;same architecture and tokenizer across the entire family&lt;/strong&gt;, I could fine-tune the &lt;strong&gt;E4B&lt;/strong&gt; on the MIRR rollouts dataset using &lt;strong&gt;TRL + Unsloth on a single Colab T4&lt;/strong&gt;, and then transfer the learnings (reward shaping, prompting structure, the &lt;code&gt;diagnose()&lt;/code&gt; action schema) to the 31B at inference time. Same family, same instincts.&lt;/p&gt;

&lt;p&gt;Per-Layer Embeddings (PLE) make E4B genuinely punchy too. Even the small model produces watchable demos on the on-device path, which matters for the eventual story of "your laptop runs a copy of your team's SRE agent locally."&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Gemma 4 over other open models
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Open weights + Apache 2.0.&lt;/strong&gt; I can actually fine-tune and ship. A closed API would have killed the RL story before it started.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Family symmetry.&lt;/strong&gt; Same tokenizer and chat template across E2B → 31B means a training signal designed on E4B transfers up cleanly. No other open family gives you this kind of clean ladder.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thinking modes as first-class API&lt;/strong&gt;, not a prompting hack. For an environment that grades reasoning, that's the difference between scoring real signal and scoring formatting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multimodal headroom.&lt;/strong&gt; v2 of MIRR includes service topology &lt;em&gt;images&lt;/em&gt; (Grafana panels, dependency graphs), and Gemma 4's vision input means one model handles it end-to-end.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What's next
&lt;/h3&gt;

&lt;p&gt;If I had another week, I'd swap the 31B for the &lt;strong&gt;26B MoE&lt;/strong&gt; to cut inference cost on the demo (3.8B active params at 31B-class quality is hard to beat), and use &lt;strong&gt;E2B's native audio input&lt;/strong&gt; to let on-call engineers literally talk to the agent during an incident. &lt;em&gt;"What's burning?"&lt;/em&gt; → live answer, while the pager is still vibrating.&lt;/p&gt;




&lt;p&gt;Gemma 4 is the first open model family where you can prototype on a phone-class checkpoint and ship on a workstation-class checkpoint without rewriting your stack. That's the unlock MIRR needed.&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
    </item>
    <item>
      <title>Vibe Coding Is Just Blind Coding</title>
      <dc:creator>Utkarsh Bahuguna</dc:creator>
      <pubDate>Tue, 19 May 2026 07:11:57 +0000</pubDate>
      <link>https://dev.to/utkarshbahuguna/vibe-coding-is-just-blind-coding-28cd</link>
      <guid>https://dev.to/utkarshbahuguna/vibe-coding-is-just-blind-coding-28cd</guid>
      <description>&lt;p&gt;Why Most Developers Can't Actually Build Anything&lt;/p&gt;

&lt;p&gt;There's a trend going around called "vibe coding." You open an AI editor, describe what you want in plain English, accept whatever the model spits out, and keep iterating until something &lt;em&gt;seems&lt;/em&gt; to work. If it runs, you ship it. If it breaks, you prompt again.&lt;/p&gt;

&lt;p&gt;This isn't coding. It's &lt;strong&gt;blind coding&lt;/strong&gt; and it's creating a generation of developers who can prompt but can't build.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Illusion of Competence
&lt;/h2&gt;

&lt;p&gt;AI has made it trivially easy to generate code that &lt;em&gt;looks&lt;/em&gt; correct. A React component here. A Docker Compose file there. A Python script that "handles" your data pipeline. The problem isn't that the code is always wrong, it's that the person prompting has &lt;strong&gt;no mental model&lt;/strong&gt; for what's actually happening underneath.&lt;/p&gt;

&lt;p&gt;Here's what I mean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;They don't know how data structures work.&lt;/strong&gt; Ask them why their AI-generated list traversal is O(n²) instead of O(n), and they'll stare at you. The code "works" on 100 rows but times out on 100,000.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;They don't know how Docker works.&lt;/strong&gt; They copy-paste a &lt;code&gt;Dockerfile&lt;/code&gt; from ChatGPT, build an image, and celebrate when &lt;code&gt;docker run&lt;/code&gt; doesn't immediately crash. But they can't explain layers, caching, multi-stage builds, or why their image is 2GB for a simple Node app.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;They don't understand integration.&lt;/strong&gt; Their frontend "talks to" their backend, but they don't know how HTTP works, what CORS actually means, or why their WebSocket drops connections under load. Everything is a black box connected to another black box by vibes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result? Fragile systems held together by hope and hallucinated confidence.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Vibe Coding Actually Looks Like
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Junior Engineer Prompt
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;"Build me a full-stack app with React and Node.js that lets users upload files and stores them. Make it secure and fast."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is what gets fed into Cursor, v0, or ChatGPT. The output is usually:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A frontend with &lt;code&gt;axios.post('/upload')&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;A backend with &lt;code&gt;multer&lt;/code&gt; dumping files to disk&lt;/li&gt;
&lt;li&gt;No auth, no validation, no rate limiting&lt;/li&gt;
&lt;li&gt;A Dockerfile that copies everything and runs &lt;code&gt;npm start&lt;/code&gt; as root&lt;/li&gt;
&lt;li&gt;"It works on my machine" until it doesn't&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The junior engineer doesn't know what's missing because they never learned to ask the right questions. They got a working demo, and that was enough.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Deliberate Engineering Looks Like
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Senior Engineer Prompt
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;"I need a file upload service. Constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Files up to 100MB, images and PDFs only&lt;/li&gt;
&lt;li&gt;Must validate MIME type server-side, not just extension&lt;/li&gt;
&lt;li&gt;Scan with ClamAV before persisting&lt;/li&gt;
&lt;li&gt;Store in S3 with presigned URLs, never stream through our servers&lt;/li&gt;
&lt;li&gt;Rate limit: 10 uploads/hour per user, tracked in Redis&lt;/li&gt;
&lt;li&gt;Return signed CDN URL for immediate display&lt;/li&gt;
&lt;li&gt;Docker: multi-stage build, distroless final image, non-root user, health checks&lt;/li&gt;
&lt;li&gt;Frontend: resumable uploads with progress, cancel support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Start with the threat model and API contract. Then the storage layer. Then the upload handler. Then the UI."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Notice the difference. The senior engineer isn't asking for &lt;em&gt;code&lt;/em&gt; they're defining &lt;strong&gt;boundaries, constraints, and failure modes&lt;/strong&gt; first. They know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;That client-side validation is cosmetic&lt;/li&gt;
&lt;li&gt;That streaming large files through app servers is a bottleneck&lt;/li&gt;
&lt;li&gt;That Docker images should be minimal and hardened&lt;/li&gt;
&lt;li&gt;That UX requires handling network interruption&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The AI still writes the code. But the senior engineer &lt;strong&gt;directs&lt;/strong&gt; it, because they understand the system.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Problem
&lt;/h2&gt;

&lt;p&gt;AI isn't making engineers obsolete. It's making it harder to distinguish between &lt;strong&gt;engineers who think&lt;/strong&gt; and &lt;strong&gt;operators who prompt&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The danger of vibe coding isn't that you use AI. Everyone should use AI. The danger is using AI &lt;em&gt;instead of&lt;/em&gt; understanding:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Vibe Coding&lt;/th&gt;
&lt;th&gt;Deliberate Engineering&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"Make me a login system"&lt;/td&gt;
&lt;td&gt;"Design auth with refresh tokens, CSRF protection, and OWASP Top 10 coverage"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Dockerize this"&lt;/td&gt;
&lt;td&gt;"Optimize layer caching, use non-root, handle graceful shutdown"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Fix this bug"&lt;/td&gt;
&lt;td&gt;"Trace the execution flow, identify the race condition, write a regression test"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Make it faster"&lt;/td&gt;
&lt;td&gt;"Profile the hot path, reduce N+1 queries, add connection pooling"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Vibe coding produces &lt;em&gt;demos&lt;/em&gt;. Deliberate engineering produces &lt;em&gt;systems&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to Do Instead
&lt;/h2&gt;

&lt;p&gt;If you're early in your career, here's how to avoid the trap:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Learn one layer deeper&lt;/strong&gt;&lt;br&gt;
Don't stop at "it works." Understand &lt;em&gt;why&lt;/em&gt; it works. Read the source code of the libraries you use. Trace a network request from browser to server to database and back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Build without AI sometimes&lt;/strong&gt;&lt;br&gt;
Write a CRUD app from scratch with no Copilot. Configure nginx manually. Set up a database replica. The friction teaches you what the abstractions hide.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Ask "what could go wrong?"&lt;/strong&gt;&lt;br&gt;
For every feature you build, list three ways it could fail. Then design for those failures. This is the difference between a toy and production code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Study systems, not syntax&lt;/strong&gt;&lt;br&gt;
Data structures, networking, concurrency, distributed systems, these are timeless. Frameworks change. Fundamentals don't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Use AI as a multiplier, not a crutch&lt;/strong&gt;&lt;br&gt;
The best engineers I know use AI constantly. But they use it to &lt;em&gt;accelerate&lt;/em&gt; decisions they've already reasoned through, not to &lt;em&gt;replace&lt;/em&gt; reasoning itself.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Vibe coding is seductive because it delivers instant gratification. You get a working prototype in minutes. But software engineering isn't about prototypes, it's about building things that survive contact with reality: scale, security, edge cases, and time.&lt;/p&gt;

&lt;p&gt;AI is the most powerful tool we've ever had. But tools don't replace judgment. They amplify it.&lt;/p&gt;

&lt;p&gt;If your judgment is blind, so is your code.&lt;/p&gt;




</description>
      <category>ai</category>
      <category>productivity</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
  </channel>
</rss>
