Forem

YASHWANTH REDDY K
YASHWANTH REDDY K

Posted on

When AI Writes Clean Code for the Wrong Problem

There’s a moment every developer has experienced: you run a piece of code, everything looks perfectly structured, neatly organized, even “production-ready”… and yet something feels off. Not broken, not crashing—just wrong in a way that’s harder to detect.

That’s exactly the kind of failure that becomes visible when you start testing models inside vibe code arena.

Recently, I ran a seemingly simple challenge:

Parse a nested JSON structure, extract specific fields, and handle missing or malformed data gracefully.

On the surface, this is not an exotic problem. It’s something most backend engineers solve early in their careers. But when you put multiple LLMs into a controlled duel environment and actually inspect their outputs—not just whether they “work,” but how they think—you start noticing something far more interesting.

This isn’t a story about one model being better than another.

It’s about how both of them misunderstood the problem… in completely different ways.

The Illusion of Competence Starts with Structure

Let’s start with the first model: Gemma 3 4B IT.

At first glance, its output looks impressive. It has everything you’d expect from a clean Python project:

  • Separate functions for loading and extracting data
  • Logging for error handling
  • A structured unit test suite
  • Clear documentation

It feels like something pulled out of a real-world repository. The kind of code that would pass a quick review if you were skimming through a pull request.

And that’s exactly where the illusion begins.

Because once you move past the structure and actually inspect the logic, something becomes clear: the model never implemented nested parsing.

The entire extraction logic boils down to:

data.get(field)
Enter fullscreen mode Exit fullscreen mode

Which only works for flat JSON.

No recursion. No traversal. No path resolution. Nothing that actually engages with the idea of “nested structure.”

So what happened here?

The model didn’t fail at coding. It failed at problem interpretation.

It recognized a familiar pattern—“JSON parsing + extraction + error handling”—and reproduced a textbook version of that pattern. But it quietly downgraded the complexity of the task from “nested parsing” to “flat key lookup.”

And because everything else looks so clean, that mistake is easy to miss.

When Simplicity Crosses into Instability

Now compare that with the second model: Mistral-Nemo.

This one doesn’t even try to impress you with structure. It goes straight to a single function approach—minimal, direct, almost barebones.

But then things fall apart almost immediately.

The function definition itself is syntactically invalid:

def parse_json+Nested data.json):
Enter fullscreen mode Exit fullscreen mode

At that point, the conversation shifts. This isn’t about logic anymore. This is about generation stability.

Even if you fix the syntax mentally and try to evaluate the intent, the implementation still reveals deeper issues:

  • It assumes fixed fields like "field1", "field2"
  • It doesn’t allow dynamic extraction
  • It treats all errors the same by returning an empty dictionary
  • It lacks any modular structure

Where Gemma over-engineers a shallow solution, Mistral under-delivers a fragile one.

And yet, both share a critical flaw.

The Missing Piece: Nobody Solved the Core Problem

Despite their differences, both models failed to implement the one thing the prompt explicitly required:

Handling nested JSON.

This is not a minor oversight. It’s the core of the task.

Nested JSON parsing requires:

  • Traversing hierarchical structures
  • Handling missing paths safely
  • Possibly supporting dot-notation or recursive lookup
  • Maintaining robustness across varying shapes

Neither model attempted any of this.

Instead, they both simplified the problem into something they were more confident solving.

This is the most important takeaway:

LLMs don’t just make mistakes—they often reinterpret the problem into something easier without telling you.

And if you’re not actively testing for that, you won’t notice.

Graceful Handling Isn’t What It Seems

Another interesting pattern shows up in how both models claim to handle errors “gracefully.”

Gemma returns None when something goes wrong.

Mistral returns {}.

At first glance, both seem reasonable. But look closer.

Neither approach:

  • distinguishes between different failure types
  • preserves error context
  • provides actionable feedback

A missing file, malformed JSON, or unexpected structure all collapse into the same output.

This is what I’d call cosmetic robustness.

The code looks defensive, but it doesn’t actually help you debug or recover meaningfully.

And again, this is easy to miss because the structure feels right.

The Testing Trap: When Tests Don’t Test the Right Thing

Gemma goes a step further by generating unit tests.

That sounds like a win. Until you read them.

The tests:

  • validate basic loading of JSON
  • check simple field extraction
  • confirm behavior for invalid files

But they never test:

  • nested structures
  • deep field access
  • edge cases involving hierarchy

So the tests reinforce the same simplified interpretation of the problem.

They don’t expose the flaw—they validate it.

This creates a dangerous loop:

  1. The model simplifies the problem
  2. It writes code for the simplified version
  3. It writes tests that validate that version
  4. Everything “passes”

And now you have a fully consistent—but fundamentally incorrect—solution.

What This Reveals About LLM Behavior

Running this inside vibe code arena makes one thing very clear:

LLMs are not just generating code—they are making judgment calls about what matters in a problem.

And those judgment calls are influenced by:

  • training data patterns
  • common code templates
  • frequency of similar tasks

In this case, both models leaned toward a more common scenario: flat JSON parsing.

Because statistically, that’s what they’ve seen more often.

So instead of solving the exact problem, they solved the most probable version of it.

Pattern Recognition vs Problem Solving

This leads to a deeper distinction:

  • Pattern recognition: identifying familiar structures and reproducing them
  • Problem solving: understanding constraints and adapting logic accordingly

Gemma excels at pattern recognition. It builds something that looks like a real project.

Mistral struggles even at that level in this example.

But neither fully engages in problem solving.

And that gap is where most real-world failures happen.

Why This Matters More Than Benchmarks

Standard benchmarks won’t catch this.

They typically evaluate:

  • correctness on predefined inputs
  • performance on known datasets
  • adherence to expected outputs

But they rarely test:

  • whether the model interpreted the problem correctly
  • whether it handled edge cases implied but not explicitly stated
  • whether the solution scales beyond trivial scenarios

That’s why environments like vibe code arena are interesting.

Because they expose:

  • how models behave under open-ended prompts
  • how they structure solutions
  • how they fail when assumptions are required

And those failures are often more informative than successes.

The Real Risk: Confidently Wrong Code

The most dangerous output isn’t broken code.

It’s code that:

  • runs without errors
  • passes its own tests
  • looks clean and maintainable
  • but solves the wrong problem

Gemma’s solution falls into this category.

If you dropped it into a codebase, it might survive for a while before anyone notices the limitation. And by then, it’s already part of your system.

That’s a much harder problem than a syntax error.

What Developers Should Take Away

This isn’t an argument against using LLMs for coding. It’s a reminder of how to use them correctly.

When working with generated code:

  • Don’t trust structure—verify logic
  • Don’t trust tests—check coverage relevance
  • Don’t assume completeness—look for missing constraints
  • Always re-read the original problem

Most importantly:

Ask yourself: “Did this code solve my problem, or a simpler version of it?”

Because more often than not, it’s the latter.

Closing Thought

This experiment started as a simple comparison between two models.

But it ended up highlighting something more fundamental:

AI doesn’t just fail randomly. It fails systematically, based on how it interprets the world.

And sometimes, that interpretation is just slightly off.

Not enough to break things immediately.

But enough to matter.

That’s the space where real debugging begins.

You can explore or create your own duels here on Vibe Code Arena:

https://vibecodearena.ai

or

Try the exact duel here:

https://vibecodearena.ai/share/befe4ee4-59ac-4f31-ae9d-c55f38c434a1

Top comments (0)