Sukriti Singh

Posted on Mar 31

I Spent a Week on VibeCode Arena. Here Is Everything I Did Not Expect.

#ai #career #beginners #discuss

Seven days. Blind voting. And a realization I didn't want to admit.

I will be upfront about something.

I almost did not write this.

Not because the week was boring. The opposite, actually. I almost did not write it because some of what I found was uncomfortable to admit, and the comfortable thing would have been to just post a highlight reel and move on.

But highlight reels are useless. So here is the actual account.

How it started

I had been hearing about VibeCode Arena for a bit. A platform by HackerEarth where you watch AI models compete on the same prompt, vote blind before the reveal, and open your results as community challenges.

Sounded interesting. I kept putting it off the way you put off anything that might make you feel like you do not know what you are doing.

Eventually I just opened it.

The first thing you see is the Duels feed. People have submitted prompts, two models have gone head to head, and the community is voting on which output is better. Some of the stuff people are building on here is genuinely creative — retro terminals, interactive tools, little games. I spent longer than I meant to just scrolling through before I actually tried anything myself.

Day One — The blind vote

I ran my first Duel.

The way it works is simple. You write a prompt, two AI models generate simultaneously, you watch both outputs come in side by side, and you vote for the better one before the model names are revealed. The reveal only happens after you commit.

I typed something I genuinely wanted to see — not a test prompt, an actual idea I had been sitting on. Both outputs came in. I read through them. I voted.

Then the model names appeared.

I had voted for the model I never use. The one I had quietly written off months ago was based on nothing more specific than a vague impression I had picked up from other people's opinions.

I did not feel vindicated. I felt slightly embarrassed. Because the implication was obvious — I had been choosing my tools based on reputation and habit, not based on what they actually produced.

I ran three more duels that day. My assumptions were wrong twice.

Day Two — The stick figure brawl

Okay, this one was just fun.

I wanted to try creating my own challenge and I did not want to overthink it. So I typed "design a game with stick figures fighting" and watched what came back.

Two outputs. One was clean and structured but felt more like a diagram than a game. The other had actual chaos to it — Player 1 attack button, Player 2 attack button, health bars, the whole thing. It felt like something you would actually waste two minutes on.

I picked the second one. Codestral-2508.

Again, not a model I had seriously considered before. Again, the blind format forced the honest evaluation.

I opened it as a community challenge — which means anyone can now jump in, take that base, prompt AI to improve it, and submit their version. The leaderboard fills up with different people's takes on the same starting point.

I posted it and waited to see what would happen.

What I felt in that moment was something I did not expect. A kind of genuine curiosity. Not "I built a thing" but "I started something and I have no idea where it goes." That is a different feeling and, honestly, a more interesting one.

Day Three — Looking at other people's challenges

I spent most of day three not building anything. Just looking.

The challenge feed has a range of things people have posted. Some are serious — UI components, accessibility focused tools, form builders. Some are completely unhinged. The range is wide and the leaderboard on each one tells you something about what the community actually values versus what looks impressive at first glance.

What I noticed was how much you learn from watching other people's iterations of the same starting point.

Someone takes a base output and restructures it entirely. Someone else adds a feature you had not thought of. A third person goes in a direction that seems wrong and then you look at their score and have to reconsider your assumptions.

I kept thinking — this is the thing most people are missing when they use AI alone. You get your own blind spots plus the model's blind spots. Open it up and suddenly you have twenty different sets of eyes on the same problem.

Day Four — The guitar fretboard

This one I am still thinking about.

I ran a duel with the prompt "design a guitar with all the key notes represented on strings." Six words. I play a bit of guitar so I was curious what the models would do with something I could actually evaluate properly.

Both outputs produced what they called an Interactive Guitar Fretboard. All twelve notes across the top, six strings, click a note and see its positions highlighted in green.

At first glance, it looked great. Clean. Functional. More than I expected from six words.

Then I actually used it.

The note positions are not accurate. When you click a note, the highlighted positions across the strings do not match where that note actually sits in standard tuning. For a tool that is supposed to teach you the fretboard, that is the most important thing to get right, and it got it wrong.

There is no sound at all. You click, and nothing plays. Silence. For a guitar tool that is not a minor gap, it is half the point missing.

No fret numbers so you cannot tell which fret you are looking at. No fret markers at the positions every guitarist uses to navigate the neck. The neck itself looks slightly too short.

I voted for Codestral-2508 again. Better visual layout, marginally better structure. But both outputs had the same fundamental problems.

I opened it as a challenge with a clear brief — fix the note positions, add fret numbers, add fret markers, and add sound. Here is what is wrong, here is what good looks like, go.

That felt more honest than pretending the output was finished.

Day Five — Something changed about how I prompt

I did not read a guide about this. It just happened.

After four days of carefully evaluating other models' outputs, I noticed I was prompting differently. Less about how to structure things, more about what the thing needed to do and why it needed to do it.

The outputs got better. Not dramatically, but noticeably.

I think what happened is that spending time on the evaluation side — really sitting with outputs and asking what is working and what is not — had quietly changed what I was putting into the prompts. I had internalised something about the gap between what prompts produce and what they should produce, and it was coming out the other side.

That is not something I could have gotten from reading about prompting. It came from doing the evaluation work repeatedly.

Day Six — The uncomfortable one

I ran a duel on a prompt close to something I had actually built myself a few months ago.

I was quietly confident going in. I knew this territory. I had already solved this problem.

One of the outputs came back and handled a specific part of the interaction better than I had.

Not overall better. Not in every dimension. But in one specific way that mattered, it had made a decision I had not made, and the decision was right.

I sat with that for longer than I probably needed to.

The thing I landed on was this. It was not that AI is better. It was that I had been too close to my own previous solution to see its weaknesses clearly. I had built a thing, shipped it, moved on, and calcified around it. Fresh evaluation — even from a model — caught what I had stopped being able to see.

That is uncomfortable. It is also useful.

Day Seven — What I actually took from all of this

A week is not long. I am not going to pretend it transformed everything.

But a few things shifted.

I think about model choice differently now. Not which model is best in general — that question does not have a useful answer. But which model for what kind of task, under what conditions, evaluated by what criteria. That framing is more honest and more practical.

I evaluate AI output more carefully. I have a clearer sense of what I am actually looking at — what is genuinely solid, what just looks good, what will cause problems later. That clarity came from doing the blind evaluations repeatedly, not from reading about them.

I am more comfortable with incompleteness. Both of my challenges — the stick figure brawl and the guitar fretboard — were unfinished in clear ways. Opening them as challenges rather than polishing them to death felt right. The platform treats incompleteness as an invitation rather than a failure. That reframe is quietly useful.

The thing nobody says about vibe coding

Most of the conversation around vibe coding is about speed. Build faster, ship faster, stop writing every line by hand.

That is true and it is also incomplete.

What vibe coding actually demands from you, if you do it properly, is better judgment. You are no longer the one writing every line so you have to be the one who can evaluate every line. You have to know what good looks like. You have to catch what is wrong before it ships. You have to make the calls the AI cannot make because it does not have your context.

VibeCode Arena is the most direct way I have found to build that judgment. Not through tutorials. Through doing it, on real prompts, with real outputs, scored honestly, in a community that is working on the same thing.

One week in I am more calibrated than I was. I am also more aware of how much calibration I still need.

That feels like the right place to be.

Try it here: https://vibecodearena.ai/?page=1&pageSize=10

And if you want to jump into something live, the guitar fretboard challenge is open, the note positions need fixing, the sound needs adding, and the leaderboard is waiting.

Challenge link: https://vibecodearena.ai/share/afc85b04-2031-4b33-8a70-caf14197ac6d

The best thing a week on VibeCode Arena did was make me honest about what I did not know. That is more valuable than any output it produced.