Anthony Humphreys

Posted on Dec 3, 2025

Advent of Code 2025: Human vs AI

#ai #vibecoding #adventofcode #programming

❄️ What I’m Doing

Every day I’ll solve the Advent of Code puzzle in:

JavaScript
Rust
Python

Then I’ll ask a lineup of current AI coding models to produce their own solutions in the same three languages:

GPT-5.1 Codex
Gemini 3 Pro
Composer-1
Opus-4.5
Sonnet-4.5

So for each puzzle, there will be:

My human solutions (in three languages)
Five AI solutions (also in three languages each)

It’s essentially a coding “showdown,” but not a competition. The real aim is to explore how different kinds of reasoning appear in code, how approaches vary from model to model, and how my own thinking compares.

Methodology

Not a lab-grade study, just a consistent, lightweight workflow so the comparison stays fair.

For each challenge, after solving the problem manually, I will drop the challenge text into a txt file in the prompts folder, prepended with a brief prompt:

You are a developer taking on the Advent of Code Challenge 2025.
Create a solution for this problem.
This puzzle has two parts, solve both in the same solution. The program output should just be the two answers on separate lines.

I will aim to keep this prompt consistent across the days, only varying on the output format as needed. I then drop the input file into the inputs folder. In cursor, I select the model under test and then @ the file, the prompt, and the target directory, then hit run. No MCPs, no MAX mode, or anything else to avoid any confounding variables, and to minimise context bloat.

I then run the run_solutions.py python script to verify the output and review the "thinking" of the model. I'll improve this and the reporting as I progress - but this works as a starting point.

Once I've verified the output, I add the model directory to the .cursorignore

Note: The model solves for all three languages and is able to reference its own prior solutions, e.g. it can start in Python and then translate that to JS and Rust. I appreciate it would be interesting to see how it stacks up when the model can only work in one language at a time, and I may run that as a second iteration of this experiment. However - the prompt does not instruct the model to execute in a particular language first, and I too can reference my own solution in other languages - so I thought this would be interesting in itself.

🎁 Why This Experiment?

To understand how AI actually solves problems

Advent of Code puzzles are a perfect testbed: small enough to be self-contained (and not burn too many tokens/£!), but clever enough to require genuine reasoning and creativity. Watching how different models break them down is already proving fascinating, and analysing the end results may yield some interesting insights.

To improve my own fluency

Solving each puzzle three times in three different languages forces me to think more deeply about patterns, algorithms, and idioms. It’s a great way to keep my skills sharp, and explore some different languages to what I use day to day.

To observe differences in style and structure

Where a model chooses brute force, I choose planning and analysis.
Where I focus on solving the problem and don't worry about failure modes (as the code is only run with known inputs in a known environment, AI may be much more defensive and write more flexible code.
Where AI might use packages to solve a problem, I may avoid it and stick to language features and standard lib. These contrasts say a lot about how “AI thinking” manifests in code.

To build a dataset of human + AI approaches

By the end of AoC, even with only 12 challenges this year, that still gives me dozens of solutions across languages and models — plenty to analyse! I’ll have dozens of solutions that all answer the same questions from different angles. That’s an interesting resource in itself.

🤖 What I’ll Be Sharing

As the month goes on, I’ll post insights on things like:

patterns the AIs gravitate toward
performance of the output code (hey, perf stats can be fun)
common mistakes or blind spots
whether models one-shot solutions or needed some handholding
places where models outperform my first instincts or develop more novel solutions
language-specific quirks (Rust borrow checker vs AI… pray for it)
what it feels like to “pair” with multiple coding models
other random thoughts and musings on AI, Cursor and model nuances

The goal isn’t to crown a winner. It’s to understand the landscape of coding in 2025, where AIs shine, where they show their limitations, and how the two complement each other. Will the models one-shot all the solutions or will it pivot into re-prompting or "pair-programming"?

🌟 Follow Along

If you enjoy Advent of Code, programming language experiments, or the evolving relationship between developers and AI tooling, stick around. I’ll be posting reflections and curiosities throughout the month, followed by a round-up at the end of the challenges.

Here’s to a December full of puzzles, head-scratching, learning, and some very weird debugging moments.

Happy coding, and an even happier Advent. 🎅🔥

Top comments (2)

Dun • Dec 3 '25

Love this idea of turning AoC into a “reasoning lab.” The consistent prompt/setup is a nice touch. It might be worth tightening the evaluation criteria a bit more (e.g., clarity vs performance vs robustness) so the comparisons stay sharp as the solutions pile up. Looking forward to the write-ups.

Anthony Humphreys • Dec 5 '25

Thanks, will share more in the coming days. Lagging a bit behind the challenges' releases due to work & a cold! 🤧