ARC AGI 3 Preview Competition — My Journey

Dhana Abhiraj — Sat, 16 Aug 2025 08:01:02 +0000

Last 2 weeks, I was participating in the ARC AGI 3 preview competition.

I was trying out different techniques to solve the problem. The competition challenge is that we need to build an agent to win an unknown game.

Most of the things that I tried didn’t work well.

My solution uses Text LLM, Image LLM, and Video LLM. But still, it doesn’t perform well enough to win the full game.

In the full development, I used Gemini. There were rate limit errors.

Week 1

After a few manual experiments and runs with a random agent, I created the below initial flow:

Generate a random trial gameplay and reset the game. Only use the frames that have an effect in analysis.
Pass the gameplay video to the LLM, then generate 10 hypotheses out of it (Retrieval) [Analysis].
Create hints using the gameplay, which will help the goal achiever achieve the goal. [Analysis]
Select 1 multi-stage goal out of all hypotheses. [Analysis]
Pass the goal to the goal achiever loop.
Use the Image LLM with the last game frame to predict the next action.
Once the action is taken, I use the last 2 frames and check whether the goal is achieved. If achieved, then steps 1 to 6 iterate.

👉 I tried to solve it using a top-to-bottom approach.

I focused on steps 6 and 7 with fixed goals and hints. Then it performed with the expected performance.

But until halfway to the deadline, I worked on just 1 iteration of the loop to optimize with manual evaluation of responses using only 1 game (LOCKSMITH game).

Then I spent some time re-architecting the flow to work iteratively and improve better.

Week 2 — First Half

I re-architected the flow with:

Random gameplay only at the start time, then the next iteration would use the previous goal play and iterate.
Modified the goal generation with max action limits to avoid wrong goals.

Week 2 — Second Half

Later, I realized I needed evaluation to speed up development, so I used the LLM as a judge to evaluate steps 1–4 to correctly generate the fixed goal and hints.

I ran evaluations for all 3 public games and found a few issues:

Random play is not good at exploring click-based games.
The goal achievement checker step is not working well. (Most LLMs couldn’t identify the difference between two image frames well.)
Goal generation is weak.
Hypotheses generation is weak.

How I tried to improve:

Introduced probability-based random play to handle both clickable and non-clickable games.
Added color change descriptions in the grid for both the game analysis step and the goal checker step.
Made the goal generation shorter, with fewer moves.
Improved frame analysis (not all frames were being analyzed earlier).

Last 3 Days

New Issues Detected:

Some goals end in just 1–2 steps, so very few frames exist for analysis.
Sometimes LLM retry calls fail with empty responses and errors.

Solutions:

Passed only the frames of the current level play.
If something fails, generate a random action.
Converted actions and gameplay into a human-readable event chain.

Remaining Issues:

The full flow still needs optimization, even after many iterations.
Prompts and inputs passed to the LLM need improvements.
Model choice matters a lot: most steps used gemini-2.5-flash, but in some cases gemini-2.5-pro worked better, while flash performed worse.

Finally, the competition deadline approached.

Final Flow

|                                                                                 |
|  Random Play --------> Game Analysis --------> Goal Achiever                    | ---> Level Cleared
|   (explore)             (perceive & set goal)      (navigate)                   |        |
|       ^                                                |                        |        |
|       |                                                |                        |        |
|       --------------------------------------------------                        |        |
-----------------------------------------------------------------------------------        |
     ^                                                                                     |
     |                                                                                     |
     ---------------------------- Hints (memory) -------------------------------------------

What I Could Have Done Better

Started evaluation and observability earlier.
Chose a dedicated model for inference.
Developed the event chain in a better way.
Considered the game title impact earlier (it has a huge positive/negative effect on all steps).
Turned parts of the workflow into autonomous tools instead of keeping it strict.
Improved the memory mechanism across levels.

What Went Well

Learned a lot about:
- Reasoning
- Multimodal models
- Building agents
- Workflow design

The full project code is available in github here - https://github.com/dhanaabhirajk/ARC-AGI-3-Agents

I’d love to hear your thoughts or ideas — always open for discussions and collaborations. Reach me on LinkedIn.

DEV Community: Dhana Abhiraj