Last 2 weeks, I was participating in the ARC AGI 3 preview competition.
I was trying out different techniques to solve the problem. The competition challenge is that we need to build an agent to win an unknown game.
Most of the things that I tried didn’t work well.
My solution uses Text LLM, Image LLM, and Video LLM. But still, it doesn’t perform well enough to win the full game.
In the full development, I used Gemini. There were rate limit errors.
Week 1
After a few manual experiments and runs with a random agent, I created the below initial flow:
- Generate a random trial gameplay and reset the game. Only use the frames that have an effect in analysis.
- Pass the gameplay video to the LLM, then generate 10 hypotheses out of it (Retrieval) [Analysis].
- Create hints using the gameplay, which will help the goal achiever achieve the goal. [Analysis]
- Select 1 multi-stage goal out of all hypotheses. [Analysis]
- Pass the goal to the goal achiever loop.
- Use the Image LLM with the last game frame to predict the next action.
- Once the action is taken, I use the last 2 frames and check whether the goal is achieved. If achieved, then steps 1 to 6 iterate.
👉 I tried to solve it using a top-to-bottom approach.
I focused on steps 6 and 7 with fixed goals and hints. Then it performed with the expected performance.
But until halfway to the deadline, I worked on just 1 iteration of the loop to optimize with manual evaluation of responses using only 1 game (LOCKSMITH game).
Then I spent some time re-architecting the flow to work iteratively and improve better.
Week 2 — First Half
I re-architected the flow with:
- Random gameplay only at the start time, then the next iteration would use the previous goal play and iterate.
- Modified the goal generation with max action limits to avoid wrong goals.
Week 2 — Second Half
Later, I realized I needed evaluation to speed up development, so I used the LLM as a judge to evaluate steps 1–4 to correctly generate the fixed goal and hints.
I ran evaluations for all 3 public games and found a few issues:
- Random play is not good at exploring click-based games.
- The goal achievement checker step is not working well. (Most LLMs couldn’t identify the difference between two image frames well.)
- Goal generation is weak.
- Hypotheses generation is weak.
How I tried to improve:
- Introduced probability-based random play to handle both clickable and non-clickable games.
- Added color change descriptions in the grid for both the game analysis step and the goal checker step.
- Made the goal generation shorter, with fewer moves.
- Improved frame analysis (not all frames were being analyzed earlier).
Last 3 Days
New Issues Detected:
- Some goals end in just 1–2 steps, so very few frames exist for analysis.
- Sometimes LLM retry calls fail with empty responses and errors.
Solutions:
- Passed only the frames of the current level play.
- If something fails, generate a random action.
- Converted actions and gameplay into a human-readable event chain.
Remaining Issues:
- The full flow still needs optimization, even after many iterations.
- Prompts and inputs passed to the LLM need improvements.
-
Model choice matters a lot: most steps used
gemini-2.5-flash
, but in some casesgemini-2.5-pro
worked better, while flash performed worse.
Finally, the competition deadline approached.
Final Flow
| |
| Random Play --------> Game Analysis --------> Goal Achiever | ---> Level Cleared
| (explore) (perceive & set goal) (navigate) | |
| ^ | | |
| | | | |
| -------------------------------------------------- | |
----------------------------------------------------------------------------------- |
^ |
| |
---------------------------- Hints (memory) -------------------------------------------
What I Could Have Done Better
- Started evaluation and observability earlier.
- Chose a dedicated model for inference.
- Developed the event chain in a better way.
- Considered the game title impact earlier (it has a huge positive/negative effect on all steps).
- Turned parts of the workflow into autonomous tools instead of keeping it strict.
- Improved the memory mechanism across levels.
What Went Well
-
Learned a lot about:
- Reasoning
- Multimodal models
- Building agents
- Workflow design
The full project code is available in github here - https://github.com/dhanaabhirajk/ARC-AGI-3-Agents
I’d love to hear your thoughts or ideas — always open for discussions and collaborations. Reach me on LinkedIn.
Top comments (0)