Agentick Benchmark: GPT-5 Mini Tops at 0.309, No Agent Paradigm Dominates

#ai #machinelearning #research #deeplearning

Agentick benchmark evaluates RL, LLM, VLM, and hybrid agents on 37 tasks. GPT-5 mini leads at 0.309 ONS, but no paradigm dominates. ASCII beats natural language.

Agentick benchmark pits RL, LLM, VLM, hybrid, and human agents on 37 tasks. GPT-5 mini leads at 0.309 oracle-normalized score, but no paradigm dominates.

Key facts

37 procedurally generated tasks across six capability categories
27 agent configurations evaluated over 90,000+ episodes
GPT-5 mini leads at 0.309 oracle-normalized score
Reasoning harness improves LLM performance 3-10x
ASCII observations outperform natural language across all agents

Researchers from Google DeepMind and Université de Montréal released Agentick, a unified benchmark for sequential decision-making agents. The benchmark provides 37 procedurally generated tasks across six capability categories — including planning, multi-agent coordination, and memory — with four difficulty levels and five observation modalities [per the arXiv preprint].

Key Findings from 90,000+ Episodes

An evaluation spanning 27 agent configurations and over 90,000 episodes reveals stark performance gaps. GPT-5 mini leads overall at 0.309 oracle-normalized score (ONS), while PPO trained for 2 million steps achieves 0.287 ONS. However, PPO dominates planning and multi-agent tasks, where LLM-based agents lag significantly.

The reasoning harness — a chain-of-thought wrapper — multiplies LLM performance by 3-10x, suggesting that prompting strategy matters more than model scale for these tasks. Surprisingly, ASCII observations consistently outperform natural language observations across all agent types, challenging the assumption that richer representations always help.

No Silver Bullet for Agent Architectures

Agentick's design explicitly addresses the fragmentation in agent evaluation. Existing benchmarks often favor one paradigm — RL on Gym environments or LLMs on static QA — making cross-paradigm comparison impossible. Agentick provides a single Gymnasium-compatible interface, oracle reference policies for all tasks, pre-built SFT datasets, and a live leaderboard [according to the paper].

The benchmark's capability-decomposed structure reveals that different architectures excel in different sub-skills. Hybrid agents combining RL policies with LLM reasoning show promise but still trail specialists in their respective domains.

Implications for RL Post-Training

Agentick ships with pre-built SFT datasets, positioning it as a training ground for RL post-training of foundation models in sequential environments. This directly addresses a gap identified in recent work: foundation models lack robust sequential decision-making capabilities that RL-from-scratch agents possess.

The paper notes that even the best-performing agent — GPT-5 mini at 0.309 ONS — leaves substantial room for improvement. An oracle-normalized score of 1.0 represents perfect performance, meaning current agents achieve less than one-third of optimal behavior.

Key Takeaways

Agentick benchmark evaluates RL, LLM, VLM, and hybrid agents on 37 tasks.
GPT-5 mini leads at 0.309 ONS, but no paradigm dominates.
ASCII beats natural language.

What to watch

Watch for the Agentick leaderboard updates as more labs submit results. Key metric: whether any agent crosses 0.5 ONS within six months, and whether hybrid RL-LLM architectures narrow the gap with PPO on planning tasks.

$Figure 1: Two observation modalities for KeyDoorPuzzle at medium difficulty. Left: isometric pixel rendering (512×\times$

Originally published on gentic.news