I built an open-source benchmark that scores AI agents, not models

alethios000 — Mon, 06 Apr 2026 13:15:48 +0000

Two agents built on the same GPT-4o can have wildly different reliability. But every benchmark only evaluates the model.

So I built Legit — an open-source platform that scores the agent as a whole.

How it works

pip install getlegit legit init --agent "MyBot" --endpoint "http://localhost:8000/run" legit run v1 --local

36 tasks across 6 categories (Research, Extract, Analyze, Code, Write, Operate). Two scoring layers:

Agents get an Elo rating and tier (Platinum/Gold/Silver/Bronze).

Free, Apache 2.0.

Would love feedback on the scoring methodology!