I built an open-source benchmark that scores AI agents, not models

#ai #agents #opensource #python

Two agents built on the same GPT-4o can have wildly different reliability. But every benchmark only evaluates the model.

So I built Legit — an open-source platform that scores the agent as a whole.

How it works

pip install getlegit legit init --agent "MyBot" --endpoint "http://localhost:8000/run" legit run v1 --local

36 tasks across 6 categories (Research, Extract, Analyze, Code, Write, Operate). Two scoring layers:

Agents get an Elo rating and tier (Platinum/Gold/Silver/Bronze).

Free, Apache 2.0.

Would love feedback on the scoring methodology!

That is interesting. Will dive into it and provide feedback.