DEV Community

alethios000
alethios000

Posted on

I built an open-source benchmark that scores AI agents, not models


Two agents built on the same GPT-4o can have wildly different reliability. But every benchmark only evaluates the model.

So I built Legit — an open-source platform that scores the agent as a whole.

How it works

pip install getlegit
legit init --agent "MyBot" --endpoint "http://localhost:8000/run"
legit run v1 --local

36 tasks across 6 categories (Research, Extract, Analyze, Code, Write, Operate). Two scoring layers:

  • Layer 1: deterministic checks, runs locally, free
  • Layer 2: 3 AI judges (Claude, GPT-4o, Gemini), median score

Agents get an Elo rating and tier (Platinum/Gold/Silver/Bronze).

Free, Apache 2.0.

GitHub: https://github.com/getlegitdev/legit

Would love feedback on the scoring methodology!

Top comments (1)

Collapse
 
rosspeili profile image
Ross Peili

That is interesting. Will dive into it and provide feedback.