
Two agents built on the same GPT-4o can have wildly different reliability. But every benchmark only evaluates the model.
So I built Legit — an open-source platform that scores the agent as a whole.
How it works
pip install getlegit
legit init --agent "MyBot" --endpoint "http://localhost:8000/run"
legit run v1 --local
36 tasks across 6 categories (Research, Extract, Analyze, Code, Write, Operate). Two scoring layers:
- Layer 1: deterministic checks, runs locally, free
- Layer 2: 3 AI judges (Claude, GPT-4o, Gemini), median score
Agents get an Elo rating and tier (Platinum/Gold/Silver/Bronze).
Free, Apache 2.0.
GitHub: https://github.com/getlegitdev/legit
Would love feedback on the scoring methodology!
Top comments (1)
That is interesting. Will dive into it and provide feedback.