Unpopular opinion: Using GPT-4 as a judge to evaluate other models is grading your own homework.
At Future AGI, we built an open-source eval library because evaluations need multiple signals, edge-case stress, and production monitoring.
Vibes are not evals. Stars appreciated ⭐
Github- https://github.com/future-agi/ai-evaluation
Top comments (0)