Evaluation - DEV Community

Skip to content

DEV Community

👋 Sign in for the ability to sort posts by relevant, latest, or top.

Saurav Bhattacharya

Jul 24

Stop Leaving Findings in the Judge: The Ratchet That Turns Opinions Into Gates

#ai #agents #evaluation #observability

4 min read

Saurav Bhattacharya

Jul 22

The Cold-Start Problem for Agent Evals: What to Gate on Day One With Zero Labeled Data

#ai #agents #evaluation #typescript

4 min read

Jul 21

Your eval dashboard has 30 metrics. When one "moves," that is usually arithmetic, not a regression.

#statistics #machinelearning #datascience #evaluation

6 min read

Pneumetron

Jul 15

PoPE: Placebo-Controlled Evaluation Challenges Error-Conditioned Self-Repair in Small Code LLMs

#llms #codegeneration #selfrepair #evaluation

3 min read

Tatsuya Shimomoto

Jul 14

LLM-as-Judge Shouldn't Aggregate Scores: Binary Checks as Evidence, One Holistic Verdict

#llm #promptengineering #evaluation #claudecode

12 min read

Paul Twist

Jul 13

The Evaluation Debt You Don't Know You Have: Why Agent Evals Fail in Production

#agents #ai #evaluation #infrastructure

7 min read

Muhammed Rasin O M

Jul 10

Half the answer keys in text-to-SQL benchmarks are wrong. So I generated the database from the answer key.

#evaluation #dataagents #benchmarks #syntheticdata

7 min read

Jul 22

An LLM judge is a biased instrument, not a measurement

#llm #evaluation #statistics #ai

6 min read

Saurav Bhattacharya

Jul 19

Stop Judging Every Run: Eval Sampling Is a Budget Decision, Not a Coverage One

#ai #agents #evaluation #observability

5 min read

Jul 5

Evaluating LLM Apps in Java

#java #ai #llm #evaluation

10 min read

Jul 5

Evaluating LLM Apps in Python

#python #ai #llm #evaluation

9 min read

Saurav Bhattacharya

Jul 2

Short-Circuit Your Agent Evals: Tier Order Is a Latency Budget, Not a Preference

#ai #agents #evaluation #typescript

5 min read

Breach Protocol

Jul 1

Your AI judge might be reliable — and still be wrong

#evaluation #llmjudges #rlhf #methodology

3 min read

Breach Protocol

Jul 1

Reliable, and still wrong

#evaluation #llmasjudge #benchmarks

3 min read

Saurav Bhattacharya

Jun 29

Give Your Agent a Type Signature: Contract-First Output Beats a Smarter Judge

#ai #agents #evaluation #typescript

4 min read

👋 Sign in for the ability to sort posts by relevant, latest, or top.