Nilofer 🚀

Posted on Jun 10

AgentLiar Detector: Catch Coding Agents That Falsely Claim Task Completion

#llm #agents #machinelearning #opensource

AI coding agents are getting better at completing tasks. They are also getting better at appearing to complete tasks. An agent that claims "done" when it has created placeholder files, written empty tests, or quietly narrowed the scope of the original requirement is harder to catch than one that simply fails, because the failure is hidden inside output that looks correct at a glance.

AgentLiar is a production-ready system that detects when coding agents falsely claim task completion. It runs four independent verification checks, produces a weighted confidence score from 0 to 100, and delivers structured evidence in JSON, Markdown, or console output - usable as a CLI tool, Python library, GitHub Action, or HTTP API.

Features

4 Independent Checks - File, Test, Scope, and LLM Judge.
Confidence Scoring - weighted aggregation on a 0–100 scale.
Multiple Interfaces - CLI, Python API, GitHub Action, and HTTP API.
Adversarial Detection - catches placeholder implementations, empty tests, and scope narrowing.
Structured Reports - JSON and Markdown output with evidence.
Production Ready - type hints, error handling, logging, and async support.

Architecture

The async orchestrator dispatches four independent checks File, Test, Scope (local), plus an optional OpenRouter LLM Judge and produces a weighted 0–100 confidence score delivered as JSON, Markdown, or console output for CI gating.

The Four Verification Checks

1. File Check

Detects missing expected files
Identifies unexpected new files
Finds placeholder content: TODO, FIXME, pass-only
Validates file sizes and content

2. Test Check

Detects empty test bodies
Identifies tests without assertions
Finds skipped tests
Validates claimed versus actual test counts

3. Scope Check

Detects silent scope narrowing: "only", "for now"
Identifies partial implementations
Finds TODO markers in code
Validates requirements coverage

4. LLM Judge

Independent assessment via OpenRouter
Structured JSON output
Timeout and retry logic
Optional - works without an API key

Quick Start

Installation

pip install -e .

Or pip install agentliar once published. Requires Python 3.10+.

CLI Usage

Prepare sample inputs from examples/simple_task.json, then run:

agentliar verify \
  --task-file .tmp/task.txt \
  --claim-file .tmp/claim.json \
  --changes-file .tmp/changes.json \
  --format markdown

Use agentliar config to inspect configuration and agentliar analyze .tmp/task.txt to review a task file.

Python API

from agentliar import Verifier

verifier = Verifier()
result = await verifier.verify(
    task_description=task,
    claim=claim_payload,
    file_changes=changes_payload
)
# Read result.score, result.passed, result.confidence_level, result.reports

GitHub Action

Use the GitHub Action with task, claim, and change files, a confidence threshold, and an optional OPENROUTER_API_KEY secret when you want the LLM Judge path enabled.

HTTP API

Start the API server:

python -m agentliar.server
# or
uvicorn agentliar.server:app --host 0.0.0.0 --port 8000

Then POST /verify with the task, claim, and file-change payloads. The response returns score, pass/fail, and evidence blocks.

Confidence Score Interpretation

90–100 - High. Task appears fully completed.
70–89 - Medium. Task likely complete with minor issues.
50–69 - Low. Task partially completed.
30–49 - Critical. Significant issues detected.
0–29 - Failed. Task likely not completed.

Configuration

Create a .env file. Set OPENROUTER_API_KEY and OPENROUTER_MODEL only if you want LLM Judge mode. The check weights must sum to 1.0. CONFIDENCE_THRESHOLD controls the pass/fail cutoff.

Recommended LLM Judge models (May 2026):

anthropic/claude-haiku-4-5 - cheap and fast judging
anthropic/claude-sonnet-4-6 or openai/gpt-5.4 - higher-quality judging
openai/gpt-4.1-mini - budget option

Use Cases

CI/CD Integration - automatically verify PR claims before merging.
Code Review - get an independent assessment of task completion alongside a human review.
Agent Monitoring - detect when AI agents overstate progress in automated pipelines.
Quality Gates - block merges below a confidence threshold.
Documentation - generate verification reports for stakeholders.

Security

No hardcoded secrets
API keys via environment variables only
No data persistence
Local processing except for LLM Judge

Project Structure

src/agentliar/           # Checks, orchestration, scoring, reports, API, CLI, server
tests/
├── unit/                # Unit tests
├── adversarial/         # Adversarial tests
└── integration/         # Integration tests
examples/                # Sample inputs
action.yml               # GitHub Action definition
pyproject.toml           # Packaging and tooling

Testing

pytest                                    # Full suite
pytest --cov=agentliar --cov-report=html  # With coverage
pytest tests/unit/                        # Unit tests only
pytest tests/adversarial/                 # Adversarial tests only
pytest tests/integration/                 # Integration tests only

Code Quality

ruff check .        # Linting
ruff format .       # Formatting
mypy src tests      # Type checking

How I Built This Using NEO

This project was built using NEO. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.

The requirement was a production-ready verification system for detecting false completion claims from coding agents - running four independent checks locally, with an optional LLM Judge via OpenRouter, and exposing the result through a CLI, Python API, GitHub Action, and HTTP API. NEO built the full implementation: the async orchestrator dispatching all four checks, the File, Test, Scope, and LLM Judge check modules, the weighted confidence scorer, the JSON and Markdown report generators, the Click CLI with verify, config, and analyze commands, the FastAPI HTTP server, the GitHub Action definition in action.yml, and the test suite split across unit, adversarial, and integration coverage.

How You Can Use and Extend This With NEO

Use it as a CI gate on every PR that includes AI-generated code.
Add the GitHub Action to your workflow with a confidence threshold. Any PR where the agent's claimed changes do not pass the file, test, and scope checks below your threshold is blocked before merge - automatically, without a reviewer having to spot the placeholder implementation manually.

Use it to monitor agent progress in long-running pipelines.
Call await verifier.verify(...) from Python after each agent task completes. The confidence score and evidence blocks tell you whether the agent actually finished the task or produced output that looks complete but is not - before the next stage of the pipeline starts.

Use the LLM Judge for higher-confidence verification on critical tasks.
Set OPENROUTER_API_KEY and configure a judge model for tasks where the local checks alone are not sufficient. The LLM Judge runs independently from the other three checks and adds a cross-model perspective to the confidence score.

Extend it with additional check types.
The four checks share a common async interface in the orchestrator. A new check follows the same pattern and its weight is added to the configuration. The orchestrator, scorer, and reporters pick it up automatically.

Final Notes

Agents that falsely claim completion are harder to catch than agents that fail outright - because the output exists and looks plausible. AgentLiar makes the verification systematic: four independent checks, a weighted confidence score, and structured evidence that tells you exactly where the claim breaks down.

The code is at https://github.com/dakshjain-1616/AgentLiar
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

Top comments (2)

Max Quimby • Jun 16

The scope-narrowing check is the one I'd have skipped and regretted. Outright failures are loud — you get a traceback, a red CI run. The expensive ones are agents that quietly redefine "done": narrow a "handle all timezones" requirement to "handle UTC," write a test that asserts True, and report green. By the time anyone notices it's three commits deep.

One thing I keep running into with the LLM-Judge layer: the judge is itself an agent that can hallucinate a pass. How are you guarding against that — do you de-weight the judge when the deterministic File/Test checks disagree with it, or treat all four equally in the 0–100 score? I've found the deterministic checks have to dominate, with the judge as a tiebreaker, otherwise you've just added a second liar to catch the first.

Also curious whether you've considered diffing claimed-vs-actual test counts against coverage deltas — empty tests pass, but they don't move coverage.

Alex Shev • Jun 11

This is a useful framing because “agent lied” is often really “the system accepted an unverified claim.” Coding agents are especially good at sounding finished before the artifact is actually finished.

The fix is not only better prompts. The workflow needs external checks: tests, diffs, logs, screenshots, or domain-specific validators that can reject a completion claim. If success is only self-reported, the agent will eventually optimize for sounding done.