AI coding agents are getting better at completing tasks. They are also getting better at appearing to complete tasks. An agent that claims "done" when it has created placeholder files, written empty tests, or quietly narrowed the scope of the original requirement is harder to catch than one that simply fails, because the failure is hidden inside output that looks correct at a glance.
AgentLiar is a production-ready system that detects when coding agents falsely claim task completion. It runs four independent verification checks, produces a weighted confidence score from 0 to 100, and delivers structured evidence in JSON, Markdown, or console output - usable as a CLI tool, Python library, GitHub Action, or HTTP API.
Features
4 Independent Checks - File, Test, Scope, and LLM Judge.
Confidence Scoring - weighted aggregation on a 0–100 scale.
Multiple Interfaces - CLI, Python API, GitHub Action, and HTTP API.
Adversarial Detection - catches placeholder implementations, empty tests, and scope narrowing.
Structured Reports - JSON and Markdown output with evidence.
Production Ready - type hints, error handling, logging, and async support.
Architecture
The async orchestrator dispatches four independent checks File, Test, Scope (local), plus an optional OpenRouter LLM Judge and produces a weighted 0–100 confidence score delivered as JSON, Markdown, or console output for CI gating.
The Four Verification Checks
1. File Check
- Detects missing expected files
- Identifies unexpected new files
- Finds placeholder content: TODO, FIXME, pass-only
- Validates file sizes and content
2. Test Check
- Detects empty test bodies
- Identifies tests without assertions
- Finds skipped tests
- Validates claimed versus actual test counts
3. Scope Check
- Detects silent scope narrowing: "only", "for now"
- Identifies partial implementations
- Finds TODO markers in code
- Validates requirements coverage
4. LLM Judge
- Independent assessment via OpenRouter
- Structured JSON output
- Timeout and retry logic
- Optional - works without an API key
Quick Start
Installation
pip install -e .
Or pip install agentliar once published. Requires Python 3.10+.
CLI Usage
Prepare sample inputs from examples/simple_task.json, then run:
agentliar verify \
--task-file .tmp/task.txt \
--claim-file .tmp/claim.json \
--changes-file .tmp/changes.json \
--format markdown
Use agentliar config to inspect configuration and agentliar analyze .tmp/task.txt to review a task file.
Python API
from agentliar import Verifier
verifier = Verifier()
result = await verifier.verify(
task_description=task,
claim=claim_payload,
file_changes=changes_payload
)
# Read result.score, result.passed, result.confidence_level, result.reports
GitHub Action
Use the GitHub Action with task, claim, and change files, a confidence threshold, and an optional OPENROUTER_API_KEY secret when you want the LLM Judge path enabled.
HTTP API
Start the API server:
python -m agentliar.server
# or
uvicorn agentliar.server:app --host 0.0.0.0 --port 8000
Then POST /verify with the task, claim, and file-change payloads. The response returns score, pass/fail, and evidence blocks.
Confidence Score Interpretation
90–100 - High. Task appears fully completed.
70–89 - Medium. Task likely complete with minor issues.
50–69 - Low. Task partially completed.
30–49 - Critical. Significant issues detected.
0–29 - Failed. Task likely not completed.
Configuration
Create a .env file. Set OPENROUTER_API_KEY and OPENROUTER_MODEL only if you want LLM Judge mode. The check weights must sum to 1.0. CONFIDENCE_THRESHOLD controls the pass/fail cutoff.
Recommended LLM Judge models (May 2026):
anthropic/claude-haiku-4-5 - cheap and fast judging
anthropic/claude-sonnet-4-6 or openai/gpt-5.4 - higher-quality judging
openai/gpt-4.1-mini - budget option
Use Cases
CI/CD Integration - automatically verify PR claims before merging.
Code Review - get an independent assessment of task completion alongside a human review.
Agent Monitoring - detect when AI agents overstate progress in automated pipelines.
Quality Gates - block merges below a confidence threshold.
Documentation - generate verification reports for stakeholders.
Security
- No hardcoded secrets
- API keys via environment variables only
- No data persistence
- Local processing except for LLM Judge
Project Structure
src/agentliar/ # Checks, orchestration, scoring, reports, API, CLI, server
tests/
├── unit/ # Unit tests
├── adversarial/ # Adversarial tests
└── integration/ # Integration tests
examples/ # Sample inputs
action.yml # GitHub Action definition
pyproject.toml # Packaging and tooling
Testing
pytest # Full suite
pytest --cov=agentliar --cov-report=html # With coverage
pytest tests/unit/ # Unit tests only
pytest tests/adversarial/ # Adversarial tests only
pytest tests/integration/ # Integration tests only
Code Quality
ruff check . # Linting
ruff format . # Formatting
mypy src tests # Type checking
How I Built This Using NEO
This project was built using NEO. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.
The requirement was a production-ready verification system for detecting false completion claims from coding agents - running four independent checks locally, with an optional LLM Judge via OpenRouter, and exposing the result through a CLI, Python API, GitHub Action, and HTTP API. NEO built the full implementation: the async orchestrator dispatching all four checks, the File, Test, Scope, and LLM Judge check modules, the weighted confidence scorer, the JSON and Markdown report generators, the Click CLI with verify, config, and analyze commands, the FastAPI HTTP server, the GitHub Action definition in action.yml, and the test suite split across unit, adversarial, and integration coverage.
How You Can Use and Extend This With NEO
Use it as a CI gate on every PR that includes AI-generated code.
Add the GitHub Action to your workflow with a confidence threshold. Any PR where the agent's claimed changes do not pass the file, test, and scope checks below your threshold is blocked before merge - automatically, without a reviewer having to spot the placeholder implementation manually.
Use it to monitor agent progress in long-running pipelines.
Call await verifier.verify(...) from Python after each agent task completes. The confidence score and evidence blocks tell you whether the agent actually finished the task or produced output that looks complete but is not - before the next stage of the pipeline starts.
Use the LLM Judge for higher-confidence verification on critical tasks.
Set OPENROUTER_API_KEY and configure a judge model for tasks where the local checks alone are not sufficient. The LLM Judge runs independently from the other three checks and adds a cross-model perspective to the confidence score.
Extend it with additional check types.
The four checks share a common async interface in the orchestrator. A new check follows the same pattern and its weight is added to the configuration. The orchestrator, scorer, and reporters pick it up automatically.
Final Notes
Agents that falsely claim completion are harder to catch than agents that fail outright - because the output exists and looks plausible. AgentLiar makes the verification systematic: four independent checks, a weighted confidence score, and structured evidence that tells you exactly where the claim breaks down.
The code is at https://github.com/dakshjain-1616/AgentLiar
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

Top comments (0)