This is a submission for the Hermes Agent Challenge
The Problem with AI Code Review
Every AI coding tool on the market can summarize a diff. "This PR adds 5 files and modifies authentication." Great — but that's not a review. That's a description.
A real code review requires:
- Reading the actual files to understand context, not just the diff
- Tracing dependencies to see who consumes changed code
- Running tests to verify nothing broke
- Searching for patterns that might be affected
- Thinking in phases, not blasting out generic comments
In other words: a real review requires agency.
What I Built
Hermes PR Investigator is a custom skill for Hermes Agent that turns the agent into an autonomous PR reviewer. It doesn't summarize diffs — it investigates them through a 5-phase agentic workflow.
The 5-Phase Investigation
Discovery & Planning
↓
Deep File Analysis
↓
Validation (tests, lint, type-check)
↓
Cross-Reference (patterns, docs, security)
↓
Structured Report Generation
How It's Different
| Static Diff Summarizer | Hermes PR Investigator |
|---|---|
| Reads patch once | Reads full files and traces dependencies |
| Generic comments | Risk-scored, severity-rated findings |
| No validation | Runs tests, lint, type checks |
| Surface-level | Cross-references codebase for affected patterns |
| Static output | Agent adapts investigation based on findings |
Demo: Real-World Test on Hermes Agent Itself
I didn't just build this — I tested it on a real merged PR from the NousResearch/hermes-agent repository:
PR: #26957 — fix(acp): replay session history before responding to session/load
What Hermes Did (Autonomously)
hermes chat --toolsets "skills,terminal,file,web" \
-q "Investigate PR https://github.com/NousResearch/hermes-agent/pull/26957"
Phase 1 — Discovery: Fetched PR metadata via gh pr view, pulled diff via gh pr diff, checked CI status (gh pr checks)
Phase 2 — Analysis: Read acp_adapter/server.py and tests/acp/test_server.py. The PR removes _schedule_history_replay and switches from deferred loop.call_soon to awaited inline replay.
Phase 3 — Validation: Checked failing test logs via gh run view --log-failed. All 6 failures were pre-existing on main (registry manifest mismatch, PermissionError in CI runner, xAI dotenv issue) — not introduced by this PR.
Phase 4 — Cross-Reference: Searched codebase for orphan references to _schedule_history_replay. Zero found — clean removal.
Phase 5 — Report: Generated structured review with verdict.
Findings from the Real PR
| Severity | Count |
|---|---|
| Critical | 0 |
| High | 0 |
| Warnings | 0 |
Suggestion: The try/except blocks in load_session and resume_session are near-identical (differ only in log message string). Consider extracting a _replay_session_history_guarded(self, state, operation: str) helper for DRY.
Verdict: "This is a clean, well-researched fix. The bug was subtle — loop.call_soon makes the server look correct in isolated testing but breaks any client that inspects notification counts synchronously after await loadSession(). The fix aligns Hermes with every other ACP server and the spec's natural reading."
Demo: Local Auth Branch
I also tested on a synthetic PR adding JWT auth to a Flask app:
hermes chat --toolsets skills -q \
"Investigate the local branch feature/add-auth"
What it found:
-
High: Hardcoded
JWT_SECRETfallback ("default-secret") in auth middleware -
High:
require_authdecorator defined but never applied to any route - Medium: 5 files changed, 0 test files modified
- Medium: Register endpoint lacks input validation or duplicate-user checks
See the full demo report in the repo: demo/real-world-report-pr-26957.md
Code
Repository: github.com/Aditya2073/hermes-pr-investigator
Project Structure
hermes-pr-investigator/
├── skills/devops/pr-investigator/
│ ├── SKILL.md # Agent instructions
│ └── scripts/
│ ├── fetch_pr.sh # GitHub API fetcher
│ ├── analyze_diff.py # Risk analyzer
│ ├── trace_deps.py # Dependency tracer
│ ├── run_validation.py # Test runner
│ └── generate_report.py # Report generator
├── demo/ # Demo repo + sample data
├── install.sh # One-line installer
└── .github/workflows/ # GitHub Action for auto-review
My Tech Stack
- Hermes Agent: The orchestrator — handles planning, tool use, and multi-step reasoning
- Python 3 + stdlib: Helper scripts for analysis (no external deps)
- Bash: GitHub API integration
- GitHub Actions: Auto-runs on every PR
How I Used Hermes Agent
Agentic Planning
The core of this project is the SKILL.md file — it's not just documentation, it's agent instructions. Hermes reads it and decides:
- Which files to read first (based on risk score)
- When to run validation (after understanding the changes)
- How deep to trace dependencies (only for core files)
Hermes uses its built-in todo tool to track the 5 phases, so if validation fails in Phase 3, it can adapt the investigation plan.
Heavy Tool Use
The skill orchestrates 6 tools across 28 toolsets:
-
terminal: Runs analysis scripts, git commands, test suites -
read_file: Reads modified files and their dependencies -
web_search: Looks up security advisories for dependencies -
execute_code: Runs Python validation scripts in sandbox -
todo: Tracks investigation phases -
skill_manage: Learns from reviews and improves its own approach
Progressive Disclosure
The skill uses Hermes' progressive disclosure pattern:
- Level 0: Skill name and description in the system prompt (~3k tokens)
-
Level 1: Full SKILL.md loads only when the user invokes
/pr-investigator - Level 2: Individual reference files load on demand
This keeps token usage efficient — the agent doesn't carry PR review instructions into unrelated conversations.
Memory & Learning
Because Hermes has persistent memory, the investigator learns over time:
- It remembers which projects use which test frameworks
- It learns the team's coding conventions from previous reviews
- It improves its risk scoring based on which findings actually mattered
Why This Approach Wins
Most "AI code review" submissions will be static analyzers or diff summarizers. I proved this is different by running it on a real PR and watching it:
-
Execute: It ran
gh pr checks,gh run view --log-failed, and searched the actual codebase — not just reading the patch -
Trace: It found zero orphan references to
_schedule_history_replay, confirming clean removal - Adapt: When CI showed failures, it checked if they were pre-existing on main before flagging them
- Report: Structured severity ratings (Critical/High/Medium/Low) with specific line references
-
Reason: It understood the subtle bug —
loop.call_soonlooking correct in isolation but breaking synchronous client inspection
The real PR test produced a 500-word technical review with a suggestion the human reviewers missed (DRY refactoring of near-identical try/except blocks).
Try It Yourself
# Install Hermes Agent
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
# Install the skill
git clone https://github.com/Aditya2073/hermes-pr-investigator.git
cd hermes-pr-investigator
bash install.sh
# Set your GitHub token
echo 'GITHUB_TOKEN=ghp_xxx' >> ~/.hermes/.env
# Investigate a PR
hermes chat --toolsets skills -q "/pr-investigator https://github.com/owner/repo/pull/123"
Or set up the GitHub Action to automatically review every PR:
name: Hermes PR Investigator
on:
pull_request:
types: [opened, synchronize]
jobs:
investigate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Hermes
run: curl -fsSL ... | bash
- name: Install Skill
run: bash install.sh
- name: Run Investigation
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: hermes chat --toolsets skills -q "/pr-investigator ${{ github.event.pull_request.html_url }}"
What's Next
-
Focus modes:
--focus security,--focus performance,--focus tests -
Custom rules: Team-specific conventions via
.hermes/pr-rules.md - Batch reviews: Run across all open PRs nightly via Hermes cron
- IDE integration: ACP adapter for in-editor review requests
Thanks for reading! If you found this interesting, give it a ❤️ and let me know what you'd want an agentic PR reviewer to catch.
Top comments (0)