DEV Community

Cover image for Coding agents broke code review. Two Claude Code skills help me fight back.
Lewis
Lewis

Posted on • Originally published at Medium

Coding agents broke code review. Two Claude Code skills help me fight back.

Last week, I had an idea for a new app.

Forty eight hours later, Opus 4.7 and I submitted its v1 to the App Store.

Of course, I didn’t write a single line of code myself—a practice increasingly adopted by the world’s most respected engineers:

Tweet from Andrej Karpathy saying he is “mostly programming in English now”

Tweet from Boris Cherny saying “pretty much 100% of our code is written by Claude Code + Opus 4.5”

There's a tradeoff to shipping at the speed of English like this though. More bugs make it into production.

Earlier this year, Concordia researchers tracked 200,000 code units across 201 projects and found AI code gets bug-fixed at a higher rate than human code.

Amazon recently pulled engineers into a “deep dive” on a string of AWS outages, citing “novel GenAI usage for which best practices and safeguards are not yet fully established.”

Then a few weeks ago we saw Axios, Mercor, Railway, the Argentinian government, and more, suffer security breaches all within the span of a few days—how many were caused by rubber-stamped AI code?

Coding agents shift the bottleneck from writing code to reviewing it, an increasingly overwhelming task given the speed at which code can be generated.

So how are engineers responding to the PR-firehose? A growing number are simply dropping the ball on review. A new NAIST study of 1,664 agentic pull requests found 75% pass through with zero revisions requested. Three out of four AI-generated PRs ship without a single human change (despite them containing more bugs!).

Like those cited above, I too have been guilty of sacrificing thoroughness for speed. And like them, I too have paid a price.

I found that while I can get new app updates into the hands of users faster, time is wasted when they’re giving me feedback on bugs, and features that don’t work as they’re meant to.

I was losing much of the time I’d saved and burning precious feedback cycles.

Then two Claude Code skills leveled the playing field.


How to catch real bugs in in AI-generated code in six keystrokes

These two skills work in tandem with Macroscope, the AI code review tool.

They are designed to give Claude seamless integration with a dedicated code reviewer, so you can get frontier-quality bug detection without leaving the Claude Code interface, and without breaking the bank on Claude’s Code Review product.

Macroscope was the first major code review tool to ditch seat-based pricing in favour of usage-based—$0.05 per KB reviewed, and most reviews come in under a dollar. Claude Code reviews meanwhile clock in an order of magnitude higher at $15–25 a piece, for what I find to be worse performance.

I’ll start by showing you what the workflow looks like, then I’ll show you how to set it up.


Every time you submit a new PR to GitHub, Macroscope automatically reviews it. It almost always comes back with a few real bugs.

a screenshot of Macroscope leaving a comment about a bug in GitHub

a second screenshot of Macroscope leaving a comment about a bug in GitHub

Once I get an alert that Macroscope has finished its review, I run the first skill /review-pr-comments in Claude Code.

a screenshot of the claude code desktop app, code tab, typing “/review-pr-comments”

Claude reads Macroscope's comments in GitHub, reviews the actual code each one is pointing at, and gives me a verdict on which are believed to be valid or invalid, with the evidence laid out.

a screenshot of Claude Code saying that 5 findings are valid, and 3 are invalid, after running the first skill

From there, you can quiz Claude for further details. Or if you disagree with a verdict, just say so, and the next skill respects your override.

When I’m happy, I run the second skill /resolve-pr-comments.

Claude rejects the invalid GitHub comments with a brief explanation and resolves those threads.

a screenshot of Claude Code using the GitHub CLI to reply to and resolve the invalid comments

Then it fixes the valid issues in your codebase one at a time.

a screenshot of Claude Code fixing each of the valid issues in the codebase

Each fix gets a Fixed. reply in GitHub and a resolved thread.

a screenshot of Claude’s comment in GitHub explaining a fix, in response to one of the bug comments, and Macroscope thanking Claude for the fix.
Claude + Macroscope hitting it off ❤

When everything’s done, Claude re-reads all the changes together to make sure the fixes don’t conflict, runs the tests one final time, and reports back.

a screenshot of Claude Code confirming all of the changes have been successful

I commit, push, and Macroscope reviews the new commit. I repeat the loop until the checks pass.

a screenshot of confirmation in GitHub that there are no outstanding issues


The outcome

✅ Fewer bugs make it to production.

✅ Less time wasted with users.

✅ Less time asking Claude to track down the cause of issues days after you wrote the code.

And all in six keystrokes.


Why two code review skills are better than one

The first review skill doesn’t touch your PR.

That’s because an agent that judges and acts in the same step is more likely to begin implementing without due consideration.

The split forces a moment of judgment between analysis and action.


Setting up Claude Code with Macroscope

You’ll need three things before you start: Claude Code, a GitHub account, and the GitHub CLI.

(If you don’t have the CLI, just ask Claude to install it)

Step 1: Install Macroscope on your repo

Head to macroscope.com and click Sign up.

a screenshot of Macroscope’s landing page hero

Log in with GitHub.

a screenshot of the login with GitHub button on Macroscope’s site

Authorize the GitHub OAuth flow.

a screenshot of the “Sign in with GitHub” screen

You’ll land on a screen telling you Macroscope isn’t installed on any of your orgs yet. Click Install Macroscope.

a screenshot of the “Install Macroscope” button

Pick the repo (or repos) you want Macroscope to review. I went with “Only select repositories” and chose the one I’m shipping.

Then just go through the Stripe checkout to claim your free trial and you’re done.

Macroscope sets up your workspace, drops $100 of free credit into your account, and you land on the dashboard.

From this point on, every PR you push to that repo gets reviewed automatically.

Step 2: Push a PR and wait for the review

Code up a storm in Claude Code as normal and submit a new PR when you're ready.

Macroscope will review it and either pass it or leave comments. When the review is done, you’ll get a notification.

That’s your cue to bring in the skills.

Step 3: Install the two skills in Claude Code

Copy the first skill below, paste it into Claude Code, and ask Claude to install it as a skill.

Do the same for the second.

Then start a new Claude Code session so the new skills load.

Skill 1: review-pr-comments

---
name: review-pr-comments
description: Investigates unresolved PR review comments on the current branch's PR and presents findings for user review. Read-only — never modifies the PR, never resolves threads, never changes code.
---

# review-pr-comments

Investigate every unresolved review comment on the current branch's PR, classify each as **Valid** or **Invalid** with evidence, and present findings for the user to review. This skill is strictly read-only: it never posts, resolves, edits, or pushes anything.

## Guiding principle

**Accuracy matters more than speed.** A wrong dismissal ships a bug; a wrong acceptance wastes developer time. When in doubt, mark **Valid**.

- **Valid** is the default verdict.
- **Invalid** requires concrete, citable evidence — specific file paths and line numbers proving the comment wrong.
- If you cannot conclusively prove a comment wrong, it stays **Valid**.

## Procedure

Follow these steps in order on every invocation. Output nothing to the user until the final summary in step 7.

### Step 1 — Detect the project stack (runtime, every invocation)

Before touching the PR, build a mental model of this repo. Do not assume anything from prior sessions. Findings live only in working context — never write them into this skill file.

1. **Identify primary language(s)** by listing top-level files and checking for manifests:
   - `package.json` (JavaScript/TypeScript)
   - `Cargo.toml` (Rust)
   - `pyproject.toml`, `setup.py`, `requirements.txt` (Python)
   - `go.mod` (Go)
   - `Package.swift`, `*.xcodeproj`, `*.xcworkspace` (Swift)
   - `Gemfile` (Ruby)
   - `pom.xml`, `build.gradle`, `build.gradle.kts` (Java/Kotlin)
   - `composer.json` (PHP)
   - `mix.exs` (Elixir)
   - `pubspec.yaml` (Dart/Flutter)
   - `*.csproj`, `*.sln` (C#/.NET)
   - Any other manifest you recognize — don't stop at this list.
2. **Identify the test framework and test-file naming conventions**:
   - Read test-related scripts in the manifest (e.g. `scripts`, `[tool.pytest]`, `[dev-dependencies]`).
   - Look for test directories (`test/`, `tests/`, `__tests__/`, `spec/`) and note the file-name pattern (`*_test.go`, `*.test.ts`, `test_*.py`, `*Spec.swift`, etc.).
   - Open one or two existing test files to confirm the actual framework in use.
3. **Locate dependency manifests and their local caches**. Examples:
   - `node_modules/`, `vendor/`, `.venv/`, `target/`, `Pods/`, `.gradle/`, `~/.cargo/`.
   Knowing where third-party code lives lets you read library source when a comment depends on external behavior.
4. **Skim a handful of source files** (3–6) from the main source tree to pick up conventions: naming, error handling, import style, module layout. You will reference these conventions when investigating comments.

Keep all of the above in working memory for this run only.

### Step 2 — Fetch PR context

1. Confirm the current branch: `git rev-parse --abbrev-ref HEAD`.
2. Find the PR for the current branch: `gh pr view --json number,url,headRefName,baseRefName,title,author,state`.
   - If no PR exists for this branch, stop and tell the user.
3. List commits on the PR so you can later check whether a fix has already landed: `gh pr view  --json commits`.
4. Fetch **all review threads** with resolution state via the GraphQL API. Example:


   ```bash
   gh api graphql -f query='
     query($owner:String!, $repo:String!, $num:Int!) {
       repository(owner:$owner, name:$repo) {
         pullRequest(number:$num) {
           reviewThreads(first:100) {
             nodes {
               id
               isResolved
               isOutdated
               path
               line
               originalLine
               comments(first:50) {
                 nodes {
                   id
                   author { login }
                   body
                   createdAt
                   diffHunk
                   commit { oid }
                   originalCommit { oid }
                 }
               }
             }
           }
         }
       }
     }' -F owner= -F repo= -F num=
   ```


5. **Filter to threads where `isResolved` is `false`.** Ignore resolved threads entirely.
6. If pagination is needed (more than 100 threads or 50 comments in a thread), page through it — do not silently truncate.

### Step 3 — Investigate each unresolved comment

For each unresolved thread, do not rely on the `diffHunk` alone. Read real code.

1. **Read the full file** the comment points at, at the current tip of the PR branch — not just the hunk. The reviewer's claim may depend on code elsewhere in the file.
2. **Trace code paths the claim depends on.** If the comment is about a function, follow the callers, the callees, and any types involved until you can confirm or rule out the claim. Cross-file reads are expected.
3. **Check whether a later commit has already addressed it.** Compare the commit the comment was written against (`originalCommit.oid`) with the PR's latest commit. If the file at the latest commit no longer has the issue, note that as evidence.
4. **Use the stack knowledge from Step 1** when the comment references language behavior, framework idioms, or test patterns. Verify claims against the actual framework in use, not a generic assumption.
5. If the comment references external library behavior, read the library source from the local cache identified in Step 1 rather than guessing.

### Step 4 — Consider both directions before deciding

For every comment, explicitly hold both hypotheses before concluding:

- **What evidence supports the comment being correct?** List it.
- **What evidence supports the comment being incorrect?** List it.

Do not stop at the first piece of evidence you find in either direction. A single matching or mismatching line is not enough — look for confirming and disconfirming evidence on both sides.

Only after weighing both sides do you choose a verdict.

### Step 5 — Classify

- **Valid** (default): the comment appears correct, or you cannot prove it wrong with concrete evidence.
- **Invalid**: you have concrete, citable evidence — specific file paths and line numbers — that the comment is wrong, already fixed on this PR, or based on a misreading of the code.

If the evidence is ambiguous, the verdict is **Valid**. Do not downgrade a valid concern to invalid because it seems minor.

### Step 6 — Re-read before finalizing

For each verdict, do one final pass:

1. Re-read the original comment verbatim.
2. Re-read the code you cited.
3. Confirm you investigated **the exact claim the reviewer made**, not a neighboring or paraphrased version. If you find you investigated the wrong thing, redo Step 3 for that comment.

### Step 7 — Present findings

Output only the final summary. Do not narrate the investigation.

Use a single numbering sequence across both sections: list all **Valid** items first, then continue the same count into **Invalid**. This way an override like "actually #3 is valid" unambiguously identifies one item.

Format:



```markdown
## Valid (investigate further / likely to act on)

1. **:** — @
   > 
   Evidence: 

2. ...

## Invalid (recommend dismissing)

3. **:** — @
   > 
   Evidence: 

4. ...
```



End with exactly one question: **"Would you like to proceed, or adjust any verdicts?"**

Do not suggest fixes, do not draft replies, do not take any action beyond presenting the list.

## Hard constraints

- **Read-only.** Never run `gh pr review`, `gh pr comment`, `gh api ... -X POST/PATCH/DELETE`, `git commit`, `git push`, or any mutating command.
- **Never resolve threads.** Do not call `resolveReviewThread` or any equivalent.
- **Never modify files** in the working tree.
- **Never write stack-detection findings back into this skill file.** They are per-run context only.
- **Do not output intermediate progress.** The only thing the user sees is the final summary in Step 7.

Skill 2: resolve-pr-comments

---
name: resolve-pr-comments
description: Acts on a prior triage produced by review-pr-comments — rejects items marked invalid (replies + resolves threads) and fixes items marked valid (edits code, replies, resolves). Refuses to run without a prior triage in the current conversation. Does not commit or push.
---

# resolve-pr-comments

Execute the triage produced earlier in this conversation by `review-pr-comments`: reject the comments marked invalid, fix the comments marked valid, and leave the branch ready for the user to commit and push.

## Hard preconditions

- **A triage from `review-pr-comments` must already exist in this conversation.** If it does not, stop immediately and respond with:
  > No triage found. Run `review-pr-comments` first, review the results, then invoke this skill.
  Do not attempt to re-investigate the comments from scratch.
- **Never commit, never push, never force-push.** The user handles version control after reviewing the result.

## Procedure

Follow these steps in order on every invocation.

### Step 1 — Detect the project stack (runtime, every invocation)

Build a mental model of the repo. Do not carry assumptions from prior sessions. These findings live only in working context — never write them back into this skill file.

1. **Identify primary language(s)** from top-level manifest files. Check for (non-exhaustive): `package.json`, `Cargo.toml`, `pyproject.toml` / `setup.py` / `requirements.txt`, `go.mod`, `Package.swift` / `*.xcodeproj`, `Gemfile`, `pom.xml` / `build.gradle` / `build.gradle.kts`, `composer.json`, `mix.exs`, `pubspec.yaml`, `*.csproj` / `*.sln`. Treat any unfamiliar manifest you encounter as a signal too.
2. **Identify the test runner command.** Look in order at:
   - Scripts in the primary manifest (e.g. `scripts.test`, `[tool.*]` sections, `[dev-dependencies]`)
   - `Makefile`, `justfile`, `Taskfile`, `Rakefile`, or similar task definitions
   - `README` / `CONTRIBUTING` for documented commands
   - Language defaults (`npm test`, `pytest`, `go test ./...`, `cargo test`, `swift test`, `bundle exec rspec`, etc.) only if the above give nothing
3. **Identify the test framework and test-file naming convention** by opening one or two existing test files. Note path patterns (e.g. sibling to source vs. a dedicated tree) and file-name patterns.
4. **Skim 3–6 source files** from the main source tree to absorb conventions: naming, error handling, import/module style, logging, typing, formatting.
5. **Note existing utility/helper modules.** Record the names and purposes of shared helpers (validation, HTTP, logging, date handling, etc.) so fixes can reuse them rather than reinventing.

Keep all of the above in working memory for this run only.

### Step 2 — Apply user overrides to the triage

Read everything the user said in this conversation after the triage was presented.

- If they reclassified any numbered item (e.g. "actually #3 is valid", "mark #5 invalid"), update the verdict for that item.
- If they said to skip or defer any item, flag it as **skip**.
- If they gave rationale for an override, keep that rationale — you will use it when writing the reply on the thread.
- **User overrides always win** over the triage's original verdicts.

If the user's instructions are ambiguous (e.g. "actually the last one"), stop and ask them to clarify before taking any action.

### Step 3 — Re-fetch review threads and match to triage items

The triage numbering may no longer correspond to anything on GitHub if threads shifted. Re-establish IDs.

1. Get the PR number for the current branch: `gh pr view --json number,url,headRefName`.
2. Fetch all review threads via GraphQL, including thread IDs and comment IDs:


   ```bash
   gh api graphql -f query='
     query($owner:String!, $repo:String!, $num:Int!) {
       repository(owner:$owner, name:$repo) {
         pullRequest(number:$num) {
           reviewThreads(first:100) {
             nodes {
               id
               isResolved
               path
               line
               originalLine
               comments(first:50) {
                 nodes { id author { login } body }
               }
             }
           }
         }
       }
     }' -F owner= -F repo= -F num=
   ```


3. Page through results if needed. Do not silently truncate.
4. For each triage item, match to a thread by **file path + line** (falling back to `originalLine`) and by the first reviewer comment body. If a triage item cannot be matched to any unresolved thread, stop and ask the user how to proceed.
5. Ignore threads already resolved.

### Step 4 — Phase 1: reject invalid items

For each item currently classified **invalid** (after user overrides), do two things — and run them in parallel where the tool layer allows, since they are independent across threads:

1. **Post a reply** on the thread explaining the reason. Keep it brief and cite the evidence from the triage. Example shape:
   > . See `:` — .
   Use `gh api` to POST a review comment reply:


   ```bash
   gh api -X POST \
     repos///pulls//comments//replies \
     -f body=''
   ```


   (Reply to the **first comment id** on the thread.)
2. **Resolve the thread** via the `resolveReviewThread` GraphQL mutation:


   ```bash
   gh api graphql -f query='
     mutation($id:ID!) {
       resolveReviewThread(input:{threadId:$id}) { thread { id isResolved } }
     }' -f id=
   ```



Do **not** touch code in this phase.

### Step 5 — Phase 2: fix valid items, one at a time

Handle each **valid** item sequentially. Do not batch. For each item, complete every sub-step before moving to the next item.

1. **Re-read the file and surrounding code.** Do not rely only on the diff hunk or the triage evidence. Understand the patterns around the change site.
2. **Make the fix**, matching the conventions you detected in Step 1:
   - Same naming style, import style, error handling idioms, and module layout
   - Reuse helpers identified in Step 1 rather than duplicating logic
   - Don't introduce a new dependency unless the comment specifically asks for one
3. **Verify the fix locally before moving on**:
   - Re-read the changed region and its callers. Confirm the fix addresses the exact claim from the original comment, not an adjacent one.
   - If tests exist that exercise the changed code, run them with the test runner from Step 1. If a test fails, investigate and adjust — don't move on with a red test.
4. **Reply and resolve the thread**:
   - Post `Fixed.` as a reply to the thread (same reply endpoint as Step 4).
   - Resolve the thread via `resolveReviewThread`.
5. **Tell the user** what changed — one line naming the file(s) touched and a short description.

If a valid item turns out to be too complex, ambiguous, or to require context you don't have, **do not guess**. Treat it as a skip (see Step 7).

### Step 6 — Final review pass

After every planned fix is done, do a holistic review before reporting.

1. Re-read every file you changed, together. Confirm:
   - Each fix follows the codebase's existing patterns from Step 1.
   - Each fix uses existing helpers instead of duplicating logic.
   - Fixes don't conflict with, shadow, or duplicate each other.
   - No unrelated behavior was changed, no new TODOs left behind, no debug output left in.
2. **Run the full test suite once**, using the runner from Step 1. Per-fix runs from Step 5 don't reflect interactions between fixes.
3. If anything fails or looks wrong, fix it before reporting. If a failure is outside the scope of this PR's review comments, stop and tell the user rather than silently papering over it.

### Step 7 — Handle skipped items

For any item the user asked to skip, or that you determined you can't responsibly handle:

1. Reply on the thread: `Acknowledged — not addressing in this PR.`
2. Resolve the thread.
3. Include it in the skipped count in the final report.

### Step 8 — Report

Output exactly one line:



```plaintext
Done — N invalid rejected, M valid fixed and resolved.
```



If any items were skipped, append:



```plaintext
 K skipped and acknowledged.
```



Do not commit. Do not push. Do not open a new PR. The user takes it from here.

## Hard constraints

- **Precondition-gated.** No triage in the conversation ⇒ refuse and direct the user to `review-pr-comments`.
- **User overrides always win** over the original triage verdicts.
- **Never commit, stage, push, force-push, or amend.**
- **Never write stack-detection findings back into this skill file.** They are per-run context only.
- **Sequential fixes only.** Valid items are handled one at a time, with local verification between them. Only the reject operations in Phase 1 may run in parallel.
- **No silent truncation** when paging through threads or comments.
- **No guessing.** If a fix is unclear or an override is ambiguous, ask or skip — don't fabricate.

That’s it. You’re set up.

Step 4: Run the loop

Once you’ve got a notification saying Macroscope’s review is complete:

  1. Run /review-pr-comments.

  2. Push back if you disagree.

  3. Run /resolve-pr-comments.

  4. Commit, push, repeat.

That’s the whole workflow. Write code. Push. Triage. Respond. Repeat until clean. It's a fast and low-friction defensive layer to catch bugs before they make it into production.

Enjoy!


References

Rahman, M. & Shihab, E. (2026). Will It Survive? Deciphering the Fate of AI-Generated Code in Open Source. EASE 2026. https://arxiv.org/abs/2601.16809

Watanabe, K., Shirai, T., Kashiwa, Y., & Iida, H. (2026). What to Cut? Predicting Unnecessary Methods in Agentic Code Generation. MSR 2026. https://arxiv.org/abs/2602.17091

Financial Times. (2026). Amazon holds ‘deep dive’ into impact of AI coding tools after outages. https://www.ft.com/content/7cab4ec7-4712-4137-b602-119a44f771de

The author has a professional affiliation with Macroscope. AI tooling moves fast and some of the implementation details above may differ since publication.

➡️ Get Macroscope with $100 of free credit.

Top comments (0)