Alistair

Posted on Apr 4 • Edited on Apr 9

I Tried to Automate a Manual Review Task with Claude. It Wasn't Worth It.

#ai #automation #github #showdev

Every day, a CI job adds new entries to test-titles.json in my Clusterflick repo. When it finds a cinema listing title the normaliser hasn't seen before, it records the input and the current output, then opens a pull request. Someone — usually me — then has to review whether those outputs are actually correct, fix anything that isn't, and merge.

It's not complicated work. Review the output and confirm the normalizer has done the correct job. If it hasn't, fix the output (test now fails ❌) and then fix the normalizer (until the test now passes ✅). But it happens twice day, and "not complicated" doesn't mean "not context switching".

So I decided to try automating it with Claude. Several hours and $5 later, I don't think it was worth it — and I think the reasons why are worth writing up 💸

The Task

The normaliser — normalize-title.js — converts raw cinema listing titles into a consistent string. I've written about it more in depth in my previous post, Cleaning Cinema Titles Before You Can Even Search.

When the CI job adds new test entries, it records whatever the normaliser currently produces. The reviewer's job is to decide whether that output is correct. There's a docs/reviewing-title-normalisation-test-cases.md file with detailed guidance on how to classify and fix different types of issues.

The automation task: look at the new entries, use the guide to decide if they look correct, fix anything that's wrong, commit. Automating it with Claude seemed like a reasonable fit, especially as I'd been doing this semi automated locallly using a very basic prompt:

In this branch we've had some automated updates to `common/tests/test-titles.json`.
Confirm these changes are correct, or if they're not correct then fix them.
There's details on how this setup works in `docs/reviewing-title-normalisation-test-cases.md`

The Approach

I set up Claude platform and added $5 of credit. Then set up a GitHub Actions workflow triggered by a @claude review titles comment on any PR. The Claude Code GitHub Action handles the Claude integration — it checks out the PR branch, runs Claude Code against it, and can commit fixes back to the branch.

The workflow was straightforward in principle:

on:
  issue_comment:
    types: [created]

jobs:
  claude-review:
    if: >-
      contains(github.event.comment.body, '@claude review titles') &&
      github.event.issue.pull_request != null
    runs-on: ubuntu-latest

Claude gets the diff, reads the documentation, checks each new entry, and either accepts it as correct or fixes it. Should be straightforward, and a manual trigger to kick it off so no surprises.

For this, I was also going to double down with Claude; Claude.ai to guide me through the setup, and using Claude API (via the Github action) to do the action review. But getting there took a few attempts.

The Problems

Something worth noting upfront: every failed run here cost money, especially if Claude spirals and chews through tokens. There's not a lot of feedback (or too much once we figured out streaming that back) so it's much harder than it is locally to see what Claude's thinking and there's no reprompt to bring it back on path. On top of that, each run takes several minutes before you find out what went wrong, the feedback loop is slow and expensive. Debugging a GitHub Actions workflow normally costs you time. Debugging this one cost time and cash.

Permissions. The first run failed with OIDC token errors. The Claude Code Action uses OIDC to generate a GitHub App token, which requires id-token: write in the workflow permissions. I'm not sure why Claude.ai didn't include that in the initial workflow.

Branch checkout. The PR branch wasn't checked out by default — the runner was on main, so Claude found no diff (and chewed through tokens). I added an explicit checkout step with ref: refs/pull/${{ github.event.issue.number }}/head and fetch-depth: 0 so git diff had something to work with. Again, I'm not sure why Claude.ai didn't include that in the initial workflow.

I probably should have caught this one myself. Checking out the PR branch is a well-known requirement when working with pull requests in Actions. I assumed a language model with broad knowledge of GitHub Actions would have it covered. The lesson there is the same as always with LLM output: trust but verify.

Missing --dangerously-skip-permissions. Without this flag, Claude keeps pausing to ask permission before running bash commands or editing files. In a non-interactive GitHub Actions environment that means it loops forever waiting for input it'll never get. Required flag for any autonomous use. Again, I'm not sure why Claude.ai didn't include that in the initial workflow.

--allowedTools has a bug. I initially used --allowedTools Bash,Read,Edit,Write to restrict Claude to just the tools it needs. But there's a known issue where the init message still reports all available tools, which can confuse Claude into thinking it can use them. Swapped to --disallowedTools instead, which works correctly.

By this point I'd spent half my budget just getting the plumbing right, without the PR being updated at all. For context, this PR added 11 new titles, so it wasn't a huge amount of data to review.

The 30-Turn Failure

The first run that got past all the setup issues hit the 30-turn limit and stopped without committing anything. It cost $0.59 and took about five minutes.

What happened was actually Claude doing the right thing. It ran all 11 inputs through the normaliser, saw that every output matched what was recorded, and then — correctly — kept going. Because matching the normaliser isn't the same as being correct. The documentation I'd pointed it at says it plainly:

The output field in test-titles.json is what the test expects, not necessarily what is correct.

So Claude spent the next 25+ turns reading through normalize-title.js, known-removable-phrases.js, and the existing test data, reasoning about whether each output was actually right. That's exactly the job. The problem was it cost $0.59 and ran out of turns before committing anything useful.

I asked Claude.ai to help diagnose this, and it suggested adding an explicit stopping condition to the prompt — something like "if it matches, accept it, don't investigate further." I took that suggestion at face value without thinking through what it actually meant. It would stop the spiralling. It would also stop the reasoning. Those are the same thing 🤦

I added the stopping condition, dropped --max-turns to 15, and declared the cost problem fixed. It wasn't — I'd just hidden it.

A "Successful" Run That Wasn't

With the prompt fixed and tools switched to --disallowedTools, the next run completed in 6 turns and 45 seconds. Cost: $0.19.

The full sequence: check the git log, get the diff, read the docs, run all 11 inputs through the normaliser in a single batch, conclude "All 11 new entries match the recorded output exactly. No fixes needed."

The problem is that conclusion is always true, by construction. The CI job that creates these PRs records normalizer(input) as the output — so of course it matches when you run the normaliser again. Confirming that match is confirming that the CI job that created the PR did the job correctly, nothing more.

What I actually needed was the second step: reasoning about whether those outputs are correct, spotting event prefixes that should be stripped, recognising real film titles that are getting mangled, and updating known-removable-phrases.js accordingly. That's the work. In solving the cost problem by narrowing the prompt, I'd removed the work entirely.

When I went back through the PR manually, I found several entries that still needed fixing.

The Cost Problem Underneath

What kept nagging at me: the task is reviewing 11 strings. There's a large corpus of existing examples, a detailed instructions document, and an LLM with a vast amount of general knowledge. It shouldn't require 30 turns and $0.59 to do this — and the fact that it did suggests something isn't well-suited here, not just misconfigured.

Part of it is a problem with visibility. With each run costing real money and taking several minutes, debugging is expensive. You can't easily see why Claude went down a particular path until you're staring at a full JSON trace of every tool call. Every misconfiguration costs you money and ten minutes before you understand what went wrong. Several of those cycles add up quickly — the $5 I spent getting here was just debugging, not doing useful work.

And even when the infrastructure is right, the cost curve for this type of task is awkward. Simple cases (all outputs correct) should be cheap, but you can't know in advance whether the run will be simple. If Claude starts investigating an ambiguous case, you're back to 20+ turns and $0.50+. The unpredictability makes it hard to budget.

For a task this focused — a small number of strings, a clear pattern to match against, a fixed corpus to consult — perhaps a deterministic script would be more reliable (and much cheaper). The Claude Code GitHub Action is well-suited to open-ended tasks where you're not sure what tools you'll need... and maybe if you've got a healthy budget to back that too. A free, open-source, personal project trying to automate reviewing normaliser outputs against a known pattern isn't really any of that.

What I'd Do Differently

I wouldn't abandon the idea entirely. The local Claude Code workflow — where I can watch it reason, reprompt when it went off track, and apply fixes interactively — has worked well and saved real time. The problem is trying to make that fully autonomous in a way that's cost-effective.

If I came back to this, I'd probably try a direct API call with a tighter prompt and explicit output format rather than the full Claude Code agentic setup. Something that gets the diff, asks Claude to classify each entry as "looks correct" or "has issue: [reason]", and only triggers the expensive autonomous work when there's actually something to fix.

But for now, some things are still faster and cheaper done by hand. 🍿

Top comments (1)

Benjamin Nguyen • Apr 4

interesting