Phil Rentier Digital

Posted on Jun 22 • Originally published at rentierdigital.xyz

Wasted AI Credits Are Every Builder's Dumb Tax. Here's What I Do Instead.

#ai #programming #claude #aiagents

Some mornings I open my Claude dashboard before anything else and see 15-20% of my weekly credits sitting there, a few hours before the reset. Credits I already paid for. Credits about to expire for nothing.

Most builders watch that counter hit zero. (You weren't going to do anything useful with them anyway, right?)

I use those hours differently. I drop an autonomous refactoring agent onto a real project, give it a verifiable finish condition, and go do something else. Live production code, not a tutorial project. Last week: a WooCommerce catalog backend at 118,000 lines of TypeScript with uptime monitored every 30 minutes, and a second project with 6 supplier integration files all running the same sync loop, about 500 lines of code copy-pasted across all 6. Setup: Opus 4.8, /effort ultracode, /goal. I configured both and left.

TLDR: an agent doesn't "improve" things. It optimizes for the finish condition you give it. A vague condition gets vague results. A bad prompt finds the shortest path to "done," which is sometimes the wrong path. 🤓 These sessions are where that difference showed up.

Opus 4.8, /effort ultracode, /goal

The trio needs some unpacking before the sessions make sense.

Opus 4.8 is the only Claude model compatible with xhigh reasoning, and it carries a 1M token context window. That second part matters when you're handing it 118,000 lines of TypeScript and telling it to find its own way through.

/effort ultracode is a session setting in Claude Code, released May 28 with v2.1.154. Setting it activates xhigh reasoning and enables Dynamic Workflows automatically. Claude decides on its own when a subtask is worth fanning out to parallel subagents: up to 16 running concurrently on a standard machine, up to 1,000 per run. Some subagents attack the problem from independent angles. Others try to refute the first pass's conclusions, iterating until results converge. Adversarial verification, basically (a raid team where half the group is there specifically to wipe the other half). Intermediate outputs live in script variables, not the main context window, so the primary context stays lean while the work fans out. I went into why the CLI-native architecture changes what's possible for autonomous agents in why CLIs outperform MCP for AI agents.

/goal is available from v2.1.139+. You define a verifiable finish condition in plain language. A separate Haiku instance reads the session transcript after each turn and replies with yes or no plus a 1-line reason. If no, that reason becomes Claude's next instruction. If yes, the goal clears and the session ends. Without /goal: roughly 40 manual "keep going" messages. With it: you define the end state and walk away.

Put the 3 together and you have a session that can run on its own for hours. What you give it to run toward is the part that actually matters.

Before the First Line Changes

The first thing the agent did in both sessions wasn't write code. It was count.

Catalog backend: it found the project's own test suite, ran typecheck, and locked the baseline. 303 tests passing, typecheck green. Those numbers went into the tracking file before anything else moved.

Supplier integration session: 404 tests green. 701 lint errors already present in the codebase, counted and recorded before the first change.

This is not optional ceremony. The mechanism that makes autonomous refactoring work is the same one that makes it safe: the agent optimizes for the finish condition you define, not for some abstract notion of better code. Without a known baseline established before the first commit, there is no condition to optimize for and no reference state to return to if something breaks mid-change. The baseline defines what green means. It's what every verification in the session points back to. Skip it and the agent makes changes it can't verify, continues because nothing says not to, and by the time you check the diff it's hours deep with no clean path back. It takes minutes to establish before touching anything. Skipping it can cost the entire session.

You don't pull the boss without setting your spawn point. The baseline is the spawn point.

First notable behavior in the catalog backend: the agent spawned parallel subagents to map the codebase before writing a single line. Each finding went back through grep for cross-checking. It wasn't browsing the code. It was building a case.

Two Moments About Verification

Phantom error, catalog backend

Mid-extraction of a component, the console threw an error: CategoryEditor defined twice. The kind of thing that makes you want to roll back immediately.

The agent ran typecheck. Clean. Grepped for any duplicate class definition in the codebase. Nothing found. Loaded the actual page in a browser: it rendered correctly.

It was a dev server artifact, a ghost from the transition between old and new module paths during the extraction. A restart at the next checkpoint: zero errors.

The procedure: typecheck, grep, browser render. Each check independent. A console error is noise until 2 independent signals agree with it. "It compiles" is not a live test. Neither is "the terminal threw something once."

A small tangent: while I was reading through the supplier integration session logs, I got a Slack from a client asking about an invoice from 3 months ago. Spent 20 minutes digging through email threads to find the original. Meanwhile the agent had processed 40 files and was running 3 parallel subagents. Different kind of throughput.

Deploy refusal, supplier integration

At one point the session hit a guardrail that blocked production access. The agent didn't try to work around it. It verified what it could through the test suite, confirmed the core changes held, and kept working through the remaining scope without the production path. That's the difference between a tool and a liability, and it's not a small distinction.

What the Agent Decided Not to Touch

The catalog backend had 2 slugify functions that looked nearly identical. Any automated audit flags those as obvious merge candidates. Reading the actual code: 1 handles apostrophes differently, the other truncates at 80 characters. The difference between them isn't cosmetic. It's about which products get which URLs, and those URLs are already indexed across the catalog. Merging them would have rewritten live product links without warning.

The same logic applied to 5 variants of a metadata parser that all rewrote live product records. The audit found the duplication. Reading the code showed why each variant existed. The 10% that diverges between them is exactly the 10% that matters for live data. The agent flagged all 5, merged nothing, and moved on.

The Convex backend was a harder call. It runs in a single environment, meaning dev is prod. A live test requires deploying to production. The potential gain was cosmetic cleanup. The risk was unverifiable without a deploy. The agent deferred it, documented it in the tracking file, and said so. I'm honestly not sure the refactor would have been worth it even with a real test environment, but that's a judgment that belongs to a human with full context, not to an agent running without supervision.

The supplier integration session ended with 334 lint errors still in place. Those 334 were already there before the session started, spread across hundreds of files. Cleaning them in batch would have meant touching everything for zero architectural gain. The old migration scripts looked dead but turned out to be manual emergency hatches, the kind you only remember exist when prod is on fire and the automated failover is down (ask me how I know). The agent left a note and skipped them rather than deciding for me.

"Satisfied" doesn't mean everything was touched. It means the real offenders got addressed and verified, and everything else is either risky to change or low-value to fix. Inventing additional work is failure, not diligence.

The Prompt With Scars

The original prompt I sent for the catalog backend session had gaps.

No mention of production anywhere. "Until satisfied" with no verifiable endpoint. "Live-test" without a definition of what counts. An "autoreview" instruction that doesn't correspond to any real Claude Code command.

Each gap maps to a moment that could have gone differently. The slugify functions would have been merged. The Convex backend might have been touched. The phantom error would have triggered a rollback on a single unconfirmed signal.

These 2 sessions produced something worth keeping:

/goal Improve this project's architecture through safe, incremental refactoring.
First: discover the project's own check/test/build/run commands (package.json
scripts, Makefile, justfile, CI config, README) and establish a green baseline
with them — record it. Then audit the codebase for concrete debt (dead code,
duplication, oversized files/modules) with grep-level evidence BEFORE touching
anything. If the project isn't green at the start, stop and report instead of
refactoring blind.
Work in priority order of (value × safety × verifiability). For each step:
  1. one coherent change,
  2. live-test through the REAL path — exercise the change end-to-end for THIS
     project type (browser for a web app, run the CLI/binary, the test suite for
     a library, a real request for an API, a dry-run for a cron). "It compiles"
     is NOT a live-test,
  3. run /code-review on the diff and address its findings,
  4. commit (conventional message, stage by path).
Hard constraints:
  - Production code: prefer surgical, reversible changes over rewrites.
  - Never change a public contract (API paths, published URLs, output formats)
    or any externally observable behavior.
  - Never refactor anything you cannot verify green. If you can't safely
    live-test it, DEFER it and say so explicitly.
  - If reading the real code shows a duplication or variation exists for a
    reason, skip the "cleanup."
Done when the remaining debt is either risky or low-value — don't manufacture work.
Track progress in ~/tmp/refactor-<project>.md (derive <project> from the repo):
log every step, every decision, AND every item you deliberately rejected with why.
Don't push or deploy without asking.

Every clause is a scar from a session that didn't have it.

That's what the full prompt contracts framework is actually about: not structure for its own sake, but constraints that come from something going wrong.

Your Credits Are Expiring Tonight

What the 2 sessions actually produced:

Catalog backend: 4 clean commits, 2 bloated files reduced, tests from 303 to 313, net -57 lines. Zero broken in production.

Supplier integration: +288/-711 lines, tests from 404 to 407, a dead 504-line component purged, lint from 701 down to 334 errors.

If you want to go deeper on the method, Vibe Coding, For Real covers the step-by-step approach to shipping real projects with AI. Free on Kindle Unlimited.

Your reset is coming. The credits sitting in your dashboard are already paid for.

Improving the architecture of your projects is the best use of expiring credits.

Sources

Keep Claude working toward a goal, Claude Code official documentation
Claude Code Ultracode: What It Does, When to Use It, and How to Control Cost, Vibe Coding Academy
What Is "ultracode" Mode in Claude Code?, BitsMinds
Claude Code /goal: Set a Finish Line, Walk Away, FindSkill.ai

This post may contain affiliate links. If you click them, I might earn a small commission (costs you nothing, and helps me keep shipping quality articles every day for your reading pleasure).