Part 3 was supposed to be me walking you through building a real project with this workflow. The "build with me" post where I show you how it all comes together on a production app.
That's not what happened.
Opus 4.6 dropped last week. I switched to it immediately, and my output tripled – speed and quality, genuinely night and day. But at 3x the cost. That's when I realised: if I'm going to run the most expensive model, I need to make sure I'm not wasting a single token getting there. So, I did what any sensible person would do. I paused everything, tore my entire workflow apart, and rebuilt it from scratch.
This proved too valuable not to share. So here we are.
I Followed My Own Advice (And That Was the Problem)
Remember Part 2? I told you: "Every time Claude does something stupid, I add a rule. That's how the file evolves."
Solid advice. I stand by it. But here's what I didn't anticipate: what happens when you follow that advice religiously – not whilst building a fun little side project, but something ready for production.
My GLOBAL_STANDARDS file ballooned past 500 lines and kept growing. Not just from me manually adding rules – I'd also been asking Claude to iterate on it, to learn from what it had done right or wrong, and to update the standards accordingly. Sounds smart, right? Claude updating its own rulebook based on experience?
In theory, brilliant. In practice, I'd created a document that was teaching Claude things it already knows, repeating itself in slightly different ways, and growing with absolutely no ceiling in sight. The commands got verbose. Design docs loaded into every single task. The workflow was producing exactly what I wanted.
And then I actually looked at what it was costing me.
The Wake-Up Call
I'd been hitting my context window for the day quite quickly – then for the week. So, I switched to token billing as an experiment, just to see what was actually happening under the hood.
31,500 tokens consumed before Claude wrote a single line of code.
That's not the work. That's the startup cost. All the context loading, standards reading, file searching that happens before any actual development begins. Those 31,500 tokens? About a quid. Just to get started.
Then I ran /address-review to handle some PR feedback – refactoring, package installation, and a few iterations. Three quid. For reviewing comments. Using Opus 4.6 felt like popping down to the shops in a Ferrari – technically impressive, financially idiotic.
Before Opus 4.6 made everything simultaneously better and more expensive, I knew I had to optimise what I already had. Because if I'm going to run the most expensive model available, I need to make absolutely certain I'm not wasting a single token getting there.
What Was Actually Happening
Every time I ran a /ship command, here's what Claude was doing:
- CLAUDE.md gets auto-loaded by Claude Code (319 lines) – can't avoid this, it's how the system works
- My /plan command explicitly tells Claude to read CLAUDE.md again – completely redundant, it's already loaded
- Then it reads GLOBAL_STANDARDS.md (over 500 lines, still growing) – every single time
- Then the design system docs (557 lines) – even for backend work
- Then design patterns (540 lines) – even for database tasks
Over 15,000 tokens of documentation loaded before Claude even looks at the GitHub issue. Then add the issue itself, codebase search, related files, past PR analysis – and you're sitting at 31,500 tokens before work begins.
And here's the bit that properly stung: most of those 500+ lines in GLOBAL_STANDARDS were teaching Claude things it already knows.
SOLID principles? TypeScript strict mode? React hooks patterns? All in the training data. I just didn't trust it at first. I was still treating it like ChatGPT in '23 – spelling out every little thing because I didn't believe it knew what I meant. But Claude doesn't need you to explain what composition over inheritance is. It needs you to tell it that you prefer it. Ten words instead of two hundred.
Asking the Engine What It's Doing Wrong
First, I read a bunch of articles. Watched a few videos. Dug into Anthropic's documentation and community findings. Built my own agenda.
But then I thought – what better way to optimise than to ask the nearly-sentient engine itself what the fuck it's doing wrong?
I fed Claude my entire setup: the commands, the standards file, the project structure. Gave it everything I'd gathered from Anthropic's own best practices, community benchmarks, the lot. Asked it to audit my workflow and build me an optimisation plan.
What came back was humbling. Not because it was genius – but because the problems were obvious once someone pointed them out.
The chained commands were re-referencing files that were already loaded. The standards file was lecturing Claude on things in its training data. Design docs were loading for backend work because I'd never added a condition to check whether the task actually involved UI. I'd built a system with circular references, redundant reads, and zero awareness of what was actually needed versus what was always dumped in.
I'd been so focused on guardrails – so determined to stop this thing going rogue (mainly because of past experience losing my mind at AI tools doing stupid shit) – that I'd pushed it into checking the same things three different ways. Some of it was oversight on my part. Some of it was stuff I only understood later, once I could see the token costs laid bare.
Digging Deeper Into What Claude Code Can Actually Do
Here's the thing – I'd been using Claude Code's command system since I started, and it worked. But once I started optimising, I dug deeper into what was actually available and realised I'd been ignoring features that solve exactly the problems I was brute-forcing.
Skills, hooks, sub-agents, model routing – these aren't new; they've been there since late last year. But they seemed either trivial or just not relevant when I first glanced at them. Who needs hooks when you can just tell Claude to run Prettier? Who needs sub-agents for simple tasks?
You do. The moment you start hitting your context window regularly and watching your credits evaporate, suddenly these features become essential.
Skills replaced commands. Same markdown files, but with frontmatter that controls behaviour – which model to use, whether to fork into a sub-agent, which tools are allowed. My old commands were flat scripts. Skills are configurable.
Hooks handle the shit I was wasting tokens on. PostToolUse hooks can auto-format after every edit. Claude never sees a Prettier error now – it gets fixed silently before Claude even knows it existed. No more loops where Opus burns thirty pence trying to fix a semicolon. That alone was a game-changer.
Sub-agents let you route the right brain to the right task. Opus for planning and architecture. Sonnet for implementation – roughly 5x cheaper and more than capable of the typing work. Haiku for fast codebase search. Instead of running everything through the most expensive model, you match the brain to the job.
I built six specialised review agents – dependencies, performance, security, bundle size, git history, and final optimisation pass. They all run on Sonnet or Haiku. They execute in parallel, aggregate their findings, and catch issues before PR creation. Faster, cheaper, more thorough than having Opus handle everything. Cost? Negligible. Value? Caught 50% more bugs pre-PR.
Model routing via opusplan – Opus handles the thinking; Sonnet handles the execution, automatically. No manual switching.
These aren't nice-to-haves. They're the features that made a lean architecture possible. Without hooks, I'd still be burning tokens on lint loops. Without skills, I'd still be loading everything upfront. Without model routing, every task would cost Opus prices regardless of complexity.
The Rebuild
I tore it all down and rebuilt it lean. One principle: load exactly what's needed, when it's needed.
CLAUDE.md went from 319 lines to 61. Deleted code examples, verbose patterns, and data model definitions. Replaced with pointers to other files using Claude Code's @import syntax. Working on data layer stuff? Only architecture.md loads. UI work? Only then does ui.md get pulled in.
GLOBAL_STANDARDS.md got decomposed and deleted. Split into five small, targeted files in .claude/standards/: architecture (28 lines), code quality (29 lines), UI (31 lines), testing (25 lines), git conventions (22 lines). Total: 135 lines across five files, loaded on demand.
I'd discussed with people that if my standards file got too big, I'd split it out. That's exactly what happened. But what I didn't expect was how much I could delete entirely. About 350 lines of things Claude inherently knows – SOLID explanations, DRY explanations, TypeScript strict mode examples. Gone. You're not teaching Claude TypeScript. You're declaring your preferences. Big difference.
The learnings system replaced verbose explanations. Sounds smart to let Claude update its own rulebook based on experience, and it is – but I was doing it wrong. Instead of verbose explanations about why something's an anti-pattern, I built a compact learnings file: 10-word bullets on what went wrong, not 200 words on why it's wrong. Lives in its own .claude/state/learnings.md, doesn't bloat the global standards, survives compaction.
The system's smart about it too – automated keyword matching with a 70% similarity threshold catches duplicates before they get added. If a new learning is substantially similar to an existing one, it prompts for a merge rather than creating noise. No manual deduplication faff. And it only captures learnings from meaningful PRs – 100+ lines changed, 5+ files touched, or manually tagged. Trivial PRs get skipped. Result? 70% fewer learning extractions, better signal-to-noise ratio.
Commands became skills with just-in-time context. The old /plan command always read 1,100 lines of design docs. Even for backend work. Even for database migrations. New version: "IF this involves UI changes, read design docs. IF not, don't." That alone saves roughly 8,000 tokens per non-UI plan.
But I took it further. Context thresholds are now dynamic based on what phase you're in and how much work remains. Planning needs more headroom (60% threshold) because you're exploring the codebase. Building can run tighter (70%) because it's more predictable. Review phase? You can push to 80% safely because it's mostly reading. And if you're 7 steps into a 10-step plan, the system adjusts upward because there's less unknown work ahead. Smart scaling instead of arbitrary limits.
PostToolUse hooks for auto-formatting. Every file edit gets Prettier to run against it silently. Claude never sees formatting errors. No more lint loops where it burns money trying to fix a missing semicolon.
The Two-Strike Rule. If Claude fails to fix a build error after 2 attempts, it stops and asks me for guidance. No more infinite loops where it burns money trying to fix something it clearly doesn't understand. Token-looping is the new memory leak – catch it early or it'll bleed you dry.
Context monitoring with emergency handlers. Three-tier system: warning at 70% (controlled compact with state preservation), emergency save at 90% (everything critical gets backed up – uncommitted changes, merge conflicts, untracked files, stashes, the lot), SessionStart recovery (pick up exactly where you left off if interrupted within the last 2 hours). Never lose work to context overflow again. The emergency handler uses git porcelain to capture everything – even edge cases like active merge conflicts or untracked files that a simple git diff would miss.
Visual QA that doesn't bankrupt you. I integrated Claude's Chrome extension for visual regression testing but made it smart. It only runs when UI files change significantly – new components or 10+ lines modified in UI files. When it does run, it has severity levels: critical issues (3+ layout shifts, broken flows) block PR creation and trigger an automated fix agent. High-severity issues (console errors, missing elements) prompt for manual decisions. Low-severity stuff (minor spacing, colours) just gets logged. Reduces the cost fivefold compared to running it on every ship. Value? Catches visual regressions before they hit production.
And here's something I found in Anthropic's own research that validated the whole exercise: instruction-following quality decreases as instruction count rises. LLMs can reliably follow about 150–200 instructions. Beyond that, compliance drops. My 500+ line standards file? Claude was ignoring half of it. I was paying more for worse results.
The Results
Same workflow. Same guardrails. Same checkpoints. But now with six specialised agents catching more bugs, visual regression testing on UI changes, automated learning capture, and dynamic context management. For about 19p per ship – including all the new quality improvements – I'm catching 50% more issues before they hit PR review.
That's the trade-off that makes sense. Not "how cheap can I make this," but "what's the optimal balance of cost and quality." And now I can run Opus 4.6 – the best model I've used – without wincing at the bill.
What I Actually Learned
This whole exercise was pure evidence of something I keep banging on about: learning comes first. Scrutinise your own approach. Refactor, refactor, refactor. Especially when new capabilities are available that you haven't explored yet.
I'd overengineered my own system. Not because the thinking in Parts 1 and 2 was wrong – the principles are sound. You need guardrails. You need a plan-first approach. You need human-in-the-loop approval. All of that still holds.
But I'd applied those principles when a more precise approach would do. The standards file grew unchecked. The commands duplicated work. The context loading had no awareness of what was actually relevant. Just-in-time context matters – make sure you aren't overfeeding this thing, because it'll eat everything you put in front of it and charge you for the privilege.
Here's what to take away if you're running any AI coding workflow:
- Check your always-loaded context. Run whatever cost-tracking your tool offers and see what you're actually burning just to get started. CLAUDE.md plus any files loaded by every command – that's your tax. Mine was 1,377 lines. Yours is probably similar. I built a simple cost reporter that aggregates logs and shows me actual spend per ship. Visibility matters.
- Stop teaching Claude what it already knows. SOLID, DRY, YAGNI, TypeScript patterns, React hooks – it's all in the training data. Declare your preferences in 10 words, not 200. Delete the lectures. You're not onboarding a junior developer who's never seen JavaScript. You're configuring a tool that already knows the fundamentals.
- Never re-read auto-loaded files. CLAUDE.md gets loaded automatically. Every command that says "Read project CLAUDE.md" is double-loading it. Once is enough. Check your commands for redundant reads.
- Split monolithic standards into domain-specific files. Architecture, UI, testing, git – separate concerns, loaded on demand. Same principle as good code: single responsibility. My 500+ line monolith became 135 lines across five files. Easier to maintain, cheaper to load.
- Add just-in-time guards to everything. "IF this involves UI, THEN read design docs." Conditional context loading saves thousands of tokens per task.
- Use hooks for formatting and linting. PostToolUse hooks running Prettier after every edit eliminates an entire class of expensive loops. Never let an LLM waste tokens fixing a semicolon. Handle it at the tooling layer.
- Route the right model to the right task. Opus for planning and architecture. Sonnet for implementation. Haiku for codebase search. Match the tool to the job. My specialised review agents all run on Sonnet or Haiku – cheaper, faster, just as effective for focused tasks.
- Enforce the Two-Strike Rule. If Claude can't fix a build error in two attempts, it stops and asks you. Token-looping will bleed you dry. Catch it early, intervene manually, move on.
- Build a learnings system that doesn't bloat standards. Capture what went wrong in compact bullets (10 words, not 200), store separately from global standards, reference on-demand. Automate deduplication with keyword matching. Only extract learnings from meaningful PRs (100+ lines, 5+ files, or manually tagged). Knowledge retention without context bloat.
- Implement dynamic context monitoring with safety nets. Adjust thresholds based on phase complexity and remaining work. Planning needs more headroom than review. Almost done with implementation? You can push closer to the limit safely. Three-tier approach: early warning (60–70% depending on phase), emergency save (90%), SessionStart recovery. Never lose work to context overflow.
- Use specialised agents for review tasks. Six cheap Sonnet/Haiku agents running specific checks (dependencies, performance, security, bundle size, git history, optimisation) catch more bugs than one expensive Opus review. They run in parallel, degrade gracefully if some fail (0–1 failures? Continue. 4+ failures? Something is broken, abort), and aggregate reports with conflict resolution. Faster, cheaper, and more thorough.
- Make visual QA smart, not wasteful. Only run it when UI files change significantly. Use severity levels to decide what's blocking vs. what's informational. Critical regressions trigger automated fix agents. Minor spacing differences just get logged.
- Keep a rollback escape hatch. When you're rebuilding a working system, back up the old version. Keep all your original versions as variants during the transition. Test the new approach on small issues first and validate before committing fully. If something breaks, you can revert immediately.
- Treat your AI setup like code. It needs refactoring. It accumulates tech debt. New capabilities make old patterns obsolete. Review it with the same rigour you'd review a pull request. I optimised mine after seeing the costs. You should probably do the same.
The Uncomfortable Bit
If my setup was bloated – and I'd specifically built it to be efficient – yours probably is too.
Every AI workflow has a startup tax. Most people just pay it without realising. And the cost isn't only money. It's degraded performance. You're loading instructions that Claude can't follow, paying for context it ignores, and wondering why the output isn't consistent.
The developers winning with AI aren't the ones using it most. They're the ones thinking about how they use it. Lean context. Just-in-time loading. Model routing. Circuit breakers. Specialised agents. Dynamic thresholds. Visual regression testing. Automated learning capture. These aren't advanced optimisations. They're fundamentals.
If you're burning a quid per command just to get started and accepting that as "the cost of AI," someone else is doing the same work for a fraction of the cost, catching more bugs, and getting better results.
The question isn't whether you should optimise. It's whether you can afford not to.
I'm Mike Nash, an Independent Technical Consultant who's spent 10+ years building software, leading engineering teams, and apparently optimising AI workflows because Opus 4.6 made me look at the bill.
These days I help companies solve gnarly technical problems, modernise legacy systems, and build software that doesn't fall over – or burn money unnecessarily.
Want to see the full setup, chat about AI optimisation, or swap war stories? Drop a comment or connect on LinkedIn.

Top comments (0)