Originally published at chudi.dev
I finished the first version of Polyphemus in 6 weeks. Fully autonomous Polymarket trading bot. 4,000+ lines, real money on the line. Couldn't have shipped it without Claude Code. I also wasted $340 in one month using it wrong.
Four months later, the same codebase is 36,000+ lines. Two live instances running pair arbitrage on BTC/ETH/SOL/XRP. Strategy evolved from directional to market-neutral. Eight silent production bugs found and killed before they touched live capital. All caught by a shadow-first deployment gate I built after month two.
The five principles that got it to 4,000 lines are the same ones that got it to 36,000 without entropy. Here's the case study, updated April 2026.
Why Does Claude Get Dumber as Your Project Gets Bigger?
The bigger your codebase grows, the faster Claude's context window fills up. Each session becomes less useful because the window is full of stale context instead of relevant code. By week three of Polyphemus, I was spending 20 minutes re-explaining context Claude had already "seen." By month one, my token bill hit $340. The problem isn't Claude. It's the missing system around it.
Every Claude Code guide starts with CLAUDE.md tips. Not wrong, just backwards.
The first thing I had to understand was not "how do I write better prompts." It was: why does Claude get dumber as my project gets bigger?
The answer is the context window. Every session loads your project into Claude's working memory. As your codebase grows, that memory fills up faster. By week three, I was re-explaining decisions Claude had made with me two days earlier. Architecture choices, naming conventions, API patterns: all gone. I was paying for Claude to relearn the project I'd already taught it.
That's not a Claude problem. That's a system problem. And the symptoms compound fast: re-explaining context is demoralizing, it produces worse outputs because you're summarizing instead of being precise, and eventually you stop correcting Claude because it feels pointless. You start accepting mediocre outputs. You start doing the "hard parts" manually. You've turned an AI assistant into an expensive autocomplete. By month one, I was close to giving up on the whole approach.
Here are the five things that fixed it.
Principle 1: Context is a Resource. Manage it Like One.
Most developers treat Claude's context window like unlimited RAM: load everything, let it sort out what matters. That approach blew my token budget by 58% and produced hallucinations on files Claude "remembered" but didn't actually have in scope. The fix is three tiers: always-loaded project identity (under 500 tokens), per-session task file (under 1,000 tokens), and explicit on-demand file loading. Nothing else.
Tiered context loading in practice:
Tier 1. Always loaded (under 500 tokens). CLAUDE.md at project root. What the project is, file structure, conventions. Nothing else. The map, not the territory.
Tier 2. Per-session (under 1,000 tokens). A CURRENT_TASK.md file. What you're building today, what files are involved, what "done" looks like.
Tier 3. On demand. Specific files, loaded explicitly. "Read src/core/kelly.py before we start."
Result: average session token usage dropped from ~10,000 to ~4,200 tokens. 58% reduction from one workflow change.
The rule that makes Tier 3 work: never reference a file by name without loading it first. "Update the execution module" produces hallucination. "Read src/execution/orders.py, then update the retry logic" produces accurate output.
Principle 2: Claude's Built-in Memory is Better Than Manual Note-Taking
Claude Code has two memory systems: the CLAUDE.md file you write by hand, and Auto Memory, which Claude writes itself based on corrections you make. Most developers only use the first. Using both cuts the manual overhead of session management by more than half and produces more accurate recall than notes you wrote yourself.
I wasted two weeks maintaining a sprawling set of markdown notes before I discovered this. I was carefully updating files that Claude was already tracking more accurately through auto memory.
What auto memory doesn't do: strategic decisions. If you've chosen PostgreSQL over SQLite for a reason, write that in CLAUDE.md. Auto memory captures patterns. CLAUDE.md captures architecture.
The CLAUDE.md that actually worked for Polyphemus:
# Polyphemus — Claude Context
## What this is
Autonomous Polymarket trading bot. Real money. Kelly Criterion sizing.
Never lets an exception stop the main loop.
## Hard rules
- Never hardcode API keys. Doppler only.
- All amounts in USDC, not cents. One violation cost a real trade.
- Log every trade decision with rationale BEFORE executing.
- MAX_POSITION_SIZE is a ceiling, not a suggestion.
## What we are NOT doing
- No async. Sync is predictable.
- No ML models. Signal threshold is a float.
- No framework for the trading loop. Too much magic.
300 tokens. That's it. Short CLAUDE.md, accurate auto memory, clean context.
Principle 3: Plan Mode Before You Write a Single Line
Plan mode (/plan) lets Claude research your codebase and propose an approach without making any changes. You review the plan, redirect if needed, then approve. On any task touching more than two files, this single step eliminates the most expensive class of Claude mistake: confident, multi-file output that conflicts with existing architecture.
Without plan mode, Claude wrote 200 lines of code conflicting with an architectural decision buried in a different file. Confident. Wrong. Two hours lost.
With plan mode on anything touching more than two files:
Me: /plan Add circuit breaker to execution module.
Pause trading after 3 consecutive losses.
Claude: [researches without touching anything]
Proposed approach: [plan]
Files: src/execution/orders.py, src/core/state.py
I noticed MAX_LOSS_DAILY in src/core/config.py —
should the circuit breaker integrate with that?
Me: Yes, but use config module, not state.py.
Claude: Understood. Implementing now...
Claude caught an integration point I hadn't mentioned. I redirected before any code was written. Plan mode costs nothing and consistently saves hours.
Principle 4: Two Gates, Not One
Two-gate verification means every Claude output clears an automated check (type checks, linting, tests in under 30 seconds) before it gets a human review pass using a fixed 6-question checklist. Before this system, 1 in 6 Claude outputs reached production with an error. After: 1 in 40. On a trading bot, that gap is the difference between an incident log and a boring afternoon.
Gate 1 is automated. A bash script: type checks, linting, tests. 30 seconds. Catches ~60% of errors.
#!/bin/bash
python -m mypy . --ignore-missing-imports && echo "✓ Types" || exit 1
python -m ruff check . && echo "✓ Lint" || exit 1
python -m pytest tests/ -q && echo "✓ Tests" || exit 1
If Gate 1 fails, paste the error back: "Fix only what's causing this error. Nothing else." That last sentence matters. Without it, Claude fixes the error and refactors three other things.
Gate 2 is a 6-question checklist. Five minutes, manual, non-negotiable:
1. Does this do exactly what I asked — not more?
2. Are external API calls using the correct endpoints?
3. Is error handling present on every async/IO operation?
4. Are there hardcoded values that should be env vars?
5. Does this break anything that was already working?
6. Can I explain every line if someone asks me tomorrow?
Question 6 catches the most issues. If I can't explain a line, I don't ship it.
Principle 5: Treat Compaction Like a Power Outage. Plan for It.
When Claude's context window fills, it compacts: nuance gets discarded, recent decisions disappear, and the next response starts from a degraded state. The fix is a pre-compaction ritual at the end of every meaningful session. One prompt to update CURRENT_TASK.md, record new decisions, and write a 2-sentence handoff note for the next Claude instance. Recovery time: 90 seconds.
At the end of every meaningful session, one prompt:
We're wrapping up. Please:
1. Update CURRENT_TASK.md with current state
2. Add new decisions to CLAUDE.md's decisions section
3. Write a 2-sentence next-session starter — what the next
instance of you needs to know to resume immediately
That third item is the key. Claude writing handoff notes for Claude produces better handoffs than I can write myself. When a new session starts:
Read CLAUDE.md and the "Next session" section of CURRENT_TASK.md.
Confirm your understanding before we continue.
90 seconds. Full speed.
The Numbers, Updated April 2026
These aren't projections. This is the actual state of a production system that started at 4,247 lines in December 2025 and hit 36,096 lines four months later, running pair arbitrage on BTC/ETH/SOL/XRP. Every number below is from a live system, not a benchmark.
| Metric | Before System | After System |
|---|---|---|
| Lines of code | 4,247 (Dec 2025) | 36,096 (Apr 2026) |
| Avg session tokens | ~10,000 | ~4,200 |
| API cost/month | $136 (month 1, unoptimized: ~$340) | $136 ongoing |
| Error rate to production | 1 per 6 outputs | 1 per 40 outputs |
| Silent bugs caught by shadow gate | 0 | 8 (none hit live capital) |
| Test coverage | 41% | 73% |
| Uptime since December | N/A | 99.2% |
| Claude Code sessions | ~180 (Dec) | 400+ (Apr) |
The system rather than the prompts made the difference. Claude Code is a force multiplier. Without a system, it's an expensive way to ship buggy code faster. For a deeper look at verification workflows, see my post on evidence-based AI code verification.
The 6th principle I'd add today: shadow-first before every live deploy. Run a dry-run instance in parallel. Collect evidence. Gate on n_completed >= 50 before promoting. In April alone, that gate caught a bug where set("BTC") was returning {'B','T','C'} instead of {'BTC'}. The bot would have traded the wrong assets live. Eight bugs like that. Zero P&L damage.
Every principle here was learned the hard way: real bugs, real money at risk, real debugging sessions at 2am on a QuantVPS SSH terminal. I also built a self-improving RAG system that captures these learnings automatically so future Claude sessions don't repeat past mistakes.
What I Didn't Cover Here
The five principles in this post are the foundation. There's a full advanced layer above them: hooks that run Gate 1 automatically after every file write, subagents routing cheap tasks to smaller models, agent teams for parallel feature development, and MCP servers giving Claude direct live database access. Each of these requires the foundation to be working first.
The full advanced system is in the Claude Code Guide: Advanced Edition. It includes hooks that run Gate 1 automatically after every file write (no manual step), subagents routing cheap tasks to smaller models (44% cost reduction with progressive context loading), agent teams for parallel feature development, checkpointing for safe architectural experiments, and MCP servers giving Claude direct access to the live database for debugging. You need the foundation before the advanced layer is useful.
The guide includes the complete Polyphemus architecture walkthrough. If you're deploying your own bot, here's my step-by-step VPS setup on DigitalOcean.
If this case study was useful, the best thing you can do is send it to one developer still burning money using Claude Code without a system.
— Chudi
hello@chudi.dev | chudi.dev
Top comments (0)