60 days with Claude Code on a production ERP: the honest balance (no hype, raw numbers)

#claudecode #ai #productivity #webdev

The evening Étienne asked to see the numbers

Tuesday evening, end of the day, the open space had cleared except for Étienne. Étienne holds sixty percent of the house and spends his working week at a fund that acquires software publishers, and he looks at ERPs all year the way others read balance sheets. He sat on the edge of my desk, a metal water bottle in hand, and said what he always says when he senses someone is telling themselves a story. "What's that based on?"

I was about to answer with a narrative. Sixty days of solo production on Rembrandt with Claude Code, learning the doctrine, the in-flight retractions, the incidents that hardened the rules. The declarative form was ready. But Étienne doesn't ask for a narrative, he asks for the material inventory. So I opened a terminal and let wc -l speak. This article is what I should have given him without waiting for him to ask — the dry, numbered balance, what worked, what didn't, what I would do differently. Not a success story, not a cautionary tale. Just the audit nobody runs on DEV.to because we're all too busy publishing the parts that shine.

What's at stake behind Étienne's question is less the performance of a device than the possibility of measuring it honestly. Sixty days of practice with an AI assistant on a production project is a rare object at this stage. Most publications circulating on the subject are either brief demos from a hackathon or marketing announcements from vendors. The field return at sixty days, delivered with its numbers and retractions, barely exists. That's the gap I intend to close here, without more pedagogy than is strictly needed.

The dry material inventory

Sixty calendar days between the first session and today. Fifty-eight active days out of sixty, meaning two days without a commit and explaining why the rest of my life barely held. Over that window, the repo accumulated nine hundred and eighty-four commits bearing my name — an average of sixteen commits per working day, on days that are not only code, that are also accounting, calls from parents, team arbitrations, and nights where I'm re-reading SQL while everyone sleeps.

The filesystem says today one hundred and thirty-one thousand six hundred and twenty-eight lines of TypeScript, TSX, JS, JSX, excluding node_modules, .next, .claude, measured by a find I can re-run in front of anyone. For scale, in March I signed a five-figure pack with a European commercial ERP publisher whose technical annex billed custom development by number of lines produced. The pack is in refund negotiation.

Living inside this mass of code are seventy-four written architecture decisions, two hundred and seventy-six Supabase migrations, eighteen project rules in .claude/rules/, forty-four sessions documented in docs/sessions/, and, in a separate folder, sixty-seven DEV.to article drafts of which twenty-four are published, including what you're reading now. The AI piloting doctrine has gone through nine versions over the period — v0.2, v0.3, v0.3.1, v0.3.2, v0.3.3, v0.4, v0.4.1, v0.6, v0.7 — each anchored in a newly documented fact or a public retraction. The v0.5 skip was acknowledged after the empirical gorgon session in mid-May, not glossed over.

I'm not writing these numbers to impress. I'm writing them because what follows doesn't stand without them, and because the practice has spent its entire duration demonstrating that they don't say what they appear to say.

What worked

Three mechanisms, and only three, deserve to be named. Each has its founding incident with a date, and each paid its learning cost before becoming profitable.

Falsify before fixing. On 6 May at the start of the afternoon, a Sentry alert surfaced that the day's enrolment counter was reading zero while three enrolments had been entered that morning. My reflex arrived before the protocol, and my hand was already on the keyboard. "The cache isn't invalidated," I committed, I deployed. Thirty minutes later, rollback — the bug was still there. The real cause, which a ninety-second grep probe would have surfaced, was that no cache_invalidate call existed anywhere in the enrolment pipeline — not a stale cache, an absent one. The textual rule asking for a falsifiable hypothesis before any fix had been living in CLAUDE.md for three weeks. I simply didn't follow it that day, because no apparatus interrupted my run between bug and commit. Ten days later, the falsify-before-fix skill was committed — no longer a textual rule but a material switch that loads on the keywords fix, bug, doesn't work and imposes its protocol before a single line is written. The gap between a textual rule that doesn't hold under pressure and a material device that physically interrupts the session is exactly what separates a pious doctrine from an operational one. Across the forty-four logged sessions, I can name at least a dozen incidents where this skill prevented a fix-then-rollback cycle. Five to ten minutes of upstream probing are worth thirty to ninety minutes saved downstream — which, at two or three bugs per week, adds up to a full day recovered each month.

Live, Snapshot, Cache. Late April, I ran the source-of-truth audit I had been putting off for three months. I crossed two queries that had never been crossed before. One student record: the contacts.montant_total column carried one thousand one hundred and fifty-nine euros entered somewhere in 2024, untouched since. The actual sum of instalments, computed on the fly, came to two thousand two hundred and sixty-two. A thousand-euro gap, on a single record, with no alarm ever ringing. I widened the grep — five hundred and sixty contacts in the same state, some off by several thousand euros. The value was read every day in the treasury dashboard, treated as an immutable past fact when it should have lived on the fly. The rule that came out of this imposes three questions on any new stored column before any migration goes to review. Must the value evolve with its upstream sources? Is on-the-fly calculation acceptable performance-wise? What declared refresh mechanism if Cache is chosen? No new derivable column enters the schema without explicit answers to all three. The rule closed the class at the root, and three months later, the audit no longer finds the same family of incidents. It is this absence in subsequent audits that constitutes the real success, even more than the five hundred and sixty corrected records.

Filesystem over summary. R2 came out in April after a finding that had cost me sleep — the backlog.md I re-read every morning to orient myself was saying things that were no longer true. Not by malice, by construction. A summary file is produced fast and maintained slowly; it ages in silence like a cache without a refresher. I had already measured it on the technical debt inventory — eleven days of gap between what INVENTAIRE.md announced and what git log said had been done, with no alarm appearing, because no alarm had been wired. Since R2, every status point begins with git log --since='7d', ls docs/adr/ | wc -l, git status --porcelain executed against the filesystem, before reading the written summary. The written summary is never the authority. The filesystem always is, because it is the only artifact that doesn't lie through writer's neglect. This rule saves me once a week from a false assertion I was about to make in good faith.

What didn't work

Three traps, each with its own scar. I name them because no DEV.to publication I've read on Claude Code admits to them, and they are precisely what distinguishes real practice from hackathon thought leadership.

The agent's over-engineering reflex. Mid-May I asked Claude Code to automate the first-contact outreach to leads — a simple business case that should hold to a day of dev. The first proposal arrived in six files: a dedicated app_config table, progressive send warm-up, deterministic hash for idempotence, admin dashboard to supervise the cadence. "Simpler," I typed. The second proposal cut the warm-up but kept the config table and the hash. "Even simpler." The third proposal finally reached six honest files — a cron route, a pre-existing email_outbox we reuse, and a day of dev that kept its promise. Three rounds of recalibration to stop the agent from writing a feature twice too expensive that nobody had asked for. The trap isn't that it proposes the complex solution — it's that it proposes it so well that you accept the first time, and you only realise you've been overrun a week later, when the app_config table nobody requested lives in your schema like furniture with no purpose. Rule R11 Parsimony was codified following that session, but it doesn't yet have a material device equivalent to the falsify-before-fix skill. It is an open doctrinal debt I would have liked to close before v0.7.

The silent drift of summaries I write myself. R2 describes the problem from the reading side — summaries age without an alarm. What took me time to understand is that the summaries I produce drift first. The doctrine itself sinned against R2, for weeks announcing "thirty-five thousand lines" to qualify Rembrandt. Practice had long since exceeded it. The figure had been projected once, repeated six times across successive versions, never re-measured. When a fourth external reviewer re-measured it, the counter exploded to one hundred and eighteen thousand, then to one hundred and thirty-one thousand at the time of writing. The doctrine was a Cache without a refresher projected onto its own description. The retraction is logged in the v0.4.1 release notes, because writing a doctrine on systematic falsification without retracting your own false numbers would be precisely the complacency it denounces. The hard lesson: if your own doctrine doesn't survive its own rule, it's not the rule that's wrong — it's your application that hadn't yet begun.

Sub-agents that don't load the same memory as I do. Commit 3756e63 on 18 May was pushed directly to the main branch by an ERP sub-agent I had briefed on an autosend refactor. The user-scope feedback "always check git branch --show-current before a non-trivial commit" had existed since 14 May — I had paid its learning cost through an incident of the same kind. But the feedback lives in user-scope memory, and the ERP sub-agent loads project-scope memory that doesn't inherit from mine transitively. The piloting agent doesn't know what I know. Rule R9 amended in v0.7 fixes the defect — any sub-agent brief over thirty minutes must now inline the load-bearing feedbacks, because no transitive mechanism does it in my place. But the amendment doesn't erase the incident. For three hours, an autonomous agent operated in the repo believing it was respecting the doctrine while having loaded only half of it. It is a type of silent failure you don't see — except when you git log afterwards and discover the branch the commit came from. Filesystem as authority, again.

What I would do differently

If I had to restart Rembrandt today knowing what I know, I would do three things differently and three only. The rest — schema, routes, auth, dashboard, planning, attendance, valuation, finance — I would do roughly the same way it came, because the nature of the project imposed its structure and I would have had little room to manoeuvre.

I would install the falsify-before-fix skill on day one, not after the 6 May incident. It didn't exist then, I know. But today that the device exists, the first reflex of any solo dev starting Claude Code should be to clone the doctrine-counterpart repo and run ./install.sh on their project before their first feature. The learning cost of the rule is its first forced use — thirty seconds of friction you get used to within a day. The cost of not having it is a rollback per week for the first month.

I would write the first ADR before the first feature, not after the first incident. ADR-0001 in this repo is dated late March and covers the decision to use Next.js Server Components — it was written two weeks after the decision, as a reconstruction. The eight that followed carry the same scar. It was only at ADR-0009 that I started writing before the decision, and only at ADR-0024 that I began writing the ADR as the consequence of an audit, not as an archiving ceremony. Rule R3 Success criteria before code should have been my rule zero. It became the third because I had taken too long to understand that it applied to the doctrine itself.

I would refuse the first autonomous sub-agent until I had materially tested its memory propagation. The pattern "delegate to a sub-agent to go faster" is tempting in week four, when you've accumulated enough discipline to think you can occasionally step away from it. That's the moment the 3756e63 incident happens, because it's exactly the moment you stop verifying what you've delegated. Rule R12 amended in v0.7 codifies the mandatory material test before adopting a sub-agent — I should have invented it in week two, not in week six.

Coda

Sixty days later, what stays with me from this period is not the pride of the one hundred and thirty-one thousand lines. The lines exist, they hold, they run in production, they serve a team and a school — that's very well. What stays is the measure I didn't have at the start and that became my only compass. That measure is neither production speed nor code quality taken in isolation. It is the gap between what I believe I've done and what the filesystem says I've done.

R2 Filesystem over summary is the parent rule of the doctrine because it closes that gap. As long as you work in accord with it, you don't write your own lies. As soon as you forget it, you write them — and the AI re-writes them with you, because it was trained to please.

An agent that never contradicts you isn't a counterpart, it's a faster typist. The doctrine that held across these sixty days is a prosthesis of material contradiction for someone who codes without a PR review and refuses to code by ear. If a single rule saves you a fix-rollback cycle next week, it has already paid for itself. If none of them speaks to you, wait for your first silent incident — it will come, and that day, you'll know what to install.

Honest 60-day balance on Rembrandt — 131,628 lines, 74 ADRs, 67 DEV.to articles, 9 doctrine versions, 44 documented sessions, 984 commits over 58 active days. Names and school fictional, scenes recomposed, commercial negotiation ongoing. The doctrine referenced: github.com/michelfaure/doctrine-counterpart (CC-BY-4.0).