Sinisa Kusic

Posted on Apr 28 • Originally published at ku5ic.substack.com

After Knowledge, Discipline

#ai #claude #devops #productivity

Anatomy of a Claude Code setup that pays for itself

The most common reaction when I show people my Claude Code workflow is some version of: "isn't that a lot of tokens?"

It is. The flow front-loads context, plans before it implements, runs scripted checks after edits, and writes structured artifacts to disk for later steps to pick up. Compared to typing "build me a feature" into a fresh chat, it spends more.

It also does the thing.

That second sentence is the one most takes on AI-assisted development skip over. The cost objection treats tokens as the only line item on the invoice. The bigger line item, by a wide margin, is the cost of correcting an agent that drifted out of scope, hallucinated an API, edited the wrong file, or confidently produced something that has to be thrown away. Once you account for that, the calculus inverts. Structure is cheaper than chaos.

This article is an anatomy of the structure I landed on. Everything described here lives in my dotfiles, public, at github.com/ku5ic/dotfiles/tree/main/claude. I will link the actual files as I go so you can read or steal whatever is useful.

The interface was already there

Before walking through the parts, it is worth saying out loud what the parts are made of, because nobody had this on their 2026 bingo card.

The interface that makes AI agents predictable and produces quality output is not a vector database. Not a fine-tuned model. Not a proprietary framework. Not an orchestration layer with a clever name.

It is markdown files in a sensible folder structure.

CLAUDE.md at the repo root. A docs folder the agent reads before it touches code. Command files that encode a workflow. Rules files that encode standards. Plain text. Version controlled. Diffable. Greppable. The exact tooling we have had for three decades.

For a couple of years the industry poured capital into building new primitives for LLMs. New storage layers, new retrieval mechanisms, new agent protocols, new runtime abstractions. Most of it was solving a problem the models did not actually have.

The models were trained on open source. Open source runs on markdown and folders. READMEs, contributing guides, architecture docs, ADRs, issue templates. That is the native format of the corpus. Of course the models respond to it. Of course structure in a repo produces structure in the output.

The unlock was not technical. It was noticing that the interface was already there.

Three things follow from that, and the rest of this article is mostly working out the implications.

First, the quality of your output is bounded by the quality of your written context. A thin CLAUDE.md produces thin work. A precise one produces precise work. Architecture documents, coding standards, and explicit constraints are no longer dead weight. They are executable.

Second, folder structure is a contract. When the agent can infer where things belong from the tree alone, it stops guessing. When it cannot, it invents. The same property that makes a codebase readable to a new hire makes it legible to a model.

Third, the skills that compound here are not prompt engineering. They are the unglamorous ones. Writing clearly. Structuring information. Keeping documentation close to the code it describes. The things senior engineers were already supposed to be doing.

The dotfiles I am about to walk through are not exotic. They are markdown files in folders. Every guardrail, every workflow stage, every skill, every command is a .md file with frontmatter, or a small shell script, or a JSON config. The whole repo is roughly 150KB of plain text. It works because the model was trained on plain text.

The joke writes itself. After all the frameworks and all the infrastructure, the winning move was a folder of markdown files and the discipline to keep them current. The future of AI-assisted engineering looks a lot like good engineering.

How it propagated

It did not start as a system. It started as one project.

There was a particular repo where I stopped treating Claude Code as a faster autocomplete and started treating it as a junior engineer who needed an onboarding document. I wrote a CLAUDE.md for it. The file captured the things I kept correcting Claude on: stop adding decorative comments, match the existing test style, do not edit lockfiles, do not invent file paths. It worked. The next project I touched, I copied it over. By the third repo I was diffing them against each other, and by the fourth I was tired of doing that.

The pattern was obvious in retrospect. The same five-stage rhythm kept emerging in every project regardless of stack: figure out what is here, plan the change, implement it in small steps, test, review. The same handful of guardrails kept being needed: do not push to main, do not commit AI signatures, do not pipe curl into a shell, do not write to my zshrc directly. The same handful of mini-context-collectors kept being needed: what is the project root, what stack is this, what is the base branch, what does the diff look like.

Once I saw the pattern, I extracted it. The dotfiles repo is the result. It is not a framework and not a methodology. It is just the parts of "use Claude Code well" that turned out to be project-independent.

The shape of the thing

Four namespaces of slash commands, six skills, four hooks, a handful of helper scripts, a settings.json that does real work, and a global CLAUDE.md that ties it together. Top-level layout:

claude/
  CLAUDE.md
  settings.json
  bin/
    detect-stack.sh
    project-name.sh
    project-root.sh
    git-base.sh
    run-checks.sh
  commands/
    flow/      preflight, plan, implement, test, review, fix, resume
    audit/     a11y, debt, doc-drift, perf, security
    meta/      feature, prompt, retro
    write/     commit, pr, release-notes, stakeholder
  hooks/
    inject-context.sh
    guard-bash.sh
    guard-edit.sh
    guard-commit.sh
    sanitize-output.sh
  skills/
    react-patterns, django-patterns, test-patterns,
    wcag-audit, security-patterns, markdown-report

The cost argument cuts through all of it. Each piece exists because the alternative, having Claude figure it out fresh every time, was demonstrably more expensive in the only currency that matters: my time spent re-steering it.

CLAUDE.md as the contract

Source: claude/CLAUDE.md.

The global CLAUDE.md is the document Claude Code reads on every session. Project-level CLAUDE.md files extend it. Mine is roughly 2,500 words and covers: project boot protocol, output rules, output discipline, token discipline, code style, verification before acting, anti-fabrication, environment and stack, commands and side effects, git workflow, decision frameworks, scope and planning, principles, ambiguity handling, communication style, the failure mode playbook, scratch artifact naming, and the canonical command namespace.

A few sections are worth singling out because they are the ones that pay for themselves the fastest.

The anti-fabrication section is short and explicit. Do not invent file paths that have not been seen via Read or Glob. Do not invent API shapes. Do not invent version numbers; read the lockfile or --version. Do not invent test results; if a test was not run, say "not run." This eliminates a specific failure mode that used to cost me real time: Claude confidently writing code against an imagined version of an API, me discovering it three steps later, both of us backing out the change.

The token discipline section tells Claude how to read efficiently: prefer rg and grep for locating, use Read only for understanding, cap git log to twenty entries unless justified, do not re-read a file in the same session unless an edit changed it, skip the obvious noise directories (node_modules, .next, dist, build, coverage, .turbo, vendor, target, __pycache__, .venv). This is direct token savings, but the more important effect is keeping Claude's context window populated with signal instead of noise.

The failure mode playbook is the most underrated section. It tells Claude what to do when quality checks fail, when the plan does not match reality, when a tool is unavailable, when context is exhausted, when the user issues a correction. Without it, the default behavior is to hide problems, keep going, and hope. With it, the default behavior is to stop and surface. Stopping early is one of the highest-leverage things an agent can do, and it does not happen unless you ask for it explicitly.

The flow namespace

Source: claude/commands/flow/.

Five commands form the main path of any feature work, plus two for off-ramp situations.

/flow:preflight

flow/preflight.md. Effort: medium. Read, do not write.

Preflight establishes shared understanding before any code is touched. It reads the project root CLAUDE.md, identifies the minimum file set the task will touch, reads those files, lists the CI checks the project actually defines (typecheck, lint, test, format) without running them, and notes uncommitted work and recent direction from git status and git log -5 --oneline.

The hard cap is twelve files read across preflight. If the minimum set exceeds twelve, the command stops and asks me to scope the task down. That cap is deliberate. A task that needs more than twelve files of context to understand is a task that needs to be split, not a task that needs more reading.

The output is a short preflight report written to ~/.claude/scratch/preflight-<project-name>-<YYYYMMDD-HHMM>.md. Subsequent commands read it.

/flow:plan

flow/plan.md. Effort: heavy.

Plan turns a confirmed task into an ordered, atomic implementation plan with explicit tradeoffs. The procedure is constrained: consider two implementation approaches, score each on scope, risk, effort, and reversibility, pick one and justify why, break the chosen approach into phased steps where each step is independently committable and leaves the codebase working, identify the test strategy per step, identify the rollback path.

Two approaches is the floor. It is not "consider many alternatives." It is "do not pick the first thing that comes to mind without comparing it to one other thing." That single constraint catches a surprising number of bad first instincts.

The plan is written to ~/.claude/scratch/plan-<project-name>-<task-slug>-<YYYYMMDD-HHMM>.md. I read it, push back on it, edit it directly, or send it back for revision. Implementation does not start until I have approved a plan I trust.

/flow:implement

flow/implement.md. Effort: heavy.

Implement executes one step of an approved plan. The scope rules are strict: stay in the files the plan names, do not refactor unrelated code, do not upgrade or add dependencies unless the plan explicitly includes them, comments only where the code does not explain itself, explain why and not what. If the plan turns out to be wrong mid-implementation, stop and surface the mismatch. Do not silently expand.

After each step the command runs the narrow verification the plan prescribed (one file's tests, one type check), then pauses and reports what was done, what was verified, and what is left in the step. The pause is enforced by the global rule in CLAUDE.md: pause after each /flow:* step and wait for user approval before continuing.

/flow:test

flow/test.md. Effort: medium.

Test adds or updates tests for the recent implementation work. It loads the test-patterns skill, scopes testing to the diff via git diff HEAD and git status, mirrors source paths for new test files, and writes tests that verify behavior rather than implementation. It runs the new tests narrowly first (single file), then the adjacent suite. After narrow tests pass, it runs the full check script (run-checks.sh, more on this below).

The discipline is in what it will not do: it will not introduce a new test framework, will not add snapshot tests if the project does not use them, and will not silently change implementation if a test reveals the implementation was wrong. That last one is important. A failing test is a signal, not a problem to suppress.

/flow:review

flow/review.md. Effort: heavy.

Review is the senior pass before handoff. It runs run-checks.sh to establish baseline check state (so pre-existing failures are noted before review begins), loads the relevant patterns skills for the stack, and reviews in eight ordered categories: correctness, types, accessibility, security, design principles, performance, maintainability, tests. Sections with no findings are skipped, not padded.

Two rules in this command matter more than the categories. First, an empty review is a valid result. Second, do not flag personal style unless it violates the project's lint config. Both push back against the agent's natural drift toward "find something to say." Reviews that invent findings to look thorough are reviews I have to spend tokens disagreeing with.

The review output goes to ~/.claude/scratch/review-<project-name>-<scope-slug>-<YYYYMMDD-HHMM>.md using the markdown-report skill format with a severity rubric.

/flow:fix and /flow:resume

flow/fix.md is for surgical fixes from a failing signal: test failure, type error, runtime error, lint error. Hypothesis stated in one sentence before changing anything. Smallest change that addresses the root cause. Stop conditions: if the hypothesis requires a refactor, stop and propose a separate /flow:plan. If the fix touches more than three files, stop and surface; this is no longer a fix. If the fix changes a public API, stop and ask for a plan.

flow/resume.md reorients against a partially executed plan. It diffs the plan against the current code state, reports which steps are done, partial, or pending, and recommends the next concrete action. It does not implement.

The audit namespace

Source: claude/commands/audit/.

Five audits, all stack-aware via the injected <repo-context> (described below).

audit/a11y.md runs a WCAG 2.2 AA audit using the wcag-audit skill. Bails if no frontend surface is detected.

audit/debt.md surfaces technical debt and architectural drift. Be direct and skeptical. Do not list minor style preferences.

audit/doc-drift.md detects divergence between code and the docs that describe it. Comparison task, does not rewrite docs. Routed to Haiku via model: haiku in the frontmatter, because the work is mechanical comparison rather than judgment.

audit/perf.md is static analysis. It does not run benchmarks. It flags what is likely slow and names what to measure.

audit/security.md is defensive review. Assume any input from outside the process boundary is hostile. Loads security-patterns.

Audits are not on the main path. They run when scope warrants.

The meta namespace

Source: claude/commands/meta/.

meta/feature.md shapes a fuzzy feature request into a structured brief before planning. It is the step before /flow:plan for tasks that are not yet sharp enough to plan against. Heavy effort, because the work is thinking.

meta/prompt.md turns a fuzzy ask into a sharp Claude Code prompt with context and acceptance criteria. Light effort, routed to Haiku. The work is rewriting with structure.

meta/retro.md is a structured retrospective for an incident, sprint, or completed feature. Routed to Haiku because it is structure-driven rather than analysis-heavy.

The write namespace

Source: claude/commands/write/.

All four are routed to Haiku via frontmatter, all four are light effort, all four are pure transformation tasks.

write/commit.md generates a commit message from the staged diff, matching project style. Reads git log --oneline -20 to detect the project's convention (Conventional Commits, ticket prefix, plain). Does not run git commit.

write/pr.md generates a PR description from the current diff. No codebase reading beyond the diff provided.

write/release-notes.md groups and rewrites commits unique to the current branch versus its base.

write/stakeholder.md reframes a technical finding for a non-technical audience.

These are the clearest case for model selection. Generating a commit message from a diff does not require Sonnet's reasoning. Haiku does it as well, faster, and at a fraction of the cost. The model choice is in the command's frontmatter, not in my head.

Effort tags and model routing

Every command starts with an effort tag: light, medium, or heavy. The tags are advisory for me when picking what to run, but they map cleanly to the model selection in frontmatter.

Light effort, mechanical transformation work runs on Haiku: the four write/* commands, audit:doc-drift, meta:prompt, and meta:retro. Everything else runs on the default model (Sonnet for me) because the work is judgment-heavy: planning, implementing, reviewing, debt audits, security audits, performance audits, feature briefs.

The cost argument is concrete here. Generating a PR description on Sonnet is a small overpayment. Doing it across every PR for a year is not. Routing it to Haiku via one line in frontmatter, model: haiku, recovers the difference without changing anything about how I invoke the command.

Skills as on-demand expertise

Source: claude/skills/.

Six skills, each a markdown file with frontmatter that tells Claude when to load it.

react-patterns: React and Next.js patterns, anti-patterns, and review checklist
django-patterns: Django patterns and review checklist
test-patterns: conventions for Vitest, Jest, RTL, Playwright, pytest
wcag-audit: WCAG 2.2 AA checklist and severity rubric
security-patterns: security checklist for frontend and backend
markdown-report: consistent format for audit and review artifacts

Skills are loaded on demand by the commands that need them. /flow:plan loads the patterns skill matching the detected stack. /flow:test loads test-patterns. /audit:a11y loads wcag-audit. /audit:security loads security-patterns. /flow:review loads multiple skills depending on what the diff touches.

This is the design choice that does the most for the cost argument: skills are not in the system prompt by default. They are pulled in only when relevant. The token budget for "everything Claude could possibly know about React" is paid only on tasks that actually involve React.

Hooks and the permission system

Source: claude/hooks/ and claude/settings.json.

This is the layer that takes the agent from "useful most of the time" to "I trust it to operate in my repos." Five hooks, registered in settings.json against tool events.

inject-context.sh

hooks/inject-context.sh. Fires on UserPromptSubmit. Runs once per session, gated by a session marker file in ~/.claude/scratch/.

It calls project-name.sh and project-root.sh, looks for a cached stack report at ~/.claude/cache/stack/<project-name>.txt, invalidates the cache if any stack sentinel file (package.json, pyproject.toml, Gemfile, Cargo.toml, go.mod) is newer than the cache, regenerates by running detect-stack.sh if needed, and prepends a <repo-context> block to the prompt:

<repo-context>
root: /Users/me/Code/some-project
js: yes (typescript, react, vitest, eslint) [pnpm]
node: 20.11.1
python: no
ruby: no
rust: no
</repo-context>

The cache is the part that matters for cost. Stack detection is not free. It runs jq over package.json, greps Python config files, checks for monorepo signals. Doing it on every prompt would be wasteful. Doing it once per session and reusing it across the conversation is cheap. Doing it once per project until a sentinel file changes is cheaper still.

The detection itself lives in bin/detect-stack.sh. Output is terse on purpose; each line is meant to be scanned in under a few hundred tokens of context.

The flip side of inject-context is that the global CLAUDE.md requires it. The very first line of the project boot protocol is: "Check the injected <repo-context> block. If absent, surface the issue and stop. The inject-context.sh hook did not fire." If the hook fails silently, Claude does not silently proceed without context. It stops and tells me.

guard-bash.sh

hooks/guard-bash.sh. Fires on PreToolUse for Bash. Reads the proposed command from stdin and blocks patterns that the permission system cannot reliably express.

What it blocks: fork bombs, piping network downloads into a shell interpreter, writes to raw disk devices (/dev/sd*, /dev/nvme*, /dev/disk*), direct writes to shell rc files, redundant shell redirects (2>&1 and &>, which the Bash tool doesn't need and which trigger permission prompts). Per-segment, after splitting on &&, ||, ;, and newlines: rm -rf against root or home or cwd, low-level disk tools (dd, shred, wipefs, mkfs), chmod 777, broad chmod +x against root or home, git push --force without --force-with-lease, force push to protected branches, git reset --hard on protected branches, --no-verify on commit/push/merge/rebase, git config --global from a project session, destructive SQL via psql -c, destructive redis-cli (FLUSHALL, FLUSHDB, CONFIG SET), find -delete, find -exec rm, keychain deletion, global package installs.

The contract is simple: exit 0 to allow, exit 2 to block with a reason shown to Claude. Any other non-zero exit is a soft failure that does not block. The hook fails open on its own errors. That last detail matters: a buggy guardrail that blocks legitimate commands is its own kind of damage.

guard-edit.sh

hooks/guard-edit.sh. Fires on PreToolUse for Edit, Write, MultiEdit. Blocks edits to lockfiles (package-lock.json, pnpm-lock.yaml, yarn.lock, bun.lockb, Gemfile.lock, Cargo.lock, composer.lock, poetry.lock, uv.lock), edits inside .git/, and direct edits to shell rc files.

If the edit targets a CI workflow file (.github/workflows/*.yml), the hook logs a warning to stderr but does not block. CI workflow edits are sometimes legitimate but always worth surfacing.

guard-commit.sh

hooks/guard-commit.sh. Fires on PreToolUse for Bash, but only acts on git commit commands.

It blocks AI signatures (Co-Authored-By: Claude, Generated by Claude, robot emoji generation tags) and AI-tell phrasing in the commit subject (Certainly:, Here is:, In this commit:, etc., even when prefixed with a Conventional Commits type). This is the hook that exists because I forgot, more than once, to strip the AI signature from commits before pushing.

sanitize-output.sh

hooks/sanitize-output.sh. Fires on PostToolUse for Edit, Write, MultiEdit. Strips typographic punctuation from written files: em dashes, en dashes, smart quotes, ellipsis characters, Unicode arrows. Replaces them with their ASCII equivalents.

This sounds petty. It is petty. It also catches the case where Claude, despite the explicit rule in CLAUDE.md, writes an em dash anyway. Belt and suspenders. The rule exists in CLAUDE.md so the model usually does the right thing. The hook exists so the model cannot ship the wrong thing even when it slips.

The permission split: allow / ask / deny

Source: claude/settings.json.

The hooks handle the cases where I want a hard block. The permission system in settings.json handles the gradient of "this is fine," "ask me first," and "never."

Examples from the allow list: git status *, git diff *, git log *, git switch *, git stash *, git worktree *, npm run *, pnpm test *, pnpm exec *, npx *, vitest *, tsc *, prettier *, eslint *, rg *, grep *, fd *, find *, gh pr view *, gh pr diff *, gh run list *, Read(**), Edit(**), Write(**), plus the bin scripts. Plus all my skills explicitly listed by name.

Examples from the ask list: npm install *, pnpm add *, brew install *, cargo install *, git push *, git reset *, git rebase *, git commit --amend*, gh pr merge *, gh release create *, rm *, rmdir *, curl *, wget *, edits to config files (.env*, tsconfig*.json, next.config.*, vite.config.*, eslint.config.*, biome.json, prettier.config.*).

Examples from the deny list: sudo *, su *, rm -rf against root or home or cwd, chmod 777, dd *, mkfs *, shutdown *, reboot *, system configuration commands (launchctl, defaults, nvram, scutil, pfctl), reading secrets (.env, *.pem, *.key, SSH private keys, AWS credentials, gh hosts.yml, netrc, pgpass, npmrc, macOS keychains).

The split exists so the agent can move at speed when the operation is safe and stops to ask when the operation is not. Allow-listing read and search tools, scoped git read commands, build commands, and the project's own scripts means the agent does not get stuck asking for permission on routine work. Ask-listing destructive or scope-changing commands means I see them before they happen. Deny-listing things I will never approve under any circumstance means I never have to read another permission prompt about them.

Helper scripts

Source: claude/bin/.

Five small scripts, each doing exactly one thing.

bin/project-root.sh returns the git repo root, falling back to $PWD outside a working tree.

bin/project-name.sh returns a slug-safe project identifier used in scratch artifact filenames. Lowercased, leading dots stripped, non-alphanumeric replaced with dashes, collapsed dashes, trimmed. $HOME becomes home. / becomes root. Empty becomes unknown. The slug stability is what makes "the most recent plan for this project" a reliable thing to ask for.

bin/git-base.sh prints the base branch for the current checkout. Detection order: explicit argument, upstream tracking branch, remote HEAD, common defaults (main, master, develop, trunk). Used by /write:release-notes and elsewhere to compute "commits unique to this branch."

bin/detect-stack.sh emits the compact stack report consumed by inject-context.sh. Detects JS/TS (with framework, package manager, Node version), Python (with framework, in ., backend/, server/, api/), Ruby (with Rails detection), Rust, and monorepo signals (pnpm workspaces, Turbo, Nx, Lerna).

bin/run-checks.sh detects and runs the project's typecheck, lint, format-check, and tests. Each section is independent; failures are reported, not aborted. It is the script /flow:test and /flow:review shell out to.

These are all intentionally boring. They are the parts of "context Claude needs about this project" that should never be regenerated by the agent itself. Boring shell scripts produce stable, slug-safe output. Stable output is what makes scratch artifact naming work, which is what makes /flow:resume work, which is what makes the whole flow recoverable when context runs out mid-task.

Scratch artifacts and the project boundary

The whole flow is held together by a naming convention for artifacts written to ~/.claude/scratch/:

~/.claude/scratch/<kind>-<project-name>-<scope-slug>-<YYYYMMDD-HHMM>.md

kind is preflight, plan, review, feature, retro, etc. project-name is the output of project-name.sh. scope-slug is the task slug or, for some kinds, omitted.

Any command that needs "the most recent X" filters by current project name:

ls -t ~/.claude/scratch/<kind>-<project-name>-*.md | head -1

Never read across projects. If no artifact exists for the current project, run the predecessor command first.

This is what makes the flow durable. A plan written for project A in the morning is still findable by /flow:resume in project A in the afternoon, even after a half-dozen unrelated sessions in other projects in between. The naming convention is the bridge between sessions.

The cost argument, restated

Add it up: a CLAUDE.md that pre-loads the contract. A flow namespace that gates work behind preflight, plan, implement, test, review. An audit namespace that loads heavy checklists only when invoked. A meta namespace for shaping fuzzy work into structured input. A write namespace routed to Haiku for transformation tasks. Skills loaded on demand instead of bundled into the system prompt. A <repo-context> block injected once per session and cached per project. Effort tags that map to model selection. Hooks that block the destructive operations that cost the most to undo. A permission split that lets the agent move on safe work and stop on dangerous work. Scratch artifacts that survive across sessions.

Every one of those is a markdown file, a shell script, or a JSON entry. None of it is a framework. None of it is a vendor primitive. The whole thing is the kind of structure any senior engineer would recognize from a well-run open source project, applied to the agent instead of to a new hire.

Yes, all of this spends tokens. It also prevents the much larger token spend of correcting an under-context, over-confident agent that drifted, hallucinated, or did the wrong thing in the right file. And it prevents the largest cost of all, the one that does not show up on any token meter: my time spent re-steering, re-explaining, and rolling back work that should not have happened.

The constraint was never knowledge. The constraint, once you let an agent into your editor, is discipline at runtime. Discipline does not scale through willpower. It scales through tooling. The dotfiles are what that scaling looks like for me. Yours will look different. The point is that you should have one.

I made a related argument about planning a while back: AI removed the time constraint that used to prevent proper architectural work upstream. This is the downstream version of the same argument. Once planning is cheap, runtime discipline becomes the next thing worth engineering.

Source for everything in this article: github.com/ku5ic/dotfiles/tree/main/claude. MIT licensed. Copy what is useful, ignore what is not, and tell me what you would change.

DEV Community