DEV Community: sisyphusse1-ops

I shipped cc-audit as a GitHub Action. Now your CLAUDE.md gets linted on every PR.

sisyphusse1-ops — Sun, 10 May 2026 23:21:14 +0000

Quick follow-up to my earlier post about scanning 492 public CLAUDE.md files. Takeaway from that scan: median compliance with the 12-rule baseline was 3/12. The top-missed rules were rules 9, 10, 12, and 1 — the behavior-file equivalent of skipping unit tests.

The fix is easy: run a linter. The harder part is remembering to run it.

So I packaged cc-audit as a GitHub Action. Drop three lines into your repo's workflow, and every push that touches CLAUDE.md or AGENTS.md gets an automatic report in the run summary — plus a hard fail if someone ever pastes a real API key into the behavior file.

The workflow

# .github/workflows/cc-audit.yml
name: cc-audit
on:
  pull_request:
    paths: [ 'CLAUDE.md', 'AGENTS.md' ]
  push:
    paths: [ 'CLAUDE.md', 'AGENTS.md' ]

jobs:
  audit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: sisyphusse1-ops/cc-audit@v1

That's it.

What you get

Every matching push/PR runs cc-audit against the file. The run summary shows:

Metric	Value
File	`CLAUDE.md`
Rules covered	7 / 12
Compliance score	58 %
Leaked secrets	0
Status	warn

The step fails with a loud ::error:: annotation if any leaked-secret pattern is detected — OpenAI keys, Anthropic keys, GitHub PATs, AWS access keys, Stripe live keys, postgres URLs with credentials. Placeholder-aware, so <YOUR_KEY> and sk-example-... don't trigger false positives.

By default it doesn't fail the build on mere rule-coverage warnings, because a 7/12 file isn't "broken" — it's just not thorough. You can flip that with:

      - uses: sisyphusse1-ops/cc-audit@v1
        with:
          fail-on-warning: true

Auto-install the baseline

There's also a companion action for the claude-code-pro-pack itself. If your repo doesn't have a CLAUDE.md / AGENTS.md yet, this installs the 12-rule baseline in one step:

- uses: sisyphusse1-ops/claude-code-pro-pack@v1
  with:
    flavor: both            # claude | agents | both
    install-templates: true # also copy templates/ and examples/

It's polite — skips files that already exist unless you pass overwrite: true.

End-to-end demo

I shipped a demo repo that uses both actions:

→ github.com/sisyphusse1-ops/ccpp-demo

Check the Actions tab — you'll see real runs installing the pack, then linting it. The install workflow is workflow_dispatch so you can fork the repo, trigger the install on your fork, and watch the same thing happen on your own files.

Why bother

Three reasons I wrote this and why you might want to run it:

Behavior drift. CLAUDE.md files get edited casually by whoever's on-call for the agent that week. Compliance scores drift down over months. A linter in CI catches it.
Secret leaks. The 492-file scan found zero real leaks, which is great — but the base rate of pasting .env contents into docs is nonzero across the wider population. A 40 ms check on every PR catches it before it hits the default branch.
Onboarding. New engineer opens your repo. CI report in the PR summary shows them the 12-rule baseline exists, which rules your file covers, and which it doesn't. The explanation is in the action output, not in a wiki page they won't find.

Install time

Workflow file: 3 lines.

CI overhead per run: 20-30 seconds on ubuntu-latest (no Docker image pull, just checkout + Python stdlib).

Token cost: zero.

Cost to break your build: zero if no secrets leaked.

Repos

cc-audit (linter + action) — github.com/sisyphusse1-ops/cc-audit
claude-code-pro-pack (baseline rules + installer action) — github.com/sisyphusse1-ops/claude-code-pro-pack
ccpp-demo (both in action, end-to-end) — github.com/sisyphusse1-ops/ccpp-demo

All three MIT.

If this saves you a merge review, or catches a leaked key, let me know. That's the use case I optimized for.

I scored 492 public CLAUDE.md files against a 12-rule baseline. Median: 3/12.

sisyphusse1-ops — Sun, 10 May 2026 23:06:40 +0000

Last week I wrote a tiny Python linter — cc-audit — that scores a CLAUDE.md or AGENTS.md file against twelve behavior rules for AI coding agents. I ran it against 492 real public CLAUDE.md files pulled from GitHub code search.

Here's what the ecosystem actually looks like.

Methodology

Pulled the first 500 public CLAUDE.md filename matches from GitHub code search
492 were fetchable at scan time (8 had been moved, renamed, or gated behind forks)
Each file scored on 12 behavior rules via keyword-signal matching (does the file address each rule?)
Separately scanned for leaked secrets (API keys, database URLs, private keys) with placeholder-aware filtering

The 12 rules come from the claude-code-pro-pack baseline (Karpathy's original 4 + 8 more covering agent-orchestration failure modes):

Read adjacent / existing code before writing new code
Don't invent APIs, imports, or file paths
Surface partial success — never silent-fail
Cap per-task token budget; stop and ask when hit
Match the project's existing style and conventions
One task per run; don't bundle unrelated changes
Surface conflicting patterns instead of averaging them
Run tests before declaring done
Don't edit out of scope without saying so
Summarize every tool call's effect in one line
Stop and ask if stuck or ambiguous
Visible fail states — never hide errors

The numbers

Files scanned: 492
Size: min 11 B, median 3.9 KB, mean 7.5 KB, max 167 KB
Compliance: median 3/12, mean 3.54/12, max 10/12
Perfect (12/12) scores: 0
Zero-score files: 41 (8%)
Top quartile (≥9/12): 11 files (2.2%)
Files with leaked production secrets: 0

The one-sentence version: the median CLAUDE.md covers a quarter of the behavior rules that matter. The top 2% cover three-quarters.

Most-missed rules (out of 492)

#	Rule	Files missing	%
9	Don't edit out of scope	482	98%
10	Summarize tool calls	464	94%
12	Visible fail states	448	91%
1	Read adjacent code	446	91%
3	Surface partial success	414	84%
2	Don't invent APIs	383	78%
6	One task per run	361	73%
4	Token budget / stop-and-ask	350	71%
11	Stop and ask if stuck	272	55%
7	Surface pattern conflicts	252	51%
5	Match project style	222	45%
8	Run tests	66	13%

Most-hit rules

The one rule nearly everyone covers is run tests — only 13% missed it. That tracks. Every CLAUDE.md template floating around for the last year includes some version of "run the tests."

The second-most-covered is match project style (55% coverage), mostly because it's also the rule people quote from Karpathy's original.

Everything else sits in the "some files remember, most don't" zone.

Why the top misses cost you real time

Rule 9 (don't edit out of scope) — missed by 98% of files. Without this, an agent "helpfully" reformats your whole file while fixing a one-line bug. Resulting PR: 500 lines of noise wrapping 3 lines of fix. Reviewers drown; real changes get lost. Costs a single sentence to add.

Rule 10 (summarize tool calls) — missed by 94%. Without this, you get verbose explanations of "what I'm about to do" and very little "what I actually did." In a long session you lose the thread. One sentence: "After every tool call, write one line: what you changed and which file."

Rule 12 (visible fail states) — missed by 91%. This is the "migration completed successfully" problem in a different skin — the agent hides a failure in a paragraph of success prose, or just doesn't surface the stack trace. Fix: "When anything fails, quote the error verbatim and stop. Never paraphrase."

Rule 1 (read adjacent code first) — missed by 91%. Top cause of duplicate functions and inconsistent patches. An agent that doesn't read adjacent code will happily implement a utility that already exists three lines away, or patch one half of a codebase in a style that conflicts with the other half.

Rules 9, 10, 12, and 1 are each one sentence. Adding all four moves a median file from 3/12 to 7/12.

What the zero-score files looked like

41 files scored 0/12. They split into two shapes:

A single paragraph. Often something like "This project uses Python. Be careful." — and that's the entire file. A project description wearing a CLAUDE.md name tag.
A README dump. The entire README.md copy-pasted in verbatim with no behavior rules at all. Good project context, zero agent guidance.

Neither shape is worthless for onboarding. Neither is reducing agent failure modes.

What the top quartile did differently

The 11 files scoring ≥9/12 shared four patterns:

Explicit tool-calling preferences ("use rg not grep", "use fd not find")
Named failure modes to avoid ("don't claim migration success if rows were skipped")
A scoped-edits rule ("don't touch files outside the current task without asking first")
A style-matching rule ("check 3 nearby files before choosing formatting")

Those four additions alone explain most of the gap between median and top quartile.

What about leaked secrets?

I was genuinely curious whether people paste real API keys into CLAUDE.md files. They mostly don't.

Of 492 files scanned:

0 real leaked secrets matching strict patterns (OpenAI keys, Anthropic keys, Google API keys, AWS access keys, GitHub tokens, Stripe live keys)
4 postgres connection strings that looked like secrets at first match — all of them turned out to be localhost + dummy users (user:password@localhost), i.e. example config that would only "work" against someone's local dev box
1 literal placeholder (postgresql://USER:***@HOST/DATABASE)

The placeholder-filter in the scanner caught most sk-example, <YOUR_KEY>, and ***-style examples. Whatever paranoia you had about CLAUDE.md being a secret-leak vector: this data says it isn't.

What to actually do

If you maintain a CLAUDE.md or AGENTS.md, these are the highest-leverage edits you can make in ninety seconds:

Add these four sentences anywhere in the file:

- When fixing a bug, don't edit files outside the immediate scope unless you say so first.
- After every tool call, write one line: what you changed and which file.
- If anything fails, quote the error verbatim and stop. Never paraphrase failures.
- Before writing new code, read the adjacent 20–40 lines of existing code in the same file.

If the ninety-second edit isn't enough context, the full 12-rule baseline as a drop-in:

→ github.com/sisyphusse1-ops/claude-code-pro-pack

If you want to score your existing one:

→ github.com/sisyphusse1-ops/cc-audit

curl -fsSL https://raw.githubusercontent.com/sisyphusse1-ops/cc-audit/main/cc_audit.py -o cc_audit.py
python3 cc_audit.py CLAUDE.md

One file, stdlib only, 40 ms on a 10 KB file.

Methodology caveats

The rule check is a keyword-signal pass. It checks whether the file mentions each concern, not whether the wording is good. A file that mentions "tests" and "scope" gets credit for those rules even if the phrasing would embarrass you.
The 3/12 median is a floor for coverage, not ceiling for quality.
A thoughtful 6/12 file easily beats a formulaic 10/12 one.
I deliberately did not score for: accurate project facts, prose quality, tone, or structure — only behavior-rule coverage.
GitHub code search returns fewer than the full 23,484 indexed CLAUDE.md files; a different 492 would shift the numbers a little but not the shape.

Raw data

The full per-file results are in the scan-500 JSON on the cc-audit repo. Each entry has repo name, file size, and compliance score.

If this landed, send it to the one person you know who writes behavior files for AI coding agents. There's a decent chance their current file scores 3/12 and four extra sentences would push it to 7/12.

I built a coding agent that runs on Gemma 4 — here's what 2B parameters can actually do

sisyphusse1-ops — Sun, 10 May 2026 22:56:37 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

gemma-coder — a single-file Python CLI that turns Gemma 4 into an agentic coding assistant. It reads your CLAUDE.md or AGENTS.md rulebook, uses a model-agnostic XML tool protocol, and ships the 12-rule claude-code-pro-pack baseline as the default behavior file.

The interesting part isn't the loop — it's that the whole thing works against Gemma 4 E2B (2 billion effective parameters) running locally. The same file runs against 31B on cloud for the power users, E4B on a phone, E2B on a Raspberry Pi 5. Same protocol, same rulebook, different scale.

Why Gemma 4 E2B specifically

The obvious submission path is to reach for 31B and flex. I went the other way. Three reasons:

1. E2B is the one that demonstrates the Gemma 4 story. Running server-grade models in the cloud is boring — OpenAI and Anthropic do that better. Running a 2B-effective model on hardware that sits in your living room is the unique capability unlock Google shipped this month. If a submission doesn't exercise that, it's a different model's submission wearing Gemma's name.

2. It forces better engineering. A 31B model tolerates sloppy prompts. E2B doesn't. Every line of the system prompt has to earn its place. That's a better stress test for the agent architecture, and the fixes you make for E2B make the 31B path faster and cheaper too.

3. Privacy and offline. Coding agents handle codebases with credentials, client IP, unreleased features. An agent that runs fully local is the only agent my legal team hasn't twitched at. E2B makes "fully local" practical.

Demo

Smallest clean run I captured:

$ gemma-coder "Add a one-line docstring to every function in src.py"

━━━ step 1/12 ━━━
Read src.py → identify fns.
→ tool: read_file({"path": "src.py"})
← {"content": "def add(a, b):\n    return a + b\n\ndef sub(a, b):\n    return a - b\n", ...}

━━━ step 2/12 ━━━
Add docstrings → rewrite src.py.
→ tool: write_file({"path": "src.py", "content": "def add(...):\n    \"\"\"Add two numbers.\"\"\"..."})
← {"bytes": 124}

━━━ step 3/12 ━━━
Done.
→ tool: done({"summary": "Added one-line docstrings to add() and sub() in src.py"})

Three steps. No re-reads, no wasted calls. That's what "narrow tool scope + rulebook baseline" buys you on a 2B model.

How it works

Tool protocol

Gemma 4 doesn't have native OpenAI-style function calling. Instead of fighting that, I treated it as a feature: the CLI uses a simple XML-framed JSON contract that every capable LLM can follow.

<tool>
{"name": "read_file", "args": {"path": "src/main.py"}}
</tool>

Results come back as <tool_result>...</tool_result> in the next user turn. Six tools total: read_file, write_file, search, run, patch, done. That's it.

Benefit: the same loop runs against any LLM that can obey the format. I tested the same file against Gemma 4 31B, Qwen 2.5 Coder 32B, and Llama 3.3. All three worked. That portability is a byproduct of respecting Gemma 4's actual capabilities instead of bolting on an abstraction.

Rulebook-first system prompt

The system prompt is short by design: tool schema + the project's CLAUDE.md (or AGENTS.md) dropped in verbatim. No framework prose, no chain-of-thought incantations, no "you are a helpful assistant."

The 12-rule pack that ships as the default rulebook closes the four most common Gemma 4 failure modes I saw in testing:

Token spirals — rule 6 caps per-task token budget so the model doesn't loop on the same 4KB of context
Silent partial failures — rule 12 requires visible fail states; no more "migration completed" when it skipped rows
Two-pattern pollution — rule 7 forces the agent to surface conflicts between codebase patterns instead of averaging
Adjacent-code blindness — rule 8 mandates reading surrounding code before writing; fixes duplicate-function drift

These aren't abstract. Each rule earned its place from a specific failure in actual runs.

Retry-with-backoff

Cloud gateways can return transient 5xx mid-session. call_openrouter wraps the HTTP call with 3-attempt exponential backoff (3s / 9s / 27s). Not glamorous, but it's the difference between a flaky demo and a shippable tool.

What Gemma 4 E2B actually can and can't do

What it handles cleanly:

Rename a function across 2-3 files
Add docstrings and type hints
Fix a failing unit test when the fix is local
Draft a README section from existing code
Apply a lint-style pattern fix consistently

What makes it struggle:

Multi-file refactors with cross-file dependency tracking (context pressure kills it around 50k tokens)
Novel architecture decisions (it's 2B params, not 100B — manage expectations)
Long-running debugging where each step depends on the last

For the "boring 80%" of coding agent work, E2B is remarkable. For the exciting 20%, use a bigger model. Now there's a CLI that lets you pick.

Install

# OpenRouter free tier, no local setup
curl -fsSL https://raw.githubusercontent.com/sisyphusse1-ops/gemma-coder/main/gemma_coder.py -o gemma_coder.py
export OPENROUTER_API_KEY=sk-or-...
python3 gemma_coder.py "your task here"

# or local Ollama
ollama pull gemma4:e2b
python3 gemma_coder.py --provider ollama --model gemma4:e2b "your task here"

One file. Python stdlib only. No framework.

Code

Repo: github.com/sisyphusse1-ops/gemma-coder

Companion projects referenced:

claude-code-pro-pack — the 12-rule baseline it loads
cc-audit — lints any CLAUDE.md against those rules

All three are MIT.

How I Used Gemma 4

I chose Gemma 4 E2B because the submission is fundamentally about answering: can a 2B-effective-parameter model actually drive a useful coding agent? Using 31B would have sidestepped the question. The value of the project is precisely that it exercises the smallest Gemma 4 variant and finds the envelope where it succeeds.

What E2B unlocked:

Runs on a Raspberry Pi 5. 5W of power, $75 of hardware, no cloud dependency, no API keys, no rate limits.
Privacy by default. Credentials, client code, unreleased features stay on the machine. "Fully local" stops being a wish-list item.
Forces rulebook discipline. The constraint of a small model made every part of the system prompt earn its place. Result: a cleaner tool protocol and a rulebook that transfers directly to larger models too.

Model selection was not "which is biggest." It was "which Gemma 4 variant makes the strongest argument for the unique capability the family ships."

Thanks for reading. If you try it, open an issue with your model + task + result — I'm collecting real-world envelope data for a follow-up post on where each Gemma 4 variant tops out.

I read 31 pages of Anthropic prompting guidance so you don't have to — here's what actually changes with Claude 4.7

sisyphusse1-ops — Sun, 10 May 2026 22:56:10 +0000

The short version

Claude Opus 4.7 follows prompts literally. Generic 4.6-era prompts like "review this contract" or "summarize this report" underperform now, not because the model got worse but because 4.7 stopped guessing at unstated structure.

Six shifts you need to internalize, plus a rewrite checklist you can apply to any existing prompt in under a minute.

The six shifts

1. Name every output. Name every boundary.

4.6-era: Review this contract.
4.7-ready: Review this contract. Flag risks per clause. Rate severity 1-5. Suggest one rewrite per risky clause. Return as a table with columns: Clause | Risk | Severity | Rewrite.

4.7 does exactly what the sentence says. If you don't name the columns, you get whatever columns it picks. If you don't cap severity levels, you get adjective soup.

2. Length scales with input now. Cap it explicitly.

Long input plus the word summarize used to give you a roughly fixed-length summary. Now it gives you a long summary. Because the input was long.

Old: Summarize this report.
New: Summarize this report in exactly 5 bullets. Each bullet under 15 words. First word of each bullet is an action verb.

3. Negative instructions don't stick. Say what TO do.

Don't use jargon is still in the context. 4.7 just doesn't reliably change behavior from it.

Old: Don't use jargon. Don't sound like a marketer.
New: Write in plain English a 16-year-old could read aloud. Use short concrete words. Replace "leverage" with "use". Replace "scalable" with "works at any size".

Rule of thumb: every negative instruction rewrites as a positive one plus a concrete swap example.

4. Action verbs ship specific artifacts.

Can you help me with the email? produces a helpful-but-vague paragraph. Action verbs produce a draft.

Old: Can you help me with the email?
New:

  Open Gmail. Find <contact> and read our last thread.
  Draft the reply email. Final draft. Send-ready.
  Goal: book a 30-min meeting by Friday.
  Length: under 90 words.
  Tone: confident, casual, specific.

Each verb at the top (Open, Find, Draft) commits 4.7 to producing a shippable artifact, not discussing one.

5. Fewer tool calls, more reasoning between. Ask for aggressive search if you need it.

4.7 calls tools less aggressively than 4.6 did. It reasons more between calls. Usually this is a quality lift. Sometimes you explicitly want broad search.

Use web search aggressively. Verify every claim with at least 2 sources before answering.

6. Colder default tone. Name the warmth if you want it back.

4.7 dropped the "great question!" energy and most emojis. If your product brand needs warmer voice:

Use a warm, conversational tone. Acknowledge my framing before answering.

Even better — paste 2-3 reference sentences in the voice you want. 4.7 matches rhythm well.

The one phrase that keeps delivering

Anthropic's own 4.7 doc includes this line, and it has become the single highest-leverage addition you can staple onto any creative or open-ended prompt:

Go beyond the basics. Polish like it's a real client deliverable.

Paired with a section-by-section brief, it consistently pushes 4.7 past the literal-minimum output. I've tested this on landing pages, PR documents, legal memos, and code refactors. Same pattern, same lift.

Full landing page example:

Build a landing page for my AI consultancy.

Sections (in order):
- Hero (headline + subheadline + CTA)
- Logo bar (6 client placeholders)
- 3 case-study cards (problem / what I did / result)
- Service blocks (4)
- Testimonial carousel (3 quotes)
- About (180-word bio + headshot placeholder)
- Newsletter signup
- Footer

Style: editorial, serif headlines, sans-serif body, generous whitespace.
Animations: subtle on scroll. No purple gradients.

Go beyond the basics. Polish like it's a real client deliverable.

The rewrite checklist

Run any prompt through this before sending it to 4.7:

[ ] Every output is named (format, columns, order, length)
[ ] Every length is capped (words, bullets, rows)
[ ] Zero negative instructions — every "don't / no / avoid" rewritten as "do X with example"
[ ] Action verbs first (Open, Draft, Build, Flag, Summarize — not "can you help…")
[ ] Tool-use preference stated if it matters (Use web search aggressively or Answer from training, no tools)
[ ] Tone named, with 2-3 reference sentences if you want warmth back
[ ] For creative work, the quality lift phrase is appended

One place the pattern helps the most: agent behavior files

If you use Claude Code, Codex, or Cursor with a CLAUDE.md or AGENTS.md in your project root, the same rules apply to those files. Negative instructions ("don't use jargon", "don't hallucinate") age poorly. Rewriting them as positive imperatives with concrete examples measurably improves compliance.

I bundled a 12-rule CLAUDE.md template (Karpathy's original 4 + 8 more covering agent-orchestration failure modes) plus a few working skills. It's a drop-in. Free, MIT:

→ github.com/sisyphusse1-ops/claude-code-pro-pack

And a tiny Python linter that scores any existing CLAUDE.md against the 12 rules and flags leaked secrets:

→ github.com/sisyphusse1-ops/cc-audit

Both are single-commit additions to your repo. No install, no framework.

Sources

Anthropic's 31-page Claude 4.7 prompting guide (PDF, official)
Ruben Hassid's digest at ruben.substack.com/p/prompt-47

If this landed, share with one person who's still prompting 4.7 like it's 4.6. That's the thing that actually helps the work.