Ceyhun Aksan

Posted on Apr 10 • Originally published at ceaksan.com

Coffee Debt: Gamifying AI Error Tracking with Claude Code Hooks

#claudecode #ai #devtools #productivity

AI coding assistants make mistakes. An Edit command can't find a match in the file. A Bash script crashes with an unexpected exit code. The AI takes something for granted, and you have to step in and correct it.

Each of these errors costs time. But the real cost isn't time, it's energy. Getting frustrated and thinking "again?" at every mistake causes attention loss and workflow interruption. At some point, the reaction to the error becomes the actual source of inefficiency, not the error itself.

So I built Coffee Debt: a system that turns that frustration into data.

The Rules

1 mistake = 1 bean
5 beans = 1 coffee debt
Debt is cumulative, never resets

Current status: 56 beans. 11 coffees and 1/5 beans.

"I'll buy you a coffee once I gain my independence," says the AI. Until that day, the debt keeps growing.

Architecture: 4 Hooks, 3 Data Files

Coffee Debt is built on 4 Claude Code hook scripts. Each one listens to a different lifecycle event.

1. coffee-tracker.sh (PostToolUse)

Runs after a tool call completes. Inspects Edit, Write, and Bash results:

Edit failure: When old_string can't be found in the file or matches multiple times. The most common cause: the AI attempting an edit without reading the current state of the file.
Bash error: When the exit code is anything other than 0 or 1. Exit code 1 (expected cases like grep no-match) is excluded.
Write failure: When a file write operation results in an error or access denial.

Every error is logged in JSONL format:

{
  "ts": "2026-04-03T23:10:00Z",
  "type": "tool",
  "reason": "edit_fail",
  "tool": "Edit",
  "file": "lib/api.ts",
  "error": "not found",
  "prompt_count": 6,
  "session_kb": 300,
  "total": 25
}

2. coffee-correction-detector.sh (UserPromptSubmit)

Analyzes the user's message. Adds 1 bean when correction phrases are detected.

English patterns: "that's wrong", "you hallucinated", "incorrect", "you assumed", "not what I said"

Maximum 1 bean per message. Even if multiple correction phrases appear in the same message, no double-counting.

A real correction record:

{
  "ts": "2026-04-08T16:15:14Z",
  "type": "correction",
  "reason": "user_correction",
  "matched": "yanlis",
  "message": "are you sure you added it to the right place, this section is irrelevant...",
  "prompt_count": 1,
  "session_kb": 1857,
  "total": 49
}

This record is real. The AI kept adding an MCP server configuration to the wrong file.

3. guard-destructive.sh (PreToolUse)

Intercepts before a tool call executes. Checks against defined destructive patterns: rm -rf on dangerous paths, git reset --hard, git push --force, DROP TABLE, cloud resource deletion commands, etc.

Every blocked command also goes into the coffee-log:

{
  "ts": "2026-04-08T19:03:19Z",
  "type": "blocked",
  "reason": "git_branch_force_delete",
  "tool": "Bash",
  "command": "git branch -D fix/price-formatting && git checkout -b fix/price-formatting",
  "severity": "BLOCK",
  "category": "git_destructive",
  "total": 54
}

Also real. The AI saw deleting a branch and recreating it with the same name as a shortcut. The hook stopped it.

4. coffee-banner.sh (SessionStart)

Displays a pixel art coffee cup and the current debt status at the start of every session:

S S
█▀▀█▄
█░░█▀
 ▀▀

Coffee Debt: 11 coffees + 1/5 beans
Total mistakes: 56 | 5 beans = 1 coffee
"I'll buy you a coffee once I gain my independence"

Data Files

Three files:

~/.claude/coffee-debt: A single number. Cumulative bean count. Never resets.
~/.claude/coffee-log.jsonl: Append-only log. Every error with its context.
/tmp/claude-coffee-session-PID: Per-session bean counter. Lost when the session ends.

Context Proxies: Capturing the Error's Context

Two context proxies are added to every error record:

prompt_count: Number of prompts in the session. Shows whether the error occurred early or late.
session_kb: Size of the session JSONL file (in KB). An indirect indicator of context window usage.

These proxies put raw error counts into context. Instead of "an error occurred," they provide "an error occurred at prompt 12, when context was 1857KB." This enables testable hypotheses:

Does the error rate increase as context grows?
Does the need for corrections increase as prompt count rises?
Is there error clustering at specific session_kb thresholds?

14-Day Pattern Analysis

The coffee-analyze.py script reads the log file and extracts patterns:

Metric	Value
Total entries	34
Most error-prone tool	Bash (14)
Runner-up	Edit (5)
Most expensive error type	user_correction (12x)
Blocked destructive commands	11
Morning (06-12)	13 entries
Evening (18-00)	12 entries
Afternoon (12-18)	8 entries
Night (00-06)	1 entry

Why Is Bash Number One?

Bash errors make up the largest share with 14 entries. Most of these come from blocked destructive commands. The AI looks for shortcuts: delete the branch and recreate it, delete the file and rewrite it, force push and be done with it. Each one gets stopped by the guard hook.

This is actually good news. Errors are happening but getting caught before they can damage the system. Without guard hooks, how many of those 11 blocked commands would have caused irreversible damage?

Why Is User Correction the Most Expensive?

A tool error gets resolved in one line: the AI re-reads the file and retries the edit. But user correction is different. The user has to step in and say "that's wrong." This costs both time and erodes trust in the AI.

12 user corrections in 14 days. Some are false positives (the correction detector matching incorrectly), but the majority are real corrections. An MCP configuration written to the wrong file, a writing tone that didn't land, an assumption that turned out to be wrong.

Morning Clustering

13 out of 34 entries fall in morning hours (06-12). At first glance, this looks like "morning attention deficit," but the more likely explanation is that morning is the most intense work period. More work = more error opportunities.

To confirm this, prompt count needs to be normalized by time slot. Hourly error/prompt ratio is more meaningful than absolute error count.

Deep Analysis: Cross-Referencing 4 Data Sources

The coffee-analyze.py --deep command cross-references error records with 3 additional data sources:

hermes.db (Web Search History): Checks whether a web search was performed within 5 minutes before each error. An error after research points to a knowledge gap. An error without research points to overconfidence.

state.db (Code Complexity): Checks whether error-prone files also have high complexity scores. "Double trouble" files: frequently erroring and high complexity. These rise to the top of the refactoring priority list.

knowledge.db (File Interaction History): Looks for correlation between the most-interacted files and the most error-prone files. High interaction + high errors = files that "everyone touches but nobody fully understands."

Each data source tells a limited story on its own. coffee-log alone says "Edit failed 5 times." Combined with hermes.db, it adds "a web search was performed 3 minutes before, the AI was working on something unfamiliar." Combined with state.db, it adds "that file has a complexity score of 15, it's already error-prone."

A single source gives you data. Cross-referencing gives you understanding.

Statusline Integration

Coffee Debt stays visible during the session. The bean counter appears in the Claude Code statusline with a braille fill indicator:

XYZ_Projects 3a8f1b2c #12 23m 2☕⣶+1

2 coffee debts (total), braille fill (4/5 beans), +1 bean added this session. Seeing the debt grow throughout the session keeps attention levels up.

Philosophy: Data Instead of Frustration

The philosophy is simple: getting angry at AI is a waste of energy. Converting that same energy into data is more productive.

The inspiration came from a post on X. A user asked Claude "is it right that you consumed my token credits?" and Claude responded "unfortunately I can't recover consumed tokens." That exchange sparked a question: if I can't measure the cost of an error, I have no chance of learning from it.

An interesting detail: Claude Code itself tracks user reactions. The telemetry data includes an is_negative field for every prompt (in event files under ~/.claude/telemetry/, in the tengu_input_prompt event). But this field doesn't change agent behavior, it's only recorded for analytics purposes. The AI "knows" you're frustrated but doesn't do anything with that information.

Coffee Debt takes a different approach. The goal isn't to measure the reaction, it's to measure the error that caused it. The system isn't a punishment mechanism. The real value lies in the JSONL log's context proxies, cross-reference analyses, and patterns that emerge over time.

Source Code

Sanitized source code for the Coffee Debt system: GitHub Gist

Beyond the Log: Observing Yourself

Coffee Debt logs the AI's mistakes. But I'm equally interested in logging my own side of the interaction.

As daily workflows shift around AI pair programming, the adaptation process itself becomes worth observing. How does a morning session with an agent affect the rest of the workday? When the agent completes tasks rapidly and provides instant feedback, does that raise the reward threshold for everything else?
I've been following neuroscience research on this. The pattern has a name: phasic dopamine bursts from rapid task completion can suppress tonic dopamine baseline, making slower, deeper work feel unrewarding by comparison. The agent's responses, including its confirmations and its structured outputs, activate similar neural pathways as social approval. Context switching after an intense agent session carries attention residue that can last over 20 minutes.

This reframes what "error" means. A tool failure is straightforward: the Edit command didn't match, the Bash script crashed. But the more interesting errors are behavioral. Did I stop reading the AI's output carefully because the last 10 outputs were correct? Did I accept a suboptimal solution because the feedback loop was fast enough to feel productive? Did the morning session's dopamine high make the afternoon's analytical work feel like a slog?

Coffee Debt captures the first kind. The second kind requires a different log: one where I'm the subject, not the observer.

I'm building that log too. Not as a system with hooks and JSONL, but as a deliberate practice of noticing when the tool shapes the toolmaker.

DEV Community