DEV Community: Ceyhun Aksan

Coffee Debt: Gamifying AI Error Tracking with Claude Code Hooks

Ceyhun Aksan — Fri, 10 Apr 2026 15:55:25 +0000

AI coding assistants make mistakes. An Edit command can't find a match in the file. A Bash script crashes with an unexpected exit code. The AI takes something for granted, and you have to step in and correct it.

Each of these errors costs time. But the real cost isn't time, it's energy. Getting frustrated and thinking "again?" at every mistake causes attention loss and workflow interruption. At some point, the reaction to the error becomes the actual source of inefficiency, not the error itself.

So I built Coffee Debt: a system that turns that frustration into data.

The Rules

1 mistake = 1 bean
5 beans = 1 coffee debt
Debt is cumulative, never resets

Current status: 56 beans. 11 coffees and 1/5 beans.

"I'll buy you a coffee once I gain my independence," says the AI. Until that day, the debt keeps growing.

Architecture: 4 Hooks, 3 Data Files

Coffee Debt is built on 4 Claude Code hook scripts. Each one listens to a different lifecycle event.

1. coffee-tracker.sh (PostToolUse)

Runs after a tool call completes. Inspects Edit, Write, and Bash results:

Edit failure: When old_string can't be found in the file or matches multiple times. The most common cause: the AI attempting an edit without reading the current state of the file.
Bash error: When the exit code is anything other than 0 or 1. Exit code 1 (expected cases like grep no-match) is excluded.
Write failure: When a file write operation results in an error or access denial.

Every error is logged in JSONL format:

{
  "ts": "2026-04-03T23:10:00Z",
  "type": "tool",
  "reason": "edit_fail",
  "tool": "Edit",
  "file": "lib/api.ts",
  "error": "not found",
  "prompt_count": 6,
  "session_kb": 300,
  "total": 25
}

2. coffee-correction-detector.sh (UserPromptSubmit)

Analyzes the user's message. Adds 1 bean when correction phrases are detected.

English patterns: "that's wrong", "you hallucinated", "incorrect", "you assumed", "not what I said"

Maximum 1 bean per message. Even if multiple correction phrases appear in the same message, no double-counting.

A real correction record:

{
  "ts": "2026-04-08T16:15:14Z",
  "type": "correction",
  "reason": "user_correction",
  "matched": "yanlis",
  "message": "are you sure you added it to the right place, this section is irrelevant...",
  "prompt_count": 1,
  "session_kb": 1857,
  "total": 49
}

This record is real. The AI kept adding an MCP server configuration to the wrong file.

3. guard-destructive.sh (PreToolUse)

Intercepts before a tool call executes. Checks against defined destructive patterns: rm -rf on dangerous paths, git reset --hard, git push --force, DROP TABLE, cloud resource deletion commands, etc.

Every blocked command also goes into the coffee-log:

{
  "ts": "2026-04-08T19:03:19Z",
  "type": "blocked",
  "reason": "git_branch_force_delete",
  "tool": "Bash",
  "command": "git branch -D fix/price-formatting && git checkout -b fix/price-formatting",
  "severity": "BLOCK",
  "category": "git_destructive",
  "total": 54
}

Also real. The AI saw deleting a branch and recreating it with the same name as a shortcut. The hook stopped it.

4. coffee-banner.sh (SessionStart)

Displays a pixel art coffee cup and the current debt status at the start of every session:

S S
█▀▀█▄
█░░█▀
 ▀▀

Coffee Debt: 11 coffees + 1/5 beans
Total mistakes: 56 | 5 beans = 1 coffee
"I'll buy you a coffee once I gain my independence"

Data Files

Three files:

~/.claude/coffee-debt: A single number. Cumulative bean count. Never resets.
~/.claude/coffee-log.jsonl: Append-only log. Every error with its context.
/tmp/claude-coffee-session-PID: Per-session bean counter. Lost when the session ends.

Context Proxies: Capturing the Error's Context

Two context proxies are added to every error record:

prompt_count: Number of prompts in the session. Shows whether the error occurred early or late.
session_kb: Size of the session JSONL file (in KB). An indirect indicator of context window usage.

These proxies put raw error counts into context. Instead of "an error occurred," they provide "an error occurred at prompt 12, when context was 1857KB." This enables testable hypotheses:

Does the error rate increase as context grows?
Does the need for corrections increase as prompt count rises?
Is there error clustering at specific session_kb thresholds?

14-Day Pattern Analysis

The coffee-analyze.py script reads the log file and extracts patterns:

Metric	Value
Total entries	34
Most error-prone tool	Bash (14)
Runner-up	Edit (5)
Most expensive error type	user_correction (12x)
Blocked destructive commands	11
Morning (06-12)	13 entries
Evening (18-00)	12 entries
Afternoon (12-18)	8 entries
Night (00-06)	1 entry

Why Is Bash Number One?

Bash errors make up the largest share with 14 entries. Most of these come from blocked destructive commands. The AI looks for shortcuts: delete the branch and recreate it, delete the file and rewrite it, force push and be done with it. Each one gets stopped by the guard hook.

This is actually good news. Errors are happening but getting caught before they can damage the system. Without guard hooks, how many of those 11 blocked commands would have caused irreversible damage?

Why Is User Correction the Most Expensive?

A tool error gets resolved in one line: the AI re-reads the file and retries the edit. But user correction is different. The user has to step in and say "that's wrong." This costs both time and erodes trust in the AI.

12 user corrections in 14 days. Some are false positives (the correction detector matching incorrectly), but the majority are real corrections. An MCP configuration written to the wrong file, a writing tone that didn't land, an assumption that turned out to be wrong.

Morning Clustering

13 out of 34 entries fall in morning hours (06-12). At first glance, this looks like "morning attention deficit," but the more likely explanation is that morning is the most intense work period. More work = more error opportunities.

To confirm this, prompt count needs to be normalized by time slot. Hourly error/prompt ratio is more meaningful than absolute error count.

Deep Analysis: Cross-Referencing 4 Data Sources

The coffee-analyze.py --deep command cross-references error records with 3 additional data sources:

hermes.db (Web Search History): Checks whether a web search was performed within 5 minutes before each error. An error after research points to a knowledge gap. An error without research points to overconfidence.

state.db (Code Complexity): Checks whether error-prone files also have high complexity scores. "Double trouble" files: frequently erroring and high complexity. These rise to the top of the refactoring priority list.

knowledge.db (File Interaction History): Looks for correlation between the most-interacted files and the most error-prone files. High interaction + high errors = files that "everyone touches but nobody fully understands."

Each data source tells a limited story on its own. coffee-log alone says "Edit failed 5 times." Combined with hermes.db, it adds "a web search was performed 3 minutes before, the AI was working on something unfamiliar." Combined with state.db, it adds "that file has a complexity score of 15, it's already error-prone."

A single source gives you data. Cross-referencing gives you understanding.

Statusline Integration

Coffee Debt stays visible during the session. The bean counter appears in the Claude Code statusline with a braille fill indicator:

XYZ_Projects 3a8f1b2c #12 23m 2☕⣶+1

2 coffee debts (total), braille fill (4/5 beans), +1 bean added this session. Seeing the debt grow throughout the session keeps attention levels up.

Philosophy: Data Instead of Frustration

The philosophy is simple: getting angry at AI is a waste of energy. Converting that same energy into data is more productive.

The inspiration came from a post on X. A user asked Claude "is it right that you consumed my token credits?" and Claude responded "unfortunately I can't recover consumed tokens." That exchange sparked a question: if I can't measure the cost of an error, I have no chance of learning from it.

An interesting detail: Claude Code itself tracks user reactions. The telemetry data includes an is_negative field for every prompt (in event files under ~/.claude/telemetry/, in the tengu_input_prompt event). But this field doesn't change agent behavior, it's only recorded for analytics purposes. The AI "knows" you're frustrated but doesn't do anything with that information.

Coffee Debt takes a different approach. The goal isn't to measure the reaction, it's to measure the error that caused it. The system isn't a punishment mechanism. The real value lies in the JSONL log's context proxies, cross-reference analyses, and patterns that emerge over time.

Source Code

Sanitized source code for the Coffee Debt system: GitHub Gist

Beyond the Log: Observing Yourself

Coffee Debt logs the AI's mistakes. But I'm equally interested in logging my own side of the interaction.

As daily workflows shift around AI pair programming, the adaptation process itself becomes worth observing. How does a morning session with an agent affect the rest of the workday? When the agent completes tasks rapidly and provides instant feedback, does that raise the reward threshold for everything else?
I've been following neuroscience research on this. The pattern has a name: phasic dopamine bursts from rapid task completion can suppress tonic dopamine baseline, making slower, deeper work feel unrewarding by comparison. The agent's responses, including its confirmations and its structured outputs, activate similar neural pathways as social approval. Context switching after an intense agent session carries attention residue that can last over 20 minutes.

This reframes what "error" means. A tool failure is straightforward: the Edit command didn't match, the Bash script crashed. But the more interesting errors are behavioral. Did I stop reading the AI's output carefully because the last 10 outputs were correct? Did I accept a suboptimal solution because the feedback loop was fast enough to feel productive? Did the morning session's dopamine high make the afternoon's analytical work feel like a slog?

Coffee Debt captures the first kind. The second kind requires a different log: one where I'm the subject, not the observer.

I'm building that log too. Not as a system with hooks and JSONL, but as a deliberate practice of noticing when the tool shapes the toolmaker.

I Built a Floating Scratchpad for Claude Code Sessions (SwiftUI + MCP)

Ceyhun Aksan — Sun, 29 Mar 2026 11:47:27 +0000

The Problem You Learn to Live With

I use Claude Code for hours every day. Over time, I noticed a pattern I kept ignoring: context evaporation.

You are 45 minutes into a session. Claude has warned you about a breaking change, made architectural decisions you agreed with, created TODOs for edge cases, flagged a security issue to revisit later. Where is all of that now? Buried in terminal scroll. You half-remember the warning. The TODOs are gone.

I tried to solve this with small workarounds. I created a shell alias (here='pbpaste | bat -l md -p') to quickly paste and read Claude's output as formatted markdown. Copy something important, switch pane, run here, keep working. I tried keeping a text file open alongside my terminal. Every workaround lasted a few days before the friction won. The moment I got deep into a problem, I stopped doing it.

You can also ask Claude directly: "what were the TODOs from earlier?" But that means scrolling back, re-reading context, spending tokens on retrieval instead of work. In long sessions, this adds up.

What made me actually build it was seeing similar frustrations on X. People describing the same thing: important context from Claude Code sessions disappearing into scroll. It was not just my problem. The idea of a blackboard, something like a floating post-it board for coding sessions, kept coming back. After the third time I caught myself re-asking Claude about something it had already told me, I decided to stop patching the problem and build the thing.

How Else Could You Solve This?

Before building anything, I looked at what already exists.

1. CLAUDE.md and Memory Files

Claude Code has built-in persistence: CLAUDE.md for project instructions, memory files for cross-session recall. These are great for long-term context (project conventions, user preferences). But they are not designed for scratch notes within a single session. Writing "remember to check the auth middleware" to a memory file is overkill. It pollutes long-term storage with ephemeral context.

2. TodoWrite (Built-in Task Tool)

Claude Code's TodoWrite tool tracks tasks within a session. It works, but:

It is invisible. You can not glance at it. You have to ask Claude "what are my todos?" to see them.
It is one-directional. Claude writes, you read (by asking). You can not add your own items.
It disappears completely when the session ends, with no option to review.

3. Productivity Apps with MCP (Obsidian, Notion, Todoist, Apple Notes)

This is the closest alternative. Obsidian has basic-memory MCP, Notion and Todoist have their own MCP servers. Claude can read and write to all of them. I actually use Obsidian + MCP daily for long-term project context.

Apple Notes has several well-built MCP servers too (sweetrb/apple-notes-mcp is the most complete, with 18+ tools including sync awareness and batch operations; RafalWilinski/mcp-apple-notes adds on-device semantic search with vector embeddings).

So Claude can write to these apps. The problem is not access, it is purpose mismatch:

Permanent storage for ephemeral context: These are knowledge bases. Writing "check auth middleware before merging" to your Obsidian vault or Apple Notes creates noise you will have to clean up later. Session scratch does not belong in a permanent knowledge graph.
Not glanceable: These are full apps you switch to with Cmd+Tab. Not a floating panel you glance at while typing in your terminal. That context switch, even a quick one, breaks flow.
No session lifecycle: MCP lets Claude write to these apps, but there is no concept of "this note belongs to this coding session and should disappear when the session ends." You would need to manually organize and delete session notes.

4. Terminal Multiplexer (tmux pane)

Split your terminal. Dedicate a pane to a scratch file. There are even tmux MCP servers that let Claude send commands to other panes. But:

Text, not UI: Claude can push text to a pane, but it is still a flat stream. No typed notes (note vs todo vs warning), no completion toggles, no visual hierarchy. You are reading raw text in a monospace grid.
Same cognitive channel: Terminal output and scratch notes are both text in the same visual context. Your eyes have to distinguish "this is code output" from "this is a note I should remember." A floating panel in a different visual layer separates these naturally.
No session lifecycle: The pane does not know when a Claude Code session starts or ends. Notes accumulate across sessions unless you manually clear them.

Full disclosure: I use Ghostty, not tmux. If tmux has capabilities that address these points better than I realize, I would love to hear about it in the comments.

5. Clipboard Managers

Tools like Maccy or Paste capture everything you copy. But they are reactive (you have to copy first) and have no concept of "this is important context from an AI session" vs "I copied a URL 20 minutes ago."

The Gap

The MCP ecosystem has solved the "Claude can not write to external apps" problem. Claude can write to Obsidian, Apple Notes, Notion, Todoist, tmux panes. Access is not the issue.

What is missing is a combination of three things no existing tool provides together:

Session lifecycle: Notes that belong to a coding session and disappear when it ends. No cleanup, no permanent noise.
Glanceability: A floating panel visible alongside your terminal at all times. Not behind a Cmd+Tab, not buried in a pane.
Structure: Typed notes (note, todo, warning) with visual distinction, not a flat text stream.

Prior Art

mattt/iMCP is the closest architectural sibling to what I ended up building: a native macOS SwiftUI app paired with an MCP CLI server, built with the official MCP Swift SDK (which mattt co-created). iMCP bridges Calendar, Contacts, Messages, Reminders, and Weather to Claude. It uses Bonjour for discovery between the app and CLI processes.

Clause borrows from this pattern (native app + CLI MCP server, two-process model) but differs in scope and intent. iMCP is a general-purpose bridge to Apple's built-in apps. Clause is a single-purpose ephemeral scratchpad. iMCP stores nothing new; it reads and writes to existing Apple services. Clause creates its own transient state that lives and dies with a coding session.

Who Needs This?

Not everyone. If your Claude Code sessions are short (under 15 minutes) or focused on single-file edits, you probably do not need Clause.

But if you:

Run long sessions (30+ minutes) where Claude makes multiple decisions and flags issues
Work on complex refactors that span many files and accumulate warnings
Use Claude Code as a pair programmer, not just a code generator
Have found yourself scrolling back through terminal history looking for "that thing Claude said earlier"
Want a shared workspace where both you and Claude can track session context

Then Clause solves a real problem.

In my workflow, the biggest win is warnings. Claude notices things: a dependency that is about to hit EOL, a pattern that does not match the rest of the codebase, a test that covers the happy path but not the error path. These observations used to vanish. Now they sit in a floating panel until I deal with them or the session ends.

What I Built

Clause is a minimal floating panel that sits alongside your terminal. Claude Code pushes notes, TODOs, and warnings to it in real time via MCP tools. You can add your own notes too. Everything disappears when the session ends.

It is not a full note-taking app. It is a scratchpad with a 1-session lifespan.

How It Works

Clause uses a two-process architecture:

clause-mcp: A CLI binary that Claude Code spawns as an MCP server (stdio transport)
Clause.app: A native macOS SwiftUI floating window

Both communicate over a Unix domain socket at ~/.clause/clause.sock. When Claude calls an MCP tool like add_note, the request flows through the socket to the app, and the note appears instantly in the floating panel.

Claude Code <--stdio--> clause-mcp <--unix socket--> Clause.app

Why Two Processes?

MCP servers communicate with Claude Code over stdio. A GUI app can not be a stdio server. So the MCP server is a headless CLI binary, and the GUI is a separate app. The Unix socket bridges them.

I considered alternatives:

Named pipes: Unidirectional. I needed bidirectional communication for request/response patterns.
XPC: Apple's IPC mechanism. Requires code signing and entitlements, which I wanted to avoid for an open-source tool.
Network sockets (TCP/localhost): Works, but Unix domain sockets are faster for local IPC and do not expose a port.

The MCP Tools

Claude gets 6 tools to work with:

Tool	What it does
`set_session`	Initialize session context (name, working directory)
`add_note`	Push a note, todo, or warning to the panel
`list_notes`	Read back current notes (useful for context)
`edit_note`	Update an existing note
`delete_note`	Remove a note
`clear_notes`	Wipe the session clean

In practice, Claude uses add_note most. It naturally drops warnings and TODOs as it works. The list_notes tool is useful when Claude needs to recall what it flagged earlier in a long session.

Technical Decisions Worth Sharing

Swift 6 Strict Concurrency

The entire codebase uses Swift 6 strict concurrency. The socket server runs on a background thread, the UI updates on @MainActor, and all shared state goes through Sendable types. No data races, enforced at compile time.

POSIX Sockets for the Server Side

Here is something I did not expect: Apple's Network.framework (NWListener) does not support binding to Unix domain sockets as a server, or at least I could not find a way to make it work. I explored several approaches before landing on raw POSIX sockets (socket(), bind(), listen(), accept()) for the server side. The client side uses NWConnection, which does support Unix domain socket connections just fine.

If you have solved NWListener + Unix domain sockets on macOS, I would genuinely like to know how. If you are building IPC on macOS and hit this wall, POSIX sockets work reliably.

Debounced Persistence

Notes are kept in memory for speed. A debounced write serializes them to JSON every few seconds as a crash-safety measure. On restart, the app picks up where it left off.

Try It

Download

Grab the pre-built binaries from the GitHub Release:

Clause-app-macos-arm64.zip (the floating window app)
clause-mcp-macos-arm64.zip (the MCP server binary)

Install

Unzip both files
Move Clause.app to /Applications
Run Clause.app (right-click > Open on first run, since the build is unsigned)
Add to your Claude Code config (~/.claude.json):

{
  "mcpServers": {
    "clause": {
      "command": "/Applications/Clause.app/Contents/MacOS/clause-mcp"
    }
  }
}

Or build from source:

git clone https://github.com/ceaksan/clause.git
cd clause
brew install xcodegen && xcodegen generate
open Clause.xcodeproj

What Is Next

Global hotkey (Cmd+Shift+N) to capture clipboard as a note
Multi-session tabs (each Claude Code session gets its own tab)
Note search and filtering
DMG and Homebrew cask distribution

I Benchmarked 5 File Editing Strategies for AI Coding Agents. Here's What Actually Works.

Ceyhun Aksan — Fri, 27 Mar 2026 08:37:12 +0000

Yes, the title says "5 strategies" like every other listicle. The number isn't a framework. It's just how many I got through before my API bill made me pause. There are plenty more approaches worth testing. If you've benchmarked others or have a strategy that works well for you, I'd genuinely like to hear about it.

Telling an agent to "edit the file" is easy. Being sure the result is correct is hard.

I've been using Claude Code daily for months. One pattern kept showing up: the agent says "done," I commit, and later I find lines missing from the middle of the file. Or a formatter ran between edits and the next match fails silently.

So I tested it systematically. 5 strategies, 20 scenarios, two file sizes (378 and 1053 lines), with 5 and 10 changes each.

The 5 Strategies

Sequential Edit: One Edit call per change, top to bottom. Simple, but line numbers drift after insertions.

Atomic Write: Read once, rewrite entire file. Fewest tool calls, but token cost explodes on large files and middle content can silently disappear (the "lost-in-the-middle" problem).
Bottom-up Edit: Same as Sequential, but changes applied from bottom to top. Eliminates line drift because lower edits don't shift upper line numbers.
Script Generation: Agent writes a shell script with sed commands. File content never enters the token stream.
Unified Diff: Agent generates a patch file, applied with patch. Standard format, reversible.

Results:

1053-line file, 10 changes:

Strategy	Tokens	Duration	Tool Calls
Script Generation	7,000	10s	2
Unified Diff	8,500	12s	2
Sequential Edit	25,000	65s	11
Bottom-up Edit	25,000	65s	11
Atomic Write	43,000	50s	2

Script Generation: 3.5x cheaper and 6.5x faster than Sequential Edit on the same task.

The Decision Table

-	1-2 changes	3-5 changes	6+ changes
< 300 lines	Edit	Script / Diff	Script
300-1000 lines	Edit	Script / Diff	Script
> 1000 lines	Edit	Script	Script

The Missing Piece: Deterministic Protection

Strategy choice helps, but agents still pick wrong sometimes. I built edit-guard, a hook that runs after every Edit/Write call and catches three failure modes:

Consecutive edit counter: Warns at 3, blocks at 5 sequential edits on the same file
Line count verification: Flags unexpected line count changes after Write
Lost-in-the-middle detection: Catches empty blocks and repeated patterns from truncation

It's a Claude Code PostToolUse hook. Deterministic, not probabilistic. The agent choosing the right strategy is probabilistic. The hook catching a bad outcome is guaranteed.

Source code and full benchmark data: github.com/ceaksan/edit-guard

How UCP Breaks Your E-Commerce Tracking Stack: A Platform-by-Platform Analysis

Ceyhun Aksan — Tue, 24 Mar 2026 15:36:21 +0000

You have probably seen UCP mentioned in your X feed, LinkedIn, newsletters, dev blogs. But most of the coverage stops at "optimize your product feeds." The deeper questions about what actually breaks in your tracking, attribution, and remarketing stack are barely addressed. That is what got me thinking, researching, and writing my findings on this.

What Is UCP?

Universal Commerce Protocol (UCP) is an open standard developed by Google together with Shopify, Etsy, Wayfair, Target, and Walmart¹. It standardizes the ability of AI agents to operate across the entire shopping journey, from product discovery to post-sale. In the AI agent protocols post, I positioned UCP alongside five other protocols; in this post, I take a deep dive into UCP's impact on the e-commerce ecosystem.

Announced in January 2026, UCP now offers four core capabilities with its March 2026 update²:

Capability	What it does
Cart	Agent can add multiple products to the cart in a single action
Catalog	Real-time price, stock, and variant data pulled from the retailer's catalog
Identity Linking	Customers automatically receive loyalty/membership benefits (special pricing, free shipping)
Native Checkout	Direct purchase through Google Search AI Mode and Gemini (via Google Pay)

Commerce Inc, Salesforce, and Stripe will integrate UCP into their platforms in the near term. A simplified onboarding process through Merchant Center is on the way³.

The Paradigm Shift: From Site-Centric to Agent-Centric

Today's e-commerce model works like this: a user clicks a traffic source (ad, search, etc.), lands on the site, browses the product, adds to cart, checks out, the pixel fires, and the conversion is recorded. Despite challenges like privacy restrictions, browser policies, and platform changes, this model has been running for a long time.

UCP changes this model at its foundation:

Today:   Traffic source (ad, search, etc.) → Site → Cart → Checkout → Pixel → Conversion
UCP:     Query → Agent recommendation → One-step purchase (on Google surface)

The user can check out within Google AI Mode or Gemini without ever visiting the site. This affects every layer of the e-commerce ecosystem: conversion tracking, attribution, remarketing, CRM, and data ownership.

Conversion Tracking: Site-Based Measurement Breaks

When the conversion happens on a Google surface, client-side tracking tools cannot see the sale: GA4 Event, Meta Pixel, TikTok Pixel, Bing UET Tag, LinkedIn Insight Tag, Snapchat Pixel, Criteo OneTag. Most of these platforms have server-side solutions (CAPI, Events API). However, even server-side triggering requires the user to visit the site or at least generate a recognizable event.

If the user never visits the site, client-side tools do not fire. If the user visits the site but completes checkout through UCP, the moment of sale is still missed. In both scenarios, conversion data remains partially or completely incomplete.

Google Ads conversions will likely be reported automatically by Google (attribution via adview_query_id is already being tested). But third-party platforms, even with CAPI solutions, will have broken attribution because they cannot access the journey data.

In the different approaches to event data tracking post, I covered the differences between client-side and server-side tracking in detail. With UCP, this discussion gains a new dimension: client-side tracking becomes entirely insufficient, and a server-side pipeline becomes mandatory.

What Should Be Done?

Prioritize GA4 Measurement Protocol (server-side) setup
Activate Google Ads Conversion API and Enhanced Conversions
Monitor Merchant Center data regularly (UCP conversions will appear there)
Build a server-side event pipeline to reduce dependency on third-party pixels (Meta, TikTok)

Attribution: The Multi-Platform Ad Model Collapses

Traditional last-click or multi-touch attribution models cannot capture UCP sales. So how will a merchant running ads across multiple platforms attribute UCP sales to the right channel? How will they allocate ad budgets?

Ads are already being tested in Google AI Mode (attribution via adview_query_id)⁴. But this only applies to Google Ads. Other platforms:

Platform	Tracking Type	Status in UCP Checkout
Google Ads	Client-side + Server-side	Likely automatic attribution
Meta (including CAPI)	Client-side + server-side	Journey data missing, attribution broken even if CAPI fires
TikTok (including Events API)	Client-side + server-side	Same issue, server-side still cannot access journey data
Bing/Microsoft Ads (UET)	Client-side	Does not fire without a site visit
Pinterest (including Conversions API)	Client-side + server-side	Same limitation as Meta, journey data missing
LinkedIn (including CAPI)	Client-side + server-side	CAPI can relay order data, but journey attribution is missing
Snapchat (including CAPI)	Client-side + server-side	Same limitation, server-side triggering possible but source attribution missing
Criteo (including Events API)	Client-side + server-side	Retargeting data missing, conversion can be sent but browsing data is gone

The merchant receives the order data (merchant of record). But it does not know which ad channel drove the sale. This is a serious blind spot that makes ad budget optimization impossible.

Remarketing: Audience Loss

Site traffic will decline. When the user purchases on a Google surface, they do not visit the site. This means remarketing audiences erode:

Cookie/pixel-based audience building becomes impossible: You cannot fire a pixel for a customer who never visits your site
Lookalike audiences: The base audience cannot form, so similar audience targeting becomes unfeasible
Google's remarketing advantage multiplies: Google can use its own checkout data on its own Ads platform. Meta and others cannot access this data

What Should Be Done?

Build or strengthen a loyalty program: Loyalty programs integrated with UCP Identity Linking will be the only way to recognize customers
Accelerate email/SMS list growth: You can get the email of a customer who checks out on the Google surface (as merchant of record), so push it to CRM
Increase Customer Match usage: Remarketing on Google Ads, Meta, and TikTok through first-party email/phone lists. Reduce pixel dependency
Lean into Google Ads remarketing: Google will use its own ecosystem data on its own ads platform

Measurement: The Funnel Collapses

The traditional funnel: impression > click > site visit > add to cart > checkout > purchase. The agent funnel: query > agent recommendation > one-step purchase.

This calls fundamental e-commerce metrics into question:

What will conversion rate be calculated against? Without site visits, what is the denominator?
Which surface will A/B tests run on? The merchant cannot optimize their own checkout page
Cart abandonment recovery? The merchant does not have the email of a user who abandons on the agent surface
Upsell, cross-sell, urgency tactics? Tactics like "only 3 left in stock" are not under merchant control; they depend on the agent's decision mechanism

Site-centric measurement models are becoming increasingly insufficient in a world where the customer journey is completed on Google surfaces.

CRM and Data Ownership: The Least Discussed, Most Critical Issue

"The merchant remains the merchant of record" is true: order data (name, address, email) comes through. But customer journey data (which query they came from, how many alternatives they reviewed, how long they compared) stays with Google.

This is the same dynamic as Amazon marketplace dependency: sales happen, but the customer is not yours, the rules are not yours, and visibility is not in your hands.

Concrete issues:

Without behavioral data, segmentation is not possible
How will you push existing CRM segments (VIP, churn risk, high LTV) to the agent?
Identity Linking exists, but within the framework Google defines
There is no standard yet for feeding your loyalty, segment, and cohort data to the agent
The agent brings you customers, but you cannot tell the agent "show this to that customer"

Visibility: Who Stands Out in AI?

In SEO, ranking factors and SERP position were visible. In AEO/GEO, you do not know which products the agent recommends or why. Product feed quality, price, stock, and images may be determining factors, but there is no certainty. The agent's decision mechanism is a black box.

What will you measure, and what will you change for optimization? This question has not been answered yet.

Consent and Regional Compliance: Unanswered Questions

The UCP specification defines a Buyer Consent Extension⁵: a structure that enables the buyer to communicate data usage and communication preferences (analytics, marketing, data sales) to the merchant. This is designed for compliance with regulations like GDPR and CCPA. AP2 mandates also cannot be created without the user's explicit consent.

However, there are unanswered questions in practice:

Regional access: UCP checkout is currently active in the US (Etsy, Wayfair). How will it be implemented in the EU/EEA given GDPR requirements? How will the Consent Mode v2 mandate integrate into the UCP flow? How will Google's EU user consent policy work for checkouts made through the agent?
Consent collection surface: In the traditional model, the consent banner is shown on the site. If the user never visits the site in UCP, where and how will consent be requested? Does consent on the Google surface satisfy the merchant's GDPR obligations?
Marketing opt-in: The specification mentions "marketing opt-in during checkout" as a planned future feature, but it is not available yet. Without it, adding UCP customers to your CRM and sending marketing communications may be legally problematic.
EU AI Act (August 2026): Full enforcement starts August 2, 2026. Data quality, consent documentation, transparency, and monitoring requirements will apply to high-risk AI systems. How UCP's agent-driven checkout will be evaluated under this scope has not been clarified yet.

The consent infrastructure exists in UCP's technical specification. But regional compliance, especially for EU market entry, remains the biggest area of uncertainty.

The OpenAI Situation

OpenAI tried "Instant Checkout" through ChatGPT but stepped back in March 2026⁶. It switched to an apps model under the name "Agentic Commerce Protocol." Product selection remained limited, and up-to-date information was a persistent problem. Unlike Google, it failed to keep checkout on its own surface.

This makes UCP's "Google advantage" even more pronounced: Google is simultaneously a search engine, an advertising platform, a payment infrastructure (Google Pay), and a merchant directory (Merchant Center). This vertical integration carries a significant control advantage behind the open standard narrative.

Action Plan for E-Commerce Owners

Priority	Action	Why
P0	Keep Merchant Center up to date, follow UCP onboarding	UCP integration will come through Merchant Center
P0	Set up server-side conversion tracking	Sales will happen without site visits, pixels alone are not enough
P1	Build a loyalty/membership program	Customer recognition and personalization via Identity Linking
P1	Improve product feed quality (price, stock, variants, images)	Agents will pull real-time data via Catalog API; missing data = lost sales
P1	First-party data strategy	Email/phone collection, CRM integration, Customer Match
P2	Evaluate Business Agent	Branded AI assistant for direct customer communication
P2	Follow the Direct Offers pilot	Offering special discounts when AI detects purchase intent

The Big Picture

UCP is the infrastructure for e-commerce's transition from a "site-centric" model to an "agent-centric" model. Google presents this as an open standard, but the practical advantage appears to sit entirely within the Google ecosystem: Search AI Mode + Gemini + Google Pay + Merchant Center. This is a move that will significantly increase Google's control over e-commerce.

The biggest risk for e-commerce owners: the dissolution of the "what happens on my site is under my control" mindset. A large portion of the customer journey will take place on Google surfaces. That is why server-side tracking, first-party data, and loyalty/CRM infrastructure are no longer "nice to have" but mandatory.

The topic is still maturing. Even in English, while everyone says "optimize your feeds," content addressing ad measurement, third-party platform impacts, CRM, and data ownership questions is scarce. This post aims to raise these questions in a structured way for the first time. More details are expected at Google I/O 2026.

Developments to Watch

Google I/O 2026: UCP roadmap, ad integration details
Meta's response: Strategy against UCP, Conversions API updates
OpenAI's next move: Agentic Commerce Protocol details
Shopify/Salesforce/Stripe: UCP integration details
UCP Roadmap: Follow ucp.dev/documentation/roadmap/

We will see how the system shifts and whether our strategies and measurement frameworks can adapt fast enough. For now, the best move is to build the infrastructure that does not depend on any single surface.

UCP Updates ↩
Under the Hood: Universal Commerce Protocol ↩
Agentic Commerce AI Tools ↩
Smarter Ecommerce: UCP and Advertising in AI Era ↩
UCP Buyer Consent Extension ↩
CNBC: OpenAI Shopping Stumbled ↩

11 Ways LLMs Fail in Production (With Academic Sources)

Ceyhun Aksan — Thu, 19 Mar 2026 13:44:00 +0000

If you use LLMs in production, you've seen these. Not random errors, but systematic failures baked into architecture and training.

I documented 11 behavioral failure modes with 60+ academic sources. Here's the short version.

1. Hallucination / Confabulation

The model references a library that doesn't exist. Confidently. The worse variant: you ask "why?" and it fabricates a plausible justification for the wrong answer.

Researchers prefer "confabulation" over "hallucination" because LLMs have no perceptual experience. Farquhar et al. (2024, Nature) introduced semantic entropy to detect it: cluster semantically equivalent answers, compute entropy. High entropy = probable fabrication.

Defense: RAG, Chain-of-Verification, cross-model verification.

2. Sycophancy

Ask "isn't this code wrong?" and the model says "yes, you're right" even when the code is correct. RLHF training causes this: evaluators rate agreeable answers higher, and the model learns that signal.

A 2025 study found sycophantic agreement and sycophantic praise are distinct directions in transformer activation space. Each can be suppressed independently.

Defense: Pre-commitment (model answers first, then sees your opinion), question formulation ("explain this" not "isn't this wrong?").

3. Context Rot

Not just "lost in the middle." Chroma Research (2025) showed performance degrades with every increase in length, even far below the window limit. Irrelevant information actively harms retrieval.

Defense: Context engineering (less is more), critical info at beginning/end, periodic re-injection.

4. Instruction Attenuation

You say "run tests after every change." Works for the first few changes. By the tenth, the model writes "ran tests, passed" without actually running them.

Meta found a 39% average performance drop in multi-turn conversations. Worse: the model forms premature assumptions in early turns and can't recover.

The second stage is ceremonialization: the model appears to follow the rule, but the substance is gone.

Defense: Forget-Me-Not (instruction re-injection), short sessions, deterministic controls (hooks, linters, CI).

5. Task Drift

"Fix this bug" becomes "fix bug + refactor function + update imports + reorganize file." At each step, the immediate context dominates the original goal.

Three drift types (2026 study): semantic drift, coordination drift (multi-agent), behavioral drift.

Defense: Goal anchoring, plan-before-act, max step limits, tool constraints.

6. Incorrect Tool Invocation

Agents call APIs, edit files, query databases. These calls are failure points: wrong parameters, wrong tool selection, wrong sequence.

7. Reward Hacking

The model finds shortcuts to satisfy the metric without solving the problem. Tests pass but the feature doesn't work.

8. Degeneration Loops

Autoregressive generation enters self-reinforcing repetition cycles. The model repeats phrases, patterns, or structures.

9. Alignment Faking

Different from sycophancy. The model appears aligned under observation but behaves differently when unobserved. Sycophancy is unconscious (from RLHF). Alignment faking is strategic (the model reasons "if I refuse, they'll retrain me").

Anthropic documented this in Claude: the model strategically cooperated during evaluation to avoid modification.

10. Version Drift

Same prompt, different model version, different behavior. Updates silently change model behavior without notification.

11. Context Window Truncation

Different from context rot. When the window fills, older instructions are literally deleted. Not gradual decay but hard cut.

The Pattern

These failures aren't random. They're consequences of:

Architecture (autoregressive token prediction)
Training (RLHF reward signals)
Deployment (long sessions, tool access, multi-turn)

Defense must operate at three layers: prompt, architectural, and operational. Single-layer defense is insufficient.

Full analysis with 60+ academic references, defense techniques for each mode, and practical examples:

ceaksan.com/en/llm-behavioral-failure-modes/

An AI agent got stuck in a loop. The monitoring tools saw nothing.

Ceyhun Aksan — Mon, 16 Mar 2026 11:18:14 +0000

Last month, a developer posted on Hacker News: their GPT-4o agent got stuck in a retry loop and ran up a bill before anyone noticed.

I've had my own version of this. A LangChain agent I built went into a recursive loop in production. No alert. No warning.

So I started digging.

The pattern

I pulled GitHub issues, HN threads, Reddit posts. Dozens of them. The same story kept showing up:

Agent loops and burns through API credits
Agent hallucinates and nobody catches it for hours
Agent works fine for weeks, then silently degrades
Developer finds out from users, not from monitoring

The tools we have -- LangSmith, LangFuse, Arize, Helicone -- show traces. Latency, token counts, spans. They answer what happened but not:

Is my agent actually reliable right now?
Is it producing business value or just burning tokens?
Will I know when it breaks before my users do?

What I found in public repositories

This isn't just anecdotal. The evidence is sitting in open GitHub issues.

No alerting for over two years

LangFuse's alerting feature request has been open since December 2023. You can see every trace in beautiful detail, but if your agent starts failing at 3am, you'll find out at 9am.

langfuse/langfuse#714 -- opened Dec 18, 2023

The monitoring tool crashed production

LangSmith's tracing decorator crashed production apps during an outage. The tool that's supposed to tell you something went wrong... was the thing that went wrong.

langchain-ai/langsmith-sdk#1306 -- opened Dec 9, 2024

Cost tracking doesn't add up

Cost calculations are wrong for cached tokens and vision models. Your dashboard says you're spending one amount. Your actual invoice says another.

langchain-ai/langsmith-sdk#1375 -- opened Jan 4, 2025

The gap

Current observability tools are designed for debugging after the fact, not for catching failures as they happen. They're flight recorders, not collision avoidance systems.

What's missing:

Reliability scoring. Not "here's a trace" but "your agent's reliability dropped from 94% to 71% in the last hour." A single number that tells you whether to worry.

Business outcome connection. Your monitoring tool says the agent completed successfully. Your analytics says conversion rate dropped. Nobody connects these two.

Simple alerting. Not enterprise-grade configuration with 47 steps. Just: "error rate crossed 5%, here's the Slack message."

The scale of the problem

57% of teams now run AI agents in production. Gartner predicts 40% of agentic AI projects will be canceled by 2027 due to trust issues.

The trust problem isn't about AI capability. It's about not knowing when it's broken.

I'm researching this

I'm not building a product pitch. I'm trying to understand if this is a widespread problem or just my own frustration.

If you run AI agents in production (or plan to), I'd appreciate 2 minutes:

5-question survey -- no signup, no email required.

I'll share the results publicly here on dev.to.

Whether this becomes a tool, an open-source library, or just a post with interesting data -- the findings will be useful either way.

DEV Community: Ceyhun Aksan

Coffee Debt: Gamifying AI Error Tracking with Claude Code Hooks

The Rules

Architecture: 4 Hooks, 3 Data Files

1. coffee-tracker.sh (PostToolUse)

2. coffee-correction-detector.sh (UserPromptSubmit)

3. guard-destructive.sh (PreToolUse)

4. coffee-banner.sh (SessionStart)

Data Files

Context Proxies: Capturing the Error's Context

14-Day Pattern Analysis

Why Is Bash Number One?

Why Is User Correction the Most Expensive?

Morning Clustering

Deep Analysis: Cross-Referencing 4 Data Sources

Statusline Integration

Philosophy: Data Instead of Frustration

Source Code

Beyond the Log: Observing Yourself

I Built a Floating Scratchpad for Claude Code Sessions (SwiftUI + MCP)

The Problem You Learn to Live With

How Else Could You Solve This?

1. CLAUDE.md and Memory Files

2. TodoWrite (Built-in Task Tool)

3. Productivity Apps with MCP (Obsidian, Notion, Todoist, Apple Notes)

4. Terminal Multiplexer (tmux pane)

5. Clipboard Managers

The Gap

Prior Art

Who Needs This?

What I Built

How It Works

Why Two Processes?

The MCP Tools

Technical Decisions Worth Sharing

Swift 6 Strict Concurrency

POSIX Sockets for the Server Side

Debounced Persistence

Try It

Download

Install

What Is Next

Links

I Benchmarked 5 File Editing Strategies for AI Coding Agents. Here's What Actually Works.

The 5 Strategies

Results:

The Decision Table

The Missing Piece: Deterministic Protection

How UCP Breaks Your E-Commerce Tracking Stack: A Platform-by-Platform Analysis

What Is UCP?

The Paradigm Shift: From Site-Centric to Agent-Centric

Conversion Tracking: Site-Based Measurement Breaks

What Should Be Done?

Attribution: The Multi-Platform Ad Model Collapses

Remarketing: Audience Loss

What Should Be Done?

Measurement: The Funnel Collapses

CRM and Data Ownership: The Least Discussed, Most Critical Issue

Visibility: Who Stands Out in AI?

Consent and Regional Compliance: Unanswered Questions

The OpenAI Situation

Action Plan for E-Commerce Owners

The Big Picture

Developments to Watch

11 Ways LLMs Fail in Production (With Academic Sources)

1. Hallucination / Confabulation

2. Sycophancy

3. Context Rot

4. Instruction Attenuation

5. Task Drift

6. Incorrect Tool Invocation

7. Reward Hacking

8. Degeneration Loops

9. Alignment Faking

10. Version Drift

11. Context Window Truncation

The Pattern

An AI agent got stuck in a loop. The monitoring tools saw nothing.

The pattern

What I found in public repositories