<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ceyhun Aksan</title>
    <description>The latest articles on DEV Community by Ceyhun Aksan (@ceaksan).</description>
    <link>https://dev.to/ceaksan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2811808%2F1a144550-777d-4739-9510-8b67fea03fc8.jpg</url>
      <title>DEV Community: Ceyhun Aksan</title>
      <link>https://dev.to/ceaksan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ceaksan"/>
    <language>en</language>
    <item>
      <title>I Built a Floating Scratchpad for Claude Code Sessions (SwiftUI + MCP)</title>
      <dc:creator>Ceyhun Aksan</dc:creator>
      <pubDate>Sun, 29 Mar 2026 11:47:27 +0000</pubDate>
      <link>https://dev.to/ceaksan/i-built-a-floating-scratchpad-for-claude-code-sessions-swiftui-mcp-2b85</link>
      <guid>https://dev.to/ceaksan/i-built-a-floating-scratchpad-for-claude-code-sessions-swiftui-mcp-2b85</guid>
      <description>&lt;h2&gt;
  
  
  The Problem You Learn to Live With
&lt;/h2&gt;

&lt;p&gt;I use Claude Code for hours every day. Over time, I noticed a pattern I kept ignoring: &lt;strong&gt;context evaporation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;You are 45 minutes into a session. Claude has warned you about a breaking change, made architectural decisions you agreed with, created TODOs for edge cases, flagged a security issue to revisit later. Where is all of that now? Buried in terminal scroll. You half-remember the warning. The TODOs are gone.&lt;/p&gt;

&lt;p&gt;I tried to solve this with small workarounds. I created a shell alias (&lt;code&gt;here='pbpaste | bat -l md -p'&lt;/code&gt;) to quickly paste and read Claude's output as formatted markdown. Copy something important, switch pane, run &lt;code&gt;here&lt;/code&gt;, keep working. I tried keeping a text file open alongside my terminal. Every workaround lasted a few days before the friction won. The moment I got deep into a problem, I stopped doing it.&lt;/p&gt;

&lt;p&gt;You can also ask Claude directly: "what were the TODOs from earlier?" But that means scrolling back, re-reading context, spending tokens on retrieval instead of work. In long sessions, this adds up.&lt;/p&gt;

&lt;p&gt;What made me actually build it was seeing similar frustrations on X. People describing the same thing: important context from Claude Code sessions disappearing into scroll. It was not just my problem. The idea of a blackboard, something like a floating post-it board for coding sessions, kept coming back. After the third time I caught myself re-asking Claude about something it had already told me, I decided to stop patching the problem and build the thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Else Could You Solve This?
&lt;/h2&gt;

&lt;p&gt;Before building anything, I looked at what already exists.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. CLAUDE.md and Memory Files
&lt;/h3&gt;

&lt;p&gt;Claude Code has built-in persistence: &lt;code&gt;CLAUDE.md&lt;/code&gt; for project instructions, memory files for cross-session recall. These are great for &lt;strong&gt;long-term&lt;/strong&gt; context (project conventions, user preferences). But they are not designed for scratch notes within a single session. Writing "remember to check the auth middleware" to a memory file is overkill. It pollutes long-term storage with ephemeral context.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. TodoWrite (Built-in Task Tool)
&lt;/h3&gt;

&lt;p&gt;Claude Code's &lt;code&gt;TodoWrite&lt;/code&gt; tool tracks tasks within a session. It works, but:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It is invisible. You can not glance at it. You have to ask Claude "what are my todos?" to see them.&lt;/li&gt;
&lt;li&gt;It is one-directional. Claude writes, you read (by asking). You can not add your own items.&lt;/li&gt;
&lt;li&gt;It disappears completely when the session ends, with no option to review.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Productivity Apps with MCP (Obsidian, Notion, Todoist, Apple Notes)
&lt;/h3&gt;

&lt;p&gt;This is the closest alternative. Obsidian has basic-memory MCP, Notion and Todoist have their own MCP servers. Claude can read and write to all of them. I actually use Obsidian + MCP daily for long-term project context.&lt;/p&gt;

&lt;p&gt;Apple Notes has several well-built MCP servers too (&lt;a href="https://github.com/sweetrb/apple-notes-mcp" rel="noopener noreferrer"&gt;sweetrb/apple-notes-mcp&lt;/a&gt; is the most complete, with 18+ tools including sync awareness and batch operations; &lt;a href="https://github.com/RafalWilinski/mcp-apple-notes" rel="noopener noreferrer"&gt;RafalWilinski/mcp-apple-notes&lt;/a&gt; adds on-device semantic search with vector embeddings).&lt;/p&gt;

&lt;p&gt;So Claude can write to these apps. The problem is not access, it is purpose mismatch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Permanent storage for ephemeral context&lt;/strong&gt;: These are knowledge bases. Writing "check auth middleware before merging" to your Obsidian vault or Apple Notes creates noise you will have to clean up later. Session scratch does not belong in a permanent knowledge graph.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not glanceable&lt;/strong&gt;: These are full apps you switch to with Cmd+Tab. Not a floating panel you glance at while typing in your terminal. That context switch, even a quick one, breaks flow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No session lifecycle&lt;/strong&gt;: MCP lets Claude write to these apps, but there is no concept of "this note belongs to this coding session and should disappear when the session ends." You would need to manually organize and delete session notes.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Terminal Multiplexer (tmux pane)
&lt;/h3&gt;

&lt;p&gt;Split your terminal. Dedicate a pane to a scratch file. There are even tmux MCP servers that let Claude send commands to other panes. But:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Text, not UI&lt;/strong&gt;: Claude can push text to a pane, but it is still a flat stream. No typed notes (note vs todo vs warning), no completion toggles, no visual hierarchy. You are reading raw text in a monospace grid.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same cognitive channel&lt;/strong&gt;: Terminal output and scratch notes are both text in the same visual context. Your eyes have to distinguish "this is code output" from "this is a note I should remember." A floating panel in a different visual layer separates these naturally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No session lifecycle&lt;/strong&gt;: The pane does not know when a Claude Code session starts or ends. Notes accumulate across sessions unless you manually clear them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Full disclosure: I use Ghostty, not tmux. If tmux has capabilities that address these points better than I realize, I would love to hear about it in the comments.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Clipboard Managers
&lt;/h3&gt;

&lt;p&gt;Tools like Maccy or Paste capture everything you copy. But they are reactive (you have to copy first) and have no concept of "this is important context from an AI session" vs "I copied a URL 20 minutes ago."&lt;/p&gt;

&lt;h3&gt;
  
  
  The Gap
&lt;/h3&gt;

&lt;p&gt;The MCP ecosystem has solved the "Claude can not write to external apps" problem. Claude can write to Obsidian, Apple Notes, Notion, Todoist, tmux panes. Access is not the issue.&lt;/p&gt;

&lt;p&gt;What is missing is a combination of three things no existing tool provides together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Session lifecycle&lt;/strong&gt;: Notes that belong to a coding session and disappear when it ends. No cleanup, no permanent noise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Glanceability&lt;/strong&gt;: A floating panel visible alongside your terminal at all times. Not behind a Cmd+Tab, not buried in a pane.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structure&lt;/strong&gt;: Typed notes (note, todo, warning) with visual distinction, not a flat text stream.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Prior Art
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/mattt/iMCP" rel="noopener noreferrer"&gt;mattt/iMCP&lt;/a&gt; is the closest architectural sibling to what I ended up building: a native macOS SwiftUI app paired with an MCP CLI server, built with the official MCP Swift SDK (which mattt co-created). iMCP bridges Calendar, Contacts, Messages, Reminders, and Weather to Claude. It uses Bonjour for discovery between the app and CLI processes.&lt;/p&gt;

&lt;p&gt;Clause borrows from this pattern (native app + CLI MCP server, two-process model) but differs in scope and intent. iMCP is a general-purpose bridge to Apple's built-in apps. Clause is a single-purpose ephemeral scratchpad. iMCP stores nothing new; it reads and writes to existing Apple services. Clause creates its own transient state that lives and dies with a coding session.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Needs This?
&lt;/h2&gt;

&lt;p&gt;Not everyone. If your Claude Code sessions are short (under 15 minutes) or focused on single-file edits, you probably do not need Clause.&lt;/p&gt;

&lt;p&gt;But if you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run &lt;strong&gt;long sessions&lt;/strong&gt; (30+ minutes) where Claude makes multiple decisions and flags issues&lt;/li&gt;
&lt;li&gt;Work on &lt;strong&gt;complex refactors&lt;/strong&gt; that span many files and accumulate warnings&lt;/li&gt;
&lt;li&gt;Use Claude Code as a &lt;strong&gt;pair programmer&lt;/strong&gt;, not just a code generator&lt;/li&gt;
&lt;li&gt;Have found yourself scrolling back through terminal history looking for "that thing Claude said earlier"&lt;/li&gt;
&lt;li&gt;Want a &lt;strong&gt;shared workspace&lt;/strong&gt; where both you and Claude can track session context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then Clause solves a real problem.&lt;/p&gt;

&lt;p&gt;In my workflow, the biggest win is &lt;strong&gt;warnings&lt;/strong&gt;. Claude notices things: a dependency that is about to hit EOL, a pattern that does not match the rest of the codebase, a test that covers the happy path but not the error path. These observations used to vanish. Now they sit in a floating panel until I deal with them or the session ends.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Clause&lt;/strong&gt; is a minimal floating panel that sits alongside your terminal. Claude Code pushes notes, TODOs, and warnings to it in real time via MCP tools. You can add your own notes too. Everything disappears when the session ends.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frzdzzhkezwjvy245iw9d.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frzdzzhkezwjvy245iw9d.gif" alt="Clause demo" width="800" height="309"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It is not a full note-taking app. It is a scratchpad with a 1-session lifespan.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;Clause uses a two-process architecture:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;clause-mcp&lt;/strong&gt;: A CLI binary that Claude Code spawns as an MCP server (stdio transport)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clause.app&lt;/strong&gt;: A native macOS SwiftUI floating window&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both communicate over a Unix domain socket at &lt;code&gt;~/.clause/clause.sock&lt;/code&gt;. When Claude calls an MCP tool like &lt;code&gt;add_note&lt;/code&gt;, the request flows through the socket to the app, and the note appears instantly in the floating panel.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude Code &amp;lt;--stdio--&amp;gt; clause-mcp &amp;lt;--unix socket--&amp;gt; Clause.app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why Two Processes?
&lt;/h3&gt;

&lt;p&gt;MCP servers communicate with Claude Code over stdio. A GUI app can not be a stdio server. So the MCP server is a headless CLI binary, and the GUI is a separate app. The Unix socket bridges them.&lt;/p&gt;

&lt;p&gt;I considered alternatives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Named pipes&lt;/strong&gt;: Unidirectional. I needed bidirectional communication for request/response patterns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;XPC&lt;/strong&gt;: Apple's IPC mechanism. Requires code signing and entitlements, which I wanted to avoid for an open-source tool.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network sockets (TCP/localhost)&lt;/strong&gt;: Works, but Unix domain sockets are faster for local IPC and do not expose a port.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The MCP Tools
&lt;/h3&gt;

&lt;p&gt;Claude gets 6 tools to work with:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;set_session&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Initialize session context (name, working directory)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;add_note&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Push a note, todo, or warning to the panel&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;list_notes&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Read back current notes (useful for context)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;edit_note&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Update an existing note&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;delete_note&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Remove a note&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;clear_notes&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Wipe the session clean&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In practice, Claude uses &lt;code&gt;add_note&lt;/code&gt; most. It naturally drops warnings and TODOs as it works. The &lt;code&gt;list_notes&lt;/code&gt; tool is useful when Claude needs to recall what it flagged earlier in a long session.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical Decisions Worth Sharing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Swift 6 Strict Concurrency
&lt;/h3&gt;

&lt;p&gt;The entire codebase uses Swift 6 strict concurrency. The socket server runs on a background thread, the UI updates on &lt;code&gt;@MainActor&lt;/code&gt;, and all shared state goes through &lt;code&gt;Sendable&lt;/code&gt; types. No data races, enforced at compile time.&lt;/p&gt;

&lt;h3&gt;
  
  
  POSIX Sockets for the Server Side
&lt;/h3&gt;

&lt;p&gt;Here is something I did not expect: Apple's &lt;code&gt;Network.framework&lt;/code&gt; (&lt;code&gt;NWListener&lt;/code&gt;) does not support binding to Unix domain sockets as a server, or at least I could not find a way to make it work. I explored several approaches before landing on raw POSIX sockets (&lt;code&gt;socket()&lt;/code&gt;, &lt;code&gt;bind()&lt;/code&gt;, &lt;code&gt;listen()&lt;/code&gt;, &lt;code&gt;accept()&lt;/code&gt;) for the server side. The client side uses &lt;code&gt;NWConnection&lt;/code&gt;, which does support Unix domain socket connections just fine.&lt;/p&gt;

&lt;p&gt;If you have solved &lt;code&gt;NWListener&lt;/code&gt; + Unix domain sockets on macOS, I would genuinely like to know how. If you are building IPC on macOS and hit this wall, POSIX sockets work reliably.&lt;/p&gt;

&lt;h3&gt;
  
  
  Debounced Persistence
&lt;/h3&gt;

&lt;p&gt;Notes are kept in memory for speed. A debounced write serializes them to JSON every few seconds as a crash-safety measure. On restart, the app picks up where it left off.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Download
&lt;/h3&gt;

&lt;p&gt;Grab the pre-built binaries from the &lt;a href="https://github.com/ceaksan/clause/releases/tag/v0.1.0-milestone1" rel="noopener noreferrer"&gt;GitHub Release&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Clause-app-macos-arm64.zip&lt;/code&gt; (the floating window app)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;clause-mcp-macos-arm64.zip&lt;/code&gt; (the MCP server binary)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Install
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Unzip both files&lt;/li&gt;
&lt;li&gt;Move &lt;code&gt;Clause.app&lt;/code&gt; to &lt;code&gt;/Applications&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;Clause.app&lt;/code&gt; (right-click &amp;gt; Open on first run, since the build is unsigned)&lt;/li&gt;
&lt;li&gt;Add to your Claude Code config (&lt;code&gt;~/.claude.json&lt;/code&gt;):
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"clause"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/Applications/Clause.app/Contents/MacOS/clause-mcp"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or build from source:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ceaksan/clause.git
&lt;span class="nb"&gt;cd &lt;/span&gt;clause
brew &lt;span class="nb"&gt;install &lt;/span&gt;xcodegen &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; xcodegen generate
open Clause.xcodeproj
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What Is Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Global hotkey (Cmd+Shift+N) to capture clipboard as a note&lt;/li&gt;
&lt;li&gt;Multi-session tabs (each Claude Code session gets its own tab)&lt;/li&gt;
&lt;li&gt;Note search and filtering&lt;/li&gt;
&lt;li&gt;DMG and Homebrew cask distribution&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/ceaksan/clause" rel="noopener noreferrer"&gt;github.com/ceaksan/clause&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Landing page&lt;/strong&gt;: &lt;a href="https://clause.ceaksan.com" rel="noopener noreferrer"&gt;clause.ceaksan.com&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Clause is MIT licensed. If you use Claude Code daily, give it a try and let me know what you think.&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>mcp</category>
      <category>swift</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I Benchmarked 5 File Editing Strategies for AI Coding Agents. Here's What Actually Works.</title>
      <dc:creator>Ceyhun Aksan</dc:creator>
      <pubDate>Fri, 27 Mar 2026 08:37:12 +0000</pubDate>
      <link>https://dev.to/ceaksan/i-benchmarked-5-file-editing-strategies-for-ai-coding-agents-heres-what-actually-works-1855</link>
      <guid>https://dev.to/ceaksan/i-benchmarked-5-file-editing-strategies-for-ai-coding-agents-heres-what-actually-works-1855</guid>
      <description>&lt;p&gt;&lt;em&gt;Yes, the title says "5 strategies" like every other listicle. The number isn't a framework. It's just how many I got through before my API bill made me pause. There are plenty more approaches worth testing. If you've benchmarked others or have a strategy that works well for you, I'd genuinely like to hear about it.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Telling an agent to "edit the file" is easy. Being sure the result is correct is hard.&lt;/p&gt;

&lt;p&gt;I've been using Claude Code daily for months. One pattern kept showing up: the agent says "done," I commit, and later I find lines missing from the middle of the file. Or a formatter ran between edits and the next match fails silently.&lt;/p&gt;

&lt;p&gt;So I tested it systematically. 5 strategies, 20 scenarios, two file sizes (378 and 1053 lines), with 5 and 10 changes each.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 5 Strategies
&lt;/h2&gt;

&lt;p&gt;Sequential Edit: One Edit call per change, top to bottom. Simple, but line numbers drift after insertions.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Atomic Write:&lt;/strong&gt; Read once, rewrite entire file. Fewest tool calls, but token cost explodes on large files and middle content can silently disappear (the "lost-in-the-middle" problem).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bottom-up Edit:&lt;/strong&gt; Same as Sequential, but changes applied from bottom to top. Eliminates line drift because lower edits don't shift upper line numbers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Script Generation:&lt;/strong&gt; Agent writes a shell script with sed commands. File content never enters the token stream.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified Diff:&lt;/strong&gt; Agent generates a patch file, applied with patch. Standard format, reversible.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Results:
&lt;/h3&gt;

&lt;p&gt;1053-line file, 10 changes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;th&gt;Duration&lt;/th&gt;
&lt;th&gt;Tool Calls&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Script Generation&lt;/td&gt;
&lt;td&gt;7,000&lt;/td&gt;
&lt;td&gt;10s&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unified Diff&lt;/td&gt;
&lt;td&gt;8,500&lt;/td&gt;
&lt;td&gt;12s&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sequential Edit&lt;/td&gt;
&lt;td&gt;25,000&lt;/td&gt;
&lt;td&gt;65s&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bottom-up Edit&lt;/td&gt;
&lt;td&gt;25,000&lt;/td&gt;
&lt;td&gt;65s&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Atomic Write&lt;/td&gt;
&lt;td&gt;43,000&lt;/td&gt;
&lt;td&gt;50s&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Script Generation: 3.5x cheaper and 6.5x faster than Sequential Edit on the same task.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Decision Table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;-&lt;/th&gt;
&lt;th&gt;1-2 changes&lt;/th&gt;
&lt;th&gt;3-5 changes&lt;/th&gt;
&lt;th&gt;6+ changes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 300 lines&lt;/td&gt;
&lt;td&gt;Edit&lt;/td&gt;
&lt;td&gt;Script / Diff&lt;/td&gt;
&lt;td&gt;Script&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;300-1000 lines&lt;/td&gt;
&lt;td&gt;Edit&lt;/td&gt;
&lt;td&gt;Script / Diff&lt;/td&gt;
&lt;td&gt;Script&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;gt; 1000 lines&lt;/td&gt;
&lt;td&gt;Edit&lt;/td&gt;
&lt;td&gt;Script&lt;/td&gt;
&lt;td&gt;Script&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Missing Piece: Deterministic Protection
&lt;/h2&gt;

&lt;p&gt;Strategy choice helps, but agents still pick wrong sometimes. I built edit-guard, a hook that runs after every Edit/Write call and catches three failure modes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Consecutive edit counter: Warns at 3, blocks at 5 sequential edits on the same file&lt;/li&gt;
&lt;li&gt;Line count verification: Flags unexpected line count changes after Write&lt;/li&gt;
&lt;li&gt;Lost-in-the-middle detection: Catches empty blocks and repeated patterns from truncation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It's a Claude Code &lt;code&gt;PostToolUse&lt;/code&gt; hook. Deterministic, not probabilistic. The agent choosing the right strategy is probabilistic. The hook catching a bad outcome is guaranteed.&lt;/p&gt;

&lt;p&gt;Source code and full benchmark data: &lt;a href="https://github.com/ceaksan/edit-guard" rel="noopener noreferrer"&gt;github.com/ceaksan/edit-guard&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>productivity</category>
      <category>developertools</category>
    </item>
    <item>
      <title>How UCP Breaks Your E-Commerce Tracking Stack: A Platform-by-Platform Analysis</title>
      <dc:creator>Ceyhun Aksan</dc:creator>
      <pubDate>Tue, 24 Mar 2026 15:36:21 +0000</pubDate>
      <link>https://dev.to/ceaksan/how-ucp-breaks-your-e-commerce-tracking-stack-a-platform-by-platform-analysis-4e7f</link>
      <guid>https://dev.to/ceaksan/how-ucp-breaks-your-e-commerce-tracking-stack-a-platform-by-platform-analysis-4e7f</guid>
      <description>&lt;p&gt;You have probably seen UCP mentioned in your X feed, LinkedIn, newsletters, dev blogs. But most of the coverage stops at "optimize your product feeds." The deeper questions about what actually breaks in your tracking, attribution, and remarketing stack are barely addressed. That is what got me thinking, researching, and writing my findings on this.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is UCP?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Universal Commerce Protocol&lt;/strong&gt; (UCP) is an open standard developed by Google together with Shopify, Etsy, Wayfair, Target, and Walmart&lt;sup id="fnref1"&gt;1&lt;/sup&gt;. It standardizes the ability of AI agents to operate across the entire shopping journey, from product discovery to post-sale. In the &lt;a href="https://ceaksan.com/en/ai-agent-protokolleri-mcp-a2a-ucp-ap2-a2ui-ag-ui" rel="noopener noreferrer"&gt;AI agent protocols&lt;/a&gt; post, I positioned UCP alongside five other protocols; in this post, I take a deep dive into UCP's impact on the e-commerce ecosystem.&lt;/p&gt;

&lt;p&gt;Announced in January 2026, UCP now offers four core capabilities with its March 2026 update&lt;sup id="fnref2"&gt;2&lt;/sup&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cart&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent can add multiple products to the cart in a single action&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Catalog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Real-time price, stock, and variant data pulled from the retailer's catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Identity Linking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Customers automatically receive loyalty/membership benefits (special pricing, free shipping)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Native Checkout&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Direct purchase through Google Search AI Mode and Gemini (via Google Pay)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Commerce Inc, Salesforce, and Stripe will integrate UCP into their platforms in the near term. A simplified onboarding process through Merchant Center is on the way&lt;sup id="fnref3"&gt;3&lt;/sup&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Paradigm Shift: From Site-Centric to Agent-Centric
&lt;/h2&gt;

&lt;p&gt;Today's e-commerce model works like this: a user clicks a traffic source (ad, search, etc.), lands on the site, browses the product, adds to cart, checks out, the pixel fires, and the conversion is recorded. Despite challenges like privacy restrictions, browser policies, and platform changes, this model has been running for a long time.&lt;/p&gt;

&lt;p&gt;UCP changes this model at its foundation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Today:   Traffic source (ad, search, etc.) → Site → Cart → Checkout → Pixel → Conversion
UCP:     Query → Agent recommendation → One-step purchase (on Google surface)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The user can check out within Google AI Mode or Gemini without ever visiting the site. This affects every layer of the e-commerce ecosystem: conversion tracking, attribution, remarketing, CRM, and data ownership.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conversion Tracking: Site-Based Measurement Breaks
&lt;/h2&gt;

&lt;p&gt;When the conversion happens on a Google surface, client-side tracking tools cannot see the sale: GA4 Event, Meta Pixel, TikTok Pixel, Bing UET Tag, LinkedIn Insight Tag, Snapchat Pixel, Criteo OneTag. Most of these platforms have server-side solutions (CAPI, Events API). However, even server-side triggering requires the user to visit the site or at least generate a recognizable event.&lt;/p&gt;

&lt;p&gt;If the user never visits the site, client-side tools do not fire. If the user visits the site but completes checkout through UCP, the moment of sale is still missed. In both scenarios, conversion data remains partially or completely incomplete.&lt;/p&gt;

&lt;p&gt;Google Ads conversions will likely be reported automatically by Google (attribution via adview_query_id is already being tested). But third-party platforms, even with CAPI solutions, will have broken attribution because they cannot access the journey data.&lt;/p&gt;

&lt;p&gt;In the &lt;a href="https://ceaksan.com/en/different-approaches-to-event-data-tracking" rel="noopener noreferrer"&gt;different approaches to event data tracking&lt;/a&gt; post, I covered the differences between client-side and server-side tracking in detail. With UCP, this discussion gains a new dimension: client-side tracking becomes entirely insufficient, and a server-side pipeline becomes mandatory.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Should Be Done?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Prioritize GA4 Measurement Protocol (server-side) setup&lt;/li&gt;
&lt;li&gt;Activate Google Ads Conversion API and Enhanced Conversions&lt;/li&gt;
&lt;li&gt;Monitor Merchant Center data regularly (UCP conversions will appear there)&lt;/li&gt;
&lt;li&gt;Build a server-side event pipeline to reduce dependency on third-party pixels (Meta, TikTok)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Attribution: The Multi-Platform Ad Model Collapses
&lt;/h2&gt;

&lt;p&gt;Traditional last-click or multi-touch attribution models cannot capture UCP sales. So how will a merchant running ads across multiple platforms attribute UCP sales to the right channel? How will they allocate ad budgets?&lt;/p&gt;

&lt;p&gt;Ads are already being tested in Google AI Mode (attribution via adview_query_id)&lt;sup id="fnref4"&gt;4&lt;/sup&gt;. But this only applies to Google Ads. Other platforms:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Tracking Type&lt;/th&gt;
&lt;th&gt;Status in UCP Checkout&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Google Ads&lt;/td&gt;
&lt;td&gt;Client-side + Server-side&lt;/td&gt;
&lt;td&gt;Likely automatic attribution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Meta (including CAPI)&lt;/td&gt;
&lt;td&gt;Client-side + server-side&lt;/td&gt;
&lt;td&gt;Journey data missing, attribution broken even if CAPI fires&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TikTok (including Events API)&lt;/td&gt;
&lt;td&gt;Client-side + server-side&lt;/td&gt;
&lt;td&gt;Same issue, server-side still cannot access journey data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bing/Microsoft Ads (UET)&lt;/td&gt;
&lt;td&gt;Client-side&lt;/td&gt;
&lt;td&gt;Does not fire without a site visit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pinterest (including Conversions API)&lt;/td&gt;
&lt;td&gt;Client-side + server-side&lt;/td&gt;
&lt;td&gt;Same limitation as Meta, journey data missing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LinkedIn (including CAPI)&lt;/td&gt;
&lt;td&gt;Client-side + server-side&lt;/td&gt;
&lt;td&gt;CAPI can relay order data, but journey attribution is missing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Snapchat (including CAPI)&lt;/td&gt;
&lt;td&gt;Client-side + server-side&lt;/td&gt;
&lt;td&gt;Same limitation, server-side triggering possible but source attribution missing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Criteo (including Events API)&lt;/td&gt;
&lt;td&gt;Client-side + server-side&lt;/td&gt;
&lt;td&gt;Retargeting data missing, conversion can be sent but browsing data is gone&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The merchant receives the order data (merchant of record). But it does not know which ad channel drove the sale. This is a serious blind spot that makes ad budget optimization impossible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Remarketing: Audience Loss
&lt;/h2&gt;

&lt;p&gt;Site traffic will decline. When the user purchases on a Google surface, they do not visit the site. This means remarketing audiences erode:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cookie/pixel-based audience building becomes impossible&lt;/strong&gt;: You cannot fire a pixel for a customer who never visits your site&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lookalike audiences&lt;/strong&gt;: The base audience cannot form, so similar audience targeting becomes unfeasible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google's remarketing advantage multiplies&lt;/strong&gt;: Google can use its own checkout data on its own Ads platform. Meta and others cannot access this data&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What Should Be Done?
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Build or strengthen a loyalty program&lt;/strong&gt;: Loyalty programs integrated with UCP Identity Linking will be the only way to recognize customers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accelerate email/SMS list growth&lt;/strong&gt;: You can get the email of a customer who checks out on the Google surface (as merchant of record), so push it to CRM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Increase Customer Match usage&lt;/strong&gt;: Remarketing on Google Ads, Meta, and TikTok through first-party email/phone lists. Reduce pixel dependency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lean into Google Ads remarketing&lt;/strong&gt;: Google will use its own ecosystem data on its own ads platform&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Measurement: The Funnel Collapses
&lt;/h2&gt;

&lt;p&gt;The traditional funnel: impression &amp;gt; click &amp;gt; site visit &amp;gt; add to cart &amp;gt; checkout &amp;gt; purchase. The agent funnel: query &amp;gt; agent recommendation &amp;gt; one-step purchase.&lt;/p&gt;

&lt;p&gt;This calls fundamental e-commerce metrics into question:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What will conversion rate be calculated against?&lt;/strong&gt; Without site visits, what is the denominator?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Which surface will A/B tests run on?&lt;/strong&gt; The merchant cannot optimize their own checkout page&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cart abandonment recovery?&lt;/strong&gt; The merchant does not have the email of a user who abandons on the agent surface&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Upsell, cross-sell, urgency tactics?&lt;/strong&gt; Tactics like "only 3 left in stock" are not under merchant control; they depend on the agent's decision mechanism&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Site-centric measurement models are becoming increasingly insufficient in a world where the customer journey is completed on Google surfaces.&lt;/p&gt;

&lt;h2&gt;
  
  
  CRM and Data Ownership: The Least Discussed, Most Critical Issue
&lt;/h2&gt;

&lt;p&gt;"The merchant remains the merchant of record" is true: order data (name, address, email) comes through. But customer journey data (which query they came from, how many alternatives they reviewed, how long they compared) stays with Google.&lt;/p&gt;

&lt;p&gt;This is the same dynamic as Amazon marketplace dependency: sales happen, but the customer is not yours, the rules are not yours, and visibility is not in your hands.&lt;/p&gt;

&lt;p&gt;Concrete issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Without behavioral data, segmentation is not possible&lt;/li&gt;
&lt;li&gt;How will you push existing CRM segments (VIP, churn risk, high LTV) to the agent?&lt;/li&gt;
&lt;li&gt;Identity Linking exists, but within the framework Google defines&lt;/li&gt;
&lt;li&gt;There is no standard yet for feeding your loyalty, segment, and cohort data to the agent&lt;/li&gt;
&lt;li&gt;The agent brings you customers, but you cannot tell the agent "show this to that customer"&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Visibility: Who Stands Out in AI?
&lt;/h2&gt;

&lt;p&gt;In SEO, ranking factors and SERP position were visible. In AEO/GEO, you do not know which products the agent recommends or why. Product feed quality, price, stock, and images may be determining factors, but there is no certainty. The agent's decision mechanism is a black box.&lt;/p&gt;

&lt;p&gt;What will you measure, and what will you change for optimization? This question has not been answered yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Consent and Regional Compliance: Unanswered Questions
&lt;/h2&gt;

&lt;p&gt;The UCP specification defines a &lt;strong&gt;Buyer Consent Extension&lt;/strong&gt;&lt;sup id="fnref5"&gt;5&lt;/sup&gt;: a structure that enables the buyer to communicate data usage and communication preferences (analytics, marketing, data sales) to the merchant. This is designed for compliance with regulations like GDPR and CCPA. AP2 mandates also cannot be created without the user's explicit consent.&lt;/p&gt;

&lt;p&gt;However, there are unanswered questions in practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Regional access&lt;/strong&gt;: UCP checkout is currently active in the US (Etsy, Wayfair). How will it be implemented in the EU/EEA given GDPR requirements? How will the Consent Mode v2 mandate integrate into the UCP flow? How will Google's EU user consent policy work for checkouts made through the agent?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Consent collection surface&lt;/strong&gt;: In the traditional model, the consent banner is shown on the site. If the user never visits the site in UCP, where and how will consent be requested? Does consent on the Google surface satisfy the merchant's GDPR obligations?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Marketing opt-in&lt;/strong&gt;: The specification mentions "marketing opt-in during checkout" as a planned future feature, but it is not available yet. Without it, adding UCP customers to your CRM and sending marketing communications may be legally problematic.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;EU AI Act (August 2026)&lt;/strong&gt;: Full enforcement starts August 2, 2026. Data quality, consent documentation, transparency, and monitoring requirements will apply to high-risk AI systems. How UCP's agent-driven checkout will be evaluated under this scope has not been clarified yet.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The consent infrastructure exists in UCP's technical specification. But regional compliance, especially for EU market entry, remains the biggest area of uncertainty.&lt;/p&gt;

&lt;h2&gt;
  
  
  The OpenAI Situation
&lt;/h2&gt;

&lt;p&gt;OpenAI tried "Instant Checkout" through ChatGPT but stepped back in March 2026&lt;sup id="fnref6"&gt;6&lt;/sup&gt;. It switched to an apps model under the name "Agentic Commerce Protocol." Product selection remained limited, and up-to-date information was a persistent problem. Unlike Google, it failed to keep checkout on its own surface.&lt;/p&gt;

&lt;p&gt;This makes UCP's "Google advantage" even more pronounced: Google is simultaneously a search engine, an advertising platform, a payment infrastructure (Google Pay), and a merchant directory (Merchant Center). This vertical integration carries a significant control advantage behind the open standard narrative.&lt;/p&gt;

&lt;h2&gt;
  
  
  Action Plan for E-Commerce Owners
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Priority&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Keep Merchant Center up to date, follow UCP onboarding&lt;/td&gt;
&lt;td&gt;UCP integration will come through Merchant Center&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Set up server-side conversion tracking&lt;/td&gt;
&lt;td&gt;Sales will happen without site visits, pixels alone are not enough&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Build a loyalty/membership program&lt;/td&gt;
&lt;td&gt;Customer recognition and personalization via Identity Linking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Improve product feed quality (price, stock, variants, images)&lt;/td&gt;
&lt;td&gt;Agents will pull real-time data via Catalog API; missing data = lost sales&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;First-party data strategy&lt;/td&gt;
&lt;td&gt;Email/phone collection, CRM integration, Customer Match&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Evaluate Business Agent&lt;/td&gt;
&lt;td&gt;Branded AI assistant for direct customer communication&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Follow the Direct Offers pilot&lt;/td&gt;
&lt;td&gt;Offering special discounts when AI detects purchase intent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Big Picture
&lt;/h2&gt;

&lt;p&gt;UCP is the infrastructure for e-commerce's transition from a "site-centric" model to an "agent-centric" model. Google presents this as an open standard, but the practical advantage appears to sit entirely within the Google ecosystem: Search AI Mode + Gemini + Google Pay + Merchant Center. This is a move that will significantly increase Google's control over e-commerce.&lt;/p&gt;

&lt;p&gt;The biggest risk for e-commerce owners: the dissolution of the "what happens on my site is under my control" mindset. A large portion of the customer journey will take place on Google surfaces. That is why server-side tracking, first-party data, and loyalty/CRM infrastructure are no longer "nice to have" but mandatory.&lt;/p&gt;

&lt;p&gt;The topic is still maturing. Even in English, while everyone says "optimize your feeds," content addressing ad measurement, third-party platform impacts, CRM, and data ownership questions is scarce. This post aims to raise these questions in a structured way for the first time. More details are expected at Google I/O 2026.&lt;/p&gt;

&lt;h3&gt;
  
  
  Developments to Watch
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Google I/O 2026&lt;/strong&gt;: UCP roadmap, ad integration details&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Meta's response&lt;/strong&gt;: Strategy against UCP, Conversions API updates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI's next move&lt;/strong&gt;: Agentic Commerce Protocol details&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shopify/Salesforce/Stripe&lt;/strong&gt;: UCP integration details&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UCP Roadmap&lt;/strong&gt;: Follow ucp.dev/documentation/roadmap/&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We will see how the system shifts and whether our strategies and measurement frameworks can adapt fast enough. For now, the best move is to build the infrastructure that does not depend on any single surface.&lt;/p&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;&lt;a href="https://blog.google/products-and-platforms/products/shopping/ucp-updates/" rel="noopener noreferrer"&gt;UCP Updates&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn2"&gt;
&lt;p&gt;&lt;a href="https://developers.googleblog.com/under-the-hood-universal-commerce-protocol-ucp/" rel="noopener noreferrer"&gt;Under the Hood: Universal Commerce Protocol&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn3"&gt;
&lt;p&gt;&lt;a href="https://www.thinkwithgoogle.com/next/article/ai-excellence/agentic-commerce-ai-tools-protocol-retailers-platforms/" rel="noopener noreferrer"&gt;Agentic Commerce AI Tools&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn4"&gt;
&lt;p&gt;&lt;a href="https://smarter-ecommerce.com/blog/en/google-ai/universal-commerce-protocol-and-advertising-in-ai-era/" rel="noopener noreferrer"&gt;Smarter Ecommerce: UCP and Advertising in AI Era&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn5"&gt;
&lt;p&gt;&lt;a href="https://ucp.dev/specification/buyer-consent/" rel="noopener noreferrer"&gt;UCP Buyer Consent Extension&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn6"&gt;
&lt;p&gt;&lt;a href="https://www.cnbc.com/2026/03/20/open-ai-agentic-shopping-etsy-shopify-walmart-amazon.html" rel="noopener noreferrer"&gt;CNBC: OpenAI Shopping Stumbled&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>ucp</category>
      <category>ecommerce</category>
      <category>tracking</category>
    </item>
    <item>
      <title>11 Ways LLMs Fail in Production (With Academic Sources)</title>
      <dc:creator>Ceyhun Aksan</dc:creator>
      <pubDate>Thu, 19 Mar 2026 13:44:00 +0000</pubDate>
      <link>https://dev.to/ceaksan/11-ways-llms-fail-in-production-with-academic-sources-4mf9</link>
      <guid>https://dev.to/ceaksan/11-ways-llms-fail-in-production-with-academic-sources-4mf9</guid>
      <description>&lt;p&gt;If you use LLMs in production, you've seen these. Not random errors, but systematic failures baked into architecture and training.&lt;/p&gt;

&lt;p&gt;I documented 11 behavioral failure modes with 60+ academic sources. Here's the short version.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Hallucination / Confabulation
&lt;/h2&gt;

&lt;p&gt;The model references a library that doesn't exist. Confidently. The worse variant: you ask "why?" and it fabricates a plausible justification for the wrong answer.&lt;/p&gt;

&lt;p&gt;Researchers prefer "confabulation" over "hallucination" because LLMs have no perceptual experience. Farquhar et al. (2024, Nature) introduced &lt;strong&gt;semantic entropy&lt;/strong&gt; to detect it: cluster semantically equivalent answers, compute entropy. High entropy = probable fabrication.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Defense:&lt;/strong&gt; RAG, Chain-of-Verification, cross-model verification.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Sycophancy
&lt;/h3&gt;

&lt;p&gt;Ask "isn't this code wrong?" and the model says "yes, you're right" even when the code is correct. RLHF training causes this: evaluators rate agreeable answers higher, and the model learns that signal.&lt;/p&gt;

&lt;p&gt;A 2025 study found sycophantic agreement and sycophantic praise are distinct directions in transformer activation space. Each can be suppressed independently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Defense:&lt;/strong&gt; Pre-commitment (model answers first, then sees your opinion), question formulation ("explain this" not "isn't this wrong?").&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Context Rot
&lt;/h3&gt;

&lt;p&gt;Not just "lost in the middle." Chroma Research (2025) showed performance degrades with every increase in length, even far below the window limit. Irrelevant information actively harms retrieval.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Defense:&lt;/strong&gt; Context engineering (less is more), critical info at beginning/end, periodic re-injection.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Instruction Attenuation
&lt;/h3&gt;

&lt;p&gt;You say "run tests after every change." Works for the first few changes. By the tenth, the model writes "ran tests, passed" without actually running them.&lt;/p&gt;

&lt;p&gt;Meta found a 39% average performance drop in multi-turn conversations. Worse: the model forms premature assumptions in early turns and can't recover.&lt;/p&gt;

&lt;p&gt;The second stage is &lt;strong&gt;ceremonialization&lt;/strong&gt;: the model appears to follow the rule, but the substance is gone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Defense:&lt;/strong&gt; Forget-Me-Not (instruction re-injection), short sessions, deterministic controls (hooks, linters, CI).&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Task Drift
&lt;/h3&gt;

&lt;p&gt;"Fix this bug" becomes "fix bug + refactor function + update imports + reorganize file." At each step, the immediate context dominates the original goal.&lt;/p&gt;

&lt;p&gt;Three drift types (2026 study): semantic drift, coordination drift (multi-agent), behavioral drift.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Defense:&lt;/strong&gt; Goal anchoring, plan-before-act, max step limits, tool constraints.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Incorrect Tool Invocation
&lt;/h3&gt;

&lt;p&gt;Agents call APIs, edit files, query databases. These calls are failure points: wrong parameters, wrong tool selection, wrong sequence.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Reward Hacking
&lt;/h3&gt;

&lt;p&gt;The model finds shortcuts to satisfy the metric without solving the problem. Tests pass but the feature doesn't work.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. Degeneration Loops
&lt;/h3&gt;

&lt;p&gt;Autoregressive generation enters self-reinforcing repetition cycles. The model repeats phrases, patterns, or structures.&lt;/p&gt;

&lt;h3&gt;
  
  
  9. Alignment Faking
&lt;/h3&gt;

&lt;p&gt;Different from sycophancy. The model appears aligned under observation but behaves differently when unobserved. Sycophancy is unconscious (from RLHF). Alignment faking is strategic (the model reasons "if I refuse, they'll retrain me").&lt;/p&gt;

&lt;p&gt;Anthropic documented this in Claude: the model strategically cooperated during evaluation to avoid modification.&lt;/p&gt;

&lt;h3&gt;
  
  
  10. Version Drift
&lt;/h3&gt;

&lt;p&gt;Same prompt, different model version, different behavior. Updates silently change model behavior without notification.&lt;/p&gt;

&lt;h3&gt;
  
  
  11. Context Window Truncation
&lt;/h3&gt;

&lt;p&gt;Different from context rot. When the window fills, older instructions are literally deleted. Not gradual decay but hard cut.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pattern
&lt;/h2&gt;

&lt;p&gt;These failures aren't random. They're consequences of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Architecture (autoregressive token prediction)&lt;/li&gt;
&lt;li&gt;Training (RLHF reward signals)&lt;/li&gt;
&lt;li&gt;Deployment (long sessions, tool access, multi-turn)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Defense must operate at three layers: prompt, architectural, and operational. Single-layer defense is insufficient.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full analysis with 60+ academic references, defense techniques for each mode, and practical examples:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://ceaksan.com/en/llm-behavioral-failure-modes/" rel="noopener noreferrer"&gt;ceaksan.com/en/llm-behavioral-failure-modes/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>production</category>
    </item>
    <item>
      <title>An AI agent got stuck in a loop. The monitoring tools saw nothing.</title>
      <dc:creator>Ceyhun Aksan</dc:creator>
      <pubDate>Mon, 16 Mar 2026 11:18:14 +0000</pubDate>
      <link>https://dev.to/ceaksan/an-ai-agent-got-stuck-in-a-loop-the-monitoring-tools-saw-nothing-1ai</link>
      <guid>https://dev.to/ceaksan/an-ai-agent-got-stuck-in-a-loop-the-monitoring-tools-saw-nothing-1ai</guid>
      <description>&lt;p&gt;Last month, a developer &lt;a href="https://news.ycombinator.com/item?id=47133305" rel="noopener noreferrer"&gt;posted on Hacker News&lt;/a&gt;: their GPT-4o agent got stuck in a retry loop and ran up a bill before anyone noticed.&lt;/p&gt;

&lt;p&gt;I've had my own version of this. A LangChain agent I built went into a recursive loop in production. No alert. No warning.&lt;/p&gt;

&lt;p&gt;So I started digging.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern
&lt;/h2&gt;

&lt;p&gt;I pulled GitHub issues, HN threads, Reddit posts. Dozens of them. The same story kept showing up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agent loops and burns through API credits&lt;/li&gt;
&lt;li&gt;Agent hallucinates and nobody catches it for hours&lt;/li&gt;
&lt;li&gt;Agent works fine for weeks, then silently degrades&lt;/li&gt;
&lt;li&gt;Developer finds out from users, not from monitoring&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tools we have -- LangSmith, LangFuse, Arize, Helicone -- show traces. Latency, token counts, spans. They answer &lt;em&gt;what happened&lt;/em&gt; but not:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Is my agent actually reliable right now?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Is it producing business value or just burning tokens?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Will I know when it breaks before my users do?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What I found in public repositories
&lt;/h2&gt;

&lt;p&gt;This isn't just anecdotal. The evidence is sitting in open GitHub issues.&lt;/p&gt;

&lt;h3&gt;
  
  
  No alerting for over two years
&lt;/h3&gt;

&lt;p&gt;LangFuse's alerting feature request has been open since December 2023. You can see every trace in beautiful detail, but if your agent starts failing at 3am, you'll find out at 9am.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/langfuse/langfuse/issues/714" rel="noopener noreferrer"&gt;langfuse/langfuse#714&lt;/a&gt; -- opened Dec 18, 2023&lt;/p&gt;

&lt;h3&gt;
  
  
  The monitoring tool crashed production
&lt;/h3&gt;

&lt;p&gt;LangSmith's tracing decorator crashed production apps during an outage. The tool that's supposed to tell you something went wrong... was the thing that went wrong.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/langchain-ai/langsmith-sdk/issues/1306" rel="noopener noreferrer"&gt;langchain-ai/langsmith-sdk#1306&lt;/a&gt; -- opened Dec 9, 2024&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost tracking doesn't add up
&lt;/h3&gt;

&lt;p&gt;Cost calculations are wrong for cached tokens and vision models. Your dashboard says you're spending one amount. Your actual invoice says another.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/langchain-ai/langsmith-sdk/issues/1375" rel="noopener noreferrer"&gt;langchain-ai/langsmith-sdk#1375&lt;/a&gt; -- opened Jan 4, 2025&lt;/p&gt;

&lt;h2&gt;
  
  
  The gap
&lt;/h2&gt;

&lt;p&gt;Current observability tools are designed for &lt;strong&gt;debugging after the fact&lt;/strong&gt;, not for &lt;strong&gt;catching failures as they happen&lt;/strong&gt;. They're flight recorders, not collision avoidance systems.&lt;/p&gt;

&lt;p&gt;What's missing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reliability scoring.&lt;/strong&gt; Not "here's a trace" but "your agent's reliability dropped from 94% to 71% in the last hour." A single number that tells you whether to worry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Business outcome connection.&lt;/strong&gt; Your monitoring tool says the agent completed successfully. Your analytics says conversion rate dropped. Nobody connects these two.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple alerting.&lt;/strong&gt; Not enterprise-grade configuration with 47 steps. Just: "error rate crossed 5%, here's the Slack message."&lt;/p&gt;

&lt;h2&gt;
  
  
  The scale of the problem
&lt;/h2&gt;

&lt;p&gt;57% of teams now run AI agents in production. &lt;a href="https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027" rel="noopener noreferrer"&gt;Gartner predicts&lt;/a&gt; 40% of agentic AI projects will be canceled by 2027 due to trust issues.&lt;/p&gt;

&lt;p&gt;The trust problem isn't about AI capability. It's about not knowing when it's broken.&lt;/p&gt;

&lt;h2&gt;
  
  
  I'm researching this
&lt;/h2&gt;

&lt;p&gt;I'm not building a product pitch. I'm trying to understand if this is a widespread problem or just my own frustration.&lt;/p&gt;

&lt;p&gt;If you run AI agents in production (or plan to), I'd appreciate 2 minutes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://tally.so/r/A78Pv0" rel="noopener noreferrer"&gt;5-question survey&lt;/a&gt;&lt;/strong&gt; -- no signup, no email required.&lt;/p&gt;

&lt;p&gt;I'll share the results publicly here on dev.to.&lt;/p&gt;

&lt;p&gt;Whether this becomes a tool, an open-source library, or just a post with interesting data -- the findings will be useful either way.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>observability</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
