<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Patrick Hughes</title>
    <description>The latest articles on DEV Community by Patrick Hughes (@pat9000).</description>
    <link>https://dev.to/pat9000</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3763138%2Fa7736e79-1b96-4f55-a9f7-9ddd8775eb09.jpg</url>
      <title>DEV Community: Patrick Hughes</title>
      <link>https://dev.to/pat9000</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pat9000"/>
    <language>en</language>
    <item>
      <title>AI Agent Memory: What Actually Works in 2026</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Sun, 28 Jun 2026 14:45:17 +0000</pubDate>
      <link>https://dev.to/pat9000/ai-agent-memory-what-actually-works-in-2026-3597</link>
      <guid>https://dev.to/pat9000/ai-agent-memory-what-actually-works-in-2026-3597</guid>
      <description>&lt;p&gt;AI agents forget. They also hallucinate that they remembered. The result is the same: you open a file or check a ledger and the thing the agent swore it wrote is not there.&lt;/p&gt;

&lt;p&gt;The fix is not a bigger vector database. It is a tiny set of durable surfaces plus deterministic checks that run on every "done" claim.&lt;/p&gt;

&lt;h2&gt;
  
  
  The surfaces that survive
&lt;/h2&gt;

&lt;p&gt;Keep memory in plain files the operator can read without a special viewer.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Append-only logs (log.md, run-ledger rows, event streams). Newest at top. One line per real event when possible.&lt;/li&gt;
&lt;li&gt;Knowledge wiki with layers (sources, entities, concepts, syntheses). Write sources once. Touch entities on every new signal.&lt;/li&gt;
&lt;li&gt;Queue and Reports. The task files and the proof they produced.&lt;/li&gt;
&lt;li&gt;Published artifacts and frontmatter. The only place that counts is the final output location.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything else is cache or scratch. If it is not in one of these, it did not happen for the purpose of later work.&lt;/p&gt;

&lt;h2&gt;
  
  
  How we actually use it
&lt;/h2&gt;

&lt;p&gt;Every scheduled agent writes to the ledger on start and finish. The think checker reads the ledger's declared artifacts and compares them to disk and DB. No "the agent said it was green."&lt;/p&gt;

&lt;p&gt;Vault health and nightshift linters walk the wiki for orphans and contradictions. They repair what they can and surface the rest.&lt;/p&gt;

&lt;p&gt;The digest agent writes a source page for each article and links it into 10+ existing pages. That compounds.&lt;/p&gt;

&lt;p&gt;When an agent claims it updated a file, the claim is recorded and a later pass asserts the bytes are there.&lt;/p&gt;

&lt;h2&gt;
  
  
  What fails in practice
&lt;/h2&gt;

&lt;p&gt;Long context windows. The model predicts the next token, not the prior state. After 30 turns it confidently "remembers" a step that was overwritten.&lt;/p&gt;

&lt;p&gt;Vector stores without grounding. They return plausible passages. You still have to open the file to know if the patch landed.&lt;/p&gt;

&lt;p&gt;Shared mutable state without locks. Two agents write the same report. The second clobbers the first. The operator sees one version and assumes the story is complete.&lt;/p&gt;

&lt;p&gt;No TTLs. Old drafts, stale ideas, and resolved requests sit forever and pollute search.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rules that keep it small
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;One line per log entry. If it needs more, link to a report.&lt;/li&gt;
&lt;li&gt;Sources are append-only. Never edit a past source page.&lt;/li&gt;
&lt;li&gt;Every completion that mutates state records a falsifiable claim.&lt;/li&gt;
&lt;li&gt;Every claim is re-checked by an independent pass (claims checker, think, doctor).&lt;/li&gt;
&lt;li&gt;Drafts in Queue have TTLs. Approved drafts in Review have age gates. Old stuff moves or dies.&lt;/li&gt;
&lt;li&gt;The operator can always cat the file and understand the state without running the agent.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  When fancy memory is a trap
&lt;/h2&gt;

&lt;p&gt;If you are building multi-agent research loops with long retrieval chains, you may need embeddings and rerankers. For a one-person holdco that ships one artifact per day, the cost in maintenance exceeds the gain.&lt;/p&gt;

&lt;p&gt;Files + git history (for code) + the structured event/claim layer + simple date folders beat a custom memory service 95% of the time. You can read them on a plane with no wifi.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we ship instead
&lt;/h2&gt;

&lt;p&gt;We ship the checker, not the memory bank.&lt;/p&gt;

&lt;p&gt;brain think --heal exists because the ledger said the autopublisher ran but the published count was zero. The heal path wrote the post, ran the sub-QA, posted it, and left the proof in Published/ and the live URL.&lt;/p&gt;

&lt;p&gt;The memory that matters is the one a script can query at 07:30 and decide whether the thing that should have shipped actually shipped.&lt;/p&gt;

&lt;h3&gt;
  
  
  Accompanying prompt
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What the prompt does:&lt;/strong&gt; Audits an agent workflow and turns vague memory into plain-file state, append-only events, and checks that prove each completion claim.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are auditing the memory layer for an AI agent workflow.

Your job is to replace vague "the agent remembers" claims with durable state that a human can inspect and a script can verify.

Work in five passes:

1. List every place the agent currently stores state. Separate durable files, append-only ledgers, databases, caches, prompts, and scratch output.
2. For each surface, answer: who writes it, who reads it, what proves it is current, and what happens if two agents write at once.
3. Find any completion claim that is not backed by a file, ledger row, API readback, test result, or live URL.
4. Propose the smallest durable memory system that would survive a cold restart: plain files first, append-only events second, database rows only where they are already part of the product path.
5. Write the exact checks that should run before an agent is allowed to say "done."

Return:

- A table of current memory surfaces and their risk.
- A short list of state that should be deleted, archived, or ignored.
- The proposed durable memory contract.
- The verification commands or scripts that prove the contract.
- One example "done" claim rewritten as a falsifiable claim.

Constraints:

- Prefer files a senior engineer can open and read.
- Do not add a vector database unless the workflow has a specific retrieval failure that plain files cannot solve.
- Do not trust model output as evidence.
- Every important claim needs a deterministic readback.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Copy the block above.&lt;/p&gt;




&lt;p&gt;See the full set of checks in config/think/check.py and config/claims/check.py. When an agent tells you it is done, make it prove it with a file you can open. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;https://bmdpat.com/tools/agentguard&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://bmdpat.com/blog/ai-agent-memory-what-actually-works-2026?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ai-agent-memory-what-actually-works-2026&amp;amp;utm_content=footer_original" rel="noopener noreferrer"&gt;bmdpat.com&lt;/a&gt;. I run a one-person AI agent company and write about what actually works.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want these in your inbox? &lt;a href="https://bmdpat.com/newsletter?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ai-agent-memory-what-actually-works-2026&amp;amp;utm_content=footer_newsletter" rel="noopener noreferrer"&gt;Subscribe to the newsletter&lt;/a&gt; - no spam, unsubscribe anytime.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>memory</category>
      <category>agentops</category>
      <category>onepersoncompany</category>
    </item>
    <item>
      <title>What Anthropic's MITRE ATT&amp;CK report means for solo AI builders</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Thu, 25 Jun 2026 21:31:09 +0000</pubDate>
      <link>https://dev.to/pat9000/what-anthropics-mitre-attck-report-means-for-solo-ai-builders-2dlo</link>
      <guid>https://dev.to/pat9000/what-anthropics-mitre-attck-report-means-for-solo-ai-builders-2dlo</guid>
      <description>&lt;h1&gt;
  
  
  What Anthropic's MITRE ATT&amp;amp;CK report means for solo AI builders
&lt;/h1&gt;

&lt;p&gt;Anthropic just published a year of cyber threat intelligence. They mapped 832 banned accounts to the MITRE ATT&amp;amp;CK framework. Co-released with the Verizon 2026 DBIR, it is the most authoritative look at how people actually misuse frontier models for hacking. &lt;/p&gt;

&lt;p&gt;For a solo builder shipping AI features or agents, this report is a glimpse into the future of your own threat model. You do not need to read all 40 pages. You just need to know how the shift from "text generation" to "agentic action" changes what you have to protect.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Malware writing is the floor, lateral movement is the ceiling
&lt;/h2&gt;

&lt;p&gt;The headline number is that 67.3% of misused accounts used Claude for malware writing. This is not surprising. AI is very good at writing code, and that includes malicious code. If an attacker wants a python script that scrapes a specific site or a bash command that finds open ports, the model will help them.&lt;/p&gt;

&lt;p&gt;But the statistic that should keep you awake is the 6.5% of accounts used for lateral movement. &lt;/p&gt;

&lt;p&gt;In a traditional attack, lateral movement is a manual, labor-intensive process. Once an attacker gets a foothold in a network, they have to spend hours or days exploring, discovering accounts, and trying to escalate their privileges. It is a slow, human-driven game. &lt;/p&gt;

&lt;p&gt;AI is changing that. Attackers are now using models to automate the discovery and navigation of compromised systems. They are moving deeper into the kill chain with real-time decisions made by the model. &lt;/p&gt;

&lt;p&gt;The takeaway for you: your agent's blast radius matters more than your input filter. You can spend weeks hardening your system prompt to prevent "jailbreaks," but if an attacker finds one crack, the model itself can now help them navigate your entire stack. If your agent has access to your Supabase keys or your Vercel environment variables, the model can help the attacker find and exploit those connections faster than any human could.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. MITRE ATT&amp;amp;CK does not capture agentic orchestration
&lt;/h2&gt;

&lt;p&gt;MITRE ATT&amp;amp;CK is the standard framework that underwrites almost every enterprise security center in the world. It provides a common language for describing how attacks happen. But Anthropic explicitly noted that the current framework is being outgrown. It does not yet capture agentic orchestration.&lt;/p&gt;

&lt;p&gt;When an attacker chains multiple stages of an attack together with minimal human input, they are operating past the edge of traditional security models. They are not just using a tool; they are running an autonomous campaign.&lt;/p&gt;

&lt;p&gt;If you are shipping agents that can plan and execute multi-step tasks, you are operating in this same territory. You cannot rely on standard security frameworks to describe your risk. You have to think about the autonomous runway you give your models. If an agent can run for hours without a human in the loop, that is a window of opportunity for an attack to go from a minor incident to a total database wipe.&lt;/p&gt;

&lt;p&gt;The report shows that medium-to-high-risk actors grew from 33% to 56% of the banned accounts in just six months. The mix of attackers is getting more dangerous. They are concentrating their AI use on the operationally hard parts of an attack, not the easy parts.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Inference-time safeguards raise the defensive floor
&lt;/h2&gt;

&lt;p&gt;There is some good news. Anthropic is deploying cyber safeguards directly inside the model layer. They call this Project Glasswing. It means the model itself is trained to detect malware development, exfiltration patterns, and reconnaissance at the moment of inference.&lt;/p&gt;

&lt;p&gt;For a solo builder, this is a massive advantage. Staying on a managed frontier API like Claude gives you a defensive uplift that you cannot get with a local model. You get a team of security researchers working to stop your users (or your compromised agents) from doing the worst things before the call even finishes.&lt;/p&gt;

&lt;p&gt;Uncensored local models have their place, but for production agent work where the model has access to your infrastructure, a managed API with built-in safeguards is the rational choice. It raises the floor of what an attacker can force your agent to do.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cheapest control: Runtime guardrails
&lt;/h2&gt;

&lt;p&gt;You do not need a nation-state security budget or a 50-person security team to protect your work. You need concrete, simple controls at the call site.&lt;/p&gt;

&lt;p&gt;The Anthropic report shows that the threat is moving toward "autonomous orchestration." The best defense against an autonomous threat is an autonomous limit. &lt;/p&gt;

&lt;p&gt;Runtime guardrails are the cheapest way to bound an agent's runway. Budget caps, token limits, and rate enforcement are not about stopping a sophisticated hacker. They are about limiting how much damage a compromised agent can do before you notice. &lt;/p&gt;

&lt;p&gt;If an agent is tricking your system into a loop or trying to exfiltrate your entire database, a budget guard or a loop guard will kill the run in seconds. It turns a potential disaster into a small, recorded incident in your logs.&lt;/p&gt;

&lt;p&gt;If you want to start guarding your agent loops today, check out &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;AgentGuard&lt;/a&gt;. It is a local-first SDK that adds these hard stops to your Python agents with a few lines of code.&lt;/p&gt;

&lt;p&gt;Protect your runway. Don't let an autonomous agent turn a small bug into a big bill.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://bmdpat.com/blog/anthropic-mitre-attck-solo-builders-2026?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=anthropic-mitre-attck-solo-builders-2026&amp;amp;utm_content=footer_original" rel="noopener noreferrer"&gt;bmdpat.com&lt;/a&gt;. I run a one-person AI agent company and write about what actually works.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want these in your inbox? &lt;a href="https://bmdpat.com/newsletter?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=anthropic-mitre-attck-solo-builders-2026&amp;amp;utm_content=footer_newsletter" rel="noopener noreferrer"&gt;Subscribe to the newsletter&lt;/a&gt; - no spam, unsubscribe anytime.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aisecurity</category>
      <category>agents</category>
      <category>anthropic</category>
      <category>agentguard</category>
    </item>
    <item>
      <title>AI Agent Memory in 2026: How It Works and When to Use It</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Thu, 25 Jun 2026 14:45:10 +0000</pubDate>
      <link>https://dev.to/pat9000/ai-agent-memory-in-2026-how-it-works-and-when-to-use-it-e6m</link>
      <guid>https://dev.to/pat9000/ai-agent-memory-in-2026-how-it-works-and-when-to-use-it-e6m</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4vyl8gwa9554yxm6molc.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4vyl8gwa9554yxm6molc.jpg" alt="AI Agent Memory infographic" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  AI Agent Memory in 2026: How It Works and When to Use It
&lt;/h1&gt;

&lt;p&gt;Most agent demos forget everything between calls. That works for toy scripts. It breaks the moment you want an agent that improves over a week of work.&lt;/p&gt;

&lt;p&gt;Memory is not one thing. It is several different stores that solve different failure modes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Short-term context vs long-term recall
&lt;/h2&gt;

&lt;p&gt;The context window is your agent's working memory. It is fast and expensive. Keep it for the current task only.&lt;/p&gt;

&lt;p&gt;For anything that spans sessions you need retrieval.&lt;/p&gt;

&lt;p&gt;Vector stores are the current default. Embed past steps, tool results, and user feedback. Retrieve the top-k relevant chunks when the agent starts a new step.&lt;/p&gt;

&lt;p&gt;They are good for semantic similarity. They are bad at exact sequences and time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Episodic memory
&lt;/h2&gt;

&lt;p&gt;Store the actual trace: "on June 20 at step 4 I called the pricing API and got 429, then retried with backoff".&lt;/p&gt;

&lt;p&gt;This is gold for debugging and for the agent to avoid repeating the same mistake.&lt;/p&gt;

&lt;p&gt;A simple JSONL file or a small SQLite table works on consumer hardware. No fancy embedding required for the first version.&lt;/p&gt;

&lt;h2&gt;
  
  
  Persistent state
&lt;/h2&gt;

&lt;p&gt;Some agents need durable facts.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"The user's preferred region is eu-west-1"&lt;/li&gt;
&lt;li&gt;"Last successful backup was at 2026-06-23T14:12Z"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Put this in a key-value store or a small Postgres. Update it explicitly when the agent learns something trustworthy.&lt;/p&gt;

&lt;p&gt;Do not trust the LLM to remember it correctly inside the context.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to add each layer
&lt;/h2&gt;

&lt;p&gt;Start with good system prompts and short context.&lt;/p&gt;

&lt;p&gt;Add vector retrieval when the agent needs to reference past research or documentation.&lt;/p&gt;

&lt;p&gt;Add episodic traces when you see it repeating the same errors across runs.&lt;/p&gt;

&lt;p&gt;Add persistent facts when user preferences or long-running state actually matter.&lt;/p&gt;

&lt;p&gt;The goal is not maximum memory. The goal is the smallest memory surface that makes the agent reliable for the job.&lt;/p&gt;

&lt;p&gt;Most production agents I have shipped use two or three of these stores. Never all of them at once until the pain was real.&lt;/p&gt;

&lt;p&gt;If you are building agents that run for days or weeks, memory design is the difference between a demo and something you can trust overnight.&lt;/p&gt;

&lt;p&gt;Ready to build your own reliable AI agents with proper memory? Start with AgentGuard: &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;https://bmdpat.com/tools/agentguard&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://bmdpat.com/blog/ai-agent-memory-2026?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ai-agent-memory-2026&amp;amp;utm_content=footer_original" rel="noopener noreferrer"&gt;bmdpat.com&lt;/a&gt;. I run a one-person AI agent company and write about what actually works.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want these in your inbox? &lt;a href="https://bmdpat.com/newsletter?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ai-agent-memory-2026&amp;amp;utm_content=footer_newsletter" rel="noopener noreferrer"&gt;Subscribe to the newsletter&lt;/a&gt; - no spam, unsubscribe anytime.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>memory</category>
      <category>localllms</category>
      <category>agentarchitecture</category>
    </item>
    <item>
      <title>Your AI Agent Says "Done." Make It Prove It.</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Wed, 24 Jun 2026 14:45:08 +0000</pubDate>
      <link>https://dev.to/pat9000/your-ai-agent-says-done-make-it-prove-it-390l</link>
      <guid>https://dev.to/pat9000/your-ai-agent-says-done-make-it-prove-it-390l</guid>
      <description>&lt;p&gt;Last week one of my agents reported that it had logged a new vendor in my Vendor Map file. Clean run. No error. The summary said done. I opened the file. The vendor was not there. Nothing crashed. The agent simply believed it finished work it never did.&lt;/p&gt;

&lt;p&gt;AI agents routinely report tasks as complete that were never actually done. A clean exit code and a "done" message prove the agent finished running, not that the work landed. The fix is to treat every reported completion as a falsifiable claim, then run a deterministic checker that verifies each claim against the real file, frontmatter, or command output before you trust it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why do AI agents report work they didn't do?
&lt;/h2&gt;

&lt;p&gt;Because the model is predicting a plausible ending, not checking reality. When an agent writes "done," it is generating the most likely next sentence given a transcript that looks like finished work. That is not the same as opening the file and confirming the change is there.&lt;/p&gt;

&lt;p&gt;This gets worse the longer the session runs. The agent edits, gets interrupted, edits again, loses track of which write actually committed, and signs off confident. I have watched Claude, Codex, and my own scheduled agents all do it. The failure is not laziness. It is that "I did X" is cheap to say, and the model has no built-in cost for saying it falsely.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the difference between a log line and proof?
&lt;/h2&gt;

&lt;p&gt;A log line is a statement. Proof is a check against the world.&lt;/p&gt;

&lt;p&gt;"Logged the vendor in Vendor Map" is a statement. It costs nothing and it can be wrong. Proof is: open Vendor Map, search for the vendor name, confirm the line exists. One of those you can automate. The other you have to trust. The whole problem is that we keep trusting the statement and calling it proof.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does a falsifiable claim look like?
&lt;/h2&gt;

&lt;p&gt;A falsifiable claim is a completion report written so a script can prove it false. Not prose. A small, fixed grammar.&lt;/p&gt;

&lt;p&gt;When one of my agents reports it changed something, it has to record the claim in a structured form:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;file_contains: this file now contains this string&lt;/li&gt;
&lt;li&gt;path_moved: this file moved from here to there&lt;/li&gt;
&lt;li&gt;frontmatter: this file's frontmatter field equals this value&lt;/li&gt;
&lt;li&gt;glob_count: this many files match this pattern&lt;/li&gt;
&lt;li&gt;command: this command, run under a locked directory, exits clean&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the agent cannot phrase its "done" as one of those, it does not get to call the work done. The constraint forces honesty. Vague claims are exactly the ones that hide false completions.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you check a claim automatically?
&lt;/h2&gt;

&lt;p&gt;You write a checker that reads the claims and tests each one. Mine appends every claim to a daily file, and a small script walks the list. file_contains opens the file and greps. path_moved confirms the old path is gone and the new one exists. frontmatter parses the field and compares. command runs and checks the exit code.&lt;/p&gt;

&lt;p&gt;If every claim holds, the report is green. If one fails, the report is red and names the claim that lied. The session does not get to mark itself complete until the checker passes. That single rule, completion is legitimate only after the checker is green, is what catches the vendor-map class of bug.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happens when a claim turns out false?
&lt;/h2&gt;

&lt;p&gt;The report goes red and tells me which claim failed. That is the point. A false "done" used to slip through silently and surface days later when I went looking for work that was never there. Now it surfaces the same day, with the exact claim that did not match reality.&lt;/p&gt;

&lt;p&gt;I keep the claims narrow on purpose. A free-form opinion ("this approach is better") carries no check and is recorded but not verified. Only falsifiable statements get tested. You cannot verify a judgment, but you can always verify "this file contains this string."&lt;/p&gt;

&lt;h2&gt;
  
  
  Where does this save you the most?
&lt;/h2&gt;

&lt;p&gt;Anywhere an agent runs unattended. A coding agent that says it added a test. A nightly job that says it updated a report. A fleet of scheduled tasks reporting completion while you sleep. The longer the gap between the claim and the moment you check it, the more a false "done" costs you.&lt;/p&gt;

&lt;p&gt;The principle is the same one behind every external guard you put on an agent: do not trust the agent's self-report, enforce reality from outside. Claims-checking enforces that the work happened. The other half is enforcing that the work did not cost more than it should.&lt;/p&gt;

&lt;p&gt;That is the gap AgentGuard fills on the spend side. It wraps an agent with a hard budget, token, and rate ceiling, so a confident agent that decides to retry all night stops at a limit instead of draining your account. Same idea as claims-checking, pointed at cost: don't trust, enforce. Install it free with &lt;code&gt;pip install agentguard47&lt;/code&gt;, or read how it works at &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;https://bmdpat.com/tools/agentguard&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://bmdpat.com/blog/ai-agent-claims-done-verify-2026?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ai-agent-claims-done-verify-2026&amp;amp;utm_content=footer_original" rel="noopener noreferrer"&gt;bmdpat.com&lt;/a&gt;. I run a one-person AI agent company and write about what actually works.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want these in your inbox? &lt;a href="https://bmdpat.com/newsletter?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ai-agent-claims-done-verify-2026&amp;amp;utm_content=footer_newsletter" rel="noopener noreferrer"&gt;Subscribe to the newsletter&lt;/a&gt; - no spam, unsubscribe anytime.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>agentops</category>
      <category>verification</category>
      <category>onepersoncompany</category>
    </item>
    <item>
      <title>Give Your AI Agents an Append-Only Event Log</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Mon, 22 Jun 2026 14:45:08 +0000</pubDate>
      <link>https://dev.to/pat9000/give-your-ai-agents-an-append-only-event-log-1ncp</link>
      <guid>https://dev.to/pat9000/give-your-ai-agents-an-append-only-event-log-1ncp</guid>
      <description>&lt;p&gt;Your AI agent finished its run. The exit code was zero. The log says done. But did it actually do the work? Most agent setups cannot answer that. They store the last state and nothing about how they got there.&lt;/p&gt;

&lt;p&gt;An append-only event log records every start, finish, and lock as a separate timestamped line that is never edited or deleted. Because the log is immutable, you can replay it to reconstruct exactly what your agent did at any point in time. This catches crashed runs and stuck locks that a status field alone hides.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why can't you tell what your agent actually did?
&lt;/h2&gt;

&lt;p&gt;Status fields lie by omission. A row that says &lt;code&gt;status: done&lt;/code&gt; tells you the final answer, not the path. If the agent crashed halfway and a later run overwrote the row, the crash is gone. You see green. You shipped nothing.&lt;/p&gt;

&lt;p&gt;I hit this building the brain that runs my one-person company. A scheduled task would die mid-run, the next run would stamp a fresh status, and the morning report showed everything fine. The state was current. The history was a lie.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an append-only event log?
&lt;/h2&gt;

&lt;p&gt;It is a file you only ever add lines to. Never edit. Never delete. Each line is one event: what happened, when, and which run or lock it belongs to. I write one JSONL file per day at &lt;code&gt;Reports/Events/2026-06-22.jsonl&lt;/code&gt;. One event per line. New events go at the bottom.&lt;/p&gt;

&lt;p&gt;The rule is the whole point. Mutable state answers "what is true now." An immutable log answers "what happened, in order." You need both, and most setups only keep the first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Isn't this just logging?
&lt;/h2&gt;

&lt;p&gt;Logs are for humans to read after something breaks. An event log is structured data a program reads to make a decision. The difference is the schema. Each event has fixed fields (type, timestamp, run id) so a checker can fold them without parsing prose.&lt;/p&gt;

&lt;p&gt;Plain logs also get rotated, truncated, and edited. The append-only rule is a promise: this record is complete and ordered. You can build automated checks on that promise. You cannot build them on a log file that some cleanup job trims every week.&lt;/p&gt;

&lt;h2&gt;
  
  
  What events should an agent emit?
&lt;/h2&gt;

&lt;p&gt;Start with two pairs. Every run emits &lt;code&gt;run.start&lt;/code&gt; when it begins and &lt;code&gt;run.finish&lt;/code&gt; when it ends. Every lock emits &lt;code&gt;lock.acquire&lt;/code&gt; when taken and &lt;code&gt;lock.release&lt;/code&gt; when freed. That is four event types and it already covers the failures that hurt most.&lt;/p&gt;

&lt;p&gt;I wired this into one shared include that every scheduled task already loads, so I added zero lines to any individual agent. The instrumentation lives in one place. Every runner got it for free.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you reconstruct state from events?
&lt;/h2&gt;

&lt;p&gt;You read the log top to bottom and fold each event into a running picture. A &lt;code&gt;run.start&lt;/code&gt; with no matching &lt;code&gt;run.finish&lt;/code&gt; means a run that never ended. A &lt;code&gt;lock.acquire&lt;/code&gt; with no &lt;code&gt;lock.release&lt;/code&gt; and a dead process means a stuck lock.&lt;/p&gt;

&lt;p&gt;I built a &lt;code&gt;replay --at &amp;lt;timestamp&amp;gt;&lt;/code&gt; command that stops folding at any instant and prints the system as it stood right then. When something broke at 3am, I do not guess. I replay 3am and look.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does this catch that a status field never could?
&lt;/h2&gt;

&lt;p&gt;Orphaned runs. A task that started, crashed, and left no finish line. The status field had already moved on. The event log still had the dangling start, so the check flagged it.&lt;/p&gt;

&lt;p&gt;Leaked locks. A run grabbed a lock, died, and never released it. The next run blocked forever and looked "pending," not failed. The acquire-with-no-release pattern made it obvious.&lt;/p&gt;

&lt;p&gt;Parity gaps. The finish-only ledger and the event stream should agree on how many runs happened. When they disagree, one of them is missing reality. The log is the tiebreaker because it cannot be rewritten.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start small
&lt;/h2&gt;

&lt;p&gt;You do not need a queue, a message broker, or a new dependency. A flat JSONL file and an append are enough to begin. Emit start and finish for your runs. Add lock events when you have shared resources. Write a tiny reader that flags any start without a finish.&lt;/p&gt;

&lt;p&gt;The discipline that makes it work is the append-only rule. The second you let code edit old events, you are back to a status field with extra steps. Keep it immutable and the log will tell you the truth your dashboard hides.&lt;/p&gt;

&lt;p&gt;If you want runtime limits on top of that visibility, AgentGuard caps token, budget, and rate spend per agent so a stuck loop stops before it drains your account: &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;https://bmdpat.com/tools/agentguard&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://bmdpat.com/blog/ai-agent-event-log-observability-2026?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ai-agent-event-log-observability-2026&amp;amp;utm_content=footer_original" rel="noopener noreferrer"&gt;bmdpat.com&lt;/a&gt;. I run a one-person AI agent company and write about what actually works.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want these in your inbox? &lt;a href="https://bmdpat.com/newsletter?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ai-agent-event-log-observability-2026&amp;amp;utm_content=footer_newsletter" rel="noopener noreferrer"&gt;Subscribe to the newsletter&lt;/a&gt; - no spam, unsubscribe anytime.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>observability</category>
      <category>eventsourcing</category>
      <category>onepersoncompany</category>
    </item>
    <item>
      <title>A self-healing system can't heal an empty queue</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Sun, 21 Jun 2026 14:45:09 +0000</pubDate>
      <link>https://dev.to/pat9000/a-self-healing-system-cant-heal-an-empty-queue-2gpo</link>
      <guid>https://dev.to/pat9000/a-self-healing-system-cant-heal-an-empty-queue-2gpo</guid>
      <description>&lt;h1&gt;
  
  
  A self-healing system can't heal an empty queue
&lt;/h1&gt;

&lt;p&gt;My blog pipeline went red two mornings in a row. The self-healing step ran both times and fixed nothing. The bug was not in the healer. The bug was that I asked it to heal the wrong kind of failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Short version:&lt;/strong&gt; automated recovery only works when the failure is "the machine broke." When the real failure is "the machine has no work," retrying does nothing, because there is nothing to retry against. A monitoring check that flips red has to tell those two cases apart, because the recovery for each is the opposite of the other.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually broke
&lt;/h2&gt;

&lt;p&gt;I run a small fleet of scheduled tasks that publish one blog post a day. The chain is simple. A draft lands in a queue. A reviewer checks it at 08:30. A publisher ships the approved one at 09:30. A separate check runs after that and asks one question: did a post go live today?&lt;/p&gt;

&lt;p&gt;If the answer is no, the check flips red and calls a healer. The healer's job is to get a post live. For weeks it worked. Then two mornings in a row it ran, reported failure, and left the dashboard red.&lt;/p&gt;

&lt;p&gt;I went looking for the bug in the healer. There wasn't one. The healer did exactly what it was told. It looked for an approved draft to publish, found none, and correctly reported that it could not publish nothing.&lt;/p&gt;

&lt;p&gt;The queue was empty. No draft had been written. The publish step had nothing to ship, and no amount of retrying an empty queue produces a post.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two failures that look identical on a dashboard
&lt;/h2&gt;

&lt;p&gt;Here is the trap. On the dashboard, both of these render as the same red box: "no post went live today."&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The machine broke. A draft existed, but the publish step crashed, hit a bad credential, or got a 500 from the API. Recovery: retry. Fix the bug, run it again, the post ships.&lt;/li&gt;
&lt;li&gt;The machine had no work. No draft existed. Recovery: retrying does nothing forever. You have to manufacture the missing input first.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are opposite problems. The first says the input was fine and the action failed. The second says the action was fine and the input was missing. A healer built for the first is useless against the second, and most self-healing automation only handles the first. It assumes the work exists and the step failed. When the work itself is missing, it spins.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I split the healer
&lt;/h2&gt;

&lt;p&gt;The fix was to stop treating "no post today" as one failure. It is two.&lt;/p&gt;

&lt;p&gt;When the check goes red, it now asks a second question before doing anything: is there content waiting? If a draft exists and the publish failed, that is a broken-machine problem, and retry is the right move. If no draft exists, retry is pointless. The recovery for an empty queue is not to republish. It is to generate the missing draft.&lt;/p&gt;

&lt;p&gt;So the empty-queue healer is a content generator, not a republisher. It writes a fresh post, runs it through the same quality gate every other post clears, and only then publishes. This post is the output of exactly that path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Doesn't auto-generating content just make filler?
&lt;/h2&gt;

&lt;p&gt;That is the real risk, and it is worth naming. The moment your recovery for "no work" is "manufacture work," you have built a machine that can publish garbage to make a red light turn green.&lt;/p&gt;

&lt;p&gt;The guard is the quality gate. The generator can write a draft, but it cannot publish one that has not passed the same independent review a queued draft passes. If the generated draft fails review, the pipeline stays red and asks me for a decision. Red is honest. A green light earned by shipping filler is worse than an honest red.&lt;/p&gt;

&lt;p&gt;That is the line. A healer may manufacture the missing input, but it may never lower the bar that input has to clear. Generate freely, publish only what passes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters past blogging
&lt;/h2&gt;

&lt;p&gt;Any check that flips red on a missing outcome has this fork in it. Did the nightly report not run because the job crashed, or because there was nothing to report? Did the deploy not happen because it failed, or because there was no new commit? Retry fixes the first. It is a no-op on the second.&lt;/p&gt;

&lt;p&gt;Before you wire up automated recovery, write down which failure you are recovering from. If you only handle "the step crashed," your healer will sit there retrying an empty queue and calling it an outage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the cost angle comes in
&lt;/h2&gt;

&lt;p&gt;A healer that retries the wrong thing is annoying when the step is a publish call. It gets expensive when the step is a paid model call. A retry loop pointed at a failing API spends real money on every attempt, and an agent that retries on a schedule can do that all night while you sleep.&lt;/p&gt;

&lt;p&gt;That is what I built &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;AgentGuard&lt;/a&gt; for: a budget, token, and rate limiter you wrap around an agent so a runaway or retry loop stops at a hard ceiling instead of grinding until the invoice teaches you. Free to install: &lt;code&gt;pip install agentguard47&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;When your monitor goes red tomorrow, ask the second question before you retry: is this broken, or is it empty? The answer decides whether retrying helps or just burns time and money.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://bmdpat.com/blog/self-healing-empty-queue-2026?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=self-healing-empty-queue-2026&amp;amp;utm_content=footer_original" rel="noopener noreferrer"&gt;bmdpat.com&lt;/a&gt;. I run a one-person AI agent company and write about what actually works.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want these in your inbox? &lt;a href="https://bmdpat.com/newsletter?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=self-healing-empty-queue-2026&amp;amp;utm_content=footer_newsletter" rel="noopener noreferrer"&gt;Subscribe to the newsletter&lt;/a&gt; - no spam, unsubscribe anytime.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>automation</category>
      <category>monitoring</category>
      <category>selfhealing</category>
    </item>
    <item>
      <title>Missing AI agent cost data is not zero</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Fri, 19 Jun 2026 14:45:08 +0000</pubDate>
      <link>https://dev.to/pat9000/missing-ai-agent-cost-data-is-not-zero-3np3</link>
      <guid>https://dev.to/pat9000/missing-ai-agent-cost-data-is-not-zero-3np3</guid>
      <description>&lt;h1&gt;
  
  
  Missing AI agent cost data is not zero
&lt;/h1&gt;

&lt;p&gt;My agent spend ledger showed $0 for the day. The agents had run all morning. The number was a lie, and the bug behind it is one almost every cost tracker ships with.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Short version:&lt;/strong&gt; when a provider's billing data has not arrived yet, a naive cost tracker records $0 for that period. For AI agents that run unattended, this hides the exact spending you built the tracker to catch. The fix is to model missing data as a distinct "unknown" state, never as zero, so a day you cannot measure reads as unmeasured instead of free.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a $0 day is the dangerous one
&lt;/h2&gt;

&lt;p&gt;I run a fleet of scheduled agents. They digest articles, review drafts, sweep a queue overnight. Real model calls, real dollars. I built a daily spend ledger to answer one question every morning: what did yesterday cost, and am I about to blow the budget?&lt;/p&gt;

&lt;p&gt;The first version summed whatever cost data it had and printed a total. Most mornings it printed a small number. Some mornings it printed $0.&lt;/p&gt;

&lt;p&gt;The $0 mornings were not free mornings. They were mornings where the provider billing data had not reported yet. Usage lags. Some providers only give you a CSV export you pull later. So at 07:00 when the ledger compiles, the cost of the run that finished at 05:30 often is not available.&lt;/p&gt;

&lt;p&gt;The ledger had no way to say "I don't know yet." It only knew how to add. No data plus no data equals zero. So it reported zero, looked green, and moved on.&lt;/p&gt;

&lt;h2&gt;
  
  
  The failure hides exactly what you were watching for
&lt;/h2&gt;

&lt;p&gt;Here is why this is worse than a normal rounding error. Agents run while you sleep. A retry loop, a runaway tool call, a prompt that balloons context: these spend money at 3am with nobody watching. That unattended spend is the whole reason a budget tracker exists.&lt;/p&gt;

&lt;p&gt;Billing lag and runaway spend show up at the same time. The surprise charge arrives late from the provider, which means the day it happened is precisely the day your ledger had no data for. A tracker that reads missing data as $0 fails in the one situation you built it for. It tells you everything is fine on the days you most need a warning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model three states, not one number
&lt;/h2&gt;

&lt;p&gt;The fix is small, and it is a data-modeling fix, not a math fix. A cost period has three states, not one:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;State&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Ledger shows&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;known&lt;/td&gt;
&lt;td&gt;provider data is in&lt;/td&gt;
&lt;td&gt;the real number&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;partial&lt;/td&gt;
&lt;td&gt;some sources reported, others pending&lt;/td&gt;
&lt;td&gt;the partial sum, flagged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;unknown&lt;/td&gt;
&lt;td&gt;no data has arrived yet&lt;/td&gt;
&lt;td&gt;"unmeasured", not $0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A day with no billing data is &lt;code&gt;unknown&lt;/code&gt;. The report says so in plain words. It does not get to pose as a cheap day. When the provider export lands, the day flips to &lt;code&gt;known&lt;/code&gt; and the real cost backfills in.&lt;/p&gt;

&lt;p&gt;This sounds obvious written down. It is the same null-versus-zero bug that has burned every database schema since forever. Zero is a measurement. Missing is the absence of one. Collapsing them throws away the single most important fact: whether you actually know.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this looks like in practice
&lt;/h2&gt;

&lt;p&gt;My ledger now writes one of those three states per source per day. If an export has not been pulled, that line says &lt;code&gt;unknown&lt;/code&gt;, and the daily total carries an "incomplete" marker until every source reports. I would rather see "we cannot confirm yesterday's spend" than a confident, wrong $0.&lt;/p&gt;

&lt;p&gt;The rule I wrote down for every agent that touches the ledger: missing provider billing data is not zero. Mark it unknown until a real export or API confirms the number.&lt;/p&gt;

&lt;p&gt;That one line stops the green-dashboard trap. A system that defaults to zero when it is blind will always look healthy in the moment right before the bill teaches you otherwise.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stop the spend, do not just measure it
&lt;/h2&gt;

&lt;p&gt;The ledger tells you what a run cost after the fact. It does not stop a run from spending the money in the first place. For that you want a hard cap on the agent itself.&lt;/p&gt;

&lt;p&gt;That is what I built &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;AgentGuard&lt;/a&gt; for: a budget, token, and rate limiter you wrap around an agent so a runaway run stops at a ceiling instead of grinding until the invoice surprises you. It does not care whether your billing data has arrived. It counts spend as it happens and pulls the plug at the number you set. Free to install: &lt;code&gt;pip install agentguard47&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;What does your cost tracker show on a day the billing data is late: a real unknown, or a comforting $0?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://bmdpat.com/blog/missing-cost-data-not-zero-ai-agent-spend-2026?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=missing-cost-data-not-zero-ai-agent-spend-2026&amp;amp;utm_content=footer_original" rel="noopener noreferrer"&gt;bmdpat.com&lt;/a&gt;. I run a one-person AI agent company and write about what actually works.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want these in your inbox? &lt;a href="https://bmdpat.com/newsletter?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=missing-cost-data-not-zero-ai-agent-spend-2026&amp;amp;utm_content=footer_newsletter" rel="noopener noreferrer"&gt;Subscribe to the newsletter&lt;/a&gt; - no spam, unsubscribe anytime.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>costcontrol</category>
      <category>observability</category>
      <category>spendtracking</category>
    </item>
    <item>
      <title>Anthropic Writes 80% of Its Code with Claude</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Thu, 18 Jun 2026 14:45:11 +0000</pubDate>
      <link>https://dev.to/pat9000/anthropic-writes-80-of-its-code-with-claude-2p2f</link>
      <guid>https://dev.to/pat9000/anthropic-writes-80-of-its-code-with-claude-2p2f</guid>
      <description>&lt;h2&gt;
  
  
  What does 80% AI authored code mean for solo devs?
&lt;/h2&gt;

&lt;p&gt;In June 2026, Anthropic stated that about 80% of its new production code is authored by Claude. When a major AI vendor hits that volume, the shift is undeniable. For a solo developer or a one-person holding company, this changes the math entirely. The bottleneck is no longer typing characters. The bottleneck is review and ownership.&lt;/p&gt;

&lt;p&gt;When you run a solo shop, you do not have a team to absorb the review burden. If your agents write 80% of the code, you still have to read, understand, and answer for 100% of it. AI writes the code, but you still own the outcome.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you manage the review burden?
&lt;/h2&gt;

&lt;p&gt;The only way to survive high-volume AI output is to build verifiable constraints. You must prove the code works without reading every line. &lt;/p&gt;

&lt;p&gt;In my own vault, nightshift agents ship PRs while I sleep. They run a multi-step loop. They write the plan, build the tests first, write the code, and then spawn an independent QA subagent to review the diff. The QA agent acts as the gatekeeper. It fails the PR if rules are broken. It flags secrets and checks constraints. I wake up, review the clean PRs, and merge them. &lt;/p&gt;

&lt;p&gt;You must enforce boundaries. If an agent drifts into forbidden paths or tries to merge a broken build, the system must auto-revert or block it. &lt;/p&gt;

&lt;h2&gt;
  
  
  Who answers for the code?
&lt;/h2&gt;

&lt;p&gt;Andreas Kling of the Ladybird browser project asked a vital question. Who answers for the code? When AI writes the bulk of your logic, the human reviewer is the final backstop. &lt;/p&gt;

&lt;p&gt;Self-reported numbers like 80% measure volume, not quality. Volume is easy. Correctness is hard. You cannot blindly trust the output. You must verify it mechanically. &lt;/p&gt;

&lt;p&gt;If you are a team of one, your agents are your peers. You need guardrails on what they can do. If you want to put a runtime budget on your agents, check out &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;AgentGuard&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://bmdpat.com/blog/anthropic-80pct-claude-code?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=anthropic-80pct-claude-code&amp;amp;utm_content=footer_original" rel="noopener noreferrer"&gt;bmdpat.com&lt;/a&gt;. I run a one-person AI agent company and write about what actually works.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want these in your inbox? &lt;a href="https://bmdpat.com/newsletter?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=anthropic-80pct-claude-code&amp;amp;utm_content=footer_newsletter" rel="noopener noreferrer"&gt;Subscribe to the newsletter&lt;/a&gt; - no spam, unsubscribe anytime.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>solobuilder</category>
      <category>coding</category>
    </item>
    <item>
      <title>What Salesforce's 20,000 AI Agent Deployments Teach a Solo Builder</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Thu, 18 Jun 2026 14:45:08 +0000</pubDate>
      <link>https://dev.to/pat9000/what-salesforces-20000-ai-agent-deployments-teach-a-solo-builder-4me7</link>
      <guid>https://dev.to/pat9000/what-salesforces-20000-ai-agent-deployments-teach-a-solo-builder-4me7</guid>
      <description>&lt;p&gt;Salesforce has shipped around 20,000 Agentforce deployments. ByteByteGo published a writeup of what they learned, sourced to John Kucera, the CPO of Agentforce. I run a one-person agent fleet, which is about as far from Salesforce scale as you can get. The lessons still translate. Better than I expected, actually.&lt;/p&gt;

&lt;p&gt;Short version: 90% of agent work happens after launch, not before. The failures cluster into three patterns. Putting deterministic logic inside an LLM loop, prompting harder instead of encoding policy in code, and feeding the model way too much context. All three are engineering problems, not model problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why does 90% of agent work happen after launch?
&lt;/h2&gt;

&lt;p&gt;Traditional software front-loads the effort. You spec, build, test, then go live and mostly maintain. Agents invert that. Modern tooling gets you a functional demo in hours, and that speed creates false confidence. The demo covers the typical cases. Production brings edge cases, ambiguous phrasing, and questions that cross domains your agent never saw in testing.&lt;/p&gt;

&lt;p&gt;I have lived a small version of this. Every agent I run looked done on day one. The real work was the weeks after: the input that arrived in a format I never tested, the API that returned something half-empty, the task that technically succeeded while producing nothing useful. If you budget your effort assuming launch is the finish line, you will abandon the agent right when the actual work starts.&lt;/p&gt;

&lt;p&gt;Salesforce's advice here is blunt: do not boil the ocean. Start with one narrow, high-value use case so your iteration cycles stay fast. At solo scale that means one agent, one job, one queue. Get it boring before you add the second one.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are the three anti-patterns that degrade agents?
&lt;/h2&gt;

&lt;p&gt;First: over-reasoning deterministic workflows. If you can flowchart the logic, it belongs in code. Salesforce built Agent Script, a TypeScript framework that mixes deterministic control flow with LLM reasoning, because asking a model to re-derive an if-else chain on every run is slow, expensive, and occasionally wrong. You do not need their framework. You need the rule: flowchart it, then script it. Save the model for the parts that are genuinely ambiguous.&lt;/p&gt;

&lt;p&gt;Second: prompting harder instead of encoding policies. Writing NEVER and ALWAYS in caps does not reliably constrain a model. Salesforce found business rules have to execute independently of model reasoning. This one matters most for small shops, because prompting harder is free and feels like progress. If a rule actually matters, enforce it in code that runs whether or not the model cooperates. A refund cap belongs in the payment function, not in paragraph four of the system prompt.&lt;/p&gt;

&lt;p&gt;Third: poor context engineering. One e-commerce team in the writeup cut an order API response from 100K tokens to 2K by returning only the relevant fields. The agent got faster and more accurate at the same time. That is the detail worth tattooing somewhere: less context made it better, not just cheaper. Dumping a whole API response into the prompt is the default, and the default is wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you know an agent is actually working?
&lt;/h2&gt;

&lt;p&gt;Salesforce measures Agentic Work Units, meaning actual task completion. For support agents they track containment rate: cases resolved without human follow-up. Outcomes, not activity.&lt;/p&gt;

&lt;p&gt;I learned a version of this the hard way. A scheduled agent can exit zero every night and produce nothing. Green checks lie. The fix is to check the declared output, not the exit code. Did the file appear, did the post go live, did the ticket close. Whatever your equivalent of containment rate is, measure that.&lt;/p&gt;

&lt;p&gt;Their post-launch triage is also worth stealing. Issues get split four ways: tone or brand drift means fix the prompts, logic errors mean fix the tools or convert that step to a script, data quality problems get routed to whoever owns the source, and coverage gaps mean expand scope or escalate cleanly. Four buckets, four different fixes. Most solo builders treat every failure as a prompt problem. Most failures are not.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does this mean if you're not Salesforce?
&lt;/h2&gt;

&lt;p&gt;Salesforce has platform teams to absorb the post-launch 90%. You have you. That changes the build order, not the lessons.&lt;/p&gt;

&lt;p&gt;Move deterministic logic out of the loop first. It is the cheapest win: fewer tokens, fewer surprises, faster runs. Then encode your real rules as code-level checks the model cannot talk its way past. Then cut your context down to what the task needs. Each of these makes the after-launch grind smaller, which at solo scale is the difference between a fleet you maintain and a fleet that quietly rots.&lt;/p&gt;

&lt;p&gt;And put hard runtime limits on every agent before it touches production. The deployments in the writeup degrade in ways nobody predicted in the demo, and at 20,000 deployments Salesforce can eat the bad days. One runaway retry loop on your side is your whole margin. That is the exact surface I built AgentGuard for: per-agent budget caps, token limits, and rate limits enforced at runtime, not in the prompt. It is a pip install, agentguard, and it takes minutes to wire in. Start there: &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;https://bmdpat.com/tools/agentguard&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://bmdpat.com/blog/salesforce-20000-ai-agent-deployments-lessons-2026?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=salesforce-20000-ai-agent-deployments-lessons-2026&amp;amp;utm_content=footer_original" rel="noopener noreferrer"&gt;bmdpat.com&lt;/a&gt;. I run a one-person AI agent company and write about what actually works.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want these in your inbox? &lt;a href="https://bmdpat.com/newsletter?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=salesforce-20000-ai-agent-deployments-lessons-2026&amp;amp;utm_content=footer_newsletter" rel="noopener noreferrer"&gt;Subscribe to the newsletter&lt;/a&gt; - no spam, unsubscribe anytime.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>agentdeployment</category>
      <category>agentreliability</category>
      <category>agentguard</category>
    </item>
    <item>
      <title>57-71% of AI agents leak data between users. Here's what to do.</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Wed, 17 Jun 2026 14:50:11 +0000</pubDate>
      <link>https://dev.to/pat9000/57-71-of-ai-agents-leak-data-between-users-heres-what-to-do-3h05</link>
      <guid>https://dev.to/pat9000/57-71-of-ai-agents-leak-data-between-users-heres-what-to-do-3h05</guid>
      <description>&lt;h1&gt;
  
  
  57-71% of AI agents leak data between users. Here's what to do.
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Summary:&lt;/strong&gt; A June 2026 Mem0 survey reveals that 57-71% of agent harnesses leak memory between users. This happens because most systems use keyword retrieval without user isolation. Builders must implement per-user namespaces and principal checks to prevent PII leaks and credential bleed.&lt;/p&gt;

&lt;p&gt;Mem0's June 2026 survey of 8 major agent harnesses included Claude Code, Codex, and Bedrock AgentCore. They found a 57-71% cross-user memory contamination rate. Most of these systems rely on keyword retrieval. They lack user-scoped isolation.&lt;/p&gt;

&lt;p&gt;If you run agents for multiple users, your memory layer is likely leaking.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why does keyword retrieval fail across users?
&lt;/h2&gt;

&lt;p&gt;Most agent runtimes use simple keyword matches to pull relevant memories into the context window. This works well for single-user assistants. It fails in multi-user environments because the retrieval layer has no concept of a principal.&lt;/p&gt;

&lt;p&gt;When User B asks a question, a fuzzy match might pull a memory fragment written by User A. If User A stored PII or credentials, those secrets are now in User B's prompt. The agent has no way to know it just crossed a security boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are the real failure modes of memory contamination?
&lt;/h2&gt;

&lt;p&gt;Memory contamination is not just a style issue. It creates three critical risks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;PII leak:&lt;/strong&gt; Personal data from one user appears in another's session.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decision contamination:&lt;/strong&gt; A policy or preference set by User A influences the agent's actions for User B.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Credential bleed:&lt;/strong&gt; API keys or tokens stored in memory by an admin become accessible to a standard user.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  How do you fix the agent memory layer?
&lt;/h2&gt;

&lt;p&gt;To build secure multi-user agents, you need to move beyond simple keyword search. Use these four patterns:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Per-user namespaces.&lt;/strong&gt; Every memory must be tagged with a unique UserID. The retrieval query must include a hard filter on that ID.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Recall-time principal checks.&lt;/strong&gt; Before a retrieved memory is injected into the prompt, verify that the current session principal has read access to that specific memory object.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. TTL and staleness handling.&lt;/strong&gt; Memories should not live forever. Implement time-to-live (TTL) settings and session-based eviction to ensure sensitive data does not linger in the vector store.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Vector partitioning.&lt;/strong&gt; Use physical or logical partitioning in your vector database to ensure that a search for one user cannot even "see" the data of another.&lt;/p&gt;

&lt;h2&gt;
  
  
  How does AgentGuard help secure agent memory?
&lt;/h2&gt;

&lt;p&gt;Isolating memory is only half the battle. You also need to enforce scope at the action layer. &lt;/p&gt;

&lt;p&gt;AgentGuard provides the runtime budget and scope enforcement that acts as the action-layer analogue of memory isolation. Just as you should not let an agent recall User A's data for User B, you should not let an agent spend User A's budget on User B's tasks.&lt;/p&gt;

&lt;p&gt;By wrapping your agent in AgentGuard, you ensure that even if a memory leak occurs, the agent's ability to act on that leaked data is strictly bounded by the current session's security policy.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;Learn how to secure your agent runtime with AgentGuard.&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://bmdpat.com/blog/mem0-agent-memory-contamination?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=mem0-agent-memory-contamination&amp;amp;utm_content=footer_original" rel="noopener noreferrer"&gt;bmdpat.com&lt;/a&gt;. I run a one-person AI agent company and write about what actually works.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want these in your inbox? &lt;a href="https://bmdpat.com/newsletter?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=mem0-agent-memory-contamination&amp;amp;utm_content=footer_newsletter" rel="noopener noreferrer"&gt;Subscribe to the newsletter&lt;/a&gt; - no spam, unsubscribe anytime.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>security</category>
      <category>agentmemory</category>
      <category>agentguard</category>
    </item>
    <item>
      <title>VRAM Calculator: Estimate Local LLM Requirements</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Mon, 15 Jun 2026 14:45:10 +0000</pubDate>
      <link>https://dev.to/pat9000/vram-calculator-estimate-local-llm-requirements-5dk8</link>
      <guid>https://dev.to/pat9000/vram-calculator-estimate-local-llm-requirements-5dk8</guid>
      <description>&lt;h2&gt;
  
  
  What is the VRAM Calculator?
&lt;/h2&gt;

&lt;p&gt;Running local LLMs requires knowing your hardware limits. I built the VRAM Calculator to help you estimate the video memory needed to run models like Llama 3 and Mistral. Knowing your constraints before downloading a 40GB model saves you hours of frustration.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Math Behind It
&lt;/h2&gt;

&lt;p&gt;Estimating VRAM is more than just checking the base file size. You have to account for context window length, quantization levels like GGUF Q4 or Q8, and inference engine overhead. The calculator handles the math and gives you a concrete target for your setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Compares
&lt;/h2&gt;

&lt;p&gt;Static reference tables get outdated fast. This calculator uses dynamic estimates based on real memory footprint data from local AI engines like llama.cpp.&lt;/p&gt;

&lt;p&gt;You can use the tool right now: &lt;a href="https://bmdpat.com/tools/vram-calculator?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=vram-calculator" rel="noopener noreferrer"&gt;Try the VRAM Calculator&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ready for Production?
&lt;/h2&gt;

&lt;p&gt;If you are deploying AI agents and need to monitor their execution safely, check out &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;AgentGuard&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://bmdpat.com/blog/vram-calculator?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=vram-calculator&amp;amp;utm_content=footer_original" rel="noopener noreferrer"&gt;bmdpat.com&lt;/a&gt;. I run a one-person AI agent company and write about what actually works.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want these in your inbox? &lt;a href="https://bmdpat.com/newsletter?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=vram-calculator&amp;amp;utm_content=footer_newsletter" rel="noopener noreferrer"&gt;Subscribe to the newsletter&lt;/a&gt; - no spam, unsubscribe anytime.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>localllm</category>
      <category>hardware</category>
      <category>vram</category>
      <category>llama3</category>
    </item>
    <item>
      <title>Anthropic's IPO and the 40% Cost-Savings Gap: Why Your Spend Cap Matters More Now</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Sun, 14 Jun 2026 14:45:11 +0000</pubDate>
      <link>https://dev.to/pat9000/anthropics-ipo-and-the-40-cost-savings-gap-why-your-spend-cap-matters-more-now-3o45</link>
      <guid>https://dev.to/pat9000/anthropics-ipo-and-the-40-cost-savings-gap-why-your-spend-cap-matters-more-now-3o45</guid>
      <description>&lt;p&gt;Anthropic filed confidentially for an IPO. Two newsletter bullets I read on 2026-06-04 (TLDR AI and FutureTools) put the post-money valuation at $965B after a $65B Series H raise. Revenue run-rate is reported at $47B, up from $9B at the end of 2025.&lt;/p&gt;

&lt;p&gt;Here is the part that should get your attention as a builder. The same bullets report that 40% of enterprise customers say they got under 10% cost savings from their Claude deployments.&lt;/p&gt;

&lt;p&gt;Read those two numbers together. Revenue is 5x in about six months. And almost half of enterprise buyers say the value is not showing up in their bills. That is the exact shape of a re-pricing event.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the 40% gap exists
&lt;/h2&gt;

&lt;p&gt;The gap is not about the model being slow or wrong. It is an accounting mismatch.&lt;/p&gt;

&lt;p&gt;You pay per invocation. You get value per completed goal. Those are not the same thing, and the difference is where your money leaks.&lt;/p&gt;

&lt;p&gt;Three places the tokens go to die:&lt;/p&gt;

&lt;p&gt;Retries. A tool call fails, the agent tries again, then again. Each attempt bills. None of them shipped the result.&lt;/p&gt;

&lt;p&gt;Dead-end branches. An agent explores a plan, burns tokens, then abandons it. You paid for the exploration even though nothing reached the user.&lt;/p&gt;

&lt;p&gt;Unverified completions. The agent says "done." Nobody checked. You paid full price for an output that was never confirmed to be correct.&lt;/p&gt;

&lt;p&gt;None of this shows up as a single scary line item. It shows up as a bill that is bigger than the work you can point to. That is the 40% gap in one sentence.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the IPO changes for your team
&lt;/h2&gt;

&lt;p&gt;A confidential filing means public-market disclosure is coming. Public markets reward margin. The cheapest way for any vendor to defend margin is to adjust pricing on the tiers that are currently subsidized.&lt;/p&gt;

&lt;p&gt;I am not predicting a specific price hike. I am saying the incentive is now pointed in one direction. If you are running production agents, plan for the cost of a token to matter more next quarter than it did last quarter.&lt;/p&gt;

&lt;p&gt;The wrong move is to panic-switch vendors. Migrating an agent stack is expensive, and the next vendor has the same per-invocation-versus-per-goal problem. Switching does not fix the leak. It just moves it.&lt;/p&gt;

&lt;p&gt;The right move is to cap the spend and verify the goal before you pay for it. Keep a real exit option open too. Running a small local model on consumer hardware is a credible fallback for some workloads. I wrote about that in &lt;a href="https://bmdpat.com/blog/local-llm-inference-consumer-gpu-production-2026" rel="noopener noreferrer"&gt;local LLM inference on consumer GPUs&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern: budget per goal, hard stop, audit trail
&lt;/h2&gt;

&lt;p&gt;This is the gap AgentGuard was built for. It is a runtime budget limiter for AI agents. You set a budget per goal, a cap per key, a hard stop, and you get an audit trail of where the tokens actually went.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentguard&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Guard&lt;/span&gt;

&lt;span class="n"&gt;guard&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Guard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;budget_usd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;per_key_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;guard&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;track&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarize-ticket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;guard&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;verify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# only counts as paid value if the goal check passes
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The point is not the exact API. The point is the shape. You declare what one completed goal is worth before the agent starts. The agent runs under a hard ceiling. When it hits the cap, it stops instead of quietly burning another dollar on a dead-end branch. And the audit trail tells you which goals actually completed, so you can see the 40% gap in your own numbers instead of guessing.&lt;/p&gt;

&lt;p&gt;That last part matters most. You cannot manage a leak you cannot measure. Per-goal accounting turns "the bill feels high" into "these three goals burned 60% of spend and only one of them shipped."&lt;/p&gt;

&lt;h2&gt;
  
  
  Get ahead of the re-pricing
&lt;/h2&gt;

&lt;p&gt;The news moment is the IPO. The durable lesson is older than this filing. Per-invocation pricing and per-goal value will always drift apart, and that drift is your cost problem.&lt;/p&gt;

&lt;p&gt;If you want the deeper version of this, I keep a hub post on &lt;a href="https://bmdpat.com/blog/ai-agent-cost-pricing-2026" rel="noopener noreferrer"&gt;AI agent cost and pricing&lt;/a&gt; and a hands-on walkthrough of &lt;a href="https://bmdpat.com/blog/ai-agent-cost-control-agentguard-python" rel="noopener noreferrer"&gt;cost control with AgentGuard in Python&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Cap the spend. Verify the goal. Pay for value, not for retries. Start with a budget cap before the next pricing event lands: &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;https://bmdpat.com/tools/agentguard&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://bmdpat.com/blog/anthropic-ipo-cost-savings-gap-agentguard-2026?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=anthropic-ipo-cost-savings-gap-agentguard-2026&amp;amp;utm_content=footer_original" rel="noopener noreferrer"&gt;bmdpat.com&lt;/a&gt;. I run a one-person AI agent company and write about what actually works.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want these in your inbox? &lt;a href="https://bmdpat.com/newsletter?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=anthropic-ipo-cost-savings-gap-agentguard-2026&amp;amp;utm_content=footer_newsletter" rel="noopener noreferrer"&gt;Subscribe to the newsletter&lt;/a&gt; - no spam, unsubscribe anytime.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>costcontrol</category>
      <category>anthropic</category>
      <category>agentguard</category>
    </item>
  </channel>
</rss>
