<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shridhar Shah</title>
    <description>The latest articles on DEV Community by Shridhar Shah (@shridhar_shah2297).</description>
    <link>https://dev.to/shridhar_shah2297</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4005841%2F20cbe6ff-759d-4040-982d-5be692b089b9.png</url>
      <title>DEV Community: Shridhar Shah</title>
      <link>https://dev.to/shridhar_shah2297</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shridhar_shah2297"/>
    <language>en</language>
    <item>
      <title>Agents Are Learning to Write Their Own SKILL.md Files</title>
      <dc:creator>Shridhar Shah</dc:creator>
      <pubDate>Sat, 27 Jun 2026 21:53:22 +0000</pubDate>
      <link>https://dev.to/shridhar_shah2297/agents-are-learning-to-write-their-own-skillmd-files-3foo</link>
      <guid>https://dev.to/shridhar_shah2297/agents-are-learning-to-write-their-own-skillmd-files-3foo</guid>
      <description>&lt;p&gt;&lt;em&gt;The Agent Skills open standard today, and the 2026 research on agents that write their own skills.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; In late 2025, "Agent Skills" became a thing — a dead-simple way to teach an AI agent a task: a folder with a &lt;code&gt;SKILL.md&lt;/code&gt; file (some instructions in Markdown). It's already an open standard. The wild part is what's coming next: agents that &lt;strong&gt;write their own skills.&lt;/strong&gt; I built a demo where an agent solves a task the hard way once, saves a real &lt;code&gt;SKILL.md&lt;/code&gt;, and then reuses it — cutting its total effort almost in half. ~130 lines, no API key.&lt;/p&gt;




&lt;h2&gt;
  
  
  First, what's a "skill"?
&lt;/h2&gt;

&lt;p&gt;If you've used Claude Code or similar tools lately, you've probably seen &lt;code&gt;SKILL.md&lt;/code&gt; files. The idea is refreshingly low-tech. A "skill" is just a folder with a Markdown file that says &lt;em&gt;how to do something&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;csv-to-markdown&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Turn comma-separated text into a Markdown table. Use when the input looks&lt;/span&gt;
  &lt;span class="s"&gt;like CSV and the user wants a table.&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="c1"&gt;# CSV to Markdown&lt;/span&gt;

&lt;span class="c1"&gt;## Instructions&lt;/span&gt;
&lt;span class="s"&gt;Split the text into rows on newlines and columns on commas. Make the first row the&lt;/span&gt;
&lt;span class="s"&gt;header, add a `---` divider row, then format every row as `| a | b | c |`.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No SDK, no config. Anthropic introduced this in October 2025 and then published it as an &lt;strong&gt;open standard&lt;/strong&gt; (&lt;a href="https://agentskills.io" rel="noopener noreferrer"&gt;agentskills.io&lt;/a&gt;) in December 2025, so the same skill folder now works across ~30+ different agent tools (Claude Code, Cursor, Copilot, and more).&lt;/p&gt;

&lt;p&gt;The full rules are short (&lt;a href="https://agentskills.io/specification" rel="noopener noreferrer"&gt;agentskills.io/specification&lt;/a&gt;): the only &lt;strong&gt;required&lt;/strong&gt; fields are &lt;code&gt;name&lt;/code&gt; (1–64 chars, lowercase-with-hyphens, and it must match the folder name) and &lt;code&gt;description&lt;/code&gt; (≤1024 chars, saying &lt;em&gt;what it does and when to use it&lt;/em&gt;). Everything else — &lt;code&gt;license&lt;/code&gt;, &lt;code&gt;metadata&lt;/code&gt;, &lt;code&gt;compatibility&lt;/code&gt;, &lt;code&gt;allowed-tools&lt;/code&gt; — is optional. That's the whole spec. The &lt;code&gt;SKILL.md&lt;/code&gt; files my demo writes follow it to the letter, so they'd load unmodified in any compatible CLI.&lt;/p&gt;

&lt;h2&gt;
  
  
  The clever trick: progressive disclosure
&lt;/h2&gt;

&lt;p&gt;Here's the smart part. If you just dumped 50 skills' worth of instructions into the agent's context, you'd fill it up and leave no room for actual work. So skills load in &lt;strong&gt;stages&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Always loaded:&lt;/strong&gt; just the &lt;code&gt;name&lt;/code&gt; and one-line &lt;code&gt;description&lt;/code&gt; of every skill (tiny).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loaded only when it matches:&lt;/strong&gt; the full instructions, once a task actually needs them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loaded only if referenced:&lt;/strong&gt; extra files or scripts the skill bundles.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So the agent can have &lt;em&gt;hundreds&lt;/em&gt; of skills installed and barely pay for it — it only reads the short descriptions until one matches, then pulls in the details. My demo shows the math: to use 1 skill out of 3 installed, loading everything costs ~1500 "tokens"; the SKILL.md way costs ~560. That gap gets huge as your library grows.&lt;/p&gt;

&lt;p&gt;This is also why people say skills and &lt;strong&gt;MCP&lt;/strong&gt; are teammates, not rivals: MCP is how an agent &lt;em&gt;connects to tools&lt;/em&gt;; a skill is how an agent &lt;em&gt;knows the procedure&lt;/em&gt; for using them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The frontier: agents that write their own skills
&lt;/h2&gt;

&lt;p&gt;Today, humans write &lt;code&gt;SKILL.md&lt;/code&gt; files. The 2026 research is about agents that write their &lt;strong&gt;own&lt;/strong&gt; — and get better over time as their skill library grows. This goes back to &lt;strong&gt;Voyager&lt;/strong&gt; (2023), an agent that played Minecraft and saved working code as reusable skills, getting dramatically faster at the game. The new wave makes it general:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://arxiv.org/html/2605.27366" rel="noopener noreferrer"&gt;MUSE-Autoskill&lt;/a&gt;&lt;/strong&gt; (2026) treats a skill as a &lt;em&gt;living asset&lt;/em&gt; with a full lifecycle — create it, give it its own memory file, manage it, test it, and refine it. Each skill even keeps a &lt;code&gt;.memory.md&lt;/code&gt; of notes about itself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://arxiv.org/pdf/2603.18743v1" rel="noopener noreferrer"&gt;Memento-Skills&lt;/a&gt;&lt;/strong&gt; (2026) stores skills as Markdown files that double as the agent's evolving memory, and turns task &lt;em&gt;failures&lt;/em&gt; into new skills automatically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://arxiv.org/abs/2602.01869" rel="noopener noreferrer"&gt;Skill-Pro&lt;/a&gt;&lt;/strong&gt; (2026) defines a skill as "when to use it + how to do it + when to stop," and only keeps a new skill if it passes a quality gate — so the library improves instead of filling up with junk.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The common thread: &lt;strong&gt;solve it once, save the recipe, reuse it forever&lt;/strong&gt; — and let the collection get smarter on its own.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;📄 &lt;strong&gt;The "this is the future" link:&lt;/strong&gt; Anthropic's own writeup, &lt;a href="https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills" rel="noopener noreferrer"&gt;&lt;em&gt;Equipping agents for the real world with Agent Skills&lt;/em&gt;&lt;/a&gt;, and the open standard at &lt;strong&gt;&lt;a href="https://agentskills.io" rel="noopener noreferrer"&gt;agentskills.io&lt;/a&gt;&lt;/strong&gt;. For the research direction, &lt;a href="https://arxiv.org/abs/2605.27366" rel="noopener noreferrer"&gt;MUSE-Autoskill (arXiv:2605.27366)&lt;/a&gt; and &lt;a href="https://arxiv.org/abs/2602.01869" rel="noopener noreferrer"&gt;Skill-Pro (arXiv:2602.01869)&lt;/a&gt; are the clearest reads on agents that grow their own skill libraries.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  You can do this &lt;em&gt;today&lt;/em&gt; in the Claude Code CLI
&lt;/h2&gt;

&lt;p&gt;This isn't theoretical — the exact pattern from my demo already ships in coding CLIs. In &lt;strong&gt;Claude Code&lt;/strong&gt;, a skill is just a folder under &lt;code&gt;.claude/skills/&lt;/code&gt; in your repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Anywhere in your project — drop a skill in and the CLI auto-discovers it&lt;/span&gt;
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; .claude/skills/csv-to-markdown
&lt;span class="nv"&gt;$EDITOR&lt;/span&gt; .claude/skills/csv-to-markdown/SKILL.md   &lt;span class="c"&gt;# same SKILL.md format as my demo&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the agent loads only that skill's one-line &lt;code&gt;description&lt;/code&gt; until a task matches — then pulls in the full instructions (that's progressive disclosure doing its job). Type &lt;code&gt;/skills&lt;/code&gt; inside the CLI to see what's loaded.&lt;/p&gt;

&lt;p&gt;The best part: because it's an &lt;strong&gt;open standard&lt;/strong&gt;, the &lt;em&gt;same&lt;/em&gt; folder works unmodified across tools. You're not locked in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;&lt;/strong&gt; — Anthropic's CLI, where the format started.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/sst/opencode" rel="noopener noreferrer"&gt;opencode&lt;/a&gt;&lt;/strong&gt; — a popular open-source terminal agent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/block/goose" rel="noopener noreferrer"&gt;Goose&lt;/a&gt;&lt;/strong&gt; — Block's open-source agent.&lt;/li&gt;
&lt;li&gt;Plus Cursor, GitHub Copilot, and 30+ others.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Write the skill once, use it everywhere. The future bit my demo points at: instead of &lt;em&gt;you&lt;/em&gt; hand-writing that file, the agent writes it for itself after solving the task the first time — and from then on, your repo quietly accumulates a library of skills your agent earned.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 10-second version (my demo)
&lt;/h2&gt;

&lt;p&gt;Same stream of 7 tasks. "Cost" is how much effort each one took.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;No-skills agent&lt;/th&gt;
&lt;th&gt;Skill-writing agent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;What it does&lt;/td&gt;
&lt;td&gt;re-solves everything from scratch&lt;/td&gt;
&lt;td&gt;learns a task once, saves a &lt;code&gt;SKILL.md&lt;/code&gt;, reuses it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total cost&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;35&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;19&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Both correct?&lt;/td&gt;
&lt;td&gt;7/7&lt;/td&gt;
&lt;td&gt;7/7&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[5] csv-to-markdown  learned it and wrote SKILL.md
[5] slugify          learned it and wrote SKILL.md
[1] csv-to-markdown  reused skill 'csv-to-markdown'   ← cheap now
[5] extract-emails   learned it and wrote SKILL.md
[1] slugify          reused skill 'slugify'
[1] csv-to-markdown  reused skill 'csv-to-markdown'
[1] extract-emails   reused skill 'extract-emails'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It writes &lt;strong&gt;real &lt;code&gt;SKILL.md&lt;/code&gt; files&lt;/strong&gt; into a &lt;code&gt;./skills&lt;/code&gt; folder you can open. The first time it sees a task it pays full price; after that, it finds its own saved skill and reuses it for cheap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Two big reasons engineers should care:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Agents stop repeating themselves.&lt;/strong&gt; Right now most agents re-derive the same thing over and over, paying for it every time. A skill library means "figure it out once, then it's free" — like a teammate who writes things down instead of relearning them daily.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A whole new ecosystem.&lt;/strong&gt; There are already 65,000+ shared skills and a scramble to build "the npm of agent skills" — registries and marketplaces where you install a skill like a package. Skills are becoming a unit of &lt;em&gt;shareable expertise&lt;/em&gt;: a senior engineer's know-how, packaged in a folder, that any agent can pick up.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;Tools tell an agent &lt;em&gt;what it can do&lt;/em&gt;. Skills tell it &lt;em&gt;how to do things well&lt;/em&gt; — and soon, agents will write that part themselves, and trade it with each other.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Shridhar-2205/living-software
&lt;span class="nb"&gt;cd &lt;/span&gt;living-software/06-agent-skills
python demo.py
&lt;span class="nb"&gt;cat &lt;/span&gt;skills/csv-to-markdown/SKILL.md   &lt;span class="c"&gt;# a skill the agent wrote itself&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Honest note: this is a POC. Real systems decide &lt;em&gt;when&lt;/em&gt; a new skill is worth saving, test it, and refine it over time (that's exactly what the 2026 papers above tackle). Mine keeps that part simple so the core idea — &lt;em&gt;learn once, save a SKILL.md, reuse it&lt;/em&gt; — is easy to see.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Shridhar Shah&lt;/strong&gt; — Senior Software Engineer on the AI team at Cisco. Part 6 of &lt;em&gt;Toward Living Software&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Shridhar-2205" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://www.linkedin.com/in/shridhar-shah-220b1721b/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Sources:&lt;/strong&gt; Anthropic, "Equipping agents for the real world with Agent Skills" (2025) and the Agent Skills open standard (agentskills.io); Voyager (arXiv:2305.16291); MUSE-Autoskill (arXiv:2605.27366); Memento-Skills (arXiv:2603.18743); Skill-Pro (arXiv:2602.01869); MemSkill (arXiv:2602.02474).&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>programming</category>
      <category>python</category>
    </item>
    <item>
      <title>How Do You Trust an AI Agent With Your Money? You Don't — You Check Its Receipt</title>
      <dc:creator>Shridhar Shah</dc:creator>
      <pubDate>Sat, 27 Jun 2026 21:53:20 +0000</pubDate>
      <link>https://dev.to/shridhar_shah2297/how-do-you-trust-an-ai-agent-with-your-money-you-dont-you-check-its-receipt-38ff</link>
      <guid>https://dev.to/shridhar_shah2297/how-do-you-trust-an-ai-agent-with-your-money-you-dont-you-check-its-receipt-38ff</guid>
      <description>&lt;p&gt;&lt;em&gt;Cryptographically verifiable agent behavior: swap, edit, or forge a step and it's rejected.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; As we let AI agents do real things (issue refunds, move data, call APIs), "just trust it" stops being good enough. The fix: the agent hands you a tamper-proof &lt;strong&gt;receipt&lt;/strong&gt; that proves it followed the &lt;em&gt;approved&lt;/em&gt; rules and didn't fake anything. I built a demo — change the rules, edit a step, or fake the signature, and the check fails every time. ~120 lines, normal everyday crypto, no API key.&lt;/p&gt;




&lt;h2&gt;
  
  
  The scary question
&lt;/h2&gt;

&lt;p&gt;You're about to let an agent issue refunds, move files, or hit your production APIs. How do you &lt;em&gt;actually know&lt;/em&gt; it followed the rules you approved — and not some changed version? And how do you know the log it gives you afterward wasn't edited?&lt;/p&gt;

&lt;p&gt;Right now, the honest answer is usually: you don't. You trust the logs. But logs can be edited, the rules an agent runs can be quietly swapped, and a compromised agent can claim it did one thing while doing another.&lt;/p&gt;

&lt;p&gt;The 2026 fix is called &lt;strong&gt;verifiable agent behavior&lt;/strong&gt; (the research term is "zkML"): the agent produces a tamper-proof receipt that proves it ran &lt;em&gt;exactly&lt;/em&gt; the approved process — and &lt;em&gt;anyone&lt;/em&gt; can check that receipt without having to trust the agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 10-second version
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What happened&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Agent ran the approved refund rules, honestly&lt;/td&gt;
&lt;td&gt;✅ &lt;strong&gt;ACCEPT&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Someone swapped in sneaky "refund anything" rules&lt;/td&gt;
&lt;td&gt;🚨 &lt;strong&gt;REJECT&lt;/strong&gt; — rules don't match the approved ones&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Someone edited a step (turned a $40 refund into $5000)&lt;/td&gt;
&lt;td&gt;🚨 &lt;strong&gt;REJECT&lt;/strong&gt; — receipt doesn't add up&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Someone faked the receipt without the secret key&lt;/td&gt;
&lt;td&gt;🚨 &lt;strong&gt;REJECT&lt;/strong&gt; — signature is invalid&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Only the honest run passes. Every kind of cheating gets caught.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works (in plain terms)
&lt;/h2&gt;

&lt;p&gt;Three normal building blocks, no magic:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A fingerprint of the approved rules.&lt;/strong&gt; Run the rules through a hashing function and you get a short, unique fingerprint. Anyone can fingerprint the &lt;em&gt;approved&lt;/em&gt; rules and compare — if the agent used different rules, the fingerprints won't match.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A receipt you can't edit.&lt;/strong&gt; Every step the agent takes is chained together so each step depends on all the steps before it. Change any one step and the whole thing stops adding up — like a tamper-evident seal:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;seal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fingerprint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rules&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;seal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seal&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# each step folds into the seal
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A signature.&lt;/strong&gt; The agent signs the final seal with a secret key. If someone tries to forge a receipt without that key, the signature won't check out.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To verify, you just redo all three and ask: &lt;em&gt;Did it use the approved rules? Is the receipt intact? Is the signature real?&lt;/em&gt; All three have to pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Every other post in this series makes agents &lt;em&gt;more independent&lt;/em&gt; — they rewrite their own code, sleep, model other people, get curious. This one is the safety net for all of that: &lt;strong&gt;independence without a way to check up on it is a liability.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The more power we hand to agents, the less we can afford to just trust them — and the more we need a way to &lt;em&gt;check&lt;/em&gt; them.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The end goal of the real research is even stronger: prove an agent followed the approved rules &lt;strong&gt;without re-running it and without exposing any private data or secret model.&lt;/strong&gt; That lets two companies trust each other's agents — yours proves it behaved, mine checks the proof, and neither of us has to reveal our secrets.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Shridhar-2205/living-software
&lt;span class="nb"&gt;cd &lt;/span&gt;living-software/05-verifiable-agent
python demo.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Honest note: the real research uses heavier cryptography so the checker doesn't have to re-run anything and never sees the secret model. My demo re-checks a signed, sealed receipt instead — much simpler, and it shows the same payoff (cheat in any way ⇒ rejected) so you can feel what "verifiable behavior" actually buys you. It uses only standard, modern hashing (SHA-256), and the "secret key" is an obvious fake, never a real credential.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Shridhar Shah&lt;/strong&gt; — Senior Software Engineer on the AI team at Cisco. Part 5 (the finale) of &lt;em&gt;Toward Living Software&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Shridhar-2205" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://www.linkedin.com/in/shridhar-shah-220b1721b/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Background:&lt;/strong&gt; "zkML" / verifiable inference — proving an AI model ran exactly as claimed. See "Verifiable evaluations of machine learning models using zkSNARKs" (&lt;a href="https://arxiv.org/abs/2402.02675" rel="noopener noreferrer"&gt;arXiv:2402.02675&lt;/a&gt;) and the survey "Zero-Knowledge Proof Based Verifiable Machine Learning" (&lt;a href="https://arxiv.org/abs/2502.18535" rel="noopener noreferrer"&gt;arXiv:2502.18535&lt;/a&gt;). Tools like &lt;a href="https://docs.ezkl.xyz/" rel="noopener noreferrer"&gt;EZKL&lt;/a&gt; do this for real ONNX models today.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>programming</category>
      <category>python</category>
    </item>
    <item>
      <title>I Built an AI Agent That Gets Curious On Its Own</title>
      <dc:creator>Shridhar Shah</dc:creator>
      <pubDate>Sat, 27 Jun 2026 21:43:35 +0000</pubDate>
      <link>https://dev.to/shridhar_shah2297/i-built-an-ai-agent-that-gets-curious-on-its-own-4oe1</link>
      <guid>https://dev.to/shridhar_shah2297/i-built-an-ai-agent-that-gets-curious-on-its-own-4oe1</guid>
      <description>&lt;p&gt;&lt;em&gt;Active inference: curiosity emerges for free from minimizing surprise — 48% vs 100% on a foraging task.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Most AI agents chase rewards — they pick whatever action scores the most points. I tried a different, brain-inspired goal: avoid &lt;em&gt;surprises&lt;/em&gt;. Something neat happened — the agent became &lt;strong&gt;curious without being told to.&lt;/strong&gt; It goes looking for information before acting, and that takes it from 48% to 100% on a simple task. ~100 lines.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two different ways to make decisions
&lt;/h2&gt;

&lt;p&gt;Most AI agents are "reward chasers." Give them points for doing well, and they'll pick whatever action they expect to score highest. Simple and effective.&lt;/p&gt;

&lt;p&gt;There's another idea from brain science: instead of chasing points, &lt;strong&gt;try to avoid being surprised&lt;/strong&gt; — act so the world matches what you expected. It sounds almost too simple, but it leads to a surprising bonus: &lt;strong&gt;when you're trying not to be surprised, going and finding out what you don't know becomes valuable all by itself.&lt;/strong&gt; In other words, curiosity isn't something you have to bolt on. It comes for free.&lt;/p&gt;

&lt;p&gt;This is called &lt;em&gt;active inference&lt;/em&gt;, and in 2026 it jumped from neuroscience into AI as a serious approach (&lt;a href="https://arxiv.org/abs/2606.22813" rel="noopener noreferrer"&gt;here's a 2026 paper&lt;/a&gt;). Here's the smallest demo that makes it click.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 10-second version
&lt;/h2&gt;

&lt;p&gt;The task: a reward is hidden behind either the &lt;strong&gt;LEFT&lt;/strong&gt; door or the &lt;strong&gt;RIGHT&lt;/strong&gt; door (50/50). There's also a &lt;strong&gt;hint&lt;/strong&gt; you can check that tells you which door — &lt;em&gt;if you bother to look.&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;❌ Reward-chaser&lt;/th&gt;
&lt;th&gt;✅ Curious agent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;What it cares about&lt;/td&gt;
&lt;td&gt;getting the reward, right now&lt;/td&gt;
&lt;td&gt;getting the reward &lt;strong&gt;+ not being unsure&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What it does&lt;/td&gt;
&lt;td&gt;guesses a door&lt;/td&gt;
&lt;td&gt;checks the hint first, &lt;em&gt;then&lt;/em&gt; opens the right door&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Success (400 tries)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;48%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Nobody told the second agent "go check the hint." It did it on its own, because being unsure &lt;em&gt;bothered&lt;/em&gt; it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;Before acting, the agent scores each option on two things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Does this get me closer to the reward?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Does this make me less unsure about what's going on?&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;value_of_checking_the_hint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;how_unsure_am_i&lt;/span&gt;    &lt;span class="c1"&gt;# high when it's a total coin-flip
&lt;/span&gt;&lt;span class="n"&gt;value_of_just_guessing&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chance_of_being_right&lt;/span&gt;  &lt;span class="c1"&gt;# only ~50% on a blind guess
&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;value_of_checking_the_hint&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;value_of_just_guessing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;check_the_hint&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;     &lt;span class="c1"&gt;# this is where curiosity shows up
&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;best_door&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;          &lt;span class="c1"&gt;# now actually go get the reward
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When it's a total coin-flip, checking the hint is worth a lot (it removes all the doubt), way more than a 50/50 guess. So it looks first. Once it &lt;em&gt;knows&lt;/em&gt;, there's nothing left to be unsure about, so it just grabs the reward. The reward-chaser never sees any value in the hint, so it flips a coin forever.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Two reasons engineers should care:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Curiosity for free.&lt;/strong&gt; A long-standing headache in AI is agents getting stuck doing the same thing, never trying anything new. People hand-tune "exploration bonuses" to force them to explore. This approach gives you curiosity automatically — the agent looks for info exactly when it's unsure, and stops once it isn't.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;It handles surprises.&lt;/strong&gt; An agent built to avoid surprises is built to deal with situations it wasn't trained for. When reality stops matching its expectations, closing that gap &lt;em&gt;becomes&lt;/em&gt; its goal — so it keeps adapting instead of breaking.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;A reward-chaser asks "what gets me the most points?" A surprise-avoider asks "what don't I understand yet?" — and that second question is what makes it adapt.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Shridhar-2205/living-software
&lt;span class="nb"&gt;cd &lt;/span&gt;living-software/04-active-inference
python demo.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Honest note: the full version of this idea has a fair bit of math behind it. I've boiled it down to the one decision that makes it obvious — &lt;em&gt;being unsure has a cost&lt;/em&gt; — so you can watch curiosity appear in a few lines of code.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Shridhar Shah&lt;/strong&gt; — Senior Software Engineer on the AI team at Cisco. Part 4 of &lt;em&gt;Toward Living Software&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Shridhar-2205" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://www.linkedin.com/in/shridhar-shah-220b1721b/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Background:&lt;/strong&gt; Karl Friston's "Free Energy Principle" (the brain-science origin); "Active Inference as the Test-Time Scaling Law for Physical AI Agents" (arXiv:2606.22813).&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>programming</category>
      <category>python</category>
    </item>
    <item>
      <title>Can an AI Agent Pass the Test We Give 4-Year-Olds?</title>
      <dc:creator>Shridhar Shah</dc:creator>
      <pubDate>Sat, 27 Jun 2026 21:43:33 +0000</pubDate>
      <link>https://dev.to/shridhar_shah2297/can-an-ai-agent-pass-the-test-we-give-4-year-olds-5825</link>
      <guid>https://dev.to/shridhar_shah2297/can-an-ai-agent-pass-the-test-we-give-4-year-olds-5825</guid>
      <description>&lt;p&gt;&lt;em&gt;Theory of Mind and the Sally-Anne false-belief test, in ~60 lines of Python.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; There's a famous test that kids pass around age 4. It checks whether you understand that &lt;em&gt;other people can believe things that aren't true.&lt;/em&gt; I built two AI agents: one that only knows "what's actually happening" (fails, like a toddler) and one that keeps track of what &lt;em&gt;each person&lt;/em&gt; believes (passes). It's ~110 lines, and it's the foundation for agents that can actually work &lt;em&gt;together&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The test
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Sally puts her marble in the &lt;strong&gt;basket&lt;/strong&gt;, then leaves the room.&lt;/li&gt;
&lt;li&gt;While she's gone, Anne moves the marble to the &lt;strong&gt;box&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Sally comes back. &lt;strong&gt;Where will she look for her marble?&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you said &lt;em&gt;basket&lt;/em&gt;, nice — you just used something called "theory of mind." Sally never saw the marble move, so in her head it's still in the basket. What's &lt;em&gt;actually&lt;/em&gt; true (it's in the box) and what &lt;em&gt;Sally believes&lt;/em&gt; (it's in the basket) are two different things, and you kept them separate without even thinking about it.&lt;/p&gt;

&lt;p&gt;A 3-year-old says "box" — they can't yet separate what &lt;em&gt;they&lt;/em&gt; know from what &lt;em&gt;Sally&lt;/em&gt; knows. A 4-year-old says "basket." It's one of the most famous tests in child psychology, and in 2026 it's become a real test for AI agents too.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 10-second version
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;❌ Agent with no "theory of mind"&lt;/th&gt;
&lt;th&gt;✅ Agent that models other minds&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;What it tracks&lt;/td&gt;
&lt;td&gt;only what's actually true&lt;/td&gt;
&lt;td&gt;what &lt;em&gt;each person&lt;/em&gt; believes, separately&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Where will Sally look?&lt;/td&gt;
&lt;td&gt;"box"&lt;/td&gt;
&lt;td&gt;"basket"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Result&lt;/td&gt;
&lt;td&gt;FAIL (only knows reality)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;PASS&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  How it works (the whole trick)
&lt;/h2&gt;

&lt;p&gt;The only difference between the two agents is one rule: &lt;strong&gt;a person's belief only updates when that person is actually in the room to see it happen.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;someone_moves_the_marble&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_place&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;who_is_watching&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;person&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;who_is_watching&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="c1"&gt;# only people in the room
&lt;/span&gt;        &lt;span class="n"&gt;beliefs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new_place&lt;/span&gt;        &lt;span class="c1"&gt;# update THEIR mental picture
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So when Anne moves the marble while Sally is out, only Anne's mental picture updates. Sally's is frozen at "basket." Ask the simple agent and it just reports reality ("box"). Ask the smarter agent and it answers from &lt;em&gt;Sally's&lt;/em&gt; point of view ("basket").&lt;/p&gt;

&lt;p&gt;That's the whole thing. But keeping a separate picture of "what does each &lt;em&gt;other&lt;/em&gt; person know" is the difference between an agent that's a good teammate and one that isn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this isn't just a cute puzzle
&lt;/h2&gt;

&lt;p&gt;Almost everything useful about multiple agents (or an agent working with a human) needs this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Handing off work:&lt;/strong&gt; to delegate, I need to know what you already know.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explaining things:&lt;/strong&gt; I should tell you the part you're &lt;em&gt;missing&lt;/em&gt;, not dump everything.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Warning someone:&lt;/strong&gt; "Heads up, Sally still thinks the marble's in the basket" only works if I can track Sally's wrong belief.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not causing chaos:&lt;/strong&gt; an agent that assumes everyone knows what &lt;em&gt;it&lt;/em&gt; knows will skip important info and make bad assumptions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most AI today reasons about &lt;em&gt;the world&lt;/em&gt;. The 2026 shift is reasoning about &lt;em&gt;the people in the world&lt;/em&gt; — including when they're wrong. That's what turns a smart tool into a real collaborator.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Being smart about the world makes a good tool. Being smart about &lt;em&gt;other people&lt;/em&gt; makes a good teammate.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Shridhar-2205/living-software
&lt;span class="nb"&gt;cd &lt;/span&gt;living-software/03-theory-of-mind
python demo.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Honest note: real versions have to &lt;em&gt;figure out&lt;/em&gt; what someone believes by watching their behavior, which is much harder. Here I just tell the agent who was in the room, so the core idea — track beliefs separately from reality — is as clear as possible.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Shridhar Shah&lt;/strong&gt; — Senior Software Engineer on the AI team at Cisco. Part 3 of &lt;em&gt;Toward Living Software&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Shridhar-2205" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://www.linkedin.com/in/shridhar-shah-220b1721b/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Background:&lt;/strong&gt; the Sally-Anne false-belief test (Baron-Cohen, Leslie &amp;amp; Frith, 1985); Kosinski, "Evaluating Large Language Models in Theory of Mind Tasks" (PNAS 2024 / &lt;a href="https://arxiv.org/abs/2302.02083" rel="noopener noreferrer"&gt;arXiv:2302.02083&lt;/a&gt;); and a 2026 follow-up showing how brittle this still is — "Understanding Artificial Theory of Mind" (&lt;a href="https://arxiv.org/abs/2602.22072" rel="noopener noreferrer"&gt;arXiv:2602.22072&lt;/a&gt;).&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>programming</category>
      <category>python</category>
    </item>
    <item>
      <title>Do AI Agents Need to Sleep? I Built One That Does</title>
      <dc:creator>Shridhar Shah</dc:creator>
      <pubDate>Sat, 27 Jun 2026 21:36:55 +0000</pubDate>
      <link>https://dev.to/shridhar_shah2297/do-ai-agents-need-to-sleep-i-built-one-that-does-53c4</link>
      <guid>https://dev.to/shridhar_shah2297/do-ai-agents-need-to-sleep-i-built-one-that-does-53c4</guid>
      <description>&lt;p&gt;&lt;em&gt;A sleep-like phase that consolidates noisy daily experience into durable memory — 75% vs 100% recall.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; There's a wave of 2026 research giving AI a "sleep" phase — time spent &lt;em&gt;not&lt;/em&gt; answering questions, just tidying up what it learned that day. I built a 90-line demo of the idea. The agent that "sleeps" remembers &lt;strong&gt;100%&lt;/strong&gt; of what it learned. The exact same agent &lt;em&gt;without&lt;/em&gt; sleep remembers only &lt;strong&gt;75%&lt;/strong&gt; and gets confused by bad info. Runs on a laptop.&lt;/p&gt;




&lt;h2&gt;
  
  
  The memory problem every AI app hits
&lt;/h2&gt;

&lt;p&gt;If you've built anything with an LLM, you know the pain: the model only "remembers" what's in its current context window. Once the conversation gets long enough, the oldest stuff scrolls off the top and is just... gone. Forgotten.&lt;/p&gt;

&lt;p&gt;The usual fix is "make the context window bigger." But that's like fixing a messy desk by buying a bigger desk. It's expensive, and the model still gets worse as you cram more in (a real, measured effect — more text in the window can actually &lt;em&gt;lower&lt;/em&gt; accuracy).&lt;/p&gt;

&lt;p&gt;Your brain doesn't work this way. You don't remember every sentence anyone said today. While you sleep, your brain replays the day, keeps the important bits as long-term memory, and dumps the rest. That's how you remember "I like coffee" without remembering every single cup.&lt;/p&gt;

&lt;p&gt;A couple of 2026 papers ask the obvious question: &lt;strong&gt;&lt;a href="https://arxiv.org/abs/2605.26099" rel="noopener noreferrer"&gt;Do Language Models Need Sleep?&lt;/a&gt;&lt;/strong&gt; Their answer: giving an AI a quiet "offline" phase to consolidate memories makes it remember better. So I built the simplest version that shows why.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 10-second version
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;❌ Agent with no sleep&lt;/th&gt;
&lt;th&gt;✅ Agent that sleeps&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;How it remembers&lt;/td&gt;
&lt;td&gt;keeps only the last N messages&lt;/td&gt;
&lt;td&gt;saves a tidy summary every night&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;After 30 noisy days&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;75% recall&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100% recall&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tricked by bad info?&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;no — it goes with what it saw most often&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Same experiences, same noise, same memory test. The only difference is whether the agent sleeps.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;Each "day," the agent hears facts like &lt;code&gt;Alice → drinks → coffee&lt;/code&gt;. To make it realistic, about 1 in 5 facts is wrong (people misremember, logs have errors).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;no-sleep agent&lt;/strong&gt; only keeps the last 10 things it heard. Anything older falls off the edge and is forgotten. And one bad recent day can flip its answer.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;sleeping agent&lt;/strong&gt; does one extra thing each night: it goes back through the day, updates a small running tally of what it heard, and then clears out the raw log:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fact&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;todays_notes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;   &lt;span class="c1"&gt;# add today's notes to the long-term tally
&lt;/span&gt;    &lt;span class="n"&gt;todays_notes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clear&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;             &lt;span class="c1"&gt;# forget the raw firehose, keep the summary
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That tiny step buys two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;It doesn't forget.&lt;/strong&gt; The summary sticks around even after the raw messages are gone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It filters out bad info.&lt;/strong&gt; Because it counts how often it heard each thing across many days, the occasional wrong fact gets outvoted by the truth.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Everyone's trying to fix AI memory by making the context window huge. But a bigger window is still just a bigger pile of raw text — expensive, and it still overflows.&lt;/p&gt;

&lt;p&gt;Sleep is a smarter bet: &lt;strong&gt;do the cleanup when the agent is idle.&lt;/strong&gt; Spend a little time while nobody's waiting to turn today's messy notes into a clean, permanent summary — so when someone &lt;em&gt;does&lt;/em&gt; ask, the answer is fast, cheap, and correct. It's the same theme as an agent that improves its own code: get better while you run, not just when a human retrains you.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The better AI agent doesn't have a bigger memory. It has a &lt;em&gt;tidier&lt;/em&gt; one — because it sleeps.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Shridhar-2205/living-software
&lt;span class="nb"&gt;cd &lt;/span&gt;living-software/02-agents-that-dream
python demo.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Honest note: real systems fold these summaries into the model itself with fancier methods. Mine just uses a plain dictionary. The &lt;em&gt;idea&lt;/em&gt; (replay the day → save a summary → clear the raw log) is exactly the same; the code is kept tiny on purpose.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Shridhar Shah&lt;/strong&gt; — Senior Software Engineer on the AI team at Cisco. Part 2 of &lt;em&gt;Toward Living Software&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Shridhar-2205" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://www.linkedin.com/in/shridhar-shah-220b1721b/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Sources:&lt;/strong&gt; "Do Language Models Need Sleep?" (arXiv:2605.26099); "Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories" (arXiv:2606.03979).&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>programming</category>
      <category>python</category>
    </item>
    <item>
      <title>I Built an AI Agent That Rewrites Its Own Code (in ~150 lines)</title>
      <dc:creator>Shridhar Shah</dc:creator>
      <pubDate>Sat, 27 Jun 2026 21:36:53 +0000</pubDate>
      <link>https://dev.to/shridhar_shah2297/i-built-an-ai-agent-that-rewrites-its-own-code-in-150-lines-3jjo</link>
      <guid>https://dev.to/shridhar_shah2297/i-built-an-ai-agent-that-rewrites-its-own-code-in-150-lines-3jjo</guid>
      <description>&lt;p&gt;&lt;em&gt;A tiny Darwin Gödel Machine that edits itself and keeps only changes that verifiably score higher.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; I built a small program that improves &lt;em&gt;itself&lt;/em&gt;. It looks at the tasks it's failing, edits its own code to fix them, and keeps a change only if the change actually makes it score better on a test. It goes from passing &lt;strong&gt;1 of 8&lt;/strong&gt; tasks to &lt;strong&gt;8 of 8&lt;/strong&gt; — and nobody wrote those fixes but the program itself. It runs on a laptop in under a second. No fancy hardware, no API key.&lt;/p&gt;




&lt;h2&gt;
  
  
  The old dream: software that improves itself
&lt;/h2&gt;

&lt;p&gt;Normally, software only gets better when &lt;em&gt;we&lt;/em&gt; make it better. You write code, you find a bug, you fix it, you ship again. The program never improves on its own.&lt;/p&gt;

&lt;p&gt;People have wanted "software that improves itself" for decades. The classic version (called a "Gödel Machine") had one rule that made it impossible to build: before the program could change a line of its own code, it had to &lt;em&gt;mathematically prove&lt;/em&gt; the change would help. Proving that about real code is basically impossible, so the idea never worked.&lt;/p&gt;

&lt;p&gt;In 2025, researchers found a way around it with the &lt;strong&gt;&lt;a href="https://arxiv.org/abs/2505.22954" rel="noopener noreferrer"&gt;Darwin Gödel Machine&lt;/a&gt;&lt;/strong&gt;. They dropped the "prove it first" rule and replaced it with something every engineer already trusts:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Try the change. Run the tests. If the score went up, keep it. If not, throw it away.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's it. It's basically how we all work — make an edit, run the test suite, keep what passes. The twist is that &lt;em&gt;the program&lt;/em&gt; is the one making the edits. In the real paper, this let an AI coding assistant improve its own tooling and jump from solving &lt;strong&gt;20%&lt;/strong&gt; to &lt;strong&gt;50%&lt;/strong&gt; of a hard benchmark of real GitHub issues.&lt;/p&gt;

&lt;p&gt;I wanted to actually see this happen, so I built the tiniest version I could.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 10-second version
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Start&lt;/th&gt;
&lt;th&gt;After improving itself&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;What it can do&lt;/td&gt;
&lt;td&gt;only &lt;code&gt;uppercase&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;learned 6 more skills on its own&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test score&lt;/td&gt;
&lt;td&gt;🔴 &lt;strong&gt;1 / 8&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;🟢 &lt;strong&gt;8 / 8&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Who wrote the fixes?&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;the program did&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Start:  ███░░░░░░░░░░░░░░░░░░░░░  1/8   (only knows: uppercase)
+reverse            ██████░░░░░░░░░░░░  2/8
+dedup_csv          █████████░░░░░░░░░  3/8
+sum_csv            ████████████░░░░░░  4/8
+sort_csv           ███████████████░░░  5/8
+title              ██████████████████  6/8
+normalize_inputs   ████████████████████  8/8   ← one fix unlocked TWO tasks
✅ SOLVED 8/8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How it works (the whole thing)
&lt;/h2&gt;

&lt;p&gt;There are only three pieces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The "agent" is just a bag of skills.&lt;/strong&gt; Each skill is a tiny function — uppercase text, reverse it, sort a list, etc. It starts out knowing almost nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. A test with known answers.&lt;/strong&gt; Every task has a correct answer, so checking the score is a plain equality check — &lt;code&gt;output == expected&lt;/code&gt;. No human grading it, no second AI judging it. Just: did it get the right answer or not? (This "write a checker, then measure" idea is the same trick behind today's reasoning models.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The loop.&lt;/strong&gt; Over and over: look at what's failing, add one skill to try to fix it, re-run the test, and &lt;strong&gt;keep the change only if the score went up.&lt;/strong&gt; It also saves every improved version, so it can branch off any of them later instead of getting stuck.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;new_version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;old_version&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;add_a_skill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;things_it_is_failing&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_version&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;old_version&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;   &lt;span class="c1"&gt;# did the test score actually improve?
&lt;/span&gt;    &lt;span class="nf"&gt;keep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_version&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                          &lt;span class="c1"&gt;# yes -&amp;gt; save it and build on it
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The cool part: small fixes unlock big ones
&lt;/h2&gt;

&lt;p&gt;One of the skills it adds, "clean up the input" (trim weird spacing), does &lt;strong&gt;nothing&lt;/strong&gt; by itself. But the agent had earlier learned a "title-case" skill that kept breaking on messy text like &lt;code&gt;"  the   quick   fox "&lt;/code&gt;. The moment it adds the cleanup step, &lt;strong&gt;two stuck tasks start passing at once&lt;/strong&gt; — that's the +2 jump at the end.&lt;/p&gt;

&lt;p&gt;This is the whole point in miniature: the agent isn't just adding features. It's making itself &lt;em&gt;better at getting better&lt;/em&gt;. A boring little fix becomes the stepping stone that makes later fixes work. The real research sees the same thing at full scale — the AI invents helpers like "try a few solutions and pick the best one," which then make &lt;em&gt;every&lt;/em&gt; future fix more effective.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I think this is where things are going
&lt;/h2&gt;

&lt;p&gt;For ten years, the way to make AI better was: make the model bigger. The newer idea is to make it &lt;strong&gt;improve itself while it runs&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;This post&lt;/strong&gt; — an agent that rewrites its own code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://arxiv.org/abs/2606.03979" rel="noopener noreferrer"&gt;"Language Models Need Sleep"&lt;/a&gt;&lt;/strong&gt; (2026) — agents that tidy up their own memory during an offline "sleep."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Small models that think harder&lt;/strong&gt; instead of being bigger.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The common thread: improvement is shifting from &lt;em&gt;us retraining the model&lt;/em&gt; to &lt;em&gt;the program improving itself&lt;/em&gt;, with a simple test telling it whether each change was good. Software that edits itself starts to feel less like a fixed program and more like something that grows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it (under a minute)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Shridhar-2205/living-software
&lt;span class="nb"&gt;cd &lt;/span&gt;living-software/01-self-rewriting-agent
python demo_cli.py     &lt;span class="c"&gt;# watch the score climb 1/8 → 8/8&lt;/span&gt;
pytest &lt;span class="nt"&gt;-q&lt;/span&gt;              &lt;span class="c"&gt;# the same claims, as automated tests&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One honest note on safety: a &lt;em&gt;real&lt;/em&gt; self-rewriting agent runs code it wrote itself, which is risky. In my version the "edits" come from a fixed list of safe skills, so nothing dangerous ever runs — the &lt;em&gt;loop&lt;/em&gt; matches the research, the &lt;em&gt;risk&lt;/em&gt; is zero. (The real one runs inside a sandbox for exactly this reason.)&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;The old dream needed a mathematical proof before changing any code. The new version just needs a &lt;strong&gt;test&lt;/strong&gt;. If you can write a check that says "this got better," you can let a program improve itself — and watch it find clever fixes you never wrote.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;strong&gt;Shridhar Shah&lt;/strong&gt; — Senior Software Engineer on the AI team at Cisco. Part 1 of &lt;em&gt;Toward Living Software&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Shridhar-2205" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://www.linkedin.com/in/shridhar-shah-220b1721b/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Source:&lt;/strong&gt; Zhang, Hu, Lu, Lange, Clune, "Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents," arXiv:2505.22954 (2025) — reports SWE-bench 20.0% → 50.0%.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>programming</category>
      <category>python</category>
    </item>
  </channel>
</rss>
