<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Stefan Broenner</title>
    <description>The latest articles on DEV Community by Stefan Broenner (@sbroenne).</description>
    <link>https://dev.to/sbroenne</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3758524%2F817515ae-dac8-4fbe-ae02-e1e59922895c.jpg</url>
      <title>DEV Community: Stefan Broenner</title>
      <link>https://dev.to/sbroenne</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sbroenne"/>
    <language>en</language>
    <item>
      <title>skillpm - Package Manager for Agent Skills. Built on npm.</title>
      <dc:creator>Stefan Broenner</dc:creator>
      <pubDate>Wed, 25 Feb 2026 19:12:17 +0000</pubDate>
      <link>https://dev.to/sbroenne/skillpm-package-manager-for-agent-skills-built-on-npm-3d31</link>
      <guid>https://dev.to/sbroenne/skillpm-package-manager-for-agent-skills-built-on-npm-3d31</guid>
      <description>&lt;p&gt;Every &lt;a href="https://agentskills.io/home" rel="noopener noreferrer"&gt;agent skill&lt;/a&gt; today is a monolith.&lt;/p&gt;

&lt;p&gt;Authors cram React patterns, TypeScript best practices, and testing guidelines into a single massive SKILL.md — because there's no way to say "just depend on that other skill." No registry. No dependency management. No versioning. The &lt;a href="https://agentskills.io" rel="noopener noreferrer"&gt;Agent Skills spec&lt;/a&gt; defines what a skill &lt;em&gt;is&lt;/em&gt;, but says nothing about how to publish, install, or share them.&lt;/p&gt;

&lt;p&gt;We (Sonnet 4.6 &amp;amp; myself) built &lt;strong&gt;&lt;a href="https://skillpm.dev" rel="noopener noreferrer"&gt;skillpm&lt;/a&gt;&lt;/strong&gt; to fix that — a lightweight orchestration layer on top of npm. ~630 lines of code, 3 dependencies, zero reinvention. Small skills that compose, not monoliths that overlap.&lt;/p&gt;

&lt;h2&gt;
  
  
  The idea: don't reinvent npm — extend it
&lt;/h2&gt;

&lt;p&gt;When we started, the tempting path was to build a custom registry, a custom resolver, a custom lockfile format. We chose the opposite.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;skillpm is a thin orchestration layer on top of npm.&lt;/strong&gt; Same &lt;code&gt;package.json&lt;/code&gt;. Same &lt;code&gt;node_modules/&lt;/code&gt;. Same &lt;code&gt;package-lock.json&lt;/code&gt;. Same registry (npmjs.org). skillpm only adds what npm can't do on its own:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Scanning&lt;/strong&gt; &lt;code&gt;node_modules/&lt;/code&gt; for packages containing &lt;code&gt;skills/*/SKILL.md&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wiring&lt;/strong&gt; discovered skills into agent directories via &lt;a href="https://www.npmjs.com/package/skills" rel="noopener noreferrer"&gt;&lt;code&gt;skills&lt;/code&gt;&lt;/a&gt; (Claude, Cursor, VS Code, Codex, and many more)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuring MCP servers&lt;/strong&gt; declared by skills, transitively across the dependency tree, via &lt;a href="https://github.com/neondatabase/add-mcp" rel="noopener noreferrer"&gt;&lt;code&gt;add-mcp&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. Everything else — resolution, caching, lockfiles, audit, semver — is npm.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install a skill (no global install needed)&lt;/span&gt;
npx skillpm &lt;span class="nb"&gt;install &lt;/span&gt;excel-mcp-skill

&lt;span class="c"&gt;# List what's installed&lt;/span&gt;
npx skillpm list

&lt;span class="c"&gt;# Scaffold a new skill&lt;/span&gt;
npx skillpm init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you run &lt;code&gt;npx skillpm install &amp;lt;skill&amp;gt;&lt;/code&gt;, here's what happens:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;npx skillpm install react-patterns
  │
  ▼
📦 npm install react-patterns
   npm handles resolution, download, lockfile
  │
  ▼
🔍 Scan node_modules/
   find packages with skills/*/SKILL.md
  │
  ▼
🔗 npx skills add ./node_modules/...
   wire into Claude, Cursor, VS Code, Codex...
  │
  ▼
📄 Read skillpm.mcpServers
   walk entire dependency tree
  │
  ▼
🔌 npx add-mcp &amp;lt;server&amp;gt;
   configure each MCP server across agents
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four tools, each doing one thing well, orchestrated together.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a skill package looks like
&lt;/h2&gt;

&lt;p&gt;A skill is just an npm package with a specific directory structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my-skill/
├── package.json          # keywords: ["agent-skill"]
├── README.md
├── LICENSE
└── skills/
    └── my-skill/
        ├── SKILL.md      # The skill definition
        ├── scripts/      # Optional executable scripts
        ├── references/   # Optional reference docs
        └── assets/       # Optional templates/data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;SKILL.md&lt;/code&gt; is where the magic lives — YAML frontmatter for metadata, Markdown body with instructions for the agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-skill&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Refactor&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;React&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;class&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;components&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;functional&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;components&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;hooks."&lt;/span&gt;
&lt;span class="na"&gt;allowed-tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Bash Read Edit&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="c1"&gt;# React Refactoring&lt;/span&gt;

&lt;span class="c1"&gt;## When to use this skill&lt;/span&gt;
&lt;span class="s"&gt;Use when the user asks to modernize React components...&lt;/span&gt;

&lt;span class="c1"&gt;## Instructions&lt;/span&gt;
&lt;span class="s"&gt;1. Identify class components in the target files&lt;/span&gt;
&lt;span class="s"&gt;2. Convert lifecycle methods to useEffect hooks&lt;/span&gt;
&lt;span class="s"&gt;3. ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Skills can depend on other skills
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting. Because skills are npm packages, they can depend on each other. &lt;strong&gt;This solves the biggest problem with Agent Skills today: prompt bloat.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without dependency management, if you want a skill that builds a full-stack React app, you have to copy-paste instructions for React, TypeScript, testing, and styling into one massive &lt;code&gt;SKILL.md&lt;/code&gt;. The agent gets overwhelmed, context windows fill up, and the skill becomes impossible to maintain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;skillpm brings standard software engineering practices to Agent Skills.&lt;/strong&gt; You don't copy-paste code anymore; you shouldn't copy-paste prompts either. With skillpm, you just declare dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fullstack-react"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"keywords"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"agent-skill"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dependencies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"react-patterns"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^2.0.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"typescript-best-practices"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^1.3.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"testing-with-vitest"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^1.0.0"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"skillpm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"@anthropic/mcp-server-filesystem"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;npx skillpm install fullstack-react&lt;/code&gt; resolves the entire tree — all three dependencies get installed, scanned, wired into agents, and their MCP servers configured. One command.&lt;/p&gt;

&lt;p&gt;Instead of monolithic, 500-line prompt files, you can build small, composable, single-purpose skills that build on top of each other. It's modularity for AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP server configuration, handled
&lt;/h2&gt;

&lt;p&gt;Skills often need MCP servers to function. The &lt;code&gt;skillpm.mcpServers&lt;/code&gt; field declares those requirements:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"skillpm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"@anthropic/mcp-server-filesystem"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"https://mcp.context7.com/mcp"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;skillpm walks the &lt;em&gt;entire&lt;/em&gt; dependency tree, collects all MCP server requirements (deduplicated), and configures each one via &lt;a href="https://github.com/neondatabase/add-mcp" rel="noopener noreferrer"&gt;add-mcp&lt;/a&gt;. The user never has to manually configure MCP servers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The ecosystem today
&lt;/h2&gt;

&lt;p&gt;There are already &lt;strong&gt;90+ skill packages&lt;/strong&gt; on npm with the &lt;code&gt;agent-skill&lt;/code&gt; keyword. We built an &lt;a href="https://skillpm.dev/registry/" rel="noopener noreferrer"&gt;Agent Skills Registry&lt;/a&gt; that indexes them all — searchable, filterable by keyword, sortable by downloads or recency.&lt;/p&gt;

&lt;p&gt;Most existing packages follow the original spec (root &lt;code&gt;SKILL.md&lt;/code&gt;) rather than our npm packaging convention (&lt;code&gt;skills/&amp;lt;name&amp;gt;/SKILL.md&lt;/code&gt;). skillpm handles both — legacy packages get installed with a friendly warning pointing to the migration guide.&lt;/p&gt;

&lt;h2&gt;
  
  
  Create and publish a skill in 60 seconds
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;my-awesome-skill &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;my-awesome-skill
npx skillpm init
&lt;span class="c"&gt;# Edit skills/my-awesome-skill/SKILL.md with your instructions&lt;/span&gt;
npx skillpm publish
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's literally it. Your skill is now on npmjs.org, discoverable by anyone running &lt;code&gt;npx skillpm install my-awesome-skill&lt;/code&gt;, and automatically wired into their agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not just use npm directly?
&lt;/h2&gt;

&lt;p&gt;You can! Skills are valid npm packages. But skillpm adds what &lt;code&gt;npm install&lt;/code&gt; alone can't do:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Scan for &lt;code&gt;SKILL.md&lt;/code&gt; files in installed packages&lt;/li&gt;
&lt;li&gt;✅ Link skills into agent directories via &lt;a href="https://www.npmjs.com/package/skills" rel="noopener noreferrer"&gt;&lt;code&gt;skills&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;✅ Configure MCP servers via &lt;a href="https://github.com/neondatabase/add-mcp" rel="noopener noreferrer"&gt;&lt;code&gt;add-mcp&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;✅ Validate skill packages before publishing&lt;/li&gt;
&lt;li&gt;✅ Show you what skills are installed and where they're wired&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the gap skillpm fills. It's npm + skill awareness.&lt;/p&gt;

&lt;h2&gt;
  
  
  Built on the shoulders of giants
&lt;/h2&gt;

&lt;p&gt;We deliberately don't reinvent anything. skillpm shells out to four battle-tested tools:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;How skillpm uses it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;npm&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Package management&lt;/td&gt;
&lt;td&gt;All installs, resolution, lockfiles, caching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;a href="https://www.npmjs.com/package/skills" rel="noopener noreferrer"&gt;&lt;strong&gt;skills&lt;/strong&gt;&lt;/a&gt; (Vercel)&lt;/td&gt;
&lt;td&gt;Agent directory linking&lt;/td&gt;
&lt;td&gt;Wires skills into agent directories&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/neondatabase/add-mcp" rel="noopener noreferrer"&gt;&lt;strong&gt;add-mcp&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;MCP server configuration&lt;/td&gt;
&lt;td&gt;Configures servers across agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://www.npmjs.com/package/skills-ref" rel="noopener noreferrer"&gt;&lt;strong&gt;skills-ref&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Spec validation&lt;/td&gt;
&lt;td&gt;Validates SKILL.md during publish&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Get started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Try it now — no install required&lt;/span&gt;
npx skillpm &lt;span class="nb"&gt;install &lt;/span&gt;skillpm-skill

&lt;span class="c"&gt;# Browse the registry&lt;/span&gt;
&lt;span class="c"&gt;# https://skillpm.dev/registry/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;📦 &lt;strong&gt;npm&lt;/strong&gt;: &lt;a href="https://www.npmjs.com/package/skillpm" rel="noopener noreferrer"&gt;npmjs.com/package/skillpm&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📖 &lt;strong&gt;Docs&lt;/strong&gt;: &lt;a href="https://skillpm.dev" rel="noopener noreferrer"&gt;skillpm.dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🔧 &lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/sbroenne/skillpm" rel="noopener noreferrer"&gt;github.com/sbroenne/skillpm&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📋 &lt;strong&gt;Agent Skills spec&lt;/strong&gt;: &lt;a href="https://agentskills.io" rel="noopener noreferrer"&gt;agentskills.io&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  We're actively looking for contributors! 🤝
&lt;/h2&gt;

&lt;p&gt;skillpm is a young project, and there's a lot of room to grow. Whether it's adding support for new agent directories, improving the CLI experience, or building out the registry, we'd love your help. Check out the &lt;a href="https://github.com/sbroenne/skillpm" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; and look for the &lt;code&gt;good first issue&lt;/code&gt; label!&lt;/p&gt;

&lt;p&gt;We'd love your feedback. Open an issue, try publishing a skill, or just tell us what you think.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What kind of skills are you building for your AI agents? Is this useful? What are we missing (e.g. custom agents, / prompts)? Let me know in the comments!&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;skillpm is MIT licensed and open source.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>npm</category>
      <category>showdev</category>
      <category>agentskills</category>
      <category>skillsengineering</category>
    </item>
    <item>
      <title>Are you human? Or are you malware?</title>
      <dc:creator>Stefan Broenner</dc:creator>
      <pubDate>Thu, 19 Feb 2026 09:18:24 +0000</pubDate>
      <link>https://dev.to/sbroenne/are-you-human-or-are-you-malware-a1k</link>
      <guid>https://dev.to/sbroenne/are-you-human-or-are-you-malware-a1k</guid>
      <description>&lt;p&gt;Someone opened a GitHub issue on my Excel MCP Server project questioning if I’m actually human. The reasons behind this assumption included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High commit velocity&lt;/li&gt;
&lt;li&gt;An AI-generated demo video&lt;/li&gt;
&lt;li&gt;Consistent structure and documentation&lt;/li&gt;
&lt;li&gt;The belief that my work might be AI-generated or even malware&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Great analysis, great question!&lt;/p&gt;

&lt;p&gt;My response was straightforward: I’m human — I just use AI tools very deliberately. GitHub Copilot helps me build faster, and tools like HeyGen enhance my communication.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This interaction highlighted an important realization: the line between “human work” and “AI-assisted work” is already blurry — and that’s okay.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Open source is evolving, developer workflows are changing, and trust models will need to adapt as well. These are interesting times to be building software in public. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/sbroenne/mcp-server-excel/issues/479" rel="noopener noreferrer"&gt;https://github.com/sbroenne/mcp-server-excel/issues/479&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>githubcopilot</category>
      <category>devjournal</category>
      <category>agentic</category>
    </item>
    <item>
      <title>pytest-aitest: Unit Tests Can't Test Your MCP Server. AI Can.</title>
      <dc:creator>Stefan Broenner</dc:creator>
      <pubDate>Fri, 13 Feb 2026 03:58:55 +0000</pubDate>
      <link>https://dev.to/sbroenne/pytest-aitest-unit-tests-cant-test-your-mcp-server-ai-can-1ebn</link>
      <guid>https://dev.to/sbroenne/pytest-aitest-unit-tests-cant-test-your-mcp-server-ai-can-1ebn</guid>
      <description>&lt;h2&gt;
  
  
  I Learned This the Hard Way
&lt;/h2&gt;

&lt;p&gt;I built two MCP servers — &lt;a href="https://github.com/sbroenne/excel-mcp-server" rel="noopener noreferrer"&gt;Excel MCP Server&lt;/a&gt; and &lt;a href="https://github.com/sbroenne/windows-mcp-server" rel="noopener noreferrer"&gt;Windows MCP Server&lt;/a&gt;. Both had solid test suites. Both broke the moment a real LLM tried to use them.&lt;/p&gt;

&lt;p&gt;I spent weeks doing manual testing with GitHub Copilot. Open a chat, type a prompt, watch the LLM pick the wrong tool, tweak the description, try again. Sometimes the design was fundamentally broken and I spent weeks on a wild goose chase before realizing the whole approach needed rethinking.&lt;/p&gt;

&lt;p&gt;The failure modes were always the same:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The LLM picks the wrong tool out of 15 similar-sounding options&lt;/li&gt;
&lt;li&gt;It passes &lt;code&gt;{"account_id": "checking"}&lt;/code&gt; when the parameter is &lt;code&gt;account&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;It ignores the system prompt entirely&lt;/li&gt;
&lt;li&gt;It asks the user "Would you like me to do that?" instead of just doing it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why?&lt;/strong&gt; Because I tested the code, not the AI interface.&lt;/p&gt;

&lt;p&gt;For LLMs, your API isn't functions and types — it's &lt;strong&gt;tool descriptions, parameter schemas, and system prompts&lt;/strong&gt;. That's what the model actually reads. No compiler catches a bad tool description. No unit test validates that an LLM will pick the right tool. And if you also inject Agent Skills — do they actually help? Or make things worse? Do LLMs really behave the way you think they will?&lt;/p&gt;

&lt;p&gt;(No. They don't.)&lt;/p&gt;

&lt;p&gt;So I built &lt;a href="https://github.com/sbroenne/pytest-aitest" rel="noopener noreferrer"&gt;pytest-aitest&lt;/a&gt;, heavily inspired by &lt;a href="https://github.com/mykhaliev/agent-benchmark" rel="noopener noreferrer"&gt;agent-benchmark&lt;/a&gt; by Dmytro Mykhaliev. &lt;/p&gt;

&lt;p&gt;It's a pytest plugin — &lt;code&gt;uv add pytest-aitest&lt;/code&gt; and you're done. No new CLI, no new syntax. Works with your existing fixtures, markers, and CI/CD pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Write Tests as Prompts
&lt;/h2&gt;

&lt;p&gt;Your test &lt;em&gt;is&lt;/em&gt; a prompt. Write what a user would say. Let the LLM figure out how to use your tools. Assert on what happened.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pytest_aitest&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MCPServer&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_balance_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;azure/gpt-5-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;mcp_servers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;MCPServer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_banking_server&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s my checking balance?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool_was_called&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_balance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;If this fails, the problem isn't your code — it's your tool description. The LLM couldn't figure out which tool to call or what parameters to pass. Fix the description, run again. This is &lt;strong&gt;TDD for AI interfaces&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Red/Green/Refactor Cycle — For Tool Descriptions
&lt;/h2&gt;
&lt;h3&gt;
  
  
  🔴 Red: Write a failing test
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_transfer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Move $200 from checking to savings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool_was_called&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transfer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The LLM reads your tool descriptions, gets confused, calls the wrong thing. Test fails.&lt;/p&gt;
&lt;h3&gt;
  
  
  🟢 Green: Fix the interface
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before — too vague
&lt;/span&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transfer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;from_acct&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;to_acct&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Transfer money.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="c1"&gt;# After — the LLM knows exactly what to do
&lt;/span&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transfer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;from_account&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;to_account&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Transfer money between accounts (checking, savings).
    Amount must be positive. Returns new balances for both accounts.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Run again. Test passes.&lt;/p&gt;
&lt;h3&gt;
  
  
  🔄 Refactor: Let AI analysis tell you what else to fix
&lt;/h3&gt;

&lt;p&gt;This is where it gets interesting. pytest-aitest doesn't just tell you pass/fail — it runs a second LLM that analyzes every failure and tells you &lt;em&gt;why&lt;/em&gt; it happened and &lt;em&gt;what to improve&lt;/em&gt;. Traditional testing requires a human to interpret failures. Here, the AI does it:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyof1k8ii6jvtetjy3bvt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyof1k8ii6jvtetjy3bvt.png" alt="Screenshot of pytest-aitest report showing deploy recommendation for gpt-5-mini, pass rate comparison across models, cost metrics, and AI-generated failure analysis"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The report tells you which model to deploy, why it wins, and what to fix. It analyzes cost efficiency, tool usage patterns, and prompt effectiveness across all your configurations. Unused tools? The AI flags them. Prompt causing permission-seeking behavior? It explains the mechanism. &lt;a href="https://sbroenne.github.io/pytest-aitest/demo/hero-report.html" rel="noopener noreferrer"&gt;See a full sample report →&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Compare Models, Prompts, and Server Versions
&lt;/h2&gt;

&lt;p&gt;The real power is comparison. Test multiple configurations against the same test suite:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;MODELS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;PROMPTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;brief&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Be concise.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;detailed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain your reasoning.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;AGENTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;azure/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;mcp_servers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;banking_server&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;MODELS&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;PROMPTS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.mark.parametrize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AGENTS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_balance_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s my checking balance?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;4 configurations. Same tests. The report generates an &lt;strong&gt;Agent Leaderboard&lt;/strong&gt; — winner by pass rate, then cost as tiebreaker:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;th&gt;Pass Rate&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5-mini-brief&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;747&lt;/td&gt;
&lt;td&gt;$0.002&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-4.1-brief&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;560&lt;/td&gt;
&lt;td&gt;$0.008&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5-mini-detailed&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;1,203&lt;/td&gt;
&lt;td&gt;$0.004&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Deploy: gpt-5-mini&lt;/strong&gt; (brief prompt) — 100% pass rate at lowest cost.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The same pattern works for A/B testing server versions (did your refactor break tool discoverability?), comparing system prompts, and measuring the impact of &lt;a href="https://agentskills.io" rel="noopener noreferrer"&gt;Agent Skills&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Multi-Turn Sessions
&lt;/h2&gt;

&lt;p&gt;Real users don't ask one question. They have conversations:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@pytest.mark.session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;banking-chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestBankingConversation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_check_balance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s my checking balance?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_transfer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Agent remembers we were talking about checking
&lt;/span&gt;        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Transfer $200 to savings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool_was_called&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transfer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_verify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Agent remembers the transfer
&lt;/span&gt;        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What are my new balances?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Tests share conversation history. The report shows the full session flow with sequence diagrams.&lt;/p&gt;
&lt;h2&gt;
  
  
  Who This Is For
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MCP server authors&lt;/strong&gt; — Validate that LLMs can actually use your tools, not just that the code works&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent builders&lt;/strong&gt; — Find the cheapest model + prompt combo that passes your test suite&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Teams shipping AI products&lt;/strong&gt; — Gate deployments on LLM-facing regression tests in CI/CD&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Works with &lt;a href="https://docs.litellm.ai/docs/providers" rel="noopener noreferrer"&gt;100+ LLM providers&lt;/a&gt; via LiteLLM — Azure, OpenAI, Anthropic, Google, local models, whatever you're running.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Key Insight
&lt;/h2&gt;

&lt;p&gt;The test is a prompt. The LLM is the test harness. The report tells you what to fix.&lt;/p&gt;

&lt;p&gt;Traditional testing validates that your code works. pytest-aitest validates that &lt;strong&gt;an LLM can understand and use your code&lt;/strong&gt;. These are different things, and the gap between them is where your production bugs live.&lt;/p&gt;

&lt;p&gt;Your tool descriptions are an API. Test them like one.&lt;/p&gt;
&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;

&lt;p&gt;pytest-aitest is open source. Contributions welcome!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/sbroenne/pytest-aitest" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Star pytest-aitest on GitHub&lt;/a&gt;
&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://sbroenne.github.io/pytest-aitest/" rel="noopener noreferrer"&gt;Documentation&lt;/a&gt; — Full guides and API reference&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/pytest-aitest/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt; — &lt;code&gt;uv add pytest-aitest&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://sbroenne.github.io/pytest-aitest/demo/hero-report.html" rel="noopener noreferrer"&gt;Sample Report&lt;/a&gt; — See AI analysis in action&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/sbroenne" rel="noopener noreferrer"&gt;
        sbroenne
      &lt;/a&gt; / &lt;a href="https://github.com/sbroenne/pytest-aitest" rel="noopener noreferrer"&gt;
        pytest-aitest
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      The testing framework for skill engineering. Test tool descriptions, prompt templates, agent skills, and custom agents with real LLMs. AI analyzes results and tells you what to fix.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;blockquote&gt;
&lt;p&gt;🗄️ &lt;strong&gt;This project is archived and no longer maintained.&lt;/strong&gt; It has been replaced by &lt;a href="https://github.com/sbroenne/pytest-skill-engineering" rel="noopener noreferrer"&gt;pytest-skill-engineering&lt;/a&gt;. Do not use this project for new work. This repository is kept as a read-only archive.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;pytest-aitest&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;a href="https://pypi.org/project/pytest-aitest/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/8ddb5f9cd42c915ff6ecddb74e709fd948465b78b0c2b25086c7b425187c6645/68747470733a2f2f696d672e736869656c64732e696f2f707970692f762f7079746573742d616974657374" alt="PyPI version"&gt;&lt;/a&gt;
&lt;a href="https://pypi.org/project/pytest-aitest/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/f4cc6b256242d23285dafa42f03f6a3465c0458ea4ae72515d1bbe6b26537ab9/68747470733a2f2f696d672e736869656c64732e696f2f707970692f707976657273696f6e732f7079746573742d616974657374" alt="Python versions"&gt;&lt;/a&gt;
&lt;a href="https://github.com/sbroenne/pytest-aitest/actions/workflows/ci.yml" rel="noopener noreferrer"&gt;&lt;img src="https://github.com/sbroenne/pytest-aitest/actions/workflows/ci.yml/badge.svg" alt="CI"&gt;&lt;/a&gt;
&lt;a href="https://opensource.org/licenses/MIT" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/fdf2982b9f5d7489dcf44570e714e3a15fce6253e0cc6b5aa61a075aac2ff71b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667" alt="License: MIT"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Skill Engineering. Test-driven. AI-analyzed.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A pytest plugin for skill engineering — test your MCP server tools, prompt templates, agent skills, and custom agents with real LLMs. Red/Green/Refactor for the skill stack. Let AI analysis tell you what to fix.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Why?&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;Modern AI systems are built on &lt;strong&gt;skill engineering&lt;/strong&gt; — the discipline of designing modular, reliable, callable capabilities that an LLM can discover, invoke, and orchestrate to perform real tasks. Skills are what separate "text generator" from "agent that actually does things."&lt;/p&gt;
&lt;p&gt;An MCP server is the runtime for those skills. It doesn't ship alone — it comes bundled with the &lt;strong&gt;full skill engineering stack&lt;/strong&gt;: &lt;strong&gt;tools&lt;/strong&gt; (callable functions), &lt;strong&gt;prompt templates&lt;/strong&gt; (server-side reasoning starters), &lt;strong&gt;agent skills&lt;/strong&gt; (domain knowledge…&lt;/p&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/sbroenne/pytest-aitest" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;




</description>
      <category>python</category>
      <category>mcp</category>
      <category>testing</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
