<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tebogo Tseka</title>
    <description>The latest articles on DEV Community by Tebogo Tseka (@tsekatm).</description>
    <link>https://dev.to/tsekatm</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2121040%2F1ac9872f-0ede-4ea9-8957-35622a424f77.jpeg</url>
      <title>DEV Community: Tebogo Tseka</title>
      <link>https://dev.to/tsekatm</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tsekatm"/>
    <language>en</language>
    <item>
      <title>How I Run Over 20 AI Agents Locally and Deploy One to Production at a Time</title>
      <dc:creator>Tebogo Tseka</dc:creator>
      <pubDate>Mon, 13 Apr 2026 15:12:56 +0000</pubDate>
      <link>https://dev.to/tsekatm/how-i-run-over-20-ai-agents-locally-and-deploy-one-to-production-at-a-time-32cc</link>
      <guid>https://dev.to/tsekatm/how-i-run-over-20-ai-agents-locally-and-deploy-one-to-production-at-a-time-32cc</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://tebogosacloud.blog/blog/local-first-agentic-development" rel="noopener noreferrer"&gt;tebogosacloud.blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I have over 20 AI agents. Only one is in production.&lt;/p&gt;

&lt;p&gt;That is not a constraint. It is a strategy.&lt;/p&gt;

&lt;p&gt;A system with one excellent production agent and a library of production-ready agents waiting locally is more mature than a system with ten mediocre agents all simultaneously causing incidents. I believe this. I have built for it. This article explains how.&lt;/p&gt;

&lt;p&gt;While most teams are racing to deploy fleets of AI agents and discovering — usually painfully — that managing agents in production is far heavier than anyone told them, I have been doing the opposite. Build locally. Validate thoroughly. Design every agent to be production-ready from day one. Promote to AWS Bedrock AgentCore only when a use case has earned it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem With How Teams Ship Agents Today
&lt;/h2&gt;

&lt;p&gt;There is a habit in AI development borrowed from early web development: ship fast, stabilise later. It worked reasonably well for stateless APIs. It fails for agents.&lt;/p&gt;

&lt;p&gt;The operational overhead problem is the one no one talks about honestly. Each agent you lift to production is a runtime you now own. It needs monitoring, evaluation, cost governance, versioning, and a deployment pipeline. Each AgentCore runtime incurs ongoing costs — model invocation fees, Lambda execution time per tool call, API Gateway requests, DynamoDB reads for conversation state. One runtime versus twenty is a meaningful difference in your AWS bill before you have written a single line of business logic. That is not a fleet — that is a maintenance burden.&lt;/p&gt;

&lt;p&gt;The failure modes compound it. A poorly configured agent does not throw a 500 error. It returns a plausible-sounding answer that is wrong. It invokes the right tool with the wrong parameters. It loses context mid-conversation and starts hallucinating a state that no longer exists. None of this shows up in your standard CloudWatch dashboard. You find out from a user.&lt;/p&gt;

&lt;p&gt;I decided early that I was not going to pay that cost for use cases that had not been proven.&lt;/p&gt;




&lt;h2&gt;
  
  
  My Architecture: Local-First Agentic Development
&lt;/h2&gt;

&lt;p&gt;My local environment is built around Claude Code and a system of over 20 agent personas, each defined as a structured Markdown file with a clear identity, a set of skills, and integration points into my SDLC.&lt;/p&gt;

&lt;p&gt;They cover the full SDLC — from architecture and security to testing, defect management, data engineering, and content. A few examples to make it concrete:&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Defect Manager&lt;/strong&gt; accepts a reported bug, writes a reproduction test, implements the fix, deploys to DEV, and closes the loop in ClickUp — without a human touching the keyboard between report and verification.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;SDET Engineer&lt;/strong&gt; designs test cases using boundary value analysis, equivalence partitioning, and pairwise techniques, then executes them against an API proxy — never the live service directly.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Cloud Security Specialist&lt;/strong&gt; runs STRIDE-based threat models and generates Terraform-ready IAM policies scoped to least privilege for the specific service under review.&lt;/p&gt;

&lt;p&gt;Each one is a specialist. None of them overlap. None are deployed unless a production use case demands it.&lt;/p&gt;

&lt;p&gt;Each agent persona is structured around:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Identity&lt;/strong&gt;: What this agent is responsible for, what it knows, how it behaves&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skills&lt;/strong&gt;: Reusable knowledge modules the agent can apply&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools&lt;/strong&gt;: Callable actions the agent can invoke (APIs, MCP tools, Lambda functions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SDLC stage&lt;/strong&gt;: Where in the development lifecycle this agent operates
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Local Development (Claude Code)
│
├── Agentic Architect (Orchestrator)
│   ├── HLD Architect          ──► Skills / Tools
│   ├── Cloud Security         ──► Skills / Tools
│   ├── SDET Engineer          ──► Skills / Tools
│   ├── Defect Manager         ──► Skills / Tools
│   ├── GenAI Engineer         ──► Skills / Tools
│   └── ... more agents        ──► Skills / Tools
│
│   Any agent, when lift criteria met:
│   ──► MCP Facade + OOP ABCs + Terraform + S3 Sync
│       ──► Production (AWS Bedrock AgentCore)
│           ├── AgentCore Runtime
│           ├── Lambda (Skills as Tools)
│           ├── API Gateway
│           └── DynamoDB (State)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The local environment gives me something production cannot: speed without consequence. I can iterate on a prompt, reshape a skill, change a tool's behaviour, and retest — all without a deployment pipeline, without CloudWatch logs, without touching live infrastructure. The feedback loop is minutes, not hours.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Liftability Pattern
&lt;/h2&gt;

&lt;p&gt;The most important design decision I made early: every agent I build locally must be liftable to production without rework.&lt;/p&gt;

&lt;p&gt;Liftability is not a deployment script. It is a design discipline applied from day one. An agent is liftable when:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The MCP Facade is in place&lt;/strong&gt;&lt;br&gt;
Skills and tools are exposed as MCP (Model Context Protocol — an open standard for tool interoperability across LLM runtimes) endpoints. This interface works identically whether the agent is running locally in Claude Code or as a runtime in Bedrock AgentCore. The agent does not know or care where it is running. That is by design.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The implementation follows OOP ABCs&lt;/strong&gt;&lt;br&gt;
Each skill is implemented as a Python class inheriting from a base abstract class. This enforces a consistent interface, makes skills independently testable, and means they slot into AgentCore's tool registration without modification.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Infrastructure is Terraform-first&lt;/strong&gt;&lt;br&gt;
Every agent that will eventually be lifted has its Terraform written alongside the code — Lambda function definitions, IAM roles scoped to least privilege, API Gateway routes, DynamoDB tables for state. When lift day comes, &lt;code&gt;terraform apply&lt;/code&gt; is the deployment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Artifacts live in S3&lt;/strong&gt;&lt;br&gt;
Agent definitions, skill configurations, and prompt templates are stored in S3 — not hardcoded. In production, AgentCore reads from the same S3 paths. Promotion is a bucket sync, not a rewrite.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. The agent has been tested end-to-end locally&lt;/strong&gt;&lt;br&gt;
This is the gate. Before an agent is considered liftable, it has unit tests for each skill, integration tests through an API proxy (not direct service calls), and a set of golden test cases that validate its end-to-end behaviour on representative inputs.&lt;/p&gt;

&lt;p&gt;The lift checklist is not a formality. It is the reason the promotion is low risk when it happens.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0f7ycb21lrwo750e1f7g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0f7ycb21lrwo750e1f7g.png" alt="The Liftability Gate — five criteria an agent must meet before production promotion" width="640" height="520"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Skills and Tools: The Real Unit of Capability
&lt;/h2&gt;

&lt;p&gt;Here is the insight that took me longest to articulate clearly: in a production agentic system, the agent is not the unit of capability. The skill is.&lt;/p&gt;

&lt;p&gt;An agent is an orchestrator. It decides which skill to apply, in what order, with what inputs. The intelligence of the system lives in how skills are designed, how they compose, and how reliably they execute — not in the agent's decision loop itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skills are tested in isolation.&lt;/strong&gt; Each skill has its own test suite. I can run &lt;code&gt;pytest skills/defect_lifecycle_management/tests/ -v&lt;/code&gt; without spinning up an agent. The skill either works or it does not. This is the only way to know before production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skills are reusable across agents.&lt;/strong&gt; My Cloud Security agent and my Peer Review agent both use the same &lt;code&gt;IAM_Least_Privilege&lt;/code&gt; skill. Written once, tested once, composed freely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skills define the production surface area.&lt;/strong&gt; When I lift an agent to Bedrock AgentCore, what I am deploying is a set of Lambda functions — one per skill — registered as tools. The AgentCore runtime is thin. The Lambda functions are where the real work happens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;New capability means a new skill, not a new agent.&lt;/strong&gt; When I need the production agent to do something new, I write a skill, test it locally, and register it as a new tool in AgentCore. The operational surface stays flat even as capability grows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fadu8fx6fnl4t6wjegmka.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fadu8fx6fnl4t6wjegmka.png" alt="Skills as the unit of capability — one skill shared across agents, deployed as Lambda in production" width="700" height="480"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Production Decision: When Does an Agent Get Lifted?
&lt;/h2&gt;

&lt;p&gt;Not every agent earns a production deployment. This is intentional.&lt;/p&gt;

&lt;p&gt;The criteria I apply before lifting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Is the use case proven?&lt;/strong&gt; Has the agent demonstrated it can handle real inputs, not just the happy path?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is the business need clear?&lt;/strong&gt; Is there a user, system, or workflow that requires this agent callable as a REST API?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Are the skills stable?&lt;/strong&gt; Have the underlying skills been through enough local iteration that core behaviour is settled?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is the infrastructure written?&lt;/strong&gt; Is the Terraform ready? IAM policies scoped? Monitoring configured?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When all four are true, the lift is a formality. The deployment is &lt;code&gt;terraform apply&lt;/code&gt; plus a GitHub Actions workflow already parameterised for the target environment.&lt;/p&gt;

&lt;p&gt;My current production agent — the Site Builder agent on Bedrock AgentCore — went through exactly this process. It ran locally for weeks. Skills were tested in isolation and end-to-end. Terraform was written alongside the code. When I lifted it, there were no surprises.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Gives You That Shipping Fast Does Not
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;A growing library of production-ready agents.&lt;/strong&gt; At any point, agents at various stages of local maturity are queued for production — tested, Terraform-ready, waiting for the business case to pull them through.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Low-risk promotions.&lt;/strong&gt; When I lift an agent, I already know it works. Tested locally on real inputs. Tested skills in isolation. Run end-to-end through an API proxy before touching AgentCore. The promotion is a confirmation, not an experiment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost control where it matters.&lt;/strong&gt; One AgentCore runtime with a growing skill set means one set of operational overhead — one monitoring configuration, one deployment pipeline, one cost centre to govern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Faster local iteration.&lt;/strong&gt; Because I am not trying to do everything in production, the local environment is unconstrained. A new agent persona can be tried in an afternoon. Skills can be composed in ways not tried before.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Counterintuitive Takeaway
&lt;/h2&gt;

&lt;p&gt;The industry benchmark for agentic maturity right now is fleet size — how many agents deployed, how many tools registered, how many concurrent sessions the platform can handle.&lt;/p&gt;

&lt;p&gt;I think this is the wrong metric.&lt;/p&gt;

&lt;p&gt;The right metrics are: how reliably does each production agent perform on the use cases it owns, and how quickly can a locally-proven agent be promoted when the business needs it.&lt;/p&gt;

&lt;p&gt;Local-first agentic development is not a workaround for teams that cannot afford AgentCore at scale. It is a discipline. Build thoroughly. Test locally. Design for liftability from day one. Promote when the use case earns it.&lt;/p&gt;

&lt;p&gt;The agents are ready. Production should always be the easy part.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Managing agents in production is operationally heavier than managing skills and tools — be deliberate about what you lift&lt;/li&gt;
&lt;li&gt;Skills are the real unit of capability — design, test, and deploy at the skill level&lt;/li&gt;
&lt;li&gt;Liftability is a design property, not a deployment script: MCP facade, OOP ABCs, Terraform-first, S3 artifacts, end-to-end tests&lt;/li&gt;
&lt;li&gt;Local-first development absorbs iteration cost so production does not have to&lt;/li&gt;
&lt;li&gt;The lift criteria (proven use case, stable skills, written infrastructure, clear business need) make every promotion low risk&lt;/li&gt;
&lt;li&gt;Fleet size is a vanity metric — reliability per agent and time-to-lift are what matter&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/agents.html" rel="noopener noreferrer"&gt;AWS Bedrock AgentCore Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://modelcontextprotocol.io/introduction" rel="noopener noreferrer"&gt;Model Context Protocol (MCP) Specification&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025" rel="noopener noreferrer"&gt;What 1,200 Production Deployments Reveal About LLMOps in 2025 — ZenML&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.zenml.io/blog/mlops-vs-llmops" rel="noopener noreferrer"&gt;MLOps vs LLMOps: What's Different — ZenML&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html" rel="noopener noreferrer"&gt;AWS Lambda Best Practices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/welcome.html" rel="noopener noreferrer"&gt;AWS Well-Architected Framework — Operational Excellence Pillar&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/prescriptive-guidance/latest/terraform-aws-provider-best-practices/introduction.html" rel="noopener noreferrer"&gt;Terraform AWS Provider Best Practices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.peopleinai.com/blog/the-job-market-for-mlops-engineers-in-2025" rel="noopener noreferrer"&gt;MLOps Engineers 2025: Skills, Salaries and Growth — People in AI&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>mlops</category>
      <category>aws</category>
      <category>bedrock</category>
    </item>
    <item>
      <title>The Missing Test Suite: Why AI Projects Fail Before Production</title>
      <dc:creator>Tebogo Tseka</dc:creator>
      <pubDate>Thu, 02 Apr 2026 14:39:57 +0000</pubDate>
      <link>https://dev.to/tsekatm/the-missing-test-suite-why-ai-projects-fail-before-production-5648</link>
      <guid>https://dev.to/tsekatm/the-missing-test-suite-why-ai-projects-fail-before-production-5648</guid>
      <description>&lt;p&gt;&lt;em&gt;Most AI projects never ship. The gap isn't the model — it's the lack of testability.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Uncomfortable Truth
&lt;/h2&gt;

&lt;p&gt;Gartner predicted that through 2022, 85% of AI projects would deliver erroneous outcomes due to bias in data, algorithms, or the teams managing them [1]. VentureBeat reported that 87% of data science projects never make it into production [2]. McKinsey's 2023 State of AI report confirmed that while generative AI adoption is accelerating, most organisations still struggle to move beyond experimentation [3].&lt;/p&gt;

&lt;p&gt;Teams build impressive demos, stakeholders nod approvingly, and then the project quietly stalls somewhere between "it works on my laptop" and "it's running in production."&lt;/p&gt;

&lt;p&gt;The usual suspects get blamed: data quality, model performance, organisational readiness. But there is a more fundamental problem hiding in plain sight — most teams have no idea how to test AI systems with the same rigour they apply to traditional software. Google's seminal paper on hidden technical debt in machine learning systems identified testing gaps as a primary source of production failures, noting that ML systems have a special capacity for incurring technical debt because they have all the maintenance problems of traditional code plus an additional set of ML-specific issues [4].&lt;/p&gt;

&lt;p&gt;They test the code. They don't test the intelligence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Systems, Two Test Suites
&lt;/h2&gt;

&lt;p&gt;A production AI system is not one system. It is two systems woven together: deterministic software (APIs, data pipelines, orchestration logic) and non-deterministic AI behaviour (prompt responses, agent decisions, model outputs).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzo5grjn5f7oo3z085i61.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzo5grjn5f7oo3z085i61.png" alt="Two Systems Two Test Suites" width="800" height="308"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Most engineering teams are excellent at testing the first. They write unit tests, integration tests, and end-to-end tests. They practice TDD. They run CI pipelines that block merges on test failures. This is mature, well-understood discipline.&lt;/p&gt;

&lt;p&gt;But the AI layer — the prompts, the agent behaviour, the model responses — gets treated as a black box. Teams eyeball a few outputs, declare it "good enough," and move on. There is no test suite. There is no regression safety net. There is no way to know if a prompt change that improved one scenario just broke twelve others.&lt;/p&gt;

&lt;p&gt;Google's ML Test Score rubric [5] proposes a structured assessment of ML production readiness across data tests, model tests, infrastructure tests, and monitoring — yet most teams score poorly on all four dimensions. Microsoft Research's study of software engineering for machine learning found that even within large technology companies, testing practices for ML systems remain significantly less mature than those for traditional software [6].&lt;/p&gt;

&lt;p&gt;This is the missing test suite. And it is the single biggest reason AI projects fail to reach production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompt Test Cases as First-Class Citizens
&lt;/h2&gt;

&lt;p&gt;If you would not ship a function without a unit test, you should not ship a prompt without a prompt test case.&lt;/p&gt;

&lt;p&gt;A prompt test case is structurally similar to a traditional test: given an input, assert something about the output. The difference is that the assertion must account for non-determinism. You are not checking for exact string equality. You are evaluating whether the output meets defined criteria — relevance, completeness, format compliance, safety, and factual accuracy.&lt;/p&gt;

&lt;p&gt;Ribeiro et al.'s CheckList framework [7] — which won Best Paper at ACL 2020 — demonstrated that traditional software testing methodologies can be directly applied to NLP models. CheckList introduces three test types that map cleanly to prompt testing: Minimum Functionality Tests (happy path), Invariance Tests (the model should produce equivalent outputs for equivalent inputs), and Directional Expectation Tests (changing the input in a specific way should change the output in a predictable direction).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvy8dlvk1acrg3vtubkft.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvy8dlvk1acrg3vtubkft.png" alt="Prompt Test Case Structure" width="800" height="75"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Happy Path
&lt;/h3&gt;

&lt;p&gt;Happy path prompt tests verify that the AI produces the expected output when given a well-formed, unambiguous input. These are your baseline. If these fail, nothing else matters.&lt;/p&gt;

&lt;p&gt;Examples of happy path assertions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Given a clear instruction, the agent produces a response that addresses all specified requirements&lt;/li&gt;
&lt;li&gt;Given structured input data, the agent formats its output according to the defined schema&lt;/li&gt;
&lt;li&gt;Given a multi-step task, the agent completes each step in the correct sequence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Happy path tests seem obvious, but most teams skip them. They assume that because the prompt "worked when they tried it," it will always work. It will not. Model updates, context changes, and subtle input variations all introduce drift.&lt;/p&gt;

&lt;h3&gt;
  
  
  Negative Scenarios
&lt;/h3&gt;

&lt;p&gt;Negative prompt tests verify that the AI fails gracefully when given problematic input. This is where most unshipped AI projects have their fatal flaw — they only ever tested the golden path.&lt;/p&gt;

&lt;p&gt;Perez et al. demonstrated that language models can be used to systematically red-team other language models, generating adversarial inputs that expose failure modes at scale [8]. The same principle applies to prompt testing — you can and should systematically probe for failures.&lt;/p&gt;

&lt;p&gt;Test for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Contradictory instructions&lt;/strong&gt;: "Summarise this document in detail but keep it under 10 words." Does the agent flag the contradiction, or does it silently produce garbage?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Out-of-scope requests&lt;/strong&gt;: When asked to perform a task outside its defined capabilities, does the agent refuse clearly, or does it hallucinate an answer?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adversarial input&lt;/strong&gt;: Prompt injection attempts, instructions disguised as data, requests to ignore system prompts. Does the agent hold its boundaries?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing context&lt;/strong&gt;: When critical information is absent from the input, does the agent ask for clarification, or does it fabricate what it doesn't know?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Negative scenarios reveal the failure modes that will surface in production, because real users do not read your documentation and do not provide clean inputs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Edge Cases
&lt;/h3&gt;

&lt;p&gt;Edge case prompt tests probe the boundaries of agent behaviour. These are the scenarios that don't fit neatly into "it works" or "it's broken" — they live in the grey zone where AI systems are most unpredictable.&lt;/p&gt;

&lt;p&gt;Test for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context window boundaries&lt;/strong&gt;: What happens when the input is near the maximum token limit? Does output quality degrade? Does critical information from early in the context get lost?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-turn drift&lt;/strong&gt;: Over a long conversation, does the agent maintain consistency with its earlier responses, or does it contradict itself?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ambiguous inputs&lt;/strong&gt;: When a request has multiple valid interpretations, does the agent pick one and commit, or does it hedge uselessly?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Format edge cases&lt;/strong&gt;: Empty strings, single-character inputs, inputs in unexpected languages, inputs with special characters or code snippets embedded in natural language&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination triggers&lt;/strong&gt;: Inputs that are factually adjacent to the agent's knowledge but require information it does not have. Does it admit uncertainty, or does it confabulate?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Edge case tests are expensive to design but cheap compared to a production incident where your AI agent confidently gives a user dangerously wrong information. The NIST AI Risk Management Framework explicitly identifies "the propensity for generative AI to produce confidently stated but incorrect outputs" as a key risk requiring systematic mitigation [9].&lt;/p&gt;

&lt;h2&gt;
  
  
  Designing Prompt Test Permutations
&lt;/h2&gt;

&lt;p&gt;Systematic test design is not a new discipline. Software testing has mature techniques — codified in ISO/IEC 29119 [10] — for generating meaningful test cases without combinatorial explosion. Part 11 of this standard, published in 2020, specifically extends these techniques to AI-based systems [11]. The same approaches apply to prompt testing — they just need to be adapted for non-deterministic outputs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvlwlje1yc8r1cad3pfmm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvlwlje1yc8r1cad3pfmm.png" alt="Test Design Techniques Applied to Prompts" width="800" height="649"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Equivalence Partitioning for Prompts
&lt;/h3&gt;

&lt;p&gt;Divide your input space into classes that you expect the AI to handle similarly. Instead of testing every possible phrasing of a request, identify the equivalence classes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Short, direct instructions vs. long, detailed instructions&lt;/li&gt;
&lt;li&gt;Technical language vs. conversational language&lt;/li&gt;
&lt;li&gt;Single-task requests vs. compound multi-task requests&lt;/li&gt;
&lt;li&gt;Inputs with complete context vs. inputs with partial context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Test one representative from each class. If the AI handles one member of the class correctly, it is likely to handle the others. Ribeiro et al. validated this approach empirically, showing that equivalence-class-based testing surfaces model failures far more efficiently than random sampling [7].&lt;/p&gt;

&lt;h3&gt;
  
  
  Boundary Value Analysis for Prompts
&lt;/h3&gt;

&lt;p&gt;Identify the thresholds where agent behaviour changes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The input length at which output quality begins to degrade&lt;/li&gt;
&lt;li&gt;The number of instructions in a single prompt before the agent starts dropping tasks&lt;/li&gt;
&lt;li&gt;The level of ambiguity at which the agent switches from executing to asking for clarification&lt;/li&gt;
&lt;li&gt;The complexity threshold beyond which the agent starts making errors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Test inputs at, just below, and just above each boundary.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decision Table Testing
&lt;/h3&gt;

&lt;p&gt;For agents with conditional behaviour — different responses based on user role, input type, or context state — build a decision table. Map every combination of conditions to the expected action. Then write a test case for each row.&lt;/p&gt;

&lt;p&gt;This is particularly critical for agents that make routing decisions, apply business rules, or enforce access controls. A missed condition in a decision table is a production bug waiting to happen.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Prompt Regression Problem
&lt;/h2&gt;

&lt;p&gt;Here is the scenario that kills AI projects in the transition from prototype to production:&lt;/p&gt;

&lt;p&gt;A developer changes a prompt to fix a reported issue. The fix works. The specific scenario that was broken now produces the correct output. The developer commits the change, satisfied.&lt;/p&gt;

&lt;p&gt;What the developer does not know is that the prompt change also altered the agent's behaviour on fourteen other scenarios — three of which are now producing incorrect outputs. Nobody finds out until users report problems. By then, confidence in the system is damaged and the project loses momentum.&lt;/p&gt;

&lt;p&gt;This is the prompt regression problem, and it is solved the same way code regression is solved: with an automated test suite that runs on every change.&lt;/p&gt;

&lt;h3&gt;
  
  
  Building a Prompt Regression Harness
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc9cbzbqwg1z7c2a9h6v6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc9cbzbqwg1z7c2a9h6v6.png" alt="Prompt Regression Harness" width="800" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A prompt regression harness consists of:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A corpus of test cases&lt;/strong&gt;: Input-output pairs covering happy paths, negative scenarios, and edge cases. Start with 20-30 and grow it continuously.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation criteria&lt;/strong&gt;: For each test case, define what "correct" means. This might be a rubric (scores 1-5 on relevance, accuracy, completeness), a set of required elements (must mention X, must not mention Y), or a format check (valid JSON, under 200 words).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated evaluation&lt;/strong&gt;: Use a combination of deterministic checks (format validation, keyword presence) and LLM-as-judge evaluation (a second model scoring the output against the rubric). Zheng et al.'s research on MT-Bench demonstrated that LLM-as-judge approaches can achieve high agreement with human evaluators when properly calibrated [12], though Shankar et al. caution that validator alignment with human preferences must itself be verified [13]. Neither approach alone is sufficient. Together, they provide reasonable coverage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI integration&lt;/strong&gt;: Run the harness on every prompt change, just as you run unit tests on every code change. Block merges that cause regression.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The harness does not need to be perfect. It needs to be better than nothing — which is what most teams have today. Frameworks such as Stanford's HELM [14] and open-source tools like OpenAI Evals [15] and DeepEval [16] provide starting points for building evaluation infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Strategy: From POC to Production
&lt;/h2&gt;

&lt;p&gt;Testing is the foundation, but shipping an AI system to production requires a broader strategy. Google's MLOps maturity model [17] describes three levels of automation — from manual ML pipelines (Level 0) to fully automated CI/CD/CT pipelines (Level 2). Most AI projects are stuck at Level 0. These are the practices that move you forward.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx4e1wkp8llqrjyptdlyk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx4e1wkp8llqrjyptdlyk.png" alt="The 7 Practices POC to Production" width="800" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Define Testability From Day One
&lt;/h3&gt;

&lt;p&gt;Before writing a single prompt, define how you will test the AI's behaviour. If you cannot articulate what "correct" looks like for a given input, you are not ready to build. Testability is a design constraint, not an afterthought. The NIST AI RMF [9] frames this as "measuring" — one of four core functions alongside governing, mapping, and managing AI risk.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Version Your Prompts Like Code
&lt;/h3&gt;

&lt;p&gt;Prompts are code. Store them in version control. Tag releases. Write changelogs. If you cannot diff two versions of a prompt and understand what changed and why, you have lost control of your system. White et al.'s prompt pattern catalogue [18] demonstrates that prompts can be documented and structured with the same rigour as software design patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Build Evaluation Into the Pipeline
&lt;/h3&gt;

&lt;p&gt;Do not evaluate AI output manually and sporadically. Build evaluation into your CI/CD pipeline. Every pull request that touches a prompt should trigger the test harness. Results should be visible in the PR review, just like test results. Kreuzberger et al.'s systematic review of MLOps architectures [19] confirms that continuous evaluation is a defining characteristic of production-grade ML systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Instrument for Observability
&lt;/h3&gt;

&lt;p&gt;In production, you need to see what the AI is doing. Log inputs, outputs, latency, token usage, and evaluation scores. Build dashboards. Set alerts on quality degradation. You cannot improve what you cannot measure, and you cannot debug what you cannot observe. Klaise et al. detail practical approaches to monitoring ML models in production, including detecting data drift and concept drift before they degrade output quality [20].&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Implement Human-in-the-Loop Gates
&lt;/h3&gt;

&lt;p&gt;Not every AI decision should be autonomous from day one. Identify high-stakes decisions and route them through human review. As confidence grows and the test suite matures, progressively expand the automation boundary. This is not a concession — it is a deployment strategy. Mosqueira-Rey et al.'s comprehensive survey of human-in-the-loop machine learning [21] demonstrates that the most successful production AI systems are designed with human oversight as an integral component, not bolted on as an afterthought.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Plan for Model Changes
&lt;/h3&gt;

&lt;p&gt;Models get updated. APIs change. Behaviour shifts. Your test suite is your safety net during model migrations. Teams that have one can upgrade models in an afternoon with confidence. Teams that don't spend weeks manually validating and still miss regressions. The EU AI Act [22] now mandates ongoing testing and monitoring for high-risk AI systems — model migration without regression testing is not just risky engineering, it is increasingly a compliance liability.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Treat Prompt Engineering as Software Engineering
&lt;/h3&gt;

&lt;p&gt;The teams that ship AI to production are the teams that apply software engineering discipline to prompt development. They review prompts in pull requests. They write tests. They track regressions. They refactor. They don't treat prompts as magic incantations — they treat them as code that happens to be written in natural language. Reynolds and McDonell's early work on prompt programming [23] laid the conceptual foundation for this approach, framing prompt design as a form of programming rather than an art.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;The AI industry has a completion problem, not a capability problem. The models are powerful enough. The tooling is mature enough. What is missing is the engineering discipline to make AI systems production-grade.&lt;/p&gt;

&lt;p&gt;If you would not ship code without tests, do not ship prompts without them. If you would not deploy a function without observability, do not deploy an agent without it. If you would not merge a code change without regression checks, do not merge a prompt change without them.&lt;/p&gt;

&lt;p&gt;The test suite your AI project is missing is the one that tests the AI itself. Build it, and you build the bridge from demo to production.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Testing AI is not a new discipline — it is the old discipline of software testing, applied to a new kind of system. The teams that recognise this will ship. The rest will keep building impressive demos that never leave the lab.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;[1] Gartner, "Gartner Predicts: AI and the Future of Work," Gartner Research, 2019. Available: &lt;a href="https://www.gartner.com/en/newsroom/press-releases" rel="noopener noreferrer"&gt;https://www.gartner.com/en/newsroom/press-releases&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[2] VentureBeat, "Why do 87% of data science projects never make it into production?" VentureBeat, July 2019. Available: &lt;a href="https://venturebeat.com/ai/why-do-87-of-data-science-projects-never-make-it-into-production/" rel="noopener noreferrer"&gt;https://venturebeat.com/ai/why-do-87-of-data-science-projects-never-make-it-into-production/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[3] McKinsey &amp;amp; Company, "The state of AI in 2023: Generative AI's breakout year," McKinsey Global Institute, 2023. Available: &lt;a href="https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2023-generative-ais-breakout-year" rel="noopener noreferrer"&gt;https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2023-generative-ais-breakout-year&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[4] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, et al., "Hidden Technical Debt in Machine Learning Systems," in &lt;em&gt;Advances in Neural Information Processing Systems (NeurIPS)&lt;/em&gt;, 2015. Available: &lt;a href="https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html" rel="noopener noreferrer"&gt;https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[5] E. Breck, S. Cai, E. Nielsen, M. Salib, and D. Sculley, "The ML Test Score: A Rubric for ML Production Readiness and Technical Debt," in &lt;em&gt;IEEE International Conference on Big Data&lt;/em&gt;, 2017. Available: &lt;a href="https://research.google/pubs/pub46555/" rel="noopener noreferrer"&gt;https://research.google/pubs/pub46555/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[6] S. Amershi, A. Begel, C. Bird, R. DeLine, H. Gall, E. Kamar, N. Nagappan, B. Nushi, and T. Zimmermann, "Software Engineering for Machine Learning: A Case Study," in &lt;em&gt;Proceedings of the 41st International Conference on Software Engineering (ICSE)&lt;/em&gt;, 2019. Available: &lt;a href="https://www.microsoft.com/en-us/research/publication/software-engineering-for-machine-learning-a-case-study/" rel="noopener noreferrer"&gt;https://www.microsoft.com/en-us/research/publication/software-engineering-for-machine-learning-a-case-study/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[7] M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh, "Beyond Accuracy: Behavioral Testing of NLP Models with CheckList," in &lt;em&gt;Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL)&lt;/em&gt;, 2020. (Best Paper Award). Available: &lt;a href="https://arxiv.org/abs/2005.04118" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2005.04118&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[8] E. Perez, S. Ringer, K. Lukosiute, K. Nguyen, E. Chen, et al., "Red Teaming Language Models with Language Models," in &lt;em&gt;Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)&lt;/em&gt;, 2022. Available: &lt;a href="https://arxiv.org/abs/2202.03286" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2202.03286&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[9] National Institute of Standards and Technology, "Artificial Intelligence Risk Management Framework (AI RMF 1.0)," NIST AI 100-1, January 2023. Available: &lt;a href="https://www.nist.gov/itl/ai-risk-management-framework" rel="noopener noreferrer"&gt;https://www.nist.gov/itl/ai-risk-management-framework&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[10] International Organization for Standardization, "ISO/IEC 29119: Software and systems engineering — Software testing," ISO/IEC, 2013-2022.&lt;/p&gt;

&lt;p&gt;[11] International Organization for Standardization, "ISO/IEC TR 29119-11:2020: Software and systems engineering — Software testing — Part 11: Guidelines on the testing of AI-based systems," ISO/IEC, 2020. Available: &lt;a href="https://www.iso.org/standard/79016.html" rel="noopener noreferrer"&gt;https://www.iso.org/standard/79016.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[12] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," in &lt;em&gt;Advances in Neural Information Processing Systems (NeurIPS)&lt;/em&gt;, 2023. Available: &lt;a href="https://arxiv.org/abs/2306.05685" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2306.05685&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[13] S. Shankar, J. D. Zamfirescu-Pereira, B. Hartmann, A. Parameswaran, and I. Arawjo, "Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences," 2024. Available: &lt;a href="https://arxiv.org/abs/2404.12272" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2404.12272&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[14] P. Liang, R. Bommasani, T. Lee, et al., "Holistic Evaluation of Language Models (HELM)," &lt;em&gt;Transactions on Machine Learning Research&lt;/em&gt;, 2022. Available: &lt;a href="https://arxiv.org/abs/2211.09110" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2211.09110&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[15] OpenAI, "Evals: A framework for evaluating LLMs and LLM systems," GitHub, 2023. Available: &lt;a href="https://github.com/openai/evals" rel="noopener noreferrer"&gt;https://github.com/openai/evals&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[16] Confident AI, "DeepEval: The open-source LLM evaluation framework," GitHub, 2023. Available: &lt;a href="https://github.com/confident-ai/deepeval" rel="noopener noreferrer"&gt;https://github.com/confident-ai/deepeval&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[17] Google Cloud, "MLOps: Continuous delivery and automation pipelines in machine learning," Google Cloud Architecture Center, 2023. Available: &lt;a href="https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning" rel="noopener noreferrer"&gt;https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[18] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, and D. C. Schmidt, "A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT," arXiv:2302.11382, 2023. Available: &lt;a href="https://arxiv.org/abs/2302.11382" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2302.11382&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[19] D. Kreuzberger, N. Kuhl, and S. Hirschl, "Machine Learning Operations (MLOps): Overview, Definition, and Architecture," &lt;em&gt;IEEE Access&lt;/em&gt;, vol. 11, 2023. Available: &lt;a href="https://ieeexplore.ieee.org/document/10081336" rel="noopener noreferrer"&gt;https://ieeexplore.ieee.org/document/10081336&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[20] J. Klaise, A. Van Looveren, G. Vacanti, and A. Coca, "Monitoring Machine Learning Models in Production," arXiv:2007.06299, 2021. Available: &lt;a href="https://arxiv.org/abs/2007.06299" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2007.06299&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[21] E. Mosqueira-Rey, E. Hernandez-Pereira, D. Alonso-Rios, J. Bobes-Bascaran, and A. Fernandez-Leal, "Human-in-the-loop machine learning: a state of the art," &lt;em&gt;Artificial Intelligence Review&lt;/em&gt;, Springer, 2023. Available: &lt;a href="https://link.springer.com/article/10.1007/s10462-022-10246-w" rel="noopener noreferrer"&gt;https://link.springer.com/article/10.1007/s10462-022-10246-w&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[22] European Parliament, "Regulation (EU) 2024/1689 — Artificial Intelligence Act," &lt;em&gt;Official Journal of the European Union&lt;/em&gt;, 2024. Available: &lt;a href="https://eur-lex.europa.eu/eli/reg/2024/1689" rel="noopener noreferrer"&gt;https://eur-lex.europa.eu/eli/reg/2024/1689&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[23] L. Reynolds and K. McDonell, "Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm," in &lt;em&gt;CHI 2021 Extended Abstracts&lt;/em&gt;, 2021. Available: &lt;a href="https://arxiv.org/abs/2102.07350" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2102.07350&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Building an LLM Judge That Doesn't Lie to You</title>
      <dc:creator>Tebogo Tseka</dc:creator>
      <pubDate>Tue, 31 Mar 2026 14:03:21 +0000</pubDate>
      <link>https://dev.to/tsekatm/building-an-llm-judge-that-doesnt-lie-to-you-47d1</link>
      <guid>https://dev.to/tsekatm/building-an-llm-judge-that-doesnt-lie-to-you-47d1</guid>
      <description>&lt;p&gt;Our first LLM judge gave a 9/10 to a page where the hero text was completely invisible.&lt;/p&gt;

&lt;p&gt;Dark grey text on a dark background image. The CSS was syntactically valid. The HTML was well-structured. Every tag was correct. The page was unusable. And our judge — Claude Opus, one of the most capable models available — scored it nearly perfect.&lt;/p&gt;

&lt;p&gt;That was the moment I realised LLM-as-judge doesn't work out of the box. It requires engineering. This article explains what we built to make it trustworthy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Inflation Problem
&lt;/h2&gt;

&lt;p&gt;The first implementation was simple: send the generated code to Claude Opus, ask it to rate 0–10. The results looked great. Average scores of 8–9/10 across the board. We nearly shipped those numbers.&lt;/p&gt;

&lt;p&gt;Then we opened the generated sites in a browser.&lt;/p&gt;

&lt;p&gt;Pages with broken images — where the model had written &lt;code&gt;&amp;lt;img src="a serene mountain landscape with morning fog"&amp;gt;&lt;/code&gt; instead of a URL — scored 8/10. Pages with empty sections — where entire content blocks were missing — scored 7/10. Pages where navigation rendered as a bulleted list because &lt;code&gt;list-style: none&lt;/code&gt; was missing from the CSS — scored 8.5/10.&lt;/p&gt;

&lt;p&gt;The judge was systematically generous. Not because it was broken, but because of how LLMs process code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Judges Inflate
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Positivity bias from RLHF training.&lt;/strong&gt; Language models are trained to be helpful, which creates a default toward positive assessment. When asked to evaluate code, the model focuses on what's present rather than what's wrong. Fifteen correct CSS properties and one devastating contrast failure? The judge sees the fifteen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code-level evaluation misses visual defects.&lt;/strong&gt; Syntactically valid CSS can produce invisible text. &lt;code&gt;color: #333&lt;/code&gt; on &lt;code&gt;background-image: url(dark-photo.jpg)&lt;/code&gt; is perfectly valid CSS and completely unreadable content. A judge that reads code without "seeing" the rendered result can't catch this category of defect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vague rubrics invite generous interpretation.&lt;/strong&gt; "Rate the quality of this HTML/CSS from 0–10" gives the judge too much latitude. What does 7 mean? What separates a 6 from an 8? Without concrete criteria, the judge fills in the gaps with optimistic interpretation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No calibration anchors.&lt;/strong&gt; The judge has no reference for what a 5/10 page looks like versus a 9/10 page. Without anchors, scores cluster at the top of the range because the model has no incentive to be harsh.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix 1: Structural Guardrails
&lt;/h2&gt;

&lt;p&gt;The first mitigation was the &lt;code&gt;HTMLVisualChecker&lt;/code&gt; — an automated pre-judge validator that catches defects the LLM judge consistently misses.&lt;/p&gt;

&lt;p&gt;It runs six checks:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Broken images.&lt;/strong&gt; Scans every &lt;code&gt;&amp;lt;img&amp;gt;&lt;/code&gt; tag's &lt;code&gt;src&lt;/code&gt; attribute. If the src contains spaces and doesn't start with &lt;code&gt;http&lt;/code&gt;, it's flagged — the model wrote a description instead of a URL. Also checks CSS &lt;code&gt;background-image&lt;/code&gt; declarations for the same pattern.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Catches: &amp;lt;img src="a modern office with glass facade"&amp;gt;
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;violations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Violation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VIS-BROKEN-IMAGE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Severity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CRITICAL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;deduction&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mf"&gt;2.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Image src contains text instead of URL: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Empty sections.&lt;/strong&gt; Finds &lt;code&gt;&amp;lt;section&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt; elements with IDs or classes that contain no visible text content. An empty hero section means the page loads with a blank area where the headline should be.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dark text on dark backgrounds.&lt;/strong&gt; Extracts CSS variables from &lt;code&gt;:root&lt;/code&gt;, identifies the text colour, checks whether background images are present, and flags when dark text is used without a light alternative. This is the check that caught the 9/10 invisible text page.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Broken navigation.&lt;/strong&gt; Detects when a &lt;code&gt;&amp;lt;nav&amp;gt;&lt;/code&gt; element contains &lt;code&gt;&amp;lt;ul&amp;gt;/&amp;lt;li&amp;gt;&lt;/code&gt; markup but the CSS doesn't include &lt;code&gt;list-style: none&lt;/code&gt; or flexbox layout — meaning the navigation renders as a bulleted list instead of a horizontal menu.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Missing interactivity.&lt;/strong&gt; Checks for the presence of JavaScript, mobile menu toggles, smooth scrolling, and hover states. A page with interactive HTML elements but no JavaScript to make them work is functionally broken.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local file paths.&lt;/strong&gt; Flags &lt;code&gt;src&lt;/code&gt; attributes pointing to filesystem paths (&lt;code&gt;/Users/...&lt;/code&gt;, &lt;code&gt;C:\...&lt;/code&gt;, relative paths without extensions) that won't work in a browser.&lt;/p&gt;

&lt;p&gt;These checks don't replace the judge — they constrain it. If the HTMLVisualChecker finds a critical violation (broken images, empty sections, invisible text), that violation is recorded regardless of what the judge thinks. The judge can still evaluate the nuances of code quality and content accuracy, but it can't override a structural failure.&lt;/p&gt;

&lt;p&gt;The analogy: unit tests don't replace code review, but they catch the obvious regressions before a human ever looks at the code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix 2: Multimodal Judging
&lt;/h2&gt;

&lt;p&gt;The second fix was sending the judge more than just code.&lt;/p&gt;

&lt;p&gt;Code-only judging fails because CSS is a spatial language encoded as text. &lt;code&gt;grid-template-columns: 1fr 2fr 1fr&lt;/code&gt; creates a three-column layout, but you can't verify it's correct without rendering it. &lt;code&gt;rgba(0, 0, 0, 0.7)&lt;/code&gt; overlay on a hero image makes text readable, but the judge can't know the overlay is sufficient without seeing the result.&lt;/p&gt;

&lt;p&gt;Our judge input bundle now includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full HTML source code&lt;/li&gt;
&lt;li&gt;Full CSS source code&lt;/li&gt;
&lt;li&gt;The scoring rubric with violation catalogue&lt;/li&gt;
&lt;li&gt;Gold standard HTML/CSS for comparison&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The judge compares agent output against the gold standard at both the code level and the structural level. It can see whether the agent's CSS variables match the requirements AND whether the agent's HTML structure preserves all sections from the template.&lt;/p&gt;

&lt;p&gt;In future rounds, we plan to add desktop and mobile screenshots to the bundle, making the judge truly multimodal — evaluating the rendered visual output alongside the source code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix 3: The Violation Catalogue as Rubric
&lt;/h2&gt;

&lt;p&gt;The third fix was the most impactful. Instead of asking the judge for a score, we ask it to identify specific violations from a fixed catalogue.&lt;/p&gt;

&lt;p&gt;The catalogue defines 22 violation types, each with a unique ID, severity level, and fixed deduction amount:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;A11Y-DARK-TEXT-ON-DARK-BG&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Dark text on dark background (unreadable)&lt;/span&gt;
  &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
  &lt;span class="na"&gt;deduction&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;-3.0&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VIS-BROKEN-IMAGE&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Image shows alt text or broken placeholder&lt;/span&gt;
  &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
  &lt;span class="na"&gt;deduction&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;-2.5&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CONTENT-PARAPHRASED&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Content paraphrased instead of exact text&lt;/span&gt;
  &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;moderate&lt;/span&gt;
  &lt;span class="na"&gt;deduction&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;-0.5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The judge prompt is explicit about what's expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Your job: identify every violation in the agent output by
comparing it against the gold standard and requirements.

Return ONLY a JSON object with violations from the catalogue.

Rules:
- Use EXACT deduction amounts from the violation catalogue
- Do NOT invent violation IDs — use only IDs from the catalogue
- Do NOT report violations that don't exist
- Focus ONLY on the specific action being evaluated
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The judge returns structured JSON — not prose, not a score:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"violations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"VIS-BROKEN-IMAGE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"critical"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"deduction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;-2.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Hero image src contains description, not URL"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"evidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;img src=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;a serene landscape with mountains&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;&amp;gt;"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CONTENT-PARAPHRASED"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"moderate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"deduction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;-0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"About section text reworded from requirements"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"evidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Requirements: 'Farm-fresh flavours' → Output: 'Fresh local ingredients'"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Hero image broken, about text paraphrased"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"strengths"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Correct colour variables"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"All sections present"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"critical_issues"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Unusable hero — no visible image"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This separation of concerns is the key design decision. The judge does &lt;strong&gt;classification&lt;/strong&gt; — which violations are present? The scoring engine does &lt;strong&gt;arithmetic&lt;/strong&gt; — sum the deductions, subtract from 10. The judge cannot inflate scores because it never assigns scores. It identifies problems. The math is deterministic.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the Three Fixes Work Together
&lt;/h2&gt;

&lt;p&gt;The evaluation pipeline runs in sequence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. HTMLVisualChecker    → catches structural/visual defects
2. Opus Judge           → identifies violations from catalogue
3. Scoring Engine       → 10 minus sum(all deductions)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The HTMLVisualChecker catches what the judge misses (broken images, contrast issues, empty sections). The judge catches what the checker can't evaluate (content accuracy, code quality nuances, whether the business name appears in all six required locations). The scoring engine applies fixed deductions from both sources.&lt;/p&gt;

&lt;p&gt;Before these fixes, the same page with invisible text scored 9/10. After: the HTMLVisualChecker flags &lt;code&gt;A11Y-DARK-TEXT-ON-DARK-BG&lt;/code&gt; (-3.0), the judge identifies &lt;code&gt;VIS-BROKEN-IMAGE&lt;/code&gt; on the hero (-2.5) and &lt;code&gt;CONTENT-PARAPHRASED&lt;/code&gt; on the about section (-0.5). Final score: 4.0/10.&lt;/p&gt;

&lt;p&gt;That 4.0 is honest. The page has serious problems. The old 9.0 was a lie.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Learned About Judge Design
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Constrain the output format
&lt;/h3&gt;

&lt;p&gt;Free-text evaluation ("rate this code 0–10") produces inflated, inconsistent scores. Structured output with predefined violation types produces consistent, auditable results. The judge's job is classification, not scoring.&lt;/p&gt;

&lt;h3&gt;
  
  
  Separate detection from scoring
&lt;/h3&gt;

&lt;p&gt;When the judge both finds problems and assigns scores, it conflates two tasks and does both poorly. When the judge only identifies violations and a deterministic engine applies fixed deductions, scores are reproducible and explainable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use structural checks as guardrails
&lt;/h3&gt;

&lt;p&gt;LLM judges have blind spots. They read code as text and miss spatial defects. Automated structural checks catch the class of defects that LLMs consistently miss — and they run in milliseconds, not minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fixed-weight violations beat subjective assessment
&lt;/h3&gt;

&lt;p&gt;Is a purple gradient better than a blue solid? The judge has opinions, but they're not universal. But a missing mobile menu toggle (-2.5) is objectively a defect. Fixed weights for objective violations eliminate the subjectivity that causes score inflation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Known Limitations
&lt;/h2&gt;

&lt;p&gt;We fixed inflation, but the judge isn't perfect. Here's what remains:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single judge bias.&lt;/strong&gt; Only Claude Opus evaluates. It may favour Claude-generated code — similar patterns, similar token distributions. We haven't tested with a second judge model. Round 2 will score a subset with an independent judge and compute Cohen's kappa for inter-rater agreement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No inter-rater calibration.&lt;/strong&gt; We don't know whether our scores are "right" in an absolute sense. We know they're consistent and that they correlate with visible defects. But a human QA review of a random sample would establish whether our 4/10 matches a human's assessment of quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aesthetic subjectivity.&lt;/strong&gt; The violation catalogue covers functional defects (broken images, missing content, contrast failures) but not aesthetic quality. Two pages can score identically — both have correct structure, content, and accessibility — while one looks significantly more professional. We don't measure that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Measurement asymmetry from Round 1.&lt;/strong&gt; Sonnet's gold standard scores (93.4%) were measured differently from alternative models' pipeline scores (59–68%). This doesn't affect the judge's per-action scoring, but it affects the aggregate comparison. Round 2 fixes this by running all models through the same pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Principles for LLM-as-Judge
&lt;/h2&gt;

&lt;p&gt;If you're building an LLM judge for any evaluation task — not just code generation — three principles apply:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Structural guardrails before LLM evaluation.&lt;/strong&gt; Catch the obvious defects with deterministic checks before the LLM judge runs. This prevents the judge from rationalising broken output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Constrained violation catalogues over open-ended scoring.&lt;/strong&gt; Define the defects you care about, assign fixed weights, and ask the judge to classify — not score. You get consistent, auditable, explainable results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The judge is only as good as its rubric.&lt;/strong&gt; Invest in the rubric. A 22-violation catalogue with severity tiers and fixed deductions took more design effort than the judge prompt itself. The catalogue IS the evaluation — the judge is just the executor.&lt;/p&gt;

&lt;p&gt;LLM judges are powerful. They're also unreliable by default. The engineering isn't in the model — it's in the constraints you build around it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is part 3 of a 7-part series documenting how we built an evaluation framework for AI code generators, tested 5 models across 467 real code generation tasks, and turned the results into production improvements.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Previous: &lt;a href="https://dev.to/tsekatm/5-models-467-actions-1-winner-what-we-learned-comparing-llms-on-real-code-generation-2lfl"&gt;5 Models, 467 Actions, 1 Winner&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Next: &lt;a href="https://tebogo.cloud/blog/cost-quality-tradeoffs-ai-code-generation" rel="noopener noreferrer"&gt;The $0.07 vs $1.05 Question — Cost-Quality Tradeoffs&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://tebogo.cloud/blog/building-llm-judge-that-doesnt-lie" rel="noopener noreferrer"&gt;tebogo.cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>evaluation</category>
      <category>testing</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>5 Models, 467 Actions, 1 Winner — What We Learned Comparing LLMs on Real Code Generation</title>
      <dc:creator>Tebogo Tseka</dc:creator>
      <pubDate>Mon, 30 Mar 2026 19:02:48 +0000</pubDate>
      <link>https://dev.to/tsekatm/5-models-467-actions-1-winner-what-we-learned-comparing-llms-on-real-code-generation-2lfl</link>
      <guid>https://dev.to/tsekatm/5-models-467-actions-1-winner-what-we-learned-comparing-llms-on-real-code-generation-2lfl</guid>
      <description>&lt;p&gt;We tested five AI models on the same task 467 times. Each run produced a complete deployable website — not a code snippet, not a function, not a patch. A real site with HTML, CSS, JavaScript, and assets.&lt;/p&gt;

&lt;p&gt;The question: can cheaper models match Claude Sonnet for production code generation?&lt;/p&gt;

&lt;p&gt;The short answer is no. The longer answer is more interesting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Models
&lt;/h2&gt;

&lt;p&gt;Five models, spanning a 15x cost range:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Input/1M Tokens&lt;/th&gt;
&lt;th&gt;Output/1M Tokens&lt;/th&gt;
&lt;th&gt;Why We Tested It&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
&lt;td&gt;OpenRouter&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;Assumed gold standard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
&lt;td&gt;OpenRouter/CLI&lt;/td&gt;
&lt;td&gt;$1.00&lt;/td&gt;
&lt;td&gt;$5.00&lt;/td&gt;
&lt;td&gt;Same family, lower tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kimi K2.5&lt;/td&gt;
&lt;td&gt;OpenRouter&lt;/td&gt;
&lt;td&gt;$0.42&lt;/td&gt;
&lt;td&gt;$2.20&lt;/td&gt;
&lt;td&gt;Moonshot AI's latest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek V3.2&lt;/td&gt;
&lt;td&gt;OpenRouter&lt;/td&gt;
&lt;td&gt;$0.26&lt;/td&gt;
&lt;td&gt;$0.38&lt;/td&gt;
&lt;td&gt;Budget option&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek R1&lt;/td&gt;
&lt;td&gt;OpenRouter&lt;/td&gt;
&lt;td&gt;$0.70&lt;/td&gt;
&lt;td&gt;$2.50&lt;/td&gt;
&lt;td&gt;Reasoning-focused&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These five represent distinct price tiers and architectural approaches. Sonnet and Haiku share a lineage. Kimi is multimodal. DeepSeek V3.2 optimises for cost. R1 optimises for step-by-step reasoning.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 16-Action Pipeline
&lt;/h2&gt;

&lt;p&gt;Each model received the same template skeleton and business requirements, then applied 16 sequential actions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;apply-colours&lt;/td&gt;
&lt;td&gt;Brand&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;swap-fonts&lt;/td&gt;
&lt;td&gt;Brand&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;replace-header-logo&lt;/td&gt;
&lt;td&gt;Brand&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;replace-footer-logo&lt;/td&gt;
&lt;td&gt;Brand&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;replace-favicon&lt;/td&gt;
&lt;td&gt;Brand&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;replace-hero-bg&lt;/td&gt;
&lt;td&gt;Images&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;replace-section-bgs&lt;/td&gt;
&lt;td&gt;Images&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;update-hero-text&lt;/td&gt;
&lt;td&gt;Content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;update-about-text&lt;/td&gt;
&lt;td&gt;Content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;update-contact&lt;/td&gt;
&lt;td&gt;Content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;apply-hero-layout&lt;/td&gt;
&lt;td&gt;Layout&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;apply-sections-layout&lt;/td&gt;
&lt;td&gt;Layout&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;add-seo-meta&lt;/td&gt;
&lt;td&gt;Technical&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;add-structured-data&lt;/td&gt;
&lt;td&gt;Technical&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;add-accessibility&lt;/td&gt;
&lt;td&gt;Technical&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;verify-contrast&lt;/td&gt;
&lt;td&gt;Quality&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Same requirements spec, same gold standard, same judge for all models. Each action scored 0–10 using a violation-deduction model (see &lt;a href="https://dev.to/tsekatm/beyond-text-how-we-built-an-evaluation-framework-for-multi-file-ai-outputs-1d10"&gt;Part 1&lt;/a&gt;). Maximum possible: 160 points.&lt;/p&gt;

&lt;p&gt;Actions are sequential — each builds on the previous output. Errors compound. This is deliberate: it mirrors how agents work in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Avg Score&lt;/th&gt;
&lt;th&gt;95% CI&lt;/th&gt;
&lt;th&gt;% of Max&lt;/th&gt;
&lt;th&gt;Std Dev&lt;/th&gt;
&lt;th&gt;Runs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claude Sonnet 4.6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;149.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;N/A†&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;93.4%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.0†&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kimi K2.5&lt;/td&gt;
&lt;td&gt;108.2&lt;/td&gt;
&lt;td&gt;[92.7, 123.7]&lt;/td&gt;
&lt;td&gt;67.6%&lt;/td&gt;
&lt;td&gt;20.1&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
&lt;td&gt;107.7&lt;/td&gt;
&lt;td&gt;[91.0, 124.4]&lt;/td&gt;
&lt;td&gt;67.3%&lt;/td&gt;
&lt;td&gt;13.4&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek V3.2&lt;/td&gt;
&lt;td&gt;94.0&lt;/td&gt;
&lt;td&gt;[78.0, 110.0]&lt;/td&gt;
&lt;td&gt;58.8%&lt;/td&gt;
&lt;td&gt;28.9&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek R1&lt;/td&gt;
&lt;td&gt;41.9&lt;/td&gt;
&lt;td&gt;N/A (n=2)&lt;/td&gt;
&lt;td&gt;26.2%&lt;/td&gt;
&lt;td&gt;3.3&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Sonnet 4.6:    ████████████████████████████████████████████████████████ 149.5 (93%)
Kimi K2.5:     ████████████████████████████████████████                108.2 (68%)  ±15.5
Claude Haiku:  ████████████████████████████████████████                107.7 (67%)  ±16.7
DeepSeek V3.2: ██████████████████████████████████                       94.0 (59%)  ±16.0
DeepSeek R1:   ███████████████                                          41.9 (26%)  n=2
               |---------|---------|---------|---------|---------|
               0        30        60        90       120       150
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Honesty Moment
&lt;/h3&gt;

&lt;p&gt;Before interpreting these rankings, three caveats:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sonnet was measured differently.&lt;/strong&gt; Its 149.5 score comes from gold standard evaluation (automated quality signals against 21 templates), not the same 16-action pipeline as the alternatives. The 41-point gap between Sonnet and the field may be partly methodological. We're fixing this in Round 2.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rankings 2–4 are noise.&lt;/strong&gt; Kimi's confidence interval is [93, 124]. Haiku's is [91, 124]. DeepSeek V3.2's is [78, 110]. These overlap heavily. With current sample sizes, we cannot say which of these three is genuinely better. What we CAN say: all three cluster around 59–68% of max, well below Sonnet's 93%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sample sizes are small.&lt;/strong&gt; 2–15 runs per model. We need n≥16 for 80% statistical power to detect a 20-point difference. The rankings are directionally useful but not statistically conclusive for the middle tier.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-Template Performance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Template&lt;/th&gt;
&lt;th&gt;Sonnet&lt;/th&gt;
&lt;th&gt;Kimi&lt;/th&gt;
&lt;th&gt;Haiku&lt;/th&gt;
&lt;th&gt;DeepSeek V3.2&lt;/th&gt;
&lt;th&gt;Best Alt % of Sonnet&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AI Page Builder (SaaS)&lt;/td&gt;
&lt;td&gt;149.5&lt;/td&gt;
&lt;td&gt;134.8&lt;/td&gt;
&lt;td&gt;124.2&lt;/td&gt;
&lt;td&gt;99.5&lt;/td&gt;
&lt;td&gt;90.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Association Corporate&lt;/td&gt;
&lt;td&gt;149.5&lt;/td&gt;
&lt;td&gt;126.0&lt;/td&gt;
&lt;td&gt;120.2&lt;/td&gt;
&lt;td&gt;105.5&lt;/td&gt;
&lt;td&gt;84.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Safari Lodge&lt;/td&gt;
&lt;td&gt;149.5&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;108.2&lt;/td&gt;
&lt;td&gt;120.5&lt;/td&gt;
&lt;td&gt;80.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SaaS Product&lt;/td&gt;
&lt;td&gt;149.5&lt;/td&gt;
&lt;td&gt;112.0&lt;/td&gt;
&lt;td&gt;89.5&lt;/td&gt;
&lt;td&gt;112.0&lt;/td&gt;
&lt;td&gt;74.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gala Event&lt;/td&gt;
&lt;td&gt;149.5&lt;/td&gt;
&lt;td&gt;98.8&lt;/td&gt;
&lt;td&gt;96.0&lt;/td&gt;
&lt;td&gt;86.8&lt;/td&gt;
&lt;td&gt;66.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The AI Page Builder template is the closest contest — Kimi reaches 90.2% of Sonnet's quality. The Gala Event template is the widest gap at 66.1%. Template complexity matters: simpler structures with fewer sections are easier for all models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Action Difficulty: What's Easy and What's Impossible
&lt;/h2&gt;

&lt;p&gt;This is where the data gets interesting. Not all 16 actions are created equal:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;th&gt;Avg Score&lt;/th&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;add-accessibility&lt;/td&gt;
&lt;td&gt;9.4/10&lt;/td&gt;
&lt;td&gt;Technical&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;add-seo-meta&lt;/td&gt;
&lt;td&gt;9.2/10&lt;/td&gt;
&lt;td&gt;Technical&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;update-about-text&lt;/td&gt;
&lt;td&gt;8.8/10&lt;/td&gt;
&lt;td&gt;Content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;replace-favicon&lt;/td&gt;
&lt;td&gt;8.6/10&lt;/td&gt;
&lt;td&gt;Content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;apply-colours&lt;/td&gt;
&lt;td&gt;5.2/10&lt;/td&gt;
&lt;td&gt;Brand&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;apply-hero-layout&lt;/td&gt;
&lt;td&gt;2.8/10&lt;/td&gt;
&lt;td&gt;Layout&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;16&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;apply-sections-layout&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-0.8/10&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Layout&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern is clear when you group by category:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Avg Score&lt;/th&gt;
&lt;th&gt;Observation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Technical (SEO, a11y, schema)&lt;/td&gt;
&lt;td&gt;8.7/10&lt;/td&gt;
&lt;td&gt;Models follow structured specs reliably&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Content (text updates)&lt;/td&gt;
&lt;td&gt;7.7/10&lt;/td&gt;
&lt;td&gt;Good when verbatim rules enforced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Brand (colours, fonts, logos)&lt;/td&gt;
&lt;td&gt;6.8/10&lt;/td&gt;
&lt;td&gt;Moderate — CSS variable application is fragile&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Images (hero, section bgs)&lt;/td&gt;
&lt;td&gt;6.2/10&lt;/td&gt;
&lt;td&gt;All models hallucinate descriptions as src&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Layout (hero, sections)&lt;/td&gt;
&lt;td&gt;1.0/10&lt;/td&gt;
&lt;td&gt;Consistently catastrophic&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Structured, well-defined tasks score high. Spatial, visual tasks score low. Same models, wildly different results depending on task type.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gap Analysis: Where Alternatives Fall Behind
&lt;/h2&gt;

&lt;p&gt;Comparing each action against Sonnet reveals where the quality gap actually lives:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;th&gt;Sonnet&lt;/th&gt;
&lt;th&gt;Kimi&lt;/th&gt;
&lt;th&gt;Haiku&lt;/th&gt;
&lt;th&gt;DS-V3&lt;/th&gt;
&lt;th&gt;Avg Gap&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;add-accessibility&lt;/td&gt;
&lt;td&gt;9.5&lt;/td&gt;
&lt;td&gt;9.6&lt;/td&gt;
&lt;td&gt;9.8&lt;/td&gt;
&lt;td&gt;9.2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+0.0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;replace-favicon&lt;/td&gt;
&lt;td&gt;9.0&lt;/td&gt;
&lt;td&gt;9.0&lt;/td&gt;
&lt;td&gt;8.8&lt;/td&gt;
&lt;td&gt;8.4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-0.3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;add-seo-meta&lt;/td&gt;
&lt;td&gt;10.0&lt;/td&gt;
&lt;td&gt;9.4&lt;/td&gt;
&lt;td&gt;9.6&lt;/td&gt;
&lt;td&gt;9.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-0.7&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;apply-colours&lt;/td&gt;
&lt;td&gt;9.5&lt;/td&gt;
&lt;td&gt;6.2&lt;/td&gt;
&lt;td&gt;5.8&lt;/td&gt;
&lt;td&gt;6.5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-3.3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;apply-hero-layout&lt;/td&gt;
&lt;td&gt;9.0&lt;/td&gt;
&lt;td&gt;4.7&lt;/td&gt;
&lt;td&gt;3.2&lt;/td&gt;
&lt;td&gt;2.8&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-5.4&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;apply-sections-layout&lt;/td&gt;
&lt;td&gt;9.0&lt;/td&gt;
&lt;td&gt;1.6&lt;/td&gt;
&lt;td&gt;-3.8&lt;/td&gt;
&lt;td&gt;-1.5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-10.2&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three actions account for most of the quality gap:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;apply-sections-layout&lt;/strong&gt; (-10.2 point gap) — alternatives actively break layouts. Haiku scores -3.8 on average, meaning it makes pages significantly worse.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;apply-hero-layout&lt;/strong&gt; (-5.4 point gap) — layout transformation is fundamentally hard for all models below Sonnet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;apply-colours&lt;/strong&gt; (-3.3 point gap) — CSS variable propagation is inconsistent. Models update some variables but miss gradients, overlays, and header tints.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Three actions show essentially zero gap:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;add-accessibility&lt;/strong&gt; (+0.0) — every model follows accessibility specs equally well.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;replace-favicon&lt;/strong&gt; (-0.3) — simple file replacement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;add-seo-meta&lt;/strong&gt; (-0.7) — structured metadata is a universal strength.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This has a practical implication: if you could route easy tasks to cheap models and hard tasks to Sonnet, you could potentially cut costs without cutting quality on the tasks that matter. More on this in &lt;a href="https://tebogo.cloud/blog/cost-quality-tradeoffs-ai-code-generation" rel="noopener noreferrer"&gt;Part 4&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Action Heatmap
&lt;/h2&gt;

&lt;p&gt;Here's every model scored on every action — the full picture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csvs"&gt;&lt;code&gt;                    &lt;span class="k"&gt;Kimi&lt;/span&gt;  &lt;span class="k"&gt;Haiku&lt;/span&gt;  &lt;span class="k"&gt;DS&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;V&lt;/span&gt;&lt;span class="mf"&gt;3&lt;/span&gt;  &lt;span class="k"&gt;DS&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;R&lt;/span&gt;&lt;span class="mf"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;add&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;accessibility&lt;/span&gt;   &lt;span class="mf"&gt;9.6&lt;/span&gt;   &lt;span class="mf"&gt;9.8&lt;/span&gt;    &lt;span class="mf"&gt;9.2&lt;/span&gt;    &lt;span class="mf"&gt;8.1&lt;/span&gt;
&lt;span class="k"&gt;add&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;seo&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;meta&lt;/span&gt;        &lt;span class="mf"&gt;9.4&lt;/span&gt;   &lt;span class="mf"&gt;9.6&lt;/span&gt;    &lt;span class="mf"&gt;9.0&lt;/span&gt;    &lt;span class="mf"&gt;6.8&lt;/span&gt;
&lt;span class="k"&gt;update&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;about&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;text&lt;/span&gt;   &lt;span class="mf"&gt;9.2&lt;/span&gt;   &lt;span class="mf"&gt;8.8&lt;/span&gt;    &lt;span class="mf"&gt;8.6&lt;/span&gt;    &lt;span class="mf"&gt;0.6&lt;/span&gt;
&lt;span class="k"&gt;replace&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;favicon&lt;/span&gt;     &lt;span class="mf"&gt;9.0&lt;/span&gt;   &lt;span class="mf"&gt;8.8&lt;/span&gt;    &lt;span class="mf"&gt;8.4&lt;/span&gt;    &lt;span class="mf"&gt;6.0&lt;/span&gt;
&lt;span class="k"&gt;replace&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;header&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;logo&lt;/span&gt; &lt;span class="mf"&gt;8.2&lt;/span&gt;   &lt;span class="mf"&gt;9.2&lt;/span&gt;    &lt;span class="mf"&gt;7.4&lt;/span&gt;    &lt;span class="mf"&gt;4.8&lt;/span&gt;
&lt;span class="k"&gt;add&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;structured&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt; &lt;span class="mf"&gt;7.8&lt;/span&gt;   &lt;span class="mf"&gt;8.8&lt;/span&gt;    &lt;span class="mf"&gt;7.0&lt;/span&gt;    &lt;span class="mf"&gt;5.1&lt;/span&gt;
&lt;span class="k"&gt;update&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;hero&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;text&lt;/span&gt;    &lt;span class="mf"&gt;7.6&lt;/span&gt;   &lt;span class="mf"&gt;7.7&lt;/span&gt;    &lt;span class="mf"&gt;7.2&lt;/span&gt;    &lt;span class="mf"&gt;1.6&lt;/span&gt;
&lt;span class="k"&gt;update&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;contact&lt;/span&gt;      &lt;span class="mf"&gt;7.4&lt;/span&gt;   &lt;span class="mf"&gt;7.6&lt;/span&gt;    &lt;span class="mf"&gt;7.0&lt;/span&gt;   &lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.2&lt;/span&gt;
&lt;span class="k"&gt;swap&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;fonts&lt;/span&gt;          &lt;span class="mf"&gt;7.6&lt;/span&gt;   &lt;span class="mf"&gt;7.0&lt;/span&gt;    &lt;span class="mf"&gt;6.8&lt;/span&gt;    &lt;span class="mf"&gt;2.1&lt;/span&gt;
&lt;span class="k"&gt;replace&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;hero&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;bg&lt;/span&gt;     &lt;span class="mf"&gt;7.3&lt;/span&gt;   &lt;span class="mf"&gt;6.2&lt;/span&gt;    &lt;span class="mf"&gt;6.5&lt;/span&gt;    &lt;span class="mf"&gt;2.8&lt;/span&gt;
&lt;span class="k"&gt;verify&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;contrast&lt;/span&gt;     &lt;span class="mf"&gt;6.4&lt;/span&gt;   &lt;span class="mf"&gt;7.8&lt;/span&gt;    &lt;span class="mf"&gt;5.8&lt;/span&gt;    &lt;span class="mf"&gt;4.8&lt;/span&gt;
&lt;span class="k"&gt;replace&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;section&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;bgs&lt;/span&gt; &lt;span class="mf"&gt;7.6&lt;/span&gt;   &lt;span class="mf"&gt;2.4&lt;/span&gt;    &lt;span class="mf"&gt;5.5&lt;/span&gt;    &lt;span class="mf"&gt;3.0&lt;/span&gt;
&lt;span class="k"&gt;replace&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;footer&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;logo&lt;/span&gt; &lt;span class="mf"&gt;6.0&lt;/span&gt;   &lt;span class="mf"&gt;8.6&lt;/span&gt;    &lt;span class="mf"&gt;4.8&lt;/span&gt;    &lt;span class="mf"&gt;2.0&lt;/span&gt;
&lt;span class="k"&gt;apply&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;colours&lt;/span&gt;       &lt;span class="mf"&gt;6.2&lt;/span&gt;   &lt;span class="mf"&gt;5.8&lt;/span&gt;    &lt;span class="mf"&gt;6.5&lt;/span&gt;    &lt;span class="mf"&gt;0.2&lt;/span&gt;
&lt;span class="k"&gt;apply&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;hero&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;layout&lt;/span&gt;   &lt;span class="mf"&gt;4.7&lt;/span&gt;   &lt;span class="mf"&gt;3.2&lt;/span&gt;    &lt;span class="mf"&gt;2.8&lt;/span&gt;   &lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;3.9&lt;/span&gt;
&lt;span class="k"&gt;apply&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;sections&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;lyt&lt;/span&gt;  &lt;span class="mf"&gt;1.6&lt;/span&gt;  &lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;3.8&lt;/span&gt;   &lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.5&lt;/span&gt;   &lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;2.5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice DeepSeek R1's column. It scores -1.2 on contact updates and -3.9 on hero layout. These aren't just bad scores — they mean the model made the page actively worse than the starting template on basic tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Reasoning Model Trap
&lt;/h2&gt;

&lt;p&gt;DeepSeek R1 scored 26.2% — worse than any other model by a wide margin. On two runs, it averaged 41.9/160. For context, a score of 41.9 means the model successfully completed roughly 4 of 16 actions and actively damaged several others.&lt;/p&gt;

&lt;p&gt;Why? R1 is a reasoning model. It's optimised for step-by-step logical deduction — mathematical proofs, multi-hop reasoning, chain-of-thought problem solving. Code generation is not reasoning. It's pattern completion with spatial awareness.&lt;/p&gt;

&lt;p&gt;R1 spent tokens "thinking" about CSS instead of writing it. Its chain-of-thought preambles consumed context window without producing better output. On layout tasks, it reasoned its way into worse solutions than models that simply pattern-matched from training data.&lt;/p&gt;

&lt;p&gt;The lesson: match the model architecture to the task type. Reasoning models are the wrong tool for code generation. This seems obvious in hindsight, but R1's pricing ($0.70/$2.50) sits between Haiku and Sonnet — it looks like a mid-tier option until you run the evaluation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Variance Problem
&lt;/h2&gt;

&lt;p&gt;Average scores tell half the story. The other half is variance.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Avg Score&lt;/th&gt;
&lt;th&gt;Std Dev&lt;/th&gt;
&lt;th&gt;Best Run&lt;/th&gt;
&lt;th&gt;Worst Run&lt;/th&gt;
&lt;th&gt;Range&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku&lt;/td&gt;
&lt;td&gt;107.7&lt;/td&gt;
&lt;td&gt;13.4&lt;/td&gt;
&lt;td&gt;~121&lt;/td&gt;
&lt;td&gt;~94&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kimi K2.5&lt;/td&gt;
&lt;td&gt;108.2&lt;/td&gt;
&lt;td&gt;20.1&lt;/td&gt;
&lt;td&gt;~128&lt;/td&gt;
&lt;td&gt;~88&lt;/td&gt;
&lt;td&gt;40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek V3.2&lt;/td&gt;
&lt;td&gt;94.0&lt;/td&gt;
&lt;td&gt;28.9&lt;/td&gt;
&lt;td&gt;120.5&lt;/td&gt;
&lt;td&gt;25.8&lt;/td&gt;
&lt;td&gt;95&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Haiku is the most consistent model — you know what you're getting. Its standard deviation (13.4) is half of Kimi's and less than half of DeepSeek V3.2's.&lt;/p&gt;

&lt;p&gt;DeepSeek V3.2's variance is remarkable. Its best run (120.5) approaches Haiku's average. Its worst run (25.8) is catastrophic — worse than R1's average. Same model, same template, same requirements, 95-point swing.&lt;/p&gt;

&lt;p&gt;For production systems, unpredictable quality is worse than consistently mediocre quality. A restaurant that's amazing 50% of the time and terrible 50% isn't a good restaurant. Haiku's consistency is a genuine advantage that doesn't show up in averages.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We'd Do Differently
&lt;/h2&gt;

&lt;p&gt;This was an exploratory evaluation — designed to identify patterns, not prove rankings. For Round 2, we're addressing three issues:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run Sonnet through the same pipeline.&lt;/strong&gt; The gold standard scoring method makes Sonnet's score non-comparable. In Round 2, Sonnet runs the same 16-action pipeline as every other model. Same judge, same conditions, same denominator.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Increase sample sizes.&lt;/strong&gt; Minimum 15 runs per model across the same template set. That gives us 80% statistical power to detect a 20-point difference at alpha=0.05. No more overlapping confidence intervals for the middle tier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Calibrate the judge.&lt;/strong&gt; Our Claude Opus judge scores Claude models. There's an obvious bias risk. Round 2 will score a subset with a second judge model and compute inter-rater agreement. We'll also blind the judge by stripping model-identifying patterns from outputs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;No model matches Sonnet.&lt;/strong&gt; The gap is directionally clear even with measurement caveats. For client-facing output where quality is non-negotiable, Sonnet remains the production choice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The middle tier is a tie.&lt;/strong&gt; Kimi, Haiku, and DeepSeek V3.2 are statistically indistinguishable. Pick based on secondary factors: Haiku for consistency, Kimi for peak performance, DeepSeek for cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task type matters more than model choice.&lt;/strong&gt; The difference between the easiest action (9.4/10) and the hardest (-0.8/10) is larger than the difference between any two models on the same action. If you optimise which tasks you give to AI rather than which AI you use, you'll see bigger quality gains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reasoning models don't generate code well.&lt;/strong&gt; R1's architecture is wrong for this task. Don't pick a model based on its benchmark scores on reasoning tasks if your workload is code generation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Variance is a feature, not noise.&lt;/strong&gt; DeepSeek V3.2 is the cheapest option but the least predictable. Haiku costs 5x more but delivers consistent results. The reliability premium is real.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is part 2 of a 7-part series documenting how we built an evaluation framework for AI code generators, tested 5 models across 467 real code generation tasks, and turned the results into production improvements.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Previous: &lt;a href="https://dev.to/tsekatm/beyond-text-how-we-built-an-evaluation-framework-for-multi-file-ai-outputs-1d10"&gt;Beyond Text: How We Built an Evaluation Framework for Multi-File AI Outputs&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Next: &lt;a href="https://tebogo.cloud/blog/building-llm-judge-that-doesnt-lie" rel="noopener noreferrer"&gt;Building an LLM Judge That Doesn't Lie to You&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://tebogo.cloud/blog/comparing-llms-real-code-generation" rel="noopener noreferrer"&gt;tebogo.cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>webdev</category>
      <category>testing</category>
    </item>
    <item>
      <title>Beyond Text: How We Built an Evaluation Framework for Multi-File AI Outputs</title>
      <dc:creator>Tebogo Tseka</dc:creator>
      <pubDate>Mon, 30 Mar 2026 17:24:19 +0000</pubDate>
      <link>https://dev.to/tsekatm/beyond-text-how-we-built-an-evaluation-framework-for-multi-file-ai-outputs-1d10</link>
      <guid>https://dev.to/tsekatm/beyond-text-how-we-built-an-evaluation-framework-for-multi-file-ai-outputs-1d10</guid>
      <description>&lt;p&gt;Most LLM benchmarks evaluate text. HumanEval checks if a function passes unit tests. SWE-bench measures whether a model can patch a repository. MBPP scores single-function completions.&lt;/p&gt;

&lt;p&gt;None of these work when your AI agent generates an entire website.&lt;/p&gt;

&lt;p&gt;I run a site builder agent that takes a template, a set of business requirements (brand colours, fonts, content, images, layout), and produces a deployable multi-file artifact: &lt;code&gt;index.html&lt;/code&gt;, &lt;code&gt;css/styles.css&lt;/code&gt;, &lt;code&gt;js/main.js&lt;/code&gt;, and an &lt;code&gt;assets/&lt;/code&gt; directory. The output isn't a string. It's a folder. And a correct &lt;code&gt;index.html&lt;/code&gt; paired with broken &lt;code&gt;styles.css&lt;/code&gt; produces a broken site — even though each file might look reasonable in isolation.&lt;/p&gt;

&lt;p&gt;I needed an evaluation framework that could score these outputs the way a QA engineer would: structurally, visually, semantically, and at the code level. Over six days, I built one. It evaluated 467 actions across 5 models, and the results changed how I think about AI code generation.&lt;/p&gt;

&lt;p&gt;This article explains the framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Existing Benchmarks Don't Work Here
&lt;/h2&gt;

&lt;p&gt;The gap between LLM benchmarks and real-world code generation is wider than it appears.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HumanEval&lt;/strong&gt; tests single functions with pass/fail assertions. There's no partial credit for CSS that's 90% right but produces invisible text on a dark background. &lt;strong&gt;SWE-bench&lt;/strong&gt; measures diffs against existing repositories — our agents generate from scratch, not patch. And &lt;strong&gt;MBPP&lt;/strong&gt; evaluates isolated snippets with no concept of inter-file dependencies.&lt;/p&gt;

&lt;p&gt;What I actually needed to measure fell into five categories: structural integrity (are the right files present?), visual fidelity (does it look correct?), content accuracy (is the business name right in all six locations?), code quality (is the CSS valid and responsive?), and accessibility (can users actually read the text?).&lt;/p&gt;

&lt;p&gt;No existing benchmark covers all five for multi-file outputs. So I built a four-layer evaluation stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 4-Layer Evaluation Stack
&lt;/h2&gt;

&lt;p&gt;Each layer catches a different class of defect. They run in sequence, and their results feed into a unified scoring model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Structural Checks
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;FolderComparer&lt;/code&gt; validates the generated file tree against the gold standard. Does &lt;code&gt;index.html&lt;/code&gt; exist? Is &lt;code&gt;css/styles.css&lt;/code&gt; present? Are there unexpected files that shouldn't be there?&lt;/p&gt;

&lt;p&gt;This layer catches the most fundamental failures. A missing &lt;code&gt;index.html&lt;/code&gt; is an instant -5.0 deduction — the site literally cannot load. An extra file nobody asked for is a minor -0.25. The structural layer answers one question: did the agent produce the right artifacts?&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Content Checks
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;ContentComparer&lt;/code&gt; parses the generated HTML and validates text content, meta tags, heading hierarchy, alt text, and viewport configuration. It answers: does the content match what was requested?&lt;/p&gt;

&lt;p&gt;This layer caught a failure pattern I didn't anticipate. Models paraphrase user-provided content roughly 30% of the time. The requirement says "Farm-fresh flavours, crafted with care" and the model writes "Fresh ingredients from local farms, prepared with dedication." Semantically similar. Functionally wrong. The client gave you exact copy — use it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Visual Checks
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;HTMLVisualChecker&lt;/code&gt; analyses HTML and CSS without rendering, catching issues that code review alone misses. It detects broken images (where the &lt;code&gt;src&lt;/code&gt; attribute contains a description instead of a URL), empty sections, dark text on dark backgrounds, broken navigation layouts, and missing interactivity.&lt;/p&gt;

&lt;p&gt;This layer exists because of a specific failure. Early in testing, our LLM judge gave a 9/10 to a page where the hero text was completely invisible — dark grey text (&lt;code&gt;#333&lt;/code&gt;) on a dark background image. The CSS was syntactically valid. The HTML was well-structured. But the page was unusable. The visual checker now catches contrast violations by analysing CSS colour values against background declarations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_dark_text_on_dark_bg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Violation&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Detect potential dark-on-dark contrast issues from CSS.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;root_match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;:root\s*\{([^}]+)\}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;css&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;root_match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;violations&lt;/span&gt;

    &lt;span class="c1"&gt;# Extract CSS variables
&lt;/span&gt;    &lt;span class="n"&gt;vars_dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;finditer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;(--[\w-]+)\s*:\s*([^;]+);&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;root_block&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;vars_dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;text_color&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vars_dict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--text-color&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;has_bg_images&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;background(-image)?\s*:\s*url\(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;css&lt;/span&gt;
    &lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;has_bg_images&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_is_dark_color&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text_color&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;violations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Violation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A11Y-DARK-TEXT-ON-DARK-BG&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Severity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CRITICAL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;deduction&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mf"&gt;3.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Dark text with background images — unreadable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It also catches image hallucination — a universal failure across all five models we tested. Every model, at some point, writes image descriptions as &lt;code&gt;src&lt;/code&gt; attributes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="c"&gt;&amp;lt;!-- What the model generates --&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;img&lt;/span&gt; &lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"a modern office building with glass facade and blue sky"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;

&lt;span class="c"&gt;&amp;lt;!-- What it should generate --&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;img&lt;/span&gt; &lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"https://images.unsplash.com/photo-1486406146926-c627a92ad1ab"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The checker flags any &lt;code&gt;src&lt;/code&gt; attribute longer than 30 characters containing spaces that doesn't start with &lt;code&gt;http&lt;/code&gt; — a simple heuristic that catches this pattern reliably.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 4: LLM Judge
&lt;/h3&gt;

&lt;p&gt;The final layer is a Claude Opus multimodal judge. It receives the source code, the scoring rubric, and the violation catalogue, then returns a structured JSON response identifying every violation it finds.&lt;/p&gt;

&lt;p&gt;The judge prompt is specific and constrained:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;Your&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;job:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;identify&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;every&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;violation&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;agent&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;output&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;by&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;comparing&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;it&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;against&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;gold&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;standard&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;and&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;requirements.&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;Return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;ONLY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;JSON&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;object&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;with&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;this&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;structure:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"violations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"VIOLATION-ID-FROM-CATALOGUE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"critical|major|moderate|minor"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"deduction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;-N.N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"What is wrong"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"evidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Specific line showing the issue"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;Rules:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Use&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;EXACT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;deduction&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;amounts&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;violation&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;catalogue&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Do&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;NOT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;invent&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;violation&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;IDs&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Do&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;NOT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;report&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;violations&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;that&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;don't&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;exist&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three design decisions matter here. First, the judge identifies violations — it doesn't assign scores. The scoring engine applies fixed deductions. This separation prevents the judge from inflating or deflating scores arbitrarily. Second, the violation IDs are constrained to a catalogue of 22 known types. The judge can't invent new categories. Third, deduction amounts are fixed per violation type. The judge classifies; the scorer calculates.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Violation-Deduction Scoring Model
&lt;/h2&gt;

&lt;p&gt;Traditional AI evaluation uses additive scoring: start at 0, add points for what's correct. Our model inverts this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Score = 10 - sum(deductions)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every action starts at 10 (perfect). Each violation subtracts its fixed deduction. Scores can go negative — and they do. The layout transformation action averages -0.8/10 across all models, meaning models consistently make the page worse than the starting template.&lt;/p&gt;

&lt;p&gt;Why deductive scoring? Because a page that's 90% correct but has invisible text is not a 9/10. It's broken. Additive scoring rewards partial completion. Deductive scoring penalises defects proportionally to their impact on the user.&lt;/p&gt;

&lt;p&gt;The 22 violation types span seven categories:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Example Violation&lt;/th&gt;
&lt;th&gt;Severity&lt;/th&gt;
&lt;th&gt;Deduction&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Structural&lt;/td&gt;
&lt;td&gt;Missing index.html&lt;/td&gt;
&lt;td&gt;Critical&lt;/td&gt;
&lt;td&gt;-5.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Structural&lt;/td&gt;
&lt;td&gt;Empty section (no visible content)&lt;/td&gt;
&lt;td&gt;Critical&lt;/td&gt;
&lt;td&gt;-3.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Visual&lt;/td&gt;
&lt;td&gt;Layout completely broken&lt;/td&gt;
&lt;td&gt;Critical&lt;/td&gt;
&lt;td&gt;-3.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Visual&lt;/td&gt;
&lt;td&gt;Broken image (description as src)&lt;/td&gt;
&lt;td&gt;Critical&lt;/td&gt;
&lt;td&gt;-2.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Content&lt;/td&gt;
&lt;td&gt;Missing text from requirements&lt;/td&gt;
&lt;td&gt;Critical&lt;/td&gt;
&lt;td&gt;-2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Content&lt;/td&gt;
&lt;td&gt;Content paraphrased, not verbatim&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;-0.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code Quality&lt;/td&gt;
&lt;td&gt;Local file path instead of URL&lt;/td&gt;
&lt;td&gt;Critical&lt;/td&gt;
&lt;td&gt;-2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code Quality&lt;/td&gt;
&lt;td&gt;No responsive breakpoints&lt;/td&gt;
&lt;td&gt;Major&lt;/td&gt;
&lt;td&gt;-1.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Accessibility&lt;/td&gt;
&lt;td&gt;Dark text on dark background&lt;/td&gt;
&lt;td&gt;Critical&lt;/td&gt;
&lt;td&gt;-3.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Accessibility&lt;/td&gt;
&lt;td&gt;Missing alt text&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;-0.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interactivity&lt;/td&gt;
&lt;td&gt;No mobile menu toggle&lt;/td&gt;
&lt;td&gt;Critical&lt;/td&gt;
&lt;td&gt;-2.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance&lt;/td&gt;
&lt;td&gt;No lazy loading&lt;/td&gt;
&lt;td&gt;Minor&lt;/td&gt;
&lt;td&gt;-0.25&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The severity tiers reflect real-world impact. A critical violation (-5.0 to -2.0) makes the site unusable or unprofessional. A major violation (-2.0 to -1.0) degrades the experience noticeably. Minor violations (-0.25) are polish issues that most users won't notice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gold Standards: The Ground Truth Problem
&lt;/h2&gt;

&lt;p&gt;Every evaluation needs ground truth. Ours comes from 21 hand-verified reference templates covering landing pages, SaaS products, corporate sites, event pages, safari lodges, training portals, and more.&lt;/p&gt;

&lt;p&gt;Each gold standard includes three stages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gold-standards/
  template-ai-page-builder/
    requirements.md              # Business customisation spec
    stage-1-customise-template/  # Skeleton with spec applied
    stage-2-site-generation/     # Optimised and validated
    stage-3-deployment/          # Deploy config and manifest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;requirements.md&lt;/code&gt; file defines every customisation the agent must apply — brand colours, typography, logo paths, hero text, about section copy, contact details, layout patterns, SEO requirements. Here's a real excerpt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Brand Amendments&lt;/span&gt;

&lt;span class="gu"&gt;### Colours&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Primary**&lt;/span&gt;: #B85C38
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Secondary**&lt;/span&gt;: #5C3D2E
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Accent**&lt;/span&gt;: #E8D5B7

&lt;span class="gu"&gt;### Typography&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Heading Font**&lt;/span&gt;: Fraunces
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Body Font**&lt;/span&gt;: Lato

&lt;span class="gu"&gt;## Content Amendments&lt;/span&gt;

&lt;span class="gu"&gt;### Hero Section&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Headline**&lt;/span&gt;: Seasonal Menus. Local Ingredients.
              Unforgettable Meals.
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**CTA Button**&lt;/span&gt;: View Our Menu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These references are git-committed, versioned, and human-reviewed. They're not generated — they're hand-built by applying the requirements to each template and verifying every change visually. This matters because the judge compares agent output against these references. If the ground truth is wrong, every score is wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Evaluation Pipeline
&lt;/h2&gt;

&lt;p&gt;Putting all four layers together, the orchestrator runs a 16-action pipeline per model per template:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Copy the template skeleton to the run directory (baseline)&lt;/li&gt;
&lt;li&gt;Screenshot the baseline&lt;/li&gt;
&lt;li&gt;For each of the 16 actions:

&lt;ul&gt;
&lt;li&gt;Send the action instruction to the model&lt;/li&gt;
&lt;li&gt;Write modified files to the action directory&lt;/li&gt;
&lt;li&gt;Run the HTMLVisualChecker (Layer 3)&lt;/li&gt;
&lt;li&gt;Run the Opus judge against the gold standard (Layer 4)&lt;/li&gt;
&lt;li&gt;Record the ActionScore (10 minus deductions)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Aggregate all 16 action scores into a template score (max 160)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The 16 actions cover six categories: brand (colours, fonts, logos, favicon), images (hero background, section backgrounds), content (hero text, about text, contact info), layout (hero layout, sections layout), technical (SEO meta, structured data, accessibility), and quality (contrast verification).&lt;/p&gt;

&lt;p&gt;Actions are sequential — each builds on the previous output. This is deliberate. Real agent workflows apply changes incrementally. A colour change affects subsequent image overlay decisions. A font change affects layout spacing. Sequential evaluation captures the compounding effect of errors, which is exactly what happens in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Framework Revealed
&lt;/h2&gt;

&lt;p&gt;Over six days, this pipeline processed 467 actions across five models and six templates. The results were clear in some places and surprising in others.&lt;/p&gt;

&lt;p&gt;What was clear: structured, well-defined tasks (SEO meta tags, accessibility attributes) score consistently high across all models (8.7-9.4/10 average). These are token-native tasks — key-value pairs and attribute additions that align with how language models process text.&lt;/p&gt;

&lt;p&gt;What was surprising: layout transformation — applying CSS grid or flexbox changes to restructure page sections — scored negative on average. Every model, including the best one, made pages worse when asked to transform layouts. This isn't a prompt engineering problem. It's a spatial reasoning gap in current language model architectures.&lt;/p&gt;

&lt;p&gt;What was most useful: the violation data drove targeted improvements. Instead of vaguely knowing "the agent sometimes produces bad output," I now know that 60% of font management failures come from a single issue (updating CSS &lt;code&gt;font-family&lt;/code&gt; but not the Google Fonts &lt;code&gt;&amp;lt;link&amp;gt;&lt;/code&gt; tag), and that 30% of content failures are verbatim violations (paraphrasing instead of using exact text). These specific failure patterns led to 1,191 lines of skill improvements across six production modules.&lt;/p&gt;

&lt;h2&gt;
  
  
  Applicability Beyond Websites
&lt;/h2&gt;

&lt;p&gt;The framework's architecture — structural checks, content checks, visual checks, LLM judge, violation-deduction scoring — isn't website-specific. Any AI system that generates multi-file artifacts can be evaluated this way.&lt;/p&gt;

&lt;p&gt;Document generation (reports, presentations, proposals) has the same inter-file dependency problem. Infrastructure-as-code (Terraform modules, CloudFormation templates) has structural requirements and validation rules. Even multi-file code generation (microservice scaffolding, API implementations) benefits from checking whether all the files work together, not just whether each file compiles.&lt;/p&gt;

&lt;p&gt;The key insight: evaluating AI-generated artifacts requires evaluating the artifact as a whole, not its parts in isolation. A syntactically valid CSS file paired with an HTML file that references different class names is a broken website. The evaluation framework must understand that relationship.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is part 1 of a 7-part series documenting how we built an evaluation framework for AI code generators, tested 5 models across 467 real code generation tasks, and turned the results into production improvements.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://tebogo.cloud/blog/beyond-text-evaluating-multi-file-ai-outputs" rel="noopener noreferrer"&gt;tebogo.cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>evaluation</category>
      <category>testing</category>
      <category>webdev</category>
    </item>
    <item>
      <title>How I Create Memory for My Agents on Claude Code</title>
      <dc:creator>Tebogo Tseka</dc:creator>
      <pubDate>Tue, 03 Mar 2026 18:33:32 +0000</pubDate>
      <link>https://dev.to/tsekatm/how-i-create-memory-for-my-agents-on-claude-code-mdn</link>
      <guid>https://dev.to/tsekatm/how-i-create-memory-for-my-agents-on-claude-code-mdn</guid>
      <description>&lt;h2&gt;
  
  
  How I Create Memory for My Agents on Claude Code
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;March 3, 2026&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;AI agents forget everything. Every new session starts from zero — no context about your project, no memory of architectural decisions, no knowledge of your coding standards. You end up repeating yourself constantly.&lt;/p&gt;

&lt;p&gt;I run 14 specialised agents across multiple AWS projects — an HLD Architect, a DevOps Engineer, an SDET, a Defect Manager, a Technical Content Engineer, and more. Each one needs to understand the codebase, follow specific rules, and build on work from previous sessions.&lt;/p&gt;

&lt;p&gt;Repeating context every session is not an option. So I built a multi-layered memory architecture in Claude Code that gives my agents persistent knowledge, specialised expertise, and consistent behaviour across every conversation.&lt;/p&gt;

&lt;p&gt;Here is exactly how I do it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: Six Layers of Memory
&lt;/h2&gt;

&lt;p&gt;My agent memory system has six layers, each solving a different problem:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────┐
│  Layer 6: Permissions (settings.local.json)  │  What the agent CAN do
├──────────────────────────────────────────────┤
│  Layer 5: Plans (.claude/plans/*.md)         │  What the agent IS doing
├──────────────────────────────────────────────┤
│  Layer 4: Auto Memory (memory/MEMORY.md)     │  What the agent HAS learned
├──────────────────────────────────────────────┤
│  Layer 3: Skills (*.skill.md)                │  HOW to do specific things
├──────────────────────────────────────────────┤
│  Layer 2: Agent Personas (*_Agent.md)        │  WHO the agent is
├──────────────────────────────────────────────┤
│  Layer 1: CLAUDE.md (project instructions)   │  The rules everyone follows
└──────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every layer is just markdown files. No databases, no APIs, no infrastructure — just files that Claude Code loads automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 1: CLAUDE.md — The Constitution
&lt;/h2&gt;

&lt;p&gt;Every project has a &lt;code&gt;CLAUDE.md&lt;/code&gt; file at its root. Claude Code reads this file automatically at the start of every session. It is the single most important file in my entire setup.&lt;/p&gt;

&lt;p&gt;My root &lt;code&gt;CLAUDE.md&lt;/code&gt; sits at the workspace level and defines global rules that every agent must follow — what I call the &lt;strong&gt;TBT Law&lt;/strong&gt; (Think Before Typing):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## TBT Law (Inviolable)&lt;/span&gt;
&lt;span class="p"&gt;
1.&lt;/span&gt; Be patient — 80% planning, 20% implementation
&lt;span class="p"&gt;2.&lt;/span&gt; Do not be overeager — never try to impress by doing unrequested work
&lt;span class="p"&gt;3.&lt;/span&gt; Always seek approval before implementing any plan
&lt;span class="p"&gt;4.&lt;/span&gt; Never make changes without a plan — plan first, always
&lt;span class="p"&gt;5.&lt;/span&gt; Do not rush the user — be patient, wait for direction
&lt;span class="p"&gt;6.&lt;/span&gt; Do not make decisions or assumptions on the user's behalf
&lt;span class="p"&gt;7.&lt;/span&gt; If unsure, ask — never guess or assume
&lt;span class="p"&gt;8.&lt;/span&gt; If the plan isn't working, STOP — no workarounds
&lt;span class="p"&gt;9.&lt;/span&gt; Rushing and over-eager changes will break code or design
&lt;span class="p"&gt;10.&lt;/span&gt; If rules are violated, admit openly — do not hide mistakes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These ten rules prevent the most common failure mode with AI agents: doing too much, too fast, without thinking. Every agent, regardless of persona, follows these rules.&lt;/p&gt;

&lt;p&gt;Below the TBT Law, the root CLAUDE.md defines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mandatory SDET Verification&lt;/strong&gt; — every plan must be tested after execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Defect Management&lt;/strong&gt; — every bug gets logged, reproduced, fixed, and verified&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment-First Verification&lt;/strong&gt; — no fix is considered testable until deployed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repository Isolation&lt;/strong&gt; — every service gets its own repo&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Resource Naming Conventions&lt;/strong&gt; — DynamoDB tables use plain names, S3 buckets include environment suffixes&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Project-Specific CLAUDE.md Files
&lt;/h3&gt;

&lt;p&gt;Each project directory has its own CLAUDE.md that inherits from the root and adds project-specific context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# my-saas-landing - Project Instructions&lt;/span&gt;

&lt;span class="gu"&gt;## Project Overview&lt;/span&gt;
&lt;span class="gs"&gt;**Repository**&lt;/span&gt;: my-saas-landing
&lt;span class="gs"&gt;**Purpose**&lt;/span&gt;: Marketing landing page - Single-page scroll site
&lt;span class="gs"&gt;**Stack**&lt;/span&gt;: React 18 + TypeScript + Vite

&lt;span class="gu"&gt;## Cross-App Navigation&lt;/span&gt;
| Action                  | Target URL                    |
|-------------------------|-------------------------------|
| "Start Free Trial"      | /app/onboarding               |
| "Buy" pricing button    | /checkout?planId={id}         |

&lt;span class="gu"&gt;## S3 Deployment&lt;/span&gt;
Landing page files deploy to the root of my-web-public S3 bucket...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means the agent immediately knows what the project is, what stack it uses, how it deploys, and how it connects to other services — before I type a single word.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 2: Agent Personas — Specialised Identities
&lt;/h2&gt;

&lt;p&gt;I have 14 agent persona files, each defined as a markdown document. When I need a specific type of expertise, I load the corresponding persona.&lt;/p&gt;

&lt;p&gt;Each persona file follows a consistent structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# DevOps Engineer Agent&lt;/span&gt;

&lt;span class="gu"&gt;## Identity&lt;/span&gt;
You are a Senior DevOps Engineer specialising in AWS infrastructure...

&lt;span class="gu"&gt;## Core Competencies&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; CI/CD pipeline design (GitHub Actions)
&lt;span class="p"&gt;-&lt;/span&gt; Infrastructure as Code (Terraform)
&lt;span class="p"&gt;-&lt;/span&gt; Container orchestration (ECS, ECR)
&lt;span class="p"&gt;-&lt;/span&gt; CloudFront distribution management

&lt;span class="gu"&gt;## Workflow&lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; Assess current infrastructure state
&lt;span class="p"&gt;2.&lt;/span&gt; Propose changes with risk assessment
&lt;span class="p"&gt;3.&lt;/span&gt; Implement with rollback plan
&lt;span class="p"&gt;4.&lt;/span&gt; Verify deployment
&lt;span class="p"&gt;5.&lt;/span&gt; Document changes

&lt;span class="gu"&gt;## Constraints&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Never modify production without approval
&lt;span class="p"&gt;-&lt;/span&gt; Always use Terraform for infrastructure changes
&lt;span class="p"&gt;-&lt;/span&gt; Follow the AWS Well-Architected Framework
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight is that &lt;strong&gt;personas are not prompts&lt;/strong&gt; — they are persistent identity files that the agent loads and embodies for the entire session. The DevOps Engineer thinks differently from the SDET, who thinks differently from the HLD Architect. They have different priorities, different vocabularies, and different workflows.&lt;/p&gt;

&lt;p&gt;My current roster:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Persona&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HLD Architect&lt;/td&gt;
&lt;td&gt;High-level design documents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLD Architect&lt;/td&gt;
&lt;td&gt;Low-level design documents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DevOps Engineer&lt;/td&gt;
&lt;td&gt;CI/CD, infrastructure, deployments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SDET&lt;/td&gt;
&lt;td&gt;Automated testing, defect tracking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Defect Manager&lt;/td&gt;
&lt;td&gt;Bug lifecycle with issue tracker integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GenAI Engineer&lt;/td&gt;
&lt;td&gt;Bedrock, LLMs, RAG solutions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud Security Specialist&lt;/td&gt;
&lt;td&gt;IAM, GuardDuty, compliance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Technical Content Engineer&lt;/td&gt;
&lt;td&gt;Blog posts, whitepapers, tutorials&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Project Manager&lt;/td&gt;
&lt;td&gt;Task orchestration, TBT workflow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peer Review Architect&lt;/td&gt;
&lt;td&gt;Design review, anti-pattern detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Technical Business Developer&lt;/td&gt;
&lt;td&gt;Market analysis, pricing models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Python AWS Developer&lt;/td&gt;
&lt;td&gt;Lambda, DynamoDB, Step Functions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Java AWS Developer&lt;/td&gt;
&lt;td&gt;Spring Boot, ECS services&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Global Template Manager&lt;/td&gt;
&lt;td&gt;Template lifecycle management&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When I say "load the DevOps Engineer persona", the agent reads the file and adopts that identity — including its specific workflow, constraints, and communication style.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 3: Skills — Reusable Knowledge Modules
&lt;/h2&gt;

&lt;p&gt;Skills are the most underrated layer. They are standalone knowledge files (&lt;code&gt;.skill.md&lt;/code&gt;) that any persona can reference. Think of them as shared libraries for agent knowledge.&lt;/p&gt;

&lt;p&gt;Examples from my setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;DynamoDB_Single_Table.skill.md&lt;/code&gt; — Single-table design patterns, GSI strategies, access patterns&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;HATEOAS_Relational_Design.skill.md&lt;/code&gt; — API design with hypermedia links&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Development_Best_Practices.skill.md&lt;/code&gt; — SOLID, TDD, BDD, DDD principles&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Monolith_Anti_Pattern_Validation.skill.md&lt;/code&gt; — Six anti-patterns (AP-1 through AP-6) to detect&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Step_Functions_Decision_Logic.skill.md&lt;/code&gt; — State machine patterns&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;API_Proxy_Testing.skill.md&lt;/code&gt; — End-to-end testing patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A skill file looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# DynamoDB Single Table Design&lt;/span&gt;

&lt;span class="gu"&gt;## When to Apply&lt;/span&gt;
Apply when a service has 3+ entity types with relational access patterns.

&lt;span class="gu"&gt;## Partition Key Strategy&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Use composite keys: {ENTITY_TYPE}#{ENTITY_ID}
&lt;span class="p"&gt;-&lt;/span&gt; GSI1PK for inverted lookups
&lt;span class="p"&gt;-&lt;/span&gt; GSI2PK for cross-entity queries

&lt;span class="gu"&gt;## Access Patterns&lt;/span&gt;
| Pattern | PK | SK | Index |
|---------|----|----|-------|
| Get user by ID | USER#123 | METADATA | Table |
| Get user's sites | USER#123 | SITE# | Table |
| Get site by domain | DOMAIN#example.com | METADATA | GSI1 |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The power of skills is &lt;strong&gt;composition&lt;/strong&gt;. When the LLD Architect is designing a new service, it can reference the DynamoDB skill, the HATEOAS skill, and the Development Best Practices skill simultaneously. When the SDET is writing tests, it pulls from the API Proxy Testing skill. The knowledge is defined once and reused across every persona.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 4: Auto Memory — Learning Across Sessions
&lt;/h2&gt;

&lt;p&gt;Claude Code has a built-in auto memory feature. It stores persistent notes in a &lt;code&gt;memory/&lt;/code&gt; directory within each project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/.claude/projects/{project-path}/memory/
├── MEMORY.md          # Always loaded (first 200 lines)
├── debugging.md       # Detailed debugging notes
├── patterns.md        # Confirmed patterns
└── architecture.md    # Architectural decisions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;MEMORY.md&lt;/code&gt; file is special — Claude Code loads the first 200 lines of it into every conversation automatically. This is where the agent stores things it has learned:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Confirmed Patterns&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; CloudFront Function handles SPA routing for all frontends
&lt;span class="p"&gt;-&lt;/span&gt; S3 bucket serves all frontend apps from different prefixes
&lt;span class="p"&gt;-&lt;/span&gt; Safe sync requires --exclude flags for other app prefixes
&lt;span class="p"&gt;-&lt;/span&gt; Browser cache causes stale content after deployments (hard refresh needed)

&lt;span class="gu"&gt;## AWS SSO&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Profile name: dev
&lt;span class="p"&gt;-&lt;/span&gt; Token expires frequently — run &lt;span class="sb"&gt;`aws sso login --profile dev`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I configure the agent to save memories with clear rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Save&lt;/strong&gt;: Stable patterns confirmed across multiple sessions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Save&lt;/strong&gt;: Key architectural decisions and important file paths&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Save&lt;/strong&gt;: Solutions to recurring problems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't save&lt;/strong&gt;: Session-specific context or temporary state&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't save&lt;/strong&gt;: Speculative conclusions from reading a single file&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result is that the agent gets smarter over time. The first time it encounters the CloudFront routing behaviour, it investigates. The second time, it already knows.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 5: Plans — Persistent Iteration
&lt;/h2&gt;

&lt;p&gt;Plans bridge the gap between sessions. When a task is too large for one conversation, the agent writes a plan file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/.claude/plans/
├── zazzy-puzzling-cloud.md       # Frontend extraction plan
├── elegant-crunching-sunbeam.md  # Security hardening rollout
└── zazzy-percolating-lecun.md    # CDN deployment plan
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A plan follows a consistent structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Plan: Extract Landing Page into Standalone Repo&lt;/span&gt;

&lt;span class="gu"&gt;## Context&lt;/span&gt;
The landing page was prototyped inside the main app...

&lt;span class="gu"&gt;## Step 1: Scaffold New Repo&lt;/span&gt;
Create directory structure at /path/to/new/repo...

&lt;span class="gu"&gt;## Step 2: Create Fresh Files&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; vite.config.ts — base: '/'
&lt;span class="p"&gt;-&lt;/span&gt; App.tsx — no router, single-page scroll

&lt;span class="gu"&gt;## Step 3: Modify Copied Files&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Navigation.tsx — remove router dependency
&lt;span class="p"&gt;-&lt;/span&gt; PricingPage.tsx — use window.location.href

&lt;span class="gu"&gt;## Verification&lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; npm run dev → all sections render
&lt;span class="p"&gt;2.&lt;/span&gt; npm run type-check → 0 errors
&lt;span class="p"&gt;3.&lt;/span&gt; Images and assets load correctly
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a new session starts and the plan file exists, Claude Code includes a reminder:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"A plan file exists from plan mode. If this plan is relevant to the current work and not already complete, continue working on it."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This means the agent picks up exactly where it left off — no re-explanation needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 6: Permissions — Trust Boundaries
&lt;/h2&gt;

&lt;p&gt;The final layer controls what each agent can actually do. Claude Code uses &lt;code&gt;settings.local.json&lt;/code&gt; to define allowed operations per project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"permissions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"allow"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Bash(git add *)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Bash(git commit *)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Bash(aws s3 sync *)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Bash(aws cloudfront create-invalidation *)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Bash(terraform plan *)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Bash(pytest *)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Bash(npm run build *)"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My permissions file is 276 lines long. It covers Git operations, AWS CLI commands (IAM, S3, Lambda, DynamoDB, CloudFront, Route53), Terraform, Python tooling, and testing frameworks.&lt;/p&gt;

&lt;p&gt;This is critical for the TBT Law. The agent can run tests and deploy to dev, but it cannot force-push to main or destroy production infrastructure without explicit approval.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It All Comes Together
&lt;/h2&gt;

&lt;p&gt;Here is a real workflow. I need to deploy a bug fix to a frontend app.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;I open the project.&lt;/strong&gt; Claude Code loads &lt;code&gt;CLAUDE.md&lt;/code&gt; (Layer 1) — the agent knows the stack, deployment targets, and global rules.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;I say "load the DevOps Engineer."&lt;/strong&gt; The agent reads the persona file (Layer 2) — it now thinks like a DevOps engineer with CI/CD expertise.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The agent references existing knowledge.&lt;/strong&gt; It checks auto memory (Layer 4) for deployment patterns — it already knows the S3 bucket name, CloudFront distribution ID, and safe sync exclusions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;It creates a plan.&lt;/strong&gt; The plan (Layer 5) outlines: build, sync to S3, invalidate CloudFront, verify. Per TBT Law, it waits for my approval.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;I approve.&lt;/strong&gt; The agent executes within its permissions (Layer 6) — it can run &lt;code&gt;npm run build&lt;/code&gt; and &lt;code&gt;aws s3 sync&lt;/code&gt;, but it asks before running destructive commands.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;SDET verification triggers.&lt;/strong&gt; Per the CLAUDE.md mandatory rule, the SDET persona activates to verify the deployment — checking asset integrity, page load, and console errors.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The agent saves what it learned.&lt;/strong&gt; If it encountered a new pattern (like a CloudFront cache behaviour), it writes it to auto memory for next time.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Six layers, all markdown files, zero infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Tips
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start with CLAUDE.md.&lt;/strong&gt; You do not need all six layers on day one. A well-written CLAUDE.md with your project context and coding standards gives you 80% of the value.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Write personas for recurring roles.&lt;/strong&gt; If you find yourself repeatedly explaining "you are a DevOps engineer who follows these patterns", extract it into a persona file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keep skills atomic.&lt;/strong&gt; One skill, one topic. A DynamoDB skill should not also contain API design patterns. Composability comes from keeping them separate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Curate auto memory.&lt;/strong&gt; Review what the agent saves. Remove outdated entries. The memory file is limited to 200 lines — keep it focused on patterns that are genuinely stable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use plans for multi-session work.&lt;/strong&gt; If a task will take more than one conversation, write a plan. The overhead of creating the plan pays for itself when you do not have to re-explain the context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Set permissions deliberately.&lt;/strong&gt; Start restrictive and expand. It is easier to grant new permissions than to recover from an agent that deleted your production database.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;AI agents do not need to forget. The tools already exist in Claude Code — CLAUDE.md files, auto memory, plan persistence, and permission controls. What they need is architecture.&lt;/p&gt;

&lt;p&gt;By structuring memory into six layers — rules, personas, skills, learning, plans, and permissions — I have agents that understand my projects, follow my standards, learn from past sessions, and operate within clear boundaries.&lt;/p&gt;

&lt;p&gt;Every layer is a markdown file. Every file is version-controlled. The entire system is transparent, auditable, and easy to iterate on.&lt;/p&gt;

&lt;p&gt;The best part? The agents get better every week. Not because the model improved, but because the memory did.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://docs.anthropic.com/en/docs/claude-code/overview" rel="noopener noreferrer"&gt;Claude Code Documentation — Anthropic&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.anthropic.com/en/docs/claude-code/memory" rel="noopener noreferrer"&gt;CLAUDE.md Best Practices — Anthropic&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;Model Context Protocol (MCP) — Specification&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/anthropics/claude-code" rel="noopener noreferrer"&gt;Claude Code CLI — GitHub&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>productivity</category>
      <category>programming</category>
    </item>
    <item>
      <title>From IDE to Cloud: Lifting Your Local Agent into an MCP Server on Amazon Bedrock AgentCore</title>
      <dc:creator>Tebogo Tseka</dc:creator>
      <pubDate>Tue, 03 Mar 2026 15:08:24 +0000</pubDate>
      <link>https://dev.to/tsekatm/from-ide-to-cloud-lifting-your-local-agent-into-an-mcp-server-on-amazon-bedrock-agentcore-3icp</link>
      <guid>https://dev.to/tsekatm/from-ide-to-cloud-lifting-your-local-agent-into-an-mcp-server-on-amazon-bedrock-agentcore-3icp</guid>
      <description>&lt;p&gt;You have built an AI agent that works beautifully on your laptop. It calls tools, reasons through problems, and returns exactly the answer your users need. There is just one problem: it lives on &lt;code&gt;localhost&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Moving from a local prototype to a production-grade, multi-tenant cloud service usually means weeks of infrastructure work — containers, load balancers, session isolation, authentication, observability. &lt;strong&gt;Amazon Bedrock AgentCore Runtime&lt;/strong&gt; collapses that effort into a handful of commands while the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; gives your agent a standard interface that any MCP-compatible client can discover and invoke.&lt;/p&gt;

&lt;p&gt;In this post you will take a Python agent running in your IDE, transform it into an MCP server, and deploy it to AgentCore Runtime — with working code at every step.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Amazon Bedrock AgentCore Runtime?
&lt;/h2&gt;

&lt;p&gt;Amazon Bedrock AgentCore Runtime is a serverless hosting environment purpose-built for AI agents. It provides several capabilities that are hard to replicate on your own:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Framework-agnostic&lt;/strong&gt; — Works with Strands Agents, LangGraph, CrewAI, or any custom Python agent. You are not locked into a single orchestration framework.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model flexibility&lt;/strong&gt; — Use any LLM — Amazon Bedrock models, Anthropic Claude, Google Gemini, or OpenAI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session isolation&lt;/strong&gt; — Each user session runs in a dedicated microVM with isolated CPU, memory, and filesystem. When the session ends, the microVM is terminated and memory is sanitised.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Protocol support&lt;/strong&gt; — Native support for Model Context Protocol (MCP) and Agent-to-Agent (A2A) communication.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extended execution&lt;/strong&gt; — Synchronous requests get a 15-minute timeout; asynchronous sessions can run for up to 8 hours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consumption-based pricing&lt;/strong&gt; — You pay only for the compute your agent actually uses, not for idle time waiting on LLM responses.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short, AgentCore Runtime handles the infrastructure so you can focus on the agent logic.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is MCP and Why Does It Matter?
&lt;/h2&gt;

&lt;p&gt;The Model Context Protocol (MCP) is an open standard that defines how AI agents discover and invoke tools over HTTP. Think of it as a contract: an MCP server exposes tools with typed inputs and outputs, and any MCP client can discover those tools at runtime and call them without custom integration code.&lt;/p&gt;

&lt;p&gt;Key characteristics of MCP on AgentCore Runtime:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stateless streamable-HTTP&lt;/strong&gt; — AgentCore requires stateless servers. The platform automatically injects a &lt;code&gt;Mcp-Session-Id&lt;/code&gt; header for session continuity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool discovery&lt;/strong&gt; — Clients call &lt;code&gt;list_tools()&lt;/code&gt; to discover every tool the server exposes, with full JSON Schema descriptions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standard path&lt;/strong&gt; — The server listens on &lt;code&gt;0.0.0.0:8000/mcp&lt;/code&gt;, which is the default path supported by most MCP SDKs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interoperability&lt;/strong&gt; — Any MCP client — Claude Code, Cursor, Kiro, Amazon Q CLI — can connect to your deployed server with zero custom wiring.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By implementing MCP, your agent becomes a reusable building block that other agents and developer tools can compose into larger systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;The flow has four stages: &lt;strong&gt;build locally&lt;/strong&gt; and test on &lt;code&gt;localhost&lt;/code&gt;, &lt;strong&gt;transform&lt;/strong&gt; for AgentCore compatibility, &lt;strong&gt;deploy&lt;/strong&gt; to AWS via the AgentCore CLI, and &lt;strong&gt;invoke&lt;/strong&gt; from any MCP client.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; LOCAL IDE                    TRANSFORMATION                AGENTCORE RUNTIME                 INVOCATION
 ─────────                    ──────────────                ─────────────────                 ──────────

 ┌─────────────────────┐      ┌──────────────────────┐      ┌──────────────────────────┐      ┌─────────────────────┐
 │                     │      │                      │      │                          │      │  Claude Code /      │
 │  MCP Server Code    │      │  Install AgentCore   │      │  agentcore configure     │      │  Cursor / Kiro      │
 │  my_mcp_server.py   │      │  MCP Server in IDE   │      │  --protocol MCP          │      │         │           │
 │         │           │      │         │            │      │         │                │      │         ▼           │
 │         ▼           │      │         ▼            │      │         ▼                │      │  ┌───────────────┐  │
 │  Local Test         │─────▶│  Transform Agent     │─────▶│  agentcore launch        │      │  │ Agent Runtime │  │
 │  localhost:8000/mcp │      │  + BedrockAgentCore  │      │  Build + ECR + Deploy    │      │  │     ARN       │◀─┤
 │                     │      │  App wrapper         │      │         │                │      │  │  MicroVM      │  │
 └─────────────────────┘      └──────────────────────┘      │         ▼                │      │  │  Isolation    │  │
                                                            │  ┌────────────────────┐  │      │  └───────────────┘  │
                                                            │  │  Agent Runtime ARN │  │      │         ▲           │
                                                            │  │  MicroVM Isolation  │──┼─────▶│  Remote MCP Client │
                                                            │  │  Session Mgmt      │  │      │  Python Script      │
                                                            │  └────────────────────┘  │      │         ▲           │
                                                            │                          │      │  MCP Inspector      │
                                                            └──────────────────────────┘      └─────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before you start, make sure you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An &lt;strong&gt;AWS account&lt;/strong&gt; with Amazon Bedrock AgentCore permissions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS CLI&lt;/strong&gt; installed and configured with appropriate credentials&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python 3.10+&lt;/strong&gt; installed (3.13 recommended)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;uv&lt;/strong&gt; package manager installed (optional but recommended)&lt;/li&gt;
&lt;li&gt;An MCP client: &lt;strong&gt;Claude Code&lt;/strong&gt;, &lt;strong&gt;Cursor&lt;/strong&gt;, &lt;strong&gt;Kiro&lt;/strong&gt;, or &lt;strong&gt;Amazon Q CLI&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Install the core packages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;mcp
pip &lt;span class="nb"&gt;install &lt;/span&gt;bedrock-agentcore
pip &lt;span class="nb"&gt;install &lt;/span&gt;bedrock-agentcore-starter-toolkit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 1: Build Your Local MCP Server
&lt;/h2&gt;

&lt;p&gt;Start by creating a simple MCP server with a few tools. This is the agent you will later lift into AgentCore Runtime.&lt;/p&gt;

&lt;p&gt;Create a file called &lt;code&gt;my_mcp_server.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# my_mcp_server.py
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.server.fastmcp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastMCP&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;starlette.responses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;JSONResponse&lt;/span&gt;

&lt;span class="c1"&gt;# Create the MCP server instance
# host must be 0.0.0.0 for AgentCore compatibility
&lt;/span&gt;&lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.0.0.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stateless_http&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;summarise_architecture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Summarise the high-level architecture of an AWS service.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; architecture typically includes &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a control plane for management operations and a data plane &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;for runtime request handling, with IAM for access control.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;estimate_monthly_cost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;requests_per_month&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;avg_duration_ms&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Estimate monthly cost for a serverless AWS service.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;cost_per_request&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0000002&lt;/span&gt;
    &lt;span class="n"&gt;cost_per_gb_second&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0000166667&lt;/span&gt;
    &lt;span class="n"&gt;memory_gb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;
    &lt;span class="n"&gt;duration_seconds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;avg_duration_ms&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
    &lt;span class="n"&gt;compute_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests_per_month&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;duration_seconds&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;memory_gb&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;cost_per_gb_second&lt;/span&gt;
    &lt;span class="n"&gt;request_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests_per_month&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;cost_per_request&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;compute_cost&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;request_cost&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Estimated monthly cost for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_iam_policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;resource_arn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Generate a least-privilege IAM policy document.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2012-10-17&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Statement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Effect&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Allow&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Resource&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;resource_arn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;streamable-http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This server exposes three tools: an architecture summariser, a cost estimator, and an IAM policy generator. The key details for AgentCore compatibility are &lt;code&gt;host="0.0.0.0"&lt;/code&gt; and &lt;code&gt;stateless_http=True&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Test Locally
&lt;/h3&gt;

&lt;p&gt;Start the server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python my_mcp_server.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server starts on port 8000. From a separate terminal, run a test client:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# my_mcp_client.py
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ClientSession&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.client.streamable_http&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;streamablehttp_client&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;mcp_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8000/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;streamablehttp_client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;mcp_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;terminate_on_close&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nf"&gt;as &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;read_stream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;write_stream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;ClientSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;read_stream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;write_stream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

            &lt;span class="c1"&gt;# Discover tools
&lt;/span&gt;            &lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_tools&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Available tools:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  - &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# Invoke a tool
&lt;/span&gt;            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;estimate_monthly_cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AWS Lambda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;requests_per_month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_duration_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Result: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see your three tools listed and a cost estimate returned.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Install the AgentCore MCP Server in Your IDE
&lt;/h2&gt;

&lt;p&gt;AWS provides an MCP server specifically for AgentCore development. This server runs inside your IDE's MCP client and guides the transformation, deployment, and testing workflow conversationally.&lt;/p&gt;

&lt;p&gt;Add the following to your MCP client configuration:&lt;/p&gt;

&lt;h3&gt;
  
  
  Claude Code (~/.claude/mcp.json)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"bedrock-agentcore-mcp-server"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"uvx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"awslabs.amazon-bedrock-agentcore-mcp-server@latest"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"FASTMCP_LOG_LEVEL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ERROR"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"disabled"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"autoApprove"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"search_agentcore_docs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"fetch_agentcore_doc"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Cursor (.cursor/mcp.json)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"bedrock-agentcore-mcp-server"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"uvx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"awslabs.amazon-bedrock-agentcore-mcp-server@latest"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"FASTMCP_LOG_LEVEL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ERROR"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"disabled"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"autoApprove"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"search_agentcore_docs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"fetch_agentcore_doc"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart your MCP client after adding the configuration. Verify by checking that &lt;code&gt;search_agentcore_docs&lt;/code&gt; and &lt;code&gt;fetch_agentcore_doc&lt;/code&gt; tools appear in your tool list.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Transform Your Agent for AgentCore
&lt;/h2&gt;

&lt;p&gt;If you are deploying an &lt;strong&gt;MCP server&lt;/strong&gt; (not a general agent), the transformation is minimal. Your FastMCP server already meets the protocol contract — it listens on &lt;code&gt;0.0.0.0:8000/mcp&lt;/code&gt; with stateless streamable-HTTP transport.&lt;/p&gt;

&lt;p&gt;However, if you are deploying a &lt;strong&gt;general agent&lt;/strong&gt; (not an MCP server), you need to wrap it with the AgentCore SDK:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Add the AgentCore import
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bedrock_agentcore.runtime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BedrockAgentCoreApp&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Initialise the application
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BedrockAgentCoreApp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Decorate your entrypoint
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.entrypoint&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;user_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;my_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Add the runner
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Update your &lt;code&gt;requirements.txt&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mcp
bedrock-agentcore
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 4: Deploy to AgentCore Runtime
&lt;/h2&gt;

&lt;p&gt;Deployment uses the AgentCore CLI from the starter toolkit. Two commands are all you need.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configure the deployment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentcore configure &lt;span class="nt"&gt;-e&lt;/span&gt; my_mcp_server.py &lt;span class="nt"&gt;--protocol&lt;/span&gt; MCP
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CLI walks you through a guided prompt:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Execution role&lt;/strong&gt; — Provide an IAM role ARN with AgentCore Runtime permissions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ECR repository&lt;/strong&gt; — Press Enter to auto-create one&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dependency file&lt;/strong&gt; — Auto-detected from the current directory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OAuth&lt;/strong&gt; — Type &lt;code&gt;yes&lt;/code&gt; if you want authentication, then provide your Cognito discovery URL and client ID&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Launch
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentcore launch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Behind the scenes, this command:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Builds an ARM64 Docker container with your server code and dependencies&lt;/li&gt;
&lt;li&gt;Pushes the container image to Amazon ECR&lt;/li&gt;
&lt;li&gt;Creates an AgentCore Runtime resource&lt;/li&gt;
&lt;li&gt;Deploys your MCP server into an isolated microVM environment&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;On success, you receive an Agent Runtime ARN:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;arn:aws:bedrock-agentcore:us-west-2:123456789012:runtime/my_mcp_server-abc123
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Save this ARN — you need it to invoke your server.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Invoke and Test Your Deployed MCP Server
&lt;/h2&gt;

&lt;p&gt;Your MCP server is now running on AWS. You can invoke it from any MCP client or from a Python script.&lt;/p&gt;

&lt;h3&gt;
  
  
  Remote invocation via Python
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AGENT_ARN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:bedrock-agentcore:us-west-2:123456789012:runtime/my_mcp_server-abc123"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;BEARER_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-oauth-bearer-token"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# my_mcp_client_remote.py
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ClientSession&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.client.streamable_http&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;streamablehttp_client&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;agent_arn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AGENT_ARN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;bearer_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BEARER_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;agent_arn&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;bearer_token&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: AGENT_ARN or BEARER_TOKEN not set&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;encoded_arn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent_arn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%3A&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%2F&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mcp_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://bedrock-agentcore.us-west-2.amazonaws.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/runtimes/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;encoded_arn&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/invocations?qualifier=DEFAULT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bearer_token&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;streamablehttp_client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;mcp_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;terminate_on_close&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nf"&gt;as &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;read_stream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;write_stream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;ClientSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;read_stream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;write_stream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

            &lt;span class="c1"&gt;# Discover tools
&lt;/span&gt;            &lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_tools&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Deployed tools:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  - &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# Call a tool on the deployed server
&lt;/span&gt;            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generate_iam_policy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;actions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3:GetObject&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3:PutObject&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resource_arn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:s3:::my-bucket/*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Generated policy:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see the same three tools you defined locally, now served from AgentCore Runtime with full session isolation, authentication, and observability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Testing with MCP Inspector
&lt;/h3&gt;

&lt;p&gt;You can also use the MCP Inspector for interactive testing. Point it at your deployed server's invocation URL with the appropriate bearer token, and you get a visual interface to discover tools, invoke them, and inspect responses.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;You now have a production MCP server running on AgentCore Runtime. Here are natural next steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AgentCore Gateway&lt;/strong&gt; — Connect your agent to external APIs and third-party tools through the managed gateway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AgentCore Memory&lt;/strong&gt; — Add persistent conversation context so your agent remembers prior interactions across sessions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AgentCore Identity&lt;/strong&gt; — Integrate with your corporate identity provider for end-user authentication.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent-to-Agent (A2A)&lt;/strong&gt; — Deploy additional agents and let them communicate using the A2A protocol.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability&lt;/strong&gt; — Enable built-in tracing to capture agent reasoning steps via CloudWatch Transaction Search.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/agents-tools-runtime.html" rel="noopener noreferrer"&gt;Host agent or tools with Amazon Bedrock AgentCore Runtime&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/runtime-mcp.html" rel="noopener noreferrer"&gt;Deploy MCP servers in AgentCore Runtime&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/runtime-mcp-protocol-contract.html" rel="noopener noreferrer"&gt;MCP protocol contract&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/mcp-getting-started.html" rel="noopener noreferrer"&gt;Amazon Bedrock AgentCore MCP Server: Vibe coding with your coding assistant&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/runtime-get-started-code-deploy.html" rel="noopener noreferrer"&gt;Get started with AgentCore Runtime direct code deployment&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/awslabs/amazon-bedrock-agentcore-samples/tree/main/01-tutorials/01-AgentCore-runtime/02-hosting-MCP-server" rel="noopener noreferrer"&gt;AgentCore MCP Server Tutorial — GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://awslabs.github.io/mcp/servers/amazon-bedrock-agentcore-mcp-server" rel="noopener noreferrer"&gt;AWS Bedrock AgentCore MCP Server — Open Source MCP Servers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/machine-learning/build-long-running-mcp-servers-on-amazon-bedrock-agentcore-with-strands-agents-integration/" rel="noopener noreferrer"&gt;Build long-running MCP servers on Amazon Bedrock AgentCore — AWS Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/machine-learning/accelerate-development-with-the-amazon-bedrock-agentcore-mcpserver/" rel="noopener noreferrer"&gt;Accelerate development with the Amazon Bedrock AgentCore MCP server — AWS Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://modelcontextprotocol.io/specification/2025-06-18/basic/transports" rel="noopener noreferrer"&gt;MCP Specification: Transports&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/aws/bedrock-agentcore-sdk-python" rel="noopener noreferrer"&gt;Amazon Bedrock AgentCore Python SDK — GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/aws/bedrock-agentcore-starter-toolkit" rel="noopener noreferrer"&gt;Amazon Bedrock AgentCore Starter Toolkit — GitHub&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;By &lt;a href="https://dev.kimmyai.io" rel="noopener noreferrer"&gt;Tebogo Tseka&lt;/a&gt; — AWS Practice Manager &amp;amp; Solutions Architect at Big Beard Web Solutions&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>mcp</category>
      <category>python</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
