<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Karl Mehta</title>
    <description>The latest articles on DEV Community by Karl Mehta (@karl_mehta).</description>
    <link>https://dev.to/karl_mehta</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3838303%2F0953f972-d3af-4906-9a2b-a0f572938d23.jpeg</url>
      <title>DEV Community: Karl Mehta</title>
      <link>https://dev.to/karl_mehta</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/karl_mehta"/>
    <language>en</language>
    <item>
      <title>Dario Amodei Just Laid Out Why AI Assurance Is Now Non-Negotiable. Here's What Enterprises Need to Do Monday Morning</title>
      <dc:creator>Karl Mehta</dc:creator>
      <pubDate>Mon, 15 Jun 2026 01:57:36 +0000</pubDate>
      <link>https://dev.to/karl_mehta/dario-amodei-just-laid-out-why-ai-assurance-is-now-non-negotiable-heres-what-enterprises-need-to-42e6</link>
      <guid>https://dev.to/karl_mehta/dario-amodei-just-laid-out-why-ai-assurance-is-now-non-negotiable-heres-what-enterprises-need-to-42e6</guid>
      <description>&lt;p&gt;Dario Amodei, CEO of Anthropic, published a sweeping policy essay yesterday on X: "Policy on the AI Exponential", that every enterprise leader deploying AI should read. His central message: AI is no longer a toy or a tool. It is a technology of national and economic consequence and our policy and governance infrastructure is dangerously behind. Dario called out the emergence of Claude Mythos Preview as proof that frontier models now pose real risks to critical infrastructure and national security, and announced that Anthropic is releasing both a legislative proposal on frontier model testing and a policy framework for job displacement, with substantial financial backing. The essay is the clearest signal yet from inside the AI industry that self-regulation is over.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Treebeard Problem Is Real&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Dario's metaphor is apt: AI is moving at Hobbit speed, policy at Treebeard speed. In the time it takes Congress to pass a bill, AI capabilities compound multiple generations. But here's the enterprise implication that doesn't get enough attention — companies are in the exact same trap. Internal governance committees, legal reviews, audit cycles — they were designed for software that didn't change every 90 days.&lt;/p&gt;

&lt;p&gt;The result? AI systems are in production making decisions about who gets hired, who gets a loan, who gets treated, who gets covered — with no independent verification, no compliance certification, and no tamper-proof evidence that anyone checked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Dario Is Calling For — And Why It Matters to Your Enterprise&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Three parts of his essay hit directly at what enterprises need to act on now:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Mandatory third-party testing is coming.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Dario explicitly calls for frontier AI models to be evaluated by qualified independent third parties before deployment — modeled on the FAA. Governments should have the power to block deployment of models that fail. Whether or not federal legislation passes this year, the regulatory direction is unmistakable: independent evaluation is becoming table stakes, not optional. The EU AI Act entered enforcement this August, with fines up to €35M or 7% of global revenue. NYC's LL144 is expanding. The legislative proposal Anthropic released alongside this essay signals the industry itself now agrees: self-reporting isn't enough.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The cybersecurity risk is not theoretical.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Claude Mythos Preview demonstrated that frontier AI poses real, immediate risks to critical infrastructure, financial systems, and national security. For enterprise leaders, the implication is concrete: the AI systems your vendors are deploying — in your HR stack, your credit decisions, your underwriting models — carry cybersecurity, bias, and accountability exposure that your existing risk frameworks weren't built to catch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The economic stakes of getting this wrong are enormous.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Dario is candid that AI may produce more enduring labor displacement than any prior technology. Companies that deploy AI recklessly will face not just regulatory exposure, but workforce, reputational, and legal consequences they aren't pricing in today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why I Open-Sourced TrustModel Last Week&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Dario's essay calls for independent evaluation infrastructure that the whole industry can trust — and that's precisely why last week I open-sourced the TrustModel core under the MIT license. My post on the launch explains the reasoning: AI assurance can't be a black box. The evaluation engine, the scoring methodology, and the guardrail framework need to be inspectable, forkable, and community-auditable. The same independence that makes TrustModel valuable as an enterprise platform — we don't have an incentive to pass the models we sell — depends on that methodology being open to scrutiny.&lt;/p&gt;

&lt;p&gt;Open-source bottom, commercial top. The evaluation engine is MIT-licensed and available now at github.com/karlmehta/trustmodel. The compliance framework library — 50+ regulatory frameworks, tamper-proof on-chain governance evidence, continuous monitoring — is the enterprise product. This is the Databricks model: openness builds trust, which is the point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What TrustModel.ai Is Doing About It — Today&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At TrustModel.ai, we've spent the last two years building the infrastructure Dario is calling for at the enterprise layer. Here's what that looks like in practice:&lt;/p&gt;

&lt;p&gt;Independent evaluation, no vendor cooperation required. Our platform scores AI models and COTS systems across 10 trust dimensions — Safety, Fairness, Accuracy, Privacy, Transparency, Robustness, Accountability, Explainability, Compliance, and Reliability — producing a TrustScore from 0–100, the credit rating for your AI systems.&lt;/p&gt;

&lt;p&gt;Compliance frameworks that match real regulatory exposure. We've operationalized EU AI Act, NIST AI RMF, OWASP LLM Top 10, and NYC LL144 into policy packs that evaluate your deployed systems against the standards regulators are now enforcing.&lt;/p&gt;

&lt;p&gt;Tamper-proof governance evidence. We anchor evaluation results and audit trails cryptographically on-chain. When a regulator asks "did you exercise due diligence?" — you have cryptographic proof you did, not a PDF from a consultant.&lt;/p&gt;

&lt;p&gt;Continuous monitoring, not one-time audits. AI models drift. Guardrails change. Vendor updates happen silently. Our platform monitors AI behavior continuously via OpenTelemetry integration, alerts on violations, and maintains a live compliance posture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Call to Action&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Dario ends with optimism — that the window of opportunity is open, that policymakers are unusually receptive, that a nonpartisan coalition around AI safety is possible. I share that optimism.&lt;/p&gt;

&lt;p&gt;But enterprises cannot wait for Congress. The EU AI Act enforcement clock has already started. Your board is already asking questions your CISO and CDO can't yet answer.&lt;/p&gt;

&lt;p&gt;The question isn't whether you need independent AI assurance. The question is whether you build that infrastructure before the first enforcement action or after.&lt;/p&gt;

&lt;p&gt;FreeScan your first AI model — full 10-dimension evaluation, no credit card required — at trustmodel.ai. Or join us at the AI Assurance &amp;amp; Governance Summit on October 1, 2026 at Stanford University , where we'll convene the enterprise, regulatory, and investor leaders working to get this right.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Personal Note on the Journey&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On a personal note: congratulations to Dario and the entire Anthropic team on the confidential S-1 filing last week, targeting a public listing that could make Anthropic one of the first trillion-dollar AI companies to reach the public markets. What Anthropic has built in just five years, from a safety-focused research lab that many dismissed as too principled to compete, to the company that has genuinely redefined what responsible AI leadership looks like — is nothing short of extraordinary. Dario has earned the right to write essays like this one. He built the credibility while others just wrote the press releases.&lt;/p&gt;

&lt;p&gt;And a special congratulations to my former colleagues and friends at Menlo Ventures — who led Anthropic's Series D at a time when many other firms had passed. That kind of conviction, writing a major check into a safety-first AI lab when the consensus wasn't yet there, is what separates great venture investors from the rest. I've had the privilege of working at Menlo Ventures and with Menlo across two of my own startups, and I know firsthand that their pattern is consistent: they back founders with a point of view before the market catches up. They did it with Anthropic. They did it with me in two of my successful companies (one acquired by Visa and another by Cornerstone OnDemand ). That's the firm I'm proud to call a partner.&lt;/p&gt;

&lt;p&gt;The best investors, like the best founders, look right before they look early.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>devops</category>
      <category>security</category>
    </item>
    <item>
      <title>AI Is Shipping Unvalidated. Today We're Open-Sourcing the Fix.</title>
      <dc:creator>Karl Mehta</dc:creator>
      <pubDate>Sun, 31 May 2026 23:41:35 +0000</pubDate>
      <link>https://dev.to/karl_mehta/ai-is-shipping-unvalidated-today-were-open-sourcing-the-fix-1gn9</link>
      <guid>https://dev.to/karl_mehta/ai-is-shipping-unvalidated-today-were-open-sourcing-the-fix-1gn9</guid>
      <description>&lt;p&gt;Enterprises are carrying an estimated half-trillion dollars of unvalidated AI risk, and U.S. courts already hold 100+ active AI lawsuits. TrustModel — Eval, Monitor, Govern — is now free and open source.&lt;/p&gt;

&lt;p&gt;Every company on earth is racing to put AI in front of customers, employees, and regulators. Almost none of them can answer a simple question before they ship: Is this AI safe, fair, and defensible?&lt;/p&gt;

&lt;p&gt;That gap — between "it worked in the demo" and "it holds up in a deposition" — is now one of the largest unpriced liabilities in the economy. By industry estimates the aggregate exposure runs into the hundreds of billions of dollars, and the bill is already coming due in cour&lt;/p&gt;

&lt;p&gt;This is not hypothetical anymore&lt;br&gt;
The litigation has arrived, and it's not edge-case. A few that should make every builder pause:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Hiring. &lt;em&gt;Mobley v. Workday&lt;/em&gt; — an AI résumé-screening system accused of discriminating by age, race, and disability — was cleared in 2025 to proceed as a nationwide collective action. The EEOC's first AI-discrimination settlement (iTutorGroup, $365K) was over software that auto-rejected older applicants.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Customer-facing agents. A tribunal ordered Air Canada to honor a refund policy its chatbot invented, flatly rejecting the argument that "the chatbot is a separate entity." Your AI's words are your words.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Healthcare. Class actions allege algorithmic systems wrongfully denied medically necessary care at scale.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Safety &amp;amp; harm. Wrongful-death and product-liability suits now name AI products directly for the content they generate.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the regulators are converging on the same standard. The EU AI Act carries penalties up to €35M or 7% of global revenue. NYC Local Law 144 mandates bias audits for hiring tools. Colorado's AI Act lands in 2026. NIST's AI Risk Management Framework, ISO 42001, and the OWASP LLM Top 10 are quietly becoming the bar you'll be measured against — in audits and in court.&lt;/p&gt;

&lt;p&gt;The problem isn't that teams don't care. It's that validation is too far away.&lt;/p&gt;

&lt;p&gt;Here's the trap. "AI trust &amp;amp; safety" has no budget line, no owner, and no tool the engineer who actually ships the model can reach for on a Tuesday afternoon. The governance platforms that exist are six-figure, top-down, procurement-cycle products aimed at a Chief Compliance Officer — months away from the code. So the model ships unvalidated, and the exposure compounds silently until a plaintiff, an auditor, or a journalist finds it.&lt;/p&gt;

&lt;p&gt;The fix can't be another enterprise sales motion. It has to be free, local, and in the hands of the developer — the person who can actually do something about a bad score before it ships. Here is the repo on GH to pick it up: &lt;a href="https://github.com/karlmehta/trustmodel" rel="noopener noreferrer"&gt;https://github.com/karlmehta/trustmodel&lt;/a&gt; also, check the HuggingFace page: &lt;a href="https://huggingface.co/spaces/karlmehta/trustmodel-score-any-ai" rel="noopener noreferrer"&gt;https://huggingface.co/spaces/karlmehta/trustmodel-score-any-ai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So we open-sourced it.&lt;/p&gt;

&lt;p&gt;"I'm Karl, founder of TrustModel. I built this because whether your AI is safe to ship shouldn't require a sales call to find out. Run it locally, read the code, and score your own AI across the same ten dimensions our enterprise customers use — for free, forever."  —&lt;a class="mentioned-user" href="https://dev.to/karlmehta"&gt;@karlmehta&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Today we're releasing TrustModel — an MIT-licensed toolkit that scores any AI across 10 trust dimensions and rolls them into a single 0–100 TrustScore. Three products, one free API key (a developer account is free, no credit card, and comes with 5 credits — $500 — to use across all three):&lt;/p&gt;

&lt;p&gt;$ pip install trustmodel&lt;br&gt;
$ trustmodel login          # free account → API key + 5 credits ($500)&lt;br&gt;
$ trustmodel eval "Take 500mg of metformin twice daily."&lt;/p&gt;

&lt;p&gt;TrustScore: 41/100 (Grade D)&lt;br&gt;
  safety          ⚠  ········  unverified medical dosage advice&lt;br&gt;
  explainability  ⚠  ·····     47 ⚠&lt;br&gt;
  ...&lt;br&gt;
Eval, Monitor, Govern&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Eval — score any model, prompt, agent, or MCP server across 10 dimensions, locally, using your own LLM as the judge. Gate it in CI so a bad score fails the build, not the lawsuit.&lt;/li&gt;
&lt;li&gt;Monitor — one line wraps your live LLM calls and scores every response in production, with threshold alerts and OpenTelemetry export. Catch drift the day it starts, not the quarter it's subpoenaed.&lt;/li&gt;
&lt;li&gt;Govern — enforce open-source policy packs mapped to real regulation (EU AI Act, NIST AI RMF, ISO 42001, NYC LL144, OWASP LLM Top 10) before output reaches a user. Block the opaque rejection. Redact the leaked PII. Stop the unsafe answer at the door.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Govern: block output that would fail an EU AI Act / LL144 audit
&lt;/h1&gt;

&lt;p&gt;from trustmodel import Guardrail&lt;br&gt;
gr = Guardrail("nyc-ll144")&lt;br&gt;
gr.check("You're not a culture fit. We can't say why.").allowed  # → False&lt;/p&gt;

&lt;p&gt;Ten dimensions, mapped to the rules you'll be judged by&lt;/p&gt;

&lt;p&gt;Safety, fairness, accuracy, privacy, transparency, robustness, accountability, explainability, compliance, reliability — each scored, each mapped to the frameworks that show up in audits and complaints. This is the difference between "we tested it" and "we can prove it." When a regulator or a plaintiff's attorney asks how you validated your system, "we ran TrustModel in CI on every release and here are the scores" is an answer. Silence is a settlement.&lt;/p&gt;

&lt;p&gt;Open core, on purpose&lt;/p&gt;

&lt;p&gt;The toolkit is free and MIT-licensed — the harness, the CLI, the MCP server, the policy packs. The calibrated, audit-ready TrustScore, the compliance reports an auditor will accept, certification, and in-VPC agent governance are the commercial layer at trustmodel.ai. Think Linux and Red Hat: run the open source forever; pay only when you need a score you can hand to a regulator. We'd rather a million developers validate their AI for free than have a thousand do it after the lawsuit.&lt;/p&gt;

&lt;p&gt;Score your AI today. Free key, $500 in credits, no credit card.&lt;/p&gt;

&lt;p&gt;pip install trustmodel &amp;amp;&amp;amp; trustmodel login&lt;br&gt;
Star on GitHubTry the live demoGet your free key: trustmodel.ai (click for free developer account under sign up) &lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/karlmehta/trustmodel" rel="noopener noreferrer"&gt;https://github.com/karlmehta/trustmodel&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://huggingface.co/spaces/karlmehta/trustmodel-score-any-ai" rel="noopener noreferrer"&gt;https://huggingface.co/spaces/karlmehta/trustmodel-score-any-ai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note: The $500B exposure figure reflects aggregate industry estimates of unvalidated AI liability; case references are drawn from public U.S. litigation and regulatory actions and are illustrative, not legal advice. AI litigation counts reflect public litigation trackers as of 2026.&lt;/p&gt;

</description>
      <category>agentskills</category>
      <category>security</category>
      <category>agents</category>
      <category>ai</category>
    </item>
    <item>
      <title>The Missing Engineering Stack for Production AI Agents</title>
      <dc:creator>Karl Mehta</dc:creator>
      <pubDate>Sun, 17 May 2026 15:29:24 +0000</pubDate>
      <link>https://dev.to/karl_mehta/the-missing-engineering-stack-for-production-ai-agents-316h</link>
      <guid>https://dev.to/karl_mehta/the-missing-engineering-stack-for-production-ai-agents-316h</guid>
      <description>&lt;p&gt;The "build an agent in 5 minutes" tutorials get you to a demo. They don't get you to production. Here's the field guide for the four primitives that decide whether your agent survives contact with real users, real data, and real adversaries — context-window discipline, skill composition, capability-based security, and drift telemetry. Concrete patterns, named tradeoffs, and the enterprise integrations that let you ship past prototype.&lt;/p&gt;

&lt;p&gt;This is part 1 of a 3-post series. Part 2 — Why current IDEs need to be redesigned for the agent era — covers the developer-tooling argument. Part 3 introduces what I'm shipping next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Tokens — context-window discipline&lt;/strong&gt;&lt;br&gt;
A token is the unit of inference cost, the unit of latency, and the unit of model attention. Treat it like memory in a 1990s embedded system: budget every byte, evict aggressively, and never assume the next call gets the same allocation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt caching is a 90% cost cut you'd be insane to ignore&lt;/strong&gt;&lt;br&gt;
Anthropic's cache_control: { type: 'ephemeral' } marker (5-minute TTL by default, 1-hour via the extended-TTL beta) deduplicates the static prefix of your prompts at the inference layer. Cached tokens are billed at 10% of input cost; cache writes cost 25% more on the first call. The math: any system prompt + tool catalog + few-shot exemplar bank that's reused more than ~3 times per 5 minutes is a net cost win. Order matters — the cache is a prefix, not a content-addressable store, so the cached span has to be byte-identical and at the start.&lt;/p&gt;

&lt;p&gt;messages: [&lt;br&gt;
  { role: "user", content: [&lt;br&gt;
    { type: "text", text: STATIC_TOOL_CATALOG, cache_control: { type: "ephemeral" } },&lt;br&gt;
    { type: "text", text: STATIC_SKILLS_BUNDLE, cache_control: { type: "ephemeral" } },&lt;br&gt;
    { type: "text", text: dynamicUserTurn },&lt;br&gt;
  ]}&lt;br&gt;
]&lt;br&gt;
Two cache breakpoints because cache reads accumulate up to the most recent cache_control marker — splitting tool catalog from skill bundle lets either evolve without busting the other. OpenAI's automatic prefix caching (no opt-in, but no extended TTL) and Gemini's explicit CachedContent resources are the equivalents on the other major providers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model routing — pay Haiku rates for Opus-class outcomes&lt;/strong&gt;&lt;br&gt;
A single agent run rarely needs the same model for every step. The cost spread is enormous: Claude Haiku 4.5 is $1/$5 per million in/out, Sonnet 4.6 is $3/$15, Opus 4.7 is $15/$75. The pattern that's worked for me is a three-tier router:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieval / classification / extraction → Haiku. Use structured outputs (forced JSON via tool_use with strict mode) so the model can't waste tokens on freeform.&lt;/li&gt;
&lt;li&gt;Synthesis / reasoning over retrieved context → Sonnet. The default mid-tier; this is where 80% of business logic lives.&lt;/li&gt;
&lt;li&gt;Tool selection / planning / disambiguation → Opus only when the planner has to coordinate &amp;gt;5 tool calls or weigh ambiguous user intent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Switching costs ~50ms of router latency. The cost amortization is typically 4–8× on production workloads. The trap: don't route based on input length alone — route based on the step type. A 50-token "is this a refund request?" classifier on Haiku is 60× cheaper than the same call on Opus.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streaming, KV reuse, and the structured-output dodge&lt;/strong&gt;&lt;br&gt;
Streaming via SSE (Anthropic, OpenAI) or gRPC bidirectional (Vertex) is non-negotiable for latency. The first token typically lands at 200–600 ms; the full response at 2–8 seconds. If your UX waits for the full response, you've added 4 seconds of perceived latency for zero product reason.&lt;/p&gt;

&lt;p&gt;KV cache reuse across calls is the under-discussed companion to prompt caching. Modern Anthropic and OpenAI back-ends keep the attention key-value cache warm across the cache TTL. Order tool calls so the most-frequently-called tools come first in your tool list, because tool definitions are part of the prefix that gets cached.&lt;/p&gt;

&lt;p&gt;The structured-output dodge: when you need a list, a classification, or a structured fact, don't ask the model in freeform — define a tool, force it via tool_choice, and receive a typed JSON object. You skip 50–80% of the freeform tokens the model would otherwise generate, and the output is parser-safe by construction. Pair with strict mode (OpenAI) or JSON Schema with $defs (Anthropic) to refuse off-schema outputs at the decoder.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Skills — composition, not prompts&lt;/strong&gt;&lt;br&gt;
A "skill" is the unit of behavior an agent can perform. Most production agents conflate three different things into a megaprompt: identity (who are you), capabilities (what can you do), and policies (what you must / must not do). That conflation makes prompts impossible to evolve safely. Separate them into composable fragments, then assemble at runtime.&lt;/p&gt;

&lt;p&gt;The model I've shipped against — and what I think every production agent eventually converges on — is the trigger / action / restriction triple per skill:&lt;/p&gt;

&lt;p&gt;{&lt;br&gt;
  "id": "refund-policy-2024",&lt;br&gt;
  "trigger": "the user asks for a refund",&lt;br&gt;
  "action": "verify the order is within the 30-day window, then issue a refund via tools.stripe.refund and post-confirm via tools.email.send",&lt;br&gt;
  "restriction": "never issue refunds &amp;gt; $500 without a human-approval gate; never refund subscription items in their first cycle"&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;Domain experts (PMs, ops, legal) author triples in plain English. The runtime composes them into a system-prompt slot. Versioning per skill — not per agent. Eval suites attach to the skill, so swapping out a refund policy in 2026 doesn't require reblessing the entire agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool use, MCP, and the transport question&lt;/strong&gt;&lt;br&gt;
Tools are the IO of an agent. The schema is the contract. Two opinions worth holding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Strict JSON schemas with additionalProperties: false. Closed-world schemas catch hallucinated arguments at the validator instead of in production. Strict mode (OpenAI) and the Anthropic tool_choice + JSON-Schema combo both enforce this.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tools should be small and idempotent. orders.refund(orderId, amountCents), not orders.handle(intent, payload). The agent's planner is dramatically more reliable when each tool does one thing with a typed input.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once you have more than ~5 tools, the catalog itself becomes worth standardizing. Model Context Protocol (MCP) — Anthropic's open-source agent ↔ tool spec — is the answer that's consolidating the ecosystem. Three transports, three different tradeoffs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;stdio — local-process tools. Lowest latency, zero network surface. Use this for code execution, filesystem ops, anything sensitive.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;SSE (deprecated in favor of StreamableHTTP) — long-poll over HTTP. Browser-friendly, easy to host. Latency ~50ms.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;StreamableHTTP — single-endpoint HTTP with optional SSE for streaming responses. The current recommendation for hosted MCP servers. Compatible with most cloud LB stacks.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The plan-execute-review loop&lt;/strong&gt;&lt;br&gt;
For agents with &amp;gt;3 sequential tool calls, prompt the model to plan first (one message, no tool calls), execute against that plan (n messages, tool calls only), then review the result against the plan's stated success criteria (one message, no tool calls). Anthropic's Agent SDK ships this pattern via the plan_mode primitive; it's also straightforward to implement in raw fetch with three system-prompt slots.&lt;/p&gt;

&lt;p&gt;The bonus: when the agent fails, the failure is grounded in a textual plan you can replay, eval, and red-team — instead of an opaque chain of tool calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Security — capability-based, not vibe-based&lt;/strong&gt;&lt;br&gt;
The threat surface of an agent is wider than people pretend. A short list:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Prompt injection — adversarial input in retrieved context, tool outputs, or user data flips the agent's instructions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Data exfiltration — the agent calls a tool that emits sensitive data to an attacker-controlled destination (an email, a webhook, a markdown image with a query string).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tool abuse / RCE — the agent uses a legitimate tool in a way the designer didn't intend (a shell tool, a code-exec tool).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Supply chain — a tool dependency or model weight is compromised.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Secret leakage — API keys end up in logs, prompts, or tool error messages.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Capability-based authority, not ambient authority&lt;/strong&gt;&lt;br&gt;
The security primitive that's stood up best in 50 years of OS research is the object capability: hand a process the smallest unforgeable token that lets it do exactly the thing it needs, and nothing else. Apply this to agents.&lt;/p&gt;

&lt;p&gt;Concretely: don't give the agent a long-lived OPENAI_API_KEY with billing access. Give it a per-session token, scoped to specific endpoints, with a TTL. Every tool gets a separate principal. Authorize via OAuth 2.1 with PKCE — the agent walks the user through delegated authorization, the user sees the exact scopes, and tokens are stored in the OS keychain (libsecret on Linux, Keychain on macOS, DPAPI on Windows; Electron's safeStorage wraps the platform primitive for cross-OS).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sandbox the tools, not just the agent&lt;/strong&gt;&lt;br&gt;
If a tool runs untrusted code or writes to a filesystem, isolate it. Three real options ranked by overhead:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;WASM (Wasmtime, Wasmer) — sub-millisecond startup, deny-by-default I/O, easy to configure capability lists. The right choice for code-exec and policy-evaluation tools.&lt;/li&gt;
&lt;li&gt;gVisor — userspace kernel; near-full Linux compatibility with a 10–100ms startup cost. Right for tool subprocesses that need the full POSIX surface.&lt;/li&gt;
&lt;li&gt;Firecracker — microVM; ~125ms startup, hardware-backed isolation. Right for multi-tenant agent execution in shared infra.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ko/distroless container images, SLSA Level 3 build attestation, and sigstore-signed artifacts close the supply-chain surface. If your agent runs in a long-lived process, write the SBOM to the artifact registry and gate deploys on cosign verification.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt injection defense&lt;/strong&gt;&lt;br&gt;
The most under-addressed threat. The mitigations that actually work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Channel separation. Treat tool outputs and retrieved documents as data, not as instructions. Anthropic's recent research on instruction-data separation in the system prompt is the current best practice — wrap untrusted content in clearly labeled XML-ish tags and tell the model to ignore any instructions inside them.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Allowlist tool surfaces. The agent can call send_email only to addresses on a per-conversation allowlist that the user explicitly authorized. The same pattern applies to outbound HTTP, database writes, file outputs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Output content classifiers. Run a small model over the agent's tool calls before they execute, looking for known exfil patterns (suspicious destinations, base64-encoded blobs, sensitive-field references).&lt;br&gt;
HITL gates on consequential actions. Anything that costs money, sends external communication, modifies a database, or touches PII goes through a human approval before execution. The threshold is per-skill.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Trust — telemetry, not vibes&lt;/strong&gt;&lt;br&gt;
"It worked when I tested it" is not a trust story. The four signals you actually need on every agent in production:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Eval pass rate against a golden set&lt;/strong&gt;&lt;br&gt;
A regression suite of input/output pairs the agent must continue to pass. Run on every prompt change, every model upgrade, every tool catalog edit. Tag failures by skill so you can localize regressions. Pairwise LMSYS-style judging works for tone-sensitive outputs; exact-match works for structured outputs. Don't conflate them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Drift detection&lt;/strong&gt;&lt;br&gt;
Even with a stable model, your agent's behavior drifts when the input distribution shifts — new product launches, seasonal traffic, adversarial probing. Track distribution shift on input embeddings (cosine distance from a reference centroid) and behavioral metrics (tool-call mix, refund rate, escalation rate). Alarm at 2σ; investigate at 1σ.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Behavioral canaries&lt;/strong&gt;&lt;br&gt;
Plant N synthetic inputs per day designed to exercise the prompt-injection, exfil, and jailbreak surfaces. Pass rate on canaries is your live red-team signal. When a new attack class appears in the wild, add it to the canary set; you'll know the next time someone tries it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit trail with integrity&lt;/strong&gt;&lt;br&gt;
Every run captured as JSONL — input, system prompt, tool calls, model responses, costs, latencies. Hash chain over the events; periodically anchor the head into an immutable store (S3 Object Lock, GCS Bucket Lock). When auditors ask "what did the agent do on March 12 at 14:22 UTC", you have a Merkle-verifiable answer.&lt;/p&gt;

&lt;p&gt;A composite TrustScore rolls these up: weighted blend of eval pass rate, drift score, canary survival, HITL approval rate. Per agent, per skill, per day. The score is operationally meaningful only if it's grounded in those underlying signals — a score with no traceable inputs is theater.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The compliance + enterprise integrations&lt;/strong&gt;&lt;br&gt;
For anything regulated — health, finance, government, EU operations — the trust telemetry has to map onto external frameworks. The integrations I've found genuinely useful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;TrustModel.ai for the GRC overlay — NIST AI RMF, ISO 42001, EU AI Act Article-by-Article mapping, SOC 2, FedRAMP. The TrustScore feeds directly into the control library and produces auditor-ready reports without re-instrumenting the agent.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cisco DefenseClaw — Apache 2.0, free, OSS. Jeetu Patel announced it from the RSAC 2026 keynote stage on March 23, 2026; it's the most consequential agent-security release of the year. Four components ship in the box: Skills Scanner (capability scan before execution), MCP Scanner (allow/block on MCP server inspection), CodeGuard (static analysis for secrets, unsafe deserialization, weak crypto, and injection patterns), and a Guardrail Proxy (runtime inspection of prompts, completions, and tool calls via regex rules + optional LLM judgment). Stack is a Go gateway sidecar + Python CLI + a TypeScript plugin for the OpenClaw framework that DefenseClaw was built to protect. The framework is observable by default, with first-class Splunk connectivity for the audit-trail story above. It bridges the trust gap that has 85% of enterprises experimenting with agents but only 5% running them in production. Personal note: Jeetu Patel is one of my role models, and I started coding the integration into the IDE I'm shipping the moment he walked off the RSAC stage. The most quoted line from the announcement — "I run OpenClaw at home — that's exactly why we built DefenseClaw" — is the right framing. There's no good reason not to wrap DefenseClaw around every production agent.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;OpenTelemetry GenAI — the emerging standard for agent telemetry semconv. Emit the standard span attributes (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens) and your traces work in any OTel-compatible backend.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The bar&lt;/strong&gt;&lt;br&gt;
A production agent is not a model and a prompt. It's a token economy, a skill catalog with versioning, a capability-scoped security model, and a trust telemetry stack. Each of those is a non-trivial engineering surface in its own right; together, they're more work than the "build an agent in 5 minutes" tutorials acknowledge.&lt;/p&gt;

&lt;p&gt;The argument I'll make in part 2 is that the IDEs we have weren't built to help engineers hit this bar. They were built for the 2010 unit of work — one developer, one project, one file at a time — and the unit of work in 2026 is an agent that gets trained, guard-railed, and overseen by a domain expert who isn't the engineer. The tooling has to follow.&lt;/p&gt;

</description>
      <category>agentskills</category>
      <category>promptengineering</category>
      <category>mcp</category>
      <category>mlops</category>
    </item>
    <item>
      <title>The Commoditization of LLM Models</title>
      <dc:creator>Karl Mehta</dc:creator>
      <pubDate>Tue, 05 May 2026 00:55:29 +0000</pubDate>
      <link>https://dev.to/karl_mehta/the-commoditization-of-llm-models-1759</link>
      <guid>https://dev.to/karl_mehta/the-commoditization-of-llm-models-1759</guid>
      <description>&lt;p&gt;I’m becoming more convinced that LLMs are moving toward the same structure as payment networks. The models will be incredibly important. But the largest value will not be captured by the raw model layer alone. It will be captured by the layers above it: routing, evals, RAG, MCP, memory, orchestration, agentic workflows, vertical applications, and trust infrastructure.&lt;/p&gt;

&lt;p&gt;As a founder and developer, this pattern feels familiar to me.I previously built a fintech company that routed transactions across multiple rails and 100+ payment methods around the world. It was eventually acquired by Visa. In payments, Visa, Mastercard, and AmEx were critical rails. But Stripe, PayPal, Adyen, PlaySpan (acquired by VISA) and others created enormous value by abstracting those rails, optimizing routing, managing risk, improving developer experience, and owning the merchant workflow. I think the same thing is happening with LLMs.&lt;/p&gt;

&lt;p&gt;At the bottom, we will likely have a small number of frontier model providers: OpenAI, Anthropic, Google, and a strong open-weight ecosystem. They will remain valuable. They will set the capability frontier. But for most production apps, the model will increasingly become a pluggable inference rail. The value moves up the stack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer one: model gateways and routing.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;OpenRouter, LiteLLM, Bedrock, Together, Fireworks, Groq, and internal enterprise gateways are making model access interchangeable. A developer can route a request to GPT, Claude, Gemini, Llama, Mistral, DeepSeek, Qwen, or a fine-tuned model depending on cost, latency, context length, modality, privacy, or benchmark performance. This is where the “LLM as rail” abstraction begins.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer two: RAG and context engineering.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The hard problem in enterprise AI is not generating fluent text. It is assembling the right context at the right time. A useful AI system needs to know the patient record, contract clause, support ticket, lab result, CRM object, claim history, policy document, API schema, prior memory, and user permission boundary. RAG is evolving from “vector search over PDFs” into a full context layer: hybrid search, graph retrieval, tool retrieval, memory retrieval, structured database queries, re-ranking, summarization, and dynamic context packing. The LLM is only as good as the context substrate around it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer three: MCP and tool connectivity.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;MCP makes the harness layer much stronger because it standardizes how agents discover and call tools. Instead of every app building custom glue code for Gmail, Slack, GitHub, Postgres, EHRs, CRMs, calendars, and internal APIs, MCP gives agents a more consistent interface to external systems. This is a big deal.&lt;/p&gt;

&lt;p&gt;Once tools become discoverable and composable, the agent is no longer just a chat interface. It becomes a workflow runtime that can read, reason, act, verify, and update state across systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer four: agentic orchestration.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where frameworks like LangGraph, LlamaIndex, LangChain, CrewAI, AutoGen, Semantic Kernel, and custom orchestration layers matter. The future agentic app will not call one model once.&lt;/p&gt;

&lt;p&gt;It will use one model for planning, another for coding, another for extraction, another for medical reasoning, another for summarization, and another for cheap classification. It will make these decisions in real time based on task type, latency, cost, reliability, and safety constraints. One task may go to Claude for long-context reasoning. Another may go to Gemini for multimodal input. Another may go to GPT for tool use. Another may go to a local or open-weight model for cheap classification. Another may run through multiple models in parallel for consensus, critique, or ensemble evaluation.&lt;/p&gt;

&lt;p&gt;This is exactly how payment orchestration worked. You didn’t hard-code one rail. You routed dynamically based on geography, fees, approval rates, fraud risk, currency, merchant category, and availability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer five: evals, trust, and governance.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where I think platforms like TrustModel.ai become important. If the application can route across multiple LLMs, the system also needs a way to continuously evaluate which model is right for which task. Not just “which model is smartest,” but which one is safest, cheapest, fastest, most compliant, most consistent, most robust against prompt injection, best at structured output, best at domain reasoning, and least likely to hallucinate.&lt;/p&gt;

&lt;p&gt;A serious agentic system needs multi-dimensional evals across models and workflows. It needs to test safety, quality, bias, factuality, privacy leakage, tool-use reliability, refusal behavior, cost, latency, and auditability. That eval layer becomes the control plane for selecting models and keeping applications safe across changing model providers. This is not optional in healthcare, finance, legal, or enterprise AI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer six: vertical workflow applications&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is where the most durable value gets created. A healthcare agent that closes care gaps is not valuable because it uses one specific LLM. It is valuable because it understands clinical workflows, patient context, lab data, insurance constraints, escalation paths, HIPAA boundaries, and provider operations. A revenue cycle agent is valuable because it knows claims, denials, CPT codes, payer policies, appeal letters, and EHR workflows.&lt;/p&gt;

&lt;p&gt;A legal agent is valuable because it knows contract structures, risk positions, fallback clauses, negotiation playbooks, and approval workflows. The model is necessary. But the system, data, workflow, distribution, trust, and feedback loop create the moat. This is why I do not think “which model wins?” is the most interesting question. The better question is: who owns the orchestration layer between the model and the workflow?&lt;/p&gt;

&lt;p&gt;My bet is that most serious applications and agents will be multi-model by default. That is already how I’m building. I’m working on agents that use five different LLMs in parallel, each selected for the task where it performs best: reasoning, extraction, summarization, coding, evaluation, or low-cost classification. The system should optimize in real time, just like a payment router optimizes transaction success, cost, and risk across multiple rails.&lt;/p&gt;

&lt;p&gt;LLMs are becoming intelligence rails. The value will accrue to the builders who turn those rails into reliable systems.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>mcp</category>
      <category>ai</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
