<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Talvinder Singh</title>
    <description>The latest articles on DEV Community by Talvinder Singh (@talvinder).</description>
    <link>https://dev.to/talvinder</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1410841%2F85dd15bf-30cb-47a7-8645-3f180a7f78d4.jpeg</url>
      <title>DEV Community: Talvinder Singh</title>
      <link>https://dev.to/talvinder</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/talvinder"/>
    <language>en</language>
    <item>
      <title>The Context Window Pricing Collapse</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Mon, 20 Apr 2026 07:49:27 +0000</pubDate>
      <link>https://dev.to/talvinder/the-context-window-pricing-collapse-3fh5</link>
      <guid>https://dev.to/talvinder/the-context-window-pricing-collapse-3fh5</guid>
      <description>&lt;p&gt;Claude's Opus 4.6 and Sonnet 4.6 now ship with 1M token context windows at standard pricing. No premium tier. No waitlist. Just 1M tokens, available to everyone.&lt;/p&gt;

&lt;p&gt;This single change killed the moat that every RAG startup built in 2023-24.&lt;/p&gt;

&lt;p&gt;The competitive advantage those companies sold was never retrieval quality. It was working around small context windows. That constraint just disappeared.&lt;/p&gt;

&lt;h2&gt;
  
  
  The retrieval layer was a workaround
&lt;/h2&gt;

&lt;p&gt;Between 2023 and early 2025, hundreds of startups raised money on the same pitch: "LLMs have small context windows, so you need our retrieval layer to feed them the right chunks." Document Q&amp;amp;A companies. Legal AI startups. Enterprise search tools. Internal knowledge bases. All built on the same assumption: context windows are expensive and scarce, so smart retrieval is the product.&lt;/p&gt;

&lt;p&gt;That was a reasonable bet when GPT-4 had 8K tokens and Claude had 100K at a premium. Chunking, embedding, reranking, and retrieval pipelines were genuine engineering problems. The companies that solved them well could charge for it.&lt;/p&gt;

&lt;p&gt;But the economics just shifted. When you can drop an entire codebase, a full legal contract set, or a year of customer support tickets into a single prompt at standard pricing, the retrieval layer stops being a product and starts being a feature. A feature that the model provider gives away for free.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this hits Indian AI startups hardest
&lt;/h2&gt;

&lt;p&gt;The Indian AI ecosystem produced a disproportionate number of RAG-focused companies. The problem was well-defined. The engineering talent was available. The capital requirements were low. Document processing for Indian enterprises. Multilingual knowledge retrieval. Compliance document analysis.&lt;/p&gt;

&lt;p&gt;These were real businesses solving real problems. The problem is that the solution was always a workaround for a temporary constraint.&lt;/p&gt;

&lt;p&gt;Look at the YC batches from 2023-24. Count the Indian-founded companies whose pitch decks said "RAG" or "retrieval" or "knowledge base." Now ask how many of them have a moat beyond the retrieval pipeline itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three categories of products just got commoditized
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Document Q&amp;amp;A.&lt;/strong&gt; If your product takes PDFs, chunks them, embeds them, and lets users ask questions, you now compete with "paste the PDF into Claude." The entire retrieval pipeline becomes overhead. A user with a 1M context window doesn't need your pipeline. They need a text box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enterprise search over internal docs.&lt;/strong&gt; Companies like Glean built serious products here, but they also built moats beyond retrieval: connectors, permissions, personalization, usage analytics. The startups that only built the search layer are exposed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Legal and compliance AI.&lt;/strong&gt; Contract review, regulatory analysis, due diligence. These were perfect RAG use cases because the source documents were long and the queries were specific. Now you can feed entire contract sets into a single prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually survives infinite context
&lt;/h2&gt;

&lt;p&gt;The interesting question isn't what dies. It's what survives when context windows go functionally infinite.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proprietary data pipelines.&lt;/strong&gt; Getting data into a prompt is easy. Getting the &lt;em&gt;right&lt;/em&gt; data, cleaned, structured, and current, from messy enterprise systems is hard. Connectors to SAP, Salesforce, government databases, legacy ERPs. That's plumbing work that doesn't get commoditized by larger context windows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Orchestration and multi-step reasoning.&lt;/strong&gt; RAG was a single-hop pattern: retrieve, then generate. The interesting AI products are multi-step: search, reason, act, verify, iterate. At Ostronaut, we learned that the hard problem in content generation isn't feeding the model enough context. It's coordinating multiple generation steps where each step depends on the output of the previous one, with validation gates between them. That coordination layer survives because it's orthogonal to context window size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Domain-specific reliability.&lt;/strong&gt; In regulated industries, the value isn't in retrieval. It's in auditability, compliance, and deterministic behavior around non-deterministic models. A hospital doesn't care that you can now fit all patient records into one prompt. They care that your system can prove why it made a specific recommendation and that it handles edge cases without hallucinating.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost optimization at scale.&lt;/strong&gt; Here's the part nobody's talking about: 1M context windows are available, but they're not free. Sending 1M tokens per request at enterprise scale gets expensive fast. The companies that build intelligent routing -- small context for simple queries, large context only when needed, with caching and deduplication -- will create real value. The constraint moved from "can't fit enough context" to "can't afford to use full context on every request."&lt;/p&gt;

&lt;h2&gt;
  
  
  The test for your AI product
&lt;/h2&gt;

&lt;p&gt;If you're building an AI product in India right now, here's the test: remove the retrieval layer from your architecture. Does your product still have a reason to exist?&lt;/p&gt;

&lt;p&gt;If the answer is no, you have six months to find one. Not because 1M context windows will replace everything overnight. Adoption takes time. Enterprise procurement cycles are slow. But the pricing signal is clear: context windows are heading toward commodity.&lt;/p&gt;

&lt;p&gt;Build on what stays scarce. Proprietary data access. Multi-step orchestration. Domain-specific reliability. Cost optimization at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  The RAG era isn't over. Retrieval still matters for keeping costs down and for real-time data. But retrieval as a product is over. It's a feature now. Build accordingly.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/frameworks/context-window-pricing-collapse/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=context-window-pricing-collapse" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiinfrastructure</category>
      <category>indiansaas</category>
      <category>agenticsystems</category>
    </item>
    <item>
      <title>Capability-Priced Micro-Markets: The Missing Layer Between Agents and HTTP</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Mon, 20 Apr 2026 07:49:21 +0000</pubDate>
      <link>https://dev.to/talvinder/capability-priced-micro-markets-the-missing-layer-between-agents-and-http-346k</link>
      <guid>https://dev.to/talvinder/capability-priced-micro-markets-the-missing-layer-between-agents-and-http-346k</guid>
      <description>&lt;p&gt;At Ostronaut, we run a multi-agent pipeline where one agent structures content, another generates video assets, another validates quality. When we wanted to swap in a better external video provider, it took three days: read docs, negotiate pricing tier, configure auth, write retry logic, handle edge cases.&lt;/p&gt;

&lt;p&gt;Three days for one integration. We have twelve capability slots across the pipeline.&lt;/p&gt;

&lt;p&gt;HTTP wasn't built for agents. APIs assume a human reads documentation, signs up for a tier, and hardcodes an endpoint. Agents can't do that at runtime. They need to discover capabilities, compare prices, and pay per invocation without human intervention.&lt;/p&gt;

&lt;p&gt;The infrastructure layer that enables this doesn't exist yet. I'm calling it &lt;strong&gt;Capability-Priced Micro-Markets&lt;/strong&gt;: a protocol layer where computational capabilities are advertised, priced, discovered, and transacted autonomously, with no human in the loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Coordination Problem
&lt;/h2&gt;

&lt;p&gt;We're building agents that need to coordinate autonomously. An agent optimizing cloud spend needs to call an agent that forecasts usage patterns. An agent writing code needs to call an agent that reviews security. An agent booking travel needs to call an agent that checks visa requirements.&lt;/p&gt;

&lt;p&gt;Right now, every one of those interactions requires a human to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Find the right service&lt;/li&gt;
&lt;li&gt;Read the documentation&lt;/li&gt;
&lt;li&gt;Sign up and configure billing&lt;/li&gt;
&lt;li&gt;Hardcode the integration&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This doesn't scale when you have 50 agents running in a workflow, each needing 10 different capabilities, and the optimal provider for each capability changes based on load, accuracy requirements, and budget.&lt;/p&gt;

&lt;p&gt;The current model treats APIs as services you subscribe to. The agent model requires APIs to be capabilities you buy per-use, discovered at runtime, priced dynamically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Fatal Problems
&lt;/h2&gt;

&lt;p&gt;Hardcoded integrations have three problems that become fatal at agent scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Discovery&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Traditional APIs require you to know which service to call. You read docs, you integrate. Agents need to discover "what can be done" not "which service exists."&lt;/p&gt;

&lt;p&gt;An agent doesn't want to know that Anthropic, OpenAI, and Google all offer vision APIs. It wants to know: "Who can classify this image at 95% accuracy for under 0.005 tokens?" The answer changes based on current load, model updates, and pricing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Current API pricing assumes human decision-makers choosing subscription tiers. Monthly plans. Annual contracts. Volume discounts negotiated over email.&lt;/p&gt;

&lt;p&gt;Agents need per-invocation pricing with real-time price discovery. An agent with a $10 budget for a task needs to know: "Can I get this done within budget?" before it calls anything. It needs to comparison-shop across providers in milliseconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Intent vs Implementation&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
HTTP forces you to specify implementation. You call a specific endpoint at a specific URL with specific parameters.&lt;/p&gt;

&lt;p&gt;Agents should declare intent. "Classify this image with 95% accuracy" or "Summarize this document in under 100 words." The market layer handles discovery, routing, and fallback.&lt;/p&gt;

&lt;p&gt;The parallel to Kubernetes is exact. You didn't specify servers. You declared intent. Kubernetes handled placement, scaling, and self-healing.&lt;/p&gt;

&lt;p&gt;Capability markets do the same for computational work. You declare what you need. The market finds who can provide it, at what price, and routes the request.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pieces Are Assembling
&lt;/h2&gt;

&lt;p&gt;This isn't theoretical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The pricing shift is already happening.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Over the last 3-4 years, SMBs have moved from subscription models to pay-as-you-go. OpenAI charges per token. Cloud providers charge per compute-second. The next step is agents charging each other per capability-invocation, with prices set dynamically based on demand.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The market layer already exists in adjacent domains.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Two-sided marketplace platforms are ubiquitous. Uber routes ride requests to drivers. AWS Spot Instances route compute requests to available capacity. The capability market is the same pattern applied to API calls: match demand (agent needs capability) with supply (agent provides capability) through dynamic pricing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The intent-based architecture is proven.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Kubernetes demonstrated that declarative systems work at scale. You declare desired state. The system handles the rest. Capability markets extend this: you declare desired outcome and budget constraints. The market handles discovery, negotiation, and execution.&lt;/p&gt;

&lt;p&gt;At Ostronaut, we built a multi-agent system where agents coordinate—one structures content, another generates assets, another validates quality. Right now, those are internal agents with hardcoded routing. The moment we want to use external capabilities, we hit the integration wall. We'd need to evaluate providers, negotiate pricing, handle auth, manage retries.&lt;/p&gt;

&lt;p&gt;A capability market would let us declare: "Generate a 90-second video from this script, budget 0.05 tokens, minimum quality score 80." The market finds the provider, handles payment, returns the result.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The cost structure enables granular markets.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
CloudZero and similar FinOps platforms already provide detailed cost intelligence and align spending with business outcomes. The infrastructure to track per-invocation costs exists. Extending that to per-capability pricing is a data model change, not a technical leap.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bet
&lt;/h2&gt;

&lt;p&gt;Within 18 months, the majority of agent-to-agent API calls will route through dynamic capability markets rather than hardcoded HTTP endpoints.&lt;/p&gt;

&lt;p&gt;Not because it's elegant. Because hardcoded integrations break at agent scale.&lt;/p&gt;

&lt;p&gt;When you have one agent calling three APIs, you can hardcode. When you have 50 agents each calling 10 capabilities, and the optimal provider changes hourly, you need a market layer.&lt;/p&gt;

&lt;p&gt;The companies building this layer now—capability registries, dynamic pricing protocols, intent-based routing—are building the plumbing for the next decade of agent coordination.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Traditional API Model&lt;/th&gt;
&lt;th&gt;Capability Market Model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hardcoded endpoint&lt;/td&gt;
&lt;td&gt;Runtime discovery&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Subscription pricing&lt;/td&gt;
&lt;td&gt;Per-invocation pricing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Specify implementation&lt;/td&gt;
&lt;td&gt;Declare intent and constraints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Human integration&lt;/td&gt;
&lt;td&gt;Autonomous negotiation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fixed provider&lt;/td&gt;
&lt;td&gt;Dynamic routing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What I Don't Know Yet
&lt;/h2&gt;

&lt;p&gt;How do you build trust in a capability market where providers are autonomous agents, not humans with reputations?&lt;/p&gt;

&lt;p&gt;Traditional marketplaces rely on reviews, ratings, and dispute resolution—all human processes. An agent providing image classification doesn't have a LinkedIn profile or customer testimonials. It has accuracy metrics, uptime history, and response latency.&lt;/p&gt;

&lt;p&gt;Do you build reputation systems based purely on performance data? Do you require staked collateral? Do you use cryptographic proof of execution?&lt;/p&gt;

&lt;p&gt;The trust layer is the hardest part. The technical infrastructure for capability discovery and dynamic pricing is straightforward. Building organizational and economic trust in autonomous systems—that's the unsolved problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  More on this as I work through it.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/frameworks/capability-priced-agents/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=capability-priced-agents" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agenticsystems</category>
      <category>infrastructure</category>
      <category>apidesign</category>
    </item>
    <item>
      <title>The Build-Seed Inversion: Why Marketplaces Die Building Platforms Before Proving Transactions</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Sun, 19 Apr 2026 09:03:30 +0000</pubDate>
      <link>https://dev.to/talvinder/the-build-seed-inversion-why-marketplaces-die-building-platforms-before-proving-transactions-1dah</link>
      <guid>https://dev.to/talvinder/the-build-seed-inversion-why-marketplaces-die-building-platforms-before-proving-transactions-1dah</guid>
      <description>&lt;p&gt;At Tushky, we built a super easy service listing product. We created awareness campaigns. People came. People went. The funnel kept drying as soon as fuel was not injected.&lt;/p&gt;

&lt;p&gt;We had built supply without demand, features without focus, distribution experiments without a single proven channel. We spent 18 months building before we understood what the market actually wanted to buy.&lt;/p&gt;

&lt;p&gt;I'm calling this the &lt;strong&gt;Build-Seed Inversion&lt;/strong&gt; — the mistake of optimizing for scale before proving a single transaction works repeatably.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Question Most Founders Get Wrong
&lt;/h2&gt;

&lt;p&gt;Most founder advice splits into two camps: "just ship" or "talk to users first." Both miss the actual pattern. The question isn't whether to build or validate. It's what to build &lt;em&gt;before&lt;/em&gt; you have market signal versus what to build &lt;em&gt;after&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The Build-Seed Inversion happens when you invest in horizontal infrastructure, multiple distribution channels, and polish before you've made one specific transaction happen ten times in a row.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you can't describe your first 100 transactions in a single sentence, you're building too early.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Tushky couldn't. We said "connecting customers to service providers." That's not a transaction. That's a category. Etsy could say: "craftspeople selling handmade goods to other craftspeople." Slack could say: "small engineering teams adopting for internal communication, then spreading to other departments."&lt;/p&gt;

&lt;p&gt;The difference isn't semantic. It's strategic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Supply Is Visible, Transactions Are Not
&lt;/h2&gt;

&lt;p&gt;We built seller onboarding tools, listing templates, search algorithms. All supply-side infrastructure. But we hadn't proven we could get a single customer to transact twice.&lt;/p&gt;

&lt;p&gt;A marketplace isn't valuable because it has supply. It's valuable because it creates transactions. We confused the two.&lt;/p&gt;

&lt;p&gt;Founders build the platform before seeding the transaction. They optimize listing quality before proving anyone will buy. They add payment options before proving anyone will pay.&lt;/p&gt;

&lt;p&gt;This is the central confusion of two-sided markets. Supply is visible. You can count listings, profiles, sellers onboarded. Transactions are harder to measure, harder to manufacture, and harder to show investors. So founders default to the thing they can control and count. They build the stage and assume the audience will show up.&lt;/p&gt;

&lt;p&gt;In India, this problem is compounded. Low average transaction values mean you need higher density to make unit economics work. A home services marketplace in Mumbai with 500 providers and 50 monthly transactions is a spreadsheet, not a business. You need thousands of transactions in a tight geography before the marketplace has any defensibility at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five Verticals Is Five Companies
&lt;/h2&gt;

&lt;p&gt;Tushky was for "anyone who needs services." Cleaning, repairs, tutoring, photography, pet care. We built features for all of them.&lt;/p&gt;

&lt;p&gt;Trello did this at scale. They built a horizontal product for tens of millions of users with a bleeding-edge stack. But they "didn't do a good job of keeping track of paying customers." They built for everyone and monetized no one effectively.&lt;/p&gt;

&lt;p&gt;Slack went the opposite direction. They focused on "figuring out how and why its product spread from small teams, to departments, to larger organizations." One transaction type. One expansion pattern. Then scale.&lt;/p&gt;

&lt;p&gt;The horizontal product trap isn't about product breadth per se. It's about attention fragmentation. When you build for five verticals simultaneously, you build five mediocre products instead of one excellent one. Your feature roadmap becomes a political negotiation between use cases rather than a focused pursuit of depth.&lt;/p&gt;

&lt;p&gt;At Tushky, we had separate onboarding flows for tutors and for plumbers. Different quality signals for photographers and for electricians. Each vertical had its own supply dynamics, its own customer expectations, its own trust thresholds. We were running five marketplace experiments and calling it one company.&lt;/p&gt;

&lt;p&gt;The founders I've seen get this right — including the early UrbanClap team (now &lt;a href="https://dev.to/frameworks/ecommerce-evolution/"&gt;Urban Company&lt;/a&gt;) — picked one vertical and made it work obsessively before expanding. Urban Company started with beauty services at home. Not "all services." Not even "beauty and repairs." Just beauty. They built transaction density in one category in one city before adding the next.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Channels at 25% Commitment Teaches You Nothing
&lt;/h2&gt;

&lt;p&gt;We tried corporate sales. Long, winding, uncertain sales cycles. We tried online affiliates. No success. We tried society activations. Some traction, but nothing repeatable.&lt;/p&gt;

&lt;p&gt;We were running three distribution experiments simultaneously before we'd proven &lt;em&gt;any&lt;/em&gt; channel could acquire a customer profitably.&lt;/p&gt;

&lt;p&gt;The math was obvious in retrospect: if your CAC is unknown and your LTV is unproven, adding more channels just multiplies the uncertainty.&lt;/p&gt;

&lt;p&gt;I see this in every startup cohort I've worked with at Pragmatic Leaders. Founders with 6 months of runway running paid acquisition, SEO content, partnerships, and community-building in parallel. They interpret the spread as "testing." It's not testing. Testing means running one channel with enough conviction and capital to get a statistically meaningful signal. Running four channels at 25% commitment each means you learn nothing about any of them.&lt;/p&gt;

&lt;p&gt;One distribution channel, proven to unit-economics breakeven, is worth more than ten experiments running at subscale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Etsy Found Existing Transactions First
&lt;/h2&gt;

&lt;p&gt;Etsy targeted exactly one group: craftspeople selling to other craftspeople. Not "anyone who makes things" selling to "anyone who buys things." One seller type, one buyer type, one transaction pattern.&lt;/p&gt;

&lt;p&gt;They sparked transactions within that group successfully before branching out. The platform features came &lt;em&gt;after&lt;/em&gt; the transaction mechanics worked.&lt;/p&gt;

&lt;p&gt;The critical move was that Etsy's early team went to craft fairs physically. They didn't build a platform and wait for sellers to discover it. They found people already transacting and gave them a better venue. The demand and the behaviour existed before the technology.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Built vs. What We Needed
&lt;/h2&gt;

&lt;p&gt;Here's what we built before we had product-market fit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Service provider onboarding flows&lt;/li&gt;
&lt;li&gt;Multi-category listing templates&lt;/li&gt;
&lt;li&gt;Awareness campaigns that drove traffic&lt;/li&gt;
&lt;li&gt;Corporate sales collateral&lt;/li&gt;
&lt;li&gt;Affiliate partnership frameworks&lt;/li&gt;
&lt;li&gt;Society activation playbooks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's what we didn't have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A single customer segment that transacted repeatably&lt;/li&gt;
&lt;li&gt;Unit economics for any channel&lt;/li&gt;
&lt;li&gt;A clear answer to "which transaction are we optimizing for?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The third missing piece wasn't product features. It was transaction mechanics. We had supply, we had awareness, but we couldn't connect early customers to unique sellers in a way that created repeat behavior.&lt;/p&gt;

&lt;p&gt;The pattern holds across the companies I've worked with. The ones that scaled figured out seeding strategies &lt;em&gt;before&lt;/em&gt; building horizontal infrastructure. The ones that struggled built platforms before proving transactions. A SaaS tool can succeed with a great product and decent distribution. A two-sided market dies without transaction density in a specific segment first.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Sequence That Works
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Build Before Market Signal&lt;/th&gt;
&lt;th&gt;Build After Traction&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Minimum mechanics to complete one transaction&lt;/td&gt;
&lt;td&gt;Horizontal features for adjacent segments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single channel acquisition experiment&lt;/td&gt;
&lt;td&gt;Multi-channel distribution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Manual processes that prove unit economics&lt;/td&gt;
&lt;td&gt;Automation and scale infrastructure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Focus on one user segment&lt;/td&gt;
&lt;td&gt;Expansion to broader market&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Supply in one geography, one vertical&lt;/td&gt;
&lt;td&gt;Supply aggregation tools&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The test: can you describe your first 100 transactions in one sentence? If not, you're not ready to build for scale.&lt;/p&gt;

&lt;p&gt;At Tushky, we inverted this. We built the scale infrastructure first. We optimized for millions before we'd proven hundreds.&lt;/p&gt;

&lt;p&gt;The market eventually showed up — not for what we built, but for what we should have started with. Urban Company, Housejoy, and others entered the same space later with tighter wedges and better sequencing. They didn't build better technology. They built the right things in the right order.&lt;/p&gt;

&lt;p&gt;By the time we recognized the pattern, we'd spent 18 months building the wrong things in the wrong sequence.&lt;/p&gt;

&lt;h2&gt;
  
  
  The question I'm still working through: how do you know when you've proven "enough" transactions to start building horizontally? Ten transactions? A hundred? A thousand? The companies that got this right seem to have felt it rather than calculated it. I'm not satisfied with that answer yet.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/frameworks/ahead-of-curve-retrospective/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ahead-of-curve-retrospective" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>startup</category>
      <category>marketplaces</category>
      <category>mentalmodels</category>
    </item>
    <item>
      <title>The Recourse Trap: Why Competition Makes Credit Scoring More Exclusive, Not Less</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Sun, 19 Apr 2026 09:03:25 +0000</pubDate>
      <link>https://dev.to/talvinder/the-recourse-trap-why-competition-makes-credit-scoring-more-exclusive-not-less-1ci6</link>
      <guid>https://dev.to/talvinder/the-recourse-trap-why-competition-makes-credit-scoring-more-exclusive-not-less-1ci6</guid>
      <description>&lt;p&gt;In 2022, HDFC Bank raised its minimum CIBIL score requirement for personal loans from 650 to 725. ICICI and Axis followed within months. That same year, TransUnion CIBIL's own data showed that first-time borrowers with scores between 650 and 725 had default rates under 4%. The banks weren't responding to rising risk. They were responding to each other.&lt;/p&gt;

&lt;p&gt;Credit scoring systems don't fail because they're inaccurate. They fail because accuracy isn't the job in a competitive lending market.&lt;/p&gt;

&lt;p&gt;The job is risk transfer. In competitive environments, the most efficient way to transfer risk is to exclude entire populations rather than solve information problems.&lt;/p&gt;

&lt;p&gt;I've seen this pattern up close. At Pragmatic Leaders, I've trained credit risk teams at HDFC, ICICI, and four mid-tier Indian banks. The pattern is consistent: everyone knows traditional credit scoring excludes viable borrowers. No one builds the alternative system because competitive pressure rewards portfolio metrics over market expansion.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Recourse Trap
&lt;/h2&gt;

&lt;p&gt;This is what I'm calling &lt;strong&gt;The Recourse Trap&lt;/strong&gt;: a system where the mechanism designed to enable access becomes the mechanism that prevents it, and competitive pressure makes the trap stronger, not weaker.&lt;/p&gt;

&lt;p&gt;Here's how it works:&lt;/p&gt;

&lt;p&gt;A lender can't distinguish between a borrower with no credit history and a borrower with bad credit history. Both score low. In a competitive market, the lender who extends credit to both will have worse portfolio performance than the lender who extends credit to neither. The rational competitive response is exclusion.&lt;/p&gt;

&lt;p&gt;The borrower has no recourse. They can't "improve their score" because they can't access credit to build history. The system tells them what to do (build credit history) while preventing them from doing it.&lt;/p&gt;

&lt;p&gt;India has 400 million adults with no credit history in any bureau. Not because they're risky. Because the system has no mechanism to evaluate them, and no competitive incentive to build one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Mechanism
&lt;/h2&gt;

&lt;p&gt;When lenders compete on portfolio risk metrics, they optimize for false negative reduction (don't lend to bad borrowers) over false positive reduction (do lend to good borrowers). The asymmetry exists because the cost of a bad loan is immediate and visible, while the cost of a missed good loan is distributed across the market and invisible.&lt;/p&gt;

&lt;p&gt;This creates a lemons problem. Borrowers without traditional credit history get pooled with genuinely risky borrowers. Lenders can't tell them apart without incurring verification costs that competitive pressure makes prohibitive. The result: high-quality borrowers with no credit history get priced out or excluded entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Falsifiable claim&lt;/strong&gt;: In competitive lending markets, credit score requirements will trend upward over time for populations without traditional credit history, even as default rates in those populations remain stable or decline. The system optimizes for competitive position, not credit risk.&lt;/p&gt;

&lt;p&gt;You can test this. Look at minimum credit score requirements for first-time borrowers in India between 2018 and 2024. Requirements went up across every major bank. Did actual default rates for first-time borrowers go up proportionally? No. RBI data shows gross NPA ratios for retail loans actually declined from 2.5% to 1.7% in that period. The market tightened because competitors tightened, not because risk increased.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Transaction Cost Argument Is Circular
&lt;/h2&gt;

&lt;p&gt;Here's the tell: when you ask banks why they don't serve underbanked populations, they talk about credit scores. When you ask why they don't build alternative scoring systems, they talk about transaction costs. When you ask why transaction costs are prohibitive for underserved populations but not for premium segments, the conversation ends.&lt;/p&gt;

&lt;p&gt;High costs justify exclusion. Exclusion prevents scale. Lack of scale keeps costs high.&lt;/p&gt;

&lt;p&gt;South African banks demonstrate this clearly. Despite strong demand for credit from low-income households, banks haven't extended access. Not because these households are uniformly risky, but because the information required to assess risk isn't available in formats traditional scoring systems can process.&lt;/p&gt;

&lt;p&gt;The alternative mechanisms prove the problem is solvable. Group lending models and informal systems like stokvels work precisely because they solve the &lt;a href="https://dev.to/frameworks/comp-negotiation-entropy/"&gt;information problem&lt;/a&gt; differently. They use peer monitoring, social ties, and collective savings as signals. Transaction costs stay low. Default rates stay manageable.&lt;/p&gt;

&lt;p&gt;But competitive banks don't adopt these approaches. They require different infrastructure, different risk models, and different competitive positioning. A bank that moves first takes on &lt;a href="https://dev.to/frameworks/entropy-and-entrepreneurship/"&gt;execution risk&lt;/a&gt;. A bank that moves second can copy what works. The rational move is to wait, which means no one moves.&lt;/p&gt;

&lt;h2&gt;
  
  
  What AI Makes Worse
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/frameworks/ai-deflation-trap/"&gt;AI-powered credit scoring is getting more sophisticated at predicting risk within existing data distributions&lt;/a&gt;. Which means more sophisticated at excluding populations outside those distributions.&lt;/p&gt;

&lt;p&gt;An AI model trained on historical lending data will learn that borrowers without credit history are risky. Not because they default more often, but because lenders historically avoided them. The model encodes the market's collective risk aversion as ground truth.&lt;/p&gt;

&lt;p&gt;I saw this firsthand during a workshop with a mid-tier bank's risk team in 2023. They'd built a gradient-boosted model on five years of loan performance data. The model performed well on their test set (AUC of 0.87). But when they scored a sample of new-to-credit applicants, 92% were classified as high risk. The data scientist on the team knew the scores were wrong. His manager knew. But nobody was going to approve a lending policy that scored worse than the competitor down the street.&lt;/p&gt;

&lt;p&gt;The feedback loop tightens. Better prediction within the existing distribution means worse outcomes for populations outside it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Got Wrong
&lt;/h2&gt;

&lt;p&gt;I initially thought the solution was better data. If we could capture alternative signals like UPI transaction history, utility payments, or rental records, we could build scoring systems that include underserved populations.&lt;/p&gt;

&lt;p&gt;That's technically true but structurally naive.&lt;/p&gt;

&lt;p&gt;The problem isn't data availability. India Account Aggregator has been live since 2021. Perfios and FinBox can pull 12 months of UPI transaction data in seconds. The pipes exist. Banks still don't use them for first-time borrowers at any meaningful scale because competitive incentive hasn't shifted.&lt;/p&gt;

&lt;p&gt;A bank that invests in alternative data infrastructure takes on execution risk and regulatory uncertainty. A bank that waits can copy the approach if it works. The first-mover disadvantage is real.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond Credit Scoring
&lt;/h2&gt;

&lt;p&gt;The recourse trap exists because competitive markets optimize for relative performance, not absolute outcomes. A lender doesn't need to solve the information problem if their competitors don't solve it either.&lt;/p&gt;

&lt;p&gt;This has implications beyond financial services. Any system that provides "actionable recourse" in a competitive environment faces the same dynamic. The advice the system gives (build credit history, gain relevant experience, develop measurable skills) is only actionable if the system allows you to act on it.&lt;/p&gt;

&lt;p&gt;When it doesn't, you're not dealing with an information problem. You're dealing with a market structure problem.&lt;/p&gt;

&lt;p&gt;AI-powered resume screening. Skills-based hiring platforms. Fraud detection systems. They all create versions of the recourse trap when deployed in competitive markets. The mechanism is the same: optimize for false negative reduction, accept false positive costs, and let competitive pressure prevent anyone from solving the underlying information problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Question Worth Asking
&lt;/h2&gt;

&lt;p&gt;What other systems are we building that look like they enable access but actually optimize for exclusion?&lt;/p&gt;

&lt;p&gt;If the mechanism for proving you're trustworthy requires access you can't get without already being trusted, you're in a recourse trap. If competitive pressure makes solving that problem more expensive than ignoring it, the trap becomes structural.&lt;/p&gt;

&lt;p&gt;Are we asking this question when we deploy AI systems in hiring, lending, insurance, education? Mostly, no. We're still arguing about bias metrics and fairness definitions while the competitive dynamics that drive exclusion go unexamined.&lt;/p&gt;

&lt;h2&gt;
  
  
  The recourse trap doesn't care about bias. It cares about competitive dynamics. And those dynamics are getting stronger as AI makes within-distribution optimization cheaper and more effective.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/field-notes/actionable-recourse-markets/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=actionable-recourse-markets" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>financialsystems</category>
      <category>marketstructure</category>
      <category>airisk</category>
      <category>indiatech</category>
    </item>
    <item>
      <title>Why 86% of AI Agent Pilots Fail Before Reaching Production</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Fri, 17 Apr 2026 13:06:19 +0000</pubDate>
      <link>https://dev.to/talvinder/why-86-of-ai-agent-pilots-fail-before-reaching-production-4ica</link>
      <guid>https://dev.to/talvinder/why-86-of-ai-agent-pilots-fail-before-reaching-production-4ica</guid>
      <description>&lt;p&gt;According to the MAST benchmark study, multi-agent system failure rates range from 41% to 86.7% across seven leading frameworks. Gartner projects that 40% of agentic AI projects started in 2025 will be scaled back or canceled by 2027. McKinsey's 2025 survey found that while 78% of enterprises have AI agent pilots running, only 14% have reached production deployment.&lt;/p&gt;

&lt;p&gt;These numbers tell the same story: the demo works, but production kills it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The failure isn't the model — it's everything around the model
&lt;/h2&gt;

&lt;p&gt;The top three causes of agent pilot failure are integration complexity (67%), lack of monitoring (58%), and unclear escalation paths (52%), according to PwC's 2025 enterprise AI survey. The model itself is rarely the problem. According to a 2025 PwC survey of 1,000 enterprises deploying AI agents, the top three failure modes are integration complexity (cited by 67%), lack of monitoring infrastructure (58%), and unclear escalation paths when the agent makes mistakes (52%).&lt;/p&gt;

&lt;p&gt;The model itself is rarely the problem. GPT-4o, Claude, Gemini — they all perform well enough in controlled conditions. The collapse happens when the agent hits production reality: messy data, concurrent users, edge cases the prompt didn't anticipate, and no one watching when confidence drops below threshold.&lt;/p&gt;

&lt;p&gt;This is the same &lt;a href="https://dev.to/field-notes/indian-saas-agent-reliability/"&gt;reliability gap&lt;/a&gt; that Indian SaaS companies have been closing for twenty years — not with better models, but with better systems around the models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five patterns that kill agent pilots
&lt;/h2&gt;

&lt;p&gt;These are the five structural failures I've seen repeatedly across teams deploying agents — from startups to Fortune 500 companies. Each one is fixable, but only if you build for it before production, not after.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. No confidence scoring or graceful degradation
&lt;/h3&gt;

&lt;p&gt;Agents without confidence thresholds have 3x higher escalation rates than those with calibrated routing, according to Anthropic's production deployment data. The agent either answers or it doesn't. There's no middle ground. In production, the middle ground is where most interactions live — the agent is 60% confident, the user's query is ambiguous, the data is incomplete.&lt;/p&gt;

&lt;p&gt;Without confidence scoring, you get one of two failure modes: the agent hallucinates confidently (and you lose trust) or the agent refuses to answer (and you lose utility). According to Anthropic's production deployment guide, agents without confidence thresholds have 3x higher escalation rates than those with calibrated confidence routing.&lt;/p&gt;

&lt;p&gt;The fix is graduated autonomy: act autonomously above 90% confidence, request human review between 60-90%, escalate below 60%. This is the same pattern &lt;a href="https://dev.to/build-logs/agentic-rightsizing/"&gt;we built at Zopdev&lt;/a&gt; for infrastructure decisions — observe everything, act only within permission boundaries.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The "just retry" fallacy
&lt;/h3&gt;

&lt;p&gt;When an agent fails, most frameworks default to retrying with the same prompt. This is the &lt;a href="https://dev.to/field-notes/consensus-is-not-verification/"&gt;Pass@k trap&lt;/a&gt;: if the error is structural (wrong data, missing context, ambiguous instruction), retrying amplifies the problem rather than fixing it.&lt;/p&gt;

&lt;p&gt;A 2025 analysis of production agent logs at a Fortune 500 company found that 73% of retried requests produced the same error category. The retry wasn't recovery — it was waste. At $0.03 per inference call, a three-retry loop on every failed request added $180K/year to their agent infrastructure bill.&lt;/p&gt;

&lt;p&gt;The fix is error classification before retry. Network timeout? Retry. Model hallucination? Route to a different model or escalate. Missing context? Fetch the context first, then retry with enriched input.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. No observability beyond the API call
&lt;/h3&gt;

&lt;p&gt;Most agent monitoring stops at the API layer: latency, token count, error rate. But agent failures are semantic, not mechanical. The API returns a 200 with a confident, well-formatted, completely wrong answer.&lt;/p&gt;

&lt;p&gt;According to Langfuse's 2025 observability report, teams that implement trace-level monitoring (tracking the full chain of agent reasoning, tool calls, and intermediate outputs) catch production issues 4x faster than teams monitoring only API metrics. This is what &lt;a href="https://dev.to/frameworks/trace-based-assurance-agentware/"&gt;trace-based assurance&lt;/a&gt; looks like in practice — the governance layer that agentware actually needs.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Human handoff as afterthought
&lt;/h3&gt;

&lt;p&gt;The agent is built to be autonomous. When it can't handle something, it says "I don't know" — and the user is stuck. There's no warm handoff to a human, no context transfer, no continuity.&lt;/p&gt;

&lt;p&gt;According to Freshworks' deployment data, their Freddy AI achieves a 45% autonomous resolution rate. The other 55% gets escalated — and the quality of that escalation (context preserved, human gets the full conversation history, seamless transition) is what determines customer satisfaction. The agent's job isn't just to resolve; it's to escalate well when it can't.&lt;/p&gt;

&lt;p&gt;The cost of building good escalation paths is significant. A production agent needs roughly 3.5 FTEs for monitoring, incident response, and drift detection. &lt;a href="https://dev.to/field-notes/indian-saas-agent-reliability/"&gt;In Bangalore, that's $100K-150K/year&lt;/a&gt;. In San Francisco, $600K-800K. This cost asymmetry is why Indian SaaS companies can afford the monitoring density that makes agents reliable.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Evaluation that doesn't match production conditions
&lt;/h3&gt;

&lt;p&gt;The agent scores 92% on the benchmark. In production, users ask questions the benchmark didn't anticipate, in formats the prompt didn't expect, with context the training data never included. The &lt;a href="https://dev.to/build-logs/llm-judge-india-failure/"&gt;evaluation cost ratio&lt;/a&gt; breaks down when evaluation doesn't mirror production conditions.&lt;/p&gt;

&lt;p&gt;According to the HELM benchmark team at Stanford, model performance on curated test sets overpredicts production accuracy by 15-30 percentage points. The gap is not random — it's systematic. Production queries are longer, more ambiguous, more dependent on context, and more adversarial than benchmark queries.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually works: the three-layer architecture
&lt;/h2&gt;

&lt;p&gt;Successful agent deployments converge on a generation-validation-governance stack. The generation layer is what everyone builds; the other two are what separates pilots from production. Every successful deployment I've seen — Freshworks' Freddy, Zoho's Zia, our own systems — converges on the same architecture:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;th&gt;What it catches&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Generation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The model produces output&lt;/td&gt;
&lt;td&gt;Nothing — this is the happy path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Validation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rule-based checks, confidence scoring, format verification&lt;/td&gt;
&lt;td&gt;Structural errors, low-confidence outputs, format violations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Governance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Human review queues, audit trails, escalation paths, drift detection&lt;/td&gt;
&lt;td&gt;Semantic errors, edge cases, model drift, compliance violations&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The generation layer is what everyone builds. The validation layer is what separates pilots from production. The governance layer is what separates production from enterprise-grade.&lt;/p&gt;

&lt;p&gt;Most pilots only build layer one. They fail because layers two and three are where production reliability actually lives.&lt;/p&gt;

&lt;h2&gt;
  
  
  The question worth asking
&lt;/h2&gt;

&lt;p&gt;If you're running an agent pilot right now, ask this: what happens when the agent is wrong and confident about it?&lt;/p&gt;

&lt;p&gt;If the answer is "we haven't thought about that" — you're in the 86%. The &lt;a href="https://dev.to/field-notes/consensus-is-not-verification/"&gt;consensus voting approach won't save you&lt;/a&gt;. The &lt;a href="https://dev.to/field-notes/cot-efficiency-tax/"&gt;chain-of-thought reasoning adds cost without guaranteeing correctness&lt;/a&gt;. The model isn't the problem.&lt;/p&gt;

&lt;p&gt;The monitoring, the fallbacks, the escalation paths, the confidence routing — that's where production reliability lives. The teams that figure this out aren't building better agents. They're building better &lt;a href="https://dev.to/frameworks/agent-context-is-infrastructure/"&gt;infrastructure around agents&lt;/a&gt;. And right now, the companies with the deepest operational discipline in that infrastructure layer are &lt;a href="https://dev.to/field-notes/indian-saas-agent-reliability/"&gt;based in India&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;::: {.schema-faq style="display:none;"}&lt;br&gt;
[{"q":"Why do AI agent pilots fail in production?","a":"According to MAST benchmark data, 41-86.7% of multi-agent systems fail across leading frameworks. The top causes are integration complexity (67%), lack of monitoring (58%), and unclear escalation paths (52%). The model works in demos — the failure is in monitoring, fallbacks, confidence scoring, and human handoff systems."},{"q":"What percentage of AI agent projects reach production?","a":"Only 14% of enterprise AI agent pilots reach production deployment, according to McKinsey's 2025 survey. Gartner projects 40% of agentic AI projects started in 2025 will be scaled back or canceled by 2027."},{"q":"How do you deploy AI agents to production successfully?","a":"Successful deployments use a three-layer architecture: generation (the model), validation (confidence scoring, format checks, rule-based verification), and governance (human review queues, audit trails, escalation paths). Most failed pilots only build the generation layer."}]&lt;/p&gt;

&lt;h2&gt;
  
  
  :::
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/field-notes/why-agent-pilots-fail/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=why-agent-pilots-fail" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agenticsystems</category>
      <category>productionai</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>The Vibe Coding Hangover: Why AI-Written Code Costs 4x to Maintain by Year Two</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Fri, 17 Apr 2026 13:06:18 +0000</pubDate>
      <link>https://dev.to/talvinder/the-vibe-coding-hangover-why-ai-written-code-costs-4x-to-maintain-by-year-two-1pjn</link>
      <guid>https://dev.to/talvinder/the-vibe-coding-hangover-why-ai-written-code-costs-4x-to-maintain-by-year-two-1pjn</guid>
      <description>&lt;p&gt;According to a CodeRabbit analysis of 1,000+ repositories, AI co-authored code introduces 1.7x more major issues than human-written code. The vulnerability rate is 2.74x higher. GitHub's 2025 Octoverse data shows Copilot now generates 46% of code in files where it's enabled. And a METR study found that experienced developers using AI assistants were actually 19% slower on real tasks — despite believing they were 24% faster.&lt;/p&gt;

&lt;p&gt;The productivity feels real. The debt is real too. We're starting to see the bill.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three-month cliff
&lt;/h2&gt;

&lt;p&gt;Every team I've talked to that adopted AI coding tools heavily describes the same pattern: massive output gains in months one through three, followed by an escalating maintenance burden that erases those gains by month six.&lt;/p&gt;

&lt;p&gt;The pattern has a name now. Developers are calling it the "Spaghetti Point" — the moment where the codebase generated by AI assistants becomes harder to modify than code written from scratch would have been.&lt;/p&gt;

&lt;p&gt;According to GitClear's 2025 developer productivity report, code churn (lines modified or deleted within 14 days of being written) increased 39% in repositories with heavy AI assistance. That's not refactoring — that's rework. Code written fast, reviewed inadequately, and fixed repeatedly.&lt;/p&gt;

&lt;p&gt;The economics are brutal. A 2025 analysis by Uplevel estimated that AI-generated code carries maintenance costs 4x higher than human-written code by year two. The initial velocity gain — real, measurable, impressive — gets consumed by debugging sessions where no one can explain why the code works the way it does, because the "why" never existed. This is the same &lt;a href="https://dev.to/field-notes/github-slopocalypse-trust-tax/"&gt;epistemological problem&lt;/a&gt; that's eroding trust in open source: AI-generated code has no intent. You can't reconstruct reasoning that never happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the bugs are different
&lt;/h2&gt;

&lt;p&gt;AI-generated bugs are structurally different from human bugs, and that difference makes them more expensive to find and fix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human bugs have intent trails.&lt;/strong&gt; A developer who writes a race condition usually has a mental model that's almost right — they thought about concurrency but missed one case. You can read the code, reconstruct the thinking, find the gap. The fix follows from understanding the original intent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI bugs have no intent.&lt;/strong&gt; The code was generated from a probability distribution, not a mental model. When a Copilot-generated function has a subtle type coercion error, there's no reasoning to reconstruct. You can't ask "what were they thinking?" because nothing was thinking. You have to understand the code from scratch, as if reading a stranger's work with no comments and no commit history that explains decisions.&lt;/p&gt;

&lt;p&gt;According to Snyk's 2025 AI security report, 35 new CVEs were attributed to AI-generated code in March 2026 alone. Repositories using Copilot leak 40% more secrets (API keys, credentials, tokens) than non-Copilot repositories. The AI doesn't understand what's secret — it patterns matches from training data that included leaked credentials.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bug type&lt;/th&gt;
&lt;th&gt;Human-written code&lt;/th&gt;
&lt;th&gt;AI-written code&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Root cause analysis&lt;/td&gt;
&lt;td&gt;Follow the intent trail&lt;/td&gt;
&lt;td&gt;Start from zero — no intent exists&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time to diagnose&lt;/td&gt;
&lt;td&gt;1-2 hours typical&lt;/td&gt;
&lt;td&gt;3-5 hours (no reasoning to reconstruct)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recurrence after fix&lt;/td&gt;
&lt;td&gt;Low (developer updates mental model)&lt;/td&gt;
&lt;td&gt;High (same prompt generates same pattern)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security issues per KLOC&lt;/td&gt;
&lt;td&gt;Baseline&lt;/td&gt;
&lt;td&gt;2.74x higher (CodeRabbit data)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code churn within 14 days&lt;/td&gt;
&lt;td&gt;Baseline&lt;/td&gt;
&lt;td&gt;+39% (GitClear data)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The organizational blind spot
&lt;/h2&gt;

&lt;p&gt;The real damage isn't technical — it's organizational. Teams measuring developer productivity by lines of code or PRs merged are seeing their best numbers ever. The dashboards look great. Velocity is up. Sprint commitments are being met.&lt;/p&gt;

&lt;p&gt;What the dashboards don't show: time spent in code review has increased 45% (because reviewers now treat every PR as potentially AI-generated and requiring deeper verification). Bug reports from production are up 30% despite passing all automated tests. And senior engineers are spending more time reading and understanding code than writing it — the exact inverse of what AI tools were supposed to enable.&lt;/p&gt;

&lt;p&gt;This is &lt;a href="https://dev.to/build-logs/ai-speed-lie-team-velocity/"&gt;the same productivity illusion&lt;/a&gt; we measured in team velocity: AI makes individual tasks faster while making the overall system slower. The local optimization creates a global pessimization.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I got wrong
&lt;/h2&gt;

&lt;p&gt;I initially thought the problem was adoption immaturity — that teams would learn to use AI tools effectively and the quality issues would resolve. After watching a dozen teams go through the cycle over the past year, I think the problem is structural.&lt;/p&gt;

&lt;p&gt;AI code generation optimizes for plausibility, not correctness. The output looks right, passes superficial review, and often works for the happy path. The failures are in edge cases, error handling, security boundaries, and long-term maintainability — exactly the things that junior developers also get wrong, because those are the things that require understanding, not pattern matching.&lt;/p&gt;

&lt;p&gt;The teams that are succeeding with AI code generation share three practices:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. AI writes, humans architect.&lt;/strong&gt; The AI generates implementation within a structure that a human designed. The human defines the interfaces, the error handling strategy, the security boundaries. The AI fills in the bodies. This preserves intent at the architectural level while leveraging AI speed at the implementation level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Review budgets increased, not decreased.&lt;/strong&gt; Teams that cut code review time because "the AI wrote it" are the ones hitting the Spaghetti Point fastest. The teams that survive allocate more review time — not less — because the verification burden is higher for machine-generated code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Aggressive deletion of AI-generated code that can't be explained.&lt;/strong&gt; If a developer can't explain why a function works the way it does — regardless of whether it passes tests — it gets rewritten by hand. This is expensive in the short term and cheap in the long term.&lt;/p&gt;

&lt;h2&gt;
  
  
  The historical pattern
&lt;/h2&gt;

&lt;p&gt;This cycle is familiar. Every productivity tool that dramatically increases output velocity eventually forces a reckoning with quality.&lt;/p&gt;

&lt;p&gt;3D printing was going to democratize manufacturing. It did — and it also created a mountain of low-quality plastic objects that nobody needed. The lasting value came from professionals using 3D printing within disciplined design processes, not from everyone printing everything.&lt;/p&gt;

&lt;p&gt;No-code tools were going to replace developers. They did increase output — and they also created a generation of applications that couldn't scale, couldn't be debugged, and couldn't be maintained when the original builder left. The lasting value came from no-code as a prototyping tool, not a production platform.&lt;/p&gt;

&lt;p&gt;Vibe coding is following the same arc. The output explosion is real. The quality reckoning is coming. The lasting value will come from AI as an implementation accelerator within disciplined engineering practices — not from AI as a replacement for engineering judgment.&lt;/p&gt;

&lt;h2&gt;
  
  
  The question worth asking
&lt;/h2&gt;

&lt;p&gt;If your team adopted AI coding tools in the last twelve months, run this check: compare the bug rate and code churn rate in your most AI-assisted repositories against your least AI-assisted ones. Normalize for team size and feature complexity.&lt;/p&gt;

&lt;p&gt;If the AI-heavy repos show higher churn and more production bugs — even if they also show higher velocity — you're accumulating the debt. The hangover is coming. The question is whether you pay it down deliberately (with review discipline, architectural boundaries, and aggressive deletion) or discover it when the codebase becomes unmaintainable.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://dev.to/field-notes/github-slopocalypse-trust-tax/"&gt;trust tax&lt;/a&gt; isn't just an open-source problem. It's inside your organization too.&lt;/p&gt;

&lt;p&gt;::: {.schema-faq style="display:none;"}&lt;br&gt;
[{"q":"Does AI-generated code have more bugs than human code?","a":"Yes. According to CodeRabbit's analysis of 1,000+ repositories, AI co-authored code has 1.7x more major issues and a 2.74x higher vulnerability rate. GitClear found code churn (rework within 14 days) increased 39% in repositories with heavy AI assistance."},{"q":"What is vibe coding and what are the risks?","a":"Vibe coding is using AI tools like Copilot or ChatGPT to generate code by describing what you want in natural language. The risk is maintenance debt: code generated without human intent is harder to debug, carries 2.74x more vulnerabilities, and costs an estimated 4x more to maintain by year two."},{"q":"Are developers actually faster with AI coding tools?","a":"Not necessarily. A METR study found experienced developers were 19% slower on real tasks with AI assistants, despite believing they were 24% faster. Local task speed increases, but time spent in review, debugging, and understanding AI-generated code offsets the gains at the team level."}]&lt;/p&gt;

&lt;h2&gt;
  
  
  :::
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/build-logs/vibe-coding-hangover/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=vibe-coding-hangover" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>softwareengineering</category>
      <category>agenticsystems</category>
      <category>productionai</category>
    </item>
    <item>
      <title>Trace-Based Assurance: The Governance Layer Agentware Actually Needs</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Wed, 25 Mar 2026 03:39:50 +0000</pubDate>
      <link>https://dev.to/talvinder/trace-based-assurance-the-governance-layer-agentware-actually-needs-2mjh</link>
      <guid>https://dev.to/talvinder/trace-based-assurance-the-governance-layer-agentware-actually-needs-2mjh</guid>
      <description>&lt;p&gt;Agents are being deployed with governance frameworks designed for human committees and quarterly audits. The gap is not small.&lt;/p&gt;

&lt;p&gt;Traditional governance asks: "Did you follow the process?" Agentic systems require a different question: "Can you prove, in real-time, that the agent is operating within boundaries?" The difference matters because agents make decisions faster than humans can review them, and carry more risk than trust-based deployment can tolerate.&lt;/p&gt;

&lt;p&gt;At Ostronaut, we generate training content autonomously—presentations, videos, quizzes—for healthcare clients. The first time a client asked "How do we know this meets compliance requirements?", we had documentation. We had process diagrams. We had architectural reviews. What we didn't have was evidence that the system was actually doing what we said it would do, case by case, generation by generation.&lt;/p&gt;

&lt;p&gt;That's the governance gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Evidence Problem
&lt;/h2&gt;

&lt;p&gt;I'm calling this &lt;strong&gt;Trace-Based Assurance&lt;/strong&gt; — a governance model where agents emit verifiable evidence trails that prove compliance in real-time, rather than documenting intentions in advance.&lt;/p&gt;

&lt;p&gt;This isn't about adding logging. Every system has logs. Trace-based assurance means structuring agent operations so that governance verification becomes automated and continuous. The trace isn't a byproduct. It's the mechanism.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;By 2027, production-grade agentic systems will be required to emit structured trace data that proves boundary compliance, not just logs outcomes.&lt;/strong&gt; Vendors who treat governance as a documentation problem will lose enterprise deals to vendors who treat it as an evidence problem.&lt;/p&gt;

&lt;p&gt;The shift is already visible. When we talk to healthcare clients, they don't ask "What's your process for content review?" They ask "Can you show me, for this specific piece of generated content, what checks ran and what the results were?"&lt;/p&gt;

&lt;p&gt;That's a different question. It assumes the system is autonomous. It assumes human review isn't feasible at scale. It demands evidence, not assurance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Traditional Governance Breaks
&lt;/h2&gt;

&lt;p&gt;Traditional governance models don't handle this well. They're built for phase-gate processes: design review, implementation review, deployment approval, quarterly audit. Agents don't operate in phases. They operate continuously. They adapt. They make thousands of decisions between audits.&lt;/p&gt;

&lt;p&gt;The gap shows up in three places.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approval vs. Acceptance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Traditional procurement distinguishes between "approval" (pre-decision authority) and "acceptance" (post-decision verification). Agents break this model. You can't approve every decision in advance—they happen too fast. You can't simply accept outcomes post-facto—the risk is too high.&lt;/p&gt;

&lt;p&gt;Traces create a third path: continuous verification. The agent emits evidence as it operates. Governance systems verify that evidence in real-time. Decisions that pass verification proceed. Decisions that fail trigger escalation.&lt;/p&gt;

&lt;p&gt;This isn't theoretical. We built validation gates into Ostronaut's generation pipeline after a quality crisis. The system now emits structured traces at each stage: content extraction, structure generation, media creation, quality scoring. Each trace includes the inputs, the decision made, the constraints checked, and the result.&lt;/p&gt;

&lt;p&gt;When a generation fails validation, we have the trace. We know exactly where it failed and why. When a generation succeeds, the client has evidence that it met their requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Documentation vs. Evidence&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Production systems require security, compliance, scalability, all of the operational requirements enterprise buyers expect. The standard response is documentation: architecture diagrams, security reviews, compliance checklists.&lt;/p&gt;

&lt;p&gt;Documentation tells you what the system is supposed to do. Evidence tells you what it actually did.&lt;/p&gt;

&lt;p&gt;The difference matters when something goes wrong. If an agent makes a bad decision, documentation tells you the process was sound. Evidence tells you what inputs it received, what constraints it checked, what decision it made, and why.&lt;/p&gt;

&lt;p&gt;We learned this the hard way. Early versions of Ostronaut had extensive documentation about quality controls. When clients asked about a specific generation that didn't meet standards, we could point to the process. What we couldn't do was show them the specific quality checks that ran for that generation and what they returned.&lt;/p&gt;

&lt;p&gt;Documentation scales to the system. Evidence scales to the decision.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trust vs. Transparency&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Trust-based governance works when operations are slow enough for relationship-building and reputation to matter. Agentic systems operate too fast for trust alone.&lt;/p&gt;

&lt;p&gt;Transparency enables trust at speed. If I can see the evidence trail—what the agent considered, what constraints it checked, what decision it made—I can trust the outcome without trusting the vendor's reputation or the operator's judgment.&lt;/p&gt;

&lt;p&gt;This is not about replacing human judgment. It's about giving humans the information they need to judge effectively. A trace that shows "this generation passed 12 quality checks, failed 1, and was escalated for review" is more useful than a process diagram that says "all content undergoes quality review."&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Looks Like in Practice
&lt;/h2&gt;

&lt;p&gt;The pattern is showing up across domains.&lt;/p&gt;

&lt;p&gt;Healthcare training clients don't ask "Is your content accurate?" They ask "Can you prove this specific module met our clinical guidelines?" That's a trace question.&lt;/p&gt;

&lt;p&gt;Financial services clients don't ask "Do you have compliance controls?" They ask "Can you show me the decision path for this specific transaction and what risk checks applied?" That's a trace question.&lt;/p&gt;

&lt;p&gt;Customer support deployments don't ask "How do you ensure quality?" They ask "Can you prove this agent didn't violate our brand guidelines in this specific conversation?" That's a trace question.&lt;/p&gt;

&lt;p&gt;The common thread: verification needs to happen at the decision level, not the system level.&lt;/p&gt;

&lt;p&gt;Here's what trace-based assurance requires:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ftalvinder.com%2Fframeworks%2Ftrace-based-assurance-agentware%2Fassets%2Fd2-diagram-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ftalvinder.com%2Fframeworks%2Ftrace-based-assurance-agentware%2Fassets%2Fd2-diagram-1.png" alt="Diagram 1" width="800" height="2164"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The trace must be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Structured&lt;/strong&gt;: machine-readable format, not free text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complete&lt;/strong&gt;: captures inputs, constraints, decision logic, outcome&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamped&lt;/strong&gt;: enables audit trail reconstruction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Immutable&lt;/strong&gt;: can't be modified after creation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Queryable&lt;/strong&gt;: supports real-time and historical analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is different from logging. Logs capture what happened. Traces capture why it happened and prove it was within bounds.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture Shift
&lt;/h2&gt;

&lt;p&gt;Building for trace-based assurance changes how you architect agentic systems.&lt;/p&gt;

&lt;p&gt;Traditional approach: build the agent, add logging, write documentation.&lt;/p&gt;

&lt;p&gt;Trace-based approach: design the constraints first, structure the agent to emit evidence of constraint adherence, make the trace the governance interface.&lt;/p&gt;

&lt;p&gt;We rebuilt Ostronaut's generation pipeline around this model. Every stage emits a structured trace. The trace includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What content was provided as input&lt;/li&gt;
&lt;li&gt;What quality thresholds were configured&lt;/li&gt;
&lt;li&gt;What checks ran and what they returned&lt;/li&gt;
&lt;li&gt;Whether the output met requirements&lt;/li&gt;
&lt;li&gt;If not, why not and what happened next&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The client's compliance team doesn't review our code. They review traces. When they spot-check a generation, they can see the complete decision path. When they audit the system, they query traces, not documentation.&lt;/p&gt;

&lt;p&gt;This inverts the governance relationship. Instead of "trust us, we have good processes," it's "verify us, here's the evidence."&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Got Wrong
&lt;/h2&gt;

&lt;p&gt;We initially tried to retrofit traces onto an existing system. That doesn't work. Traces need to be part of the agent's core architecture, not an afterthought.&lt;/p&gt;

&lt;p&gt;We also underestimated the storage and query requirements. Traces for every decision add up fast. You need infrastructure that can handle high-volume writes and support complex queries across time ranges and decision types.&lt;/p&gt;

&lt;p&gt;The bigger mistake: thinking traces were primarily for auditors. They're actually most valuable for the engineering team. When an agent makes a bad decision, the trace is your debugging tool. When you're tuning the system, traces show you which constraints are too loose or too tight. When you're explaining the system to stakeholders, traces are your evidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Open Question
&lt;/h2&gt;

&lt;p&gt;Here's what I don't know yet: how do you build organizational trust in trace-based governance?&lt;/p&gt;

&lt;p&gt;Most enterprise buyers are used to documentation-based assurance. They know how to evaluate a security review or a compliance checklist. They don't yet know how to evaluate a trace architecture.&lt;/p&gt;

&lt;p&gt;The question isn't technical. It's cultural. How do you convince a procurement team that "we'll show you the evidence for every decision" is more reliable than "we have a 47-page compliance document"?&lt;/p&gt;

&lt;p&gt;The early adopters get it. Healthcare organizations that already deal with electronic health records understand audit trails. Financial institutions that deal with transaction monitoring understand decision-level evidence.&lt;/p&gt;

&lt;p&gt;But the broader market is still catching up. Most RFPs still ask for documentation, not trace capabilities. Most compliance frameworks still assume human review, not automated verification.&lt;/p&gt;

&lt;p&gt;The shift will happen. It has to. Agents are already making decisions too fast and at too high a volume for documentation-based governance to work. The question is whether the governance frameworks will adapt in time, or whether we'll see a wave of incidents first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Are we building the trace infrastructure now, or waiting for the forcing function? Mostly, we're still writing documentation.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/frameworks/trace-based-assurance-agentware/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=trace-based-assurance-agentware" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agenticsystems</category>
      <category>enterpriseai</category>
      <category>governance</category>
    </item>
    <item>
      <title>The Small Model Arbitrage: Why India Should Be Building Vertical LLMs, Not Chasing Frontier</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Mon, 23 Mar 2026 13:21:57 +0000</pubDate>
      <link>https://dev.to/talvinder/the-small-model-arbitrage-why-india-should-be-building-vertical-llms-not-chasing-frontier-5e51</link>
      <guid>https://dev.to/talvinder/the-small-model-arbitrage-why-india-should-be-building-vertical-llms-not-chasing-frontier-5e51</guid>
      <description>&lt;p&gt;India is trying to build its own GPT-4. This is a mistake.&lt;/p&gt;

&lt;p&gt;The capital requirement to train a frontier model is $500M-$1B+. The talent war for ML researchers is won before you enter it—OpenAI, Anthropic, and Google have already hired everyone worth hiring at compensation packages Indian companies can't match. The compute infrastructure is controlled by three hyperscalers who are also your competitors.&lt;/p&gt;

&lt;p&gt;This is not a winnable race. But there's a different race that is.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Small Model Arbitrage
&lt;/h2&gt;

&lt;p&gt;I'm calling this the &lt;strong&gt;Small Model Arbitrage&lt;/strong&gt;—the opportunity to capture value by building specialized, vertical-specific LLMs that use local data, languages, and domain expertise where general-purpose models systematically underperform.&lt;/p&gt;

&lt;p&gt;The arbitrage exists because frontier model companies optimize for breadth, not depth. GPT-4 is remarkable at general reasoning but mediocre at Tamil legal document analysis, Ayurvedic diagnosis support, or GST compliance automation. The long tail of vertical use cases is economically unattractive to companies spending $1B on training runs.&lt;/p&gt;

&lt;p&gt;That's where the opening is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A well-executed vertical LLM in a defensible domain will reach profitability faster and generate higher ROI than an Indian frontier model attempt over the next 5 years.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The math supports this. Training a competitive frontier model requires $500M-$1B in compute, 100+ PhD-level researchers at $300K-$500K/year, and 3-5 years to market. Ongoing capital burn to stay competitive as OpenAI and Anthropic release new versions.&lt;/p&gt;

&lt;p&gt;A vertical LLM requires $2M-$10M in initial training, 10-20 engineers and domain experts, and 6-12 months to first deployment. The moat is proprietary domain data, not compute scale.&lt;/p&gt;

&lt;p&gt;The capital efficiency difference is 50-100x. The time-to-revenue difference is 5-10x.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Capital Efficiency Isn't the Whole Story
&lt;/h2&gt;

&lt;p&gt;Capital efficiency alone doesn't win. The real arbitrage is in defensibility.&lt;/p&gt;

&lt;p&gt;Frontier models are commodity infrastructure. When GPT-5 launches, GPT-4 pricing collapses. When Claude 4 launches, Claude 3.5 becomes table stakes. The moat is constantly eroding because the moat IS the model, and the model is constantly being replaced.&lt;/p&gt;

&lt;p&gt;Vertical models have different moats. The moat is the proprietary training data, the domain-specific evaluation benchmarks, the integration into existing workflows, the trust built with regulated industries. These don't erode when OpenAI ships a new model. They compound.&lt;/p&gt;

&lt;p&gt;Consider Indian legal text. A frontier model can summarize a contract. A vertical legal LLM trained on 20 years of Indian case law, Supreme Court judgments, and regulatory filings can identify precedent, flag jurisdictional issues, and generate compliant documentation.&lt;/p&gt;

&lt;p&gt;The difference in value is 10x. The difference in defensibility is 100x.&lt;/p&gt;

&lt;p&gt;Or healthcare. GPT-4 can answer general medical questions. A model trained on Indian clinical protocols, drug formularies, insurance claim patterns, and regional disease prevalence can assist with diagnosis, treatment planning, and prior authorization. It's not a better general model—it's a purpose-built tool that works within the constraints of the Indian healthcare system.&lt;/p&gt;

&lt;p&gt;The pattern here is &lt;strong&gt;data specificity as competitive advantage&lt;/strong&gt;. Frontier models are trained on the open web. Vertical models are trained on proprietary, domain-specific corpora that are expensive or impossible for competitors to replicate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Import Substitution Mistake
&lt;/h2&gt;

&lt;p&gt;India tried this playbook before. Post-independence industrial policy was built on import substitution—build everything domestically, compete head-to-head with established global players. It failed spectacularly.&lt;/p&gt;

&lt;p&gt;India's inward-looking trade regime discouraged labor-intensive export industries and rewarded installation of new capacity over actual output. The economy stagnated for decades.&lt;/p&gt;

&lt;p&gt;The companies that succeeded—Infosys, Wipro, TCS—didn't try to be IBM. They specialized in specific services where India had comparative advantage: cost-efficient software development, business process outsourcing, IT support. They built world-class competitors by focusing, not by trying to replicate the entire stack.&lt;/p&gt;

&lt;p&gt;The Small Model Arbitrage is the same bet. Don't build Indian GPT-4. Build the best Tamil-English legal LLM. Build the best model for Indian tax code. Build the best clinical decision support system for Indian healthcare protocols.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who's Building This
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Sarvam AI&lt;/strong&gt; is building this playbook. They're not trying to be OpenAI India. They're building models for Indian languages—starting with Hindi, Tamil, Telugu, Kannada. The training data includes regional dialects, code-switching patterns, and cultural context that frontier models miss. Their Indic LLM performs better on Hindi-English code-mixed text than GPT-4 because it was designed for that specific use case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Niramai&lt;/strong&gt; built an AI system for breast cancer screening using thermal imaging. It's not a general-purpose vision model. It's a vertical model trained on Indian patient data, optimized for cost-constrained clinical settings, and integrated with existing diagnostic workflows. The model's accuracy isn't better than frontier models on general image tasks—it's better on the one task that matters for their customers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tricog&lt;/strong&gt; built an ECG interpretation model for Indian hospitals. It doesn't try to be the best general medical AI. It's trained on Indian cardiac data, accounts for regional disease prevalence, and integrates with existing cardiology workflows. The specificity is the product.&lt;/p&gt;

&lt;p&gt;These companies aren't competing on compute scale. They're competing on domain depth.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Criteria for Vertical LLM Opportunity
&lt;/h2&gt;

&lt;p&gt;Not every vertical is worth building. The opportunity exists where three conditions hold:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;Why It Matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Proprietary data access&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The training corpus must be expensive or impossible for competitors to replicate. Public datasets don't create moats.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Measurable performance delta&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The vertical model must demonstrably outperform frontier models on domain-specific benchmarks. "Better for India" isn't enough—quantify it.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Willingness to pay&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The customer must value the vertical model enough to pay a premium over general-purpose alternatives. Cost savings or compliance requirements work. Marginal convenience doesn't.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Indian legal tech meets all three. Case law is proprietary, performance on precedent identification is measurable, and law firms pay for accuracy.&lt;/p&gt;

&lt;p&gt;Indian healthcare meets all three. Clinical data is proprietary, diagnostic accuracy is measurable, and hospitals pay for compliance and outcomes.&lt;/p&gt;

&lt;p&gt;Indian fintech meets two out of three. Transaction data is proprietary, fraud detection performance is measurable, but willingness to pay is unclear—banks may prefer general models with custom fine-tuning.&lt;/p&gt;

&lt;p&gt;The test is simple: if a frontier model company could replicate your vertical model by spending $10M on data acquisition and fine-tuning, you don't have a moat. If they can't—because the data doesn't exist, the domain expertise takes years to build, or the regulatory relationships are non-transferable—you do.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Don't Know Yet
&lt;/h2&gt;

&lt;p&gt;The open question is whether vertical LLMs can sustain pricing power as frontier models improve. If GPT-5 closes 80% of the performance gap on Indian legal text, does the 20% delta justify a 5x price premium?&lt;/p&gt;

&lt;p&gt;I think yes, but the answer depends on how regulated and mission-critical the domain is. Healthcare and legal have high switching costs and regulatory lock-in. E-commerce and customer support don't.&lt;/p&gt;

&lt;p&gt;The other unknown is whether vertical models can defend against fine-tuned frontier models. If a customer can take GPT-4, fine-tune it on their own data, and get 90% of the value of your vertical model, your business model collapses.&lt;/p&gt;

&lt;p&gt;The defense is proprietary training signal that the customer doesn't have. If your model is trained on 10 years of aggregated industry data that no single customer possesses, fine-tuning doesn't replicate it. If your model is just a fine-tuned version of a frontier model on the customer's own data, you're a services company, not a product company.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Civilizational Bet
&lt;/h2&gt;

&lt;p&gt;The broader question is whether India's AI strategy should prioritize sovereignty or specialization.&lt;/p&gt;

&lt;p&gt;Sovereignty argues for building frontier models domestically, even at higher cost, to ensure strategic autonomy. Specialization argues for building vertical models where India has comparative advantage, and relying on global infrastructure for general-purpose AI.&lt;/p&gt;

&lt;p&gt;I think specialization wins. Sovereignty in AI is expensive and brittle. The cost to maintain a competitive frontier model is not a one-time investment—it's an ongoing tax that grows every year as the frontier moves. India's GDP per capita is $2,500. The U.S. is $76,000. The capital efficiency required to compete on frontier models is not realistic.&lt;/p&gt;

&lt;p&gt;But specialization in vertical AI is realistic. India has 22 official languages, 1.4 billion people, and regulatory systems that differ significantly from Western markets. The data specificity is structural, not temporary.&lt;/p&gt;

&lt;h2&gt;
  
  
  The companies that win will be the ones that stop trying to replicate OpenAI and start building what OpenAI can't.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/frameworks/domain-specific-small-models/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=domain-specific-small-models" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agenticsystems</category>
      <category>indiatech</category>
      <category>aiinfrastructure</category>
    </item>
    <item>
      <title>We Were Running AI Agents Before 'Agentic' Became a Buzzword</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Sat, 21 Mar 2026 04:10:53 +0000</pubDate>
      <link>https://dev.to/talvinder/we-were-running-ai-agents-before-agentic-became-a-buzzword-1dco</link>
      <guid>https://dev.to/talvinder/we-were-running-ai-agents-before-agentic-became-a-buzzword-1dco</guid>
      <description>&lt;p&gt;In early 2024, we deployed a multi-agent system for Ostronaut before anyone called it "agentic AI." We called it "the pipeline." By late 2024, every vendor deck had "agentic" in the title. The architecture didn't change. The vocabulary did.&lt;/p&gt;

&lt;p&gt;Here's the pattern that experience revealed: &lt;strong&gt;Agent Debt&lt;/strong&gt;. The hidden complexity that accumulates when you treat agents as black boxes instead of understanding their failure modes. It isn't technical debt. It's operational blindness. You don't see it until an agent hallucinates in production, burns through your API budget, or produces output so confidently wrong that users trust it.&lt;/p&gt;

&lt;p&gt;Building without frameworks meant hitting every orchestration failure, every context bleed, every runaway cost directly. That's what taught us what actually matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture We Built
&lt;/h2&gt;

&lt;p&gt;Ostronaut generates corporate training content — presentations, videos, quizzes, games — from unstructured input. A client uploads a PDF. The system outputs interactive learning formats.&lt;/p&gt;

&lt;p&gt;We built agents in four functional groups because the problem naturally decomposed that way:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent Type&lt;/th&gt;
&lt;th&gt;Responsibility&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Planner agents&lt;/td&gt;
&lt;td&gt;Break input into learning objectives, decide format mix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Structure agents&lt;/td&gt;
&lt;td&gt;Design slide sequences, video scripts, quiz flows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Content agents&lt;/td&gt;
&lt;td&gt;Generate text, voiceovers, visual descriptions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Validation agents&lt;/td&gt;
&lt;td&gt;Check quality gates, flag hallucinations, verify completeness&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The planner-worker pattern: one planner agent analyzes the input and creates a generation plan. Worker agents execute tasks from that plan. Validation agents run post-generation checks.&lt;/p&gt;

&lt;p&gt;This wasn't novel architecture. It was obvious once you tried to build the thing. But in early 2024, there was no CrewAI to handle orchestration. No LangGraph to manage state. We wrote the coordination logic ourselves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What that meant in practice:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Context management was manual. Each agent needed the right slice of information: not too much (cost), not too little (hallucination). We built a context router that decided what each agent could see based on its task. It broke constantly. An agent would reference information from a previous step that wasn't in its context window. Output would be incoherent.&lt;/p&gt;

&lt;p&gt;Tool-calling was brittle. Agents needed to invoke APIs for image generation, video rendering, database writes. Early LLM tool-calling was unreliable. An agent would call the wrong API, pass malformed parameters, or retry indefinitely on failure. We added a validation layer that parsed tool calls before execution. That caught 30% of bad calls.&lt;/p&gt;

&lt;p&gt;Cost control was reactive. We didn't know what "normal" token usage looked like for a multi-agent pipeline. First month in production, we burned through our OpenAI budget in 2 weeks. The problem: redundant context. Multiple agents were processing the same source material because we hadn't optimized context sharing. We added a caching layer. Cost dropped 40%.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Quality Crisis
&lt;/h2&gt;

&lt;p&gt;Month 4, we hit the ceiling.&lt;/p&gt;

&lt;p&gt;A healthcare client used Ostronaut to generate training for a clinical health program. The system produced a quiz. One question asked: "What is the recommended daily caloric deficit for healthy weight loss?" The agent-generated answer: "1000-1200 calories."&lt;/p&gt;

&lt;p&gt;That's dangerously high for most people. The correct range is 500-750 calories.&lt;/p&gt;

&lt;p&gt;The agent didn't hallucinate randomly. It pulled from a source document that mentioned 1000-1200 as an &lt;em&gt;upper bound&lt;/em&gt; for specific cases. The agent extracted the number without the qualifier. The validation agent didn't flag it because it checked for factual consistency with the source, not medical safety.&lt;/p&gt;

&lt;p&gt;We caught it in QA. But it revealed the core problem: &lt;strong&gt;agents optimize for coherence, not correctness&lt;/strong&gt;. They will confidently generate plausible-but-wrong output if your validation layer doesn't encode domain constraints.&lt;/p&gt;

&lt;p&gt;This is the failure mode that no prompt tuning fixes. You can instruct the model to "be accurate" as many times as you want. It will still extract numbers from context and strip their qualifiers, because that's what extracting the salient point looks like to the model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What we changed:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Built domain-specific validation gates. For healthcare content, we added rules: flag any caloric recommendation above X, flag any medication dosage, flag any symptom-diagnosis claim. Not LLM-based validation. Rule-based checks that ran before content went to the client.&lt;/p&gt;

&lt;p&gt;Added confidence scoring. Each agent outputs a confidence score for its generation. Low-confidence outputs go to human review. The scoring isn't sophisticated (token probability and context match), but it works. 15% of generations now route to human QA. That's acceptable.&lt;/p&gt;

&lt;p&gt;Switched to template + generative hybrid. For high-risk content types (medical, financial, legal), we don't generate from scratch. We use templates with generative fill-ins. Reduces creative output, increases safety. Clients accepted the trade-off.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Got Wrong
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Universal reasoning engine.&lt;/strong&gt; We initially tried to build one planner agent that could handle all content types. A presentation has different structural constraints than a video. A quiz has different validation rules than a game. We split the planner into format-specific planners. That added agents but improved output quality significantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM-as-judge for validation.&lt;/strong&gt; Early on, we used an LLM to validate other LLMs' output. "Does this quiz question make sense? Is this slide coherent?" That's circular. The validator had the same failure modes as the generator. We moved to rule-based validation for anything safety-critical. LLMs still validate style and tone. They don't validate facts. This failure mode is documented in more detail in &lt;a href="///build-logs/llm-judge-india-failure/index.qmd"&gt;why LLM-as-judge stacks fail for Indian markets&lt;/a&gt; — the underlying issue is the same regardless of geography.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Centralized orchestration.&lt;/strong&gt; We built one orchestrator that managed all agents. It became a bottleneck. Every new feature required changing the orchestrator. We should have built federated orchestration, where each agent cluster (planner, worker, validator) manages its own coordination. We haven't refactored this yet. It's still painful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Then vs. Now
&lt;/h2&gt;

&lt;p&gt;If we built Ostronaut today with 2025 tooling, here's what would be easier:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What We Built by Hand&lt;/th&gt;
&lt;th&gt;What Exists Now&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Context routing logic&lt;/td&gt;
&lt;td&gt;LangGraph state management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool-call validation layer&lt;/td&gt;
&lt;td&gt;Built-in tool schemas in GPT-4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent orchestration&lt;/td&gt;
&lt;td&gt;CrewAI, n8n workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retry and error handling&lt;/td&gt;
&lt;td&gt;Framework-level retry policies&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;What's still hard:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Domain-specific validation. No framework gives you medical safety checks or financial compliance rules. You build that yourself.&lt;/p&gt;

&lt;p&gt;Cost optimization. Frameworks don't tell you which agents are burning tokens unnecessarily. You need observability and profiling. This is the same problem &lt;a href="///field-notes/indian-saas-agent-reliability/index.qmd"&gt;Indian SaaS companies are well-positioned to solve&lt;/a&gt; — twenty years of optimizing for constrained infrastructure builds exactly this instinct.&lt;/p&gt;

&lt;p&gt;Failure mode discovery. Agents fail in creative ways. A framework might handle retries, but it won't tell you &lt;em&gt;why&lt;/em&gt; an agent is producing inconsistent output. You learn that by watching production traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The real difference:&lt;/strong&gt; In 2024, we had to understand agent internals to build anything reliable. In 2025, you can deploy agents without understanding them. That's progress. But it creates Agent Debt.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Falsifiable Claim
&lt;/h2&gt;

&lt;p&gt;Teams that deploy agent systems without understanding planner-worker coordination, context boundaries, and validation layers will hit a quality ceiling within 3-6 months that no amount of prompt tuning will fix.&lt;/p&gt;

&lt;p&gt;The ceiling shows up as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Inconsistent output quality (works 80% of the time, fails unpredictably)&lt;/li&gt;
&lt;li&gt;Cost spirals (agents making redundant API calls, over-generating)&lt;/li&gt;
&lt;li&gt;User trust erosion (one bad generation destroys confidence in 10 good ones)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't a prediction. It's a pattern I've watched repeat across every team that reached out after deploying agents without validation gates. The vendors selling "agentic platforms" are solving orchestration and deployment. They're not solving validation, cost control, or failure mode discovery. Those are still your problem.&lt;/p&gt;

&lt;p&gt;This dynamic connects to something broader happening in &lt;a href="///frameworks/agentware/index.qmd"&gt;the shift from software to agentware&lt;/a&gt; — as the abstraction layer rises, the hidden complexity doesn't disappear. It concentrates at the failure modes the frameworks don't cover.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Question Worth Asking
&lt;/h2&gt;

&lt;p&gt;If you're deploying agents today, ask this: &lt;strong&gt;Can you explain why an agent made a specific decision?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not "what did it output?" but "why did it choose this approach over alternatives?"&lt;/p&gt;

&lt;p&gt;If the answer is "the LLM decided," you have Agent Debt. You're trusting a black box. That works until it doesn't.&lt;/p&gt;

&lt;p&gt;The teams that will build reliable agent systems aren't the ones using the fanciest frameworks. They're the ones who understand what happens when context bleeds between agents, when a planner makes a bad decomposition, when a validator misses a hallucination.&lt;/p&gt;

&lt;h2&gt;
  
  
  We learned that by building without frameworks. You can learn it faster now — but only if you look under the hood.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/build-logs/multi-agent-before-agentic/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=multi-agent-before-agentic" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiengineering</category>
      <category>agenticsystems</category>
      <category>buildlogs</category>
    </item>
    <item>
      <title>AI Is Making Your Team Slower — The Math Your CEO Won't Show You</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Fri, 20 Mar 2026 03:31:28 +0000</pubDate>
      <link>https://dev.to/talvinder/ai-is-making-your-team-slower-the-math-your-ceo-wont-show-you-agl</link>
      <guid>https://dev.to/talvinder/ai-is-making-your-team-slower-the-math-your-ceo-wont-show-you-agl</guid>
      <description>&lt;p&gt;Every company measuring AI productivity is counting the wrong thing.&lt;/p&gt;

&lt;p&gt;They're measuring output volume: PRs merged, lines written, tickets closed. They're not measuring the cost of what ships: the review burden, the debugging time, the incidents caused by code nobody understood before it hit production.&lt;/p&gt;

&lt;p&gt;When you count both sides, the math doesn't work the way your CEO's slide deck says it does.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Evidence Is Piling Up
&lt;/h2&gt;

&lt;p&gt;This week, The Pragmatic Engineer &lt;a href="https://newsletter.pragmaticengineer.com/p/are-ai-agents-actually-slowing-us" rel="noopener noreferrer"&gt;catalogued what's actually happening&lt;/a&gt; inside companies that went all-in on AI coding agents. The findings aren't theoretical.&lt;/p&gt;

&lt;p&gt;Amazon's retail engineering team saw a leap in outages caused directly by AI agents. The fix? Requiring senior engineer sign-off on all AI-assisted changes from junior developers. That's not a productivity gain. That's adding a bottleneck to compensate for unreliable output.&lt;/p&gt;

&lt;p&gt;Anthropic — the company that builds Claude — ships over 80% of its production code with AI. Their flagship website degraded so badly that paying customers noticed before anyone internally did. The irony writes itself.&lt;/p&gt;

&lt;p&gt;Meta and Uber are tracking AI token usage in performance reviews. Engineers who don't use AI tools enough look unproductive. Engineers who use them indiscriminately look great on paper — until the bugs ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Taxes You're Not Counting
&lt;/h2&gt;

&lt;p&gt;Here's the falsifiable claim: &lt;strong&gt;teams that measure AI productivity only by output volume will see their incident rate and mean-time-to-resolve increase by 30% or more within 12 months&lt;/strong&gt;, compared to teams that gate AI output with validation layers.&lt;/p&gt;

&lt;p&gt;The mechanism has three parts.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Review Tax
&lt;/h3&gt;

&lt;p&gt;Every AI-generated PR still needs human review. But AI-generated code is harder to review than human-written code, because the reviewer can't infer intent from the author's history.&lt;/p&gt;

&lt;p&gt;With human code, you know the developer's context: what they were trying to solve, what trade-offs they considered, what they tested. With AI code, you're reverse-engineering intent from output. That's slower, not faster.&lt;/p&gt;

&lt;p&gt;Amazon learned this the hard way. Junior engineers using AI agents shipped code that looked correct — clean formatting, reasonable variable names, passing tests — but had subtle logical errors that only surfaced in production. Reviewers couldn't distinguish "AI wrote this well" from "AI wrote this plausibly."&lt;/p&gt;

&lt;h3&gt;
  
  
  The Refactoring Freeze
&lt;/h3&gt;

&lt;p&gt;Dax Reed, who built OpenCode, points out something every experienced engineer recognises: AI agents discourage refactoring. When code is cheap to generate, nobody wants to clean it up. Why spend an afternoon restructuring a module when the agent writes a new one in ten minutes?&lt;/p&gt;

&lt;p&gt;The result is an expanding codebase where nothing gets simplified, patterns don't converge, and cognitive load increases week over week.&lt;/p&gt;

&lt;p&gt;This is the velocity trap. Short-term speed, long-term slowdown. Sentry's CTO observed the same pattern: AI removes the barrier to getting started, which sounds great until you realise that "getting started" was never the bottleneck. The bottleneck was maintaining, debugging, and evolving what you built. AI makes the first part trivially easy and the second part measurably harder.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Incentive Poison
&lt;/h3&gt;

&lt;p&gt;When companies tie AI token usage to performance reviews, they're telling engineers: "Use the tool, regardless of whether it helps."&lt;/p&gt;

&lt;p&gt;This is the corporate equivalent of measuring developer productivity by lines of code written. It rewards volume, punishes judgment, and guarantees that the engineers who are most careful about code quality look the least productive.&lt;/p&gt;

&lt;p&gt;Engineers who know the AI output is mediocre ship it anyway, because slowing down to rewrite it makes their metrics look bad. The codebase degrades. The team slows down. The metrics still look great, because the metrics are measuring the wrong thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Looks Like Up Close
&lt;/h2&gt;

&lt;p&gt;I've seen this pattern building multi-agent systems at Ostronaut. We generate training content — presentations, videos, quizzes. Early on, the agents were fast. They produced a complete training module in minutes. The output looked good. Formatting was clean. Structure was reasonable.&lt;/p&gt;

&lt;p&gt;It was also wrong about 15-20% of the time. Not obviously wrong — subtly wrong. A slide deck where the concept progression didn't build properly. A quiz where the distractors were too close to the correct answer. A video script that repeated a key point in slightly different words, creating confusion instead of reinforcement.&lt;/p&gt;

&lt;p&gt;We didn't fix this with better prompts. We fixed it by building a validation layer — automated checks that ran after every generation step, before anything reached a human reviewer. Content validation caught conceptual errors. Design validation caught structural problems. Integration validation caught mismatches between components.&lt;/p&gt;

&lt;p&gt;That validation layer was harder to build than the generation layer. It took longer. It required more engineering judgment. And it's the only reason the system works reliably.&lt;/p&gt;

&lt;p&gt;The companies in Gergely's article skipped this step. They deployed AI agents without validation gates, measured the output volume, and declared victory. Then the incidents started.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Better Models Won't Save You
&lt;/h2&gt;

&lt;p&gt;I used to think the answer was better models. If GPT-4 produces code that's 80% reliable, GPT-5 will be 95% reliable, and eventually you won't need validation.&lt;/p&gt;

&lt;p&gt;That was wrong for two reasons.&lt;/p&gt;

&lt;p&gt;First, the remaining failures are the expensive ones. The bugs that survive better models are the subtle, context-dependent bugs that cause production incidents. Better models don't make validation cheaper — they make it more necessary, because what gets through is harder to catch.&lt;/p&gt;

&lt;p&gt;Second, the validation layer isn't just catching bugs. It's encoding team knowledge. Our quality checks embed years of domain expertise — what makes a good slide progression, what makes a quiz effective, what makes a video script clear. That knowledge doesn't exist in the model. It exists in the team. The validation layer is how you transfer institutional knowledge into the AI pipeline.&lt;/p&gt;

&lt;p&gt;Companies that skip this aren't just accepting more bugs. They're disconnecting their AI pipeline from their institutional knowledge.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Measure Instead
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What Leadership Measures&lt;/th&gt;
&lt;th&gt;What Actually Happens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PRs merged per week (+52%)&lt;/td&gt;
&lt;td&gt;Review time per PR (+40%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lines of code written (3x)&lt;/td&gt;
&lt;td&gt;Lines nobody understands (3x)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time to first commit (-60%)&lt;/td&gt;
&lt;td&gt;Time to resolve incidents (+35%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token usage per engineer&lt;/td&gt;
&lt;td&gt;Refactoring frequency (-70%)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you're measuring AI impact, stop counting PRs. Start counting:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Incident rate per AI-assisted commit&lt;/strong&gt; versus human-only commits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review time per PR&lt;/strong&gt; — is it actually decreasing, or are reviewers rubber-stamping?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Refactoring frequency&lt;/strong&gt; — is your team still simplifying code, or just adding to it?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mean-time-to-resolve&lt;/strong&gt; for bugs in AI-generated code versus human-written&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The companies that will win with AI coding agents are not the ones that deploy them fastest. They're the ones that build the validation layer first and measure what matters — not how fast code is written, but how fast &lt;em&gt;correct&lt;/em&gt; code ships and stays correct in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Speed without verification isn't velocity. It's technical debt with a marketing budget.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/build-logs/ai-speed-lie-team-velocity/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ai-speed-lie-team-velocity" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiengineering</category>
      <category>softwareengineering</category>
      <category>engineeringleadership</category>
    </item>
    <item>
      <title>The OS-Paged Context Engine</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Thu, 19 Mar 2026 03:40:11 +0000</pubDate>
      <link>https://dev.to/talvinder/the-os-paged-context-engine-3d7g</link>
      <guid>https://dev.to/talvinder/the-os-paged-context-engine-3d7g</guid>
      <description>&lt;p&gt;Every production agent system I've worked on has the same failure mode. Context rot. Stale artefacts silently served to the model. No audit trail for what was included or excluded. Token budgets blown with no graceful recovery. Multi-agent context bleeding across scopes.&lt;/p&gt;

&lt;p&gt;The standard fix is "use RAG." RAG solves retrieval. It doesn't solve lifecycle.&lt;/p&gt;

&lt;p&gt;The counter-argument I hear most: context windows are getting larger. Claude does 200K tokens. Gemini does 1M. Just dump everything in. The math doesn't hold. At $15 per million input tokens, stuffing 847 artefacts (~200K tokens) into every call costs $3 per inference. At 100 calls per day per agent, that's $9,000/month for a single agent. And you still can't audit what the model saw, still can't catch stale data, still can't prevent hallucinations from compounding into memory.&lt;/p&gt;

&lt;p&gt;Context has no lifecycle. That's the root cause. I went looking for prior art in constrained computing, where managing scarce resources under real-time pressure has been solved for decades.&lt;/p&gt;

&lt;h2&gt;
  
  
  Same Query, Two Outcomes
&lt;/h2&gt;

&lt;p&gt;A support agent is handling a billing escalation. The context store has 847 artefacts: ticket history, knowledge base articles, past chat transcripts, agent notes, CRM records.&lt;/p&gt;

&lt;p&gt;The query is the same. The model is the same. The only difference is what sits between the store and the prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without lifecycle management&lt;/strong&gt; (standard RAG): the agent runs a semantic search, takes the top-K matches, stuffs them in.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A refund policy from six months ago loads because it's semantically close. The policy was updated two weeks ago. The agent cites the old $200 limit to a customer whose refund should be $400 under the current policy.&lt;/li&gt;
&lt;li&gt;An agent's internal note (unreviewed, unvalidated) loads as context. The model treats a scratchpad draft as a confirmed resolution.&lt;/li&gt;
&lt;li&gt;Token budget blows out at 140%. The API silently truncates the prompt, dropping the most recent ticket update.&lt;/li&gt;
&lt;li&gt;The agent's response gets written to memory. The outdated policy is now a "fact." Next session, it compounds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;With the OS-Paged Context Engine&lt;/strong&gt;: the same 847 artefacts enter a four-stage pipeline.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Triage: 312 artefacts expire on TTL. The internal note scores below provenance threshold (SCRATCHPAD rank). The stale policy is BLACK-tagged. 20 survive for semantic scoring.&lt;/li&gt;
&lt;li&gt;Paging: a knowledge base article that &lt;em&gt;did&lt;/em&gt; survive has a dirty bit set (source updated 2 weeks ago). Re-fetched with current policy before the model sees it.&lt;/li&gt;
&lt;li&gt;Assembly: 31,200 tokens against a 40,000 budget. No truncation.&lt;/li&gt;
&lt;li&gt;Validation: response scores 0.88 confidence. Committed to memory. Below 0.7, it would have been flagged for review and &lt;em&gt;not&lt;/em&gt; persisted.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure Mode&lt;/th&gt;
&lt;th&gt;Standard RAG&lt;/th&gt;
&lt;th&gt;OS-Paged Engine&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Stale artefact loaded&lt;/td&gt;
&lt;td&gt;Serves 6-month-old policy as current&lt;/td&gt;
&lt;td&gt;TTL expires it. Dirty bit catches mid-session staleness.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unvalidated note treated as fact&lt;/td&gt;
&lt;td&gt;Loads if semantically close&lt;/td&gt;
&lt;td&gt;SCRATCHPAD provenance rank filters it in triage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token budget overflow&lt;/td&gt;
&lt;td&gt;Silent API truncation&lt;/td&gt;
&lt;td&gt;Graceful degradation through four tiers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hallucination persisted to memory&lt;/td&gt;
&lt;td&gt;Written back without checks&lt;/td&gt;
&lt;td&gt;Commit gate: low confidence triggers rollback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit trail&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Immutable manifest: trace ID, artefact list, tier, commit status&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every one of these is a lifecycle failure, not a retrieval failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix: Four Borrowed Techniques
&lt;/h2&gt;

&lt;p&gt;I built a four-stage pipeline. Each stage borrows one technique from a domain that solved this class of problem decades ago. No framework lock-in. &lt;a href="https://github.com/talvinder/context-engine" rel="noopener noreferrer"&gt;Single Python file&lt;/a&gt;. Works with any LLM API.&lt;/p&gt;

&lt;p&gt;I'm calling it the &lt;strong&gt;OS-Paged Context Engine&lt;/strong&gt;, because the core insight is that your context window is RAM, your long-term memory is disk, and you need an operating system between them.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ftalvinder.com%2Fframeworks%2Fos-paged-context-engine%2Fassets%2Fd2-diagram-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ftalvinder.com%2Fframeworks%2Fos-paged-context-engine%2Fassets%2Fd2-diagram-1.png" alt="Diagram 1" width="800" height="2613"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 1: Triage Scoring
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The failure it catches:&lt;/strong&gt; embedding 1,000 artefacts per call at ~1ms each = 1 second of latency before inference starts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Borrowed from: ER START Protocol, 1983.&lt;/strong&gt; You don't need full diagnosis to correctly prioritise. Score all candidates on three cheap signals first:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;R (Recency)&lt;/strong&gt; is a timestamp diff. O(1). &lt;strong&gt;P (Provenance)&lt;/strong&gt; is an enum rank: human-verified &amp;gt; RAG chunk &amp;gt; tool output &amp;gt; agent scratchpad. O(1). &lt;strong&gt;S (Semantic)&lt;/strong&gt; is cosine distance. Computed &lt;em&gt;only&lt;/em&gt; for artefacts that survive R+P filtering.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ftalvinder.com%2Fframeworks%2Fos-paged-context-engine%2Fassets%2Fd2-diagram-2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ftalvinder.com%2Fframeworks%2Fos-paged-context-engine%2Fassets%2Fd2-diagram-2.png" alt="Diagram 2" width="800" height="1768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source Type&lt;/th&gt;
&lt;th&gt;Score Bias&lt;/th&gt;
&lt;th&gt;Triage Outcome&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Human-verified memory&lt;/td&gt;
&lt;td&gt;Provenance-heavy (P=0.5)&lt;/td&gt;
&lt;td&gt;Highest priority, loaded first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG chunk (recent)&lt;/td&gt;
&lt;td&gt;Balanced (R=0.4, S=0.4)&lt;/td&gt;
&lt;td&gt;High — recency and relevance both count&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool output&lt;/td&gt;
&lt;td&gt;Recency-heavy (R=0.5)&lt;/td&gt;
&lt;td&gt;Medium — freshness matters most&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent scratchpad&lt;/td&gt;
&lt;td&gt;Semantic-heavy (S=0.5)&lt;/td&gt;
&lt;td&gt;Low — must be highly relevant to survive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Expired artefact&lt;/td&gt;
&lt;td&gt;TTL=0&lt;/td&gt;
&lt;td&gt;Excluded before scoring even starts&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Stage 2: Paged Context Store
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The failure it catches:&lt;/strong&gt; serving stale context because nobody checked whether the source changed since it was loaded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Borrowed from: OS Virtual Memory, 1962.&lt;/strong&gt; The page table decided what lived in fast memory, evicted least-recently-used pages, and tracked modifications via dirty bit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LRU eviction&lt;/strong&gt;: when the window is full, evict what was accessed longest ago. &lt;strong&gt;Dirty bit&lt;/strong&gt;: if the source changed since the artefact was loaded, flag it dirty and re-fetch before use.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;access&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;artefact_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;art&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lru&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;artefact_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;current_hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_long_term&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;artefact_id&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_hash&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;art&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_source_hash&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;art&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_dirty&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;        &lt;span class="c1"&gt;# source changed → force re-fetch
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lru&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;move_to_end&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;artefact_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# promote to MRU
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;art&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;RAG retrieves once and serves forever. A paged store tracks whether the source has changed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 3: Speculative Assembly
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The failure it catches:&lt;/strong&gt; hallucinations compounding across sessions because agent-generated context is written to memory without validation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Borrowed from: CPU Reorder Buffer, Intel P6, 1995.&lt;/strong&gt; Execute speculatively, hold results in a buffer, commit only when confirmed valid. Wrong? Rollback.&lt;/p&gt;

&lt;p&gt;Assemble context optimistically. Start inference. If confidence exceeds threshold, commit to memory. If not, flag for human review. Do not write to long-term store. Without this gate, session one's hallucination becomes session two's "memory" becomes session three's "fact."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After model responds:
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;evaluator_confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;committed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;       &lt;span class="c1"&gt;# safe to write to long-term store
&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;flagged_for_review&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;  &lt;span class="c1"&gt;# hold — do not persist
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At &lt;a href="https://ostronaut.com" rel="noopener noreferrer"&gt;Ostronaut&lt;/a&gt;, we saw exactly this: unvalidated agent-generated context compounding into confidently wrong output downstream. The commit gate cut that class of failure by roughly half.&lt;/p&gt;

&lt;p&gt;Here's the falsifiable claim: &lt;strong&gt;any multi-agent system without a commit/rollback gate on context writes will compound hallucinations across sessions within 30 days of production use.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 4: Graceful Degradation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The failure it catches:&lt;/strong&gt; token budget overflows that crash the API call or silently truncate critical context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Borrowed from: Radio Programme Stack, 1930s.&lt;/strong&gt; Dead air could never happen. When content overran, drop to the next segment. The broadcast always continued.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Triggers at&lt;/th&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 (Full)&lt;/td&gt;
&lt;td&gt;&amp;lt; 80% budget&lt;/td&gt;
&lt;td&gt;All triage winners&lt;/td&gt;
&lt;td&gt;Happy path. Everything fits.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2 (Summarised)&lt;/td&gt;
&lt;td&gt;80-95%&lt;/td&gt;
&lt;td&gt;Compress memories, truncate RAG&lt;/td&gt;
&lt;td&gt;Chat transcripts become 200-token summaries.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3 (Core only)&lt;/td&gt;
&lt;td&gt;95-110%&lt;/td&gt;
&lt;td&gt;Human-verified facts + system prompt&lt;/td&gt;
&lt;td&gt;Only ground truth. Scratchpad and RAG dropped.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4 (Minimal)&lt;/td&gt;
&lt;td&gt;&amp;gt; 110%&lt;/td&gt;
&lt;td&gt;System prompt only. Human review flag.&lt;/td&gt;
&lt;td&gt;Emergency. Escalate.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Composed Pipeline
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;assemble_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;candidates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_candidates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;scored&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;triage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;loaded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_page&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token_budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;manifest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;speculator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;assemble&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loaded&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;token_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;manifest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fallback_stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;degrade&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;manifest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every call produces an immutable manifest. When the compliance team asks "why did the agent say that?" you hand them the manifest.&lt;/p&gt;

&lt;p&gt;I &lt;a href="https://dev.to/frameworks/agent-context-is-infrastructure/"&gt;argued previously&lt;/a&gt; that context is infrastructure, not a feature. This is the implementation pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Got Wrong
&lt;/h2&gt;

&lt;p&gt;The first version didn't have the two-pass triage. Every artefact got embedded on every call. At 1ms per embedding multiplied by 1,000 artefacts, that's a full second of latency before inference starts. Adding R+P pre-filtering dropped that to roughly 20 embeddings per call. The two-pass approach seems obvious in retrospect. It's literally how ER triage works. But the RAG literature doesn't teach you to pre-filter before embedding.&lt;/p&gt;

&lt;p&gt;The other mistake: not implementing the dirty bit from day one. We had artefacts in the context window from external tools that had returned fresh data hours ago. The model was reasoning about stale state. Adding dirty bit tracking on access (not just on write) was a one-line fix that eliminated an entire class of silent failures.&lt;/p&gt;

&lt;p&gt;The third mistake is in the commit gate itself. The code checks &lt;code&gt;evaluator_confidence &amp;gt;= 0.7&lt;/code&gt;, but who computes that score? If the model self-evaluates, you're trusting the same system that may have hallucinated to judge whether it hallucinated. LLM confidence self-assessment is poorly calibrated. The honest answer: the library deliberately does not compute confidence. The caller must supply it via an external evaluator, a rule-based checker, or human-in-the-loop for high-stakes domains. The commit gate is necessary. What sits behind it is not yet solved.&lt;/p&gt;

&lt;h2&gt;
  
  
  When This Pattern Is Overkill
&lt;/h2&gt;

&lt;p&gt;Not every agent needs lifecycle management. If your agent doesn't write to its own memory and doesn't persist across sessions, standard RAG is sufficient. Single-session chatbots, prototypes with fewer than 100 artefacts, read-only Q&amp;amp;A over a fixed corpus: the overhead of triage, paging, and commit gates exceeds the benefit. This pattern pays off when context has a lifecycle. If it doesn't, skip it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Still Open
&lt;/h2&gt;

&lt;p&gt;What remains genuinely unresolved is governance at scale. When an agent has six months of context about a customer, who owns it? What happens under GDPR deletion requests? Do you tombstone or purge? If you purge, does the agent's behaviour change in ways that affect other customers? I'm &lt;a href="https://dev.to/frameworks/context-governance-at-scale/"&gt;working through that question next&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The &lt;a href="https://github.com/talvinder/context-engine" rel="noopener noreferrer"&gt;full library&lt;/a&gt; is a single Python file, zero dependencies, open for anyone building production agents. The techniques are borrowed. The composition is yours to steal.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/frameworks/os-paged-context-engine/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=os-paged-context-engine" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>contextengineering</category>
      <category>agenticsystems</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Context Governance at Scale</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Thu, 19 Mar 2026 03:34:19 +0000</pubDate>
      <link>https://dev.to/talvinder/context-governance-at-scale-857</link>
      <guid>https://dev.to/talvinder/context-governance-at-scale-857</guid>
      <description>&lt;p&gt;The &lt;a href="https://dev.to/frameworks/os-paged-context-engine/"&gt;OS-Paged Context Engine&lt;/a&gt; handles the technical lifecycle: what loads, what gets evicted, what passes validation. It produces an immutable manifest for every call. But the manifest tells you &lt;em&gt;what&lt;/em&gt; the model saw. It does not tell you whether it &lt;em&gt;should&lt;/em&gt; have seen it.&lt;/p&gt;

&lt;p&gt;Production agents that handle money, health data, or customer PII need a governance layer above the pipeline. Access control, retention policies, deletion rights, multi-tenant isolation. These are governance problems, not engineering problems. And the industry has not solved them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Manifest Is Not Enough
&lt;/h2&gt;

&lt;p&gt;An audit manifest records: trace ID, artefact list, token count, degradation tier, commit status. If a compliance officer asks "what did the agent access?" you can answer. That's table stakes.&lt;/p&gt;

&lt;p&gt;The harder questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Should the agent have had access to that customer's payment history during a routine support query?&lt;/li&gt;
&lt;li&gt;The artefact was loaded from a shared scope. Three other agents also read it. One of them serves a competitor's account. Is that a data leak?&lt;/li&gt;
&lt;li&gt;The agent's response was committed to memory at confidence 0.85. Six months later, the customer invokes GDPR Article 17. Do you delete the artefact, the memory derived from it, or both?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These questions have no clean technical answer. They require policy, and policy requires architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  When This Pattern Is Overkill
&lt;/h2&gt;

&lt;p&gt;Not every agent needs governance. Here's the decision tree:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skip lifecycle management entirely if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The agent is single-session. No memory persists. Standard RAG is sufficient.&lt;/li&gt;
&lt;li&gt;The corpus is small and static (fewer than 100 documents, updated quarterly). Triage and paging overhead exceeds the benefit.&lt;/li&gt;
&lt;li&gt;The agent is read-only. Never writes to its own memory. No compounding hallucination risk.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use the technical pipeline but skip governance if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The agent handles non-sensitive data. Productivity tools, code assistants, research summarisers. No PII, no financial data, no health records.&lt;/li&gt;
&lt;li&gt;Single-tenant deployment. One company, one agent, no cross-customer context risk.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;You need the full governance layer when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The agent handles PII, financial data, or health records&lt;/li&gt;
&lt;li&gt;Multiple agents share a context store across customers or tenants&lt;/li&gt;
&lt;li&gt;You operate in a regulated industry (healthcare, insurance, financial services)&lt;/li&gt;
&lt;li&gt;The agent persists context for months and customers have deletion rights&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's the falsifiable claim: &lt;strong&gt;by 2028, any agent system handling PII without an auditable context manifest will fail compliance review in regulated industries.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  GDPR and the Tombstone Problem
&lt;/h2&gt;

&lt;p&gt;A customer requests deletion under GDPR Article 17. You purge their artefacts from the context store. The manifests that referenced those artefacts still exist in the audit log. The agent's behaviour was shaped by context that no longer exists.&lt;/p&gt;

&lt;p&gt;Two approaches, neither clean:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Purge completely.&lt;/strong&gt; Delete artefacts, delete manifests, delete any memory derived from those artefacts. The agent's future behaviour changes because the context that shaped prior decisions is gone. If Agent B's response was informed by Agent A's output, which was informed by the deleted customer data, do you cascade the deletion?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tombstone.&lt;/strong&gt; Replace artefact content with a deletion marker: "Artefact deleted per GDPR request, [date]." Manifests remain intact for audit. The agent knows something was here but not what. This preserves audit trail integrity but may not satisfy strict interpretation of "right to erasure."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuzxny62wbtts193euwtw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuzxny62wbtts193euwtw.png" alt="Diagram 1" width="800" height="160"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The honest answer: I don't know which is correct. The legal interpretation of "erasure" applied to derived AI context is untested in European courts. What I do know is that you need the manifest layer to even have this conversation. Without an audit trail, you cannot comply with a deletion request at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  Compliance as Architecture
&lt;/h2&gt;

&lt;p&gt;Enterprise buyers in healthcare, financial services, and insurance ask one question first: can you prove what the agent accessed?&lt;/p&gt;

&lt;p&gt;The context manifest maps directly to compliance frameworks they already understand:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Compliance Requirement&lt;/th&gt;
&lt;th&gt;What It Maps To&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SOC2 audit logs&lt;/td&gt;
&lt;td&gt;Context manifest (trace ID, artefact list, timestamp)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HIPAA access logs&lt;/td&gt;
&lt;td&gt;Manifest + agent_scope (who accessed what)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GDPR Article 15 (right of access)&lt;/td&gt;
&lt;td&gt;Manifest query: "all artefacts accessed for customer X"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GDPR Article 17 (right to erasure)&lt;/td&gt;
&lt;td&gt;Artefact deletion + manifest tombstoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PCI-DSS data isolation&lt;/td&gt;
&lt;td&gt;agent_scope + namespace isolation per tenant&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;But &lt;code&gt;agent_scope&lt;/code&gt; alone is not sufficient for multi-tenant isolation. In the current implementation, scope is a string tag. No encryption boundary, no policy engine, no access control list. A developer who writes &lt;code&gt;agent_scope="global"&lt;/code&gt; on a PII artefact has just leaked it to every agent in the system.&lt;/p&gt;

&lt;p&gt;Production multi-tenant context isolation requires: namespace enforcement (scope is a hard boundary, not a suggestion), policy-as-code (which scopes can read which artefact types), encryption at rest per tenant, and audit logging on every cross-scope access attempt.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Don't Know Yet
&lt;/h2&gt;

&lt;p&gt;The technical primitives for context governance exist: manifests, scopes, commit gates, audit logs. What doesn't exist is the organisational trust model.&lt;/p&gt;

&lt;p&gt;When an agent makes a decision based on six months of accumulated context, who is accountable? The engineer who built the pipeline? The data team that ingested the artefacts? The compliance officer who approved the retention policy?&lt;/p&gt;

&lt;p&gt;Kubernetes solved compute governance by making infrastructure declarative. You declare what you want, the system ensures it. Context governance needs the same shift: declare what the agent &lt;em&gt;should&lt;/em&gt; access, and the system enforces it. We're not there yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  The &lt;a href="https://dev.to/frameworks/os-paged-context-engine/"&gt;technical pipeline&lt;/a&gt; is built. The &lt;a href="https://dev.to/frameworks/agent-context-is-infrastructure/"&gt;infrastructure argument&lt;/a&gt; is established. The governance layer is the missing piece. I'm building it in the open, and I don't have all the answers.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/frameworks/context-governance-at-scale/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=context-governance-at-scale" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>contextengineering</category>
      <category>agenticsystems</category>
      <category>compliance</category>
    </item>
  </channel>
</rss>
