<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nick Talwar</title>
    <description>The latest articles on DEV Community by Nick Talwar (@talweezy).</description>
    <link>https://dev.to/talweezy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3117179%2F98df51dc-a114-4e60-9e38-87b83249f2ee.jpeg</url>
      <title>DEV Community: Nick Talwar</title>
      <link>https://dev.to/talweezy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/talweezy"/>
    <language>en</language>
    <item>
      <title>The CIO Role Just Split in Two. Here’s What You Need to Know.</title>
      <dc:creator>Nick Talwar</dc:creator>
      <pubDate>Tue, 16 Jun 2026 17:14:47 +0000</pubDate>
      <link>https://dev.to/talweezy/the-cio-role-just-split-in-two-heres-what-you-need-to-know-34f2</link>
      <guid>https://dev.to/talweezy/the-cio-role-just-split-in-two-heres-what-you-need-to-know-34f2</guid>
      <description>&lt;p&gt;Why the Best AI Leaders Run Offense and Defense Simultaneously&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb24u8h4fjja2yor3cdfk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb24u8h4fjja2yor3cdfk.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Fourteen AI initiatives on a single roadmap, governed by one steering committee, measured against one set of success criteria. Half are automating existing workflows to protect margins. The other half are building capabilities the company has never offered before. Meanwhile, the budget, risk framework, and quarterly check-in schedule remain stagnant.&lt;/p&gt;

&lt;p&gt;This is what most enterprise AI portfolios look like right now. And it explains why so many of them feel stuck.&lt;/p&gt;

&lt;p&gt;The two halves of that portfolio are fundamentally different games. One is about protecting what already works. The other is about building what comes next. Each requires different ownership, different timelines, different metrics, and different tolerance for ambiguity. Running them as a single strategy is like training for a marathon and a sprint on the same schedule. The structure guarantees that one of them suffers.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Most Organizations Miss
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.mckinsey.com/capabilities/mckinsey-technology/our-insights/mckinsey-global-tech-agenda-2026" rel="noopener noreferrer"&gt;McKinsey’s Global Tech Agenda 2026 &lt;/a&gt;found that the CIOs delivering measurable value have made a specific shift. They’ve moved technology from a cost center to what McKinsey calls a “value creator,” embedding AI and data directly into operating models.&lt;/p&gt;

&lt;p&gt;But the research surfaced a clear divide between organizations that are simply modernizing their technology estate and those that are rewiring for competitive advantage.&lt;/p&gt;

&lt;p&gt;That divide maps to a pattern I keep running into with enterprise leaders. The companies actually moving forward are playing two distinct games at once:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;With defense, they’re using AI and Agents to protect the core business. Automating manual workflows, tightening operational efficiency, reducing cost structures that have been bloated for years.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;On offense, they’re building new capabilities. New products, new revenue streams, new ways of reaching customers that weren’t possible eighteen months ago.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Most organizations don’t have a mental model for this split. They’re either in pure cost-cutting mode or chasing growth, and the AI and Agentic AI strategy simply reflects whichever game the board happens to be pressuring this quarter.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Defense Actually Looks Like
&lt;/h2&gt;

&lt;p&gt;Defensive AI and Agent targets processes you understand well, with outcomes you can measure in months and risk profiles you can model. Automated claims processing. Intelligent document extraction. Predictive maintenance on equipment that’s already generating revenue.&lt;/p&gt;

&lt;p&gt;The success criteria are clear. Faster cycle times, lower error rates, reduced headcount for routine tasks, better margins on existing lines of business. The value case is arithmetic, and the ROI conversation is relatively straightforward.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Offense Actually Looks Like
&lt;/h2&gt;

&lt;p&gt;Offensive AI builds capabilities that didn’t exist before. You’re not optimizing a known process. You’re testing whether a new process should exist at all.&lt;/p&gt;

&lt;p&gt;These projects look like using AI to enter adjacent markets with personalized products, or building recommendation engines that fundamentally change how customers discover what you sell, or creating internal decision-support tools that give your operators information advantages competitors don’t have.&lt;/p&gt;

&lt;p&gt;The success criteria are murkier. You’re measuring learning velocity, market signal, and option value. The ROI conversation is harder, and the organizational patience required is significantly higher.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Efficiency Eats Innovation
&lt;/h2&gt;

&lt;p&gt;When companies run offense and defense under the same governance structure, the defensive projects almost always win the resource fight.&lt;/p&gt;

&lt;p&gt;Defense gets measured on efficiency, cost reduction, and operational reliability. The governance is tighter and accountability sits with operational leaders who own the processes being improved.&lt;/p&gt;

&lt;p&gt;Offense gets measured on learning rate, market validation, and strategic optionality. The governance is much lighter, and the timelines are longer.&lt;/p&gt;

&lt;p&gt;Overall, defensive projects are easier to justify, easier to measure, and easier to get approved. So offensive projects get deprioritized because they can’t compete on the same ROI framework.&lt;/p&gt;

&lt;p&gt;The result is a portfolio that looks busy, but only plays one game. The company gets more efficient at what it already does while falling behind on what it could become. The board sees cost savings and assumes the AI and Agent strategy is working, but nobody’s building anything that changes the company’s competitive position.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Diagnostic
&lt;/h2&gt;

&lt;p&gt;If you’re running AI and Agent initiatives right now, here’s a quick test. Look at your active portfolio and sort every project into one of two columns. Column one: protecting existing revenue and margin. Column two: building something you’ve never had before.&lt;/p&gt;

&lt;p&gt;If you can’t sort them cleanly, your strategy is probably conflated.&lt;/p&gt;

&lt;p&gt;The companies losing ground on AI and Agents aren’t necessarily the ones spending too little. They’re the ones who never made the split visible, never assigned ownership to each side, and ended up with a portfolio that defaults to whichever pressure is loudest.&lt;/p&gt;

&lt;p&gt;Making the split explicit is the first step toward making it work.&lt;/p&gt;

&lt;p&gt;…&lt;/p&gt;

&lt;p&gt;Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://www.linkedin.com/in/nicktalwar/" rel="noopener noreferrer"&gt;Follow him on LinkedIn&lt;/a&gt; to catch his latest thoughts.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://nicktalwar.substack.com/" rel="noopener noreferrer"&gt;Subscribe to his free Substack&lt;/a&gt; for in-depth articles delivered straight to your inbox.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://techleaders.kit.com/ai-workflows-for-regulated-content" rel="noopener noreferrer"&gt;Watch the live session&lt;/a&gt; to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>cioleadership</category>
      <category>enterprisestrategy</category>
      <category>technologyleadership</category>
    </item>
    <item>
      <title>5 Org Chart Mistakes That Are Killing ROI in the AI and Agent Era</title>
      <dc:creator>Nick Talwar</dc:creator>
      <pubDate>Tue, 09 Jun 2026 12:41:57 +0000</pubDate>
      <link>https://dev.to/talweezy/5-org-chart-mistakes-that-are-killing-roi-in-the-ai-and-agent-era-24b4</link>
      <guid>https://dev.to/talweezy/5-org-chart-mistakes-that-are-killing-roi-in-the-ai-and-agent-era-24b4</guid>
      <description>&lt;p&gt;Organizational structure determines AI outcomes more than technology ever will&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl8y1l0zenzunvfwzzj4d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl8y1l0zenzunvfwzzj4d.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.mckinsey.com/~/media/mckinsey/business%20functions/quantumblack/our%20insights/the%20state%20of%20ai/2025/the-state-of-ai-how-organizations-are-rewiring-to-capture-value_final.pdf" rel="noopener noreferrer"&gt;McKinsey’s research&lt;/a&gt; found that more than 80% organizations are not yet seeing a tangible impact on enterprise-level EBIT from AI and Agents. This suggests that while adoption is broadening, most companies are still struggling to turn AI and Agents into scaled financial results.&lt;/p&gt;

&lt;p&gt;But there is an important piece of the story that is missing. &lt;a href="https://www.aigovernancetoday.com/news/enterprise-ai-spending-crisis-2026" rel="noopener noreferrer"&gt;A separate analysis&lt;/a&gt; of 140 enterprise AI implementations found that 77% of failures were organizational in nature, with technical issues like model performance, data quality, and integration complexity accounting for less than a quarter.&lt;/p&gt;

&lt;p&gt;Your org chart is the first system AI has to survive before it reaches a single customer or workflow, and these five structural mistakes consistently prevent it from getting there.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Your Chief AI Officer Reports Nowhere Near the P&amp;amp;L
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://static1.squarespace.com/static/62adf3ca029a6808a6c5be30/t/6942c3cb535da44088c2dbff/1765983179572/2026+AI+%26+Data+Leadership+Executive+Benchmark+Survey+Final.pdf" rel="noopener noreferrer"&gt;The 2026 AI &amp;amp; Data Leadership Executive Benchmark Survey&lt;/a&gt; found that 38.5% of companies have now appointed a Chief AI Officer or equivalent, but there’s almost no consensus on where that role sits. Reporting lines are split across technology, business, and transformation leadership, with no dominant model emerging and no clear pattern connecting any one reporting structure to better outcomes.&lt;/p&gt;

&lt;p&gt;That fragmentation carries real downstream consequences. When AI leadership reports into a CTO or CIO function, the role tends to optimize for infrastructure and tooling decisions rather than business impact. When it reports into a transformation office, it gravitates toward strategy decks and governance frameworks that rarely survive contact with operational reality.&lt;/p&gt;

&lt;p&gt;Neither path connects AI or Agents directly to revenue, margin, or operational throughput, which means the person nominally responsible for AI results often has no line of sight into the metrics that define them.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Your AI or Agent Team Lives in IT Instead of in the Business
&lt;/h2&gt;

&lt;p&gt;When AI or Agent capability gets housed inside the IT department, it inherits IT’s entire operating model, meaning projects get scoped through a service request lens, prioritization follows the IT backlog, and success gets measured in uptime and deployment velocity rather than business outcomes.&lt;/p&gt;

&lt;p&gt;This is a fundamental structural mismatch. AI is a business capability that requires technical infrastructure, and the distinction matters because AI initiatives that start with a business problem and work backward toward the right technical approach tend to survive past the pilot stage, while initiatives that start with a model and go looking for a use case tend to stall indefinitely.&lt;/p&gt;

&lt;p&gt;Organizations running AI teams embedded within business units, or at minimum co-located with business leadership, consistently outperform centralized IT-led models on both adoption and value delivery.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Your Steering Committee Owns Accountability for Nothing
&lt;/h2&gt;

&lt;p&gt;AI steering committees are one of the most popular governance structures in enterprise AI programs, and they’re also one of the least effective.&lt;/p&gt;

&lt;p&gt;The typical setup includes senior representatives from multiple functions who meet monthly to review progress, offer guidance, and align priorities, but in practice, these committees almost always devolve into a venue for status updates where no actual decisions get made.&lt;/p&gt;

&lt;p&gt;The root issue is accountability without power. Steering committees rarely control budget allocation, staffing decisions, or deployment timelines, which means they can recommend changes but have no mechanism to compel them. When an AI initiative hits an organizational obstacle (and every one does), the committee discusses it, documents it, and then waits for someone else to resolve it, creating a governance layer that absorbs time without reducing friction.&lt;/p&gt;

&lt;p&gt;Research on AI governance maturity from &lt;a href="https://www.mckinsey.com/capabilities/tech-and-ai/our-insights/tech-forward/state-of-ai-trust-in-2026-shifting-to-the-agentic-era" rel="noopener noreferrer"&gt;McKinsey’s 2026 AI Trust Maturity Survey&lt;/a&gt; reinforces how widespread this gap is, with only about 30% of organizations reaching a maturity level of three or higher in governance, even as their technical and data capabilities continue to advance. The organizational decision-making apparatus simply hasn’t kept pace with the technology it’s supposed to govern.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. You Built AI Skills in One Team and Called It Done
&lt;/h2&gt;

&lt;p&gt;Concentrating AI talent in a single team feels efficient at first, but the problems with this approach emerge at scale. When every AI initiative has to flow through the same team, that team becomes a bottleneck.&lt;/p&gt;

&lt;p&gt;This pattern appears so frequently in enterprise organizations that it has earned a name in organizational design circles. It’s called the Center of Excellence trap.&lt;/p&gt;

&lt;p&gt;The CoE starts as a strategic asset and gradually evolves into a capacity constraint that chokes the very pipeline it was built to open. &lt;a href="https://www.cio.com/article/4099513/how-to-keep-ai-plans-intact-before-agents-run-amok.html" rel="noopener noreferrer"&gt;A CIO article from late 2025&lt;/a&gt; described the resulting dynamic well, noting that business units inevitably branch off on their own when the central AI team can’t keep pace, creating fragmented and ungoverned efforts scattered across the company with no shared standards or oversight.&lt;/p&gt;

&lt;p&gt;The more sustainable model is capability distribution. Instead of hoarding AI expertise in one group, the investment goes into building baseline AI literacy and applied skills across functions. This allows the central team to shift from doing the work to enabling others to do it by providing tooling, standards, training, and quality guardrails while the business units own execution and outcomes.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Your Center of Excellence Has No Authority to Make Anything Stick
&lt;/h2&gt;

&lt;p&gt;This is the inverse of mistake four. Some organizations do build a Center of Excellence with a genuine mandate to drive AI adoption across the enterprise, staffing it well, giving it a clear charter, and expecting it to set standards for how AI gets developed, deployed, and monitored. Then they forget to give it any enforcement power.&lt;/p&gt;

&lt;p&gt;What follows is predictable. The CoE publishes best practices that business units ignore, develops governance frameworks that project teams route around, and recommends tooling standards that departments override. Without budget influence, or the organizational standing to block non-compliant deployments, the CoE becomes an advisory function that advises no one in particular and enforces nothing at all.&lt;/p&gt;

&lt;p&gt;This is a design failure at the leadership level. A CoE with clear standards but no enforcement mechanism creates the illusion of governance while fragmented, uncoordinated AI adoption continues underneath it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Infrastructure Problem
&lt;/h2&gt;

&lt;p&gt;These five mistakes share a common thread. They all treat AI as something that can be added to an existing organizational structure without redesigning how decisions get made, who owns outcomes, and where authority actually lives.&lt;/p&gt;

&lt;p&gt;AI underperformance in most organizations traces back to an org chart that was built for a different kind of work and never updated to reflect how AI-driven operations actually need to function.&lt;/p&gt;

&lt;p&gt;The companies capturing real returns in 2026 are the ones willing to redesign reporting lines, redistribute decision rights, and place AI leadership where it can actually influence how the business operates on a daily basis.&lt;/p&gt;

&lt;p&gt;If you’re reviewing your AI strategy this quarter, start with the org chart. The structure you’re running determines the ceiling of what AI can deliver, and right now, most ceilings are set lower than anyone realizes.&lt;/p&gt;

&lt;p&gt;…&lt;/p&gt;

&lt;p&gt;Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://www.linkedin.com/in/nicktalwar/" rel="noopener noreferrer"&gt;Follow him on LinkedIn&lt;/a&gt; to catch his latest thoughts. &lt;br&gt;
→ &lt;a href="https://nicktalwar.substack.com/" rel="noopener noreferrer"&gt;Subscribe to his free Substack&lt;/a&gt; for in-depth articles delivered straight to your inbox.&lt;br&gt;
→ &lt;a href="https://techleaders.kit.com/ai-workflows-for-regulated-content" rel="noopener noreferrer"&gt;Watch the live session&lt;/a&gt; to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI. &lt;/p&gt;

</description>
      <category>ai</category>
      <category>aistrategy</category>
      <category>enterpriseai</category>
      <category>organizationaldesign</category>
    </item>
    <item>
      <title>4 Ways to Keep Your AI and Agent Costs Down</title>
      <dc:creator>Nick Talwar</dc:creator>
      <pubDate>Wed, 03 Jun 2026 13:10:20 +0000</pubDate>
      <link>https://dev.to/talweezy/4-ways-to-keep-your-ai-and-agent-costs-down-38no</link>
      <guid>https://dev.to/talweezy/4-ways-to-keep-your-ai-and-agent-costs-down-38no</guid>
      <description>&lt;p&gt;The architectural decisions that separate controlled spend from compounding surprises&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F49o24x14t2ije4enpauo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F49o24x14t2ije4enpauo.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AI and Agentic AI costs have a way of looking reasonable right up until they aren’t.&lt;/p&gt;

&lt;p&gt;The early pilots run on contained use cases with limited traffic, so the numbers stay small and nobody questions the architecture behind them. Then the product scales. Teams start layering inference calls into features that weren’t in the original cost model, and the spend starts compounding in places nobody is watching.&lt;/p&gt;

&lt;p&gt;By the time finance flags the invoice, the architecture driving those costs is already embedded in production and expensive to change. A Gartner survey found that more than 90% of CIOs say managing cost limits their ability to extract value from AI at scale.&lt;/p&gt;

&lt;p&gt;The problem is rarely any single API call. It’s the accumulation of decisions that were never designed to hold up under real production volume. These four levers address that directly. Each one targets a different layer of the cost structure, and together they give you a system that stays predictable as usage grows.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Right-Size Model Selection to Task Complexity
&lt;/h2&gt;

&lt;p&gt;The fastest way to cut AI costs without changing outcomes is to stop sending every request to your most capable model. Most production AI workloads follow a clear pattern where a small percentage of requests require deep reasoning while the majority involve extraction, classification, or short-form responses that a lighter model handles just as well.&lt;/p&gt;

&lt;p&gt;A model routing layer evaluates each incoming request and directs it to the appropriate model based on complexity, confidence thresholds, or task type. Simple queries go to smaller, faster, cheaper models. Only the requests that genuinely need frontier-class reasoning get routed to the expensive option.&lt;/p&gt;

&lt;p&gt;The impact is significant. Industry benchmarks consistently show that intelligent routing reduces inference costs by 30% to 60% in mixed-workload environments, and in some configurations the savings reach even higher. IBM research has highlighted estimates that routing a portion of queries to smaller models can reduce inference costs by up to 85% compared to always using the largest available model.&lt;/p&gt;

&lt;p&gt;When 70% to 80% of your traffic can be handled by a model that costs a fraction of your top-tier option, the math changes quickly. The key is building this routing logic into the architecture early, before usage patterns are established and before teams develop habits around defaulting to a single model for everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Build Caching Layers for Predictable and Repetitive Inputs
&lt;/h2&gt;

&lt;p&gt;Every time your system pays for an inference call that produces the same output as a previous call with identical or near-identical input, you’re burning money on redundant compute. In most production AI and Agent systems, this happens more often than teams realize. Support workflows, document processing pipelines, and internal tools all generate repetitive queries that trigger fresh inference calls unnecessarily.&lt;/p&gt;

&lt;p&gt;Caching addresses this by storing responses to previous inputs and returning cached results when a sufficiently similar request comes in. Semantic caching takes this further by using embedding similarity to match new queries against previously answered ones, so you don’t need exact string matches to get a cache hit.&lt;/p&gt;

&lt;p&gt;For applications with stable system prompts or repeated reference documents, prompt caching alone can cut costs by 50% to 90% on eligible workloads. That’s a significant margin improvement for what is fundamentally an infrastructure decision, not a product change.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Monitor Cost Per Outcome, Not Cost Per API Call
&lt;/h2&gt;

&lt;p&gt;Most teams track AI and Agent spend at the wrong level of granularity. They watch cost per API call or cost per token, optimize those numbers, and then wonder why the overall bill keeps climbing. The problem is that per-call metrics tell you how efficiently your infrastructure runs, but they tell you nothing about whether the spend is generating proportional business value.&lt;/p&gt;

&lt;p&gt;The metric that actually matters is cost per outcome. What does it cost to resolve one support ticket, process one document, or generate one qualified recommendation? When you measure at the outcome level, you start seeing which features and workflows are efficient and which ones burn through tokens without producing proportional results.&lt;/p&gt;

&lt;p&gt;This shift in measurement changes how teams make decisions. A workflow that costs $0.002 per API call looks cheap in isolation, but if it takes 40 calls to produce one usable output, your effective cost per outcome is $0.08. Another workflow might cost $0.01 per call but deliver a result in three calls, making it four times more cost-effective at the outcome level. Without outcome-level tracking, teams end up optimizing the wrong variable. They hit their API budget targets while the business bleeds margin on features that consume far more inference than their value justifies.&lt;/p&gt;

&lt;p&gt;Building this visibility requires tagging inference calls by feature, workflow, and business outcome so you can attribute costs accurately. It’s operational overhead up front, but it gives you the data to make allocation decisions that actually improve unit economics.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Create a Deprecation Practice for Low-Value Use Cases
&lt;/h2&gt;

&lt;p&gt;Not every AI-powered feature deserves to keep running. As products evolve, teams tend to accumulate use cases without revisiting whether each one still clears a reasonable cost-to-value threshold. A feature that made sense during a pilot, when call volume was low and the marginal cost was negligible, can become a quite drain on your budget once it’s processing thousands of requests per day in production.&lt;/p&gt;

&lt;p&gt;A formal deprecation practice addresses this by establishing a regular review cycle where every active AI use case and Agent gets evaluated against its actual cost and measured value. Use cases that fall below the threshold get flagged for rearchitecting, downsizing to a cheaper model, or retiring entirely.&lt;/p&gt;

&lt;p&gt;This is where most AI cost problems actually live. They aren’t unit cost problems. They’re accumulation problems. Twenty features each burning a small amount of unjustified spend add up to a significant line item that nobody owns because nobody is looking at the portfolio as a whole.&lt;/p&gt;

&lt;p&gt;The review doesn’t need to be complicated. Quarterly is a reasonable cadence. The criteria should include cost per outcome (from the monitoring practice above), usage volume trends, and a clear-eyed assessment of whether the feature still aligns with product priorities.&lt;/p&gt;

&lt;h2&gt;
  
  
  Revisit Your Architecture to Sustain Your ROI
&lt;/h2&gt;

&lt;p&gt;Each of these four levers operates at a different layer of the cost structure, and none of them require you to sacrifice capability or slow down product development. Model routing targets per-call efficiency. Caching eliminates redundant compute. Outcome-level monitoring gives you the data to allocate intelligently. And deprecation keeps your portfolio from accumulating dead weight.&lt;/p&gt;

&lt;p&gt;The common thread is that AI cost management is an architecture problem. The decisions that determine your spend at scale are made by engineering teams during system design, not by finance teams during contract negotiation. The organizations that keep their costs predictable are the ones that treat these decisions as first-class architectural concerns from the beginning, rather than scrambling to retrofit controls after the bill becomes a boardroom conversation.&lt;/p&gt;

&lt;p&gt;…&lt;/p&gt;

&lt;p&gt;Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://www.linkedin.com/in/nicktalwar/" rel="noopener noreferrer"&gt;Follow him on LinkedIn&lt;/a&gt; to catch his latest thoughts.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://nicktalwar.substack.com/" rel="noopener noreferrer"&gt;Subscribe to his free Substack&lt;/a&gt; for in-depth articles delivered straight to your inbox.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://techleaders.kit.com/ai-workflows-for-regulated-content" rel="noopener noreferrer"&gt;Watch the live session&lt;/a&gt; to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.&lt;/p&gt;

</description>
      <category>aicostoptimization</category>
      <category>enterpriseai</category>
      <category>llminfrastructure</category>
      <category>aistrategy</category>
    </item>
    <item>
      <title>Your AI and Agent Rollout Needs a Problem-Definition Process</title>
      <dc:creator>Nick Talwar</dc:creator>
      <pubDate>Tue, 26 May 2026 14:04:02 +0000</pubDate>
      <link>https://dev.to/talweezy/your-ai-and-agent-rollout-needs-a-problem-definition-process-1hll</link>
      <guid>https://dev.to/talweezy/your-ai-and-agent-rollout-needs-a-problem-definition-process-1hll</guid>
      <description>&lt;p&gt;How Product Management Discipline Separates Lasting AI and Agent Adoption from Expensive Shelf-Ware&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhlv9a8ci27l06z360j45.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhlv9a8ci27l06z360j45.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We’ve all read about the AI rollouts that go awry. Tools get purchased, training gets scheduled, an adoption campaign goes out, but within two months the usage curve flattens because nobody in the organization can answer a simple question: &lt;/p&gt;

&lt;p&gt;What specific problem are we solving, and how will we know we solved it?&lt;br&gt;
I’ve spent years leading teams from both an engineering and product management perspective, so I’ve seen from the trenches why this obvious question can get skipped. The urgency to "adopt AI" pushes companies straight into tool selection and training programs while the harder, slower work of defining which problems are actually worth solving never happens.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Missing Discipline
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://hbr.org/2026/02/to-drive-ai-adoption-build-your-teams-product-management-skills" rel="noopener noreferrer"&gt;A recent Harvard Business Review study&lt;/a&gt; by Amanda Pratt and Melissa Valentine examined AI adoption at a major tech company and surfaced a finding that should reframe how every operator thinks about this problem.&lt;/p&gt;

&lt;p&gt;It was no surprise to me that the area most correlated with successful, sustained AI adoption turned out to be product management, not prompt engineering or technical fluency. The disciplines that mattered most were defining which problems are worth solving, designing structured experiments, and integrating solutions into the way work already happens.&lt;/p&gt;

&lt;p&gt;These findings line up with what I've observed across dozens of AI and Agentic AI engagements. The companies where AI actually takes root are the ones that approach adoption with product discipline, starting with a specific workflow, identifying a measurable friction point, building a small test, and evaluating results before scaling anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Companies, Two Approaches
&lt;/h2&gt;

&lt;p&gt;Consider the difference between two real patterns I see repeatedly in enterprise AI and Agentic AI work.&lt;/p&gt;

&lt;p&gt;1) Company A purchases an AI platform, negotiates an enterprise license, builds a prompt library, and launches a change management campaign complete with lunch-and-learns, weekly tip emails, and a login dashboard to track "adoption." After three months, a handful of power users have integrated the tool into their workflows, and everyone else has moved on.&lt;/p&gt;

&lt;p&gt;2) Company B takes a different path. Before selecting any tool, they run a structured problem-definition process across three business units. Each unit identifies its highest-friction workflow, documents the current state in detail, and defines what a measurable improvement would look like. Only then does the team evaluate which AI capabilities (if any) could address those specific problems. They run 30-day pilots with clear success criteria, and when two of the three pilots produce measurable gains, those two scale while the third gets killed early, saving months of wasted effort.&lt;/p&gt;

&lt;p&gt;One of those pilots, for example, targeted a procurement approval workflow that averaged nine days from request to sign-off. The team mapped every handoff, identified two steps where AI-assisted document review could eliminate manual bottlenecks, and set a target of reducing cycle time to under four days. After the pilot, cycle time dropped to three and a half days. That result gave leadership concrete evidence to fund a broader rollout in procurement, and the specificity of the success made it easy to communicate across the organization.&lt;/p&gt;

&lt;p&gt;Company B spent less money, took slightly longer to get started, and ended up with AI embedded in actual workflows producing actual results. Company A spent more, moved faster, and ended up with an expensive tool that sits mostly unused.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Problem-Definition Keeps Getting Skipped
&lt;/h2&gt;

&lt;p&gt;The rise of AI has put immense pressure on companies to try to move fast. But the problem-definition process feels time consuming and slow. On the other hand, buying a tool and launching a training program feels like jumping quickly into action.&lt;/p&gt;

&lt;p&gt;There's also a structural gap. Most organizations assign AI adoption to IT or to a newly created "AI team" that reports to the CTO. Those teams are good at evaluating technology. They're less practiced at the product management work of scoping problems, defining success metrics, and designing experiments within business workflows they don't own. The people closest to the workflows (operations leads, department managers, senior ICs) rarely get pulled into the problem-definition phase because the initiative is framed as a technology project, not a workflow improvement project.&lt;/p&gt;

&lt;p&gt;Velocity without direction is just expensive motion. The organizations I work with that have the strongest AI adoption results are the ones that invested the first four to six weeks in problem definition and a Data Story / IP Moat audit before evaluating a single vendor. That initial patience created clarity that made everything downstream faster, from tool selection to pilot design to scaling decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Diagnostic Question
&lt;/h2&gt;

&lt;p&gt;If you want to know whether your AI or Agentic AI adoption effort has legs, ask one question across every team that's supposed to be using AI. Can they answer, specifically, what problem they're solving and how they'll know if they've solved it?&lt;/p&gt;

&lt;p&gt;If the answer is vague ("We're using AI to be more efficient") or circular ("We're adopting AI because we need to adopt AI"), the rollout is already in trouble. Clear problem statements are the leading indicator of whether AI adoption will stick or stall.&lt;/p&gt;

&lt;p&gt;The companies that bring product management discipline to AI adoption, with defined problems, scoped experiments, and honest evaluation, end up with AI embedded in their actual operations. Everyone else ends up with a line item on the budget and a login dashboard nobody checks.&lt;/p&gt;

&lt;p&gt;…&lt;/p&gt;

&lt;p&gt;Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://www.linkedin.com/in/nicktalwar/" rel="noopener noreferrer"&gt;Follow him on LinkedIn&lt;/a&gt; to catch his latest thoughts. &lt;br&gt;
→ &lt;a href="https://nicktalwar.substack.com/" rel="noopener noreferrer"&gt;Subscribe to his free Substack&lt;/a&gt; for in-depth articles delivered straight to your inbox.&lt;br&gt;
→ &lt;a href="https://techleaders.kit.com/ai-workflows-for-regulated-content" rel="noopener noreferrer"&gt;Watch the live session&lt;/a&gt; to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI. &lt;/p&gt;

</description>
      <category>ai</category>
      <category>productmanagement</category>
      <category>enterprisetechnology</category>
      <category>startupstrategy</category>
    </item>
    <item>
      <title>6 Things Your AI Agents Need That You're Probably Not Building</title>
      <dc:creator>Nick Talwar</dc:creator>
      <pubDate>Tue, 19 May 2026 17:35:06 +0000</pubDate>
      <link>https://dev.to/talweezy/6-things-your-ai-agents-need-that-youre-probably-not-building-32hi</link>
      <guid>https://dev.to/talweezy/6-things-your-ai-agents-need-that-youre-probably-not-building-32hi</guid>
      <description>&lt;p&gt;The infrastructure that separates agents that demo well from agents that actually run&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnc6ueycn1gwicc6oqjys.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnc6ueycn1gwicc6oqjys.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You would never bring a new hire onto your team without performance feedback, escalation paths, or a way to know when they're struggling. Yet that's exactly how most organizations deploy AI agents. &lt;a href="https://sloanreview.mit.edu/projects/the-emerging-agentic-enterprise-how-leaders-must-navigate-a-new-age-of-ai/" rel="noopener noreferrer"&gt;MIT Sloan and BCG's 2025 research&lt;/a&gt; found that 76% of executives now describe agents as coworkers rather than tools, but almost none of them are managing agents that way. They ship the agent and move on.&lt;/p&gt;

&lt;p&gt;Deciding to call your agents “coworkers” is easy. Setting up the feedback loops, escalation paths, and failure signals that actually make one is where teams stall. It's almost entirely an infrastructure problem, and these are the six pieces most teams skip.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Evaluation Frameworks
&lt;/h2&gt;

&lt;p&gt;A working agent and a reliable agent are two different things. Evaluation frameworks give you the ability to measure the difference before your users discover it for you. This means building structured test suites that run against your agent's outputs on a regular cadence, scoring for accuracy, relevance, and task completion across a range of realistic scenarios.&lt;/p&gt;

&lt;p&gt;Good evaluation suites include both deterministic checks (did the agent call the right tool with the right parameters?) and judgment-based scoring (was the response actually useful to the person asking?). &lt;/p&gt;

&lt;p&gt;The key is that evaluation has to be continuous, running in CI/CD pipelines and against live traffic, because agent behavior shifts as underlying models update and data distributions change. LLMs, the technology that undergirds agents, are at their core probabilistic in nature, which means there is an often opaque statistical distribution that can shift over time, which affects performance and accuracy. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents" rel="noopener noreferrer"&gt;Anthropic's engineering team has written publicly&lt;/a&gt; about maintaining evaluation suites as living artifacts, with dedicated teams owning the infrastructure while domain experts contribute tasks and run the tests themselves.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Fallback and Escalation Logic
&lt;/h2&gt;

&lt;p&gt;Every agent will encounter situations it cannot handle. The question is whether you've decided in advance what happens next, or whether the agent improvises.&lt;/p&gt;

&lt;p&gt;Fallback logic defines the boundaries. When confidence drops below a threshold, when a tool call returns unexpected data, when the task exceeds the agent's defined scope, the system needs a predetermined path. That path might route to a simpler deterministic process, a different model, or a human operator. Escalation logic layers on top of that by adding severity awareness.&lt;/p&gt;

&lt;p&gt;Without explicit escalation tiers, every failure gets the same treatment, which means either everything gets flagged (and humans stop paying attention) or nothing does (and real problems slip through). The organizations successfully scaling agents build these paths before deployment, treating them as load-bearing architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Monitoring for Drift
&lt;/h2&gt;

&lt;p&gt;AI agents degrade quietly. Model updates, shifts in input data, changes to upstream APIs, seasonal variation in user behavior. Any of these can erode agent performance without triggering a single error.&lt;/p&gt;

&lt;p&gt;Drift monitoring tracks the gap between how your agent performed when you validated it and how it performs now. This includes statistical monitoring of output distributions, latency tracking across individual tool calls, and automated quality scoring against baseline benchmarks. In practice, effective drift detection requires capturing baseline metrics during your evaluation phase and then running the same scoring pipeline against production traffic on an ongoing basis. When scores diverge from your baseline by more than an acceptable margin, you have a concrete signal to investigate rather than a vague feeling that things seem off.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Human-in-the-Loop Checkpoints
&lt;/h2&gt;

&lt;p&gt;Full autonomy sounds efficient until you realize what it costs when the agent is wrong. Human-in-the-loop checkpoints create structured moments where a person reviews, approves, or redirects agent output before it reaches the end user or triggers a downstream action.&lt;/p&gt;

&lt;p&gt;The design challenge is placement. Too many checkpoints and you've built an expensive autocomplete system. Too few and you've handed off accountability to a system that can't actually hold it. The right approach maps checkpoints to consequence.&lt;/p&gt;

&lt;p&gt;Low-risk, reversible actions can run autonomously. High-stakes decisions, anything involving money, legal exposure, or customer-facing commitments, need a human gate. As agents take on more complex workflows, these checkpoints also become your training data pipeline. Every human correction is a signal about where the agent needs improvement, but only if you're logging it (which brings us to the next point).&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Logging for Auditability
&lt;/h2&gt;

&lt;p&gt;When an agent makes a decision, you need to be able to reconstruct exactly how it got there. Full execution logging captures the chain of reasoning, tool invocations, retrieved context, intermediate outputs, and final actions across every run.&lt;/p&gt;

&lt;p&gt;This serves three purposes simultaneously:&lt;/p&gt;

&lt;p&gt;First, debugging. When something goes wrong, you need the trace, not a guess.&lt;/p&gt;

&lt;p&gt;Second, compliance. Regulated industries require demonstrable decision trails, and even unregulated ones are moving in that direction.&lt;/p&gt;

&lt;p&gt;Third, improvement. Logged executions become the dataset you use to identify failure patterns, tune prompts, and build better evaluation suites.&lt;/p&gt;

&lt;p&gt;The tooling for this has matured significantly. OpenTelemetry-based tracing, structured span capture, and production replay capabilities now exist across multiple frameworks. The infrastructure cost is low relative to the cost of operating an agent you cannot inspect.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. A Defined Handoff Protocol
&lt;/h2&gt;

&lt;p&gt;Agents rarely operate in isolation. They pass work to other agents, to human operators, to downstream systems, and occasionally back to the user. Every one of those transitions is a potential failure point.&lt;br&gt;
A handoff protocol specifies what information transfers with the task, what context the receiving party needs, what constitutes a successful handoff versus a dropped one, and who owns the outcome after the transition.&lt;/p&gt;

&lt;p&gt;This gets more complex in multi-agent systems where one agent's output becomes another agent's input. If the first agent summarizes a customer issue and strips out a critical detail before passing it along, the second agent makes a decision on incomplete information. Neither agent has failed individually, but the system has failed completely.&lt;/p&gt;

&lt;p&gt;Without this kind of structural clarity, you get the agent equivalent of a game of telephone. Context gets lost between steps, responsibilities blur, and when something fails mid-workflow, nobody can pinpoint where.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Management Layer You Can't Skip
&lt;/h2&gt;

&lt;p&gt;These six elements share a common thread. They're all infrastructure that exists to manage the agent after it's built.&lt;/p&gt;

&lt;p&gt;The agent itself, the model, the prompts, the tool integrations, that's maybe 40% of what a production deployment actually requires.&lt;br&gt;
The other 60% is the system that keeps the agent honest, visible, and recoverable when things go sideways.&lt;/p&gt;

&lt;p&gt;Organizations that treat agent deployment as a build-and-ship exercise will spend the next six months doing manual cleanup on failures they could have prevented. The ones that invest in this management layer first will find that their agents get better over time instead of quietly getting worse.&lt;/p&gt;

&lt;p&gt;The technology is mature enough. The question is whether your operational infrastructure is ready to match it.&lt;/p&gt;

&lt;p&gt;…&lt;/p&gt;

&lt;p&gt;Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://www.linkedin.com/in/nicktalwar/" rel="noopener noreferrer"&gt;Follow him on LinkedIn&lt;/a&gt; to catch his latest thoughts. &lt;br&gt;
→ &lt;a href="https://nicktalwar.substack.com/" rel="noopener noreferrer"&gt;Subscribe to his free Substack&lt;/a&gt; for in-depth articles delivered straight to your inbox.&lt;br&gt;
→ &lt;a href="https://techleaders.kit.com/ai-workflows-for-regulated-content" rel="noopener noreferrer"&gt;Watch the live session&lt;/a&gt; to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI. &lt;/p&gt;

</description>
      <category>ai</category>
      <category>aiagents</category>
      <category>enterprisetechnology</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Your Product Doesn't Need GPT-5. And It’s Costing You More Than You Think.</title>
      <dc:creator>Nick Talwar</dc:creator>
      <pubDate>Tue, 12 May 2026 12:13:37 +0000</pubDate>
      <link>https://dev.to/talweezy/your-product-doesnt-need-gpt-5-and-its-costing-you-more-than-you-think-5eme</link>
      <guid>https://dev.to/talweezy/your-product-doesnt-need-gpt-5-and-its-costing-you-more-than-you-think-5eme</guid>
      <description>&lt;p&gt;How Fine-Tuned Small Models Outperform Frontier AI for Most Production Workloads&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqwq7imruvduulzhu8p7w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqwq7imruvduulzhu8p7w.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Serving a 7B parameter model costs roughly $0.0004 per 1,000 tokens. A frontier model like GPT-5 charges up to $0.09 for the same volume. That's a 200x spread on per-token cost, and at production scale, it compounds into the kind of line item that makes CFOs start asking uncomfortable questions.&lt;/p&gt;

&lt;p&gt;Yet most enterprise AI strategies still start in the same place. Frontier model API, default configuration, build everything on top.&lt;/p&gt;

&lt;p&gt;I’ve heard the same reasoning for this decision countless times. The plan is to start here, and optimize later. But "optimize later" rarely happens. The API dependency becomes load-bearing, and switching costs quickly accumulate. More often than not, teams discover much too late that 70-80% of their inference calls are handling structured, repeatable tasks that never needed frontier-class reasoning in the first place. Meanwhile, a fine-tuned small model handles all of it at a fraction of the cost, often with better accuracy on the specific domain, and without the vendor dependency.&lt;/p&gt;

&lt;p&gt;The question worth asking before you architect anything isn't "which model is most powerful." It's whether the task even requires that power.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Compounding Cost Problem
&lt;/h2&gt;

&lt;p&gt;The per-token price gap between frontier and small models tells only part of the story. The real damage happens at volume.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.gartner.com/en/newsroom/press-releases/2026-03-25-gartner-predicts-that-by-2030-performing-inference-on-an-llm-with-1-trillion-parameters-will-cost-genai-providers-over-90-percent-less-than-in-2025" rel="noopener noreferrer"&gt;Gartner’s analysis&lt;/a&gt; found that agentic AI workflows consume 5 to 30 times more tokens per task than standard chatbot interactions. When your agents are running thousands of structured, repeatable tasks per day, each one burning frontier-priced tokens, monthly inference bills can scale from manageable to alarming before anyone notices. A system handling 50,000 daily agent tasks on frontier APIs accumulates costs that a finance team will eventually flag, and "but the model is really smart" isn't a satisfying answer when 80% of those tasks are pattern execution.&lt;/p&gt;

&lt;p&gt;API pricing has dropped significantly. Frontier-quality model costs fell roughly 80% between 2025 and early 2026. But cheaper tokens don't change the underlying architectural mistake. You're still paying for general-purpose reasoning capacity on tasks that need specialized precision. It's the equivalent of provisioning a 256-core cluster to run a cron job.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Small Models Win (And Where They Don't)
&lt;/h2&gt;

&lt;p&gt;Small language models, typically under 10 billion parameters, have crossed a performance threshold that changes the production calculus. &lt;a href="https://arxiv.org/abs/2512.15943" rel="noopener noreferrer"&gt;Research from late 2025&lt;/a&gt; demonstrated that a fine-tuned 350M parameter model outperformed generalist frontier models on structured tool-calling and API orchestration tasks. A 3B parameter model trained on domain-specific data can match frontier accuracy on classification, extraction, and routing while delivering 150 to 300 tokens per second compared to the 50 to 100 range typical of large models.&lt;/p&gt;

&lt;p&gt;The production evidence is growing. &lt;a href="https://florinelchis.medium.com/how-companies-actually-use-small-language-models-what-287-case-studies-reveal-d9ea4b61e530" rel="noopener noreferrer"&gt;An analysis of 287 documented SLM deployments&lt;/a&gt; found companies like Checkr, NVIDIA, Bayer, and DoorDash replacing frontier models with 7B to 14B parameter alternatives at 5 to 150 times lower cost, with equal or better performance on their specific tasks.&lt;/p&gt;

&lt;p&gt;But small models have real limits. They fall apart on tasks requiring deep reasoning across long, unstructured documents. Complex multi-step inference, novel problem synthesis, and ambiguous decision-making still belong to frontier architectures. Pretending otherwise leads to brittle systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Decision Framework for Model Selection
&lt;/h2&gt;

&lt;p&gt;The architectural question isn't "which model is best." It's what the specific task actually requires.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Route to a small model when the task is&lt;/strong&gt; structured, repeatable, and well-defined. Classification, entity extraction, document routing, templated generation, API orchestration, and status parsing all fit. If you can describe the task with clear input-output examples and the domain is bounded, a fine-tuned small model will likely match frontier performance at a fraction of the cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Route to a frontier model when the task demands&lt;/strong&gt; open-ended reasoning, novel problem-solving, or synthesis across large unstructured contexts. Strategic analysis, complex code generation, multi-document research, and ambiguous judgment calls still benefit from frontier-scale reasoning. These tasks involve genuine inference, not pattern execution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The hybrid architecture is where most production systems should land&lt;/strong&gt;. Use a frontier model as the orchestration layer for planning, decision routing, and edge cases. Deploy fine-tuned small models as the execution layer for the high-volume structured tasks that account for the bulk of actual inference calls. One documented deployment using this approach, a frontier model as "master controller" with specialized small models handling task execution, showed a 90% reduction in monthly API costs and a 70% improvement in response speed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Vendor Lock-In Problem
&lt;/h2&gt;

&lt;p&gt;There's a second cost that doesn't show up on the monthly invoice. Every API call to a frontier model is a dependency you don't control. Pricing changes, rate limits, model deprecations, and terms-of-service updates all happen on someone else's timeline.&lt;/p&gt;

&lt;p&gt;Fine-tuned small models running on your own infrastructure eliminate that variable. You control the model weights, the serving stack, the update cycle, and the data pipeline. For regulated industries where sensitive data can't touch third-party APIs, self-hosted small models aren't just a cost optimization. They're the compliance baseline.&lt;/p&gt;

&lt;p&gt;The breakeven point for self-hosting versus API consumption is lower than most teams assume. Analysis across production deployments puts the threshold around 8,000 conversations per day, or roughly $500 per month in API spend. Above that line, owning your inference infrastructure starts paying for itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Right-Sizing as an Engineering Discipline
&lt;/h2&gt;

&lt;p&gt;Treating model selection with the same rigor you'd apply to database provisioning or infrastructure architecture is the move that separates production-grade AI systems from expensive experiments.&lt;/p&gt;

&lt;p&gt;A frontier model is a tool. A small model is a tool. The discipline is knowing which tool fits which job, and building the architectural flexibility to use both without locking yourself into either. For most production workloads running structured, repeatable agent tasks at scale, the 7B parameter model on your own infrastructure will outperform the frontier API call to a model that's three orders of magnitude larger than what the task requires.&lt;/p&gt;

&lt;p&gt;The smartest infrastructure decision you make this year might be choosing the smaller model, most of the time.&lt;/p&gt;

&lt;p&gt;…&lt;br&gt;
Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://www.linkedin.com/in/nicktalwar/" rel="noopener noreferrer"&gt;Follow him on LinkedIn&lt;/a&gt; to catch his latest thoughts. &lt;br&gt;
→ &lt;a href="https://nicktalwar.substack.com/" rel="noopener noreferrer"&gt;Subscribe to his free Substack&lt;/a&gt; for in-depth articles delivered straight to your inbox.&lt;br&gt;
→ &lt;a href="https://techleaders.kit.com/ai-workflows-for-regulated-content" rel="noopener noreferrer"&gt;Watch the live session&lt;/a&gt; to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI. &lt;/p&gt;

</description>
      <category>ai</category>
      <category>enterpriseai</category>
      <category>smalllanguagemodels</category>
      <category>aiarchitecture</category>
    </item>
    <item>
      <title>$4M Revenue Per Employee Is the New Benchmark. Most Companies Can’t Get There.</title>
      <dc:creator>Nick Talwar</dc:creator>
      <pubDate>Tue, 05 May 2026 14:00:00 +0000</pubDate>
      <link>https://dev.to/talweezy/4m-revenue-per-employee-is-the-new-benchmark-most-companies-cant-get-there-2301</link>
      <guid>https://dev.to/talweezy/4m-revenue-per-employee-is-the-new-benchmark-most-companies-cant-get-there-2301</guid>
      <description>&lt;p&gt;What AI-Native Operations Actually Look Like and Why Retrofitting Falls Short&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgeuqf57t5leudoph3r2y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgeuqf57t5leudoph3r2y.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cursor crossed $2 billion in annualized revenue in early 2026. The team that built it? Roughly 300 people. Gamma, the AI presentation platform, hit $100 million ARR with about 50 employees and has been profitable for over two years. Midjourney generates hundreds of millions in annual revenue with a team you could fit in a mid-sized conference room. Lovable reached $100M ARR in eight months with 45 people.&lt;/p&gt;

&lt;p&gt;Meanwhile, the median private SaaS company generates about $130,000 per employee. Five years ago, $100K was considered a reasonable benchmark. At scale, the best traditional SaaS companies were proud to reach $300K.&lt;/p&gt;

&lt;p&gt;The gap between these numbers tells you something specific about how these companies are built. All four companies I mentioned initially have something in common beyond the headcount math.&lt;/p&gt;

&lt;p&gt;From the first hire, they were built around AI as a core operator, with every workflow, every role, and every system designed on that assumption. The label for this is AI-native. &lt;/p&gt;

&lt;p&gt;And for founders and executives running $5-30M ARR companies right now, the gap between AI-native operations and everyone else is a competitive timeline that is already shrinking.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "AI-Native" Actually Means at the Operational Level
&lt;/h2&gt;

&lt;p&gt;The phrase gets thrown around loosely, so let me be specific. An AI-native company designs its workflows from scratch around what AI can do. Every process, every role, every system assumes AI as a core participant from day one.&lt;/p&gt;

&lt;p&gt;This is fundamentally different from what most companies do, which is take their existing workflows and add AI tools to them. The distinction matters because the architecture of your operations determines the ceiling of your efficiency.&lt;/p&gt;

&lt;p&gt;Consider how a traditional SaaS company handles content. A marketing team writes briefs. Writers produce drafts. Editors review. Designers format. A project manager coordinates the whole thing. Five or six people touch every piece of content before it ships.&lt;/p&gt;

&lt;p&gt;An AI-native company designs that workflow differently from the start. AI generates first drafts from structured inputs. A single editor shapes the output. Distribution happens programmatically. The entire pipeline might involve one or two people instead of six, and the throughput is three to five times higher.&lt;/p&gt;

&lt;p&gt;Multiply that across customer support, engineering, sales enablement, onboarding, and internal operations. The compounding effect explains how Cursor runs at $6 million per employee while companies with similar revenue require ten times the headcount.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Retrofitting Existing Operations Fails
&lt;/h2&gt;

&lt;p&gt;The instinct most established companies have is to layer AI tools onto what already exists. Buy a few licenses, integrate a copilot, maybe automate some ticket routing. This feels productive. It rarely moves the needle in a meaningful way.&lt;/p&gt;

&lt;p&gt;The problem is structural. Your existing workflows were designed around human throughput. Your org chart reflects that design. Your hiring plans, your meeting cadences, your approval chains, your reporting structures all assume that humans do the work and other humans coordinate that work.&lt;/p&gt;

&lt;p&gt;Bolting AI onto this foundation creates an awkward hybrid. AI generates a draft, but then it still goes through the same five-person review chain that existed before. AI triages support tickets, but the staffing model hasn't changed to reflect the reduced load. The tool saves twenty minutes per task, but the organizational overhead around that task stays identical.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Realistic Options for Established Companies
&lt;/h2&gt;

&lt;p&gt;If you're running a $5-30M ARR company, you probably aren't going to tear everything down and rebuild from scratch. That's fine. But pretending the efficiency gap will close on its own is a mistake with a deadline.&lt;/p&gt;

&lt;p&gt;Here's what actually works for companies that aren't starting from zero.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with one workflow, redesigned from zero&lt;/strong&gt;. Pick your highest-volume, most repeatable process and redesign it from scratch with AI as the primary operator. Don't optimize the existing process. Design the new one as if the old one didn't exist. Customer onboarding, content production, and first-line support are common starting points because they're high-volume and have clear inputs and outputs. The goal is to prove to your own organization what redesigned throughput looks like before you try to scale the approach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hire for the new architecture.&lt;/strong&gt; The next time you open a role, ask whether the function that role serves could be restructured around AI instead. This doesn't mean replacing people. It means designing the role so one person with AI leverage can do what previously required three. The companies generating $2M+ per employee didn't get there by giving existing employees AI tools. They built teams where every person operates as a force multiplier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Measure the right ratio.&lt;/strong&gt; Track revenue per employee quarterly. If you're below $150K and growing, you're adding headcount faster than you're adding efficiency. That was fine in 2020. Today, it means you're falling behind the curve that AI-native competitors are setting. For context, top-quartile SaaS companies now generate $350K-$700K per employee, and the AI-native outliers are running at five to ten times that range.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Accept that partial adoption produces partial results.&lt;/strong&gt; A company that redesigns 30% of its operations around AI-native principles will capture meaningful efficiency gains. A company that gives everyone a ChatGPT license and calls it transformation will not. Architectural commitment drives the outcome here. Tool selection alone never has.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sequence your investment around leverage.&lt;/strong&gt; Most companies adopt AI where it's easiest to implement. The better approach is to start where the ratio of human labor to repeatable output is highest. That's usually operations and fulfillment, where the actual throughput gains live.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Clock Is Running
&lt;/h2&gt;

&lt;p&gt;The revenue-per-employee gap between AI-native companies and everyone else keeps widening. Gartner projects a wave of companies generating $2M+ per employee by 2030, and the leaders are already well past that mark.&lt;/p&gt;

&lt;p&gt;For operators and founders at the $1-5M stage, this isn't a future problem. Your next funding round, your next hire, your next operational decision is happening in a market where competitors might need one-fifth the headcount to deliver the same output.&lt;/p&gt;

&lt;p&gt;The companies that approach this as an architectural challenge will adapt. The ones running a tool-buying exercise will learn the hard way that efficiency at this scale comes from how you build, from how you design the work itself.&lt;/p&gt;

&lt;p&gt;…&lt;/p&gt;

&lt;p&gt;Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://www.linkedin.com/in/nicktalwar/" rel="noopener noreferrer"&gt;Follow him on LinkedIn&lt;/a&gt; to catch his latest thoughts.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://nicktalwar.substack.com/" rel="noopener noreferrer"&gt;Subscribe to his free Substack&lt;/a&gt; for in-depth articles delivered straight to your inbox.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://techleaders.kit.com/ai-workflows-for-regulated-content" rel="noopener noreferrer"&gt;Watch the live session&lt;/a&gt; to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>startupstrategy</category>
      <category>operationalefficiency</category>
      <category>saas</category>
    </item>
    <item>
      <title>The Job Title That Didn’t Exist Last Year</title>
      <dc:creator>Nick Talwar</dc:creator>
      <pubDate>Tue, 21 Apr 2026 12:07:28 +0000</pubDate>
      <link>https://dev.to/talweezy/the-job-title-that-didnt-exist-last-year-4bb7</link>
      <guid>https://dev.to/talweezy/the-job-title-that-didnt-exist-last-year-4bb7</guid>
      <description>&lt;p&gt;Why Enterprise AI Needs a Translation Layer Between Data and Decisions&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnlsnulclht6zuudnjutl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnlsnulclht6zuudnjutl.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Gartner projects that &lt;a href="https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027" rel="noopener noreferrer"&gt;over 40% of agentic AI initiatives will be abandoned by 2027&lt;/a&gt;. Reading that, a reasonable person might conclude that there is an inherent issue with the technology. &lt;/p&gt;

&lt;p&gt;However, I know from my own experience building agents that when done correctly, they deliver.&lt;/p&gt;

&lt;p&gt;The failure pattern we keep hearing about has nothing to do with model quality or infrastructure maturity. It's that organizations have no single agreed-upon definition for their own data.&lt;/p&gt;

&lt;p&gt;Different departments define the same terms differently, and agents consume whatever definition they hit first at 10x the speed any human team would. Humans reconciled those gaps in quarterly meetings and footnotes. Agents just produce confident, expensive wrong answers.&lt;/p&gt;

&lt;p&gt;The real fix requires a role that most companies haven't named yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  When "Revenue" Means Different Things
&lt;/h2&gt;

&lt;p&gt;Humans have always tolerated semantic drift inside organizations. If marketing and finance calculate revenue differently, they reconcile the gap in quarterly meetings or bury it in footnotes. The cost of ambiguity stayed low because humans processed data slowly enough to catch the mismatches.&lt;/p&gt;

&lt;p&gt;AI agents don't reconcile by themselves. They ingest whatever schema they can access, apply whatever definition they encounter first, and produce output that sounds authoritative regardless of whether the underlying logic holds.&lt;/p&gt;

&lt;p&gt;The confidence of the output actually makes the problem worse, because stakeholders trust polished summaries more than they trust raw numbers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Role Sitting Between Data and Meaning
&lt;/h2&gt;

&lt;p&gt;The people solving this problem function as a translation layer between raw enterprise data and business meaning. They define what terms actually mean across the organization, map those definitions into the semantic structures that AI systems rely on, and maintain the consistency of that layer as business logic evolves.&lt;/p&gt;

&lt;p&gt;The skillset is specific and rare. You need someone who understands data modeling well enough to audit pipeline logic, but who also understands the business well enough to know that "active customer" means something different to the retention team than it does to the billing team. You need someone who can sit in a room with a CFO and a data engineer simultaneously and translate in both directions.&lt;/p&gt;

&lt;p&gt;Most companies don't have this person because the job didn't exist until AI agents started consuming enterprise data fast enough to make the gaps visible.&lt;/p&gt;

&lt;p&gt;Some organizations are calling this a semantic architect. Others are folding it into "context engineering," which has emerged as a recognized discipline for designing the information environment that AI models operate within.&lt;/p&gt;

&lt;p&gt;Cognizant's CIO, Neal Ramasamy, recently described context engineering as the factor that separates enterprise AI experimentation from sustainable scale, noting that most of the critical context in organizations still lives in people's heads rather than in systems where agents can access it.&lt;/p&gt;

&lt;p&gt;Whatever you call the role, the function is the same: someone owns the relationship between what the data says and what the business means.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Role Could Look Like
&lt;/h2&gt;

&lt;p&gt;Here's how I'd scope this role if I were hiring for it today. &lt;br&gt;
This person sits between the data engineering team and business leadership. They own the company's business glossary, the single source of truth that defines what every key term means across the organization.&lt;/p&gt;

&lt;p&gt;Before any new data source enters the AI pipeline, they confirm that field names map to actual business logic. When two departments define "customer" differently, they make the call on which definition the system uses. And they have enough authority to make that call stick.&lt;/p&gt;

&lt;p&gt;The technical work is straightforward. The hard part is the authority. A semantic layer without organizational backing is just a wiki nobody reads.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.marketsandmarkets.com/Market-Reports/semantic-web-market-15328110.html" rel="noopener noreferrer"&gt;The semantic layer market is projected to grow from $2.7 billion to $7.7 billion by 2030&lt;/a&gt; precisely because companies are realizing that the technical infrastructure only works when someone with real authority governs it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Org Chart Hasn't Caught Up
&lt;/h2&gt;

&lt;p&gt;Companies are spending millions on model selection, compute infrastructure, and agent orchestration while leaving the semantic layer as an afterthought managed by whichever data engineer happens to notice the inconsistency. It's the organizational equivalent of building a Formula 1 car and forgetting to hire someone who reads the track map.&lt;/p&gt;

&lt;p&gt;The companies getting reliable output from their AI systems in 2026 will be the ones that treated this translation function as a first-class strategic hire, reporting to the CTO or CDO with real authority over definitions. The ones still debugging confident-sounding garbage will be the ones who assumed the data would speak for itself.&lt;/p&gt;

&lt;p&gt;It won't. It never did. Humans just papered over the gaps. AI agents don't have that option.&lt;/p&gt;

&lt;p&gt;…&lt;/p&gt;

&lt;p&gt;Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://www.linkedin.com/in/nicktalwar/" rel="noopener noreferrer"&gt;Follow him on LinkedIn&lt;/a&gt; to catch his latest thoughts. &lt;br&gt;
→ &lt;a href="https://nicktalwar.substack.com/" rel="noopener noreferrer"&gt;Subscribe to his free Substack&lt;/a&gt; for in-depth articles delivered straight to your inbox.&lt;br&gt;
→ &lt;a href="https://techleaders.kit.com/ai-workflows-for-regulated-content" rel="noopener noreferrer"&gt;Watch the live session&lt;/a&gt; to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI. &lt;/p&gt;

</description>
      <category>enterpriseai</category>
      <category>semanticlayer</category>
      <category>datastrategy</category>
      <category>ctoinsights</category>
    </item>
    <item>
      <title>The 8-Hour Agent Doesn’t Fit Into Your Business Model</title>
      <dc:creator>Nick Talwar</dc:creator>
      <pubDate>Tue, 14 Apr 2026 13:26:25 +0000</pubDate>
      <link>https://dev.to/talweezy/the-8-hour-agent-doesnt-fit-into-your-business-model-3kj1</link>
      <guid>https://dev.to/talweezy/the-8-hour-agent-doesnt-fit-into-your-business-model-3kj1</guid>
      <description>&lt;p&gt;Why AI Workstream Duration Changes Everything About Hiring, Teams, and Accountability&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu2ip3dkqhuj0pakykvtq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu2ip3dkqhuj0pakykvtq.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A year ago, agents could reliably handle about an hour of autonomous work. Tasks like summarizing a document or running a data pull. Useful, but contained. You could bolt those tasks onto existing workflows without changing anything structural.&lt;/p&gt;

&lt;p&gt;That window is closing fast.&lt;/p&gt;

&lt;p&gt;METR, the AI evaluation research organization, published findings last year that reframed how I think about planning horizons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The length of tasks that frontier AI agents can complete with 50% reliability has been doubling approximately every seven months.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In the 2024-2025 period, the pace accelerated to roughly every four months.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Agents that managed one-hour workflows in early 2025 will be handling full eight-hour workstreams by late 2026.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An eight-hour workstream is a fundamentally different unit of work than a one-hour task. And most companies have no operating model for that.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Staffing Problem Nobody's Solving Yet
&lt;/h2&gt;

&lt;p&gt;When an agent handles a one-hour task, it fits neatly inside your existing org chart. But when an agent handles an eight-hour workflow, you've crossed into project-level work.&lt;/p&gt;

&lt;p&gt;This raises questions your org chart wasn't designed to answer. Who scopes the work? Who reviews quality at intermediate checkpoints, not just at the end? If the agent makes a judgment call four hours in that sends the remaining four hours in the wrong direction, whose problem is that?&lt;/p&gt;

&lt;p&gt;Most executives are still thinking about AI as a task-level tool, something that makes individual contributors faster. The planning shift required here goes deeper. If an agent can own a full workday of output, you're making staffing decisions, not automation decisions. And staffing decisions cascade. They affect headcount planning, team composition, project timelines, and how you think about accountability for deliverables.&lt;/p&gt;

&lt;p&gt;Consider a concrete example. A three-person analytics team currently handles weekly reporting, ad hoc data pulls, and quarterly business reviews. At the one-hour level, agents might handle the data pulls. The team stays intact, just faster. At the eight-hour level, an agent can own the entire weekly reporting cycle, from data extraction through visualization to narrative summary. Now you're looking at a different team shape entirely. Maybe two analysts and one workflow architect who designs and monitors the agent pipelines. Same output, different organizational logic.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://tomtunguz.com/agent-asana-inflection/" rel="noopener noreferrer"&gt;Tomasz Tunguz &lt;/a&gt;has been writing about this transition from the venture side. He's running 31 agent tasks a day through his own workflows and watching software engineers manage 15 parallel AI workstreams through GitHub. The throughput numbers are real. But throughput without organizational redesign just creates a different kind of mess.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Breaks When You Map Agent Capabilities Onto Human Structures
&lt;/h2&gt;

&lt;p&gt;Here's where most companies get stuck. They take their existing team structure, identify tasks within that structure, and hand those tasks to agents. That works fine at the one-hour level. At the eight-hour level, you start hitting structural mismatches.&lt;/p&gt;

&lt;p&gt;Human team structures assume certain things. People accumulate context over days and weeks. They build judgment through repeated exposure to similar decisions. They escalate ambiguity upward. But agents don't operate on any of those assumptions. They start fresh each time (unless you architect context persistence). And they'll confidently proceed through a six-hour workflow on a flawed assumption made in hour one.&lt;/p&gt;

&lt;p&gt;That's a critical insight for anyone planning around agent-length workflows. The longer the workflow, the more you need architectural guardrails, not because the agent is incompetent, but because compounding errors over eight hours of unsupervised work can waste the entire output.&lt;/p&gt;

&lt;h2&gt;
  
  
  Designing Work Around Agent-Length Workflows
&lt;/h2&gt;

&lt;p&gt;So what actually changes in practice? Three things.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First, decomposition becomes an engineering discipline.&lt;/strong&gt; When you're handing off an eight-hour workstream, the quality of your work breakdown determines the quality of the output. Vague briefs that a senior employee could interpret and correct on the fly become expensive failures when an agent executes them literally for a full workday. The skill shifts from "manage the person doing the work" to "architect the specification precisely enough that autonomous execution succeeds."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, review cadence matters more than review depth.&lt;/strong&gt; A single end-of-day review of eight hours of agent work is a recipe for rework. The Deloitte research on agentic AI adoption found that organizations succeeding with agent workflows redesigned their review processes around intermediate checkpoints, not final deliverable review. The parallel in software engineering is obvious. You don't wait for the entire codebase to be written before doing a code review. You review at the pull request level. Agent workflows need the same kind of incremental quality gates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, accountability has to be redesigned, not just reassigned.&lt;/strong&gt; When a human employee produces bad work, the feedback loop is straightforward. When an agent produces bad work after eight hours, the accountability question splits in several directions. Was the specification wrong? Was the workflow architecture missing a checkpoint? Did the person who scoped the work understand what the agent could and couldn't handle? These are systems questions, not performance questions. And they require a different management muscle than most organizations have built.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Planning Horizon Question
&lt;/h2&gt;

&lt;p&gt;Companies that wait until agents can reliably own full workdays before restructuring will be rebuilding their operating models under time pressure. Companies that start now, rethinking work decomposition, review cadences, and accountability frameworks, will have the organizational muscle in place when the capability arrives.&lt;/p&gt;

&lt;p&gt;The point here goes beyond headcount replacement. The unit of work you're managing is about to change scale. A hiring plan built around task-level automation looks very different from one built around project-level agent staffing. The team structure that works when agents handle one-hour tasks won't hold when they handle eight.&lt;/p&gt;

&lt;p&gt;The businesses that get this right won't be the ones with the best AI models. They'll be the ones that redesigned their operations to match what agents can actually own.&lt;/p&gt;

&lt;p&gt;…&lt;/p&gt;

&lt;p&gt;Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://www.linkedin.com/in/nicktalwar" rel="noopener noreferrer"&gt;Follow him on LinkedIn&lt;/a&gt; to catch his latest thoughts.&lt;br&gt;
→ &lt;a href="https://nicktalwar.substack.com/" rel="noopener noreferrer"&gt;Subscribe to his free Substack&lt;/a&gt; for in-depth articles delivered straight to your inbox.&lt;br&gt;
→ &lt;a href="https://techleaders.kit.com/ai-workflows-for-regulated-content" rel="noopener noreferrer"&gt;Watch the live session&lt;/a&gt; to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>engineeringleadership</category>
      <category>businessoperations</category>
    </item>
    <item>
      <title>4 Questions to Redesign Your Org for AI Agents</title>
      <dc:creator>Nick Talwar</dc:creator>
      <pubDate>Tue, 07 Apr 2026 16:01:04 +0000</pubDate>
      <link>https://dev.to/talweezy/4-questions-to-redesign-your-org-for-ai-agents-10d2</link>
      <guid>https://dev.to/talweezy/4-questions-to-redesign-your-org-for-ai-agents-10d2</guid>
      <description>&lt;p&gt;What High-Performing AI Companies Have Already Figured Out &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgd3jsyzpcewbr3tig7k2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgd3jsyzpcewbr3tig7k2.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every workflow has invisible seams, steps that only function because a human with ten years of context fills the gaps.&lt;/p&gt;

&lt;p&gt;Most companies don't notice these gaps because the process works well enough and the entire human’s job is not documented, step-by-step (an unreasonable expectation, of course). What usually happens in these cases is people route around the broken handoff, apply judgment where the documentation runs out, and quietly absorb complexity that was never formally accounted for.&lt;/p&gt;

&lt;p&gt;Oftentimes, humans supporting and filling gaps is great when humans run the workflow. But as the use of AI agents begins to rise, things start to change and each one of these gaps become places where the agent fails and never picks up.&lt;/p&gt;

&lt;p&gt;Drop an agent into a workflow built on informal human compensation, and the agent will execute the process exactly as written. Which means the real question is whether the workflow itself was ever designed to run without a human quietly holding it together.&lt;/p&gt;

&lt;p&gt;For most companies, the answer is no. And that means the work needs to start with workflow redesign.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Pilots Succeed and Scaling Breaks
&lt;/h2&gt;

&lt;p&gt;Pilots work because a small team compensates for every gap the agent can't handle. Scale via Agents and technology removes that team. What's left is a workflow designed for humans, now being executed by software with zero tolerance for ambiguity.&lt;/p&gt;

&lt;p&gt;Agents don't adapt to broken handoffs. They don't infer ownership when it's unclear. All they do is follow the process as defined.&lt;br&gt;
If the process is being held together by informal knowledge and human workarounds, the agent will expose every seam.&lt;/p&gt;

&lt;p&gt;About 90% of the function-specific AI use cases that hold real transformative potential are still stuck in pilot, according to McKinsey. The problem is process and workflows, not technology. High-performing AI companies are roughly three times more likely to redesign workflows from scratch rather than layer agents onto what already exists. The redesign is where the real value lives.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Workflow Redesign Actually Looks Like
&lt;/h2&gt;

&lt;p&gt;Redesigning a workflow for agents means answering four questions at every stage of the process.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 1: Which steps can an agent fully own?
&lt;/h2&gt;

&lt;p&gt;These are tasks with clear inputs, defined outputs, and minimal need for contextual judgment. Data extraction. Standardized formatting. Pulling records from structured sources. If the step can be described as a contract (this input produces this output, within these constraints), an agent can own it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 2: Which steps require a human decision point?
&lt;/h2&gt;

&lt;p&gt;Anywhere the process involves evaluating trade-offs, exercising risk tolerance, or making a call that depends on relationships or institutional context. These steps don't disappear when agents arrive. They become more visible, because the agent will stop and wait rather than guess.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 3: Where does the agent hand back?
&lt;/h2&gt;

&lt;p&gt;The handoff points matter more than most teams realize. A poorly defined handoff creates the same ambiguity problem that broke the original workflow. Every transition between agent and human needs an explicit output contract. The agent delivers a specific artifact, in a specific format, with a clear expectation for what the human does next. Vague handoffs like "the agent prepares a draft for review" just move the ambiguity to a different part of the chain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 4: What does the output contract look like at each stage?
&lt;/h2&gt;

&lt;p&gt;This is where most redesigns fail quietly. Teams define what the agent does but skip defining what "done" looks like at each step. Without an output contract, downstream steps inherit uncertainty, and the compounding effect makes the whole workflow fragile.&lt;/p&gt;

&lt;h2&gt;
  
  
  This Is an Org Design Decision
&lt;/h2&gt;

&lt;p&gt;Most conversations about AI agents stay in the technology lane. Which model, which framework, which vendor.&lt;/p&gt;

&lt;p&gt;But deploying an agent into an existing workflow is an organizational design decision. You're changing who does what, where decisions get made, and what information flows where. That makes it a structural change to how your operation runs, and it deserves the same rigor you'd apply to any reorg.&lt;/p&gt;

&lt;p&gt;Skipping the redesign means the agent will faithfully execute a process that was already broken. It will do it faster, at scale, and with none of the informal corrections that made it barely work before. Every workaround your team normalized over the years becomes a failure point. Every undocumented decision becomes a gap in the chain.&lt;/p&gt;

&lt;p&gt;The companies pulling real value from agents share one thing in common. They were willing to look at a workflow that "works fine" and admit it only works because humans have been compensating for design flaws the org stopped noticing years ago.&lt;/p&gt;

&lt;p&gt;…&lt;/p&gt;

&lt;p&gt;Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://www.linkedin.com/in/nicktalwar/" rel="noopener noreferrer"&gt;Follow him on LinkedIn&lt;/a&gt; to catch his latest thoughts.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://nicktalwar.substack.com/" rel="noopener noreferrer"&gt;Subscribe to his free Substack&lt;/a&gt; for in-depth articles delivered straight to your inbox.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://techleaders.kit.com/ai-workflows-for-regulated-content" rel="noopener noreferrer"&gt;Watch the live session&lt;/a&gt; to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aiagents</category>
      <category>engineeringleadership</category>
      <category>startupstrategy</category>
    </item>
    <item>
      <title>Why Engineering-Led AI and Agent Initiatives Collapse in Production</title>
      <dc:creator>Nick Talwar</dc:creator>
      <pubDate>Tue, 17 Mar 2026 11:40:35 +0000</pubDate>
      <link>https://dev.to/talweezy/why-engineering-led-ai-and-agent-initiatives-collapse-in-production-bbi</link>
      <guid>https://dev.to/talweezy/why-engineering-led-ai-and-agent-initiatives-collapse-in-production-bbi</guid>
      <description>&lt;p&gt;The staffing and governance gaps that turn working demos into unmaintainable systems&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0byznr1n7u3ydva25h6h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0byznr1n7u3ydva25h6h.png" alt=" " width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Your engineering team just showed off a new AI feature, and everyone left the room feeling good about the future of the initiative.&lt;/p&gt;

&lt;p&gt;But fast forward three months and the system is crashing twice a week. The team is spending weeks trying to reproduce bugs that only appear in production.&lt;/p&gt;

&lt;p&gt;In my time as a fractional CTO serving AI-first organizations, I’ve noticed that many companies structure AI projects the same way they structure any other software build. Leadership sets a roadmap, hands it to engineering, and expects execution to follow the usual patterns.&lt;/p&gt;

&lt;p&gt;However, the underlying assumption here is that building intelligent systems follows the same rules as building deterministic ones. This assumption kills most AI initiatives within six months of launch.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Talent Gap Shows Up Too Late
&lt;/h2&gt;

&lt;p&gt;Machine learning systems break three key assumptions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Predictable behavior&lt;br&gt;
— A model that returns one answer today might return a different answer tomorrow given identical input.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Testable edge cases&lt;br&gt;
— Edge cases don’t come from a finite list of scenarios you can test against. They emerge from novel combinations of features your training data never represented.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Debuggable logic&lt;br&gt;
— When something fails, you can’t just step through the code to find the bug because the decision logic was learned through statistical optimization, not explicitly programmed.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your engineering team wasn’t hired to handle probabilistic systems. They won’t naturally catch biased training data, misleading accuracy metrics, or model architectures that can’t explain their predictions. That requires ML expertise.&lt;/p&gt;

&lt;p&gt;These aren’t skills you can pick up by reading documentation. They come from building and breaking enough ML systems to recognize patterns that lead to failure.&lt;/p&gt;

&lt;p&gt;All too often, teams don’t realize they need these skills until it’s too late. By that time, you’re hiring someone to audit months of work and explain which architectural decisions need to be unwound.&lt;/p&gt;

&lt;p&gt;Senior ML engineers know which approaches create technical debt you can’t maintain, which data quality problems cause drift, and which evaluation strategies mislead you during development. They catch these issues before roadmaps lock and budgets get allocated, not after engineering has already committed to the wrong direction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Demos That Look Great Until Production
&lt;/h2&gt;

&lt;p&gt;Demos operate in carefully controlled environments. The team selects clean input data, constrains the problem space to tested scenarios, and tunes prompts until the output looks impressive.&lt;/p&gt;

&lt;p&gt;Under these conditions, AI and Agentic systems seem remarkably capable.&lt;/p&gt;

&lt;p&gt;Production removes every safety rail. Real users submit malformed inputs and unexpected data formats. Your data pipelines fail intermittently for reasons that don’t show up in logs. Third-party APIs change their response formats without warning. Models encounter distribution shifts (patterns in the data that differ fundamentally from training data) and produce outputs ranging from subtly wrong to completely nonsensical.&lt;/p&gt;

&lt;p&gt;Faced with these issues, an inexperienced engineering team will add retry logic, improve logging, and write better error handling. These help at the margins, but won’t fix what the team doesn’t understand.&lt;/p&gt;

&lt;p&gt;Without instrumentation built specifically for model behavior, you’re stuck just treating symptoms. The system logs show normal operation. The model is still running. But somewhere between input and output, quality degraded in ways you never instrumented for.&lt;/p&gt;

&lt;p&gt;This is where the lack of ML expertise during architecture becomes expensive. ML engineers build observability into the system from the start because they know models behave unpredictably in production. They instrument confidence thresholds, track prediction distributions, monitor for data drift, and create alerts when model behavior deviates from expected patterns.&lt;/p&gt;

&lt;p&gt;Without that foundation, you’re trying to add monitoring for problems you don’t fully understand while simultaneously keeping a broken system running.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Needs to Change
&lt;/h2&gt;

&lt;p&gt;The very first thing teams should do is bring in a senior ML or data science lead before finalizing the roadmap. You need ML expertise in decision-making before commitments happen, not after engineering has spent two months building in the wrong direction.&lt;/p&gt;

&lt;p&gt;Build your operating model around daily collaboration between ML and engineering, not sequential handoffs. The traditional approach where product writes specifications, engineering builds features, and ML practitioners “add intelligence” creates silos that guarantee failure. ML engineers need to work directly with the people building data pipelines, API interfaces, and monitoring systems. These components depend on each other in ways that don’t map to separate work streams.&lt;/p&gt;

&lt;p&gt;Establish governance before launch, not after the first incident. Define explicit boundaries: which predictions execute automatically, which require human review, and which should fail safely rather than guess. Implement monitoring that tracks model behavior, confidence score distributions, and output quality trends over time. Create clear escalation paths so when something breaks (and it will) there’s an obvious owner who can diagnose root cause and implement fixes.&lt;/p&gt;

&lt;p&gt;This feels like overhead until you ship without it and realize nobody can answer basic questions about system behavior.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build Systems That Actually Work
&lt;/h2&gt;

&lt;p&gt;Team composition should match the problem:&lt;/p&gt;

&lt;p&gt;ML engineers bring expertise in navigating probabilistic systems and understanding where models break.&lt;/p&gt;

&lt;p&gt;Software engineers bring discipline around building maintainable infrastructure that operates at scale.&lt;/p&gt;

&lt;p&gt;Product brings judgment about where automation creates value and where it introduces unacceptable risk.&lt;/p&gt;

&lt;p&gt;All three perspectives need equal weight in planning. Companies that understand this stop launching impressive demos that collapse under real-world load. They build reliable systems that work consistently because they planned for production complexity from day one.&lt;/p&gt;

&lt;p&gt;Get the team structure, governance, and collaboration patterns right, and technical challenges become tractable. Skip these foundational changes, and engineering will keep building systems that work beautifully until the moment they encounter reality.&lt;/p&gt;

&lt;p&gt;…&lt;/p&gt;

&lt;p&gt;Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://www.linkedin.com/in/nicktalwar/" rel="noopener noreferrer"&gt;Follow him on LinkedIn&lt;/a&gt; to catch his latest thoughts.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://nicktalwar.substack.com/" rel="noopener noreferrer"&gt;Subscribe to his free Substack&lt;/a&gt; for in-depth articles delivered straight to your inbox.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://techleaders.kit.com/ai-workflows-for-regulated-content" rel="noopener noreferrer"&gt;Watch the live session&lt;/a&gt; to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.&lt;/p&gt;

</description>
      <category>aistrategy</category>
      <category>machinelearning</category>
      <category>engineeringleadership</category>
      <category>aiimplementation</category>
    </item>
    <item>
      <title>The #1 Reason Agentic AI Fails in Production</title>
      <dc:creator>Nick Talwar</dc:creator>
      <pubDate>Tue, 10 Mar 2026 11:52:03 +0000</pubDate>
      <link>https://dev.to/talweezy/the-1-reason-agentic-ai-fails-in-production-3c7l</link>
      <guid>https://dev.to/talweezy/the-1-reason-agentic-ai-fails-in-production-3c7l</guid>
      <description>&lt;p&gt;What happens when you let the LLM make every decision in Agentic AI use cases (and how to fix it)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9yp4h5w7aydczq6t7lr1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9yp4h5w7aydczq6t7lr1.png" alt=" " width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A few months ago, I watched a Series B startup demo their “production-ready” Agentic AI system. In testing, it worked just fine. But when they gave it real users and edge cases started appearing, the behavior became unpredictable.&lt;/p&gt;

&lt;p&gt;The issue was architectural: they’d given the LLM complete autonomy over execution decisions, and LLMs simply aren’t built to provide deterministic control at that level.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027" rel="noopener noreferrer"&gt;Gartner predicts that over 40% of Agentic AI projects will fail to reach production by 2027&lt;/a&gt;. The difference between systems that scale reliably and those that collapse under real-world conditions comes down to whether you separate reasoning from execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Failures Actually Originate
&lt;/h2&gt;

&lt;p&gt;The latest LLMs demonstrate remarkable reasoning capabilities. They can break down complex tasks, weigh tradeoffs, and generate sophisticated action plans. The problem emerges when organizations confuse reasoning capability with execution reliability.&lt;/p&gt;

&lt;p&gt;LLMs are probabilistic pattern matchers trained on text. These characteristics propagate to Agentic AI systems built on top of LLMs. They excel at understanding context and generating plausible responses. But they struggle with deterministic execution, maintaining consistent behavior across edge cases, and guaranteeing the same output given similar inputs. Even when they appear to be well understood during pre-production testing and simulation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://labs.zenity.io/p/moving-the-decision-boundary-of-llm-safety-classifiers" rel="noopener noreferrer"&gt;Zenity Labs found that classifiers fail when inputs take unexpected paths through activation space.&lt;/a&gt; The classifier works perfectly on inputs it recognizes, but novel paths (even semantically similar ones) can produce completely different classifications. The same dynamic applies to Agentic AI: systems trained and tested on known scenarios encounter unfamiliar patterns in production, and their responses become unpredictable.&lt;/p&gt;

&lt;p&gt;When you let the LLM make execution decisions directly, you’re betting that production will only present scenarios the model has learned to handle reliably. That bet fails more often than teams expect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Full Autonomy Creates Unpredictability
&lt;/h2&gt;

&lt;p&gt;In production environments, Agents don’t receive clean, well-formatted inputs. They encounter ambiguity, partial information, conflicting signals, and edge cases that fall outside training distributions.&lt;/p&gt;

&lt;p&gt;Consider an Agent tasked with processing refund requests. In testing, requests follow predictable patterns. In production, you get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Requests that qualify for refunds but use non-standard phrasing&lt;/li&gt;
&lt;li&gt;Borderline cases where policy interpretation matters&lt;/li&gt;
&lt;li&gt;Situations requiring escalation that don’t match trained escalation triggers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Inputs that combine multiple issues in ways the model hasn’t seen&lt;br&gt;
When the Agent has full autonomy, it must decide in real-time which action to take. Small variations in input phrasing can trigger entirely different action sequences. Run the same ambiguous request twice, and you might get different outcomes. This happens not because the model is malfunctioning, but because probabilistic systems don’t guarantee determinism.&lt;/p&gt;

&lt;p&gt;This behavior compounds across interactions. An Agent processing hundreds or thousands of decisions daily will inevitably encounter scenarios that push it outside reliable operating ranges. Without external controls, there’s no mechanism to catch these situations before they produce incorrect actions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Control Layer Solution
&lt;/h2&gt;

&lt;p&gt;The Control Layer architectural fix separates what LLMs do well (reasoning) from what they do poorly (deterministic execution).&lt;/p&gt;

&lt;p&gt;In this model:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The Agent analyzes the situation and proposes an action&lt;/li&gt;
&lt;li&gt;A control layer validates whether that action is permitted&lt;/li&gt;
&lt;li&gt;Only validated actions execute&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The control layer uses rule-based logic that encodes business constraints, compliance requirements, and operational boundaries. When the Agent proposes an action, the control layer checks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does this action fall within permitted operations?&lt;/li&gt;
&lt;li&gt;Do the action parameters meet safety constraints?&lt;/li&gt;
&lt;li&gt;Are required conditions satisfied?&lt;/li&gt;
&lt;li&gt;Does the user context allow this operation?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If validation passes, the action executes. If not, the Agent receives feedback and can propose an alternative. Taking time to address these questions as a team, distill it into requirements, and then work with engineering to distill them into a Control Layer architecture is a core mitigation strategy for these business risks.&lt;/p&gt;

&lt;p&gt;This architecture maintains the Agent’s flexibility while ensuring predictable boundaries. The Agent can still reason about complex scenarios and adapt to novel situations. The control layer ensures that adaptation happens within defined limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Right Level of Control
&lt;/h2&gt;

&lt;p&gt;Building systems that consistently do the right things matters more than maximizing autonomy.&lt;/p&gt;

&lt;p&gt;Control layers define boundaries that let Agents operate confidently within them. Inside those boundaries, Agents can be remarkably flexible, adapting to novel scenarios and learning from outcomes. The boundaries simply ensure that adaptation doesn’t violate business requirements or create unpredictable behavior. It also gives you a backstop to monitor and close feedback loops, slowly improving the system over time so less escalations occur.&lt;/p&gt;

&lt;p&gt;Organizations that skip this step typically discover the need for controls after production failures. By then, retrofitting governance becomes significantly harder than building it from the start (akin to putting a genie back in a bottle).&lt;/p&gt;

&lt;p&gt;The systems that succeed in production share a common architecture: they separate reasoning from execution, maintain clear decision boundaries, and enforce validation before actions reach production systems. That architectural choice (more than model selection, training approach, or testing strategy) determines whether Agentic AI delivers predictable value or unpredictable failures.&lt;/p&gt;

&lt;p&gt;.…&lt;/p&gt;

&lt;p&gt;Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://www.linkedin.com/in/nicktalwar/" rel="noopener noreferrer"&gt;Follow him on LinkedIn&lt;/a&gt; to catch his latest thoughts.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://nicktalwar.substack.com/" rel="noopener noreferrer"&gt;Subscribe to his free Substack&lt;/a&gt; for in-depth articles delivered straight to your inbox.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://techleaders.kit.com/ai-workflows-for-regulated-content" rel="noopener noreferrer"&gt;Watch the live session&lt;/a&gt; to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>productionsystems</category>
      <category>systemdesign</category>
      <category>aiengineering</category>
    </item>
  </channel>
</rss>
