DEV Community: Nick Talwar

When Engineers Manage Agents and Managers Engineer

Nick Talwar — Tue, 28 Jul 2026 14:00:00 +0000

Redesigning the AI engineering team structure before unclear roles slow everyone down

Engineers working with AI tools now spend more hours reviewing generated code than writing new code. Digital Applied’s Q1 2026 survey of 2,847 developers recorded the crossover, with review overtaking writing as the largest AI-assisted time sink after writing held a four-hour lead as recently as 2024.

Over the same period engineering managers have moved in the opposite direction. They are now more technically hands-on than they have been in a decade as Agentic AI lowers the barrier to direct code contribution.

Both trends meet in the middle of the org chart. The division of responsibilities between engineer and EM was doing structural work that few leaders ever named, and coding agents are dissolving it with nothing arriving to replace it.

The Job Descriptions Stopped Matching the Work

Look at how a senior engineer on an AI-heavy team actually spends a Tuesday. She kicks off two agent runs before standup, reviews a stack of generated pull requests mid-morning, fixes a prompt configuration that started producing flaky tests, and switches contexts across three tools before lunch.

The Pragmatic Engineer’s 2026 survey of over 900 engineers and engineering leaders captured exactly this. Engineers orchestrate more and context-switch more often, managers can be more hands-on, and the survey’s authors flagged the conclusion themselves. The engineer and manager roles are becoming similar.

Managers are converging from the other side. An EM can now ship a fix or prototype a feature between one-on-ones, and plenty of them do. The technical distance that used to accumulate after two years in management has stopped accumulating.

The Old Division Was the Long Pole

The classic split assigned implementation quality to engineers and gave managers allocation, priorities, and people. AI broke this in both directions.

When agents produce the majority of a feature’s code, the person directing them is making allocation decisions. Which tasks go to the machine, which stay human, how much scrutiny each output deserves. That used to be manager territory. And when a manager merges their own AI-assisted fix, they have re-entered the codebase their role was designed to stay out of.

What Role Confusion Costs

One client I worked with last year had three teams improving the same AI workflow at the same time.

Engineering upgraded the model to reduce latency. The AI team refined prompts and retrieval settings to improve answer quality. Operations updated the business rules the agent was expected to follow.

Each team shipped good changes. Each team achieved its own goals. A month later, overall accuracy had dropped.

No single change caused the problem. It was the interaction between all three. Everyone was optimizing their part of the system, but no one owned the system itself.

We uncovered it during a retrospective and established a single owner for end-to-end evaluation, along with shared metrics across the teams.

That problem happened to be visible. Most aren’t.

A Tilburg University study of Copilot adoption in open-source projects found core developers reviewing 6.5% more code while their original output dropped 19%. Stack Overflow’s 2025 survey found 45% of developers citing time-consuming debugging of AI-generated code as a top frustration. And in a March 2026 SmartBear survey of 273 software leaders, 70% said application quality had already degraded as AI accelerated development.

Those numbers tell a consistent story. Code production is accelerating faster than organizational ownership.

When an engineering manager merges agent-generated code and a production incident surfaces two weeks later, who owns the postmortem? The engineer who approved the pull request? The team that tuned the prompts? The platform team that selected the model? The product manager who defined the workflow?

Teams without a clear answer pay for AI twice. Once for the tokens, and again for the coordination overhead of figuring out whose responsibility the output became.

A Redesign That Fits on One Page

The fix requires less machinery than most reorgs. In my work with engineering teams adopting agentic workflows, four decisions cover most of the confusion.

Name one accountable reviewer per code surface. Agent-generated pull requests get a single human owner, assigned by code area and written into the CODEOWNERS.md file. That owner can be an engineer or an EM. What matters is that exactly one name appears, so accountability for quality survives the increase in volume.

Give manager code contribution explicit rules. If an EM ships code, it goes through the same review path as everyone else’s, and its scope stays bounded. Prototypes, internal tooling, and spikes work well. Critical-path features do not, because a manager who owns production code has become an engineer with a reporting-line problem.

Put orchestration in the engineer job description. Hours spent directing agents, writing evals, and maintaining prompt configurations should count as engineering work in performance reviews. If promotion criteria still reward hand-written lines, engineers will optimize for the old job while the actual work goes unmeasured.

Rebuild the EM role around what AI left behind. Stakeholder negotiation, cross-team decisions, career development, and the judgment calls agents consistently fumble. Those responsibilities gained value as everything around them got automated, and a manager whose calendar reflects that is doing the redesigned job instead of competing with their own engineers for the review queue.

Then revisit the whole arrangement quarterly. The tools are changing fast enough that a role definition written in January describes a different workflow by June.

An org chart is a claim about how work gets done. Each quarter the chart goes unedited while the work underneath changes, the claim gets a little less true. The teams outperforming with AI-assisted delivery share one habit that costs nothing to copy; they wrote down what changed. Engineers who manage agents, managers who touch code, and one name on every review.

Role convergence turns out to be a design problem, and design problems reward the leader willing to name them.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

→ Follow him on LinkedIn to catch his latest thoughts.

→ Subscribe to his free Substack for in-depth articles delivered straight to your inbox.

→ Watch the live session to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.

Your Agent Platform Choice Is a Decade-Long Bet

Nick Talwar — Tue, 21 Jul 2026 13:01:10 +0000

Where control is accumulating in the Agentic AI stack, and how to choose on purpose

For about fifteen years, the force that decided where enterprise value collected had a name. Dave McCrory called it data gravity in 2010, and the idea aged well.

Applications drift toward data because moving data is slow, costly, and risky. Whoever controlled the data layer controlled the decisions that stacked on top of it, from analytics to applications to budgets.

That logic still holds, but what reaches for your data has changed. The dashboards and pipelines that used to sit beside the warehouse are giving way to agents, and an agent does not stay next to the data. It runs on some platform, reasons over whatever it can reach, and moves results between systems on its own.

The platform you pick to run your agents is taking over the position the data layer used to hold. It is becoming the thing that owns the relationship with your data.

That is a bigger decision than it looks, and most teams are making it without noticing.

Once those walls go up, the position is expensive to win back, which is what makes this a decade-long bet and not a procurement round.

Agent gravity is the newer force

Tomasz Tunguz called this shift “agent gravity” in a recent essay. The argument runs parallel to the old one. Agents demand enormous compute, that compute is a large and growing business, and the platforms hosting agent workloads will fight to keep them. The more agents and data flowing through a platform, the heavier its pull.

Agents are turning into the main surface through which people and systems touch enterprise data.

An employee asks an agent instead of opening a dashboard. A customer interacts with an agent instead of a form. Other automated systems call an agent instead of hitting a database directly.

Once that becomes the default path, the platform running the agent sits closer to the value than the platform storing the data. Proximity to the work now accumulates more leverage than custody of the bytes.

Why running the agents is where the moat forms

Running an agent is expensive, and that expense is the point. Inference at scale, orchestration, memory, tool calls, retries, and the guardrails that stop an agent from doing something costly all burn compute, and compute is the business these platforms are in.

Tunguz has written separately about the harness, the orchestration and control layer that turns a raw model into something an enterprise can trust. That harness is where the hard engineering lives now. Whoever owns it owns the relationship with everything the agent reads, writes, and moves.

This is why the platform decision outlasts the model decision. Models will keep leapfrogging each other on every leaderboard.

The harness around them, the place your agents are configured, governed, and run, is sticky in a way individual models never were. I have seen an arrangement like this start as one convenient integration and end, two years later, with the bulk of a company’s analytical work running somewhere nobody picked on purpose.

The doors are already closing

Incumbents understand the dynamic, and they are not waiting for you to notice it. In April, Microsoft removed the compatibility mode that let Power BI query Databricks metric views through the standard connector, which broke the reports that relied on it (the release notes state it without ceremony).

At Build 2026, Microsoft positioned Fabric as the data platform for its Copilot and agent ecosystem, wired Fabric IQ into Microsoft 365 Copilot, and shipped Agent Skills that let agents build models and reports directly on governed Fabric data.

The behavior repeats across the field. Snowflake pushes Cortex, Google leans on BigQuery, and every one of them wants your agents reasoning over data inside its own walls.

The friction a vendor removes inside its own stack becomes friction everywhere else. That asymmetry is the gravity well, and it is built on purpose.

The question worth asking

This reframes what an evaluation should measure. Benchmark scores age in weeks, and a model that tops a chart today will sit mid-table by the next release. Tuning a ten-year decision around this quarter’s numbers misreads the timeline.

The operators I work with tend to ask a sharper question once they see the mechanics.

Which layer of the stack will own the relationship with our data over the next five to ten years? A company that stores its customer data in one system and runs its agents through another has already answered that question, whether it meant to or not. It handed the relationship to whoever controls the agent runtime, and it did so without holding a meeting about it.

Three checks separate a deliberate choice from an accidental one. First, can your agents read and write across platforms, or does every convenient path keep everything inside one vendor? Second, when an agent copies or moves data, who holds the audit trail and the off switch? Third, if you had to move your agent workloads to a different platform in three years, what would break, and what would it cost?

When the honest answer to the third question is that nobody has ever priced it, the platform has already priced it for you.

Make the bet on purpose

None of this argues for paralysis. Single-vendor stacks are convenient, and convenience earns its keep when a team is small and shipping fast. The narrower point is the one worth holding onto. The choice of where your agents run is compounding into control over your data, and that control is hard to win back once a vendor has built the gravity well around it.

Pick with open eyes, and price the exit before you need it. A platform decision you file under tactical has a habit of turning into the most strategic call you made all decade.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

→ Follow him on LinkedIn to catch his latest thoughts.

→ Subscribe to his free Substack for in-depth articles delivered straight to your inbox.

→ Watch the live session to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.

What Your AI and Agent Dashboard is Hiding

Nick Talwar — Wed, 15 Jul 2026 12:31:01 +0000

There is a version of the enterprise AI story told in board meetings, and a version told in weekly standups, and the uncomfortable truth is that both are accurate.

In the board meeting, the chart goes up and to the right. Adoption is up, usage is up, time-to-first-output is down. Agents are running, employees are experimenting, and the company appears to have crossed from AI aspiration into Agentic AI execution. Nobody is lying. The chart is real.

In the standup, sales says the AI-generated account briefs are useful, after someone verifies the facts. Support says the agent drafts good responses, except the policy-sensitive ones, which is to say the ones that matter. Engineering says coding agents accelerate scaffolding, and that senior engineers just lost most of a sprint untangling an AI-generated migration that passed review and failed in staging. Operations says the workflow agent handles the happy path, and that when it doesn’t, someone spends two days reconstructing what the agent actually did, which systems it touched, what data it relied on, why it made the call it made, because nothing was built to replay it.

The temptation is to decide one group is wrong: the executives are high on their own supply, or the operators are foot-dragging. Neither. They are looking at different layers of the same system. Executives see the application layer. Operators live in the integration layer. And only one of those layers makes it onto the dashboard.

That is the abstraction error at the center of most enterprise AI programs, and it is worth being precise about, because the companies that fix it first are going to be very hard to catch.

Production Is Not Absorption

Dashboards measure what is easy to instrument: users, prompts, drafts, summaries, agent runs completed, time saved to first output. None of these numbers is fake. All of them measure the same thing, AI production, and production was never in doubt.

Producing more output faster is the entire point of the technology. Celebrating it is like celebrating that the printing press produces pages.

The enterprise question is absorption: can the organization convert that output into trusted work at lower total cost and risk? Because a draft is not a closed deal, a generated pull request is not shipped software, and an answer is not trust. Every one of those gaps gets closed by humans, and the dashboard is silent about all of it.

The model is fast. The company may not be. A dashboard that measures only the model will systematically overstate the company.

The AI and Agent Cleanup Tax

The missing line item deserves a name: the AI cleanup tax, the human effort required to turn AI output into something the business can actually stand behind.

It would be convenient if this tax were small. It is not. The mistake is imagining it as light editing, a fact-check here, a rewrite there. In practice, the expensive version looks like this:

A coding agent opens a pull request that compiles, passes tests, yet misunderstands a core architectural assumption. A senior engineer, your scarcest resource, spends a day and a half finding that out, because the failure isn’t in any single line; it’s in the intent.

An agent updates records across the CRM and a downstream billing system. Three weeks later the numbers don’t reconcile. Now someone is doing forensic archaeology across two systems of record, with no execution trace, trying to determine which of four hundred automated writes was wrong and whether the error propagated.

A support agent gives a customer an answer that touches a regulatory boundary. Compliance asks the only questions compliance ever asks: where did this come from, what policy did it rely on, and can you show me? If the honest answer is “we can’t reconstruct it,” the cleanup tax on that single output is measured in days of legal and engineering time, and in the awkward decision to pull the agent back from anything that matters.

This is the crucial point: the tax doesn’t hide because it’s small. It hides because it’s misattributed. It shows up as debugging, as review, as reconciliation, as “just being careful”, as ordinary work done by your most senior people. It never shows up as a cost of the AI program, so the AI program books the time savings and someone else’s budget absorbs the verification. The board sees a 40% reduction in time-to-draft; the workflow sees a 5% improvement in cycle time. The gap between those two numbers is the tax, and today almost nobody is measuring it.

Trust Is the Expensive Part

Why does this pattern repeat everywhere? Because AI is unusually good at making the visible part of knowledge work cheap. Drafting, summarizing, classifying, generating, the demo-friendly layer. But that was never the expensive part of enterprise work. The expensive part is trust: can I send this to a customer, ship this code, make this decision, and defend it in front of compliance, legal, procurement, or the board?

Trust is not a prompt problem. It is a system problem. It requires context, governed data access, policy gates, evaluation against real cases, observability, replayable execution traces, human review at the right points, and clear ownership. The value of enterprise AI does not come from the model; it comes from the operating system around the model, and that operating system is precisely what the first-generation dashboard leaves out.

Agents raise the stakes on all of this, because an agent is not an interface, it is a workflow participant. It doesn’t just answer; it acts, retrieves, decides, routes, updates, escalates, commits changes into systems of record. Which means the cleanup tax escalates with capability: when an agent drafts, the tax is editorial; when it writes to systems, the tax is operational; when it touches clinical, financial, or legal workflows, the tax is risk.

“Agent runs completed” is therefore a dangerously shallow metric. Completion is not correctness, correctness is not trust, and the right question is not did the agent run but did it complete the workflow correctly, with the right context, under the right controls, at lower total cost than the process it replaced?

Cost Per Trusted Output

The board does not need less AI measurement. It needs a dashboard built around four questions: What did the system produce? What did humans have to do before it became usable? What happened downstream? And can we prove it?

The second question is the one nearly everyone skips, which is why model output and business outcome refuse to correlate on so many dashboards. The bridge between them is rework, and rework is measurable: edit distance between raw and approved output, regeneration rates, rejection rates, review latency, exception and escalation rates, policy-block events, manual handoffs, downstream acceptance. If 600 of 1,000 generated support responses need material edits, that is not an anecdote. If reps trust 40 of 200 AI account plans enough to use, that is not a vibe. If a coding agent opens 50 pull requests and senior review time doesn’t fall, that is a system telling you exactly where it is broken.

All of it rolls up into one economic unit. Not cost per token. Not cost per prompt. Not cost per draft.

Cost per trusted output.

That is the number that connects the AI program to the P&L, the number that survives contact with a skeptical CFO, and the number that, once you start driving it down, turns AI from a line item into a compounding advantage.

The Roadmap Writes Itself

Here is the payoff for doing the harder measurement: once the cleanup tax is visible, it stops being demoralizing and becomes a diagnostic. Every failure mode points at a specific fix.

Output needs too much correction, the task boundary is too broad; narrow it. The system lacks context, fix retrieval, data access, and memory. Reviewers don’t trust it, build evals against real production examples, not synthetic demos. Exceptions take days to trace, you’re missing observability and replay, so build the trace before you scale the agent. People are shuttling data between systems by hand, finish the integration. Everything requires senior sign-off, either the use case exceeds your current maturity or your controls are underbuilt, and now you know which.

This is also why “AI strategy” fails when it arrives as a use-case inventory. A list of applications is not a strategy; it is a backlog. The strategic work is sequencing: which workflows are valuable enough, bounded enough, instrumented enough, and safe enough to absorb AI output without drowning the organization in verification? That is the difference between a demo roadmap and a production roadmap, and the market will eventually price the difference.

Down the Stack

The first wave of enterprise AI abstraction happened at the interface, type into a box, get useful output, and everyone got it roughly for free. The next wave happens lower in the stack, and it will not be evenly distributed: shared context, governed data access, workflow-specific agents, eval harnesses, policy gates, approval records, cost telemetry, execution traces. The unglamorous machinery that lets a company know not just what the AI produced, but how that output moved through the business and what it cost to trust it.

This work photographs badly. It will never demo like the chatbot did. It is also where the durable value is, precisely because it is hard to copy. Prompt volume is a commodity; anyone can buy tokens. An organization that has instrumented its cleanup tax, driven down its cost per trusted output, and learned to convert every unit of rework into a system improvement has built a capability, and capabilities compound.

The board should still get a number. It should just be the right one: how much trusted work did the AI or Agent system help complete, at what total cost, with what risk, with what evidence?

That number is harder to produce than an adoption chart. It is also the only one worth funding against, and unlike the next model release, it is entirely within your control. The tax is real, it is large, and right now it is invisible. The first company in your market to see it clearly wins.

So, go make it visible.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

→ Follow him on LinkedIn to catch his latest thoughts.

→ Subscribe to his free Substack for in-depth articles delivered straight to your inbox.

→ Watch the live session to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.

Build Governance That Matches What Agentic AI Actually Does

Nick Talwar — Tue, 07 Jul 2026 15:49:44 +0000

Why oversight models built for supervised tools fall short once agents start acting

A supervised AI tool hands you a draft and waits. You read it, you edit it, you decide whether it ships. But an agent does not wait. It reads a support ticket, queries a database, updates a CRM record, sends a few emails, and schedules a follow-up, finishing most of that before anyone looks at the outcome.

The model underneath can be identical. The oversight problem is a different beast entirely.

Most governance frameworks running in production were written for the first kind of system. They assume a person checks each output before it carries consequences, so the controls cluster around the moment of approval. That design holds up well when AI generates something and stops. It comes apart the moment an agent chains actions together across systems and accounts, where each step sets up the next and no one is standing at the gate.

The data shows how wide this gap has grown. In McKinsey’s 2026 AI Trust Maturity Survey, only about 30 percent of organizations reached maturity level three or higher in strategy, governance, and agentic AI controls, even as deployment footprints kept expanding. Technical capability is racing ahead. The oversight structures meant to keep it accountable are lagging, and the distance keeps widening.

Sequences change what oversight has to catch

The reason supervised guardrails fall short with agents comes down to how the two systems fail. A supervised tool fails at a single point. It produces a bad draft, a person catches it, and the cost stops there.

An agent fails along a path. It misreads one input, acts on that reading, and every action after it inherits the error. By the time anyone notices, the agent has touched five systems and the original mistake is buried three steps back.

This is why security and risk concerns now sit at the top of the list of barriers to scaling agentic AI, cited by close to two-thirds of respondents in the same survey. The worry has shifted from capability to control. Teams want to know what happens when an agent does something it was never explicitly told to do, and whether anyone can reconstruct the chain of events well enough to undo it.

A governance framework built for agents has to account for the sequence rather than the endpoint. That means defining the boundaries of what an agent may touch, building checkpoints into the path instead of bolting them onto the final output, and deciding in advance what happens when an agent operates outside its intended scope.

Make every agent decision traceable

When an agent acts across systems, the most valuable thing you can have afterward is a record of why it did what it did, which input triggered which action, and which decision produced which outcome. Without that trail, an incident becomes a forensic exercise with no evidence, and the team is left guessing at a system that already moved on.

McKinsey’s survey found that the rate of AI incidents has held steady at roughly 8 percent, yet confidence in how organizations respond to them has dropped. Close to 60 percent of respondents who experienced an incident rated their organization’s response as no better than satisfactory. Incident frequency has stayed flat. The ability to trace, explain, and contain those incidents has fallen behind the complexity of the systems creating them.

Traceability is an engineering problem before it becomes a compliance one. It means logging the agent’s reasoning and actions in a form a human can reconstruct, designing systems so a single decision can be traced back to its trigger, and building the audit trail into the architecture instead of adding it after something goes wrong.

Agents that cannot explain themselves are agents you cannot govern.

Governance belongs in engineering before it reaches compliance

A lot of organizations are waiting for regulation to tell them what good looks like. That instinct is understandable, and it is also fragile. The EU AI Act’s high-risk obligations for stand-alone systems were originally set to apply in August 2026, and in May 2026 EU lawmakers reached a political agreement to push most of them to December 2027.

Transparency rules still land in August 2026, but the headline deadline that many teams were planning around moved by more than a year.

This is the core problem with running agentic oversight off a regulatory calendar. The calendar reflects political negotiation, and it tells you nothing about how your specific agents fail, what they can reach, or how you would catch them when they drift. Those are engineering questions, and they get answered well only by people who understand the architecture.

What to ask next

Agentic governance comes down to a few honest questions:

What can this agent reach?
What does it do when it gets something wrong?
How can you trace any outcome back to the decision that caused it?

A team that can answer these has already built the things a large enterprise customer or regulator asks for: an agent with bounded access, a defined response when it gets something wrong, and an audit trail someone can actually follow.

Now think back to the agent I described at the beginning. It read the ticket, queried the database, updated the record, and sent the emails before anyone looked at the outcome. In this scenario, oversight waits until the end of that chain.

Apply these questions to understand how the workflow could look different. Then the control sits inside the system instead of at the final output, put there by the people who built it before the agent ever runs.

Your agents are already acting across live systems, and the only governance that protects you is the kind you build into how they work.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

→ Follow him on LinkedIn to catch his latest thoughts.

→ Subscribe to his free Substack for in-depth articles delivered straight to your inbox.

→ Watch the live session to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.

The CIO Role Just Split in Two. Here’s What You Need to Know.

Nick Talwar — Tue, 16 Jun 2026 17:14:47 +0000

Why the Best AI Leaders Run Offense and Defense Simultaneously

Fourteen AI initiatives on a single roadmap, governed by one steering committee, measured against one set of success criteria. Half are automating existing workflows to protect margins. The other half are building capabilities the company has never offered before. Meanwhile, the budget, risk framework, and quarterly check-in schedule remain stagnant.

This is what most enterprise AI portfolios look like right now. And it explains why so many of them feel stuck.

The two halves of that portfolio are fundamentally different games. One is about protecting what already works. The other is about building what comes next. Each requires different ownership, different timelines, different metrics, and different tolerance for ambiguity. Running them as a single strategy is like training for a marathon and a sprint on the same schedule. The structure guarantees that one of them suffers.

What Most Organizations Miss

McKinsey’s Global Tech Agenda 2026 found that the CIOs delivering measurable value have made a specific shift. They’ve moved technology from a cost center to what McKinsey calls a “value creator,” embedding AI and data directly into operating models.

But the research surfaced a clear divide between organizations that are simply modernizing their technology estate and those that are rewiring for competitive advantage.

That divide maps to a pattern I keep running into with enterprise leaders. The companies actually moving forward are playing two distinct games at once:

With defense, they’re using AI and Agents to protect the core business. Automating manual workflows, tightening operational efficiency, reducing cost structures that have been bloated for years.
On offense, they’re building new capabilities. New products, new revenue streams, new ways of reaching customers that weren’t possible eighteen months ago.

Most organizations don’t have a mental model for this split. They’re either in pure cost-cutting mode or chasing growth, and the AI and Agentic AI strategy simply reflects whichever game the board happens to be pressuring this quarter.

What Defense Actually Looks Like

Defensive AI and Agent targets processes you understand well, with outcomes you can measure in months and risk profiles you can model. Automated claims processing. Intelligent document extraction. Predictive maintenance on equipment that’s already generating revenue.

The success criteria are clear. Faster cycle times, lower error rates, reduced headcount for routine tasks, better margins on existing lines of business. The value case is arithmetic, and the ROI conversation is relatively straightforward.

What Offense Actually Looks Like

Offensive AI builds capabilities that didn’t exist before. You’re not optimizing a known process. You’re testing whether a new process should exist at all.

These projects look like using AI to enter adjacent markets with personalized products, or building recommendation engines that fundamentally change how customers discover what you sell, or creating internal decision-support tools that give your operators information advantages competitors don’t have.

The success criteria are murkier. You’re measuring learning velocity, market signal, and option value. The ROI conversation is harder, and the organizational patience required is significantly higher.

When Efficiency Eats Innovation

When companies run offense and defense under the same governance structure, the defensive projects almost always win the resource fight.

Defense gets measured on efficiency, cost reduction, and operational reliability. The governance is tighter and accountability sits with operational leaders who own the processes being improved.

Offense gets measured on learning rate, market validation, and strategic optionality. The governance is much lighter, and the timelines are longer.

Overall, defensive projects are easier to justify, easier to measure, and easier to get approved. So offensive projects get deprioritized because they can’t compete on the same ROI framework.

The result is a portfolio that looks busy, but only plays one game. The company gets more efficient at what it already does while falling behind on what it could become. The board sees cost savings and assumes the AI and Agent strategy is working, but nobody’s building anything that changes the company’s competitive position.

The Diagnostic

If you’re running AI and Agent initiatives right now, here’s a quick test. Look at your active portfolio and sort every project into one of two columns. Column one: protecting existing revenue and margin. Column two: building something you’ve never had before.

If you can’t sort them cleanly, your strategy is probably conflated.

The companies losing ground on AI and Agents aren’t necessarily the ones spending too little. They’re the ones who never made the split visible, never assigned ownership to each side, and ended up with a portfolio that defaults to whichever pressure is loudest.

Making the split explicit is the first step toward making it work.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

→ Follow him on LinkedIn to catch his latest thoughts.

→ Subscribe to his free Substack for in-depth articles delivered straight to your inbox.

→ Watch the live session to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.

5 Org Chart Mistakes That Are Killing ROI in the AI and Agent Era

Nick Talwar — Tue, 09 Jun 2026 12:41:57 +0000

Organizational structure determines AI outcomes more than technology ever will

McKinsey’s research found that more than 80% organizations are not yet seeing a tangible impact on enterprise-level EBIT from AI and Agents. This suggests that while adoption is broadening, most companies are still struggling to turn AI and Agents into scaled financial results.

But there is an important piece of the story that is missing. A separate analysis of 140 enterprise AI implementations found that 77% of failures were organizational in nature, with technical issues like model performance, data quality, and integration complexity accounting for less than a quarter.

Your org chart is the first system AI has to survive before it reaches a single customer or workflow, and these five structural mistakes consistently prevent it from getting there.

1. Your Chief AI Officer Reports Nowhere Near the P&L

The 2026 AI & Data Leadership Executive Benchmark Survey found that 38.5% of companies have now appointed a Chief AI Officer or equivalent, but there’s almost no consensus on where that role sits. Reporting lines are split across technology, business, and transformation leadership, with no dominant model emerging and no clear pattern connecting any one reporting structure to better outcomes.

That fragmentation carries real downstream consequences. When AI leadership reports into a CTO or CIO function, the role tends to optimize for infrastructure and tooling decisions rather than business impact. When it reports into a transformation office, it gravitates toward strategy decks and governance frameworks that rarely survive contact with operational reality.

Neither path connects AI or Agents directly to revenue, margin, or operational throughput, which means the person nominally responsible for AI results often has no line of sight into the metrics that define them.

2. Your AI or Agent Team Lives in IT Instead of in the Business

When AI or Agent capability gets housed inside the IT department, it inherits IT’s entire operating model, meaning projects get scoped through a service request lens, prioritization follows the IT backlog, and success gets measured in uptime and deployment velocity rather than business outcomes.

This is a fundamental structural mismatch. AI is a business capability that requires technical infrastructure, and the distinction matters because AI initiatives that start with a business problem and work backward toward the right technical approach tend to survive past the pilot stage, while initiatives that start with a model and go looking for a use case tend to stall indefinitely.

Organizations running AI teams embedded within business units, or at minimum co-located with business leadership, consistently outperform centralized IT-led models on both adoption and value delivery.

3. Your Steering Committee Owns Accountability for Nothing

AI steering committees are one of the most popular governance structures in enterprise AI programs, and they’re also one of the least effective.

The typical setup includes senior representatives from multiple functions who meet monthly to review progress, offer guidance, and align priorities, but in practice, these committees almost always devolve into a venue for status updates where no actual decisions get made.

The root issue is accountability without power. Steering committees rarely control budget allocation, staffing decisions, or deployment timelines, which means they can recommend changes but have no mechanism to compel them. When an AI initiative hits an organizational obstacle (and every one does), the committee discusses it, documents it, and then waits for someone else to resolve it, creating a governance layer that absorbs time without reducing friction.

Research on AI governance maturity from McKinsey’s 2026 AI Trust Maturity Survey reinforces how widespread this gap is, with only about 30% of organizations reaching a maturity level of three or higher in governance, even as their technical and data capabilities continue to advance. The organizational decision-making apparatus simply hasn’t kept pace with the technology it’s supposed to govern.

4. You Built AI Skills in One Team and Called It Done

Concentrating AI talent in a single team feels efficient at first, but the problems with this approach emerge at scale. When every AI initiative has to flow through the same team, that team becomes a bottleneck.

This pattern appears so frequently in enterprise organizations that it has earned a name in organizational design circles. It’s called the Center of Excellence trap.

The CoE starts as a strategic asset and gradually evolves into a capacity constraint that chokes the very pipeline it was built to open. A CIO article from late 2025 described the resulting dynamic well, noting that business units inevitably branch off on their own when the central AI team can’t keep pace, creating fragmented and ungoverned efforts scattered across the company with no shared standards or oversight.

The more sustainable model is capability distribution. Instead of hoarding AI expertise in one group, the investment goes into building baseline AI literacy and applied skills across functions. This allows the central team to shift from doing the work to enabling others to do it by providing tooling, standards, training, and quality guardrails while the business units own execution and outcomes.

5. Your Center of Excellence Has No Authority to Make Anything Stick

This is the inverse of mistake four. Some organizations do build a Center of Excellence with a genuine mandate to drive AI adoption across the enterprise, staffing it well, giving it a clear charter, and expecting it to set standards for how AI gets developed, deployed, and monitored. Then they forget to give it any enforcement power.

What follows is predictable. The CoE publishes best practices that business units ignore, develops governance frameworks that project teams route around, and recommends tooling standards that departments override. Without budget influence, or the organizational standing to block non-compliant deployments, the CoE becomes an advisory function that advises no one in particular and enforces nothing at all.

This is a design failure at the leadership level. A CoE with clear standards but no enforcement mechanism creates the illusion of governance while fragmented, uncoordinated AI adoption continues underneath it.

The Real Infrastructure Problem

These five mistakes share a common thread. They all treat AI as something that can be added to an existing organizational structure without redesigning how decisions get made, who owns outcomes, and where authority actually lives.

AI underperformance in most organizations traces back to an org chart that was built for a different kind of work and never updated to reflect how AI-driven operations actually need to function.

The companies capturing real returns in 2026 are the ones willing to redesign reporting lines, redistribute decision rights, and place AI leadership where it can actually influence how the business operates on a daily basis.

If you’re reviewing your AI strategy this quarter, start with the org chart. The structure you’re running determines the ceiling of what AI can deliver, and right now, most ceilings are set lower than anyone realizes.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

→ Follow him on LinkedIn to catch his latest thoughts.
→ Subscribe to his free Substack for in-depth articles delivered straight to your inbox.
→ Watch the live session to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.

4 Ways to Keep Your AI and Agent Costs Down

Nick Talwar — Wed, 03 Jun 2026 13:10:20 +0000

The architectural decisions that separate controlled spend from compounding surprises

AI and Agentic AI costs have a way of looking reasonable right up until they aren’t.

The early pilots run on contained use cases with limited traffic, so the numbers stay small and nobody questions the architecture behind them. Then the product scales. Teams start layering inference calls into features that weren’t in the original cost model, and the spend starts compounding in places nobody is watching.

By the time finance flags the invoice, the architecture driving those costs is already embedded in production and expensive to change. A Gartner survey found that more than 90% of CIOs say managing cost limits their ability to extract value from AI at scale.

The problem is rarely any single API call. It’s the accumulation of decisions that were never designed to hold up under real production volume. These four levers address that directly. Each one targets a different layer of the cost structure, and together they give you a system that stays predictable as usage grows.

1. Right-Size Model Selection to Task Complexity

The fastest way to cut AI costs without changing outcomes is to stop sending every request to your most capable model. Most production AI workloads follow a clear pattern where a small percentage of requests require deep reasoning while the majority involve extraction, classification, or short-form responses that a lighter model handles just as well.

A model routing layer evaluates each incoming request and directs it to the appropriate model based on complexity, confidence thresholds, or task type. Simple queries go to smaller, faster, cheaper models. Only the requests that genuinely need frontier-class reasoning get routed to the expensive option.

The impact is significant. Industry benchmarks consistently show that intelligent routing reduces inference costs by 30% to 60% in mixed-workload environments, and in some configurations the savings reach even higher. IBM research has highlighted estimates that routing a portion of queries to smaller models can reduce inference costs by up to 85% compared to always using the largest available model.

When 70% to 80% of your traffic can be handled by a model that costs a fraction of your top-tier option, the math changes quickly. The key is building this routing logic into the architecture early, before usage patterns are established and before teams develop habits around defaulting to a single model for everything.

2. Build Caching Layers for Predictable and Repetitive Inputs

Every time your system pays for an inference call that produces the same output as a previous call with identical or near-identical input, you’re burning money on redundant compute. In most production AI and Agent systems, this happens more often than teams realize. Support workflows, document processing pipelines, and internal tools all generate repetitive queries that trigger fresh inference calls unnecessarily.

Caching addresses this by storing responses to previous inputs and returning cached results when a sufficiently similar request comes in. Semantic caching takes this further by using embedding similarity to match new queries against previously answered ones, so you don’t need exact string matches to get a cache hit.

For applications with stable system prompts or repeated reference documents, prompt caching alone can cut costs by 50% to 90% on eligible workloads. That’s a significant margin improvement for what is fundamentally an infrastructure decision, not a product change.

3. Monitor Cost Per Outcome, Not Cost Per API Call

Most teams track AI and Agent spend at the wrong level of granularity. They watch cost per API call or cost per token, optimize those numbers, and then wonder why the overall bill keeps climbing. The problem is that per-call metrics tell you how efficiently your infrastructure runs, but they tell you nothing about whether the spend is generating proportional business value.

The metric that actually matters is cost per outcome. What does it cost to resolve one support ticket, process one document, or generate one qualified recommendation? When you measure at the outcome level, you start seeing which features and workflows are efficient and which ones burn through tokens without producing proportional results.

This shift in measurement changes how teams make decisions. A workflow that costs $0.002 per API call looks cheap in isolation, but if it takes 40 calls to produce one usable output, your effective cost per outcome is $0.08. Another workflow might cost $0.01 per call but deliver a result in three calls, making it four times more cost-effective at the outcome level. Without outcome-level tracking, teams end up optimizing the wrong variable. They hit their API budget targets while the business bleeds margin on features that consume far more inference than their value justifies.

Building this visibility requires tagging inference calls by feature, workflow, and business outcome so you can attribute costs accurately. It’s operational overhead up front, but it gives you the data to make allocation decisions that actually improve unit economics.

4. Create a Deprecation Practice for Low-Value Use Cases

Not every AI-powered feature deserves to keep running. As products evolve, teams tend to accumulate use cases without revisiting whether each one still clears a reasonable cost-to-value threshold. A feature that made sense during a pilot, when call volume was low and the marginal cost was negligible, can become a quite drain on your budget once it’s processing thousands of requests per day in production.

A formal deprecation practice addresses this by establishing a regular review cycle where every active AI use case and Agent gets evaluated against its actual cost and measured value. Use cases that fall below the threshold get flagged for rearchitecting, downsizing to a cheaper model, or retiring entirely.

This is where most AI cost problems actually live. They aren’t unit cost problems. They’re accumulation problems. Twenty features each burning a small amount of unjustified spend add up to a significant line item that nobody owns because nobody is looking at the portfolio as a whole.

The review doesn’t need to be complicated. Quarterly is a reasonable cadence. The criteria should include cost per outcome (from the monitoring practice above), usage volume trends, and a clear-eyed assessment of whether the feature still aligns with product priorities.

Revisit Your Architecture to Sustain Your ROI

Each of these four levers operates at a different layer of the cost structure, and none of them require you to sacrifice capability or slow down product development. Model routing targets per-call efficiency. Caching eliminates redundant compute. Outcome-level monitoring gives you the data to allocate intelligently. And deprecation keeps your portfolio from accumulating dead weight.

The common thread is that AI cost management is an architecture problem. The decisions that determine your spend at scale are made by engineering teams during system design, not by finance teams during contract negotiation. The organizations that keep their costs predictable are the ones that treat these decisions as first-class architectural concerns from the beginning, rather than scrambling to retrofit controls after the bill becomes a boardroom conversation.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

→ Follow him on LinkedIn to catch his latest thoughts.

→ Subscribe to his free Substack for in-depth articles delivered straight to your inbox.

→ Watch the live session to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.

Your AI and Agent Rollout Needs a Problem-Definition Process

Nick Talwar — Tue, 26 May 2026 14:04:02 +0000

How Product Management Discipline Separates Lasting AI and Agent Adoption from Expensive Shelf-Ware

We’ve all read about the AI rollouts that go awry. Tools get purchased, training gets scheduled, an adoption campaign goes out, but within two months the usage curve flattens because nobody in the organization can answer a simple question:

What specific problem are we solving, and how will we know we solved it?
I’ve spent years leading teams from both an engineering and product management perspective, so I’ve seen from the trenches why this obvious question can get skipped. The urgency to "adopt AI" pushes companies straight into tool selection and training programs while the harder, slower work of defining which problems are actually worth solving never happens.

The Missing Discipline

A recent Harvard Business Review study by Amanda Pratt and Melissa Valentine examined AI adoption at a major tech company and surfaced a finding that should reframe how every operator thinks about this problem.

It was no surprise to me that the area most correlated with successful, sustained AI adoption turned out to be product management, not prompt engineering or technical fluency. The disciplines that mattered most were defining which problems are worth solving, designing structured experiments, and integrating solutions into the way work already happens.

These findings line up with what I've observed across dozens of AI and Agentic AI engagements. The companies where AI actually takes root are the ones that approach adoption with product discipline, starting with a specific workflow, identifying a measurable friction point, building a small test, and evaluating results before scaling anything.

Two Companies, Two Approaches

Consider the difference between two real patterns I see repeatedly in enterprise AI and Agentic AI work.

1) Company A purchases an AI platform, negotiates an enterprise license, builds a prompt library, and launches a change management campaign complete with lunch-and-learns, weekly tip emails, and a login dashboard to track "adoption." After three months, a handful of power users have integrated the tool into their workflows, and everyone else has moved on.

2) Company B takes a different path. Before selecting any tool, they run a structured problem-definition process across three business units. Each unit identifies its highest-friction workflow, documents the current state in detail, and defines what a measurable improvement would look like. Only then does the team evaluate which AI capabilities (if any) could address those specific problems. They run 30-day pilots with clear success criteria, and when two of the three pilots produce measurable gains, those two scale while the third gets killed early, saving months of wasted effort.

One of those pilots, for example, targeted a procurement approval workflow that averaged nine days from request to sign-off. The team mapped every handoff, identified two steps where AI-assisted document review could eliminate manual bottlenecks, and set a target of reducing cycle time to under four days. After the pilot, cycle time dropped to three and a half days. That result gave leadership concrete evidence to fund a broader rollout in procurement, and the specificity of the success made it easy to communicate across the organization.

Company B spent less money, took slightly longer to get started, and ended up with AI embedded in actual workflows producing actual results. Company A spent more, moved faster, and ended up with an expensive tool that sits mostly unused.

Why Problem-Definition Keeps Getting Skipped

The rise of AI has put immense pressure on companies to try to move fast. But the problem-definition process feels time consuming and slow. On the other hand, buying a tool and launching a training program feels like jumping quickly into action.

There's also a structural gap. Most organizations assign AI adoption to IT or to a newly created "AI team" that reports to the CTO. Those teams are good at evaluating technology. They're less practiced at the product management work of scoping problems, defining success metrics, and designing experiments within business workflows they don't own. The people closest to the workflows (operations leads, department managers, senior ICs) rarely get pulled into the problem-definition phase because the initiative is framed as a technology project, not a workflow improvement project.

Velocity without direction is just expensive motion. The organizations I work with that have the strongest AI adoption results are the ones that invested the first four to six weeks in problem definition and a Data Story / IP Moat audit before evaluating a single vendor. That initial patience created clarity that made everything downstream faster, from tool selection to pilot design to scaling decisions.

The Diagnostic Question

If you want to know whether your AI or Agentic AI adoption effort has legs, ask one question across every team that's supposed to be using AI. Can they answer, specifically, what problem they're solving and how they'll know if they've solved it?

If the answer is vague ("We're using AI to be more efficient") or circular ("We're adopting AI because we need to adopt AI"), the rollout is already in trouble. Clear problem statements are the leading indicator of whether AI adoption will stick or stall.

The companies that bring product management discipline to AI adoption, with defined problems, scoped experiments, and honest evaluation, end up with AI embedded in their actual operations. Everyone else ends up with a line item on the budget and a login dashboard nobody checks.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

6 Things Your AI Agents Need That You're Probably Not Building

Nick Talwar — Tue, 19 May 2026 17:35:06 +0000

The infrastructure that separates agents that demo well from agents that actually run

You would never bring a new hire onto your team without performance feedback, escalation paths, or a way to know when they're struggling. Yet that's exactly how most organizations deploy AI agents. MIT Sloan and BCG's 2025 research found that 76% of executives now describe agents as coworkers rather than tools, but almost none of them are managing agents that way. They ship the agent and move on.

Deciding to call your agents “coworkers” is easy. Setting up the feedback loops, escalation paths, and failure signals that actually make one is where teams stall. It's almost entirely an infrastructure problem, and these are the six pieces most teams skip.

1. Evaluation Frameworks

A working agent and a reliable agent are two different things. Evaluation frameworks give you the ability to measure the difference before your users discover it for you. This means building structured test suites that run against your agent's outputs on a regular cadence, scoring for accuracy, relevance, and task completion across a range of realistic scenarios.

Good evaluation suites include both deterministic checks (did the agent call the right tool with the right parameters?) and judgment-based scoring (was the response actually useful to the person asking?).

The key is that evaluation has to be continuous, running in CI/CD pipelines and against live traffic, because agent behavior shifts as underlying models update and data distributions change. LLMs, the technology that undergirds agents, are at their core probabilistic in nature, which means there is an often opaque statistical distribution that can shift over time, which affects performance and accuracy.

Anthropic's engineering team has written publicly about maintaining evaluation suites as living artifacts, with dedicated teams owning the infrastructure while domain experts contribute tasks and run the tests themselves.

2. Fallback and Escalation Logic

Every agent will encounter situations it cannot handle. The question is whether you've decided in advance what happens next, or whether the agent improvises.

Fallback logic defines the boundaries. When confidence drops below a threshold, when a tool call returns unexpected data, when the task exceeds the agent's defined scope, the system needs a predetermined path. That path might route to a simpler deterministic process, a different model, or a human operator. Escalation logic layers on top of that by adding severity awareness.

Without explicit escalation tiers, every failure gets the same treatment, which means either everything gets flagged (and humans stop paying attention) or nothing does (and real problems slip through). The organizations successfully scaling agents build these paths before deployment, treating them as load-bearing architecture.

3. Monitoring for Drift

AI agents degrade quietly. Model updates, shifts in input data, changes to upstream APIs, seasonal variation in user behavior. Any of these can erode agent performance without triggering a single error.

Drift monitoring tracks the gap between how your agent performed when you validated it and how it performs now. This includes statistical monitoring of output distributions, latency tracking across individual tool calls, and automated quality scoring against baseline benchmarks. In practice, effective drift detection requires capturing baseline metrics during your evaluation phase and then running the same scoring pipeline against production traffic on an ongoing basis. When scores diverge from your baseline by more than an acceptable margin, you have a concrete signal to investigate rather than a vague feeling that things seem off.

4. Human-in-the-Loop Checkpoints

Full autonomy sounds efficient until you realize what it costs when the agent is wrong. Human-in-the-loop checkpoints create structured moments where a person reviews, approves, or redirects agent output before it reaches the end user or triggers a downstream action.

The design challenge is placement. Too many checkpoints and you've built an expensive autocomplete system. Too few and you've handed off accountability to a system that can't actually hold it. The right approach maps checkpoints to consequence.

Low-risk, reversible actions can run autonomously. High-stakes decisions, anything involving money, legal exposure, or customer-facing commitments, need a human gate. As agents take on more complex workflows, these checkpoints also become your training data pipeline. Every human correction is a signal about where the agent needs improvement, but only if you're logging it (which brings us to the next point).

5. Logging for Auditability

When an agent makes a decision, you need to be able to reconstruct exactly how it got there. Full execution logging captures the chain of reasoning, tool invocations, retrieved context, intermediate outputs, and final actions across every run.

This serves three purposes simultaneously:

First, debugging. When something goes wrong, you need the trace, not a guess.

Second, compliance. Regulated industries require demonstrable decision trails, and even unregulated ones are moving in that direction.

Third, improvement. Logged executions become the dataset you use to identify failure patterns, tune prompts, and build better evaluation suites.

The tooling for this has matured significantly. OpenTelemetry-based tracing, structured span capture, and production replay capabilities now exist across multiple frameworks. The infrastructure cost is low relative to the cost of operating an agent you cannot inspect.

6. A Defined Handoff Protocol

Agents rarely operate in isolation. They pass work to other agents, to human operators, to downstream systems, and occasionally back to the user. Every one of those transitions is a potential failure point.
A handoff protocol specifies what information transfers with the task, what context the receiving party needs, what constitutes a successful handoff versus a dropped one, and who owns the outcome after the transition.

This gets more complex in multi-agent systems where one agent's output becomes another agent's input. If the first agent summarizes a customer issue and strips out a critical detail before passing it along, the second agent makes a decision on incomplete information. Neither agent has failed individually, but the system has failed completely.

Without this kind of structural clarity, you get the agent equivalent of a game of telephone. Context gets lost between steps, responsibilities blur, and when something fails mid-workflow, nobody can pinpoint where.

The Management Layer You Can't Skip

These six elements share a common thread. They're all infrastructure that exists to manage the agent after it's built.

The agent itself, the model, the prompts, the tool integrations, that's maybe 40% of what a production deployment actually requires.
The other 60% is the system that keeps the agent honest, visible, and recoverable when things go sideways.

Organizations that treat agent deployment as a build-and-ship exercise will spend the next six months doing manual cleanup on failures they could have prevented. The ones that invest in this management layer first will find that their agents get better over time instead of quietly getting worse.

The technology is mature enough. The question is whether your operational infrastructure is ready to match it.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

Your Product Doesn't Need GPT-5. And It’s Costing You More Than You Think.

Nick Talwar — Tue, 12 May 2026 12:13:37 +0000

How Fine-Tuned Small Models Outperform Frontier AI for Most Production Workloads

Serving a 7B parameter model costs roughly $0.0004 per 1,000 tokens. A frontier model like GPT-5 charges up to $0.09 for the same volume. That's a 200x spread on per-token cost, and at production scale, it compounds into the kind of line item that makes CFOs start asking uncomfortable questions.

Yet most enterprise AI strategies still start in the same place. Frontier model API, default configuration, build everything on top.

I’ve heard the same reasoning for this decision countless times. The plan is to start here, and optimize later. But "optimize later" rarely happens. The API dependency becomes load-bearing, and switching costs quickly accumulate. More often than not, teams discover much too late that 70-80% of their inference calls are handling structured, repeatable tasks that never needed frontier-class reasoning in the first place. Meanwhile, a fine-tuned small model handles all of it at a fraction of the cost, often with better accuracy on the specific domain, and without the vendor dependency.

The question worth asking before you architect anything isn't "which model is most powerful." It's whether the task even requires that power.

The Compounding Cost Problem

The per-token price gap between frontier and small models tells only part of the story. The real damage happens at volume.

Gartner’s analysis found that agentic AI workflows consume 5 to 30 times more tokens per task than standard chatbot interactions. When your agents are running thousands of structured, repeatable tasks per day, each one burning frontier-priced tokens, monthly inference bills can scale from manageable to alarming before anyone notices. A system handling 50,000 daily agent tasks on frontier APIs accumulates costs that a finance team will eventually flag, and "but the model is really smart" isn't a satisfying answer when 80% of those tasks are pattern execution.

API pricing has dropped significantly. Frontier-quality model costs fell roughly 80% between 2025 and early 2026. But cheaper tokens don't change the underlying architectural mistake. You're still paying for general-purpose reasoning capacity on tasks that need specialized precision. It's the equivalent of provisioning a 256-core cluster to run a cron job.

Where Small Models Win (And Where They Don't)

Small language models, typically under 10 billion parameters, have crossed a performance threshold that changes the production calculus. Research from late 2025 demonstrated that a fine-tuned 350M parameter model outperformed generalist frontier models on structured tool-calling and API orchestration tasks. A 3B parameter model trained on domain-specific data can match frontier accuracy on classification, extraction, and routing while delivering 150 to 300 tokens per second compared to the 50 to 100 range typical of large models.

The production evidence is growing. An analysis of 287 documented SLM deployments found companies like Checkr, NVIDIA, Bayer, and DoorDash replacing frontier models with 7B to 14B parameter alternatives at 5 to 150 times lower cost, with equal or better performance on their specific tasks.

But small models have real limits. They fall apart on tasks requiring deep reasoning across long, unstructured documents. Complex multi-step inference, novel problem synthesis, and ambiguous decision-making still belong to frontier architectures. Pretending otherwise leads to brittle systems.

A Decision Framework for Model Selection

The architectural question isn't "which model is best." It's what the specific task actually requires.

Route to a small model when the task is structured, repeatable, and well-defined. Classification, entity extraction, document routing, templated generation, API orchestration, and status parsing all fit. If you can describe the task with clear input-output examples and the domain is bounded, a fine-tuned small model will likely match frontier performance at a fraction of the cost.

Route to a frontier model when the task demands open-ended reasoning, novel problem-solving, or synthesis across large unstructured contexts. Strategic analysis, complex code generation, multi-document research, and ambiguous judgment calls still benefit from frontier-scale reasoning. These tasks involve genuine inference, not pattern execution.

The hybrid architecture is where most production systems should land. Use a frontier model as the orchestration layer for planning, decision routing, and edge cases. Deploy fine-tuned small models as the execution layer for the high-volume structured tasks that account for the bulk of actual inference calls. One documented deployment using this approach, a frontier model as "master controller" with specialized small models handling task execution, showed a 90% reduction in monthly API costs and a 70% improvement in response speed.

The Vendor Lock-In Problem

There's a second cost that doesn't show up on the monthly invoice. Every API call to a frontier model is a dependency you don't control. Pricing changes, rate limits, model deprecations, and terms-of-service updates all happen on someone else's timeline.

Fine-tuned small models running on your own infrastructure eliminate that variable. You control the model weights, the serving stack, the update cycle, and the data pipeline. For regulated industries where sensitive data can't touch third-party APIs, self-hosted small models aren't just a cost optimization. They're the compliance baseline.

The breakeven point for self-hosting versus API consumption is lower than most teams assume. Analysis across production deployments puts the threshold around 8,000 conversations per day, or roughly $500 per month in API spend. Above that line, owning your inference infrastructure starts paying for itself.

Right-Sizing as an Engineering Discipline

Treating model selection with the same rigor you'd apply to database provisioning or infrastructure architecture is the move that separates production-grade AI systems from expensive experiments.

A frontier model is a tool. A small model is a tool. The discipline is knowing which tool fits which job, and building the architectural flexibility to use both without locking yourself into either. For most production workloads running structured, repeatable agent tasks at scale, the 7B parameter model on your own infrastructure will outperform the frontier API call to a model that's three orders of magnitude larger than what the task requires.

The smartest infrastructure decision you make this year might be choosing the smaller model, most of the time.

…
Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

$4M Revenue Per Employee Is the New Benchmark. Most Companies Can’t Get There.

Nick Talwar — Tue, 05 May 2026 14:00:00 +0000

What AI-Native Operations Actually Look Like and Why Retrofitting Falls Short

Cursor crossed $2 billion in annualized revenue in early 2026. The team that built it? Roughly 300 people. Gamma, the AI presentation platform, hit $100 million ARR with about 50 employees and has been profitable for over two years. Midjourney generates hundreds of millions in annual revenue with a team you could fit in a mid-sized conference room. Lovable reached $100M ARR in eight months with 45 people.

Meanwhile, the median private SaaS company generates about $130,000 per employee. Five years ago, $100K was considered a reasonable benchmark. At scale, the best traditional SaaS companies were proud to reach $300K.

The gap between these numbers tells you something specific about how these companies are built. All four companies I mentioned initially have something in common beyond the headcount math.

From the first hire, they were built around AI as a core operator, with every workflow, every role, and every system designed on that assumption. The label for this is AI-native.

And for founders and executives running $5-30M ARR companies right now, the gap between AI-native operations and everyone else is a competitive timeline that is already shrinking.

What "AI-Native" Actually Means at the Operational Level

The phrase gets thrown around loosely, so let me be specific. An AI-native company designs its workflows from scratch around what AI can do. Every process, every role, every system assumes AI as a core participant from day one.

This is fundamentally different from what most companies do, which is take their existing workflows and add AI tools to them. The distinction matters because the architecture of your operations determines the ceiling of your efficiency.

Consider how a traditional SaaS company handles content. A marketing team writes briefs. Writers produce drafts. Editors review. Designers format. A project manager coordinates the whole thing. Five or six people touch every piece of content before it ships.

An AI-native company designs that workflow differently from the start. AI generates first drafts from structured inputs. A single editor shapes the output. Distribution happens programmatically. The entire pipeline might involve one or two people instead of six, and the throughput is three to five times higher.

Multiply that across customer support, engineering, sales enablement, onboarding, and internal operations. The compounding effect explains how Cursor runs at $6 million per employee while companies with similar revenue require ten times the headcount.

Why Retrofitting Existing Operations Fails

The instinct most established companies have is to layer AI tools onto what already exists. Buy a few licenses, integrate a copilot, maybe automate some ticket routing. This feels productive. It rarely moves the needle in a meaningful way.

The problem is structural. Your existing workflows were designed around human throughput. Your org chart reflects that design. Your hiring plans, your meeting cadences, your approval chains, your reporting structures all assume that humans do the work and other humans coordinate that work.

Bolting AI onto this foundation creates an awkward hybrid. AI generates a draft, but then it still goes through the same five-person review chain that existed before. AI triages support tickets, but the staffing model hasn't changed to reflect the reduced load. The tool saves twenty minutes per task, but the organizational overhead around that task stays identical.

The Realistic Options for Established Companies

If you're running a $5-30M ARR company, you probably aren't going to tear everything down and rebuild from scratch. That's fine. But pretending the efficiency gap will close on its own is a mistake with a deadline.

Here's what actually works for companies that aren't starting from zero.

Start with one workflow, redesigned from zero. Pick your highest-volume, most repeatable process and redesign it from scratch with AI as the primary operator. Don't optimize the existing process. Design the new one as if the old one didn't exist. Customer onboarding, content production, and first-line support are common starting points because they're high-volume and have clear inputs and outputs. The goal is to prove to your own organization what redesigned throughput looks like before you try to scale the approach.

Hire for the new architecture. The next time you open a role, ask whether the function that role serves could be restructured around AI instead. This doesn't mean replacing people. It means designing the role so one person with AI leverage can do what previously required three. The companies generating $2M+ per employee didn't get there by giving existing employees AI tools. They built teams where every person operates as a force multiplier.

Measure the right ratio. Track revenue per employee quarterly. If you're below $150K and growing, you're adding headcount faster than you're adding efficiency. That was fine in 2020. Today, it means you're falling behind the curve that AI-native competitors are setting. For context, top-quartile SaaS companies now generate $350K-$700K per employee, and the AI-native outliers are running at five to ten times that range.

Accept that partial adoption produces partial results. A company that redesigns 30% of its operations around AI-native principles will capture meaningful efficiency gains. A company that gives everyone a ChatGPT license and calls it transformation will not. Architectural commitment drives the outcome here. Tool selection alone never has.

Sequence your investment around leverage. Most companies adopt AI where it's easiest to implement. The better approach is to start where the ratio of human labor to repeatable output is highest. That's usually operations and fulfillment, where the actual throughput gains live.

The Clock Is Running

The revenue-per-employee gap between AI-native companies and everyone else keeps widening. Gartner projects a wave of companies generating $2M+ per employee by 2030, and the leaders are already well past that mark.

For operators and founders at the $1-5M stage, this isn't a future problem. Your next funding round, your next hire, your next operational decision is happening in a market where competitors might need one-fifth the headcount to deliver the same output.

The companies that approach this as an architectural challenge will adapt. The ones running a tool-buying exercise will learn the hard way that efficiency at this scale comes from how you build, from how you design the work itself.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

→ Follow him on LinkedIn to catch his latest thoughts.

→ Subscribe to his free Substack for in-depth articles delivered straight to your inbox.

→ Watch the live session to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.

The Job Title That Didn’t Exist Last Year

Nick Talwar — Tue, 21 Apr 2026 12:07:28 +0000

Why Enterprise AI Needs a Translation Layer Between Data and Decisions

Gartner projects that over 40% of agentic AI initiatives will be abandoned by 2027. Reading that, a reasonable person might conclude that there is an inherent issue with the technology.

However, I know from my own experience building agents that when done correctly, they deliver.

The failure pattern we keep hearing about has nothing to do with model quality or infrastructure maturity. It's that organizations have no single agreed-upon definition for their own data.

Different departments define the same terms differently, and agents consume whatever definition they hit first at 10x the speed any human team would. Humans reconciled those gaps in quarterly meetings and footnotes. Agents just produce confident, expensive wrong answers.

The real fix requires a role that most companies haven't named yet.

When "Revenue" Means Different Things

Humans have always tolerated semantic drift inside organizations. If marketing and finance calculate revenue differently, they reconcile the gap in quarterly meetings or bury it in footnotes. The cost of ambiguity stayed low because humans processed data slowly enough to catch the mismatches.

AI agents don't reconcile by themselves. They ingest whatever schema they can access, apply whatever definition they encounter first, and produce output that sounds authoritative regardless of whether the underlying logic holds.

The confidence of the output actually makes the problem worse, because stakeholders trust polished summaries more than they trust raw numbers.

The Role Sitting Between Data and Meaning

The people solving this problem function as a translation layer between raw enterprise data and business meaning. They define what terms actually mean across the organization, map those definitions into the semantic structures that AI systems rely on, and maintain the consistency of that layer as business logic evolves.

The skillset is specific and rare. You need someone who understands data modeling well enough to audit pipeline logic, but who also understands the business well enough to know that "active customer" means something different to the retention team than it does to the billing team. You need someone who can sit in a room with a CFO and a data engineer simultaneously and translate in both directions.

Most companies don't have this person because the job didn't exist until AI agents started consuming enterprise data fast enough to make the gaps visible.

Some organizations are calling this a semantic architect. Others are folding it into "context engineering," which has emerged as a recognized discipline for designing the information environment that AI models operate within.

Cognizant's CIO, Neal Ramasamy, recently described context engineering as the factor that separates enterprise AI experimentation from sustainable scale, noting that most of the critical context in organizations still lives in people's heads rather than in systems where agents can access it.

Whatever you call the role, the function is the same: someone owns the relationship between what the data says and what the business means.

What This Role Could Look Like

Here's how I'd scope this role if I were hiring for it today.
This person sits between the data engineering team and business leadership. They own the company's business glossary, the single source of truth that defines what every key term means across the organization.

Before any new data source enters the AI pipeline, they confirm that field names map to actual business logic. When two departments define "customer" differently, they make the call on which definition the system uses. And they have enough authority to make that call stick.

The technical work is straightforward. The hard part is the authority. A semantic layer without organizational backing is just a wiki nobody reads.

The semantic layer market is projected to grow from $2.7 billion to $7.7 billion by 2030 precisely because companies are realizing that the technical infrastructure only works when someone with real authority governs it.

The Org Chart Hasn't Caught Up

Companies are spending millions on model selection, compute infrastructure, and agent orchestration while leaving the semantic layer as an afterthought managed by whichever data engineer happens to notice the inconsistency. It's the organizational equivalent of building a Formula 1 car and forgetting to hire someone who reads the track map.

The companies getting reliable output from their AI systems in 2026 will be the ones that treated this translation function as a first-class strategic hire, reporting to the CTO or CDO with real authority over definitions. The ones still debugging confident-sounding garbage will be the ones who assumed the data would speak for itself.

It won't. It never did. Humans just papered over the gaps. AI agents don't have that option.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.