- Book: AI Agents Pocket Guide: Patterns for Building Autonomous Systems with LLMs
- Also by me: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
On April 13, 2026, PwC dropped a number that should be pinned to every staff-engineer Slack: 74% of AI's measured economic value is being captured by 20% of companies. The other 80% are running pilots, attending summits, and watching their margin not move. You can read the full write-up in PwC's 2026 AI Performance Study press release. The methodology covers 1,217 senior executives across 25 sectors. The headline finding is brutally specific: leaders use AI for growth, not just cost cuts, and the leading group reports a 7.2× AI-driven performance boost over their peers.
If you're an engineer, you might read that and shrug. Strategy. Not your problem. Wrong. The 7.2× gap shows up in code before it shows up in a board deck. The teams in that 20% are doing four engineering things differently, and none of them are about which model they call.
The strategy gap is downstream of an engineering gap
PwC's main finding sounds like a McKinsey quote: leaders point AI at new revenue, laggards point it at headcount math. But that strategy choice doesn't survive contact with a team that can't ship reliable AI features. You can't grow with AI if your AI features hallucinate, regress silently, or lose money on every call. The same PwC press release reports the leading group is "2.6× as likely as peers to report AI improves their ability to reinvent their business model," and that comes from somewhere concrete: they ship more AI features that actually work, faster, and they catch breakages before users do.
Companies trying to grow with broken AI quietly stop trying. The 33% number in the PwC report is the other side of this — only one in three companies report any cost or revenue gain at all. Most teams ship a chatbot, watch the eval score drift, lose the budget fight, and fold the experiment. The leaders ship the same chatbot, instrument it on day one, catch the drift on a Tuesday, and have the receipts when the CFO asks.
So the question isn't "are we using AI." Eight in ten companies are. The question is whether your team has the four engineering signals below.
Signal 1: Ship with instrumentation on day one
The fastest way to spot a laggard team is to ask one question: "show me the trace for the last AI call your product made." If they open Datadog and find nothing, they're a laggard. If they have to pull from raw logs, they're behind. Leaders treat tracing as table stakes the same way they treat HTTP request logging.
Five lines. Pick one of OpenTelemetry, OpenInference, LangSmith, or Langfuse. The shape barely changes:
from langfuse.decorators import observe
@observe(name="customer-summary")
def summarize(case_id: str, transcript: str) -> str:
return llm.complete(prompt=build_prompt(transcript))
That decorator gives you input, output, latency, token count, cost, and a trace ID per call. The first time a customer says "the assistant told me my refund was approved" and you can pull the exact prompt, response, and timestamp in 30 seconds, you stop being a laggard. Without it, every incident becomes archaeology.
Signal 2: Eval discipline that runs in CI
Laggards run evals when they remember to. Leaders run evals on every pull request that touches a prompt, a tool definition, or a model version. The eval rig doesn't have to be fancy — it has to exist and gate merges.
# .github/workflows/llm-eval.yml
name: llm-eval
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -r requirements.txt
- run: python evals/run.py --threshold 0.82
The threshold is the recall, accuracy, or task-success score on a frozen test set of real questions, scored by a deterministic checker or an LLM-judge with a saved prompt. If the score drops below the threshold, the build fails, and the PR doesn't merge.
This is the single highest-leverage practice. Teams without it ship prompt changes on Friday, hear about regressions on Monday, and have no idea which commit caused them. Teams with it ship five prompt changes a day and roll back the one that misfires before lunch. The PwC report doesn't use the word "evals" but the gap it describes is exactly this gap, expressed in revenue.
Signal 3: A real cost ceiling, enforced in code
Every laggard team I've seen has the same conversation: "the OpenAI bill jumped 4× last month, can someone look into it." The leaders don't have that conversation because they put a ceiling in the code path.
MAX_TOKENS_PER_REQUEST = 4_000
MAX_USD_PER_USER_PER_DAY = 2.50
def gated_complete(user_id: str, prompt: str) -> str:
if usage.cost_today(user_id) > MAX_USD_PER_USER_PER_DAY:
raise BudgetExceeded(user_id)
return llm.complete(
prompt=prompt,
max_tokens=MAX_TOKENS_PER_REQUEST,
)
Five lines, one Redis counter, and the bill stops surprising the CFO. This matters more than it sounds. The PwC study notes that the leading 20% have AI as a P&L line item with positive return. That is only possible if the cost side is bounded. Laggards run AI as an unbounded variable cost, the bill grows with usage, and at some break-even point the feature starts losing money per call. Then the experiment dies.
A real cost ceiling makes AI features survive their growth phase. Without it, success kills you.
Signal 4: Idempotent retries on every external call
LLM APIs fail. Tools called by agents fail. Networks blip. Laggard code retries naively, dual-writes, double-charges, and double-emails. Leader code is idempotent by default, every external side-effect keyed on a stable request ID.
def send_email(idempotency_key: str, to: str, body: str):
if outbox.seen(idempotency_key):
return outbox.result(idempotency_key)
result = ses.send_email(to=to, body=body)
outbox.record(idempotency_key, result)
return result
The key is derived from the agent's plan step, not generated fresh each retry. So when the agent retries the "send_email" tool call after a network blip, the second attempt sees the first attempt's result and short-circuits. The customer gets one email, the audit log shows one event, and the agent moves on.
This is the unglamorous code that separates an agent demo from an agent in production. Teams that have shipped agents at scale converge on the same pattern, and the canonical write-up is Stripe's idempotency post. Every team still in pilot does not have this.
The "AI-ready" team checklist
Print this. Pin it next to your screen. Score your team honestly:
- [ ] Every LLM call emits a trace with input, output, latency, cost, model version
- [ ] Evals run on every PR that touches prompts, tools, or model config
- [ ] A per-user-per-day cost ceiling is enforced in code, not in a dashboard alert
- [ ] Every tool call from an agent is keyed by an idempotency token
- [ ] The product owner can name the top three failure modes from last week's traces
- [ ] Prompt changes are versioned in git, not in a Notion page
- [ ] At least one eval score is reported in the same dashboard as DAU and revenue
- [ ] A model swap (e.g., GPT-5.5 to Claude Opus 4.7) takes hours of code change, not weeks
Leaders score 7 or 8. Laggards score 2 or 3 and don't notice the gap until the budget review.
You can spot the gap from outside
The leading 20% put AI features in the critical path of their main product: search, support, onboarding, the thing the company sells. The other 80% put AI features in a side panel labeled "AI Assistant (Beta)" that nobody opens. Side panels die at the first budget review. Critical-path features get more budget. Clearing that bar is an engineering problem, and the four signals above are the minimum kit for solving it.
What this means for your next sprint
You are not going to fix all four signals in a sprint. Pick one. The one with the highest leverage for most teams is Signal 1 — get tracing on every LLM call by the end of the week. Five lines of code. The minute you can pull a trace by user ID, every other conversation about AI quality gets faster.
The PwC number (20% capturing 74%) is going to keep widening. The companies in the leading group are compounding their advantage every month they ship reliably while their competitors don't. The engineering gap underneath the strategy gap is real, measurable, and mostly fixable. The only thing standing between your team and the 20% is whether you do the boring work that nobody writes a blog post about.
If this was useful
Most of the four signals above come down to two skills: building agents that don't fall over, and instrumenting LLM calls so you can see what's happening. The pocket guides below cover both ends: agent patterns that survive production, and the tracing/evals tooling that catches the regressions you'd otherwise miss.


Top comments (0)