What AI & Data @Scale 2026 Revealed
A couple of weeks ago at Meta's AI & Data @Scale 2026, Boris Cherny, Head of Claude Code at Anthropic, mentioned he hasn't written a single line of code manually in eight months. Every commit generated by Claude Code, the very system he leads.
That's how paradigm shifts happen — quietly, and then all at once.
The Bottleneck Has Shifted
The conference theme this year was AI Native Transformation, a phrase that sounds like marketing until you sit through a full day of sessions and realize it's a precise technical description.
For the past decade, the bottleneck in building AI systems was algorithmic: more capable models, scalable architecture, better training data. That bottleneck is no longer the primary one; the frontier has evolved. Today, models are capable enough that the constraint has shifted, to infrastructure, to governance, and to our ability to evaluate what we've built.
No one illustrated this more viscerally than Cherny. He described days where he manages hundreds of agents, and days where that number reaches thousands. Sub-agents prompt each other. The human sits above the system, directing intent, unblocking bottlenecks, reviewing outputs. He described his current workflow built around routines, higher-order async automations, that let you set up work and deliver it into mergeable pull requests. The developer experience has fundamentally changed. The IDE is no longer the only interface; it's been joined by a new layer: the prompt, the task description, and the agent dashboard. The craft is shifting from execution to direction.
Governance: Your Security Model Wasn't Built for Agents
If the model capability problem is largely solved, the next constraint blocking safe deployment is governance. One of the most important sessions of the day came from Meta's Komal Mangtani and Ilia Shumailov, a two-part treatment of why much of what we know about securing software systems stops working the moment agents enter the picture.
Shumailov, whose background spans Cambridge and Google DeepMind, opened with a taxonomy of failures that rules-based systems cannot address. The core problem: traditional security was built for authenticated human identities making discrete, enumerable requests. Agents are neither discrete nor enumerable. They chain, recurse, and inherit permissions through context windows.
More dangerously, they can be manipulated by the data they consume. An agent that retrieves a document, a database row, or an API response can be instructed by that content to behave in unintended ways. This is indirect prompt injection; unlike direct attacks, the user did nothing wrong. The attack surface is semantic, not syntactic; rules-based filters cannot catch it.
Across both sessions, four failure modes emerged as the core framework:
- Identity confusion: agents misrepresenting or inheriting user identity
- Entitlement creep: permissions compounding across multi-agent chains
- Recursive leakage: sensitive context bleeding between agents
- Prompt injection via data payloads: attacks embedded in retrieved content
Mangtani then presented Meta's architectural response: a defense-in-depth stack anchored by something called DataVM, a unified trusted execution environment that bounds an agent's inputs, tool calls, and outputs under one governed scope at instantiation. The blast radius is constrained structurally, not by policy. What makes DataVM architecturally significant is what it inverts: instead of trusting data stores and denying access by exception, the agent's entire operational scope is bounded before it begins.
For anyone building AI products in regulated industries, financial services, healthcare, legal, this is not a future concern. It is a present one.
Evaluation: We're Building Faster Than We Can Measure
If governance is the first new bottleneck, evaluation is the second, and arguably the most invisible. Alex Ratner, CEO of Snorkel AI, made the most underappreciated argument of the day: our ability to measure AI has been outpaced by our ability to develop it.
He called this the evaluation gap, framing closing it as one of the most important problems facing the field.
The observability stack for AI agents has three layers: monitoring and logging, benchmarks, and evaluation tools. All three matter, but benchmarks are the keystone, because only a benchmark with real signal can answer the question that everything else is trying to avoid asking directly: is this agent actually doing what it's supposed to?
Ratner described three dimensions where today's agents most commonly break down, and where the next generation of benchmarks must deliver signal:
- Environment complexity: how dynamic and rich is the real operating world? Static, sandboxed evals massively underestimate production failure rates.
- Autonomy horizon: how far can an agent act independently before accumulating errors? Single-turn evals don't capture multi-step failure propagation.
- Output complexity: how sophisticated and verifiable is what the agent produces? Pass/fail metrics miss partial correctness entirely.
He then introduced the concept of the full work loop, the data an agentic benchmark must capture: tasks, environments, traces, outputs, and verifiers. In practice, the two elements most teams skip are traces (the step-by-step record of agent reasoning) and verifiers (the mechanism that confirms correctness). Without both, you don't have a benchmark; you have a demo.
He named concrete benchmarks illustrating what real signal looks like, testing agents on end-to-end tasks in real environments and measuring quality degradation over time, not just at a single snapshot. The underlying argument is harder to dismiss than any score; the future of AI progress may depend less on architectural breakthroughs than on whether we can build evaluation instruments that keep pace with the systems they are meant to govern.
User Agency: When Natural Language Becomes the Interface
The third bottleneck is the most human one: even if your agents are governed and evaluated, users won't adopt what they don't understand or trust. The most product-forward session of the day came from Qi Guo at Meta, presenting Instagram's Tune-Your-Algorithm (TYA) feature.
The premise is simple and long overdue: users should be able to understand and control the algorithm shaping what they see, not through settings menus and toggles, but in plain language.
TYA is built on two innovations. The first is the MRS Memory System, which constructs a persistent, structured "biography" of each user, a continuously updated summary of their interests and intent derived from behavioral signals. The second is Think-Then-Recommend (TTR), a reasoning layer that decomposes user interests and complex intents into personalized sub-goals before generating recommendations.
The key architectural shift: recommendation as a reasoning problem over a user model, not a retrieval problem over historical signals. The system thinks before it recommends.
Early results showed strong product-market fit, with users specifically citing transparency and agency as the source of satisfaction. That finding deserves emphasis. Users didn't just want better recommendations. They wanted to understand them. For years, recommendation algorithms were intentionally hidden, black boxes that decided what you saw without explanation or recourse. TYA doesn't fully open that box, but it makes it significantly more legible; users can see how the algorithm has interpreted their interests and correct it in plain language. That shift, from passive recipient to active participant even within a constrained system, is what drove the product-market fit signal.
The design implication is harder to dismiss than it looks. Natural language isn't just a more convenient input method; it's a fundamentally different relationship between user and system. When users can describe what they want instead of navigating what a designer anticipated, the interface stops being a constraint and starts being a conversation.
What This Means
The through-line across every session was a renegotiation of the relationship between the engineer, the user, and the machine. As someone leading the work on the AI platform developer experience at Intuit, these aren't abstract observations; they're the design and engineering challenges landing on our roadmap right now.
Engineers have started to become fleet managers and intent directors. Users are being handed the controls. The machine is taking over execution.
That shift creates three new imperatives:
Govern before you ship. The governance frameworks and attack taxonomies presented across the day are a checklist for what any agentic system touching sensitive data needs to have in place before it reaches users. In financial services, retrofitting governance after launch is not an option.
Evaluate what you build. The full work loop, tasks, environments, traces, outputs, verifiers, is a spec, not a suggestion. Every team building agents needs a benchmark alongside their dashboard, and before they ship to production.
Design for agency and transparency. TYA's product-market fit signal is a direct challenge to every AI product team: your users don't just want a better algorithm. They want to understand it, and increasingly, to talk back to it. That means designing not just for the happy path, but for legibility, making the system's reasoning visible enough that users can meaningfully correct it. The question isn't whether to design for that; it's how far you're willing to open the box.
The engineers who define the next decade won't be the fastest coders. They'll be the ones who understand how to govern, evaluate, and direct intelligent systems; and, perhaps most importantly, how to design and build the experiences that let users do the same.
What bottleneck is your team hitting first?
Top comments (0)