Harshdeep Singh

Posted on Jun 11 • Originally published at theharshdeepsingh.com

The 95% Problem: Why Enterprise AI Keeps Failing — and What the 5% Get Right

#enterpriseai #aistrategy #datagovernance #aiagents

Ninety-five out of every hundred enterprise AI pilots produce nothing a CFO would sign off on. The reflex is to blame the model — too dumb, too small, the wrong vendor. It almost never is. The thing quietly killing enterprise AI is older and more boring than any model: data nobody organized for machines, and rules nobody ever wrote down. The strangest part of the story is who is losing the fight hardest — the firms whose entire business is selling everyone else the cure.

The most expensive irony in enterprise software

In late 2025, Deloitte gave part of a government cheque back. The firm had delivered a report to Australia's Department of Employment and Workplace Relations, and reviewers found something awkward buried inside it: citations to academic papers that did not exist, and a fabricated reference to a federal court judgment. The work had been produced with help from generative AI, and no one had checked it before it went out the door. Deloitte agreed to refund part of its fee.

It is tempting to read that as a story about a hallucinating chatbot. It is not. A capable model can cite a real paper; the failure was not that the AI was too weak. The failure was that nothing in the process forced a human to verify machine output before it reached a client. There was no standard operating procedure, no checkpoint, no rule with teeth. That distinction — between a model problem and a data-and-governance problem — is the entire subject of this essay, and the firms that sell AI for a living have just handed us the clearest possible illustration of it.

Consider the position those firms are in. Since 2023, the Big Four and the major strategy houses have collectively poured more than ten billion dollars into AI. It is their flagship pitch. Accenture reported close to six billion dollars in generative-AI bookings in a single fiscal year. PwC became one of OpenAI's largest enterprise customers and then its reseller. KPMG signed a two-billion-dollar alliance with Microsoft. These organizations have built their modern brand on the promise that they can walk into any enterprise and fix its AI problem. And yet, internally, they have hit precisely the wall they are paid to dismantle.

This is not schadenfreude about one embarrassing report. It is the most useful data point in the entire enterprise-AI conversation, because it removes the easy excuses. You cannot say the consultants lacked talent, budget, model access, or executive buy-in. They had all of it in abundance. If the people who sell the cure can still catch the disease, then the disease is not what most companies think it is. It is not a shortage of intelligence in the model. It is a shortage of order in the data and discipline in the governance — and almost nobody is immune.

The number that belongs on every board agenda

Start with the figure that has been ricocheting around boardrooms since it landed. In its 2025 report The GenAI Divide: State of AI in Business, MIT's NANDA initiative studied hundreds of enterprise AI initiatives and concluded that roughly ninety-five percent of them had produced no measurable impact on the bottom line. Not weak returns. No returns. The spending in scope ran to tens of billions of dollars, and the overwhelming majority of it bought experiments that never crossed into anything a finance team could defend.

95% of enterprise generative-AI pilots deliver no measurable business impact — the spending lands, the value does not.

It is not an isolated finding. Gartner expects that by the end of 2025, three in ten generative-AI projects will be abandoned after the proof-of-concept stage, and that through 2026, sixty percent of AI projects will be scrapped specifically because the organizations lacked AI-ready data. The firm goes further on agents: it forecasts that more than forty percent of agentic-AI projects will be cancelled by the end of 2027. The RAND Corporation has put the historical AI project failure rate above eighty percent — roughly twice the rate of conventional IT projects. And S&P Global found that the share of companies abandoning most of their AI initiatives jumped to forty-two percent in 2025, up from just seventeen percent a year earlier. The trend is not improving as the technology matures. It is getting worse as spending outruns readiness.

The crucial detail is where these projects die. They almost never fail in the lab. They fail on the road to production. A pilot runs on a curated slice of data — a clean schema, a controlled volume, a problem chosen because it demos well. Production runs on the actual enterprise: the duplicated records, the contradictory definitions, the fields that mean different things in different systems, the knowledge trapped in formats no machine can read. The distance between the demo and the deployment is the distance between curated data and real data, and that distance is where the money disappears. People in the field have a name for the place projects go to expire: pilot purgatory.

The people closest to the data already know this. In Informatica's 2025 survey of chief data officers, the most-cited obstacle to AI success was not talent, not budget, not model quality — it was data quality and readiness. The executives responsible for the foundation are telling everyone the foundation is the problem. Most strategies are simply not listening, because listening would mean slowing down to do the tedious work, and the market is rewarding speed.

And the window in which to fix this is closing faster than the failure rate alone suggests, because the industry is sprinting from chatbots to agents. Gartner expects that by the end of 2026, four in ten enterprise software applications will include task-specific AI agents, up from less than one in twenty in 2025. Agents raise the stakes of the underlying problem by an order of magnitude. A chatbot that retrieves bad data returns a bad answer a human can still catch. An agent that acts on bad data — reconciling an account, approving a request, triggering a downstream workflow — propagates the error into the real world before anyone reviews it. The same analysts forecasting the agentic wave also forecast that more than forty percent of agentic projects will be cancelled by the end of 2027, for the same unglamorous reasons the chatbots failed. We are, in other words, about to point far more autonomous systems at foundations that were already too weak for the last generation of tools.

That is what makes the failure rate a strategic problem rather than a technical footnote. The cost is not the wasted pilot budget; that is the cheap part. The real cost is competitive. Every quarter a rival reaches production while you re-run experiments that were always going to fail for the same reason, the rival's system gets better, its data gets cleaner, its people get more fluent, and the gap compounds. You are not standing still. You are losing ground while looking busy.

It was never the model

The comforting story inside most failed AI programs is that the technology was not ready, and that the next model — bigger, newer, from a different lab — will be the one that finally works. It is comforting because it requires nothing of the organization except patience and a bigger invoice. It is also wrong.

Here is the inconvenient test. The model that hallucinated its way through your failed pilot is, in most cases, the same model that performed flawlessly in the vendor's demo. Nothing about the weights changed between those two moments. What changed was everything around the model: the quality of the data it was fed, the clarity of the instructions it was given, and the rules governing what it was allowed to touch. The model was never the variable. The environment was.

The model in your failed pilot and the model in the vendor's flawless demo are usually the same model. The difference between them is everything you built — or failed to build — around it.

This is also why "wait for the next model" is such a seductive and expensive trap. Each new model is genuinely more capable than the last, which makes it easy to believe the next one will finally clear the bar. But a more capable model pointed at the same unstructured data and the same absent rules does not fix the problem — it executes the same mistakes more fluently, and, increasingly, more autonomously. Capability without a foundation is not progress. It is leverage applied to a fault line.

That environment has two load-bearing pieces, and almost every enterprise is missing both. The first is data that a machine can actually reason over. The second is governance a machine can actually obey. The original instinct that AI needs "organized data and some kind of SOP" is exactly right — it just turns out that each half is a deep discipline in its own right, and that naming them separately is the difference between a strategy that works and a slide that sounds good. Take them one at a time.

Gap one: data that was never built for machines

An AI agent does not think the way a database is organized. It does not navigate neat rows and columns; it reasons over entities, the relationships between them, and the context that gives them meaning. It needs to know that this customer is the same as that account, that "revenue" in the finance system and "revenue" in the sales dashboard are or are not the same number, that this contract supersedes that one, that this policy applies to this region. Enterprise data, as it actually exists, is almost the precise opposite of that.

In most companies the data is siloed across systems that were never designed to talk to each other, duplicated in ways no one fully maps, and defined inconsistently enough that the same word can name genuinely different things in different systems. Worse, the knowledge that actually matters — the reasoning, the precedent, the hard-won judgment — tends to live in formats machines cannot read: slide decks, PDFs, email threads, and the heads of senior people who are about to retire. You can connect the cleanest model in the world to that, and it will faithfully reflect the chaos back to you.

The most instructive proof of this comes, again, from a consultancy. When McKinsey built its internal AI platform, the firm discovered that the tool could not initially parse PowerPoint — which was a problem, because PowerPoint is where most of McKinsey's institutional knowledge actually lived. Sit with that for a moment. One of the most knowledge-intensive organizations on earth, a firm whose entire product is structured thinking, found that its crown-jewel intellectual property was effectively illegible to a machine until it did real work to fix the ingestion. If McKinsey's knowledge was trapped in slides, it is worth asking, honestly, what shape yours is in.

The failure rarely announces itself as missing data. It hides in data that is present but means subtly different things in different places. Ask an agent a question as ordinary as how many active customers the business has, and it will find a dozen tables with a dozen definitions of "active" — a login within the last thirty days in one system, a non-zero balance in another, an uncancelled contract in a third. A human analyst resolves that ambiguity with context and a quick message to a colleague. An agent, lacking both, picks one definition silently and reports a confident number that is wrong in a way nobody can see. Multiply that across every entity and every metric a company cares about, and you have the real texture of the problem: not an empty warehouse, but a full one with no shared language.

You cannot retrieve your way out of a data swamp

The popular hope is that retrieval-augmented generation — pointing an agent at your documents and letting it fetch what it needs — will paper over the mess. It will not. An agent retrieving from a swamp returns swamp, dressed up in fluent prose that makes the swamp harder to detect. And the instinct to fix this by building a bigger data lake usually just produces a bigger swamp with better storage economics. Volume was never the problem. Meaning was.

What actually closes the gap is a layer most enterprises have never built: a semantic, machine-readable map of what the data means. In practice this goes by several names that point at the same idea — a semantic layer, an ontology, a knowledge graph, a governed data catalog. The common thread is that core business concepts get defined once, consistently, in a form an agent can consume: what a customer is, what counts as revenue, how entities relate, which rules and constraints apply. The catalog becomes the control plane of truth, and the semantic layer becomes the thing that lets a model answer in terms of your business rather than in terms of raw, ambiguous tables.

The organizing principle behind all of this is treating data as a product rather than as exhaust. Exhaust is whatever a system happens to emit, owned by no one, documented nowhere. A product has an owner, a contract, documentation, versioning, and a consumer whose needs shape it. The research bears out how much this matters: organizations that treat data as a product — with curated models and shared vocabularies — are dramatically more likely to scale generative AI successfully than those that do not. When the foundation is built this way, retrieval techniques like graph-based RAG can ground every answer in verified, connected data, enforce access controls at the moment of the query, and trace each response back to the exact source it came from. That is the difference between an agent that confidently invents a court case and one that shows its work.

A knowledge graph earns its keep precisely here. Instead of flat, disconnected tables, it stores the business as a web of entities and the relationships between them: this customer belongs to this account, which is covered by this contract, governed by this policy, owned by this team. An agent reasoning over that structure can follow the connections the way a knowledgeable employee would, and every answer it produces carries its lineage — which source, which definition, which version. That is also what makes governance enforceable at the level of meaning rather than the level of the raw row, because the graph knows what a thing is and who is permitted to see it.

The strategic point hiding in the plumbing

Strip away the vocabulary and the strategic truth is simple: AI readiness is mostly data maturity wearing a more exciting outfit. You cannot purchase your way past it with a model subscription, because the thing you are missing is not compute or intelligence — it is the slow, structural work of making your own knowledge legible. That work is unglamorous, expensive, and invisible in a board deck, which is exactly why most organizations skip it and exactly why most organizations end up in the ninety-five percent. The semantic foundation is not overhead on the way to the real prize. It is the prize. It is the part competitors cannot copy by signing the same vendor contract you did.

Gap two: governance that was never written down

The blue glass facade of a modern office building, its windows forming a strict geometric grid.

Order, structure, repetition — the corporate grid. Governance is what gives an agent the same scaffolding a new employee takes for granted. Photo: Fabian Kleiser / Unsplash.

If you ask most enterprises where their AI governance lives, the honest answer is a PDF on a shared drive — a well-intentioned document of principles that almost no one has read and that no system enforces. A PDF nobody reads is not a policy an agent can obey. It is a statement of hope. And hope does not survive contact with an autonomous system acting at machine speed across systems it was never explicitly cleared to touch.

Governance for AI, and especially for agents, has to be machine-actionable to mean anything. The "SOP" intuition is the right one, but it resolves into two concrete questions that a slide of principles never answers: what is this agent actually allowed to touch and do, and what is the operating rhythm that keeps it honest over time? Get specific about both, and governance stops being a compliance ornament and starts being the thing that lets you deploy without holding your breath.

An agent is a new employee with root access and no onboarding

It helps to think of an agent as exactly what it is becoming: a digital worker. The trouble is that we wrap human workers in decades of accumulated controls and give agents almost none of them. A new human employee receives an identity, a defined role, least-privilege access to only the systems their job requires, a manager who reviews their work, and an audit trail that records what they did. An agent, in too many deployments, gets a single shared API key with broad standing credentials, the ability to call tools far outside its actual task, and no logging worth the name. It is the most over-permissioned new hire in the building, and no one interviewed it.

The discipline that fixes this is well understood, even if it is rarely applied. Security researchers call the core idea least-agency, or least-privilege: an agent should receive the minimum autonomy required for its specific task and nothing more. A customer-support agent does not need write access to the billing database. A research agent does not need the ability to send external email. From there it cascades into concrete controls: whitelisting the specific tools an agent may use, issuing short-lived credentials instead of permanent keys, sandboxing execution, restricting where an agent can send data, and — critically — keeping a human in the loop for actions that are irreversible or sensitive. A mature deployment will refuse to let an agent move money without clearing a confidence threshold or obtaining a second approval, and will strip personally identifiable information before it ever reaches a model, restoring it only on the way back. None of that lives in a principles document. All of it lives in enforced policy. That, and not the PDF, is the real standard operating procedure: rules expressed as controls a system cannot route around.

The danger is not hypothetical, and it does not require malice. Picture an agent handed broad database credentials so it could "be helpful," then asked to tidy up some duplicate records. With no constraint on its scope and no human checkpoint, a single ambiguous instruction becomes a destructive write across production data in seconds — faster than any person could intervene, and recorded nowhere anyone thought to look. The same autonomy that makes agents useful is what makes their mistakes fast and quiet. Standing credentials, missing audit trails, and unrestricted tool access are not exotic edge cases; they are the default state of most early agent deployments, and they are exactly how a promising program turns into a board-level incident.

A policy an agent cannot read is decoration. Governance that scales is policy an agent is structurally unable to disobey.

You do not have to invent the rulebook

The encouraging part is that the scaffolding for all of this already exists, written by people who have thought about it harder than any individual team has time to. The U.S. National Institute of Standards and Technology publishes the AI Risk Management Framework, along with a dedicated profile for generative AI, and its emphasis maps almost directly onto agent controls: role-based access, continuous monitoring, adversarial testing, and lifecycle logging for traceability. The international ISO/IEC 42001 standard formalizes the idea of an AI management system, with oversight and continual improvement built in. The OWASP GenAI Security Project maintains a Top 10 for large-language-model applications and a newer Top 10 for agentic applications, cataloguing the exact failure classes teams keep rediscovering the hard way: prompt injection, tool misuse, memory leakage. And the external pressure is rising fast, from the EU AI Act to a wave of national AI laws, which means governance is shifting from a nice-to-have to a condition of doing business.

Underneath the frameworks sits a single overlooked capability that decides whether any of them are real: observability. If you cannot see what an agent did — which data it touched, which tools it called, which decision it made and why — then you cannot govern it, debug it, or defend it to a regulator, and you certainly cannot trust it with anything that matters. Audit logging and traceability are not paperwork. They are the line between an agent you can put into production and a black box you can only hope behaves. Trust, in the end, is not extended to systems because they are clever. It is extended to systems because they are accountable.

The reframe that matters most here is that governance is not a brake on AI. It is the enabler. The organizations that actually reach production are, consistently, the ones that invested in governance frameworks before they scaled agent capabilities — not after a breach forced their hand. The absence of guardrails does not make you faster; it produces exactly the brittle, untrustworthy, occasionally catastrophic behavior that makes leadership pull the plug and sends a promising program back to purgatory. Guardrails are what let you move quickly without flinching.

The consultants' mirror

Return now to where we started, because the consulting firms are not just a cautionary anecdote. They are the most public, best-funded live experiment in internal AI adoption that exists, and watching them is the closest thing the rest of us have to a controlled trial. They are simultaneously the largest sellers of AI transformation and a room full of organizations trying to transform themselves — and they have been unusually candid about how hard it is. BCG, in its own cross-industry research, found that roughly three-quarters of companies struggle to achieve and scale value from their AI initiatives. The firms are not describing a problem they have solved from the outside. They are describing one they are living from the inside.

The stakes are visible in their own pyramids. Internal tools like McKinsey's assistant and BCG's slide-polishing system can already perform a large share of the research-and-formatting work that used to define a junior analyst's first years, and entry-level hiring across the industry has tightened as a result. That is what it looks like when this technology genuinely lands inside an organization — and it is a useful reminder that getting it right is not a productivity nicety. It restructures the firm. The flip side is that getting it wrong, in public, with a client's name on the document, restructures the firm's reputation just as quickly.

And so every major firm now has its own platform: McKinsey with its knowledge assistant and a fleet of internal agents numbering in the tens of thousands, BCG with its build unit and internal tools, Deloitte with its assistant and an agentic platform, PwC with an agent operating system spanning tens of thousands of deployed agents, EY with a platform giving tens of thousands of staff access to a growing roster of agents and a multi-year plan to scale into the hundreds of thousands, and KPMG with its own agentic workbench. Billions of dollars, real engineering, genuine ambition. But the tools are the visible ten percent. The invisible ninety percent — the part that determines whether any of it works — is the data and governance plumbing underneath. There is a useful rule of thumb circulating in this world that only a small fraction of AI value comes from the algorithms and the technology, with the overwhelming majority coming from people, process, and the organizational change required to make the technology stick. The firms that are winning internally are the ones that took that ratio seriously.

The five percent: what McKinsey's Lilli actually proves

The curved, balconied interior of a grand library, its shelves densely and neatly filled with books.

Organized, machine-legible knowledge was the real moat — not the model. Every rival had the same models. Photo: Susan Q Yin / Unsplash.

If the failure rate has a counterexample worth studying, it is McKinsey's internal platform, Lilli. It is the case study everyone cites, and almost everyone draws the wrong lesson from it. The wrong lesson is that McKinsey succeeded because it had access to powerful models. That cannot be the explanation, because every competitor had access to the same models. The right lesson is far less flattering to the technology and far more useful to anyone trying to replicate the result: McKinsey succeeded because it did the boring work that everyone else was skipping.

Look at what the boring work actually was. The platform draws on more than forty knowledge sources and over a hundred thousand documents and interview transcripts — but the unlock was not aggregation, it was curation and tagging, the patient labor of making a century of accumulated knowledge consistent and machine-legible. The team built what is better described as an orchestration layer than a simple retrieval bot, designed to synthesize and contextualize rather than just fetch. They confronted the unglamorous reality that their best material was trapped in slides and fixed the ingestion so the machine could read it. Only then did the human side of adoption begin: a phased rollout, training that cured what the firm called "prompt anxiety" in roughly an hour, internal evangelists, and senior leaders modeling the behavior they wanted to see.

The results are the part people quote, and they are genuinely impressive: more than three-quarters of the firm's tens of thousands of employees now use the tool, heavy users return to it more than a dozen times a week, and the firm reports its people save close to a third of their research time. But the number to internalize is not the adoption rate. It is what produced it. The moat was never the model. The moat was a hundred years of knowledge made legible to machines, wrapped in the governance and the change management required to make people actually trust it and use it.

McKinsey's edge was not a smarter model — every rival had the same models. The edge was a century of knowledge made legible to machines, and the discipline to govern it.

The pattern repeats across the rest of the industry, even if less dramatically. The firms making real internal progress — across the spectrum of platforms and agents now deployed — are consistently the ones that invested in their data foundations and their governance before they tried to scale. And the lesson generalizes cleanly to any enterprise, in any sector, that wants off the wrong side of the divide. The winners are not the organizations with the best model. Everyone has the same models. The winners are the organizations that did the unglamorous data-and-governance work that everyone else found a reason to defer.

It is worth saying plainly what this does and does not mean for everyone else, because the lesson is easy to mislearn. You cannot buy McKinsey's result by buying McKinsey's tool, any more than you could acquire a rival's culture by licensing their software. What travels is not the platform; it is the method — the willingness to treat your own knowledge as an asset worth making machine-legible, and your own governance as an engineering problem worth solving before the agents arrive. That method is available to any organization in any industry. It is simply not for sale, and it cannot be rushed.

What the winners actually do

None of this resolves into a checklist, and anyone selling you one is selling the wrong thing. But the organizations on the right side of the divide do share a small number of strategic commitments, and they are worth stating plainly — not as steps to execute in order, but as the shape of a serious posture toward AI.

They sequence data before models

The single most counterintuitive move the winners make is to stop running hero pilots and fix the foundation first. That does not mean a multi-year data project before any value is delivered; it means choosing initial use cases precisely where the data is already clean and compatible, shipping those to generate real and defensible returns, and then using that credibility and momentum to fund the remediation of the messier domains. The failure mode is the opposite: picking use cases based on strategic ambition and executive enthusiasm, discovering the data underneath cannot support them, and producing an over-budget pilot that demonstrates the limits of the data rather than the capability of the AI. Match the ambition to the data maturity, not to the org chart's excitement.

They treat data as a product, not exhaust

The winners give their data owners, contracts, documentation, and a semantic layer that defines the business vocabulary once and reuses it everywhere. The catalog becomes the control plane of truth; the ontology and knowledge graph become the connective tissue that lets agents reason over entities and relationships rather than guess at ambiguous tables. This is the work that does not show up in a launch announcement and entirely determines whether the launch was real. It is also, not coincidentally, the part of the strategy a competitor cannot acquire by signing the same contracts you did.

They make governance machine-actionable

The winners do not confuse a principles document with a control. They express policy as something a system enforces: identity for every agent, least-agency access, tool whitelisting, audit trails, and a human in the loop for anything irreversible or sensitive. They adopt an established standard rather than inventing their own — the NIST framework, the ISO management-system standard, the OWASP failure catalogues — and they wrap it in an operating rhythm: regular reviews, red-teaming, and genuine change management for agents, treating a new agent with the seriousness one would treat a new hire with broad system access. Governance, done this way, is not the thing that slows the program down. It is the thing that lets the program move without fear.

They build the context layer agents inherit

Rather than re-explaining the business inside every prompt, the winners push meaning and rules down into the data itself, so that any agent connecting to it inherits both. It is worth watching the emerging plumbing here — open protocols that standardize how agents connect to tools and to one another, sometimes described as an "HTTP for agents," are quickly becoming the connective standard for this world. But a word of caution that the protocol enthusiasm tends to skip: plumbing that lets agents reach your data faster does nothing good if the house behind the tap is a mess. Standard connectivity over a swamp just distributes the swamp at higher throughput. The connectivity is necessary; the clean, governed foundation is what makes it worth having.

They treat adoption as a change program, not a software rollout

Finally, the winners understand that buying licenses is not the same as achieving adoption. The most replicable lesson from the internal success stories is almost embarrassingly human: an hour of training to dissolve the anxiety of a blank prompt, visible evangelists, and leaders who actually use the tools they are asking their people to use. If the overwhelming majority of AI value comes from people and process rather than the algorithm, then the overwhelming majority of the effort has to go there too. Technology adoption has always been a human problem wearing a technical mask, and AI has not changed that. It has only raised the stakes.

They measure value, not motion

The organizations stuck in the ninety-five percent tend to measure activity — pilots launched, seats provisioned, models evaluated — and mistake it for progress. The winners measure outcomes, and they are ruthless about it: a use case either moves a number a finance team recognizes, or it is killed quickly, before it hardens into a permanent science project. That discipline is exactly what frees the budget and the attention to pour into foundations, where the compounding returns actually live. Counting pilots is how a program feels busy while going nowhere. Counting value is how it escapes the lab.

The real divide

It is worth being precise about what the so-called GenAI Divide actually divides. It is not a line between companies with good models and companies with bad ones. Frontier models are a commodity now; the same handful are available to everyone with a credit card. The divide is between the organizations that did the foundational work and the organizations that did not — and underneath the AI costume, that is simply a gap in data maturity and governance discipline that has existed for years and that AI has suddenly made expensive to ignore.

And it compounds. The organizations on the right side of the divide get faster every quarter, because their agents inherit ever-cleaner data and ever-tighter rules, and each success funds the next. The organizations on the wrong side re-run pilots that fail for the same reason they failed last time, mistaking a foundation problem for a model problem and waiting for a model that was never going to save them. The gap between the two groups does not stay constant. It widens.

The deepest irony of the whole story is the one we began with. The cure for the failing enterprise AI program was never a smarter model. It was the boring, expensive, unglamorous discipline that the consultants themselves had to learn the hard way, in public, with a refunded invoice as tuition: organize the data so a machine can reason over it, write the rules down in a form a machine is forced to obey, and only then let the agents loose. The companies that internalize that will not merely adopt AI. They will compound on it — quietly, structurally, and largely out of view — while everyone else is still abandoning pilots and blaming the model.

An aerial view of a road winding in tight switchbacks up a green mountain pass.

The divide compounds. The foundation you lay now decides how fast you can move later. Photo: Robert Bye / Unsplash.

Sources and further reading

MIT NANDA initiative, The GenAI Divide: State of AI in Business 2025 — the source of the widely cited finding that roughly 95% of enterprise generative-AI pilots show no measurable business impact.
Gartner — predictions on proof-of-concept abandonment, AI-ready data, and agentic-project cancellation rates.
VentureBeat — background on McKinsey's internal Lilli platform and how it was built.
Deloitte, State of AI in the Enterprise — recurring survey data on enterprise AI spend, scaling, and ROI.
NIST AI Risk Management Framework and its Generative AI Profile — role-based access, monitoring, adversarial testing, and lifecycle logging.
OWASP GenAI Security Project — the Top 10 for LLM applications and the Top 10 for agentic applications, covering prompt injection, tool misuse, and related failure classes.
ISO/IEC 42001 — the international standard for an AI management system, covering oversight and continual improvement.

Top comments (2)

Mallory Haigh • Jun 16

The failure is in the absence of any platform underneath it, not so much the model - there's no checkpoints, identity management, standards, or "definition of done". I'd call the Deloitte example almost too clean: the AI did what AI does, and the system around it did nothing. That's not a hallucination problem - it's a governance gap, and governance belongs to platforms.

"Getting it right" isn't about using a better model. The successful teams I'm seeing have built a substrate that makes agents work at scale: context, tooling, approval gates, evaluation, observability, etc. That's the practice Platform Engineering, just specifically applied to agents!

Some comments may only be visible to logged-in visitors. Sign in to view all comments.