DEV Community

Cover image for The Document Data Crisis
Alex Lipinski
Alex Lipinski

Posted on • Originally published at keymarkinc.com

The Document Data Crisis

AI Agents become more reliable when unstructured data is properly managed from capture to formatting for AI analysis and RAG.

Enterprise AI demos show a best-case scenario of enterprise AI when a system has the advantage of a clean, up-to-date, and properly formatted document data pipeline. But such a pipeline is seldom the reality. Instead, AI models often operate on incomplete or inaccessible data, equating to frequent failure. Intelligent Document Processing creates structure and format from unstructured data.

Takeaways

  • Enterprises can’t rely on agents they don’t trust, and AI agents can’t act on what they can’t retrieve and parse.
  • Crisis ensues when unstructured document data becomes dark data: stored and retained but not analyzed, or hidden/lost from view.
  • Less than 1% of enterprise unstructured data is suitable for AI consumption.
  • AI reliability is a data problem before it’s ever a model problem.

When a lack of context equals a drop in AI reliability

If you’ve used any generic generative AI platform like ChatGPT or Gemini for personal use then you are almost certainly familiar with what AI failure looks like, and can identify with the mistrust failure can create. Three extremely common red flags of AI failure include:

  • An answer that is plausible but inaccurate.
  • A confident recommendation that conflicts with known documentation.
  • Different answers for the same queries over time.

This type of behavior is frustrating in a personal setting, but for businesses — particularly those in a highly regulated industry — it’s dangerous, especially when an AI system forgoes raising an exception and fails quietly and unnoticed. But these mistakes don’t usually stem from poor models. Rather, poor data quality and limited context lead to misinformed decisions on unstructured data — data that can account for over 80% of enterprise content. They occur when businesses run AI programs as if all content is readily accessible for AI consumption, regardless of data formats.

What is unstructured data and the Document Data Crisis

Unstructured data is data without a defined format or schema, usually text-heavy documents or rich media.

We’re coining the Document Data Crisis as what happens when an enterprise’s most important operational data exists as unstructured data in documents that are unable to be analyzed by AI, or data that may have already been lifted from documents, but not properly formatted.

Dark Data is a frequent, but not the sole contributor to the crisis. Dark Data is any piece of data that is collected and stored somewhere in the enterprise, but that goes unused or underutilized in analysis and decision-making. Data goes dark in a variety of ways and stages of document management:

  • A dilapidated ECM or shared drive that’s become unsearchable.
  • Documents stored without reliable metadata, classification, or tagging.
  • Scanned documents that stay as images rather than machine-readable content.
  • Attachments to files severed from the source, or sources with poor version control.

In a recent post on dark data we revealed that the world approaches nearly 400zb of data produced annually, of which an estimated 250zb is enterprise-owned unstructured data. Of that unstructured data, over half of it remains unused in data analysis. But a recent report by IDC provides a different, equally troubling angle – less than 1% of enterprise unstructured data is even suitable for AI consumption to begin with. [1].

So this document data crisis doesn’t exist because organizations don’t have enough data to analyze. In fact, in that same blog post, we note that 47% of enterprise unstructured data already resides in a repository. Regardless of whether it’s been captured, it remains unready for AI use.

I think we’ve harped a lot on data readiness since working enterprise AI catapulted to the top of everyone’s priority list. In those past instances, we were vocal about the importance of document versioning, cleaning/optimizing your ECM, and capturing all the dark data you didn’t realize you had. But while an ECM is a fantastic solution for document storage and automation, it alone isn’t where content transforms into usable data for database analytics or AI analysis.

Data readiness is still a major hurdle

A silly comic circulated around our office gives a relevant commentary to the state of AI and the importance of data readiness, even after several years of the ongoing hype.

CEO's rally to adopt AI without clear direction

The frenzy to adopt AI continues. Stakeholders and shareholders want to see profit. The executive level wants to be profitable. The engineers, analysts, and IT teams responsible for vetting/recommending/purchasing systems receive pressure upstream to adopt. Models are deployed into the reality of 1% unstructured data readiness, and their missing contextual awareness leads to poor decisions. The poor decisions lead to mistrust among users and bad user buy-in, not to mention the extreme costs and risk these mistakes can incur.

And all because data readiness doesn’t stop enterprise-wide capture.

In their 2026 Market Trends report, Deep Analysis writes, “the GenAI surge has exposed the “deplorable state” of enterprise unstructured data collections.” [2]

The GenAI surge has exposed the deplorable state of enterprise unstructured data collections

Just because you may have captured the data, doesn’t mean it’s immediately usable for AI consumption. Structuring content post-capture for a variety of use cases, including AI consumption, is needed to deter crisis, failure, and hallucinations.

Curing the state of enterprise unstructured data collections

Intelligent document processing at its core is data capture technology that bridges the gap between traditional optical character recognition (OCR) and AI through a combination of machine learning, natural language processing, human-in-the-loop, and traditional means.

IDP already does much to ingest, scan, lift, label, and route data that could otherwise be trapped in a variety of the worst-case offender document types – those that are highly variable and lack predictable structure. But as described, it isn’t just about capturing data; it’s about producing a structure that an AI model can read, understand, and ultimately inform decision-making from.

On the back end, an IDP system goes beyond intelligent capture by producing structured outputs in formats preferred by AI systems, including JSON, CSV, XML, and other highly structured formats designed for ingestion into data lakes, warehouses, and data pipelines.

For CEOs, data engineers, IT teams, and other stakeholders pursuing agentic solutions, particularly in document-heavy industries, IDP as a foundation for corporate GenAI is not optional. Front-end AI systems depend on the back-end data readiness provided by IDP to support analytics and Retrieval-Augmented Generation (RAG), in which systems retrieve, summarize, and contextualize information from the data lake. In other words, IDP is table stakes for success.

Averting crisis

Lack of context equals a drop in quality. If a model cannot retrieve a document because it’s out of reach (not captured in the first place) or can’t read the document because it’s in the wrong format, then your agents don’t have the necessary details to make decisions, resulting in misinformation.

Intelligent capture is the tip of the iceberg, supported by extraction, classification, and validation with human-in-the-loop oversight that brings confidence scores to 100% reliability – highly essential in regulated industries.

Select outputs for database loading that serve AI systems with the necessary information to analyze, summarize, and contextualize data in your content ecosystem, and avert a crisis.

Sources:

  1. https://www.box.com/resources/unstructured-data-paper

  2. https://info.aiim.org/market-momentum-index-idp-survey-2025


The post "The Document Data Crisis" was originally published on https://www.keymarkinc.com/the-data-context-crisis/

Top comments (0)