DEV Community: Alex Lipinski

The Document Data Crisis

Alex Lipinski — Wed, 21 Jan 2026 18:46:41 +0000

AI Agents become more reliable when unstructured data is properly managed from capture to formatting for AI analysis and RAG.

Enterprise AI demos show a best-case scenario of enterprise AI when a system has the advantage of a clean, up-to-date, and properly formatted document data pipeline. But such a pipeline is seldom the reality. Instead, AI models often operate on incomplete or inaccessible data, equating to frequent failure. Intelligent Document Processing creates structure and format from unstructured data.

Takeaways

Enterprises can’t rely on agents they don’t trust, and AI agents can’t act on what they can’t retrieve and parse.
Crisis ensues when unstructured document data becomes dark data: stored and retained but not analyzed, or hidden/lost from view.
Less than 1% of enterprise unstructured data is suitable for AI consumption.
AI reliability is a data problem before it’s ever a model problem.

When a lack of context equals a drop in AI reliability

If you’ve used any generic generative AI platform like ChatGPT or Gemini for personal use then you are almost certainly familiar with what AI failure looks like, and can identify with the mistrust failure can create. Three extremely common red flags of AI failure include:

An answer that is plausible but inaccurate.
A confident recommendation that conflicts with known documentation.
Different answers for the same queries over time.

This type of behavior is frustrating in a personal setting, but for businesses — particularly those in a highly regulated industry — it’s dangerous, especially when an AI system forgoes raising an exception and fails quietly and unnoticed. But these mistakes don’t usually stem from poor models. Rather, poor data quality and limited context lead to misinformed decisions on unstructured data — data that can account for over 80% of enterprise content. They occur when businesses run AI programs as if all content is readily accessible for AI consumption, regardless of data formats.

What is unstructured data and the Document Data Crisis

Unstructured data is data without a defined format or schema, usually text-heavy documents or rich media.

We’re coining the Document Data Crisis as what happens when an enterprise’s most important operational data exists as unstructured data in documents that are unable to be analyzed by AI, or data that may have already been lifted from documents, but not properly formatted.

Dark Data is a frequent, but not the sole contributor to the crisis. Dark Data is any piece of data that is collected and stored somewhere in the enterprise, but that goes unused or underutilized in analysis and decision-making. Data goes dark in a variety of ways and stages of document management:

A dilapidated ECM or shared drive that’s become unsearchable.
Documents stored without reliable metadata, classification, or tagging.
Scanned documents that stay as images rather than machine-readable content.
Attachments to files severed from the source, or sources with poor version control.

In a recent post on dark data we revealed that the world approaches nearly 400zb of data produced annually, of which an estimated 250zb is enterprise-owned unstructured data. Of that unstructured data, over half of it remains unused in data analysis. But a recent report by IDC provides a different, equally troubling angle – less than 1% of enterprise unstructured data is even suitable for AI consumption to begin with. [1].

So this document data crisis doesn’t exist because organizations don’t have enough data to analyze. In fact, in that same blog post, we note that 47% of enterprise unstructured data already resides in a repository. Regardless of whether it’s been captured, it remains unready for AI use.

I think we’ve harped a lot on data readiness since working enterprise AI catapulted to the top of everyone’s priority list. In those past instances, we were vocal about the importance of document versioning, cleaning/optimizing your ECM, and capturing all the dark data you didn’t realize you had. But while an ECM is a fantastic solution for document storage and automation, it alone isn’t where content transforms into usable data for database analytics or AI analysis.

Data readiness is still a major hurdle

A silly comic circulated around our office gives a relevant commentary to the state of AI and the importance of data readiness, even after several years of the ongoing hype.

The frenzy to adopt AI continues. Stakeholders and shareholders want to see profit. The executive level wants to be profitable. The engineers, analysts, and IT teams responsible for vetting/recommending/purchasing systems receive pressure upstream to adopt. Models are deployed into the reality of 1% unstructured data readiness, and their missing contextual awareness leads to poor decisions. The poor decisions lead to mistrust among users and bad user buy-in, not to mention the extreme costs and risk these mistakes can incur.

And all because data readiness doesn’t stop enterprise-wide capture.

In their 2026 Market Trends report, Deep Analysis writes, “the GenAI surge has exposed the “deplorable state” of enterprise unstructured data collections.” [2]

The GenAI surge has exposed the deplorable state of enterprise unstructured data collections

Just because you may have captured the data, doesn’t mean it’s immediately usable for AI consumption. Structuring content post-capture for a variety of use cases, including AI consumption, is needed to deter crisis, failure, and hallucinations.

Curing the state of enterprise unstructured data collections

Intelligent document processing at its core is data capture technology that bridges the gap between traditional optical character recognition (OCR) and AI through a combination of machine learning, natural language processing, human-in-the-loop, and traditional means.

IDP already does much to ingest, scan, lift, label, and route data that could otherwise be trapped in a variety of the worst-case offender document types – those that are highly variable and lack predictable structure. But as described, it isn’t just about capturing data; it’s about producing a structure that an AI model can read, understand, and ultimately inform decision-making from.

On the back end, an IDP system goes beyond intelligent capture by producing structured outputs in formats preferred by AI systems, including JSON, CSV, XML, and other highly structured formats designed for ingestion into data lakes, warehouses, and data pipelines.

For CEOs, data engineers, IT teams, and other stakeholders pursuing agentic solutions, particularly in document-heavy industries, IDP as a foundation for corporate GenAI is not optional. Front-end AI systems depend on the back-end data readiness provided by IDP to support analytics and Retrieval-Augmented Generation (RAG), in which systems retrieve, summarize, and contextualize information from the data lake. In other words, IDP is table stakes for success.

Averting crisis

Lack of context equals a drop in quality. If a model cannot retrieve a document because it’s out of reach (not captured in the first place) or can’t read the document because it’s in the wrong format, then your agents don’t have the necessary details to make decisions, resulting in misinformation.

Intelligent capture is the tip of the iceberg, supported by extraction, classification, and validation with human-in-the-loop oversight that brings confidence scores to 100% reliability – highly essential in regulated industries.

Select outputs for database loading that serve AI systems with the necessary information to analyze, summarize, and contextualize data in your content ecosystem, and avert a crisis.

Sources:

The post "The Document Data Crisis" was originally published on https://www.keymarkinc.com/the-data-context-crisis/

What is due diligence for IDP and why is it important?

Alex Lipinski — Tue, 02 Dec 2025 15:41:48 +0000

Due diligence is the investigative process of vetting an investment or agreement to verify facts and make informed decisions. Good due diligence reduces risk and protects decision-makers from signing off on costly mistakes.

*With new intelligent document processing vendors emerging monthly, technology iterating quarterly, and orgs cycling through solutions like flavors of the month, your ability to analyze a market full of showy IDP software and make a determination on whether it’s a right fit for your enterprise is becoming an insanely valuable skill. *

Takeaways

Analysts now track over 450 IDP entrants, a 15% increase year-over-year, marking an essential need for tight decision-making and software selection skills.
IDP is no longer a primarily back-office thing with 62% of IDP systems now involving external users.
66% of new IDP projects are initiated to replace a previous IDP system.
Proof‑of‑concept evaluations are essential to verify AI accuracy, integration, and security before signing a contract.

Why are there so many IDP solutions all of a sudden?

Traditionally, intelligent capture has been the champion of back-office workflows like mailroom automation, AP/AR invoice processing, and audit prep — activities involving predominantly structured and semi-structured documents. As advancements in machine learning and natural language processing have given way to richer capabilities for mastering unstructured documents, there’s been a shift from back-office use-cases to industry-specific front-office functions.

Today, over 60% of IDP use-cases are in support of processes where external users create, access, and share unstructured documents/data, including customer service, employee onboarding, contract and agreement analysis, claims intake, licenses and permits processing, and beyond. [2]

Combine that with the understanding that 90% of enterprise data is unstructured — and that data quality and data quantity have a dramatic impact on enterprise GenAI results — and the demand for IDP to capture unstructured data climbs higher than ever before.

As such, the IDP market is growing fast, with an increasing rate of 15% year-over-year [1] as new entrants in document intelligence capture capabilities spring out of the woodwork in response to the rising number of use cases, and the growing demands for data to support better GenAI results and data analysis.

Solution red flags to avoid

Deep Analysis, a market analysis and due diligence firm, reports that they’re now tracking 456 companies globally that sell IDP as a standalone product or a feature [1].

The problem for those 456 vendors and for you is that differentiation between IDP products is extremely difficult with product messaging that’s more or less the same. To filter at least a few of the not-so-great solutions out of the decision process, beware of these common red flags.

Claims of 99% or “near-perfect accuracy” without proof. Analysts call out this claim as one of the most misleading in the market because the claim doesn’t tell the whole story. What were the sample documents? Did anyone check if the results were actually correct?
Unclear Data Policies. Data privacy still ranks among the biggest concerns with AI and IDP. Training a model requires sample documents, but who is providing those? You? If so, how is that data being handled?
Consumption or token pricing. Uncapped consumption models mean you can go way over budget during document surges. And token pricing is unpredictable and may vary depending on document complexity. Roughly 88% of surveyed IDP purchasers have indicated that they prefer the predictability and stability of fixed pricing models. [2]
Human-in-the-loop as an upsell. The work is only done if it’s accurate. Classification with the risk of errors, even 1% errors, is dangerous. Human-in-the-loop verification is still a necessity to reach high‑quality outcomes and retrain models safely. To sell it as an add-on is to sell an incomplete solution. Experts advise always placing HITL where accuracy must be guaranteed.
Really neat demos and UI... and that’s it. Demos are meticulously curated for specific use cases that look great in the demo but fall apart in the real world. PoCs are a must.

The last claim — GenAI as the problem-solver

The effect of GenAI hype and advertising on IDP purchase decisions is apparent. Today, over 66% of new IDP projects are started just to replace old ones that either don’t work as promised or don’t deliver the same GenAI capabilities as promised in the new one. [2]

These replacements coincide with the dramatic rise of GenAI baked into solutions. As of early 2025, over 80% of IDP vendors have advertised GenAI capabilities somewhere in their IDP solutions, with some touting it as the predominant feature [3] even though data quality is the cause for successful GenAI, and not the other way around.

Putting the cart before the horse is a problem. Confusing the cart for the horse is Don Quixote-level insane.

That’s not to paint GenAI a villain. LLMs are fantastic at zero-shot/few-shot learning and summarization. But for raw data extraction at scale and at relatively lower costs, discriminative machine learning is superior.

Basic due diligence questions

Here’s a loose framework for due diligence that combines categories used by analysts to rate vendors with a criteria that aims to pinpoint performance while avoiding red flags.

Is the solution purpose-built to align with our priority use cases?
Is there evidence that the solution can handle a wide variety of document types utilizing modern machine learning and natural language processing?
Can we verify data encryption, access controls, and a clear no‑training‑on‑my‑data policy?
Is the platform easy to deploy and maintain? Is it predictable and transparent with pricing, with clear visibility into costs over time and usage safeguards? Are we being sold this one solution or an entire platform?
Does the vendor seem like they know what’s going on? Do they have a history of innovation and a clear roadmap?
Can we see confidence scores? Track model versioning? Can the vendor show us every step a document takes from upload to storage and what happens to data and processing records afterwards?

Do your homework, and IDP will work for you

Intelligent Document Processing (IDP) has matured but the market is crowded and the 450+ vendors are not all created equal. Case studies are better than demos, and success depends on measurable ROI not GenAI claims. IDP works with data quality, model transparency, and measurable ROI. Ask vendors to prove not just what their models can do, but how they do it, and how they’ll perform over time. In the age of AI everywhere all the time, skepticism is a virtue.

Download a free interactive worksheet to grade IDP solutions based on a proven criteria.

Sources:

The post “What is due diligence for IDP and why is it important?” was originally published on keymarkinc.com.

What is Intelligent Document Processing?

Alex Lipinski — Wed, 29 Oct 2025 17:55:12 +0000

What is IDP?

IDP combines AI-powered tools like natural language processing and machine learning with traditional capture methods such as optical character recognition (OCR) to extract data from unstructured and structured sources—then formats the data for easier analysis.

Intelligent Document Processing achieves mastery over unstructured document data, which can account for more than 80–90% of enterprise data globally. By recognizing, extracting, classifying, and structuring data for use in agentic AI projects, across workflows, and in data lakehouses, IDP reduces risk and gives a much clearer picture of business operations for strategic decision-making.

Takeaways

IDP modernizes traditional capture technology with AI capabilities.
IDP looks beyond individual characters to understand words, sentences, and context—leading to much more accurate results.
IDP roots out hidden data, or captures data regardless of schema.
IDP fuels agentic AI, workflow, and data analysis by structuring semantic data in a variety of formats (including JSON and Markdown).

Intelligent Document Processing Stats and Facts

Enterprise data generated: ~318zb
Unstructured enterprise data: ~90%
Data already living in a repository: ~47%

Structured Data vs. Unstructured Data

Unlike structured data (which fits tidily into predefined formats like tables and form fields), unstructured data lacks a clear, organized format and can come in a variety of forms:

Multimedia files: Images, audio, and video parsed to text
Social media content and data
Web page content
Physical documents or e-files

When variation occurs in format or schema, or when data lives outside easily defined fields (such as rich media), traditional OCR methods may miss data unless specifically trained for each source. That’s highly inefficient and unsuitable for most organizations.

How Does IDP Work?

Intelligent Document Processing combines traditional OCR with additional AI capabilities (like machine learning and natural language processing) to dramatically improve document understanding by recognizing context. This improves first-pass data capture and classification by spotting and labeling data regardless of where it lives on a page or how it’s shared in rich media. Classification can proceed with near 100% accuracy.

Why is Human-in-the-Loop Validation Still Important?

IDP uses machine learning, but it still needs feedback. During validation, the system is told when it’s wrong and learns from its mistakes. Data is classified and given a confidence rating. In most cases, the confidence will be near 100%. But exceptions persist—these are moments when IDP says: “I’m pretty sure this is a purchase order, but I’m not 100% sure. Can you verify?” Validating exceptions steadily improves IDP’s performance, reducing mistakes over time.

Benefits of IDP With Human-in-the-Loop

IDP Reduces the Risk of AI Project Failure

As enterprises generate over 80% of the world’s data (global market insights), most of that data is created to support agentic AI experiences. But AI is only as smart as the data it’s given. Errors in data can be catastrophic for AI agents and the people making decisions with them. With unstructured data accounting for up to 90% (IDC) of enterprise data, there’s a lot of room for error.

IDP Enables Straight-Through Batch Processing and Workflow Automation

IDP can quickly sort large batches of files, separate documents by type (without separator sheets or barcodes), then extract, classify, and route data down workflow streams to the people that depend on it. This can be scheduled, manual, or triggered automatically as documents flow into the organization.

IDP Tidies Document Data for Data Lakehouse Integration

IDP takes unstructured document data and provides structure in new formats like JSON (critical for data lakehouses) and Markdown (the language of AI). Without this capability, querying newly captured data would be terribly tedious.

The post “What is Intelligent Document Processing (IDP)?” was originally published on keymarkinc.com.

How IDP Boosts ELT & Lakehouse Analytics

Alex Lipinski — Wed, 29 Oct 2025 17:39:14 +0000

Takeaways

Data availability is catching up with document complexity.
Intelligent Document Processing (IDP) uses machine learning and natural language processing to catch what ETL and ELT can’t.
IDP extracts data from documents at ingestion points and outputs structured data in formats like JSON, CSV, and XML.
The IDP market is accelerating as it improves data quality and expands data possibilities in lakehouses.

Historically, traditional methods of capturing document data have fallen short of large-scale analytics—mostly offering indexing and basic metadata. Variations in document structure and schema have always posed a challenge for capture solutions.

[Intelligent Document Processing (IDP)][1] is now the second hottest tool on every analyst and CEO’s list (right after AI). As of 2025, 63% of Fortune 250 companies have implemented IDP to add structure to locked document data, vastly improving access for analytics and AI. Industry stats suggest an 80–90% increase in access to data when analyzing content once confined within documents[^2].

Let’s break down the why, how, and where IDP fits in a document-driven data pipeline.

Comparing Modern Data Querying Pipelines

Schema-on-Write (ETL: Extract, Transform, Load)

Takes raw structured data (typically RDBMS, logs, APIs)
Normalizes and structures with a predefined schema
Loads into a Data Warehouse

Benefits: High performance/consistency, fast business reporting, reliable data quality, easy querying.

Schema-on-Read (ELT: Extract, Load, Transform)

Extracts raw data
Loads into a Data Lake
Adds structure during queries, batch jobs, or scheduled tasks

Benefits: Handles all varieties (tables, logs, unstructured content). Schema is applied at query time.

The Data Lakehouse

Cloud architecture in platforms like Databricks and Snowflake blends data warehouse management with data lake flexibility. Today, [85% of organizations—a 20% increase from last year—leverage Data Lakehouse architecture][2] to store enterprise data for analytics and AI/ML projects. (And AI needs lots of data!)

Where Does the Lakehouse Struggle?

With so much data in the lakehouse, organizations either:

Lack enough unstructured document data (that’s ~80% of enterprise data), or
Have unformatted document data tossed in the lake—like sunken treasure

A lake without structure becomes a “data swamp”: a poorly managed, unusable repository of raw data.

Many lakehouses feature some native toolsets for document types, and custom Python can convert regular schemas to JSON, CSV, XML formats.

But with inconsistent or shifting formats, manual scripting error rates spike.

Parsing Unstructured Data to JSON

IDP uses NLP to make sense of documents with variable structures, extracting context and insights for analytics.

After capture, IDP delivers results as JSON, XML, CSV—no manual scripting or debugging needed.

Just well-trained AI/ML models, adapting instantly to new formats, providing digestible data as documents flow into your business.

Why does this matter?

Whether your data is a highly structured invoice or a rambling CEO letter, crucial data is littered everywhere.

IDP makes it analyzable—fueling decisions for data engineers and powerful prompts for LLMs.

The post “How IDP Boosts ELT & Lakehouse Analytics” was originally published on keymarkinc.com.

Your Content Repository Could Be a Data Gold Mine for IDP

Alex Lipinski — Wed, 29 Oct 2025 17:01:24 +0000

Takeaways

Dark data is data that can't be analyzed — and it's everywhere.
Approximately 55% of enterprise data is dark.
47% of dark data could already be living in content services or ECM, waiting for extraction.
Intelligent Document Processing rapidly makes sense of dark, unstructured data, preparing it with the structure needed for analysis in a data lake.

Organizations collect huge volumes of information, but very seldom, if ever, analyze all of it. Unanalyzed “dark data” is hidden everywhere, from PDFs and spreadsheets to Teams chats and nearly every place where humans exchange ideas.

With it commonly accepted that anywhere from 80-90% of enterprise data is unstructured¹, and almost half of enterprise data goes unused in decision-making², there are some insanely valuable insights locked away.

In his adventure of the Copper Beeches, Sherlock Holmes exclaimed,

"Data! Data! Data! I cannot make bricks without clay!"

Data is paramount to informed decisions, and even Mr. Holmes — a man capable of making the most astute, albeit absurd, deductions — needs data to succeed. Your organization is no different.

But while the amount of dark data can be alarming when put in terms like zettabytes or compared to lost floppy disks, 47% of it could already be in an ECM or content services system³.

Those are our proverbial mineshafts. That’s where the gold is.

It’s time to go spelunking.

What is Dark Data?

Dark data is any data, structured or unstructured, that is collected but not utilized to inform business decisions.

While structured data stored in legacy systems, personal devices, private spreadsheets, and department chats can contribute to dark data buildup, unstructured document/content data wins as the biggest offender of data gone dark — and it’s not even close.

How Does Unstructured Data Go Dark?

Unstructured data can go "dark" when it gets lost in the shuffle of:

Data siloes
Legacy systems
Poor lifecycle management
Bad document storage practices

Even properly stored data can become dark if it's too complex to parse into data lakes for analysis, or is directly loaded into a lake in its raw format.

The reason unstructured data is so difficult to master is primarily because it is often human-generated — arriving in many differing document and content types, including emails, paper files, social posts, images, or any document without a consistent format or layout.

Some Unstructured Data Statistics (and Absurd Holmesian Deductions)

Today, about 175 ZB of data is created, replicated, and consumed each year — expecting exponential growth.⁴ That's about 122 quadrillion floppy disks in case you were wondering...
Based on IDC's global datasphere predictions, yearly world data will reach almost 400 ZB by 2028.⁵ That's more than double the data... and floppy disks.
Of the 393 ZB of world data in 2028, 81% will be generated by enterprises hunting for data analysis and gen AI.⁵ That's 318 ZB of data.
Taking the commonly cited statistic that 80–90% of the world's enterprise data is unstructured⁶ and being conservative, by 2028 enterprises alone will generate more unstructured data than the world of today.
Current reports assess that 55% of enterprise data is unanalyzable, or "dark."⁷ So by 2028, you'd be better off taking 122 quadrillion floppy disks containing important enterprise data and recycling them for plasticware at the office.
Of the unstructured enterprise data out there, nearly half is exchanged via a central content repository like an ECM or content services platform.⁶

How Can I Better Utilize My Content Data?

Start by understanding where unstructured data lives. Because dark data has almost a 50% chance of being unstructured data contained in a centralized content repository, that’s a great and easy place to start.

Unstructured content flows into your organization through many ingestion points within inbound communication channels like:

Email or chat
Uploads and sharing
APIs and integrations
Automated systems

Ideally, the end of that workflow lands the content in some form of centralized system.

So that’s where the gold is.

Use AI Better — Find Your Data Gold

Intelligent Document Processing (IDP) combines natural language processing, machine learning, and a variety of capture methods to organize unstructured data. It can rapidly capture, label, index, and route data as it comes into your organization — or, unlock that 50% of dark content data sitting in your repository for analysis.

How Does IDP Enable Data Analysis from Unstructured Data?

Unstructured data is hard to parse automatically and slow to process via scripts — because humans think in words and context clues that become riddles to Python.

Humans also create many variations in how content is organized, which is not ideal for extraction.

The way IDP analyzes documents — focusing on context, natural language patterns, and learning over time — allows it to:

Translate content to a finer semantic layer
Provide necessary structure for data analysis
Shine light on dark data for extraction and analysis

Your content repository is a mine. Data is gold. IDP is a shovel.

If Holmes were here, he’d be stacking gold bricks — because data, data, data!

The post Your Content Repository Is a Data Gold Mine — Here’s How IDP Can Mine It appeared first on keymarkinc.com.

IDC ↩
Splunk ↩
IDC ↩
IDC/Seagate: The Digitization of the World: From Edge to Core ↩
IDC DataSphere 最新趋势预测 ↩
IDC: Untapped Value: What Every Executive Needs to Know About Unstructured Data ↩
Splunk: Dark Data: An Introduction ↩