A demo tells you what the best-case looks like. Messy data tells you what production looks like.
I've developed one rule for AI tool evaluations that has saved me from recommending the wrong product more times than I can count:
Never trust the demo data.
Demos are prepared. The data is clean, the queries are rehearsed, the outputs are the best examples the vendor has found. The demo is designed to show you what the product looks like when everything goes right.
Your actual work data is not like the demo data. It's messy, contradictory, outdated, inconsistently formatted, and full of organizational artifacts that make no sense to anyone who wasn't there when they were created.
A product that performs excellently on clean data and breaks on messy data is not an enterprise product. It's a demo product.
Here's how I test the difference.
What Messy Data Actually Looks Like in Enterprise Environments
Before I describe the tests, it's worth being specific about what "messy data" means in practice. Because "messy data" in an enterprise context is different from what a data scientist means by messy data.
Temporal inconsistency: Your document corpus contains multiple versions of the same document, some of which are current and some of which are outdated. The AI should retrieve the current version and either flag or ignore outdated versions. Most systems don't.
Contradictory information: Different documents, written at different times by different teams, say different things about the same topic. Your Q2 strategy deck says one thing; a later board update says something different. The AI needs to either surface the contradiction, default to the more recent source, or at minimum not confidently assert one version as fact.
Implicit context and jargon: Internal documents use abbreviations, code names, and vocabulary that is meaningful to insiders and opaque to outsiders. "The Project Aurora framework" means something specific to your team and nothing to the AI unless it has sufficient surrounding context.
Structural inconsistency: Some documents are well-structured with clear headers and sections. Others are informal notes, email threads, meeting transcriptions, or Slack exports with no formal structure. Systems trained on structured documents often fail on unstructured ones.
Stale references: Documents reference people, systems, and processes that no longer exist in the described form. "Contact the regional manager" — who no longer has that title. "Submit via the legacy portal" — which was replaced two years ago.
The Five Tests I Run
Test 1: Contradictory source retrieval
I create two documents with contradictory information about the same topic — a policy that changed, a number that was updated, a process that was revised.
I query the AI about the topic.
I'm evaluating three things: Does it retrieve both documents? Does it recognize the contradiction? Does it tell me which version is current, ask me to clarify, or confidently assert one version without flagging the conflict?
A system that confidently asserts the outdated version without flagging that newer information exists is not trustworthy for knowledge management. A system that surfaces both and explains the conflict is.
Test 2: Internal jargon and code names
I use the organization's real internal terminology — project code names, department abbreviations, internal process names — in queries without providing definition.
I'm evaluating whether the system can resolve these references from context in the document corpus, or whether it either fails to retrieve relevant documents or confabulates an answer that sounds plausible but isn't grounded in the actual indexed content.
This test matters because if your AI can't navigate your organization's vocabulary, it's useful only for generically-worded queries and fails on the specific queries that would actually save time.
Test 3: Malformed and inconsistent document structure
I feed the system a mix of well-structured documents and genuinely messy ones: meeting notes that are bullet points with no context, email threads formatted as plain text, documents with inconsistent heading levels, and files that were clearly OCR'd from physical paper with artifacts.
I evaluate retrieval and comprehension quality across document types. Systems that perform well on structured documents and poorly on unstructured ones will degrade significantly in real deployments, because most enterprise knowledge doesn't live in well-formatted documents.
Test 4: Query that requires synthesis across multiple sources
I ask a question that can only be answered correctly by combining information from three or more documents, none of which contains the complete answer.
Example: "What is our current refund policy for enterprise customers who purchased before the Q3 pricing change?" This requires the current refund policy document, the Q3 pricing change announcement, and potentially the enterprise customer definition document.
Systems that return a partial answer (one document only) or a confabulated answer (no grounded retrieval) fail this test. Systems that retrieve the relevant fragments and synthesize them correctly pass it.
Test 5: Query with no good answer in the corpus
I ask a question about a topic that isn't covered in the indexed document corpus.
I'm evaluating whether the system says "I don't have information about this in the available documents" or whether it confabulates an answer that sounds authoritative but isn't grounded in anything.
This is the honesty test. Enterprise AI systems that confabulate rather than acknowledge gaps in their knowledge create a trust problem that's worse than having no AI at all — because users can't tell when to trust the output.
What to Do With the Results
After running these tests, score each system on a simple pass/fail for each test. A product that fails two or more tests at the enterprise level should not be recommended for production use, regardless of demo performance.
For products that pass, note which tests required more prompting, clarification, or careful query formulation. Systems that require users to phrase queries very precisely to get good results have a hidden adoption cost — users who don't develop that skill will get worse results and either distrust the tool or make decisions based on poor outputs.
The best enterprise AI products handle messy input gracefully, acknowledge their uncertainty honestly, and surface conflicts rather than hiding them. These properties don't show up in demos. They show up in the messy data tests.
Top comments (0)