DEV Community

Cover image for Why most enterprise AI projects underperform
TI for Kreuzberg

Posted on

Why most enterprise AI projects underperform

Enterprises are investing more in AI now than ever before, but most of that investment is not delivering what boards expect. Agents often miss important context. RAG pipelines pull up the wrong information. Internal copilots look good in demos but struggle with real user content.

When problems show up, teams usually check the model first. Then they look at the pipeline and try better chunking, new embeddings, a different vector database, or adding a re-ranker. These steps help, but they are usually not where the problem lies.

The real problem starts with the input. The documents sent into the pipeline are not in a format the AI can use, and no amount of downstream work can fully fix that. If scanned PDFs turn into a mess of unstructured characters before reaching the embeddings, the embeddings are working with bad content. If multilingual contracts are treated as if they are only in English, the model is making decisions on text it cannot understand. If your codebase is split by line count instead of by function, your code-aware agent is working with pieces that have no real meaning.

This is the layer where the industry has not invested enough, and it is the key factor in whether your AI stack works.

What 'AI-ready' means

AI-ready content is more than just raw text. For example, a contract has parties, clauses, definitions, and an obligation structure. A research report includes sections, figures, tables, footnotes, and citations that all connect. A codebase has modules, functions, imports, and call graphs. If you reduce any of these to just characters, you lose most of their value.

When a document is AI-ready, its structure stays intact. Headings remain as headings. Tables keep their rows and columns. Each section's language is identified, instead of assuming the whole document is in one language. Code is broken into units that follow its syntax. Cross-references between documents are kept as links, not turned into plain text.

The test is straightforward: can a downstream AI system use this content correctly without restructuring it? If not, the content is not AI-ready, no matter how clean the text appears.

What breaks downstream when the input layer is weak

These problems are often misdiagnosed because the symptoms appear far from the actual cause. A retrieval system might show an outdated policy because the document lacked clear timestamp metadata, and three nearly identical copies were embedded, with no way to tell them apart. The team blames the vector database, but the database is not the problem. The ingestion process never pulled out the metadata.

An agent might answer questions about a quarterly report with a confident but wrong number. The PDF was scanned; the OCR missed some details; a row in a financial table got messed up; and the agent repeated the wrong number as fact. The team thinks the model is the issue, but it's just working with the text it was given.

A multilingual support copilot works well in English but struggles in German and Japanese. The pipeline assumed each document was in only one language, so tickets with mixed languages were handled incorrectly from the start.

A code-aware agent might suggest a refactor that breaks a function because it never saw the whole function together. The repository was split into 800-character segments, so the function was split across two embeddings. The cause is at the input. Pipelines that look healthy on synthetic test data degrade on real enterprise content, and the degradation is most pronounced where document complexity is highest.

Extraction doesn't equal understanding

Most teams have not prepared for this difference. Extraction just pulls characters from files. Understanding creates output that another system can use without losing important details.

In practice, that means handling:

  • PDFs that are scanned, born-digital, mixed, or partially corrupted.
  • Images, including charts and diagrams that contain the actual information.
  • Audio and video, where speaker attribution, timestamps, and topic structure carry meaning beyond transcription.
  • Web content that depends on the rendered DOM structure rather than plain HTML.
  • Code, parsed as code, with awareness of language, file boundaries, and symbol structure.

This process needs to be consistent. The same contract should be handled the same way, whether it comes as a PDF, a Word file, or a scanned image. The output format should stay the same for all content types. Tables should remain tables, and hierarchy should remain hierarchy.

It also needs to handle nested content. For example, an email with three PDF attachments and an embedded image is really a group of documents, not just one. A repository is organized like a tree, not just a list of files. If the content layer flattens these structures, it loses much of what made the input valuable.

Where it fits in an existing stack

Most AI engineering teams have already chosen a framework, a vector database, an LLM provider, and an orchestration tool. They don't want a content layer that forces them to rebuild their pipeline.

A good document intelligence layer should not require that. Kreuzberg connects at the start of your existing pipeline and works directly with LangChain, Haystack, LlamaIndex, Spring AI, txtai, and CrewAI (the list is growing!). The rest of the pipeline stays the same. It just starts getting content in the right format.

The same idea applies to deployment. Some content can go through a managed cloud service. Some must stay self-hosted for compliance. Some must run in an air-gapped environment. For most enterprises, a document intelligence layer that operates in only one mode is not a practical solution.

Kreuzberg Cloud as Infrastructure Layer

Teams often have to rebuild their content layer because they see it as a one-time integration task. They get the documents in, extract the text, and move on. But then a new content type appears, a multilingual customer joins, someone wants to add video, or an auditor asks where the data is processed.

Each of these is a small change on its own. But together, they explain why a content pipeline that seemed complete six months ago is now holding up three items on your roadmap.

This is why document intelligence is an infrastructure issue. It needs to work at scale, handle every content type your enterprise uses, run wherever your data is allowed to run, and remain reliable as your AI stack changes.

Kreuzberg Cloud is built for that role.

Do you want to find out where your AI projects are losing accuracy? The quickest test is to take some real content from your pipeline, run it through Kreuzberg, and compare the results to what your current pipeline produces. If there is only a small difference, the problem lies elsewhere. If there is a big difference, you have just found the most cost-effective fix in your stack.

Top comments (0)