Kasia Ryniak

Posted on Jun 17

Knowledge Graphs from Unstructured Documents: Why the Hard Part Isn't the Extraction

Rafal Cymerys, CTO at Applied AI consultancy Upside, wrote an insightful post on the reality of working with Knowledge Graphs:

Institutional Knowledge, Machine-Readable: What It Actually Takes to Build It

Large institutions generate documents the way they generate bureaucracy: continuously, in slightly different formats, by different people, across decades. Legal departments accumulate contracts and precedents. Regulators publish guidance that amends earlier guidance that references older frameworks. Engineering organizations build up internal standards, compliance manuals, and technical specifications - each written by whoever was responsible at the time, in whatever format made sense to them then.

The result is the same almost everywhere: thousands of documents that are technically structured - they have sections, clauses, definitions, cross-references - but not machine-readable in any useful sense. Semi-freeform is the right word for it. There's enough regularity that a human can navigate them, not enough that a parser can.

For a long time, the only answer was people. Subject matter experts who knew where things were. Ctrl+F and institutional memory. The occasional internal wiki that was always six months out of date.

The obvious question now is whether LLMs change that. The answer is: partly, and the "partly" is doing a lot of work.

Models are genuinely good at reading technical prose and pulling out entities, relationships, and constraints. The extraction step works. What doesn't work automatically is everything that happens afterward - verifying the model didn't confabulate a relationship, enforcing a schema that's still evolving, keeping the graph consistent as documents change, and exposing all of it through an API that downstream systems can actually depend on.

That gap between "the LLM can read documents" and "the knowledge graph is a reliable system of record" is where the real engineering lives.

Why This Is Harder Than It Looks

The extraction step is the part that demos well. You paste in a clause of dense regulatory prose, the model returns a clean JSON object with entities, relationships, and attributes. It's genuinely impressive the first time you see it. So the natural instinct is to build a pipeline around that step - better prompts, smarter chunking, structured output formatting - and assume the hard work is done.

It isn't.

The extraction step is maybe 20% of the problem. The other 80% is what happens to that output afterward, and it's where most teams hit a wall they didn't see coming.

The first thing that goes wrong is consistency. An LLM reading the same clause twice won't always produce the same output. Change the surrounding context, update the model version, adjust the prompt slightly - the extraction shifts. For a demo that's fine. For a knowledge graph that's supposed to be a source of truth for the systems built on top of it, that's a serious problem. Variance in the extraction propagates directly into every system that depends on it.

The second thing that goes wrong is schema drift. You start with a reasonable ontology - entities, relationships, a few key attributes - and it works for the first batch of documents. Then a new document type surfaces a relationship you hadn't modeled. A domain expert reviews the output and pushes back on how you've represented a concept. A downstream consumer needs a different shape. The ontology changes, and now everything you've already extracted is potentially inconsistent with the new model. If you didn't build for this from the start, it's expensive to fix.

The third thing - and this one is underestimated most often - is that the documents themselves aren't as consistent as they look. Semi-freeform means there's a structure, but it isn't enforced. The same concept appears under different headings in different documents. Implicit relationships that any human reader would infer are simply absent from the text. Older documents use terminology that was later revised. An LLM will do its best with all of this, but "its best" isn't the same as "correct and consistent," and there's no automatic way to tell the difference.

None of this means LLMs aren't useful here. They are - significantly so. But treating the extraction step as the problem, and everything else as plumbing, is where projects get into trouble.

The Pipeline You Actually Need

The data extraction pipeline using an LLM

So what does a pipeline that takes this seriously actually look like?

The extraction step stays, but it gets bounded. Good prompt engineering, structured output formatting, and sensible document chunking are all worth investing in - they improve the quality of the raw material. But from a pipeline perspective, you also need to validate whatever the extraction step returned.

Schema enforcement before storage. Whatever comes out of the LLM needs to pass validation against your ontology before it touches the graph. This means defining your data model precisely enough that you can write constraints against it - mandatory relationships, type checks, value ranges, cardinality rules. The tooling depends on your stack - something as simple as JSON Schema covers the basics for simpler use cases, while more expressive options like SHACL handle the full complexity of RDF-based semantic graphs - but the principle is universal. Run validation on every extraction, without exception. A violation means one of two things: the data wasn't present in the source document to begin with, or the model failed to extract it despite it being there. Often the right move is to feed the validation result back to the extraction step - the model gets a second attempt with explicit feedback about what's missing. But pushing the model to fill gaps also risks hallucination, which is why every violation needs to be flagged for human review regardless of how it resolves.

Deterministic validation on top of probabilistic extraction. LLM outputs are non-deterministic by nature. That's acceptable for extraction - you're asking a model to interpret ambiguous text, and some variance is inevitable. Rule-based checks handle what schema validation doesn't: referential integrity, duplicate detection, conflict identification against existing graph data, and basic sanity checks on extracted values — a date that falls outside the expected range, an integer that doesn't map to a known category, a measurement that's physically implausible. Models make these kinds of mistakes, and they're cheap to catch deterministically. They form a deterministic layer around the probabilistic one, and they're what lets you make claims about graph consistency with any confidence.

Provenance on everything. Every node and edge written to the graph should carry metadata: which source document it came from, which model version produced it, when it was extracted, whether it passed automated validation, whether a human reviewed it. This feels like overhead until you need to re-process a document because the model changed, or audit a specific claim a downstream system is relying on, or trace a query result back to its source. Provenance is what makes the graph auditable.

A routing layer for confidence. Not all extractions are equal. Some outputs will be clean, well-formed, and consistent with existing graph data. Others will be ambiguous, partially invalid, or in conflict with something already there. The pipeline should route these differently - high-confidence extractions go straight through, lower-confidence ones go to a review queue. Confidence, in this context, is a composite of schema validation results, conflict detection, and similarity to known-good extractions. Getting this routing right is what makes human review tractable at scale.

All of the above results in a layered pipeline, where each layer handles a different category of failure - and the failures are different enough that no single mechanism catches all of them.

The Storage Layer Underneath

When people think about knowledge graphs, they reach for a graph database - Neo4j, GraphDB, or similar - and start building. That's reasonable, but there are a few architectural decisions worth making before you commit to a storage topology.

Graph and relational aren't mutually exclusive. A knowledge graph is the right system of record for interconnected entities and semantic relationships - that's what it's designed for. It's less obviously right for everything. High-volume filtering, aggregation queries, caching, and workflow state tend to work better in a relational database. In practice, most production systems end up with both: the graph as the authoritative source for structured knowledge, a relational layer for specific access patterns that don't map well to graph traversal. The mistake is assuming one technology handles everything, and discovering the gaps under load.

Performance is non-obvious at scale. Knowledge graph technologies can exhibit surprising behavior depending on workload patterns. Write-heavy scenarios with certain ontologies, complex multi-hop traversals, and large queries can all degrade significantly as data volume grows. Validate your expected workload characteristics against your chosen technology before you're in production, not after. The performance profile of a graph store with ten thousand nodes is not the same as one with ten million, and the differences aren't always predictable from first principles.

The Schema Evolution Problem

Your ontology will change. Plan for it.

The first version of your schema is based on the documents you've seen so far, the use cases you understand now, and the abstractions that made sense when you started. All three of those shift. A new document type surfaces a relationship you hadn't modeled. A domain expert reviews your representation of a concept and tells you it's subtly wrong. A downstream team needs to query the graph in a way that your current structure makes very complicated. Each of these is a reasonable, expected event. The question is whether the system can absorb those changes without falling apart.

The teams that handle this well treat schema changes the same way they treat database migrations. Every change to the ontology is versioned. There's a changelog. Scripts run against the graph when the schema updates, transforming existing data to match the new model. Nothing gets changed informally, because informal changes are the ones that create silent inconsistencies - data written under the old schema sitting alongside data written under the new one, with no record of which is which.

The harder question is what happens to data that was extracted under an old schema version. There are two options: re-extract from the source documents using the updated schema, or transform and backfill the existing data in place based on what's already in the graph. Neither is always right. Re-extraction is cleaner but expensive - you're running the full pipeline again, which costs time and compute, and assumes the source documents are still available and unchanged. In-place transformation is faster but riskier, because you're inferring what the new schema would have produced rather than actually producing it. The decision depends on how significant the schema change is and how much trust you have in the transformation logic.

Making It Queryable, and Keeping It That Way

Getting knowledge into the graph is one problem. Making it reliably accessible to the systems that need it is another.

The instinct is to expose the graph directly - give downstream consumers a raw endpoint to the knowledge graph or the underlying data store and let them query. This works fine for internal tooling and exploratory work. It's a bad foundation for anything that needs stability guarantees. Query languages like SPARQL or SQL are expressive, but exposing them directly binds API consumers to your schema - which is exactly what you want to avoid. When the ontology changes - and it will - every query that touched the affected parts breaks. You find out when something downstream stops working, not before.

A versioned API layer sitting in front of the graph solves this. It could be a RESTful API, a search index like Elasticsearch, or any other interface that decouples what consumers see from how the data is actually stored internally. When the schema changes, you version the API - consumers stay on the old version until they're ready to migrate, and breaking changes become manageable rather than catastrophic.

The Evaluation Problem

A pipeline without an evaluation framework is essentially running on faith. You're assuming the extractions are good because they look reasonable in spot checks. That assumption tends to hold until it doesn't - and when it breaks, it usually breaks quietly, across a large portion of the graph, before anyone notices.

The core problem is that LLM output quality isn't static. It shifts when you change the prompt, update the model, encounter a new document type, or hit edge cases in the source material. Models also get deprecated - often on a shorter timeline than you'd expect - which means you may be forced to swap the underlying model with limited notice. Without systematic measurement, you have no way to know whether a new model produces equivalent results, or whether quality has quietly shifted across your document types.

The practical answer is a curated set of documents with verified expected outputs. You run your pipeline against this set automatically, on every significant change: a prompt update, a model swap, a schema revision. The output tells you whether the pipeline still behaves the way it should, or whether something has shifted. Without it, you're relying on spot checks and intuition, which tend to catch problems after they've already propagated into the graph.

The evaluation dataset deserves the same care as the pipeline code. It should be version-controlled, reviewed when the schema changes, and expanded whenever a new edge case reaches the review queue. Every extraction that gets corrected by a human is a candidate for the ground truth set. The dataset compounds in value over time - it's the institutional memory of what "correct" looks like for your specific domain and document types.

Human-in-the-Loop: Where and How Much

Full automation is the right long-term goal. It's almost never the right starting point.

The gap between "the pipeline produces output" and "the output is trustworthy enough to enter the graph without review" is real, and it's different for every domain and document type. Trying to close that gap by tightening the pipeline before you understand where it fails is backwards. You need human review first - not as a permanent state, but as the mechanism that tells you where the pipeline needs work.

The practical model is a confidence-based routing layer. High-confidence extractions - those that pass all validation checks, don't conflict with existing graph data, and closely resemble known-good examples - go straight through. Lower-confidence ones get queued for review. The threshold between the two is something you tune over time, starting conservative and relaxing it as the pipeline matures.

What makes this work in practice is the review interface. Reviewers need to see the source passage alongside the proposed extraction, understand what would change in the graph if they approve it, and make a decision quickly. A poorly designed review interface becomes a bottleneck - slow to use, cognitively expensive, and prone to reviewer fatigue that degrades decision quality.

Every decision a reviewer makes is data. An approved extraction with no changes confirms the pipeline is working for that case. A corrected extraction tells you something more valuable - exactly where the model went wrong, in a form you can add to your ground truth set and use to improve evaluation. The human review queue is, in this sense, your best source of training signal. Teams that treat it as a necessary evil miss this entirely.

Over time, as prompts improve and the evaluation framework catches regressions earlier, the review rate should fall. Track it explicitly. If it plateaus or rises, something in the pipeline has regressed - a new document type is exposing gaps, a model update shifted behavior, or the schema changed in a way the prompts haven't caught up with. The review rate is one of the most honest signals you have about overall pipeline health.

Where This Is Going

The tooling around all of this is improving fast. Models produce better-structured output than they did a year ago. Evaluation frameworks are becoming less bespoke. The case for building these systems is stronger than it's ever been.

What isn't changing is the underlying challenge. A knowledge graph is only as useful as it is consistent and trustworthy, and LLMs - left without the surrounding infrastructure - aren't consistent by nature. The non-determinism is a fundamental property of how these models work, and the engineering response to that is a governance layer: validation, provenance, evaluation, controlled schema evolution, human review where it's needed. That layer is what makes the probabilistic output of a model into something a production system can depend on.

I don't know exactly how this space looks in three years. The models will be better. Some of the validation work we do manually today will probably be automatable. The boundary between what needs human review and what doesn't will move. But the separation of concerns - extraction on one side, governance on the other - feels durable. The problem of turning loosely structured institutional knowledge into something machines can reliably reason over isn't going away, and neither is the need to do it carefully.

Building knowledge graph systems for institutional knowledge is an end-to-end problem - extraction, validation, and evaluation as parts of a single process. There's no other way to build something you can actually depend on.