DEV Community: Kasia Ryniak

Knowledge Graphs from Unstructured Documents: Why the Hard Part Isn't the Extraction

Kasia Ryniak — Wed, 17 Jun 2026 09:01:19 +0000

Rafal Cymerys, CTO at Applied AI consultancy Upside, wrote an insightful post on the reality of working with Knowledge Graphs:

Institutional Knowledge, Machine-Readable: What It Actually Takes to Build It

Large institutions generate documents the way they generate bureaucracy: continuously, in slightly different formats, by different people, across decades. Legal departments accumulate contracts and precedents. Regulators publish guidance that amends earlier guidance that references older frameworks. Engineering organizations build up internal standards, compliance manuals, and technical specifications - each written by whoever was responsible at the time, in whatever format made sense to them then.

The result is the same almost everywhere: thousands of documents that are technically structured - they have sections, clauses, definitions, cross-references - but not machine-readable in any useful sense. Semi-freeform is the right word for it. There's enough regularity that a human can navigate them, not enough that a parser can.

For a long time, the only answer was people. Subject matter experts who knew where things were. Ctrl+F and institutional memory. The occasional internal wiki that was always six months out of date.

The obvious question now is whether LLMs change that. The answer is: partly, and the "partly" is doing a lot of work.

Models are genuinely good at reading technical prose and pulling out entities, relationships, and constraints. The extraction step works. What doesn't work automatically is everything that happens afterward - verifying the model didn't confabulate a relationship, enforcing a schema that's still evolving, keeping the graph consistent as documents change, and exposing all of it through an API that downstream systems can actually depend on.

That gap between "the LLM can read documents" and "the knowledge graph is a reliable system of record" is where the real engineering lives.

Why This Is Harder Than It Looks

The extraction step is the part that demos well. You paste in a clause of dense regulatory prose, the model returns a clean JSON object with entities, relationships, and attributes. It's genuinely impressive the first time you see it. So the natural instinct is to build a pipeline around that step - better prompts, smarter chunking, structured output formatting - and assume the hard work is done.

It isn't.

The extraction step is maybe 20% of the problem. The other 80% is what happens to that output afterward, and it's where most teams hit a wall they didn't see coming.

The first thing that goes wrong is consistency. An LLM reading the same clause twice won't always produce the same output. Change the surrounding context, update the model version, adjust the prompt slightly - the extraction shifts. For a demo that's fine. For a knowledge graph that's supposed to be a source of truth for the systems built on top of it, that's a serious problem. Variance in the extraction propagates directly into every system that depends on it.

The second thing that goes wrong is schema drift. You start with a reasonable ontology - entities, relationships, a few key attributes - and it works for the first batch of documents. Then a new document type surfaces a relationship you hadn't modeled. A domain expert reviews the output and pushes back on how you've represented a concept. A downstream consumer needs a different shape. The ontology changes, and now everything you've already extracted is potentially inconsistent with the new model. If you didn't build for this from the start, it's expensive to fix.

The third thing - and this one is underestimated most often - is that the documents themselves aren't as consistent as they look. Semi-freeform means there's a structure, but it isn't enforced. The same concept appears under different headings in different documents. Implicit relationships that any human reader would infer are simply absent from the text. Older documents use terminology that was later revised. An LLM will do its best with all of this, but "its best" isn't the same as "correct and consistent," and there's no automatic way to tell the difference.

None of this means LLMs aren't useful here. They are - significantly so. But treating the extraction step as the problem, and everything else as plumbing, is where projects get into trouble.

The Pipeline You Actually Need

The data extraction pipeline using an LLM

So what does a pipeline that takes this seriously actually look like?

The extraction step stays, but it gets bounded. Good prompt engineering, structured output formatting, and sensible document chunking are all worth investing in - they improve the quality of the raw material. But from a pipeline perspective, you also need to validate whatever the extraction step returned.

Schema enforcement before storage. Whatever comes out of the LLM needs to pass validation against your ontology before it touches the graph. This means defining your data model precisely enough that you can write constraints against it - mandatory relationships, type checks, value ranges, cardinality rules. The tooling depends on your stack - something as simple as JSON Schema covers the basics for simpler use cases, while more expressive options like SHACL handle the full complexity of RDF-based semantic graphs - but the principle is universal. Run validation on every extraction, without exception. A violation means one of two things: the data wasn't present in the source document to begin with, or the model failed to extract it despite it being there. Often the right move is to feed the validation result back to the extraction step - the model gets a second attempt with explicit feedback about what's missing. But pushing the model to fill gaps also risks hallucination, which is why every violation needs to be flagged for human review regardless of how it resolves.

Deterministic validation on top of probabilistic extraction. LLM outputs are non-deterministic by nature. That's acceptable for extraction - you're asking a model to interpret ambiguous text, and some variance is inevitable. Rule-based checks handle what schema validation doesn't: referential integrity, duplicate detection, conflict identification against existing graph data, and basic sanity checks on extracted values — a date that falls outside the expected range, an integer that doesn't map to a known category, a measurement that's physically implausible. Models make these kinds of mistakes, and they're cheap to catch deterministically. They form a deterministic layer around the probabilistic one, and they're what lets you make claims about graph consistency with any confidence.

Provenance on everything. Every node and edge written to the graph should carry metadata: which source document it came from, which model version produced it, when it was extracted, whether it passed automated validation, whether a human reviewed it. This feels like overhead until you need to re-process a document because the model changed, or audit a specific claim a downstream system is relying on, or trace a query result back to its source. Provenance is what makes the graph auditable.

A routing layer for confidence. Not all extractions are equal. Some outputs will be clean, well-formed, and consistent with existing graph data. Others will be ambiguous, partially invalid, or in conflict with something already there. The pipeline should route these differently - high-confidence extractions go straight through, lower-confidence ones go to a review queue. Confidence, in this context, is a composite of schema validation results, conflict detection, and similarity to known-good extractions. Getting this routing right is what makes human review tractable at scale.

All of the above results in a layered pipeline, where each layer handles a different category of failure - and the failures are different enough that no single mechanism catches all of them.

The Storage Layer Underneath

When people think about knowledge graphs, they reach for a graph database - Neo4j, GraphDB, or similar - and start building. That's reasonable, but there are a few architectural decisions worth making before you commit to a storage topology.

Graph and relational aren't mutually exclusive. A knowledge graph is the right system of record for interconnected entities and semantic relationships - that's what it's designed for. It's less obviously right for everything. High-volume filtering, aggregation queries, caching, and workflow state tend to work better in a relational database. In practice, most production systems end up with both: the graph as the authoritative source for structured knowledge, a relational layer for specific access patterns that don't map well to graph traversal. The mistake is assuming one technology handles everything, and discovering the gaps under load.

Performance is non-obvious at scale. Knowledge graph technologies can exhibit surprising behavior depending on workload patterns. Write-heavy scenarios with certain ontologies, complex multi-hop traversals, and large queries can all degrade significantly as data volume grows. Validate your expected workload characteristics against your chosen technology before you're in production, not after. The performance profile of a graph store with ten thousand nodes is not the same as one with ten million, and the differences aren't always predictable from first principles.

The Schema Evolution Problem

Your ontology will change. Plan for it.

The first version of your schema is based on the documents you've seen so far, the use cases you understand now, and the abstractions that made sense when you started. All three of those shift. A new document type surfaces a relationship you hadn't modeled. A domain expert reviews your representation of a concept and tells you it's subtly wrong. A downstream team needs to query the graph in a way that your current structure makes very complicated. Each of these is a reasonable, expected event. The question is whether the system can absorb those changes without falling apart.

The teams that handle this well treat schema changes the same way they treat database migrations. Every change to the ontology is versioned. There's a changelog. Scripts run against the graph when the schema updates, transforming existing data to match the new model. Nothing gets changed informally, because informal changes are the ones that create silent inconsistencies - data written under the old schema sitting alongside data written under the new one, with no record of which is which.

The harder question is what happens to data that was extracted under an old schema version. There are two options: re-extract from the source documents using the updated schema, or transform and backfill the existing data in place based on what's already in the graph. Neither is always right. Re-extraction is cleaner but expensive - you're running the full pipeline again, which costs time and compute, and assumes the source documents are still available and unchanged. In-place transformation is faster but riskier, because you're inferring what the new schema would have produced rather than actually producing it. The decision depends on how significant the schema change is and how much trust you have in the transformation logic.

Making It Queryable, and Keeping It That Way

Getting knowledge into the graph is one problem. Making it reliably accessible to the systems that need it is another.

The instinct is to expose the graph directly - give downstream consumers a raw endpoint to the knowledge graph or the underlying data store and let them query. This works fine for internal tooling and exploratory work. It's a bad foundation for anything that needs stability guarantees. Query languages like SPARQL or SQL are expressive, but exposing them directly binds API consumers to your schema - which is exactly what you want to avoid. When the ontology changes - and it will - every query that touched the affected parts breaks. You find out when something downstream stops working, not before.

A versioned API layer sitting in front of the graph solves this. It could be a RESTful API, a search index like Elasticsearch, or any other interface that decouples what consumers see from how the data is actually stored internally. When the schema changes, you version the API - consumers stay on the old version until they're ready to migrate, and breaking changes become manageable rather than catastrophic.

The Evaluation Problem

A pipeline without an evaluation framework is essentially running on faith. You're assuming the extractions are good because they look reasonable in spot checks. That assumption tends to hold until it doesn't - and when it breaks, it usually breaks quietly, across a large portion of the graph, before anyone notices.

The core problem is that LLM output quality isn't static. It shifts when you change the prompt, update the model, encounter a new document type, or hit edge cases in the source material. Models also get deprecated - often on a shorter timeline than you'd expect - which means you may be forced to swap the underlying model with limited notice. Without systematic measurement, you have no way to know whether a new model produces equivalent results, or whether quality has quietly shifted across your document types.

The practical answer is a curated set of documents with verified expected outputs. You run your pipeline against this set automatically, on every significant change: a prompt update, a model swap, a schema revision. The output tells you whether the pipeline still behaves the way it should, or whether something has shifted. Without it, you're relying on spot checks and intuition, which tend to catch problems after they've already propagated into the graph.

The evaluation dataset deserves the same care as the pipeline code. It should be version-controlled, reviewed when the schema changes, and expanded whenever a new edge case reaches the review queue. Every extraction that gets corrected by a human is a candidate for the ground truth set. The dataset compounds in value over time - it's the institutional memory of what "correct" looks like for your specific domain and document types.

Human-in-the-Loop: Where and How Much

Full automation is the right long-term goal. It's almost never the right starting point.

The gap between "the pipeline produces output" and "the output is trustworthy enough to enter the graph without review" is real, and it's different for every domain and document type. Trying to close that gap by tightening the pipeline before you understand where it fails is backwards. You need human review first - not as a permanent state, but as the mechanism that tells you where the pipeline needs work.

The practical model is a confidence-based routing layer. High-confidence extractions - those that pass all validation checks, don't conflict with existing graph data, and closely resemble known-good examples - go straight through. Lower-confidence ones get queued for review. The threshold between the two is something you tune over time, starting conservative and relaxing it as the pipeline matures.

What makes this work in practice is the review interface. Reviewers need to see the source passage alongside the proposed extraction, understand what would change in the graph if they approve it, and make a decision quickly. A poorly designed review interface becomes a bottleneck - slow to use, cognitively expensive, and prone to reviewer fatigue that degrades decision quality.

Every decision a reviewer makes is data. An approved extraction with no changes confirms the pipeline is working for that case. A corrected extraction tells you something more valuable - exactly where the model went wrong, in a form you can add to your ground truth set and use to improve evaluation. The human review queue is, in this sense, your best source of training signal. Teams that treat it as a necessary evil miss this entirely.

Over time, as prompts improve and the evaluation framework catches regressions earlier, the review rate should fall. Track it explicitly. If it plateaus or rises, something in the pipeline has regressed - a new document type is exposing gaps, a model update shifted behavior, or the schema changed in a way the prompts haven't caught up with. The review rate is one of the most honest signals you have about overall pipeline health.

Where This Is Going

The tooling around all of this is improving fast. Models produce better-structured output than they did a year ago. Evaluation frameworks are becoming less bespoke. The case for building these systems is stronger than it's ever been.

What isn't changing is the underlying challenge. A knowledge graph is only as useful as it is consistent and trustworthy, and LLMs - left without the surrounding infrastructure - aren't consistent by nature. The non-determinism is a fundamental property of how these models work, and the engineering response to that is a governance layer: validation, provenance, evaluation, controlled schema evolution, human review where it's needed. That layer is what makes the probabilistic output of a model into something a production system can depend on.

I don't know exactly how this space looks in three years. The models will be better. Some of the validation work we do manually today will probably be automatable. The boundary between what needs human review and what doesn't will move. But the separation of concerns - extraction on one side, governance on the other - feels durable. The problem of turning loosely structured institutional knowledge into something machines can reliably reason over isn't going away, and neither is the need to do it carefully.

Building knowledge graph systems for institutional knowledge is an end-to-end problem - extraction, validation, and evaluation as parts of a single process. There's no other way to build something you can actually depend on.

Migrating an ERP-Driven Storefront Using a Message-Broker Architecture (RabbitMQ)

Kasia Ryniak — Wed, 19 Nov 2025 11:00:21 +0000

Modern e-commerce platforms increasingly rely on modular, API-driven components. ERPs… do not. They are deterministic, slow-moving systems built around the idea that consistency matters more than speed.

Recently, when migrating a custom storefront to Solidus, a complete open source e-commerce solution built with Ruby on Rails for one of our clients, we faced this architectural tension head-on: How do you build a fast, flexible, customer-facing storefront when the ERP must remain the single source of truth for products, stock, and order states?

The problem that has arisen here was ensuring the data consistency between the [Solidus](https://solidus.io/) storefront and ERP system. It was obvious that event-driven two-way communication between them needed to be established, but the question was how to implement it in a way that would be efficient, maintainable, scalable and fault-tolerant - or to put is shortly, how to ensure that the communication meets the standards of a modern, well-engineered system.

Our answer to that problem was to design the integration around a message-broker-centric architecture. The message broker of choice in this particular case was RabbitMQ, mainly for the simplicity of integration.

Why message brokers?

Efficiency

Message brokers are software components that were built with the purpose of handling large amounts of data (in the isolated packages of data called “messages”). The messages can be published to the broker, or consumed from it by the services that are subscribed to it (publish/consume/subscribe - RabbitMQ nomenclature, other message brokers name those actions differently with little difference on what they functionally mean). This the clue of message brokers - they serve as a middleman between systems that can either publish or consume data from it.

Diagram illustrating data flow between clients via message broker.
One client publishes the data to the message broker queue. The client named “Consumer” is subscribed to that specific queue. This way, whenever some message (data package) is published to the subscribed queue, the said client will “consume” (receive) the data.
Note that publish/subscribe/consume is a RabbitMQ-specific nomenclature. Other message brokers name those actions differently but in principle work similarly.

This architecture pattern facilitates event-driven communication between the systems - whenever an event occurs that the other system should know about (e.g. there is a stock movement on particular products on the ERP side, or an user has placed an order on the storefront side), all that is needed is to publish the pertinent message to the broker and ensure that the other side is subscribed to the broker.

Subscription to the queue is defined on the consumers side. With a properly subscribed client, the message can be consumed and processed accordingly. In our case we used this mechanism to update system internal state right after the update message was sent to the pertinent queues.

The integral ease of implementing the event-driven communication that characterize message broker is a big part of what makes it a state-of-the-art solution for the efficient two-way communication between systems.

Maintainability

Message brokers allow easy logical division of data streams (via queues, or - on higher level - virtual hosts). This facilitates integration of numerous systems to one message broker, with logical separation of the messages. It makes it easy to extend - you can easily integrate new services/systems by creating new queues/virtual hosts.

Diagram illustrating the logical division of data streams facilitated by message brokers Virtual hosts serve as a logical environments inside a broker that provide isolation between systems sharing broker infrastructure. Queues serve as a logical data buckets, isolating data between services/components inside of one system.
Diagram illustrating the logical division of data streams facilitated by message brokers

Virtual hosts serve as a logical environments inside a broker that provide isolation between systems sharing broker infrastructure. Queues serve as a logical data buckets, isolating data between services/components inside of one system.

The diagram above illustrates an example architecture using mechanisms for logical separation of data streams provided by message brokers:

Virtual hosts - ideal for isolating data streams between systems that share a broker’s infrastructure. For example, one ERP system can be connected to two separate stores via a single broker instance with two dedicated virtual hosts; this ensures that one store cannot access the data intended for the other.
Queues - logical containers for messages of a specific type (e.g., product updates or new post comments). They act as the data buckets that subscribers connect to and separate data between components within a system (e.g., two services for product updates and user updates connected to two separate queues, “products” and “users”).
This logical division of data streams into virtual hosts and queues simplifies system organization and isolation, making it easier to maintain and extend complex projects.

Another upside of message broker queues, in terms of data flow, is that they implement FIFO (First In, First Out) mechanism - ensuring that the messages are processed in order they were received. This provides predictable and consistent data processing, prevents race conditions, and helps maintain the logical sequence of events across distributed systems.

There is an additional advantage when it comes to maintainability of message broker architecture, appreciated by any pragmatic software developer - namely the wide range of existing libraries and client integrations. RabbitMQ, for instance, provides integration libraries for virtually every popular web framework and programming language, providing seamless connectivity and simplifying development across diverse tech stacks.

Scalability

Message brokers scale well both horizontally (more instances of broker services) and vertically (more resources designated to one instance), which make it a good decision for dynamic and growing systems.

Horizontal scaling needs to be integrated with a load balancing layer or be supported by the specific message broker via cluster awareness mechanisms (so that the client can connect to the broker instance, and the messages are correctly routed internally within the cluster of brokers). Obviously this comes with additional complexity on the configuration side. That being said, in most systems vertical scaling will be absolutely enough, as message brokers are highly efficient in message processing and can generally withstand heavy throughput on one instance.

Also the aforementioned ways of logical separation facilitates the case of setting up the mirrored services - that was our case on our project, where we were to set up three separate instances of the eCommerce storefront, that were connected to the same ERP. We just created new virtual hosts inside of RabbitMQ instance and connected the mirrored web server instances to it. Since the clients attempt to publish/subscribe data to/from the queue is enough to declare the queue, just connecting the mirrored service (with shared codebase) automatically handles the queue structure definition.

Fault tolerance

Message brokers store the messages in their internal memory, up to the moment of them being successfully consumed by the designated service. This ensures better fault tolerance of the communication - that way the messages do not get lost (e.g. when the target service is temporarily down). Also most message brokers support the message persistence, so that the messages are not lost even in case of failure or temporary outage of the broker itself, or the message consumption acknowledgment mechanisms, that allow the consumers to control whether of not given message should be removed from the queue - which could be used to e.g. not remove messages that caused consumer crash on received data.

In traditional flows based on API or webhooks, if the receiving service is unavailable, the request simply fails, and the data may be lost unless additional retry logic is implemented. In contrast, message brokers buffer and persist messages, allowing the consumer to process them later, once it’s back online. This makes them a far more robust solution for asynchronous, fault-tolerant communication between distributed systems.

Lessons Learned (for teams building ERP-driven e-commerce systems)

This project reinforced a pattern we see frequently in ERP-centric e-commerce environments: the technical challenges are rarely about the tools themselves. They stem from how systems think, how they communicate, and how teams design the boundaries between them. Below are our lessons:

1. Event models are more important than queues

A message broker won’t save a poorly designed event contract. Invest time in defining:

event schema
versioning
idempotency
domain boundaries

2. Monitoring is non-negotiable

Most issues in distributed systems aren’t failures but rather slowdowns. RabbitMQ metrics let us spot:

backlog accumulation
slow consumers
malformed messages
retries and dead-letter queues

3. ERP constraints should drive architectural decisions, not vice versa

ERPs are deterministic, linear, often slow, and sometimes unavailable. Storefronts are fast, parallel, and customer-facing. The message broker acts as a translator between these fundamentally different worlds. Designing with this asymmetry in mind (instead of pretending both systems operate at the same tempo) leads to more stable and predictable commerce operations.

4. Broker-backed integration accelerates future modernization

Once event-driven messaging is in place, you can:

add new microservices
replace parts of the storefront
extend ERP responsibilities
integrate PIM, OMS, CRM without rewriting the whole integration layer.

This is why message brokers are becoming standard in composable commerce architecture.

Conclusion

Migrating to a Solidus storefront and connecting it tightly to an existing ERP taught us, once again, that message brokers can be a foundational architectural element for e-commerce system where reliability, extensibility, and operational resilience matter.

And if the ERP is the system of record, as it often is, treating the message broker as the backbone of your data synchronization is a sustainable approach we’ve seen that consistently survives real-world operational complexity.