<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tae Kim</title>
    <description>The latest articles on DEV Community by Tae Kim (@hannune).</description>
    <link>https://dev.to/hannune</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3678969%2F0d047df5-2d0b-45b9-bbde-b4dbad88e550.png</url>
      <title>DEV Community: Tae Kim</title>
      <link>https://dev.to/hannune</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hannune"/>
    <language>en</language>
    <item>
      <title>I Built a Local LLM Rig to Escape API Bills. Then I Paid OpenAI Again.</title>
      <dc:creator>Tae Kim</dc:creator>
      <pubDate>Sat, 13 Jun 2026 02:19:42 +0000</pubDate>
      <link>https://dev.to/hannune/i-built-a-local-llm-rig-to-escape-api-bills-then-i-paid-openai-again-4bi7</link>
      <guid>https://dev.to/hannune/i-built-a-local-llm-rig-to-escape-api-bills-then-i-paid-openai-again-4bi7</guid>
      <description>&lt;p&gt;I run a one-person AI shop. For 2asy.ai's filing pipeline that needs thousands of single-document extractions per cycle, the local rig lost the batch lane and OpenAI Batch won. Per-pipeline, not per-company.&lt;/p&gt;

&lt;p&gt;The rule that decided it: no cross-document attention. Each filing gets its own prompt window. No string concatenation. The rule came from a Neo4j rollback I already paid for.&lt;/p&gt;

&lt;p&gt;Quick results.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Local Gemma 4 26B on llama.cpp&lt;/strong&gt; (RTX 4090 + W6800): live serving fine. Batch lane blocked. vLLM has no 4-bit MoE path I need, container wants CUDA 12.9, host driver is 12.8. &lt;code&gt;GGML_CUDA_DISABLE_GRAPHS=1&lt;/code&gt; keeps llama.cpp alive when graph optimizer segfaults.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenRouter&lt;/strong&gt;: no real batch. Live pricing. At concurrency 32, latency 2 to 17 seconds, 121s timeouts, 429s.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini batch SDK&lt;/strong&gt;: silently inline-concatenates documents into one context. Cross-document leak. Neo4j rollback. Upstream &lt;code&gt;googleapis/python-genai&lt;/code&gt; issue 1984 is not-planned.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI Batch&lt;/strong&gt; (&lt;code&gt;gpt-5.4-mini&lt;/code&gt;): JSONL line-isolated, 50 percent off, 100-doc nano gate in 2.7 min, zero 429s, around 1 cent per document.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The local rig stays for live serving, ER API LLM gate, multimodal, and ablations. The batch lane moves to OpenAI.&lt;/p&gt;

&lt;p&gt;Full retrospective with the side-by-side table: &lt;a href="https://hannune.ai/blog/local-llm-to-openai-batch.html" rel="noopener noreferrer"&gt;https://hannune.ai/blog/local-llm-to-openai-batch.html&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>openai</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Cross-Lingual Entity Resolution in a Trade Knowledge Graph: Adding 39,534 Aliases to 6,883 Nodes</title>
      <dc:creator>Tae Kim</dc:creator>
      <pubDate>Sun, 07 Jun 2026 05:37:11 +0000</pubDate>
      <link>https://dev.to/hannune/cross-lingual-entity-resolution-in-a-trade-knowledge-graph-adding-39534-aliases-to-6883-nodes-34o3</link>
      <guid>https://dev.to/hannune/cross-lingual-entity-resolution-in-a-trade-knowledge-graph-adding-39534-aliases-to-6883-nodes-34o3</guid>
      <description>&lt;p&gt;In the previous post, I described how 2asy.ai moved from plain vector search to a cross-domain ontology Graph RAG that resolves entities across documents and traverses causal chains. That post ended with an honest note: the graph is sparse, and it will get denser. This post is about one specific dimension of that density problem, and how I addressed it this week.&lt;/p&gt;

&lt;p&gt;The problem is not just that the graph needs more documents. The problem is that the same real-world entity arrives in multiple languages.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Cross-Lingual Mentions Break a Knowledge Graph
&lt;/h2&gt;

&lt;p&gt;2asy.ai reads Asian trade and tariff news published in English, but the underlying entities, companies, regulators, ministries, and policymakers, are often referenced in their local-language forms inside those articles. A Korean article about trade policy mentions 한국은행. A Chinese article about semiconductors mentions 三星电子. A Japanese article on auto tariffs mentions トヨタ自動車. And a wire story covering the same events writes Bank of Korea, Samsung Electronics, and Toyota Motor Corporation.&lt;/p&gt;

&lt;p&gt;If entity resolution only handles English strings, each foreign-language mention becomes a separate node. The graph now has 한국은행 and Bank of Korea as two distinct entities. Any causal path that crosses the language boundary breaks. A tariff chain that starts with a Korean regulator's decision and ends at a multinational company's earnings call will never close.&lt;/p&gt;

&lt;p&gt;This is the specific failure mode: the graph looks connected per language and disconnected across languages. Cross-domain ontology gets you over the per-document barrier. Cross-lingual resolution gets you over the per-language barrier. They are separate problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix: Filling the Registry with Alias Tables, Not Rewriting Code
&lt;/h2&gt;

&lt;p&gt;The entity registry at the center of 2asy.ai already had the mechanism. When a new mention arrives during extraction, the registry resolves it against every known surface form for each canonical entity. If the mention matches any alias, it maps to that entity's canonical ID. The registry was already doing this for English variants (partial names, abbreviations, alternate spellings). It just had no non-English entries.&lt;/p&gt;

&lt;p&gt;So the change was to fill the alias tables, not to redesign the resolution layer.&lt;/p&gt;

&lt;p&gt;The registry holds 13,371 canonical entities. Of those, 6,883 had at least one non-English surface form worth resolving, so I added ko, ja, and zh alias lists for them. The total additions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ko (Korean): 13,757 aliases&lt;/li&gt;
&lt;li&gt;ja (Japanese): 12,615 aliases&lt;/li&gt;
&lt;li&gt;zh (Chinese): 13,162 aliases&lt;/li&gt;
&lt;li&gt;Total: 39,534 aliases across 6,883 entities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Almost all of them made it in. The pipeline flagged two small categories for exclusion before writing to the registry:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Collision:&lt;/strong&gt; 3 alias candidates already resolved to a different canonical entity. All three were the abbreviation &lt;code&gt;CENTCOM&lt;/code&gt;, generated as a Korean, Japanese, and Chinese alias for United States Central Command, which was already mapped to an existing US Central Command node. Writing them would have made one string resolve to two IDs, so they were skipped automatically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Low confidence:&lt;/strong&gt; 5 candidates were held in a staging area rather than written, because the source could not confirm a verified local-language form. For example, LG Electronics has confirmed Korean forms (LG전자, 엘지전자) but no verified official Japanese or Simplified Chinese legal name was found, so its non-Korean entries scored zero and were held instead of guessed.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The excluded set is tiny next to the 39,534 accepted, but it is the part that matters: the collision check and confidence gate exist specifically to keep a wrong alias from silently corrupting every mention it touches.&lt;/p&gt;

&lt;p&gt;After writing the accepted aliases, my local ER registry resolves these on exact lookup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;한국은행 (and the Japanese 韓国銀行, Chinese 韩国银行) resolves to the canonical Bank of Korea entity.&lt;/li&gt;
&lt;li&gt;三星电子 (and the Korean 삼성전자 주식회사) resolves to Samsung Electronics.&lt;/li&gt;
&lt;li&gt;미국 상무부 (and the Japanese 米国商務省, Chinese 美国商务部) resolves to United States Department of Commerce.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before this change, each foreign-language form would have produced no match and been written to the graph as a new, orphaned node.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for the Causal Graph in 2asy.ai
&lt;/h2&gt;

&lt;p&gt;The effect, once this alias layer is rolled into the live graph, is that Korean, Japanese, and Chinese entity mentions in trade news resolve to the same canonical nodes as their English counterparts. When a Korean regulator, a Chinese manufacturer, and an American import duty appear in the same causal chain, the chain can close across all three languages instead of fragmenting at each language boundary.&lt;/p&gt;

&lt;p&gt;This matters most for the cross-document chains. The whole point of moving from per-article graphs to a shared ontology was to let causality span the full corpus. A cause in one article connects to an effect in another. That connection only works if both articles are writing about the same canonical entity. Without cross-lingual resolution, the connection breaks the moment the two articles use different language forms for the same entity.&lt;/p&gt;

&lt;p&gt;Cross-lingual entity resolution is not a cosmetic feature. In a multilingual news corpus, it is a prerequisite for the graph to be coherent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Harder Lesson About Entity Resolution in Graph RAG
&lt;/h2&gt;

&lt;p&gt;Running 2asy.ai has made the priority order clear to me.&lt;/p&gt;

&lt;p&gt;Extraction is the part that looks impressive. You throw a document at a language model, it returns entities and relations, and the graph grows. It is satisfying to watch. But extraction quality has a ceiling set by what comes after it: resolution. If two mentions that should be the same node stay as two nodes, extraction accuracy does not matter. The chain is broken regardless.&lt;/p&gt;

&lt;p&gt;Entity resolution in Graph RAG is harder than most write-ups acknowledge. The English-only case is already non-trivial: abbreviations, partial names, acquired companies that change their names, subsidiaries that share names with parent entities. Add cross-lingual surface forms and the problem surface expands significantly.&lt;/p&gt;

&lt;p&gt;The registry-with-aliases approach I use in 2asy.ai is a deterministic layer on top of the probabilistic extraction. Extraction guesses what is in the text. The registry decides what that thing resolves to. Keeping those two responsibilities separate makes the system easier to debug and correct: if a mention is resolving wrong, I fix the registry, not the extraction model.&lt;/p&gt;

&lt;p&gt;39,534 aliases is a lot of entries. It is also a manageable data problem. The hard part is the collision detection and confidence gating, because a bad alias silently corrupts every mention it touches. Only 8 candidates were excluded this pass, 3 collisions and 5 held, but those were the gate doing exactly its job: an abbreviation that already belonged to another entity, and entities with no verifiable local-language name. As the corpus grows, that excluded set will grow with it, and it is the set that needs the most care.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Still Missing
&lt;/h2&gt;

&lt;p&gt;Cross-lingual resolution is one layer. There are still open problems.&lt;/p&gt;

&lt;p&gt;The 5 held candidates need review, and more low-confidence cases will surface as the corpus grows. Some held entries are genuine cases that should stay held, and some are probably correct but scored cautiously because the source did not have enough context. Working through that set as it grows will expand coverage further.&lt;/p&gt;

&lt;p&gt;The 3 collision candidates need manual triage. A collision means two canonical entities share a surface form, which is either a real-world ambiguity (two companies with similar names, a ministry that was renamed) or a registry error (two canonical entries that should be merged). The CENTCOM case is the first example: it points to a likely duplicate canonical entry that should be reconciled.&lt;/p&gt;

&lt;p&gt;And the registry has no coverage for Arabic, Russian, or Southeast Asian language forms yet. Those are smaller portions of the current corpus but not zero.&lt;/p&gt;

&lt;p&gt;The alias-filling step this week was the first pass. It is enough to close the most common cross-lingual gaps. The rest follows as the pipeline keeps running.&lt;/p&gt;




&lt;p&gt;This alias layer is validated on my local ER instance. Rolling it into the live 2asy.ai graph is the next step, once the held and collision cases are triaged. The public graph is at &lt;a href="https://www.2asy.ai/" rel="noopener noreferrer"&gt;https://www.2asy.ai/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If cross-lingual entity resolution or Graph RAG pipelines are something you are working on, the entity resolution service is at &lt;a href="https://api.hannune.ai/entity-resolution/v1" rel="noopener noreferrer"&gt;https://api.hannune.ai/entity-resolution/v1&lt;/a&gt;. That public endpoint serves the core resolver; the multilingual alias layer described here is validated locally and not yet loaded into it.&lt;/p&gt;

&lt;p&gt;The original post: &lt;a href="https://hannune.ai/blog/cross-domain-ontology-graph-rag.html" rel="noopener noreferrer"&gt;From Vector Search to a Cross-Domain Ontology Graph&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>rag</category>
      <category>knowledgegraph</category>
    </item>
    <item>
      <title>From Vector Search to a Cross-Domain Ontology Graph: How 2asy.ai Reads Tariff News</title>
      <dc:creator>Tae Kim</dc:creator>
      <pubDate>Tue, 02 Jun 2026 17:18:24 +0000</pubDate>
      <link>https://dev.to/hannune/from-vector-search-to-a-cross-domain-ontology-graph-how-2asyai-reads-tariff-news-9kh</link>
      <guid>https://dev.to/hannune/from-vector-search-to-a-cross-domain-ontology-graph-how-2asyai-reads-tariff-news-9kh</guid>
      <description>&lt;p&gt;A new tariff briefing went up on 2asy.ai this week, and for the first time you can see the graph it was built from, right there on the page. That graph is the visible end of a quiet rebuild. The retrieval behind 2asy.ai went from plain vector search, to a simple per-article graph, to a cross-domain ontology Graph RAG. This is what each step bought, and what is still missing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where 2asy.ai started: plain vector RAG
&lt;/h2&gt;

&lt;p&gt;The first version of 2asy.ai was ordinary vector RAG. I chunked each trade and tariff article, embedded the chunks, and retrieved by similarity. For a question like "what is happening with steel duties," that works. The system finds passages that look like the question and hands them to a model to summarize.&lt;/p&gt;

&lt;p&gt;What vector RAG cannot do is answer "why." Similarity retrieval finds text that resembles your query. It does not know that a sunset review on one product is connected to an antidumping order on another, or that an action against one country pulls in suppliers in a second. Each chunk sits alone. The causal structure that makes trade news worth reading is exactly the thing embeddings throw away.&lt;/p&gt;

&lt;h2&gt;
  
  
  The next step: a simple per-article graph
&lt;/h2&gt;

&lt;p&gt;So I moved to Graph RAG. For each article I extracted entities, events, and the relations between them, and stored them as a small graph instead of a bag of chunks. This was a real improvement. Inside a single article you could now follow a chain: this investigation led to this duty, which affected this set of producers.&lt;/p&gt;

&lt;p&gt;The limit showed up between articles. Each document produced its own little graph, and those graphs did not talk to each other. The "South Korea" mentioned in a steel story and the "South Korea" in a tire story were two unrelated nodes. There was no shared vocabulary of entity types and relation types, so the system could not connect a cause reported in one article to its effect reported in another. The graph was per-document, not per-world.&lt;/p&gt;

&lt;h2&gt;
  
  
  What cross-domain ontology Graph RAG changes
&lt;/h2&gt;

&lt;p&gt;The current version of 2asy.ai runs on a shared ontology. Every entity, event, and relation is extracted against the same fixed set of types and the same canonical relation vocabulary, and entities are resolved across documents so that the same real-world thing becomes one node no matter how many articles mention it. That is the cross-domain part. A "Sunset Review" is the same Sunset Review whether it shows up in a methionine story, a tire story, or a steel story, and the edges between events can now cross from one domain to another.&lt;/p&gt;

&lt;p&gt;You can see the result on the latest briefing. The causal map for the June 2 story, "US Trade Remedies Expand Amid Global Investigations," puts a Sunset Review node at the center as the root, with directed relations fanning out to the entities it touches. Some edges carry qualifiers like "via South Korea" or "via Taiwan," which is the system recording the path a cause took, not just that two things are related. The full extraction for that one story is around 41 nodes and 60 edges, and the view focuses on the root and the bridge nodes so the chain stays readable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The graph is thin right now, and that is expected
&lt;/h2&gt;

&lt;p&gt;If you open it today, the graph will look sparse. For this story it is a few dozen nodes, and some of them are still coarse: a node typed as "unknown" rather than placed cleanly in the ontology, or an entity that is more general than it should be. I want to be honest about that rather than hide it.&lt;/p&gt;

&lt;p&gt;This is a data-accumulation problem, not a design ceiling. A cross-domain graph gets better the more documents flow through it, because resolution and typing both improve with volume. The first time an entity appears it has little context, so it lands as a weak or untyped node. The tenth time it appears, across different stories, it has enough surrounding structure to resolve confidently and connect to the rest of the world. The shape is already correct. What it needs is time and throughput, and both are accruing as the pipeline keeps running.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I built it this way
&lt;/h2&gt;

&lt;p&gt;The whole point of 2asy.ai is causal chains: not "here is some news that matches your query," but "here is how this trade action connects to that one, and to the producers and countries in between." Vector RAG cannot represent that. A per-article graph can represent it inside one story but not across the corpus. A cross-domain ontology graph is the first version where the connections can actually span the whole body of news, which is where the interesting causality lives.&lt;/p&gt;

&lt;p&gt;All of this runs on local hardware, an RTX 4090 and an AMD W6800, with open models doing the extraction and resolution. There is no cloud inference bill behind it. If you want to see where it stands, the latest briefing and its graph are live at &lt;a href="https://www.2asy.ai/" rel="noopener noreferrer"&gt;https://www.2asy.ai/&lt;/a&gt; . It is early, and it will get denser as the corpus grows.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>rag</category>
      <category>knowledgegraph</category>
    </item>
    <item>
      <title>How I Caught My LLM Fabricating Its Own Evidence</title>
      <dc:creator>Tae Kim</dc:creator>
      <pubDate>Sat, 30 May 2026 17:00:14 +0000</pubDate>
      <link>https://dev.to/hannune/how-i-caught-my-llm-fabricating-its-own-evidence-1i8d</link>
      <guid>https://dev.to/hannune/how-i-caught-my-llm-fabricating-its-own-evidence-1i8d</guid>
      <description>&lt;p&gt;The language model behind my Graph RAG pipeline did something worse than getting a fact wrong. It fabricated the evidence. Each relation it extracted carried a quote that was supposed to come straight from the source article, and many of those quotes had never been written. They read perfectly. They did not exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does fabricated evidence mean in a knowledge graph?
&lt;/h2&gt;

&lt;p&gt;I am building the seed knowledge graph for 2asy.ai, a causal-chain intelligence system over trade and tariff news. Every relation and event in the graph carries an evidence field: the exact sentence from the source document that justifies it. That evidence is the whole point. It is what lets me, or a reader, trace a claim back to where it came from instead of trusting the model on faith.&lt;/p&gt;

&lt;p&gt;The problem is that I was asking a language model to produce that evidence by quoting the source. And a language model is a text generator, not a copier. When I checked the evidence against the original articles, a large share of the quotes were not verbatim. They were fluent, on-topic, and invented.&lt;/p&gt;

&lt;h2&gt;
  
  
  The ellipsis was the tell
&lt;/h2&gt;

&lt;p&gt;The clearest pattern was the ellipsis. The model would take two sentences from completely different parts of an article, drop a &lt;code&gt;...&lt;/code&gt; between them, and present the result as one continuous quotation. The seam looked like a normal editorial cut. It was not. It was two unrelated fragments fused to manufacture support for a relation the model had already decided to extract.&lt;/p&gt;

&lt;p&gt;This is the dangerous kind of hallucination, because it is shaped exactly like real evidence. A wrong fact stands out. A fabricated quote that paraphrases something true reads as completely credible until you go back to the source and search for it character by character.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why truncating the input made it worse
&lt;/h2&gt;

&lt;p&gt;I had been doing something that looked harmless: truncating each article body to a few thousand characters before extraction, to stay inside a comfortable context window. That truncation was quietly licensing the fabrication. When the sentence that actually supported a relation sat past the cutoff, the model did not refuse. It filled the gap with a confident reconstruction of what the missing text probably said.&lt;/p&gt;

&lt;p&gt;So I removed the truncation entirely and switched the collectors to full-body-or-skip: either the pipeline has the complete article text, or it does not process that document at all. A partial document is more dangerous than a missing one, because a partial document still produces output, and the output looks finished.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix: check the quote, do not grade it
&lt;/h2&gt;

&lt;p&gt;The fix is almost embarrassingly simple, and that is the point. At commit time, before any relation is written to the graph, I check that its evidence string appears as an exact substring of the source document. If the quote is not literally in the text, the relation is rejected. No fuzzy matching, no second model asked to judge whether the evidence is good enough.&lt;/p&gt;

&lt;p&gt;The instinct in this situation is to reach for another language model to verify the first one. I think that instinct is usually wrong. If you can check an output with a deterministic string operation, do that instead of grading one generator with another generator. A substring test cannot be talked into a plausible answer. It is cheap, it is exact, and it cannot hallucinate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cleaning up what had already shipped
&lt;/h2&gt;

&lt;p&gt;The guard stops new fabrication, but it does not undo the relations already sitting in the graph. So I ran the same substring check backward over everything that had already been committed. I reverted 122 documents whose evidence had been stitched together from separate sentences, then cleaned out hundreds more whose quotes simply did not match their source, more than 500 documents in total across the cleanup passes.&lt;/p&gt;

&lt;p&gt;That number is the real cost of having trusted generated evidence in the first place. Every one of those documents had passed through a pipeline that ran cleanly and produced output that looked correct. The volume of the cleanup is a measure of how convincing fabricated evidence is when nobody checks it against the source.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lesson: store pointers, verify with code
&lt;/h2&gt;

&lt;p&gt;If you are extracting structured claims from text with a language model, treat any quote it gives you as a hypothesis, not a fact, until a string operation confirms it. Store evidence as something you can verify, a span or an exact substring, not as free text the model is trusted to have copied faithfully.&lt;/p&gt;

&lt;p&gt;All of this runs on local hardware, an RTX 4090 and an AMD W6800, with open models doing the extraction. The substring guard adds no model calls and no cloud cost. It is a few lines of code standing between a graph you can trust and a graph that quietly lies to you in complete sentences.&lt;/p&gt;

</description>
      <category>graphrag</category>
      <category>llm</category>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>The Real Work in Graph RAG Is Not Extraction</title>
      <dc:creator>Tae Kim</dc:creator>
      <pubDate>Fri, 29 May 2026 03:00:34 +0000</pubDate>
      <link>https://dev.to/hannune/the-real-work-in-graph-rag-is-not-extraction-1mc4</link>
      <guid>https://dev.to/hannune/the-real-work-in-graph-rag-is-not-extraction-1mc4</guid>
      <description>&lt;p&gt;Extraction is the easy part of Graph RAG. I learned this building the seed knowledge graph for 2asy.ai. The pipeline ran, the numbers looked fine, and the graph was still broken. The real work was normalization, and it took far longer than extraction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why does a freshly extracted knowledge graph look fine but break?
&lt;/h2&gt;

&lt;p&gt;I built a seed knowledge graph for 2asy.ai from roughly 450 documents, extracting more than 3,000 entities and over 1,000 events. The extraction pipeline ran cleanly. The counts looked healthy. Then I actually looked at the graph, and it was not navigable.&lt;/p&gt;

&lt;p&gt;A knowledge graph is only useful when you can walk it: follow an entity to an event, an event to its cause, a cause to the entity behind it. Mine could not be walked, because the same real-world idea had been recorded under many different names. The numbers measured volume, not consistency.&lt;/p&gt;

&lt;h2&gt;
  
  
  The relation layer: how 360 labels became 80 canonical types
&lt;/h2&gt;

&lt;p&gt;The relation types were the worst offender. Over 360 distinct labels had accumulated. The language model named the same structural idea differently depending on the article it was reading at the time. &lt;code&gt;caused_by&lt;/code&gt;, &lt;code&gt;is_caused_by&lt;/code&gt;, and &lt;code&gt;was_caused_by&lt;/code&gt; were three labels for one relationship. &lt;code&gt;triggers&lt;/code&gt;, &lt;code&gt;trigger&lt;/code&gt;, and &lt;code&gt;triggered_by&lt;/code&gt; were three more.&lt;/p&gt;

&lt;p&gt;A graph with 360 relation types is not a knowledge graph. It is an expensive document store. So I stopped extracting and went backward. I mapped every relation type down to 80 canonical forms and renamed the edges, hundreds of them. The graph shrank in variety and became navigable for the first time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The entity layer: why "aluminium" and "aluminum" break a causal chain
&lt;/h2&gt;

&lt;p&gt;The entity layer had a different failure. "Aluminium", "aluminum", and "ALUMINUM" were three separate nodes with zero edges connecting them. A causal chain breaks cleanly at the first spelling inconsistency, because the graph does not know the three nodes are the same metal.&lt;/p&gt;

&lt;p&gt;I merged the duplicates. I also fixed evidence substrings so each relation points back to the exact sentence that produced it, not a paraphrased approximation. Evidence that does not match its source is evidence you cannot trust later.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you stop the cleanup from happening twice?
&lt;/h2&gt;

&lt;p&gt;Cleaning once is not enough if the next extraction run reinvents the mess. So I added validation that enforces the 80-type vocabulary at extraction time. New runs must use an existing canonical relation type instead of inventing a new label. The cleanup becomes a one-time cost, not a recurring tax.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running the whole pipeline on local hardware
&lt;/h2&gt;

&lt;p&gt;All of this ran on local hardware: an RTX 4090 and an AMD W6800, with Gemma and Qwen models handling the language work. There was no cloud bill for the extraction, the normalization, or the validation. For a one-person operation, owning the inference cost structure is what makes a daily Graph RAG pipeline affordable to run at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lesson: plan for cleanup before extraction
&lt;/h2&gt;

&lt;p&gt;If you are building Graph RAG, plan for the cleanup phase before you start the extraction phase. The language model will produce plausible output. The graph will look dense. The counts will feel satisfying. None of that means the graph is usable.&lt;/p&gt;

&lt;p&gt;A knowledge graph becomes useful only when the entity and relation layers are consistent enough to walk. In my case, extraction took days and normalization took weeks. The seed graph for 2asy.ai now runs causal-chain queries against 80 canonical relation types, with every edge linked back to its source sentence.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://hannune.ai/blog/graph-rag-normalization.html" rel="noopener noreferrer"&gt;hannune.ai&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>graphrag</category>
      <category>rag</category>
      <category>knowledgegraph</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
