DEV Community: Tae Kim

15 AI Agent Frameworks, One Side-by-Side Table

Tae Kim — Wed, 01 Jul 2026 17:26:16 +0000

Four production agent projects in the last two years. Four different frameworks.

LangGraph for a stateful multi-step pipeline with human review gates at critical decision points. CrewAI for a research workflow that needed role-based task delegation. Pydantic AI for typed tool calls behind a thin API wrapper. OpenAI Agents SDK for one that was going to live inside the OpenAI runtime anyway.

Every one of those projects started the same way: two weeks of reading documentation, building toy demos, and trying to understand whether the problem called for a graph, a crew, a typed tool loop, or something else entirely.

The decision I kept failing to make fast was framework selection. Not because the information was hidden, but because it was scattered. GitHub stars on one site. Benchmark numbers on another, measuring canned tasks I did not recognize from real production. Marketing pages using language shaped to attract everyone rather than describe anything specific.

What I wanted was a table. Control style: graph, role crew, typed tool, conversational. State model: how the framework holds and passes state between steps. License. What the framework is actually shaped for. A liveness signal so I could tell whether the project was maintained. One row per framework. All in one place.

That table did not exist, so I built it.

What the table covers

15 frameworks at launch: LangGraph, CrewAI, AutoGen, Pydantic AI, OpenAI Agents SDK, Mastra, LangChain Agents, LlamaIndex Agents, Semantic Kernel, Haystack Agents, Smolagents, Atomic Agents, Phidata, DSPy, AG2.

Each row has a tagline I would write to a friend making the selection decision, not the one from the project marketing page. Control style. State model. License. What the framework is shaped for. GitHub stars as a rough liveness proxy.

The full directory: https://compare-lab.xyz/ai-agent-frameworks/

Why not a benchmark

I do not believe the public agent benchmarks right now. They measure tool-call success on small canned tasks and miss the three things that actually determine which framework fits a real project.

The first is state-model fit. If your task requires explicit state that survives across steps and branches conditionally, you need LangGraph or Semantic Kernel. If your task is naturally decomposable into independent roles, CrewAI or multi-agent AutoGen patterns fit better. Picking the wrong state model means fighting the framework rather than building with it.

The second is abstraction escape. Every framework has an abstraction layer designed for the common case. Production agents hit edge cases constantly. What matters is whether you can break out of the abstraction cleanly when you need to, without rewriting the whole pipeline. Some frameworks make this easy. Some make it very hard.

The third is failure recovery. What does the framework actually do when a tool call fails halfway through a long-running run? Can you retry from the checkpoint? Can you log the partial state and restart at step three? These questions are invisible in toy demos and critical in production.

Contributing

The data file is open. If you have shipped on one of these in production and a row gets a detail wrong, a correction is a one-line PR. If you have shipped on a framework not yet listed, send me the slug.

The longer write-up on why I went with a directory format instead of a benchmark: https://hannune.ai/blog/why-i-built-ai-agent-framework-compare.html

I Built a Local LLM Rig to Escape API Bills. Then I Paid OpenAI Again.

Tae Kim — Sat, 13 Jun 2026 02:19:42 +0000

I run a one-person AI shop. For 2asy.ai's filing pipeline that needs thousands of single-document extractions per cycle, the local rig lost the batch lane and OpenAI Batch won. Per-pipeline, not per-company.

The rule that decided it: no cross-document attention. Each filing gets its own prompt window. No string concatenation. The rule came from a Neo4j rollback I already paid for.

Quick results.

Local Gemma 4 26B on llama.cpp (RTX 4090 + W6800): live serving fine. Batch lane blocked. vLLM has no 4-bit MoE path I need, container wants CUDA 12.9, host driver is 12.8. GGML_CUDA_DISABLE_GRAPHS=1 keeps llama.cpp alive when graph optimizer segfaults.
OpenRouter: no real batch. Live pricing. At concurrency 32, latency 2 to 17 seconds, 121s timeouts, 429s.
Gemini batch SDK: silently inline-concatenates documents into one context. Cross-document leak. Neo4j rollback. Upstream googleapis/python-genai issue 1984 is not-planned.
OpenAI Batch (gpt-5.4-mini): JSONL line-isolated, 50 percent off, 100-doc nano gate in 2.7 min, zero 429s, around 1 cent per document.

The local rig stays for live serving, ER API LLM gate, multimodal, and ablations. The batch lane moves to OpenAI.

Full retrospective with the side-by-side table: https://hannune.ai/blog/local-llm-to-openai-batch.html

Cross-Lingual Entity Resolution in a Trade Knowledge Graph: Adding 39,534 Aliases to 6,883 Nodes

Tae Kim — Sun, 07 Jun 2026 05:37:11 +0000

In the previous post, I described how 2asy.ai moved from plain vector search to a cross-domain ontology Graph RAG that resolves entities across documents and traverses causal chains. That post ended with an honest note: the graph is sparse, and it will get denser. This post is about one specific dimension of that density problem, and how I addressed it this week.

The problem is not just that the graph needs more documents. The problem is that the same real-world entity arrives in multiple languages.

Why Cross-Lingual Mentions Break a Knowledge Graph

2asy.ai reads Asian trade and tariff news published in English, but the underlying entities, companies, regulators, ministries, and policymakers, are often referenced in their local-language forms inside those articles. A Korean article about trade policy mentions 한국은행. A Chinese article about semiconductors mentions 三星电子. A Japanese article on auto tariffs mentions トヨタ自動車. And a wire story covering the same events writes Bank of Korea, Samsung Electronics, and Toyota Motor Corporation.

If entity resolution only handles English strings, each foreign-language mention becomes a separate node. The graph now has 한국은행 and Bank of Korea as two distinct entities. Any causal path that crosses the language boundary breaks. A tariff chain that starts with a Korean regulator's decision and ends at a multinational company's earnings call will never close.

This is the specific failure mode: the graph looks connected per language and disconnected across languages. Cross-domain ontology gets you over the per-document barrier. Cross-lingual resolution gets you over the per-language barrier. They are separate problems.

The Fix: Filling the Registry with Alias Tables, Not Rewriting Code

The entity registry at the center of 2asy.ai already had the mechanism. When a new mention arrives during extraction, the registry resolves it against every known surface form for each canonical entity. If the mention matches any alias, it maps to that entity's canonical ID. The registry was already doing this for English variants (partial names, abbreviations, alternate spellings). It just had no non-English entries.

So the change was to fill the alias tables, not to redesign the resolution layer.

The registry holds 13,371 canonical entities. Of those, 6,883 had at least one non-English surface form worth resolving, so I added ko, ja, and zh alias lists for them. The total additions:

ko (Korean): 13,757 aliases
ja (Japanese): 12,615 aliases
zh (Chinese): 13,162 aliases
Total: 39,534 aliases across 6,883 entities

Almost all of them made it in. The pipeline flagged two small categories for exclusion before writing to the registry:

Collision: 3 alias candidates already resolved to a different canonical entity. All three were the abbreviation CENTCOM, generated as a Korean, Japanese, and Chinese alias for United States Central Command, which was already mapped to an existing US Central Command node. Writing them would have made one string resolve to two IDs, so they were skipped automatically.
Low confidence: 5 candidates were held in a staging area rather than written, because the source could not confirm a verified local-language form. For example, LG Electronics has confirmed Korean forms (LG전자, 엘지전자) but no verified official Japanese or Simplified Chinese legal name was found, so its non-Korean entries scored zero and were held instead of guessed.

The excluded set is tiny next to the 39,534 accepted, but it is the part that matters: the collision check and confidence gate exist specifically to keep a wrong alias from silently corrupting every mention it touches.

After writing the accepted aliases, my local ER registry resolves these on exact lookup:

한국은행 (and the Japanese 韓国銀行, Chinese 韩国银行) resolves to the canonical Bank of Korea entity.
三星电子 (and the Korean 삼성전자 주식회사) resolves to Samsung Electronics.
미국 상무부 (and the Japanese 米国商務省, Chinese 美国商务部) resolves to United States Department of Commerce.

Before this change, each foreign-language form would have produced no match and been written to the graph as a new, orphaned node.

What This Means for the Causal Graph in 2asy.ai

The effect, once this alias layer is rolled into the live graph, is that Korean, Japanese, and Chinese entity mentions in trade news resolve to the same canonical nodes as their English counterparts. When a Korean regulator, a Chinese manufacturer, and an American import duty appear in the same causal chain, the chain can close across all three languages instead of fragmenting at each language boundary.

This matters most for the cross-document chains. The whole point of moving from per-article graphs to a shared ontology was to let causality span the full corpus. A cause in one article connects to an effect in another. That connection only works if both articles are writing about the same canonical entity. Without cross-lingual resolution, the connection breaks the moment the two articles use different language forms for the same entity.

Cross-lingual entity resolution is not a cosmetic feature. In a multilingual news corpus, it is a prerequisite for the graph to be coherent.

The Harder Lesson About Entity Resolution in Graph RAG

Running 2asy.ai has made the priority order clear to me.

Extraction is the part that looks impressive. You throw a document at a language model, it returns entities and relations, and the graph grows. It is satisfying to watch. But extraction quality has a ceiling set by what comes after it: resolution. If two mentions that should be the same node stay as two nodes, extraction accuracy does not matter. The chain is broken regardless.

Entity resolution in Graph RAG is harder than most write-ups acknowledge. The English-only case is already non-trivial: abbreviations, partial names, acquired companies that change their names, subsidiaries that share names with parent entities. Add cross-lingual surface forms and the problem surface expands significantly.

The registry-with-aliases approach I use in 2asy.ai is a deterministic layer on top of the probabilistic extraction. Extraction guesses what is in the text. The registry decides what that thing resolves to. Keeping those two responsibilities separate makes the system easier to debug and correct: if a mention is resolving wrong, I fix the registry, not the extraction model.

39,534 aliases is a lot of entries. It is also a manageable data problem. The hard part is the collision detection and confidence gating, because a bad alias silently corrupts every mention it touches. Only 8 candidates were excluded this pass, 3 collisions and 5 held, but those were the gate doing exactly its job: an abbreviation that already belonged to another entity, and entities with no verifiable local-language name. As the corpus grows, that excluded set will grow with it, and it is the set that needs the most care.

What Is Still Missing

Cross-lingual resolution is one layer. There are still open problems.

The 5 held candidates need review, and more low-confidence cases will surface as the corpus grows. Some held entries are genuine cases that should stay held, and some are probably correct but scored cautiously because the source did not have enough context. Working through that set as it grows will expand coverage further.

The 3 collision candidates need manual triage. A collision means two canonical entities share a surface form, which is either a real-world ambiguity (two companies with similar names, a ministry that was renamed) or a registry error (two canonical entries that should be merged). The CENTCOM case is the first example: it points to a likely duplicate canonical entry that should be reconciled.

And the registry has no coverage for Arabic, Russian, or Southeast Asian language forms yet. Those are smaller portions of the current corpus but not zero.

The alias-filling step this week was the first pass. It is enough to close the most common cross-lingual gaps. The rest follows as the pipeline keeps running.

This alias layer is validated on my local ER instance. Rolling it into the live 2asy.ai graph is the next step, once the held and collision cases are triaged. The public graph is at https://www.2asy.ai/

If cross-lingual entity resolution or Graph RAG pipelines are something you are working on, the entity resolution service is at https://api.hannune.ai/entity-resolution/v1. That public endpoint serves the core resolver; the multilingual alias layer described here is validated locally and not yet loaded into it.

The original post: From Vector Search to a Cross-Domain Ontology Graph

From Vector Search to a Cross-Domain Ontology Graph: How 2asy.ai Reads Tariff News

Tae Kim — Tue, 02 Jun 2026 17:18:24 +0000

A new tariff briefing went up on 2asy.ai this week, and for the first time you can see the graph it was built from, right there on the page. That graph is the visible end of a quiet rebuild. The retrieval behind 2asy.ai went from plain vector search, to a simple per-article graph, to a cross-domain ontology Graph RAG. This is what each step bought, and what is still missing.

Where 2asy.ai started: plain vector RAG

The first version of 2asy.ai was ordinary vector RAG. I chunked each trade and tariff article, embedded the chunks, and retrieved by similarity. For a question like "what is happening with steel duties," that works. The system finds passages that look like the question and hands them to a model to summarize.

What vector RAG cannot do is answer "why." Similarity retrieval finds text that resembles your query. It does not know that a sunset review on one product is connected to an antidumping order on another, or that an action against one country pulls in suppliers in a second. Each chunk sits alone. The causal structure that makes trade news worth reading is exactly the thing embeddings throw away.

The next step: a simple per-article graph

So I moved to Graph RAG. For each article I extracted entities, events, and the relations between them, and stored them as a small graph instead of a bag of chunks. This was a real improvement. Inside a single article you could now follow a chain: this investigation led to this duty, which affected this set of producers.

The limit showed up between articles. Each document produced its own little graph, and those graphs did not talk to each other. The "South Korea" mentioned in a steel story and the "South Korea" in a tire story were two unrelated nodes. There was no shared vocabulary of entity types and relation types, so the system could not connect a cause reported in one article to its effect reported in another. The graph was per-document, not per-world.

What cross-domain ontology Graph RAG changes

The current version of 2asy.ai runs on a shared ontology. Every entity, event, and relation is extracted against the same fixed set of types and the same canonical relation vocabulary, and entities are resolved across documents so that the same real-world thing becomes one node no matter how many articles mention it. That is the cross-domain part. A "Sunset Review" is the same Sunset Review whether it shows up in a methionine story, a tire story, or a steel story, and the edges between events can now cross from one domain to another.

You can see the result on the latest briefing. The causal map for the June 2 story, "US Trade Remedies Expand Amid Global Investigations," puts a Sunset Review node at the center as the root, with directed relations fanning out to the entities it touches. Some edges carry qualifiers like "via South Korea" or "via Taiwan," which is the system recording the path a cause took, not just that two things are related. The full extraction for that one story is around 41 nodes and 60 edges, and the view focuses on the root and the bridge nodes so the chain stays readable.

The graph is thin right now, and that is expected

If you open it today, the graph will look sparse. For this story it is a few dozen nodes, and some of them are still coarse: a node typed as "unknown" rather than placed cleanly in the ontology, or an entity that is more general than it should be. I want to be honest about that rather than hide it.

This is a data-accumulation problem, not a design ceiling. A cross-domain graph gets better the more documents flow through it, because resolution and typing both improve with volume. The first time an entity appears it has little context, so it lands as a weak or untyped node. The tenth time it appears, across different stories, it has enough surrounding structure to resolve confidently and connect to the rest of the world. The shape is already correct. What it needs is time and throughput, and both are accruing as the pipeline keeps running.

Why I built it this way

The whole point of 2asy.ai is causal chains: not "here is some news that matches your query," but "here is how this trade action connects to that one, and to the producers and countries in between." Vector RAG cannot represent that. A per-article graph can represent it inside one story but not across the corpus. A cross-domain ontology graph is the first version where the connections can actually span the whole body of news, which is where the interesting causality lives.

All of this runs on local hardware, an RTX 4090 and an AMD W6800, with open models doing the extraction and resolution. There is no cloud inference bill behind it. If you want to see where it stands, the latest briefing and its graph are live at https://www.2asy.ai/ . It is early, and it will get denser as the corpus grows.

How I Caught My LLM Fabricating Its Own Evidence

Tae Kim — Sat, 30 May 2026 17:00:14 +0000

The language model behind my Graph RAG pipeline did something worse than getting a fact wrong. It fabricated the evidence. Each relation it extracted carried a quote that was supposed to come straight from the source article, and many of those quotes had never been written. They read perfectly. They did not exist.

What does fabricated evidence mean in a knowledge graph?

I am building the seed knowledge graph for 2asy.ai, a causal-chain intelligence system over trade and tariff news. Every relation and event in the graph carries an evidence field: the exact sentence from the source document that justifies it. That evidence is the whole point. It is what lets me, or a reader, trace a claim back to where it came from instead of trusting the model on faith.

The problem is that I was asking a language model to produce that evidence by quoting the source. And a language model is a text generator, not a copier. When I checked the evidence against the original articles, a large share of the quotes were not verbatim. They were fluent, on-topic, and invented.

The ellipsis was the tell

The clearest pattern was the ellipsis. The model would take two sentences from completely different parts of an article, drop a ... between them, and present the result as one continuous quotation. The seam looked like a normal editorial cut. It was not. It was two unrelated fragments fused to manufacture support for a relation the model had already decided to extract.

This is the dangerous kind of hallucination, because it is shaped exactly like real evidence. A wrong fact stands out. A fabricated quote that paraphrases something true reads as completely credible until you go back to the source and search for it character by character.

Why truncating the input made it worse

I had been doing something that looked harmless: truncating each article body to a few thousand characters before extraction, to stay inside a comfortable context window. That truncation was quietly licensing the fabrication. When the sentence that actually supported a relation sat past the cutoff, the model did not refuse. It filled the gap with a confident reconstruction of what the missing text probably said.

So I removed the truncation entirely and switched the collectors to full-body-or-skip: either the pipeline has the complete article text, or it does not process that document at all. A partial document is more dangerous than a missing one, because a partial document still produces output, and the output looks finished.

The fix: check the quote, do not grade it

The fix is almost embarrassingly simple, and that is the point. At commit time, before any relation is written to the graph, I check that its evidence string appears as an exact substring of the source document. If the quote is not literally in the text, the relation is rejected. No fuzzy matching, no second model asked to judge whether the evidence is good enough.

The instinct in this situation is to reach for another language model to verify the first one. I think that instinct is usually wrong. If you can check an output with a deterministic string operation, do that instead of grading one generator with another generator. A substring test cannot be talked into a plausible answer. It is cheap, it is exact, and it cannot hallucinate.

Cleaning up what had already shipped

The guard stops new fabrication, but it does not undo the relations already sitting in the graph. So I ran the same substring check backward over everything that had already been committed. I reverted 122 documents whose evidence had been stitched together from separate sentences, then cleaned out hundreds more whose quotes simply did not match their source, more than 500 documents in total across the cleanup passes.

That number is the real cost of having trusted generated evidence in the first place. Every one of those documents had passed through a pipeline that ran cleanly and produced output that looked correct. The volume of the cleanup is a measure of how convincing fabricated evidence is when nobody checks it against the source.

The lesson: store pointers, verify with code

If you are extracting structured claims from text with a language model, treat any quote it gives you as a hypothesis, not a fact, until a string operation confirms it. Store evidence as something you can verify, a span or an exact substring, not as free text the model is trusted to have copied faithfully.

All of this runs on local hardware, an RTX 4090 and an AMD W6800, with open models doing the extraction. The substring guard adds no model calls and no cloud cost. It is a few lines of code standing between a graph you can trust and a graph that quietly lies to you in complete sentences.

The Real Work in Graph RAG Is Not Extraction

Tae Kim — Fri, 29 May 2026 03:00:34 +0000

Extraction is the easy part of Graph RAG. I learned this building the seed knowledge graph for 2asy.ai. The pipeline ran, the numbers looked fine, and the graph was still broken. The real work was normalization, and it took far longer than extraction.

Why does a freshly extracted knowledge graph look fine but break?

I built a seed knowledge graph for 2asy.ai from roughly 450 documents, extracting more than 3,000 entities and over 1,000 events. The extraction pipeline ran cleanly. The counts looked healthy. Then I actually looked at the graph, and it was not navigable.

A knowledge graph is only useful when you can walk it: follow an entity to an event, an event to its cause, a cause to the entity behind it. Mine could not be walked, because the same real-world idea had been recorded under many different names. The numbers measured volume, not consistency.

The relation layer: how 360 labels became 80 canonical types

The relation types were the worst offender. Over 360 distinct labels had accumulated. The language model named the same structural idea differently depending on the article it was reading at the time. caused_by, is_caused_by, and was_caused_by were three labels for one relationship. triggers, trigger, and triggered_by were three more.

A graph with 360 relation types is not a knowledge graph. It is an expensive document store. So I stopped extracting and went backward. I mapped every relation type down to 80 canonical forms and renamed the edges, hundreds of them. The graph shrank in variety and became navigable for the first time.

The entity layer: why "aluminium" and "aluminum" break a causal chain

The entity layer had a different failure. "Aluminium", "aluminum", and "ALUMINUM" were three separate nodes with zero edges connecting them. A causal chain breaks cleanly at the first spelling inconsistency, because the graph does not know the three nodes are the same metal.

I merged the duplicates. I also fixed evidence substrings so each relation points back to the exact sentence that produced it, not a paraphrased approximation. Evidence that does not match its source is evidence you cannot trust later.

How do you stop the cleanup from happening twice?

Cleaning once is not enough if the next extraction run reinvents the mess. So I added validation that enforces the 80-type vocabulary at extraction time. New runs must use an existing canonical relation type instead of inventing a new label. The cleanup becomes a one-time cost, not a recurring tax.

Running the whole pipeline on local hardware

All of this ran on local hardware: an RTX 4090 and an AMD W6800, with Gemma and Qwen models handling the language work. There was no cloud bill for the extraction, the normalization, or the validation. For a one-person operation, owning the inference cost structure is what makes a daily Graph RAG pipeline affordable to run at all.

The lesson: plan for cleanup before extraction

If you are building Graph RAG, plan for the cleanup phase before you start the extraction phase. The language model will produce plausible output. The graph will look dense. The counts will feel satisfying. None of that means the graph is usable.

A knowledge graph becomes useful only when the entity and relation layers are consistent enough to walk. In my case, extraction took days and normalization took weeks. The seed graph for 2asy.ai now runs causal-chain queries against 80 canonical relation types, with every edge linked back to its source sentence.

Originally published at hannune.ai.