DEV Community: Dan Shalev

I built a GraphRAG demo with FalkorDB’s new SDK, then benchmarked it against Neo4j

Dan Shalev — Wed, 29 Apr 2026 04:59:16 +0000

FalkorDB shipped graphrag-sdk v1.0.0rc1 and I wanted to see how it feels on real content, not a toy dataset. An afternoon of "let me just try it" turned into a few days of "if I'm going to have an opinion, I should measure it against something."

The something, obviously, was neo4j-graphrag. Same corpus, same LLM, same embedder, same 25-question set, same blind judge. The whole thing — ingest, 25 queries, and the judge rubric across both stacks — costs about $0.15 to reproduce end-to-end.

This is a write-up of what I did, what broke, what the numbers actually say, and what I'd do differently. I'm not here to crown a winner. I'm here to show what it took to compare them honestly.

Repo: github.com/FalkorDB/graphrag-sdk-demo.

The corpus and the pipeline

The corpus is 8 FalkorDB blog posts and case studies, roughly 140 KB of Markdown. Topics range from "what is GraphRAG" to the Securin threat-intel case study to a March 2026 cybersecurity webinar announcement. That mix matters later: some questions are short factual lookups, some need multi-hop joining across documents, some ask for specific numbers buried in a single paragraph.

             ┌───────────────────┐
             │  scrape.py        │  Firecrawl → content/*.md
             └─────────┬─────────┘
                       │
            ┌──────────┴──────────┐
            ▼                     ▼
 ┌───────────────────┐   ┌───────────────────┐
 │  ingest.py        │   │  neo4j_ingest.py  │
 │  GraphRAG (async) │   │  SimpleKGPipeline │
 │  + postprocess.py │   │  + entity embeds  │
 └─────────┬─────────┘   └─────────┬─────────┘
           ▼                       ▼
  ┌────────────────┐      ┌────────────────┐
  │  FalkorDB      │      │  Neo4j 5.24    │
  │  :6379 / :3000 │      │  :7474 / :7687 │
  └────────┬───────┘      └───────┬────────┘
           │                      │
           └──────────┬───────────┘
                      ▼
         ┌──────────────────────────┐
         │ benchmark_compare.py     │
         │ bench/ (25 Qs + judge)   │
         │ → COMPARISON.md          │
         └──────────────────────────┘

Stack: Python 3.14, graphrag-sdk[litellm]==1.0.0rc1, neo4j-graphrag[openai]>=1.14, LiteLLM, FalkorDB and Neo4j in docker-compose.yml. The LLM is gpt-4o-mini (extraction + generation) at temperature=0. Embedder is text-embedding-3-small (1536 dim). The judge is gpt-4o with temperature=0 and seed=42. Chunking on both sides is fixed-size 1000 with 100 overlap, no approximation. Extraction is open-schema on both sides — no hand-tuned GraphSchema. I wanted the libraries to show their defaults, not my schema design.

Part 1 — Getting the FalkorDB demo working
The FalkorDB ingest is almost boring to write down, which is the point. The whole thing fits in roughly 60 lines:

async with GraphRAG( connection=ConnectionConfig(host="localhost", graph_name="falkordb_blog_kg"), llm=LiteLLM(model="openai/gpt-4o-mini"), embedder=LiteLLMEmbedder(model="openai/text-embedding-3-small"), ) as rag: for path in content_files: text = path.read_text(encoding="utf-8") await rag.ingest(path.stem, text=text)

stats = await rag.finalize()   # dedup + embeddings + indexes`

ingest() is per-file; finalize() is the cleanup pass that deduplicates entities, backfills embeddings, and creates the HNSW indexes. After the 8 files, I had 509 nodes, 1 228 edges, 160 LLM calls, ~230 seconds wall time, and a cost of $0.054. For an afternoon of "try the SDK," this is a good story.

Then I opened the FalkorDB browser and it was a hairball.

Every relationship came out as [:RELATES {rel_type: "USES"}], [:RELATES {rel_type: "INTEGRATES_WITH"}], and so on. The relation type lives as a property on a single generic edge, not as the edge label itself. This is fine for the retriever — it reads the property — but it's ugly in the browser and it's a pain to query by hand (WHERE r.rel_type = 'USES' is not index-accelerated in FalkorDB; the skill docs are explicit about this).

So I wrote postprocess.py. Two idempotent passes:

`# Promote (:Entity)-[:RELATES {rel_type:'INTEGRATES_WITH'}]->(:Entity)

into real typed edges.

for raw in distinct_rel_types:
safe = TYPE_SAFE.sub("", raw.upper().replace(" ", "").replace("-", "_"))
graph.query(
"MATCH (a)-[r:RELATES {rel_type: $t}]->(b) "
f"MERGE (a)-[r2:{safe}]->(b) "
"SET r2.fact = r.fact, r2.description = r.description, "
" r2.source_chunk_ids = r.source_chunk_ids, r2.spans = r.spans "
"DELETE r",
params={"t": raw},
)`
“

The string-substitution into the Cypher is necessary because relationship types can't be parameterized — so I sanitize the type name to [A-Z0-9_] before interpolating. The result: 336 generic RELATES became 161 real typed edges (INTEGRATES_WITH, SUPPORTS, USES, ...), idempotently, on every re-ingest.

The other thing I learned by looking, not by reading docs, is that the SDK does two LLM calls per query: a keyword-extraction pass over the question, then the final generation. Between them the retriever ranks candidate entities deterministically by term frequency — no LLM. I only nailed this down when I ran GRAPH.SLOWLOG against a corrected benchmark harness (my first version was double-counting a call; see the addendum). It matters later in the numbers.

Part 2 — "Is this actually good?"

My five demo questions produced answers that looked great. That proved nothing. Cherry-picking five queries against a knowledge graph is not evidence; it is ambience.

I needed three things: a question set with known ground truth, a comparable second stack so the numbers meant something in context, and a judge that didn't know which stack produced which answer.

Before writing any benchmark code I wrote down the fairness constraints: same corpus, same chunking (1000/100, fixed-size), same LLM, same embedder, same 25 questions, same judge with a fixed seed, same price sheet for cost math. Anything I couldn't equalize, I would disclose.

Part 3 — Building the Neo4j side

This was where it got interesting. neo4j-graphrag is a good library, but the defaults don't give you parity with FalkorDB's out-of-the-box retrieval; you have to build it.

Four non-obvious things bit me:

Document dedup on a stale path. SimpleKGPipeline writes Document.path = 'document.txt' for every file and deduplicates on that path. If you loop over your files, the second file silently merges into the first. The fix is to rename the freshly-created Document node to the source slug right after each run_async().
Missing chunk vector index. The pipeline writes chunk embeddings as properties, but doesn't always create the vector index to query them. create_vector_index(driver, CHUNK_INDEX, ...) after ingest, manually.
No entity embeddings at all. This one took me a while. FalkorDB builds an entity HNSW as part of finalize(); SimpleKGPipeline does not. So after ingest, I walk every Entity, embed name + description in batches of 64, write with db.create.setNodeVectorProperty, and then create an entity_embedding_idx. Without this pass, "entity vector search" on the Neo4j side would have been meaningless and the comparison would have been dishonest.
GraphRAG rejects custom composite retrievers. This is the fun one. I wanted a retriever that mirrors FalkorDB's MultiPathRetrieval: vector search over entities with 1-hop fact expansion, plus vector search over chunks. In neo4j-graphrag, the obvious shape is two VectorCypherRetrievers composed into a wrapper. But when you pass a wrapper into GraphRAG(...), its pydantic validation rejects anything that isn't a Retriever subclass.

I drove the retrievers directly instead. About 40 lines cleaner:

`ENTITY_QUERY = """
WITH node, score
OPTIONAL MATCH (node)-[r]-(nbr:Entity)
WITH node, score,
collect(DISTINCT {
rel: type(r),
neighbour: coalesce(nbr.name, nbr.id, ''),
fact: coalesce(r.fact, r.description, '')
})[..8] AS facts
RETURN coalesce(node.name, node.id, '') AS entity_name,
labels(node) AS entity_labels,
coalesce(node.description, '') AS entity_description,
facts, score
"""

self.entity = VectorCypherRetriever(
driver=driver, index_name="entity_embedding_idx",
retrieval_query=ENTITY_QUERY, result_formatter=_entity_fmt,
embedder=embedder, neo4j_database=NEO4J_DATABASE,
)
self.chunk = VectorCypherRetriever(
driver=driver, index_name="chunk_embedding_idx",
retrieval_query=CHUNK_QUERY, result_formatter=_chunk_fmt,
embedder=embedder, neo4j_database=NEO4J_DATABASE,
)`

Then: build context → prompt → llm.invoke(). Behavior is equivalent to what GraphRAG would do internally; I just don't get the pydantic validator in my way.

There's a tradeoff worth naming: I did not write a Neo4j equivalent of postprocess.py's typed-edge promotion. I thought about it. I decided that giving Neo4j a cleanup pass that its pipeline doesn't provide would bias toward Neo4j, and the whole point was to compare defaults. So Neo4j keeps its out-of-the-box relation-type distribution (226 types across 785 nodes) while FalkorDB gets the cleanup it natively benefits from (154 types across 504 nodes). I'll come back to this in the asymmetries section.

Part 4 — Making cost and token tracking honest

Token counting is the part of a benchmark that you'd think is easy and is not.

FalkorDB side. LiteLLM returns usage and exposes litellm.completion_cost, but I didn't want two different price sheets (one from LiteLLM's snapshot, one for Neo4j's raw OpenAI usage). I subclassed LiteLLM and LiteLLMEmbedder, overrode ainvoke / ainvoke_messages / _raw_embed_async, captured usage from each call, and sent the numbers through a single price sheet:

bench/costs.py — the whole thing.

PRICES = {
"openai/gpt-4o-mini": {"in": 0.15, "out": 0.60}, # per 1M tokens
"openai/gpt-4o": {"in": 2.50, "out": 10.00},
"openai/text-embedding-3-small": {"in": 0.02},
}
Neo4j side. This is ugly. neo4j-graphrag.OpenAILLM.LLMResponse doesn't expose token usage at all. The underlying OpenAI client has it on the response object, but the wrapper drops it. So I subclassed OpenAILLM and monkey-patched the client:

class TrackingOpenAILLM(OpenAILLM): def __init__(self, *a, **kw): super().__init__(*a, **kw) self.call_count = 0; self.prompt_tokens = 0; self.completion_tokens = 0 sync_create = self.client.chat.completions.create def sync_wrap(**kwargs): r = sync_create(**kwargs) self._record(r, self.model_name) return r self.client.chat.completions.create = sync_wrap # same for async_client

Same trick for OpenAIEmbeddings.client.embeddings.create. This is the ugliest code in the repo and I'm at peace with it — it's a benchmark harness, not a library. Both stacks' numbers now go through the same bench/costs.py. No drift, no surprises.

Part 5 — The 25-question set and the judge

I wrote 25 questions in four categories:

Factual (8): single-hop lookups. "What is GraphRAG?" "What ports does FalkorDB expose?"
Multi-hop (6): joins across entities or documents. "Which FalkorDB integrations does Securin use together?"
Comparative (5): "How does FalkorDB compare to Neo4j for knowledge graphs?" "In-memory vs on-disk tradeoffs?"
Numeric (6): specific numbers from the corpus. "What was Securin's average query latency?" "When is the cybersecurity webinar?"
Every question is paired with reference_facts (the ground truth from the source document) and expected_source_docs (which files should be hit). The dataclass is frozen and asserts exactly 25 with unique IDs, so you can't silently drift the set.

The judge lives in bench/judge.py. It's blind A/B:

Blind pairing so the judge never sees "FalkorDB" vs "Neo4j".

rng = random.Random(42)
for q in questions:
a, b = (falk, neo) if rng.random() < 0.5 else (neo, falk)
# ... send judge Answer A vs Answer B ...
Rubric: four dimensions (groundedness, correctness, completeness, conciseness), integer 1–5 per dimension per answer, plus a one-sentence rationale. The judge is given the reference_facts from the question set, so the scoring is grounded in known-correct content, not in the judge's vibes about the question. The judge is gpt-4o with temperature=0, seed=42, response_format={"type": "json_object"}. Running the 25-question rubric cost $0.0456.

Every run writes a timestamped JSON under results/ so I can re-render the comparison report later without re-running.

Part 6 — The numbers

Three tables. This is the whole benchmark, stripped of narration.

Ingestion
Metric FalkorDB Neo4j
Wall time 233.4 s 251.7 s
LLM calls 160 159
Input / output / embedding tokens 112 925 / 60 211 / 43 759 172 419 / 28 208 / 35 130
Cost $0.0539 $0.0435
Nodes 504 785
Edges 1 202 1 632
Entities / chunks 335 / 160 612 / 159
Relationship types 154 226

FalkorDB's extractor produces a tighter graph (fewer entities, fewer rel types); Neo4j's produces more fragments. Neither is better in the abstract — different defaults. FalkorDB reads less, writes more output tokens; Neo4j reads more (more prompt context per extraction call), writes less.

Per-query aggregates (25 questions)
Metric FalkorDB Neo4j
Avg retrieve ms 1 493 496
Avg LLM ms 1 641 1 759
Avg total ms 3 094 2 255
Avg LLM calls per Q 2.0 1.0
Avg input / output tokens 4 125 / 59 2 952 / 55
Avg cost per Q $0.000654 $0.000476
p95 latency 4 793 ms 4 506 ms
Avg retrieved entities / chunks / docs 11.4 / 1.0 / 3.5 3.7 / 10.0 / 2.6
25-Q total cost $0.01635 $0.01191

Neo4j wins latency (−27 %) and cost (−27 %). Structurally this is because Neo4j does one LLM call per query and FalkorDB does two — keyword extraction, then generation, with deterministic ranking in between. The difference is not a configuration bug; it is what the SDK is doing on your behalf.

Judge rubric (gpt-4o, seed=42, blind A/B)

Dimension FalkorDB Neo4j Δ
Groundedness 3.88 3.84 +0.04
Correctness 3.84 3.60 +0.24
Completeness 3.52 3.24 +0.28
Conciseness 4.56 4.60 −0.04
Overall 3.95 3.82 +0.13
Category n FalkorDB Neo4j
Factual 8 4.38 4.00
Multi-hop 6 3.83 3.50
Numeric 6 3.92 3.92
Comparative 5 3.45 3.80

Win/loss/tie over 25 questions (tie threshold |Δ| ≤ 0.125): 5 / 5 / 15 — tied on wins, but FalkorDB has the higher overall mean. Source-document recall is 97 % vs 90 % in FalkorDB's favour.

Reframe, plainly: FalkorDB pays roughly 27 % more latency and 27 % more cost per query. In exchange it wins factual (+0.38) and multi-hop (+0.33) quality, leads on correctness (+0.24) and completeness (+0.28), and retrieves the right source document more often (97 % vs 90 %). It loses comparative questions (−0.35) where its pipeline tends to over-elaborate, and ties numeric extraction — both stacks get the same 4/6 numbers right and fail the same 2. The extra LLM call isn't free, but it's doing work on the substantive categories.

Part 7 — Where each one actually failed

Aggregates hide the interesting failures.

FalkorDB's worst: the comparative category. On c2 ("How does GraphRAG outperform vector RAG on complex questions?") and c3 ("What are the tradeoffs of in-memory vs on-disk graph storage?") FalkorDB produced longer, more elaborated answers than Neo4j. The elaborations weren't fabricated — they were grounded in the retrieved entities — but they drifted past the reference facts the judge was scoring against. Neo4j's tighter single-pass answers hewed closer to exactly what the source said and won both questions. Average comparative score: FalkorDB 3.45 vs Neo4j 3.80.

Lesson: the keyword-extraction pre-step pulls a wider entity set into context, which helps on factual and multi-hop questions but can encourage over-explanation on contrast questions where terseness is a virtue. The extra call is doing work; on some categories that work is counterproductive.

Neo4j's worst: abstention on broad multi-hop and numeric questions. On m4 ("Which companies or products are described as using or integrating with FalkorDB?") and n5 ("What are FalkorDB's default ports?") Neo4j returned "I don't know based on the provided context." FalkorDB answered both — correctly naming Snowflake and LangChain on m4, though it happened to fail n5 on this run too (retrieval variance; the context didn't surface the ports chunk). In general Neo4j's retriever had the information in context and the single-pass prompt declined to use it.

This is the flip side of no rewrite loop. FalkorDB's extra keyword-extraction call pushes the model to use what was retrieved; Neo4j's cautious single prompt occasionally refuses when a broader context pull would have landed the answer.

I want to state this plainly: fabrication and abstention are both real failure modes. Neither is strictly worse. In a production system you'd probably tune the prompts to move each one toward the safer behavior for your use case. The point is not that one stack is wrong — it's that they fail differently.

The asymmetries I did not fix
Three of them, and I called every one out in the repo, in COMPARISON_FULL.md, and I'll call them out here too.

Typed-edge promotion runs only on FalkorDB. Porting postprocess.py to Neo4j would have given Neo4j a cleanup its pipeline doesn't provide. I chose to benchmark defaults.

Retrieval shape differs. FalkorDB's MultiPathRetrieval returns ~11 entities + 1 chunk per query. My Neo4j composite returns ~4 entities + 10 chunks. Both are tunable; I left them at reasonable defaults for each side. This likely explains part of FalkorDB's edge on multi-hop.

Two LLM calls vs one. I did not strip out FalkorDB's keyword-extraction pre-step to "match" Neo4j. It's what the SDK does by default, and measuring the SDK means measuring that work.
If you want the benchmark to tell a different story, you can rerun it with adjusted parameters. The harness is ~500 lines and the whole comparison costs fifteen cents.

What I'd do differently

Start with the benchmark harness, not the demo. The demo's code shaped itself around "five cool queries" and I ended up rewriting half of it when the 25-question set arrived. The right order is: questions → stacks → demo as a special case.
Put bench/costs.py in from day one. I burned time reconciling LiteLLM's cost calc with the Neo4j-side raw usage before I realized a single price dict would erase the drift entirely.
Expose community summaries in FalkorDB's retrieval. finalize() generates them but MultiPathRetrieval doesn't surface them on short queries. A custom retriever that includes community summaries for broad thematic questions (where the multi-hop expansion doesn't cover the space) is probably worth 15 minutes.
Add a "refusal" dimension to the judge rubric (or a fifth score). Right now "I don't know" scores 1/5 on correctness, which is mathematically right — it isn't correct — but doesn't distinguish hallucination from honest abstention. A production benchmark should treat those separately.
Use gpt-4o-mini as the judge on a larger sample. gpt-4o on 25 questions is fine for signal; gpt-4o-mini on 250 questions would probably be noisier per-question but more robust in aggregate, for the same budget.

A note on the FalkorDB skills pack
One thing that shaped how I worked on this: the repo ships a .falkordb-skills/ pack — SKILL.md plus cypher-skills/, operations-skills/, and udf-skills/ subfolders, each containing narrow, tested "how to do X in FalkorDB" notes. Copilot loads them automatically when I'm writing Cypher or operating the container. They're not tutorials; they're a short, opinionated reference for the things that are easy to get wrong.

A few places they saved me time — or saved me from shipping something subtly broken:

use-merge-to-avoid-duplicates and update-and-remove-properties. My postprocess.py rewrites relationships on every re-ingest. The skill pack is explicit that FalkorDB has no REMOVE clause (set to NULL instead) and that MERGE is the idiom for idempotent upserts. Both shaped the final shape of the typed-edge promotion code.
use-parameterized-queries. The same skill is what prompted me to pass rel_type as a parameter (params={"t": raw}) while interpolating the sanitized type name into the Cypher. One is untrusted data; the other is part of the query structure. The skill makes the distinction concrete.

track-slow-queries (GRAPH.SLOWLOG). This is how I eventually pinned down that the SDK makes two LLM calls per query — keyword extraction and generation — not three as I initially thought (my benchmark harness was double-counting a call; see the addendum). I wasn't looking for the call count at all; I was looking at what was slow during ingest. GRAPH.SLOWLOG also surfaced the real ingest bottleneck: a single UNWIND $batch AS item MATCH (e:Entity {id:item.eid}) SET e.embedding = vecf32(item.vector) at ~333 ms per call, which is finalize() backfilling entity embeddings. Knowing the bottleneck is the embedding write, not the extraction, changes which optimizations are worth attempting.
inspect-graphs-and-memory (GRAPH.LIST / GRAPH.INFO / GRAPH.MEMORY USAGE). This is what query_demo.py uses to print that the whole 509-node / 1 228-edge graph is 4 MB resident. That number is genuinely useful for capacity planning; it's also the kind of thing that's easy to forget to measure.
apply-cypher-limitations-correctly. Specifically: <> filters aren't index-accelerated. I avoided at least one "let me just exclude this one type" query in postprocess.py that would have degraded on larger graphs.

inspect-query-plans / profile-query-runtime. GRAPH.EXPLAIN and GRAPH.PROFILE. Not cited explicitly in the demo, but referenced in my .github/copilot-instructions.md so that any Cypher generated in this workspace is validated against an explain plan before being considered optimized.

The value of the pack, to me, is less that it teaches me Cypher — I can read the docs — and more that it encodes the operational patterns that distinguish a working demo from a working system. Idempotency, introspection, index-awareness, parameter safety. A lot of LLM-generated Cypher looks fine and is quietly not idempotent or quietly scans a whole table. Having the skills loaded means the assistant writes code that wouldn't embarrass me in a review.

If you adopt the SDK, copy the skills pack too. It's in the repo under .falkordb-skills/, MIT-licensed.

Who actually needs this
FalkorDB's graphrag-sdk is not the right tool for every retrieval problem. Being concrete about who it is for:

You should reach for it when:

Your corpus has implicit structure the model has to discover — case studies, research reports, customer-support tickets, product documentation, threat-intel feeds. Anything where "what relates to what" isn't already a table. The SDK's open-schema extraction is how you turn that into a graph without designing the graph yourself.

Your questions regularly span multiple documents or require joining facts — "which of our customers using product X are on plan tier Y and had an incident last month?" The multi-hop and numeric wins on the judge rubric weren't theoretical; they were on exactly these question shapes.

You need provenance on every answer — "which document did that claim come from?" The SDK tracks chunk provenance end-to-end; my query_demo.py prints it. This is the baseline for anything regulated or customer-facing.

You're a Python team shipping a chat or agent feature and you do not want to stand up a separate Neo4j instance, a separate vector DB, and a separate ingestion pipeline. FalkorDB runs as a single container alongside your cache (it is a Redis module). The whole stack for this demo is docker compose up -d.
You want to keep the graph small and fast — in-memory graph, 4 MB resident for this corpus, sub-millisecond single-hop Cypher. This matters when the graph is inline with a request path, not a nightly batch job.

It's less obviously a fit when:

Your "knowledge base" is actually a structured database (SaaS CRM, an ERP) where the relationships are already explicit. A graph projection over SQL, or a text-to-SQL pipeline, is a shorter path. The SDK's extraction cost is wasted on content you already have structured.
Your workload is pure short-factual lookups where sub-second latency matters more than nuance. Neo4j's single-pass pipeline wins on latency and cost; vector RAG alone would win by even more. Do not pay for a keyword-extraction pre-pass if the question is "what's the phone number on page 3."
You need a globally-distributed, multi-region write path. FalkorDB is single-instance or primary-replica. Fine for most apps; not for multi-region active-active.
Your corpus is so large (hundreds of GB) that in-memory is not an option. FalkorDB can persist to disk, but the design center is "graph fits comfortably in RAM."
If I had to describe the ideal user in one sentence: a Python developer building an agent, copilot, or customer-facing Q&A feature over a few hundred MB to a few GB of unstructured domain content, who needs multi-hop reasoning with source citations, wants ~60 lines of pipeline code to do most of the work, and cares more about being right than being the fastest responder by 300 ms.

If that's you, the ~$0.05 ingest and ~$0.00065 per-query numbers from this benchmark are the shape you'll see in production. If that's not you, use something else — and the same harness in this repo will tell you that honestly.

Running it yourself

Everything is in the repo: the 8 Markdown files, the 25-question set, the judge prompt, all four raw-result JSON directories, the auto-rendered COMPARISON.md, and the hand-written COMPARISON_FULL.md with per-question transcripts and judge rationales.

git clone https://github.com/FalkorDB/graphrag-sdk-demo
cd graphrag-sdk-demo
cp .env.example .env && $EDITOR .env # OPENAI_API_KEY
docker compose up -d
python3.14 -m venv .venv && .venv/bin/pip install -r requirements.txt
.venv/bin/python benchmark_compare.py all
About fifteen minutes, about fifteen cents.

If you want a recommendation, here's the one I'm willing to commit to: if your workload is dominated by factual completeness and multi-hop reasoning where correctness and source recall matter more than latency, FalkorDB's extra keyword-extraction pass is paying for something measurable (+0.38 factual, +0.33 multi-hop, 97 % vs 90 % source-doc recall, +0.24 correctness, +0.28 completeness). If your workload is latency- or cost-sensitive — especially short comparative questions where Neo4j's tighter single-pass prompt actually wins, or numeric extraction where the two stacks tie — Neo4j is 27 % faster and 27 % cheaper and gives equivalent answers.

Text2SQL on xxxx's of tables?

Dan Shalev — Mon, 05 Jan 2026 13:56:33 +0000

Text-to-SQL tools work fine in demos. Production with 30K+ tables? They hallucinate relationships and fail in ways you can't debug because the code is closed.

We've shipped several updates focused on solving this.

What's New

React UI Rebuild
Cleaned up the frontend, fixed button logic issues, improved overall UX.

API Release
New API endpoint for programmatic access. Full docs in the repo.

Backend Performance Improvements
Optimized graph generation and query processing for large-scale schemas.

Full Schema Auditability
You can now inspect exactly how your schema gets mapped into the knowledge graph—see the indexing logic, relationship detection, everything.

The Architecture Difference

Vector embeddings can't capture the semantic depth needed for complex relational schemas. QueryWeaver uses knowledge graphs to map table relationships structurally, not statistically. Less hallucination, better accuracy and subsequent queries.

Check it out, it's free: https://app.queryweaver.ai/

Graph to store your security data?

Dan Shalev — Sun, 17 Aug 2025 06:15:15 +0000

The Challenge: You run a multi-tenant security platform and need to ensure full tenant isolation, avoiding customers' data commingling in the same database or needing to spin up a dedicated database for every new customer.

How it affects you: You either introduce risk of data leakage or waste infrastructure resources on isolated stacks.

Why choose graph: You can manage 10,000+ isolated graph tenants per database (across all pricing tiers). Each tenant gets a private namespace and query surface.

🟰 Business impact: Zero tenant data commingling. Minimal DevOps overhead. Efficient scaling of your infrastructure as you grow.

The other 5 reasons are here.

Graph database v4.10 is out!

Dan Shalev — Thu, 12 Jun 2025 14:02:00 +0000

The new release (v4.10.0) is out, and I wanted to share some of the updates and ask for feedback from folks who care about performance, memory efficiency in graph-heavy systems.
FalkorDB is an open-source property graph database that supports OpenCypher (with our own extensions) and is used under the hood for retrieval-augmented generation setups where accuracy matters.
The big problem we’re working on is scaling graph databases without memory bloat or unpredictable performance in prod. Support for Indexing tends to be limited with array fields. And if you want to do something basic like compare a current value to the previous one in a sequence (think time series modeling), the query engine often makes you jump through hoops.
We started FalkorDB after working for years on RedisGraph (we were the original authors). Rather than patch the old codebase, we built FalkorDB with a sparse matrix algebra backend for performance. Our goal was to build something that could hold up under pressure, like 10K+ graphs in a single instance, and still let you answer complex queries interactively.
To get closer to this goal, we’ve added the following improvements in this new version: We added string interning with a new intern() function. It lets you deduplicate identical strings across graphs, which is surprisingly useful in, for example, recommender systems where you have millions of “US” strings. We also added a command (GRAPH.MEMORY USAGE) that breaks down memory consumption by nodes, edges, matrices, and indices (per graph), which is useful when you’re trying to figure out if your heap is getting crushed by edge cardinality or indexing overhead.
Indexing got smarter too, with arrays now natively indexable in a way that’s actually usable in production (Neo4j doesn’t do this natively, last I checked).
On the analytics side, we added CDLP (community detection via label propagation), WCC (weakly connected components), and betweenness centrality, which are all exposed as procedures. These came out of working with teams in fraud detection and behavioral clustering where you don’t want to guess the number of communities in advance.
If you want to try FalkorDB, we recommend you run it via Docker
The code’s also available on GitHub (https://github.com/FalkorDB/falkordb) and we have a live sandbox you can play with at https://browser.falkordb.com. No login or install needed to run queries. Docs are at https://docs.falkordb.com.
Curious to hear from anyone who’s building graph-heavy systems, especially if you’ve hit memory or indexing limits elsewhere. We’re heads-down building and always learning, grateful for any feedback or test cases you throw at us.

How to use a knowledge graph ft. Yohei Nakajima

Dan Shalev — Tue, 27 May 2025 11:29:14 +0000

In this workshop we’ll show 2 live builds: Fractal KG, a UI for building knowledge graphs from a natural language prompt, and VCPedia, a real-time startup intelligence Crunchbase-like platform that is graph powered by hourly Twitter pulls, LLM-based funding-round extraction, and automated newsletters.

Get information: Map out Fractal KG’s architecture: ingestion, embedding, dedupe, hierarchy, and newsletters.
Get inspired: Break down VCPedia’s architecture.
Get started: Integrate FalkorDB for graph queries, multi-tenant support, and setup.

18 June, 2025 ⏰ PDT: 10:00 AM | EDT: 1:00 PM | CEST: 7:00 PM

p.s you can register and we’ll send a recap (even if you can’t attend).

Questions to ask before you build a knowledge graph

Dan Shalev — Tue, 06 May 2025 08:10:45 +0000

Are you planning to develop intelligent chatbots that require advanced understanding and interaction capabilities?
Is your focus on enabling dynamic, complex research endeavors?
Do you want to visualize or monitor asset flows and risks within your organization?
Do you aim to unlock siloed data or enhance connectivity between disparate data environments?

Knowledge graphs help structure information by capturing relationships between disparate data points. They allow users to integrate data from diverse sources and discover hidden patterns and connections.

Why I Fell in Love with Rust Procedural Macros

Dan Shalev — Mon, 05 May 2025 13:14:31 +0000

I consider myself a junior Rust developer. I have been learning Rust for a few months now, and I have thoroughly enjoyed
the process. Recently, we started writing the next generation of
Falkordb using Rust. We chose Rust because of its
performance, safety, and rich type system.

One part we are implementing by hand is the scanner and parser. We do this to optimize performance and to maintain a
clean AST (abstract syntax tree). We are working with
the Antlr4 Cypher grammar, where each Derivation in the grammar maps to a Rust function.

For example, consider the parse rule for a NOT expression:

oC_NotExpression : ( NOT SP? )* oC_ComparisonExpression ;

This corresponds to the Rust function:

fn parse_not_expr(&mut self) -> Result<QueryExprIR, String> {
let mut not_count = 0;

while self.lexer.current() == Token::Not {
self.lexer.next();
not_count += 1;
}

let expr = self.parse_comparison_expr()?;

if not_count % 2 == 0 {
Ok(expr)
} else {
Ok(QueryExprIR::Not(Box::new(expr)))
}
}

Here, we compress consecutive NOT expressions during parsing, but otherwise, the procedure closely resembles the Antlr4
grammar. The function first consumes zero or more NOT tokens, then calls parse_comparison_expr

While working on the parser, a recurring pattern emerged. Many expressions follow the form:

oC_ComparisonExpression
: oC_OrExpression ( ( SP? COMPARISON_OPERATOR SP? ) oC_OrExpression )* ;

which translates roughly to:

fn parse_comparison_expr(&mut self) -> Result<QueryExprIR, String> {
let mut expr = self.parse_or_expr()?;

while self.lexer.current() == Token::ComparisonOperator {
let op = self.lexer.current();
self.lexer.next();
let right = self.parse_or_expr()?;
expr = QueryExprIR::BinaryOp(Box::new(expr), op, Box::new(right));
}

Ok(expr)
}

Similarly, for addition and subtraction:

oC_AddOrSubtractExpression
: oC_MultiplyDivideModuloExpression ( ( SP? '+' SP? oC_MultiplyDivideModuloExpression ) | ( SP? '-' SP? oC_MultiplyDivideModuloExpression ) )* ;

which looks like this in Rust:

fn parse_add_sub_expr(&mut self) -> Result<QueryExprIR, String> {
let mut vec = Vec::new();
vec.push(self.parse_mul_div_modulo_expr()?);
loop {
while Token::Plus == self.lexer.current() {
self.lexer.next();
vec.push(self.parse_mul_div_modulo_expr()?);
}
if vec.len() > 1 {
vec = vec!(QueryExprIR::Add(vec));
}
while Token::Dash == self.lexer.current() {
self.lexer.next();
vec.push(self.parse_mul_div_modulo_expr()?);
}
if vec.len() > 1 {
vec = vec!(QueryExprIR::Sub(vec));
}
if ![Token::Plus, Token::Dash].contains(&self.lexer.current()) {
return Ok(vec.pop().unwrap());
}
};
}

This pattern appeared repeatedly with one, two, or three operators. Although the code is not very complicated, it would
be nice to have a macro that generates this code for us.

We envisioned a macro that takes the expression parser and pairs of (token, AST constructor) like this:

parse_binary_expr!(self.parse_mul_div_modulo_expr()?, Plus => Add, Dash => Sub);

So I started exploring how to write procedural macros in Rust, and I must say it was a very pleasant experience. With
the help of the crates quote and syn, I was able to write a procedural macro that generates this code automatically. The
quote crate lets you generate token streams from templates, and syn allows parsing Rust code into syntax trees and token
streams. Using these two crates makes writing procedural macros in Rust feel like writing a compiler extension.

Let's get into the code.

The first step is to model your macro syntax using Rust data structures. In our case, I used two structs:

struct BinaryOp {
parse_exp: Expr,
binary_op_alts: Vec<BinaryOpAlt>,
}

struct BinaryOpAlt {
token_match: syn::Ident,
ast_constructor: syn::Ident,
}

The leaves of these structs are data types from the syn crate. Expr represents any Rust expression, and syn::Ident
represents an identifier.

Next, we parse the token stream into these data structures. This is straightforward with syn by implementing the Parse
trait:

impl Parse for BinaryOp {
fn parse(input: ParseStream) -> Result<Self> {
let parse_exp = input.parse()?;
_ = input.parse::<syn::Token![,]>()?;
let binary_op_alts =
syn::punctuated::Punctuated::<BinaryOpAlt, syn::Token![,]>::parse_separated_nonempty(
input,
)?;
Ok(Self {
parse_exp,
binary_op_alts: binary_op_alts.into_iter().collect(),
})
}
}

impl Parse for BinaryOpAlt {
fn parse(input: ParseStream) -> Result<Self> {
let token_match = input.parse()?;
_ = input.parse::<syn::Token![=>]>()?;
let ast_constructor = input.parse()?;
Ok(Self {
token_match,
ast_constructor,
})
}
}

The syn crate smartly parses the token stream into the data structures based on the expected types (Token, Expr, Ident,
or BinaryOpAlt).

The final step is to generate the appropriate code from these data structures using the quote crate, which lets you
write Rust code templates that generate token streams. This is done by implementing the ToTokens trait:

impl quote::ToTokens for BinaryOp {
fn to_tokens(
&self,
tokens: &mut proc_macro2::TokenStream,
) {
let binary_op_alts = &self.binary_op_alts;
let parse_exp = &self.parse_exp;
let stream = generate_token_stream(parse_exp, binary_op_alts);
tokens.extend(stream);
}
}

fn generate_token_stream(
parse_exp: &Expr,
alts: &[BinaryOpAlt],
) -> proc_macro2::TokenStream {
let whiles = alts.iter().map(|alt| {
let token_match = &alt.token_match;
let ast_constructor = &alt.ast_constructor;
quote::quote! {
while Token::#token_match == self.lexer.current() {
self.lexer.next();
vec.push(#parse_exp);
}
if vec.len() > 1 {
vec = vec![QueryExprIR::#ast_constructor(vec)];
}
}
});
let tokens = alts.iter().map(|alt| {
let token_match = &alt.token_match;
quote::quote! {
Token::#token_match
}
});

quote::quote! {
let mut vec = Vec::new();
vec.push(#parse_exp);
loop {
#(#whiles)*
if ![#(#tokens,)*].contains(&self.lexer.current()) {
return Ok(vec.pop().unwrap());
}
}
}
}

In generate_token_stream, we first generate the collection of while loops for each operator, then place them inside a
loop using the repetition syntax #(#whiles)*. And that's it!

You can find the full code here

VectorRAG is naive, lacks domain awareness, and can’t handle full dataset retrieval

Dan Shalev — Tue, 29 Apr 2025 12:13:40 +0000

If we were building a GenAI stack today, we'd start with one question: Can your retrieval system handle multi-hop logic?

Trick question, b/c most can’t. They treat retrieval as nearest-neighbor search.

Today, we discussed scaling #GraphRAG at AWS DevOps Day, and the takeaway is clear: VectorRAG is naive, lacks domain awareness, and can’t handle full dataset retrieval.

GraphRAG builds a knowledge graph from source documents, allowing for a deeper understanding of the data + higher accuracy.

X0,000s Ops/sec with Multigraph Topology

Dan Shalev — Wed, 23 Apr 2025 13:44:32 +0000

Yesterday's 'Easily Achieve X0,000s Ops/sec with Multigraph Topology' workshop was awesome. Great questions, too!

We started with a standalone machine with 16 cores = 26k queries per second. We tripled and then doubled the number of cores, and achieved linear scalability.

Recording link: https://www.youtube.com/watch?v=LbeA0-xy1f8

Your GenAI system is only as smart as its retrieval layer.

Dan Shalev — Wed, 16 Apr 2025 11:20:32 +0000

A recent enterprise GenAI survey just confirmed what we’ve been seeing in production:

85% of teams are deploying LLMs
71% are already seeing output risks
99% agree: human oversight is still mandatory

Let’s be clear—this isn’t about model tuning. It’s about retrieval failure at the infrastructure level.

You can’t generate correct answers if your stack can’t model relationships.

You can’t trace decisions if your data lacks structure.

You can’t scale trust if your system hallucinates under pressure.

Here’s what actually works:

Load your structured + unstructured knowledge into a graph database
Model entities, relationships, and policies—don’t flatten them
Route results into your LLM prompt—clean, fast, explainable

That’s how you replace retrieval duct tape with graph-native reasoning.

Exploring advanced RAG & GraphRAG? Start here: https://github.com/FalkorDB/GraphRAG-SDK

Add a Knowledge Graph 3x better

Dan Shalev — Wed, 09 Apr 2025 07:27:09 +0000

If your AI agent doesn’t know when it’s wrong, it doesn’t belong in production.

We reviewed a recent study that tested LLM pipelines against enterprise data environments, benchmarking their performance on enterprise datasets using the Yale Spider schema.

Same model. Same questions. Different architecture.

**Here’s what changed:

**SQL-only → 17.28% accuracy
Add a Knowledge Graph → 3x better
Add ontology-based query checks + repair loop → 72.55% accuracy

That's not incremental progress. That’s a systems-level shift.

Here’s what mattered most:

70% of fixes came from domain constraints in the query body
Most gains showed up in complex schema environments—think KPIs and strategic planning
And when the model couldn’t repair itself? It admitted it with “unknown,” cutting hallucinated outputs by a huge margin

The architecture looks like this:

Ontologies validate logic pre-execution
Knowledge graphs serve as real-time reasoning layers
LLM Repair loops handle failure cases autonomously
FalkorDB is already solving the low-latency challenge here—serving graphs in real time for reasoning-heavy queries.
The lesson: You don’t need smarter prompts. You need systems that can detect when the logic breaks—and fix it before it hits the user.

Vector Recall Reasoning

Dan Shalev — Tue, 01 Apr 2025 13:47:41 +0000

If your GenAI agent can’t reason across relationships, memory, and context—it’s not an agent. It’s a demo.

We came across a research system called DEMENTIA-PLAN—focused on dementia care, but it exposed something bigger: the fatal flaw in most GenAI stacks (source in comments).

Agents that run vector-only retrieval pipelines fail to answer questions grounded in human context.

DEMENTIA-PLAN used multiple knowledge graphs + a planning agent to adapt retrieval in real time. Result? 30% better memory support. 10% higher coherence.

This isn’t just about healthcare.
It’s a blueprint for every agent stack that actually needs to think. If your RAG pipeline gets retrieval wrong, you’re shipping guessware.