We built an open Neo4j expert dataset — here's what we learned

#neo4j #python #opensource #beginners

We're building GibsGraph, an open-source tool that lets you query any Neo4j graph in plain English — or build new graphs from unstructured text. To generate good Cypher, the agent needs real Neo4j expertise. Not LLM training data. Actual documentation, patterns, and best practices.

So we built a curated expert dataset from scratch. 920 records. 5 categories. Fully bundled as JSONL — no setup needed.

Here's what we learned along the way.

What's in the dataset

We parsed the official Neo4j documentation — the Cypher manual, modeling guides, knowledge base articles — into structured records:

Type	Count	Source
Cypher examples	446	Official docs, parsed from AsciiDoc
Best practices	318	Knowledge base articles, modeling guides
Cypher functions	133	Cypher manual function reference
Cypher clauses	36	Cypher manual clause reference
Modeling patterns	23	Data modeling docs + curated additions

Each record has a source_file tracing back to the original documentation, an authority_level (1 = official docs, 2 = curated), and for some records a quality_tier that controls whether they get loaded at runtime.

After quality filtering, 849 records make it into the live system.

The automated audit

Trusting your own data is dangerous. So we built a 4-tier audit script that verifies everything it can mechanically:

Tier 1 — Completeness: Every record has its required fields, valid enums, real source paths.

Tier 2 — Cypher validation: All 1,131 Cypher snippets checked for balanced syntax, string interpolation, and pseudocode detection.

Tier 3 — Cross-reference: Every function and clause name checked against the official Neo4j 5.x reference. We hardcoded 126 built-in functions, plus known APOC and GDS entries.

Tier 4 — Duplicates: Within-file duplicate detection, naming convention checks (PascalCase node labels, UPPER_SNAKE relationship types).

What the audit caught

First run results:

Records: 956 | Cypher snippets: 1,131

Verified:     3,604 checks passed
Flagged:      33 issues
Human review: 341 records

7 real failures:

"Introduction" — a parser artifact that got classified as a Cypher function. Not a function. It's the intro paragraph of the functions docs page.
allReduce — flagged as unrecognized. Turns out this is a valid Neo4j 5.x predicate function, but our reference list was missing it. The audit caught a gap in our own reference data.
Unbalanced Cypher in LET clause — the LET clause examples had syntax issues. LET is a newer GQL-conformance addition and the docs examples are still maturing.
A best practice with an unclosed bracket — "Query to kill transactions" had unbalanced Cypher in its example.
An APOC example with unbalanced strings — apoc.load.jsonParams example had mismatched quotes.

Every one of these is a real data quality issue that would have degraded the agent's Cypher generation.

What we can't auto-verify

The 23 modeling patterns and 318 best practices contain domain advice: "When to use this pattern", "What the anti-pattern is", "Why this practice matters." No script can tell you if a modeling recommendation is sound. That takes a human who writes Cypher professionally.
a
This is the honest gap. The mechanical data is 98.8% clean. The advice data needs expert eyes.

The dataset is open

Everything is MIT-licensed and on GitHub:

Raw data: src/gibsgraph/data/*.jsonl
Audit script: data/ascripts/audit_expert_data.py
Review CSVs: data/review_modeling_patterns.csv and data/review_best_practices.csv

If you work with Neo4j and want to help verify the modeling patterns and best practices, we've set up review CSVs with empty review_status and reviewer_notes columns. Even reviewing 10 records helps.

Join the discussion on GitHub

We'd rather ship 200 verified records than 900 unverified ones.

Written with AI assistance, reviewed and published by https://gibs.dev.

Thank you Claude (L)