Identity resolution is a classic graph problem hiding inside what many teams still try to solve with relational joins.
In this project, I built a local-first identity resolution engine that:
- stores fragmented user identifiers in Neo4j
- generates synthetic overlapping user data
- resolves hidden identity links through recursive Cypher traversals
- uses Ollama to translate natural language into Cypher
- exposes the whole workflow through both a CLI and a lightweight web UI
This post walks through the design, code, and implementation details.
The Problem
Imagine this simplified user journey:
- one account signs in with
tech_user@gmail.com - another signs in with
tech_user+alt@gmail.com - both use the same device
- a third identity later shares a cookie or phone with the second identity
Those records may look independent in tabular storage, but in a graph they form a connected component.
That is exactly what identity resolution needs: path discovery across shared identifiers.
Why a Graph Database Fits Better
In SQL, link analysis usually becomes:
- self-joins on login tables
- joins on device tables
- more joins on email or cookie tables
- increasingly difficult reasoning as hop depth increases
In Neo4j, identifiers become nodes and relationships become explicit edges.
So instead of forcing the query planner through repeated joins, you traverse the graph directly.
High-Level Architecture
flowchart TD
USER["Developer / Analyst"] --> UI["CLI or Web UI"]
UI --> APP["Python Service Layer"]
APP --> OLLAMA["Ollama Local Model"]
APP --> NEO["Neo4j"]
APP --> SEED["Synthetic Data Generator"]
OLLAMA --> GEN["Generated Cypher"]
GEN --> APP
APP --> PROFILE["PROFILE Metrics"]
PROFILE --> UI
NEO --> UI
The Graph Model
The core schema is intentionally small:
(:Identity {identity_id, full_name})
(:Email {address})
(:Phone {number})
(:Device {device_id})
(:Cookie {cookie_id})
Relationships:
(:Identity)-[:HAS_EMAIL]->(:Email)
(:Identity)-[:HAS_PHONE]->(:Phone)
(:Identity)-[:USED_DEVICE]->(:Device)
(:Identity)-[:ASSOCIATED_WITH]->(:Cookie)
This keeps the graph readable while still modeling real-world identity fragmentation.
Local Infrastructure
Neo4j runs locally via Docker Compose:
services:
neo4j:
image: neo4j:5.26
container_name: neo4j-identity
ports:
- "7474:7474"
- "7687:7687"
environment:
NEO4J_AUTH: neo4j/password
This lets the entire project stay local:
- local database
- local seeding
- local LLM inference
- local UI
Seeding Synthetic Overlap Data
The project uses Faker to generate 100 identities, but the important part is not just fake data. It is fake data with overlap.
The seeding script deliberately creates:
- shared devices
- shared cookies
- reused emails
- reused phones
- explicit demo chains
Code excerpt:
shared_devices = [f"device-shared-{idx:03d}" for idx in range(1, 11)]
shared_cookies = [f"cookie-shared-{idx:03d}" for idx in range(1, 11)]
shared_emails = [f"shared_user_{idx}@example.com" for idx in range(1, 7)]
shared_phones = [f"+14155550{idx:03d}" for idx in range(1, 7)]
Then, for selected rows, those identifiers get reused.
The script also injects a deterministic demo chain:
chain_overrides = [
("identity-001", "tech_user@gmail.com", "+14155550111", "device-demo-001", "cookie-demo-001"),
("identity-002", "tech_user+alt@gmail.com", "+14155550112", "device-demo-001", "cookie-demo-002"),
("identity-003", "bob@info.com", "+14155550113", "device-demo-002", "cookie-demo-002"),
("identity-004", "robert@info.com", "+14155550114", "device-demo-002", "cookie-demo-004"),
]
That gives you predictable live demo scenarios.
Data Integrity with Constraints
The project creates uniqueness constraints before seeding:
CREATE CONSTRAINT email_address_unique IF NOT EXISTS
FOR (e:Email) REQUIRE e.address IS UNIQUE
CREATE CONSTRAINT phone_number_unique IF NOT EXISTS
FOR (p:Phone) REQUIRE p.number IS UNIQUE
This matters because graph ingestion should preserve identity semantics:
- a given email node should not duplicate accidentally
- a phone number should resolve to one canonical node
The Core Resolution Query
The simplest demonstration query is also one of the most important:
MATCH (start:Email {address: $email})
MATCH p=(start)-[*1..4]-(related)
RETURN p
That [*1..4] is the key.
It says:
- start at a known identifier
- traverse any relationship type
- explore up to 4 hops
This exposes transitive links like:
Email -> Identity -> Device -> Identity -> Cookie
For an identity resolution demo, that is the moment when the graph model becomes obvious.
Natural Language to Cypher with Ollama
The next layer is accessibility. Not everyone wants to write Cypher by hand.
So the app sends the user’s question to Ollama along with a tightly constrained prompt:
- graph schema
- relationship directions
- allowed Cypher clauses
- examples
- output shape requirements
Prompt excerpt:
Rules:
- Return JSON only.
- JSON shape must be {"cypher":"...", "params":{...}, "explanation":"..."}.
- Use only MATCH, OPTIONAL MATCH, WITH, RETURN, ORDER BY, LIMIT.
- Never use CREATE, MERGE, DELETE, SET, REMOVE, DROP, CALL, LOAD CSV, APOC, or schema operations.
This matters because local models are fast and convenient, but they are not always consistent.
Hardening the LLM Path
A naive LLM-to-query pipeline is brittle.
To make it practical, this project adds several guardrails:
1. JSON extraction from noisy model output
Some models wrap valid JSON in extra prose. The parser scans for valid JSON objects instead of assuming perfect output.
2. Cypher sanitization
The app rejects write or dangerous clauses:
blocked = ("CREATE ", "MERGE ", "DELETE ", "SET ", "REMOVE ", "DROP ", "CALL ", "LOAD CSV")
3. Semantic linting
The app checks for reversed graph edges such as:
-
(:Email)-[:HAS_EMAIL]->(:Identity)which is incorrect for this schema
4. Retry with corrective feedback
If the model returns malformed or semantically invalid output, the app sends back a correction request and asks for a fixed JSON-only response.
That turns the system from “prompt once and hope” into a more resilient local pipeline.
Query Execution and Performance Profiling
When PROFILE mode is enabled, the app prefixes the generated query:
query = f"PROFILE {cypher}" if profile else cypher
After execution, it walks the returned profile tree and sums DB hits.
That gives two advantages:
- a visible performance story for demos
- a useful debugging aid while iterating on generated Cypher
CLI and Web UI
The project exposes two interfaces:
CLI
Direct resolution:
python3 main.py resolve --email tech_user@gmail.com --profile
Natural-language query:
python3 main.py ask \
"Find all accounts linked to the device used by 'bob@info.com'." \
--profile \
--model llama3.2:latest
Web UI
The browser UI is intentionally lightweight and dependency-free.
It uses:
- Python’s built-in HTTP server
- HTML/CSS/JS embedded in
webapp.py - the same backend functions as the CLI
That keeps the project easy to run and easy to inspect.
Internal Request Flow
What Makes This Interesting
There are lots of small AI demos and lots of small graph demos. What makes this one useful is the combination:
- practical graph schema
- deterministic overlap generation
- local model integration
- query validation
- performance introspection
- inspectable outputs
It is not just “ask an LLM a database question.” It is a constrained local graph workflow.
Potential Extensions
If I were taking this further, I would add:
1. Graph visualization
Render linked identities as an interactive node-edge graph.
2. Confidence scoring
Use weighted evidence:
- shared device strength
- phone uniqueness
- cookie freshness
- overlap count
3. Batch resolution
Accept CSV input and generate clusters for many identifiers at once.
4. Multi-model evaluation
Run the same prompt across several local Ollama models and compare:
- validity
- speed
- correctness
- DB hits
5. Exportable investigation reports
Generate analyst-friendly summaries for fraud or trust-and-safety teams.
Final Thoughts
This project is a good example of where local AI and graph databases complement each other well.
Neo4j provides the right data model for connected identity evidence.
Ollama provides a convenient natural-language interface for querying that graph.
Python keeps the whole system compact and inspectable.
The result is not just a demo. It is a strong foundation for:
- identity stitching
- risk analysis
- entity resolution
- graph-assisted investigative tooling
If you want to study or extend the code, start with:
-
seed_data.pyfor graph ingestion -
main.pyfor LLM-to-Cypher and query execution -
webapp.pyfor the local interface
Github Repo: https://github.com/harishkotra/identity-resolution-engine



Top comments (0)