Harish Kotra (he/him)

Posted on Apr 3

Building a Local Identity Resolution Engine with Neo4j, Python, and Ollama

#ai #python #opensource #dailybuild2026

Identity resolution is a classic graph problem hiding inside what many teams still try to solve with relational joins.

In this project, I built a local-first identity resolution engine that:

stores fragmented user identifiers in Neo4j
generates synthetic overlapping user data
resolves hidden identity links through recursive Cypher traversals
uses Ollama to translate natural language into Cypher
exposes the whole workflow through both a CLI and a lightweight web UI

This post walks through the design, code, and implementation details.

The Problem

Imagine this simplified user journey:

one account signs in with tech_user@gmail.com
another signs in with tech_user+alt@gmail.com
both use the same device
a third identity later shares a cookie or phone with the second identity

Those records may look independent in tabular storage, but in a graph they form a connected component.

That is exactly what identity resolution needs: path discovery across shared identifiers.

Why a Graph Database Fits Better

In SQL, link analysis usually becomes:

self-joins on login tables
joins on device tables
more joins on email or cookie tables
increasingly difficult reasoning as hop depth increases

In Neo4j, identifiers become nodes and relationships become explicit edges.

So instead of forcing the query planner through repeated joins, you traverse the graph directly.

High-Level Architecture

flowchart TD
    USER["Developer / Analyst"] --> UI["CLI or Web UI"]
    UI --> APP["Python Service Layer"]
    APP --> OLLAMA["Ollama Local Model"]
    APP --> NEO["Neo4j"]
    APP --> SEED["Synthetic Data Generator"]

    OLLAMA --> GEN["Generated Cypher"]
    GEN --> APP
    APP --> PROFILE["PROFILE Metrics"]
    PROFILE --> UI
    NEO --> UI

The Graph Model

The core schema is intentionally small:

(:Identity {identity_id, full_name})
(:Email {address})
(:Phone {number})
(:Device {device_id})
(:Cookie {cookie_id})

Relationships:

(:Identity)-[:HAS_EMAIL]->(:Email)
(:Identity)-[:HAS_PHONE]->(:Phone)
(:Identity)-[:USED_DEVICE]->(:Device)
(:Identity)-[:ASSOCIATED_WITH]->(:Cookie)

This keeps the graph readable while still modeling real-world identity fragmentation.

Local Infrastructure

Neo4j runs locally via Docker Compose:

services:
  neo4j:
    image: neo4j:5.26
    container_name: neo4j-identity
    ports:
      - "7474:7474"
      - "7687:7687"
    environment:
      NEO4J_AUTH: neo4j/password

This lets the entire project stay local:

local database
local seeding
local LLM inference
local UI

Seeding Synthetic Overlap Data

The project uses Faker to generate 100 identities, but the important part is not just fake data. It is fake data with overlap.

The seeding script deliberately creates:

shared devices
shared cookies
reused emails
reused phones
explicit demo chains

Code excerpt:

shared_devices = [f"device-shared-{idx:03d}" for idx in range(1, 11)]
shared_cookies = [f"cookie-shared-{idx:03d}" for idx in range(1, 11)]
shared_emails = [f"shared_user_{idx}@example.com" for idx in range(1, 7)]
shared_phones = [f"+14155550{idx:03d}" for idx in range(1, 7)]

Then, for selected rows, those identifiers get reused.

The script also injects a deterministic demo chain:

chain_overrides = [
    ("identity-001", "tech_user@gmail.com", "+14155550111", "device-demo-001", "cookie-demo-001"),
    ("identity-002", "tech_user+alt@gmail.com", "+14155550112", "device-demo-001", "cookie-demo-002"),
    ("identity-003", "bob@info.com", "+14155550113", "device-demo-002", "cookie-demo-002"),
    ("identity-004", "robert@info.com", "+14155550114", "device-demo-002", "cookie-demo-004"),
]

That gives you predictable live demo scenarios.

Data Integrity with Constraints

The project creates uniqueness constraints before seeding:

CREATE CONSTRAINT email_address_unique IF NOT EXISTS
FOR (e:Email) REQUIRE e.address IS UNIQUE

CREATE CONSTRAINT phone_number_unique IF NOT EXISTS
FOR (p:Phone) REQUIRE p.number IS UNIQUE

This matters because graph ingestion should preserve identity semantics:

a given email node should not duplicate accidentally
a phone number should resolve to one canonical node

The Core Resolution Query

The simplest demonstration query is also one of the most important:

MATCH (start:Email {address: $email})
MATCH p=(start)-[*1..4]-(related)
RETURN p

That [*1..4] is the key.

It says:

start at a known identifier
traverse any relationship type
explore up to 4 hops

This exposes transitive links like:

Email -> Identity -> Device -> Identity -> Cookie

For an identity resolution demo, that is the moment when the graph model becomes obvious.

Natural Language to Cypher with Ollama

The next layer is accessibility. Not everyone wants to write Cypher by hand.

So the app sends the user’s question to Ollama along with a tightly constrained prompt:

graph schema
relationship directions
allowed Cypher clauses
examples
output shape requirements

Prompt excerpt:

Rules:
- Return JSON only.
- JSON shape must be {"cypher":"...", "params":{...}, "explanation":"..."}.
- Use only MATCH, OPTIONAL MATCH, WITH, RETURN, ORDER BY, LIMIT.
- Never use CREATE, MERGE, DELETE, SET, REMOVE, DROP, CALL, LOAD CSV, APOC, or schema operations.

This matters because local models are fast and convenient, but they are not always consistent.

Hardening the LLM Path

A naive LLM-to-query pipeline is brittle.

To make it practical, this project adds several guardrails:

1. JSON extraction from noisy model output

Some models wrap valid JSON in extra prose. The parser scans for valid JSON objects instead of assuming perfect output.

2. Cypher sanitization

The app rejects write or dangerous clauses:

blocked = ("CREATE ", "MERGE ", "DELETE ", "SET ", "REMOVE ", "DROP ", "CALL ", "LOAD CSV")

3. Semantic linting

The app checks for reversed graph edges such as:

(:Email)-[:HAS_EMAIL]->(:Identity) which is incorrect for this schema

4. Retry with corrective feedback

If the model returns malformed or semantically invalid output, the app sends back a correction request and asks for a fixed JSON-only response.

That turns the system from “prompt once and hope” into a more resilient local pipeline.

Query Execution and Performance Profiling

When PROFILE mode is enabled, the app prefixes the generated query:

query = f"PROFILE {cypher}" if profile else cypher

After execution, it walks the returned profile tree and sums DB hits.

That gives two advantages:

a visible performance story for demos
a useful debugging aid while iterating on generated Cypher

CLI and Web UI

The project exposes two interfaces:

CLI

Direct resolution:

python3 main.py resolve --email tech_user@gmail.com --profile

Natural-language query:

python3 main.py ask \
  "Find all accounts linked to the device used by 'bob@info.com'." \
  --profile \
  --model llama3.2:latest

Web UI

The browser UI is intentionally lightweight and dependency-free.

It uses:

Python’s built-in HTTP server
HTML/CSS/JS embedded in webapp.py
the same backend functions as the CLI

That keeps the project easy to run and easy to inspect.

Internal Request Flow

What Makes This Interesting

There are lots of small AI demos and lots of small graph demos. What makes this one useful is the combination:

practical graph schema
deterministic overlap generation
local model integration
query validation
performance introspection
inspectable outputs

It is not just “ask an LLM a database question.” It is a constrained local graph workflow.

Potential Extensions

If I were taking this further, I would add:

1. Graph visualization

Render linked identities as an interactive node-edge graph.

2. Confidence scoring

Use weighted evidence:

shared device strength
phone uniqueness
cookie freshness
overlap count

3. Batch resolution

Accept CSV input and generate clusters for many identifiers at once.

4. Multi-model evaluation

Run the same prompt across several local Ollama models and compare:

validity
speed
correctness
DB hits

5. Exportable investigation reports

Generate analyst-friendly summaries for fraud or trust-and-safety teams.

Final Thoughts

This project is a good example of where local AI and graph databases complement each other well.

Neo4j provides the right data model for connected identity evidence.
Ollama provides a convenient natural-language interface for querying that graph.
Python keeps the whole system compact and inspectable.

The result is not just a demo. It is a strong foundation for:

identity stitching
risk analysis
entity resolution
graph-assisted investigative tooling

If you want to study or extend the code, start with:

seed_data.py for graph ingestion
main.py for LLM-to-Cypher and query execution
webapp.py for the local interface

Github Repo: https://github.com/harishkotra/identity-resolution-engine

DEV Community