DEV Community

Cover image for Building a Local Identity Resolution Engine with Neo4j, Python, and Ollama
Harish Kotra (he/him)
Harish Kotra (he/him)

Posted on

Building a Local Identity Resolution Engine with Neo4j, Python, and Ollama

Identity resolution is a classic graph problem hiding inside what many teams still try to solve with relational joins.

In this project, I built a local-first identity resolution engine that:

  • stores fragmented user identifiers in Neo4j
  • generates synthetic overlapping user data
  • resolves hidden identity links through recursive Cypher traversals
  • uses Ollama to translate natural language into Cypher
  • exposes the whole workflow through both a CLI and a lightweight web UI

This post walks through the design, code, and implementation details.

The Problem

Imagine this simplified user journey:

  • one account signs in with tech_user@gmail.com
  • another signs in with tech_user+alt@gmail.com
  • both use the same device
  • a third identity later shares a cookie or phone with the second identity

Those records may look independent in tabular storage, but in a graph they form a connected component.

That is exactly what identity resolution needs: path discovery across shared identifiers.

Why a Graph Database Fits Better

In SQL, link analysis usually becomes:

  • self-joins on login tables
  • joins on device tables
  • more joins on email or cookie tables
  • increasingly difficult reasoning as hop depth increases

In Neo4j, identifiers become nodes and relationships become explicit edges.

So instead of forcing the query planner through repeated joins, you traverse the graph directly.

High-Level Architecture

flowchart TD
    USER["Developer / Analyst"] --> UI["CLI or Web UI"]
    UI --> APP["Python Service Layer"]
    APP --> OLLAMA["Ollama Local Model"]
    APP --> NEO["Neo4j"]
    APP --> SEED["Synthetic Data Generator"]

    OLLAMA --> GEN["Generated Cypher"]
    GEN --> APP
    APP --> PROFILE["PROFILE Metrics"]
    PROFILE --> UI
    NEO --> UI
Enter fullscreen mode Exit fullscreen mode

The Graph Model

The core schema is intentionally small:

(:Identity {identity_id, full_name})
(:Email {address})
(:Phone {number})
(:Device {device_id})
(:Cookie {cookie_id})
Enter fullscreen mode Exit fullscreen mode

Relationships:

(:Identity)-[:HAS_EMAIL]->(:Email)
(:Identity)-[:HAS_PHONE]->(:Phone)
(:Identity)-[:USED_DEVICE]->(:Device)
(:Identity)-[:ASSOCIATED_WITH]->(:Cookie)
Enter fullscreen mode Exit fullscreen mode

This keeps the graph readable while still modeling real-world identity fragmentation.

Local Infrastructure

Neo4j runs locally via Docker Compose:

services:
  neo4j:
    image: neo4j:5.26
    container_name: neo4j-identity
    ports:
      - "7474:7474"
      - "7687:7687"
    environment:
      NEO4J_AUTH: neo4j/password
Enter fullscreen mode Exit fullscreen mode

This lets the entire project stay local:

  • local database
  • local seeding
  • local LLM inference
  • local UI

Seeding Synthetic Overlap Data

The project uses Faker to generate 100 identities, but the important part is not just fake data. It is fake data with overlap.

The seeding script deliberately creates:

  • shared devices
  • shared cookies
  • reused emails
  • reused phones
  • explicit demo chains

Code excerpt:

shared_devices = [f"device-shared-{idx:03d}" for idx in range(1, 11)]
shared_cookies = [f"cookie-shared-{idx:03d}" for idx in range(1, 11)]
shared_emails = [f"shared_user_{idx}@example.com" for idx in range(1, 7)]
shared_phones = [f"+14155550{idx:03d}" for idx in range(1, 7)]
Enter fullscreen mode Exit fullscreen mode

Then, for selected rows, those identifiers get reused.

The script also injects a deterministic demo chain:

chain_overrides = [
    ("identity-001", "tech_user@gmail.com", "+14155550111", "device-demo-001", "cookie-demo-001"),
    ("identity-002", "tech_user+alt@gmail.com", "+14155550112", "device-demo-001", "cookie-demo-002"),
    ("identity-003", "bob@info.com", "+14155550113", "device-demo-002", "cookie-demo-002"),
    ("identity-004", "robert@info.com", "+14155550114", "device-demo-002", "cookie-demo-004"),
]
Enter fullscreen mode Exit fullscreen mode

That gives you predictable live demo scenarios.

Data Integrity with Constraints

The project creates uniqueness constraints before seeding:

CREATE CONSTRAINT email_address_unique IF NOT EXISTS
FOR (e:Email) REQUIRE e.address IS UNIQUE

CREATE CONSTRAINT phone_number_unique IF NOT EXISTS
FOR (p:Phone) REQUIRE p.number IS UNIQUE
Enter fullscreen mode Exit fullscreen mode

This matters because graph ingestion should preserve identity semantics:

  • a given email node should not duplicate accidentally
  • a phone number should resolve to one canonical node

The Core Resolution Query

The simplest demonstration query is also one of the most important:

MATCH (start:Email {address: $email})
MATCH p=(start)-[*1..4]-(related)
RETURN p
Enter fullscreen mode Exit fullscreen mode

That [*1..4] is the key.

It says:

  • start at a known identifier
  • traverse any relationship type
  • explore up to 4 hops

This exposes transitive links like:

Email -> Identity -> Device -> Identity -> Cookie

For an identity resolution demo, that is the moment when the graph model becomes obvious.

Natural Language to Cypher with Ollama

The next layer is accessibility. Not everyone wants to write Cypher by hand.

So the app sends the user’s question to Ollama along with a tightly constrained prompt:

  • graph schema
  • relationship directions
  • allowed Cypher clauses
  • examples
  • output shape requirements

Prompt excerpt:

Rules:
- Return JSON only.
- JSON shape must be {"cypher":"...", "params":{...}, "explanation":"..."}.
- Use only MATCH, OPTIONAL MATCH, WITH, RETURN, ORDER BY, LIMIT.
- Never use CREATE, MERGE, DELETE, SET, REMOVE, DROP, CALL, LOAD CSV, APOC, or schema operations.
Enter fullscreen mode Exit fullscreen mode

This matters because local models are fast and convenient, but they are not always consistent.

Hardening the LLM Path

A naive LLM-to-query pipeline is brittle.

To make it practical, this project adds several guardrails:

1. JSON extraction from noisy model output

Some models wrap valid JSON in extra prose. The parser scans for valid JSON objects instead of assuming perfect output.

2. Cypher sanitization

The app rejects write or dangerous clauses:

blocked = ("CREATE ", "MERGE ", "DELETE ", "SET ", "REMOVE ", "DROP ", "CALL ", "LOAD CSV")
Enter fullscreen mode Exit fullscreen mode

3. Semantic linting

The app checks for reversed graph edges such as:

  • (:Email)-[:HAS_EMAIL]->(:Identity) which is incorrect for this schema

4. Retry with corrective feedback

If the model returns malformed or semantically invalid output, the app sends back a correction request and asks for a fixed JSON-only response.

That turns the system from “prompt once and hope” into a more resilient local pipeline.

Query Execution and Performance Profiling

When PROFILE mode is enabled, the app prefixes the generated query:

query = f"PROFILE {cypher}" if profile else cypher
Enter fullscreen mode Exit fullscreen mode

After execution, it walks the returned profile tree and sums DB hits.

That gives two advantages:

  • a visible performance story for demos
  • a useful debugging aid while iterating on generated Cypher

CLI and Web UI

The project exposes two interfaces:

CLI

Direct resolution:

python3 main.py resolve --email tech_user@gmail.com --profile
Enter fullscreen mode Exit fullscreen mode

Natural-language query:

python3 main.py ask \
  "Find all accounts linked to the device used by 'bob@info.com'." \
  --profile \
  --model llama3.2:latest
Enter fullscreen mode Exit fullscreen mode

Web UI

The browser UI is intentionally lightweight and dependency-free.

It uses:

  • Python’s built-in HTTP server
  • HTML/CSS/JS embedded in webapp.py
  • the same backend functions as the CLI

That keeps the project easy to run and easy to inspect.

Internal Request Flow

Internal Request Flow

What Makes This Interesting

There are lots of small AI demos and lots of small graph demos. What makes this one useful is the combination:

  • practical graph schema
  • deterministic overlap generation
  • local model integration
  • query validation
  • performance introspection
  • inspectable outputs

It is not just “ask an LLM a database question.” It is a constrained local graph workflow.

Potential Extensions

If I were taking this further, I would add:

1. Graph visualization

Render linked identities as an interactive node-edge graph.

2. Confidence scoring

Use weighted evidence:

  • shared device strength
  • phone uniqueness
  • cookie freshness
  • overlap count

3. Batch resolution

Accept CSV input and generate clusters for many identifiers at once.

4. Multi-model evaluation

Run the same prompt across several local Ollama models and compare:

  • validity
  • speed
  • correctness
  • DB hits

5. Exportable investigation reports

Generate analyst-friendly summaries for fraud or trust-and-safety teams.

Final Thoughts

This project is a good example of where local AI and graph databases complement each other well.

Neo4j provides the right data model for connected identity evidence.
Ollama provides a convenient natural-language interface for querying that graph.
Python keeps the whole system compact and inspectable.

The result is not just a demo. It is a strong foundation for:

  • identity stitching
  • risk analysis
  • entity resolution
  • graph-assisted investigative tooling

If you want to study or extend the code, start with:

  • seed_data.py for graph ingestion
  • main.py for LLM-to-Cypher and query execution
  • webapp.py for the local interface

Output 1

Output 2

Github Repo: https://github.com/harishkotra/identity-resolution-engine

Top comments (0)