Rost

Posted on Mar 24 • Originally published at glukhov.org

Neo4j graph database for GraphRAG, install, Cypher, vectors, ops

#rag #docker #kubernetes #devops

Neo4j is what you reach for when the relationships are the data. If your domain looks like a whiteboard of circles and arrows, forcing it into tables is painful.

Neo4j models that picture as a property graph and queries it with Cypher.

This guide covers what Neo4j is used for, ACID behaviour, Neo4j vs Amazon Neptune vs TigerGraph (and peers), GraphRAG with vector indexes, local and production install paths, ports and neo4j.conf, and copy-paste Cypher and Python patterns.

For broader context on data infrastructure choices, see the Data Infrastructure for AI Systems pillar.

What is Neo4j used for in production graph workloads

Neo4j is for connected data where you need to ask connected questions, repeatedly, under production constraints. That is the direct answer to what is Neo4j used for in most teams.

Property graph data model with nodes, relationships, and properties

Neo4j uses the property graph model: nodes represent entities, relationships connect nodes, and both can have properties. Labels and relationship types give structure without locking you into a brittle schema.

You can start with a thin model, ship value, and evolve the graph as new questions appear.

Cypher graph query language for pattern matching without join soup

Cypher is declarative and built around pattern matching. You describe subgraph shapes and let the planner execute them.

If SQL is about sets, Cypher is about subgraphs. That matters for multi-hop traversal, path queries, recommendations, provenance, and “who touched what via which system” questions.

Is Neo4j ACID compliant and why you should care

Is Neo4j ACID compliant? Yes. Creating or updating relationships touches coherent structure; the database keeps that consistent under failure and concurrency.

Design graph apps around strong transactional guarantees unless you are forced otherwise. That makes debugging and reasoning about behaviour much easier than assuming vague eventual consistency.

Neo4j vs Amazon Neptune vs TigerGraph: a senior engineer comparison

A “Neo4j vs X” question is usually “Which ecosystem will we live in for years?”

Short, opinionated view—about engineering time, not benchmark slides.

Product	Core model and query style	Where it wins	Where it bites
Neo4j	Property graph and Cypher	Strong ergonomics for connected data, mature tooling, graph plus vector retrieval	Graph modelling is a skill you must invest in
Amazon Neptune	Managed graph on AWS (Gremlin, openCypher, SPARQL for RDF)	AWS-centric contracts and operations	Query language mix can feel platform-driven
TigerGraph	GSQL and OpenCypher-related patterns	Analytics-style workloads and compiled query approaches	Different mental model; not drop-in Cypher everywhere
JanusGraph	Distributed graph with external storage backends	Open source with pluggable backends	You operate the backend stack
ArangoDB	Multi-model (documents, KV, graph)	One database for mixed shapes	Graph depth varies versus graph-first engines
Memgraph	Property graph, Cypher compatible	Streaming and fresh-data workflows	Engine behaviour differs; compatibility is not identity

What to decide before you pick a graph database

Pick query language and operations model first.

If your team wants Cypher and a graph-first workflow, Neo4j is a strong default. If you already have Gremlin expertise, Neptune or JanusGraph can fit. If you want one multi-model store, ArangoDB can reduce moving parts.

Be honest about operations. “We will run a distributed storage backend” is easy to say until you are paged about compactions or JVM pressure at 03:00.

Neo4j for RAG and GraphRAG: vector search plus graph context

Many RAG stacks start as vector search plus prompt. That works until you need provenance, entity resolution, multi-hop context, or disambiguation—then you risk rebuilding a knowledge graph in application code.

How does GraphRAG improve retrieval augmented generation? It uses the graph to pull structured context—entities, relationships, neighbourhoods—that similarity alone often misses, which helps grounding and trustworthiness.

Neo4j vector index for embedding similarity search

Can Neo4j do vector search for RAG? Yes. Neo4j supports vector indexes for similarity over embeddings (commonly HNSW-style approximate nearest neighbour search).

Vectors find “things that look similar”. They do not by themselves encode “how they relate” in your domain. Neo4j lets you combine similarity with traversals.

Using the SEARCH subclause for vector-constrained pattern matching

Neo4j’s SEARCH subclause lets you constrain a Cypher MATCH pattern using approximate nearest neighbour hits from a vector index. That is the ergonomic bridge for hybrid retrieval.

Practical pattern: vector retrieval for candidates, then graph expansion for context, filters, and explanation.

GraphRAG in Python with neo4j-graphrag

Neo4j’s neo4j-graphrag package for Python wires a driver, retriever, and LLM interface into a GraphRAG flow. You can still use external vector stores if you want to split responsibilities.

How to install Neo4j locally and in production

How do you install Neo4j locally? Match the option to your risk profile.

Install Neo4j with Docker for local development

Docker is the fastest path to a repeatable server.

# Minimal run. Data is NOT persisted between restarts.
docker run \
  --restart always \
  --publish=7474:7474 --publish=7687:7687 \
  neo4j:5

For real work, set an initial password and mount a data volume.

docker run \
  --restart always \
  --publish=7474:7474 --publish=7687:7687 \
  --env NEO4J_AUTH=neo4j/your_password \
  --volume=$HOME/neo4j/data:/data \
  neo4j:5

Docker Compose for a team-friendly setup

services:
  neo4j:
    image: neo4j:5
    ports:
      - "7474:7474"
      - "7687:7687"
    environment:
      - NEO4J_AUTH=neo4j/your_password
    volumes:
      - $HOME/neo4j/logs:/logs
      - $HOME/neo4j/config:/config
      - $HOME/neo4j/data:/data
      - $HOME/neo4j/plugins:/plugins
    restart: always

Neo4j Desktop

Neo4j Desktop is strong for prototyping and teaching—projects, GUI, local instances. For CI and integration tests, Docker usually wins.

Linux, Windows, or macOS servers

For long-running hosts, follow official OS install paths. You will eventually care about service management, logs, memory, backups, and upgrades.

Neo4j AuraDB (managed)

If you prefer shipping product to running databases, AuraDB is the managed Neo4j cloud option.

Kubernetes with Helm

If the platform is Kubernetes, use the Helm-based deployment and expose Bolt and HTTP through services. Only deploy databases on K8s if your organisation can run state reliably there.

Neo4j configuration essentials: ports, connectors, and neo4j.conf

Settings live in neo4j.conf (key=value, # comments). Strict validation helps catch typos before you serve traffic.

Default Neo4j ports and connectors

What are the default Neo4j ports? Bolt 7687, HTTP 7474, HTTPS 7473 by default. In production, expose only what you need; often Bolt on a private network and HTTP UI restricted.

Example hardening (adapt IPs and TLS to your environment):

server.bolt.listen_address=10.0.1.10:7687
server.http.listen_address=127.0.0.1:7474
server.https.enabled=true
server.https.listen_address=10.0.1.10:7473

Transaction settings that limit unbounded damage

Useful levers in reviews include db.transaction.timeout for runaway queries and db.transaction.concurrent.maximum to avoid thundering herds.

db.transaction.timeout=10s
db.transaction.concurrent.maximum=1000

Practical Cypher and vector index examples for RAG

Create a vector index and store embeddings

CREATE VECTOR INDEX doc_embeddings
FOR (d:Document) ON (d.embedding)
OPTIONS {indexConfig: {
  `vector.dimensions`: 1536,
  `vector.similarity_function`: "cosine"
}};

Vector retrieval then graph expansion

Vector search for candidate nodes.
Traverse for neighbours, provenance, and constraints.
Format context for the LLM with clear boundaries.

Example using SEARCH inside MATCH (syntax may vary slightly by Neo4j version—check the manual for your server version):

MATCH (d:Document)
  SEARCH d IN (
    VECTOR INDEX doc_embeddings
    FOR $queryEmbedding
    LIMIT 10
  ) SCORE AS score
MATCH (d)-[:MENTIONS]->(e:Entity)
RETURN d.id AS doc_id, score, collect(distinct e.name) AS entities
ORDER BY score DESC
LIMIT 5;

Minimal GraphRAG in Python

from neo4j import GraphDatabase
from neo4j_graphrag.retrievers import VectorRetriever
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.generation import GraphRAG

driver = GraphDatabase.driver("neo4j://localhost:7687", auth=("neo4j", "password"))

embedder = OpenAIEmbeddings(model="text-embedding-3-large")
retriever = VectorRetriever(driver, "doc_embeddings", embedder)

llm = OpenAILLM(model_name="gpt-4o", model_params={"temperature": 0})

rag = GraphRAG(retriever=retriever, llm=llm)

response = rag.search(query_text="How do I do similarity search in Neo4j?", retriever_config={"top_k": 5})
print(response.answer)

Real-world Neo4j use cases: fraud, recommendations, and knowledge graphs

Fraud detection and risk graphs

Fraud is rarely one row. It is patterns across accounts, devices, IPs, merchants, identities, and time. Graphs express neighbourhoods and multi-hop paths without ten-way join mazes.

Recommendations with behaviour and explicit relationships

Production recommendations combine scored candidates with inventory, constraints, hierarchies, and explainability. Graphs help you return paths people can reason about.

Knowledge graphs for RAG and agents

RAG needs grounding; agents need memory, provenance, and constraints. A knowledge graph stores entities, relationships, sources, and embeddings in one model—natural fit for GraphRAG.

When should you choose Neo4j over Amazon Neptune or TigerGraph?

When should you choose Neo4j over Amazon Neptune or TigerGraph? Choose Neo4j for a Cypher-first graph and vector + traversal in one product. Choose Neptune when AWS and Gremlin or RDF lines up with your org. Choose TigerGraph when GSQL and analytics-style workloads are the primary bet.

DEV Community