Choosing the Right LLM for Cognee: Local Ollama Setup

#llm #ai #selfhosting #ollama

Choosing the Best LLM for Cognee demands balancing graph-building quality, hallucination rates, and hardware constraints.
Cognee excels with larger, low-hallucination models (32B+) via Ollama but mid-size options work for lighter setups.

Key Cognee Requirements

Cognee relies on the LLM for entity extraction, relation inference, and metadata generation. Models under 32B often produce noisy graphs, while high hallucination (e.g., 90%+) pollutes nodes/edges, degrading retrieval. Official docs recommend deepseek-r1:32b or llama3.3-70b-instruct-q3_K_M paired with Mistral embeddings.

Model Comparison Table

Model	Params	Hallucination (SimpleQA/est.)	VRAM (quantized)	Cognee Strengths	Weaknesses
gpt-oss:20b	20B	91.4%	~16GB	Fast inference, tool-calling	Severe graph noise
Qwen3:14b	14B	~40-45%	~12-14GB	Efficient on modest hardware	Limited depth for graphs
Devstral Small 2	24B	~8-10%	~18-20GB	Coding focus, clean entities	Higher VRAM than Qwen3
Llama3.3-70b	70B	~30-40%	~40GB+	Optimal graph quality	Heavy resource needs
Deepseek-r1:32b	32B	Low (recommended)	~24-32GB	Best for reasoning/graphs	Slower on consumer GPUs

Data synthesized from Cognee docs, model cards, and benchmarks, the hallucination level data even though looks out of wack, might be not far off...

Recommendations by Hardware

High-end (32GB+ VRAM): Deepseek-r1:32b or Llama3.3-70b. These yield the cleanest graphs per Cognee guidance.
Mid-range (16-24GB VRAM): Devstral Small 2. Low hallucination and coding prowess suit structured memory tasks.
Budget (12-16GB VRAM): Qwen3:14b over gpt-oss:20b - avoid 91% hallucination pitfalls.
Thinking to avoid gpt-oss:20b for Cognee; there are notes that its errors amplify in unfiltered graph construction. But the inferrence speed on my GPU is 2+ times faster....

Quick Ollama + Cognee Setup

# 1. Pull model (e.g., Devstral)
ollama pull devstral-small-2:24b  # or qwen3:14b, etc.

# 2. Install Cognee
pip install "cognee[ollama]"

# 3. Env vars
export LLM_PROVIDER="ollama"
export LLM_MODEL="devstral-small-2:24b"
export EMBEDDING_PROVIDER="ollama"
export EMBEDDING_MODEL="nomic-embed-text"  # 768 dims
export EMBEDDING_DIMENSIONS=768

# 4. Test graph
cognee add --file "your_data.txt" --name "test_graph"

Match embedding dims (e.g., 768, 1024) across config and vector store. Qwen3 Embeddings (unproven in Cognee) could work at 1024-4096 dims if Ollama-supported.

Prioritize low-hallucination models for production Cognee pipelines—your graphs will thank you.
Test on your hardware and monitor graph coherence.

Embedding models

Didn't think much on this one, but here is a table I brought together, for future reference

Ollama Model	Size, GB	Embedding Dimensions	Context Length
nomic-embed-text:latest	0.274	768	2k
jina-embeddings-v2-base-en:latest	0.274	768	8k
nomic-embed-text-v2-moe	0.958	768	512
qwen3-embedding:0.6b	0.639	1024	32K
qwen3-embedding:4b	2.5	2560	32K
qwen3-embedding:8b	4.7	4096	32K
avr/sfr-embedding-mistral:latest	4.4	4096	32K