I Built a Research Memory Agent with Cognee—Five API Breaks, One Knowledge Graph, Seven Days

Sourabha Kallapur — Sun, 05 Jul 2026 18:18:14 +0000

I applied strict TDD to build a Cognee knowledge graph agent — here is every upstream API break I hit and how I fixed it

What I Built

ChronoScholar is a temporally-aware research memory agent that ingests arXiv papers into a Cognee knowledge graph and detects when stored scientific beliefs are contradicted by incoming literature. Built for the WeMakeDevs x Cognee hackathon.

The core problem: standard RAG returns answers from all stored papers with equal confidence. An agent that ingested Paper A in January and Paper B in March — where B refutes A's central claim — surfaces both indefinitely. Fixing this is not a retrieval problem. It requires persistent, typed memory with explicit contradiction awareness.

The system ingests papers via the arXiv API, builds a typed knowledge graph using cognee.add() and cognee.cognify(), classifies paper pairs with Gemini 2.5 Flash, and synthesizes cross-paper answers using Cognee's GRAPH_COMPLETION search mode.

Pilot numbers: 10 papers ingested, 494 knowledge graph entities, 1,025 edges, 6 typed edge types (contradicts, supports, extends, invalidates, replicates, authored_by). Contradiction detection F1=1.000 on a 10-pair benchmark — 4 positive, 6 negative. Results are directional; n=10 is insufficient for confidence interval estimation.

GitHub: https://github.com/SourabhaKK/ChronoScholar

Why TDD on an AI Memory System Is Different

Standard TDD works well on deterministic code. You define inputs, specify expected outputs, mock external dependencies, and confirm behavior. The tests stay fast and the feedback loop stays tight.

AI memory systems break this model in two distinct ways. First, the behavior you care about is often emergent from library internals you do not control — Cognee's auth posture, LiteLLM's model routing, LadybugDB's file locking. Second, mocking these dependencies hides their real behavior by design.

I ran strict Red→Green→Refactor across 8 components and 60 tests, with all external calls mocked. The test suite was valuable. It also missed every integration failure.

What TDD caught

B005 strip() bug. The contradiction response parser called response.strip("`json"). Python's str.strip() strips individual characters from a set, not a substring. This silently corrupted JSON parsing whenever the model output started with a backtick. The test caught it by asserting on the parsed dict, not the raw string.

System prompt never passed to Groq. The detect() method constructed a detailed system prompt and passed it to groq_complete(), but the function signature accepted only prompt. The system prompt was silently discarded. The fix required a keyword argument and a test that mocked the Groq client and asserted on the messages list passed to the API call. Without that assertion, detect() returned plausible-looking output with no scientific grounding and no visible error.

string.Template greedy parsing. Prompt templates used $variable substitution. Template substitutes greedily — $var in a sentence could match a longer key than intended, leaving substitution gaps. Switching to ${variable} throughout, with tests asserting each variable appeared exactly once in rendered output, caught two silent failures.

What TDD could not catch

All 60 tests mock cognee.add(), cognee.cognify(), and cognee.search(). The mocks return controlled values. Real Cognee does not.

Every one of the five integration failures described below was invisible to the test suite until the real system ran. The mocks were correct relative to the API surface documented when the tests were written. The actual library behavior differed.

The lesson is specific: AI memory systems need a second test tier — integration tests that run against real dependencies on a minimal one-document corpus, with real API keys, in CI. Unit tests with mocks are necessary but not sufficient.

Five Cognee 1.2.2 Integration Failures

Failure 1: set_llm_config() rejected

Problem: cognee.config.set_llm_config(provider="groq", model="llama-3.1-8b-instant", api_key=key) raised InvalidConfigAttributeError on the provider key at startup.

Root cause: Cognee 1.2.2 removed programmatic LLM configuration. All LLM config is now read from environment variables via LiteLLM.

Fix: Remove the set_llm_config() call entirely. Set LLM_MODEL and LLM_API_KEY in .env.

Lesson: Cognee routes through LiteLLM internally. Model names require LiteLLM provider prefix format — groq/llama-3.1-8b-instant, not llama-3.1-8b-instant. Check the LiteLLM model list, not the provider's native model list.

Failure 2: get_nodes() and get_edges() do not exist

Problem: AttributeError on graph_engine.get_nodes() inside CogneeService.get_stats().

Root cause: The internal graph engine API changed between versions with no changelog entry.

Fix: Replace with get_graph_data(), which returns a (nodes, edges) tuple.

`python nodes, edges = await graph_engine.get_graph_data() `

Lesson: Cognee's internal graph API has no stability guarantee. Wrap all internal API calls in try/except with a fallback that returns zero counts — the stats endpoint should degrade gracefully, not crash.

Failure 3: Multi-tenant auth reads the wrong database

Problem: entity_count=0 after successful cognify(). The knowledge graph appeared empty on every query.

Root cause: ENABLE_BACKEND_ACCESS_CONTROL=true is the default. In multi-tenant mode, cognify() writes to a UUID-scoped .lbug file while get_graph_data() reads the global file. Different databases — cognify writes, queries read nothing.

Fix: Set ENABLE_BACKEND_ACCESS_CONTROL=false in .env. This requires load_dotenv() to execute before any app.* import, because Cognee resolves its auth posture at module import time.

`python

app/main.py — order is mandatory

from dotenv import load_dotenv
load_dotenv()
from app.config import Settings # Cognee imports happen inside app modules
`

Lesson: Module-level side effects create import-order dependencies that unit tests cannot detect. If load_dotenv() runs after Cognee's internal modules initialize, the env var arrives too late.

Failure 4: Windows file lock contention (Error 33)

Problem: Both CHUNKS and GRAPH_COMPLETION searches in the /compare endpoint failed with "Could not set lock on cognee_graph_ladybug" when run concurrently.

Root cause: LadybugDB is a single-file graph store. On Windows, the OS acquires an exclusive lock. Concurrent async reads both attempt to acquire it and one fails.

Fix: Run CHUNKS first, await completion, then run GRAPH_COMPLETION. Sequential, not concurrent.

Lesson: File-backed stores have OS-specific concurrency semantics. LadybugDB works with concurrent readers on Linux. On Windows it does not. Test on the target OS.

Failure 5: Groq 6K TPM shared between two LLM consumers

Problem: GRAPH_COMPLETION synthesis returned 7 characters. No exception raised, no error logged.

Root cause: ContradictionService.detect() and Cognee's internal synthesis both used LLM_API_KEY pointing to Groq. A single GRAPH_COMPLETION synthesis call requests 6,910 tokens — over Groq's 6,000 TPM free-tier limit. Cognee exhausted the budget silently and returned an empty result.

Fix: Assign separate providers.

`ini

.env

APP_LLM_PROVIDER=gemini # application LLM (detect, query grounding)
LLM_MODEL=openai/gpt-4o-mini # Cognee internal LLM
GROQ_API_KEY=... # tier-2 fallback only
`

Gemini 2.5 Flash for the application layer. GPT-4o-mini for Cognee internals. Groq as tier-2 fallback. Three separate TPM pools, no contention.

Lesson: Any library that makes its own LLM calls is a separate consumer of your rate limit budget. Assign it a dedicated API key.

Results

10 papers ingested. 494 knowledge graph entities. 1,025 edges across 6 typed edge types. Cognify runtime: 63.3 seconds per 10-paper batch. Contradiction detection: Precision=1.000, Recall=1.000, F1=1.000 on a 10-pair pilot benchmark (4 positive, 6 negative).

Statistical caveat: n=10 is a pilot evaluation. Results are directional only. Sample size is insufficient for confidence interval estimation.

One additional finding: Gemini 2.5 Flash correctly classified all 4 positive pairs with zero false negatives. Groq llama-3.1-8b-instant missed one, giving F1=0.857 on the same benchmark. Model capability on scientific NLI is not uniform across providers.

What Is Next

Expand the benchmark to 200+ pairs to reach statistical validity. Add a second CI test tier: integration tests against real Cognee on a one-document corpus, with real API keys in a separate workflow.

Run systematic provider comparison — Groq vs Gemini vs GPT-4o-mini — on contradiction detection F1 to quantify the capability gap and understand at what scale it matters.