Large Language Models (LLMs) are powerful, yet they are forgetful by design. Once a conversation grows beyond a certain number of tokens, the model begins to lose information unless that context is explicitly reinjected with every prompt. Most sophisticated AI assistant platforms like ChatGPT or Claude only appear to "remember" previous messages. In reality, an orchestration layer (not the model itself) continually feeds a curated history of conversations back into the LLM when prompted.
In practice, a system like ChatGPT is often given 100–200 pages of context per request, consisting of carefully selected previous messages, metadata, and auxiliary documents. The illusion of the LLM remembering is created by the system constantly refreshing the context.
LLMs appear to remember, but the truth is they rely entirely on carefully retrieved context fed back into them (through a context window). This gap is what Memcortex aims to fill: a developer-friendly, long-term memory layer.
Why Context Matters
An LLM cannot truly store past conversations. Its only "memory" is the context window, a fixed-length input buffer (e.g., 128k tokens in GPT-4.1, 200k+ in Claude 3.5 Sonnet, and up to 2 million tokens in Gemini 1.5 Pro). When the conversation exceeds that limit, the orchestrator must perform three critical steps for the next query:
- Decide what information is most important.
- Compress or summarise the history.
- Re-inject relevant history into the prompt.
For developers building custom agents, this crucial orchestration layer does not come out of the box even when integrating APIs provided by these hyperscale AI assistants. You have to build your own, and that necessity is where the idea for MemCortex originated.
What Memcortex Does Differently
The core difference is that MemCortex is a semantic memory layer, not just a simple list of all previous conversations. Instead of pushing raw text history into each request, MemCortex stores vector embeddings of past messages and retrieves only the relevant ones using vector search. This architecture aligns with the industry pattern known as Retrieval Augmented Generation (RAG). MemCortex uses Ollama to run the open-source nomic-embed-text embedding model locally for fast, privacy-preserving vector generation and Weaviate for vector storage and indexing. All of these components are packaged into a single Docker container, making Memcortex a portable, customisable memory layer that runs locally, on servers, or in the cloud. With a single exposed /chat endpoint, Memcortex acts as a context-rich middleware for your applications.
How it Works (High-Level)
Ingestion:
- Take every new message or event.
- Generate an embedding vector using the nomic-embed-text model via Ollama.
- Store: the original text, its vector, and associated metadata (like timestamps).
Retrieval:
- A new user query arrives.
- Embed the query.
- Perform a vector search in Weaviate.
- Fetch the top-k similar items as "memories".
- Inject only these relevant memories back into the LLM context.
This process is nearly identical to how enterprise AI systems handle long-term coherence, only they do it at a massive scale with additional scoring and ranking algorithms. Memcortex is simply the lightweight, developer-friendly version aimed at demystifying how long-term context is handled.
Why I Built It: Solving the Memory Problem for Agents
When building a sophisticated AI agent, you need three things:
- Long-Term Recall: The agent must remember important facts across sessions.
- Relevance: It must retrieve only context relevant to the current task.
- Efficiency: You must avoid feeding the entire conversation into every prompt.
MemCortex addresses these points through specific features:
- Relevance Scoring: A configurable vector distance score and relevance threshold.
- Max Memory Distance: A tunable environment variable ensures only high-similarity memories are returned.
- Persistence: Using Weaviate means memories live beyond process restarts, which is essential for real-world agents.
- Pluggable Backends: Developers can easily swap embedding models, swap vector stores, or add custom ranking logic.
Where MemCortex Fits Today
Memcortex is a proof-of-concept (POC) / production-ready scaffold. It is a powerful foundation for:
- AI agents
- Customer-support bots
- Workflow assistants
- Knowledge-augmented chat systems
- Memory-RAG prototypes
It is designed to be simple, flexible, and intentionally un-opinionated about the surrounding application logic.
Limitations
While a powerful scaffold, Memcortex has constraints as a standalone component:
- Scalability and speed depend entirely on your chosen storage/indexing solution.
- Accuracy and relevance depend on the quality of the embeddings and retrieval logic.
- Persistence, backups, and security are the responsibility of the developer integrating the container.
- Cost scales with storage, embeddings, and retrieval frequency.
- It does not inherently reason, summarise, or prioritise beyond the retrieval logic you implement.
Future Enhancements
Some clear next steps for an evolving system like this include:
- Temporal scoring (recency decay)
- Memory summarisation
- Topic clustering (for more efficient retrieval)
- Multi-vector per memory
- Event-driven memory ("only save meaningful messages")
- Emotional/contextual tagging
Existing open-source projects like LangMem provide tooling to extract important information from conversations, optimise agent behaviour through prompt refinement, and maintain long-term memory.
Conclusion
Memcortex is a small but critical step toward giving your AI-powered applications the persistent, semantic memory they need to move from short-term chat partners to capable long-term agents. As AI agents grow more capable, systems like this will bridge the gap between short-term context and true long-term reasoning. For those interested in extending, optimising, or integrating with the system, the source code is available on GitHub.

Top comments (0)