Masih Maafi

Posted on Dec 3 • Originally published at masihmoafi.com

A-Modular-Kingdom - The Infrastructure Layer AI Agents Deserve

#ai #agents #rag #mcp

title: "A-Modular-Kingdom - The Infrastructure Layer AI Agents Deserve"
published: true
description: "Production-ready MCP server with RAG, memory, and tools. Connect any AI agent to long-term memory, document retrieval, and 10+ powerful tools."
tags: ai, rag, mcp, python

canonical_url: https://masihmoafi.com/blog/a-modular-kingdom

A-Modular-Kingdom

The infrastructure layer AI agents deserve

Why I Built This

Every AI agent I built had the same problem: I kept rebuilding the same infrastructure from scratch.

RAG system? Build it again.
Long-term memory? Implement it again.
Web search, code execution, vision? Wire them up again.

After the third project, I stopped. I extracted everything into a single, production-ready foundation that any agent can plug into via the Model Context Protocol (MCP).

A-Modular-Kingdom is that foundation.

What It Does

Start the MCP server:

python src/agent/host.py

Now any AI agent—Claude Desktop, Gemini, custom chatbots—instantly gets:

Document retrieval (RAG) with Qdrant + BM25 + CrossEncoder reranking
Hierarchical memory that persists across sessions and projects
10+ tools: web search, browser automation, code execution, vision, TTS/STT

One server. Unlimited applications.

The Tools

Tool	What It Does
`query_knowledge_base`	Search documents with hybrid retrieval (vector + keyword + reranking)
`save_memory`	Store memories with automatic scope inference
`search_memories`	Retrieve with priority: global rules → preferences → project context
`save_fact`	Structured fact storage with metadata
`set_global_rule`	Persistent instructions across all sessions
`list_all_memories`	View everything stored
`delete_memory`	Remove by ID
`web_search`	DuckDuckGo integration
`browser_automation`	Playwright scraping (text + screenshots)
`code_execute`	Safe Python sandbox
`analyze_media`	Ollama vision for images/videos
`text_to_speech`	Multiple engines (pyttsx3, gtts, kokoro)
`speech_to_text`	Whisper transcription

RAG: Not Just Vector Search

Most RAG implementations are naive: embed documents, find nearest neighbors, return results. This works for demos. It fails in production.

A-Modular-Kingdom uses a three-stage pipeline:

Anthropic's Contextual Retrieval - the inspiration for this RAG implementation.

Stage 1: Hybrid Retrieval

Vector search (Qdrant Cloud) finds semantically similar chunks
BM25 keyword search catches exact term matches vectors miss

Stage 2: Ensemble Fusion

Results from both methods are combined with configurable weights
Neither method dominates—they complement each other

Stage 3: CrossEncoder Reranking

A cross-encoder model (ms-marco-MiniLM-L-6-v2) scores each result against the query
Top 5 most relevant results are returned

V3 RAG Architecture - Hybrid retrieval with RRF fusion and CrossEncoder reranking.

The Numbers

Accuracy:

Focused FAQ: 100%
Real documents: 83-86%
LLM-as-Judge: 84-98%

Performance:

V2: 26.8s cold start, 0.31s warm query
V3: 13.9s cold start, 0.02s warm query

Supports: Python, Markdown, PDF, Jupyter notebooks, JavaScript, TypeScript

Anthropic's evaluation showing contextual retrieval improvements - benchmark reference for our implementation.

Memory: Scoped and Hierarchical

Memory architecture inspired by Mem0 - hierarchical, scoped, and persistent.

Flat memory systems don't scale. When you have hundreds of memories, search becomes noise.

A-Modular-Kingdom organizes memory into scopes:

Scope	Persistence	Example
`global_rules`	Forever, all projects	"Always use type hints"
`global_preferences`	Forever, all projects	"Prefer concise responses"
`global_personas`	Forever, all projects	Reusable agent personalities
`project_context`	Current project only	"Uses FastAPI backend"

Smart Inference

You don't need to specify scopes manually. The system infers from content:

save_memory("User prefers dark mode")  # → global_preferences
save_memory("Always validate input")   # → global_rules
save_memory("Uses PostgreSQL")         # → project_context

Priority Search

When you search, results come back in priority order:

Global rules (highest priority)
Global preferences
Global personas
Project context

This means your persistent instructions always surface first.

Integration

Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "a-modular-kingdom": {
      "command": "python",
      "args": ["/path/to/src/agent/host.py"]
    }
  }
}

Custom Agents

from smolagents import ToolCallingAgent, ToolCollection
from mcp import StdioServerParameters

params = StdioServerParameters(
    command="python",
    args=["/path/to/host.py"]
)

with ToolCollection.from_mcp(params) as tools:
    agent = ToolCallingAgent(tools=list(tools.tools))
    result = agent.run("Search the codebase for auth logic")

Standalone Package

Don't need the full server? Install just the RAG and memory components:

pip install rag-mem

from memory_mcp import RAGPipeline, MemoryStore

# RAG
pipeline = RAGPipeline(document_paths=["./docs"])
pipeline.index()
results = pipeline.search("How does auth work?")

# Memory
memory = MemoryStore()
memory.add("Important fact")
results = memory.search("facts")

CLI

memory-mcp init
memory-mcp serve --docs ./documents
memory-mcp index ./path/to/files

Technical Stack

Embeddings: Pluggable providers—Ollama, sentence-transformers, or OpenAI
Vector DB: Qdrant (local or cloud)
Keyword Search: BM25 (rank-bm25)
Reranking: CrossEncoder (ms-marco-MiniLM-L-6-v2)
Memory: Qdrant with hierarchical scoping
Protocol: Model Context Protocol (MCP)

Real-World Application: Google Hackathon

The modularity of A-Modular-Kingdom was demonstrated in my Google Kaggle Hackathon submission—a multi-agent emotional AI system built on Gemma 3n.

Multi-agent architecture using A-Modular-Kingdom's RAG and Memory modules.

The system uses a modular pipeline: Vocal Emotion Detection analyzes speech while Gemma 3n's vision assesses facial expressions. The combined emotion tag and transcribed query are passed to a Router Agent that delegates to specialist sub-agents—each backed by A-Modular-Kingdom's RAG and Memory module for personalized, context-aware responses.

Modules Used

RAG: Each sub-agent retrieves relevant context from persistent knowledge bases
Memory: Long-term storage of user preferences, conversation history, and learned behaviors
Browser Automation: Playwright MCP tool for web interactions

Links

A-Modular-Kingdom: Stop rebuilding. Start building.

DEV Community