DEV Community

Cover image for A-Modular-Kingdom - The Infrastructure Layer AI Agents Deserve
Masih Maafi
Masih Maafi

Posted on • Originally published at masihmoafi.com

A-Modular-Kingdom - The Infrastructure Layer AI Agents Deserve


title: "A-Modular-Kingdom - The Infrastructure Layer AI Agents Deserve"
published: true
description: "Production-ready MCP server with RAG, memory, and tools. Connect any AI agent to long-term memory, document retrieval, and 10+ powerful tools."
tags: ai, rag, mcp, python

canonical_url: https://masihmoafi.com/blog/a-modular-kingdom

A-Modular-Kingdom

The infrastructure layer AI agents deserve

Why I Built This

Every AI agent I built had the same problem: I kept rebuilding the same infrastructure from scratch.

  • RAG system? Build it again.
  • Long-term memory? Implement it again.
  • Web search, code execution, vision? Wire them up again.

After the third project, I stopped. I extracted everything into a single, production-ready foundation that any agent can plug into via the Model Context Protocol (MCP).

A-Modular-Kingdom is that foundation.

What It Does

Start the MCP server:

python src/agent/host.py
Enter fullscreen mode Exit fullscreen mode

Now any AI agent—Claude Desktop, Gemini, custom chatbots—instantly gets:

  • Document retrieval (RAG) with Qdrant + BM25 + CrossEncoder reranking
  • Hierarchical memory that persists across sessions and projects
  • 10+ tools: web search, browser automation, code execution, vision, TTS/STT

One server. Unlimited applications.

The Tools

Tool What It Does
query_knowledge_base Search documents with hybrid retrieval (vector + keyword + reranking)
save_memory Store memories with automatic scope inference
search_memories Retrieve with priority: global rules → preferences → project context
save_fact Structured fact storage with metadata
set_global_rule Persistent instructions across all sessions
list_all_memories View everything stored
delete_memory Remove by ID
web_search DuckDuckGo integration
browser_automation Playwright scraping (text + screenshots)
code_execute Safe Python sandbox
analyze_media Ollama vision for images/videos
text_to_speech Multiple engines (pyttsx3, gtts, kokoro)
speech_to_text Whisper transcription

RAG: Not Just Vector Search

Most RAG implementations are naive: embed documents, find nearest neighbors, return results. This works for demos. It fails in production.

A-Modular-Kingdom uses a three-stage pipeline:

Anthropic's Contextual Retrieval RAG
Anthropic's Contextual Retrieval - the inspiration for this RAG implementation.

Stage 1: Hybrid Retrieval

  • Vector search (Qdrant Cloud) finds semantically similar chunks
  • BM25 keyword search catches exact term matches vectors miss

Stage 2: Ensemble Fusion

  • Results from both methods are combined with configurable weights
  • Neither method dominates—they complement each other

Stage 3: CrossEncoder Reranking

  • A cross-encoder model (ms-marco-MiniLM-L-6-v2) scores each result against the query
  • Top 5 most relevant results are returned

V3 RAG Architecture
V3 RAG Architecture - Hybrid retrieval with RRF fusion and CrossEncoder reranking.

The Numbers

Accuracy:

  • Focused FAQ: 100%
  • Real documents: 83-86%
  • LLM-as-Judge: 84-98%

Performance:

  • V2: 26.8s cold start, 0.31s warm query
  • V3: 13.9s cold start, 0.02s warm query

Supports: Python, Markdown, PDF, Jupyter notebooks, JavaScript, TypeScript

Anthropic's RAG Evaluation Results
Anthropic's evaluation showing contextual retrieval improvements - benchmark reference for our implementation.

Memory: Scoped and Hierarchical

Memory Architecture inspired by Mem0
Memory architecture inspired by Mem0 - hierarchical, scoped, and persistent.

Flat memory systems don't scale. When you have hundreds of memories, search becomes noise.

A-Modular-Kingdom organizes memory into scopes:

Scope Persistence Example
global_rules Forever, all projects "Always use type hints"
global_preferences Forever, all projects "Prefer concise responses"
global_personas Forever, all projects Reusable agent personalities
project_context Current project only "Uses FastAPI backend"

Smart Inference

You don't need to specify scopes manually. The system infers from content:

save_memory("User prefers dark mode")  # → global_preferences
save_memory("Always validate input")   # → global_rules
save_memory("Uses PostgreSQL")         # → project_context
Enter fullscreen mode Exit fullscreen mode

Priority Search

When you search, results come back in priority order:

  1. Global rules (highest priority)
  2. Global preferences
  3. Global personas
  4. Project context

This means your persistent instructions always surface first.

Integration

Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "a-modular-kingdom": {
      "command": "python",
      "args": ["/path/to/src/agent/host.py"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Custom Agents

from smolagents import ToolCallingAgent, ToolCollection
from mcp import StdioServerParameters

params = StdioServerParameters(
    command="python",
    args=["/path/to/host.py"]
)

with ToolCollection.from_mcp(params) as tools:
    agent = ToolCallingAgent(tools=list(tools.tools))
    result = agent.run("Search the codebase for auth logic")
Enter fullscreen mode Exit fullscreen mode

Standalone Package

Don't need the full server? Install just the RAG and memory components:

pip install rag-mem
Enter fullscreen mode Exit fullscreen mode
from memory_mcp import RAGPipeline, MemoryStore

# RAG
pipeline = RAGPipeline(document_paths=["./docs"])
pipeline.index()
results = pipeline.search("How does auth work?")

# Memory
memory = MemoryStore()
memory.add("Important fact")
results = memory.search("facts")
Enter fullscreen mode Exit fullscreen mode

CLI

memory-mcp init
memory-mcp serve --docs ./documents
memory-mcp index ./path/to/files
Enter fullscreen mode Exit fullscreen mode

Technical Stack

  • Embeddings: Pluggable providers—Ollama, sentence-transformers, or OpenAI
  • Vector DB: Qdrant (local or cloud)
  • Keyword Search: BM25 (rank-bm25)
  • Reranking: CrossEncoder (ms-marco-MiniLM-L-6-v2)
  • Memory: Qdrant with hierarchical scoping
  • Protocol: Model Context Protocol (MCP)

Real-World Application: Google Hackathon

The modularity of A-Modular-Kingdom was demonstrated in my Google Kaggle Hackathon submission—a multi-agent emotional AI system built on Gemma 3n.

Multi-agent architecture using A-Modular-Kingdom's RAG and Memory modules.

The system uses a modular pipeline: Vocal Emotion Detection analyzes speech while Gemma 3n's vision assesses facial expressions. The combined emotion tag and transcribed query are passed to a Router Agent that delegates to specialist sub-agents—each backed by A-Modular-Kingdom's RAG and Memory module for personalized, context-aware responses.

Modules Used

  • RAG: Each sub-agent retrieves relevant context from persistent knowledge bases
  • Memory: Long-term storage of user preferences, conversation history, and learned behaviors
  • Browser Automation: Playwright MCP tool for web interactions

Links


A-Modular-Kingdom: Stop rebuilding. Start building.

Top comments (0)