Most AI projects call an API and return the response. I built something different — a cognitive layer that classifies every input, routes it to the optimal LLM provider based on complexity and cost, executes tools autonomously in an iterative loop, and compresses memory so conversations never lose context.
It's called Akatskii, and it runs as the brain behind Luna, a personal AI assistant I use daily for managing a SaaS platform, trading systems, and creative projects.
This isn't a tutorial. It's what I learned building a production cognitive architecture from scratch.
The Problem
I needed an AI assistant that could:
- Think at different speeds. A status check doesn't need Claude's reasoning. A complex architecture decision doesn't belong on Groq's fast path.
- Use tools autonomously. Not "call one function and return" — actually reason through multi-step problems, calling tools iteratively until the task is done.
- Remember everything. Conversations shouldn't break when you hit context limits. The system should compress, not truncate.
- Cost close to zero. I'm running this 24/7. Paying $0.15/1K tokens for every status check is insane.
No existing framework solved all four. So I built one.
The Router: Regex Before LLM
The most counterintuitive decision I made: use regex for routing, not an LLM.
Here's why. About 80% of inputs to an AI assistant are predictable. "What's the system status?" is always a status check. "Why are posts failing?" always needs reasoning. Pattern matching handles these instantly, deterministically, and for free.
User: "check system health"
→ Regex match: REFLEX (confidence: 0.9)
→ Route to Groq (free, ~150ms)
→ Pre-fetch system_health capability data
→ LLM generates response grounded in real data
Only when regex confidence drops below 0.8 does the system fall back to an LLM for classification — Groq's Llama 3.3 70B, which adds ~100ms and costs nothing on the free tier.
Result: 80% of routing is instant and free. The remaining 20% costs fractions of a cent.
Five Thought Types, Three Providers
Every input gets classified into one of five thought types, each mapped to a different LLM provider:
| Type | Provider | Cost | When |
|---|---|---|---|
| Reflex | Groq Llama 3.3 70B | Free | Status checks, lookups, simple queries |
| Executive | Gemini 2.5 Flash | Free | Reasoning, planning, debugging |
| Sensory | Gemini 2.5 Flash | Free | Image analysis, multimodal |
| Creative | Gemini 2.5 Flash | Free | Narrative, prose, worldbuilding |
| Deep | Claude Sonnet 4 | Paid | Complex architecture (explicit escalation only) |
The key insight: Claude is never invoked unless the user explicitly asks for it ("need claude", "escalate", "deep dive"). Everything else runs on free providers. My monthly LLM cost for a 24/7 AI assistant: effectively zero.
The Agentic Loop: Let the LLM Decide When It's Done
This is where it gets interesting. When a thought requires tools, the brain enters an iterative loop:
for iteration in range(15):
response = llm.call_with_tools(messages, available_tools)
if response.is_text: # LLM decided it's done
return response
# Execute tool calls (parallel if multiple)
for tool_call in response.tool_calls:
result = await execute_mcp_tool(tool_call)
messages.append(tool_result(result))
# Loop: LLM sees results, decides next action
The LLM reasons, calls tools, gets results, reasons again — up to 15 iterations. It decides when the task is complete, not the developer. Multiple tool calls in a single response execute in parallel via asyncio.gather.
If the iteration limit is hit, instead of silently returning incomplete work, the system runs a continuation pass — a second invocation with the partial results as context, giving the LLM one more chance to synthesize a complete response.
The tool bridge connects to a Model Context Protocol (MCP) server with 40+ tools via direct Python imports — no HTTP overhead, no serialization, no auth handshake. The LLM can check system health, query databases, manage content pipelines, execute trades, and signal other agents.
Memory Compaction: Compress, Don't Truncate
Long conversations hit context limits. Most systems handle this by dropping old messages. That's like treating amnesia by forgetting harder.
Akatskii's compaction engine does something different:
- A daemon monitors conversation token count
- At 180K tokens, it triggers compaction
- Gemini extracts structured facts: decisions, preferences, technical details, context
- It generates dual summaries — a short continuity hint (2-3 lines) and a detailed snapshot (1-2 paragraphs)
- Facts persist to a knowledge base. Snapshots persist as episodic memory.
- The conversation resumes with compressed context
Target compression: 3x. A 180K token conversation becomes ~60K tokens of distilled context, and nothing meaningful is lost. The user sees: "A moment while I gather my thoughts..." — then the conversation continues seamlessly.
Room-Aware Personality
Luna isn't one personality. She adapts based on context.
In an engineering room, she's technical and direct. In a trading room, she's analytical and cautious. In a creative room, she writes prose. Each room also filters which tools are available — the trading room gets market tools, the build room gets code execution, and they can't cross boundaries.
Same brain, different capabilities, different voice. This is implemented through room configs that define system prompts, tool allowlists, and persona adjustments per context.
Background Autonomy
Three daemon processes run independently:
- Heartbeat (every 5 min): evaluates trigger conditions, syncs calendar, fires proactive notifications
- Scout (every 4 hours): monitors HackerNews, Reddit, and GitHub for relevant news, scores findings, caches for later surfacing
- Shadow Runner: picks up queued tasks and executes them through the full agentic loop — autonomously, with no user interaction
The Shadow Runner is the most interesting. It's essentially the same brain, but running headless on queued work. It loads room-specific configs, decomposes complex tasks, retries on failure, and reports results back to the task queue. Autonomous execution without supervision.
What I'd Do Differently
The router patterns are hand-crafted. 40 regex patterns work, but they're brittle. A learned classifier (fine-tuned on routing decisions) would generalize better. I haven't done this because the regex approach works well enough and costs nothing.
Tool definitions are static. The MCP bridge maps capabilities to tools at import time. A more dynamic system would discover tools at runtime and let the LLM choose from the full inventory. MCP's tool discovery protocol supports this — I just haven't needed it yet.
Compaction loses nuance. Fact extraction captures the "what" but sometimes misses the "how we got there." The reasoning chain that led to a decision is harder to compress than the decision itself. I'm exploring storing reasoning traces alongside facts.
The Stack
| Layer | Technology |
|---|---|
| Runtime | Python 3.11, FastAPI, asyncio |
| LLM Providers | Gemini 2.5 Flash, Groq Llama 3.3 70B, Claude Sonnet 4 |
| Tools | Model Context Protocol (MCP) — 40+ tools |
| Memory | SQLite + fastembed (all-MiniLM-L6-v2, 384-dim vectors) |
| Voice | OpenAI Whisper (STT), ElevenLabs soundboard + Edge TTS |
| Process Management | PM2 |
Try It
The full source is at github.com/AlexlaGuardia/Akatskii. The MCP server it connects to is at github.com/AlexlaGuardia/guardia-mcp.
If you're building something similar — or if you've solved any of the problems I mentioned differently — I'd genuinely like to hear about it.
I'm Alex, a full-stack software engineer building AI systems, game engines, and production platforms. I write Python, Rust, and TypeScript. Find me on GitHub or LinkedIn.
Top comments (0)