The hardest part of building a self-hosted AI isn't the architecture — it's choosing the right LLM for your actual use case and your actual hardware. A multimodal model that understands images is pointless when your task is explaining bank transactions. A model that tops benchmarks is useless if it takes 20 seconds to respond on your CPU. Here's what happened when I learned both lessons.
The Hardware
The entire gharasathi system runs on a ByteNUC mini PC:
| Spec | Value |
|---|---|
| CPU | Intel (no discrete GPU) |
| RAM | 6GB total |
| Disk | 2TB SSD |
| OS | Talos Linux (immutable, K8s-native) |
| Cost | ~$200 |
6GB RAM. That's it. And it needs to run the OS, Kubernetes, Neo4j, four microservices, and an LLM.
The Memory Budget
Every megabyte matters. Here's how the 6GB is carved up, based on actual K8s resource limits from our deployment manifests:
| Service | K8s requests | K8s limits | Notes |
|---|---|---|---|
| Ollama sidecar | 2Gi | 3Gi | LLM inference engine |
| FastAPI (aapla-hushar) | 512Mi | 1Gi | Python service |
| Neo4j | 512Mi | 1Gi | Graph database |
| aapla-dhan | 256Mi | 512Mi | Go finance service |
The LLM service gets ~3GB total (Ollama + FastAPI together in one K8s pod via sidecar pattern). After the Python runtime, FastAPI, and LangChain dependencies eat ~500MB, the actual model has roughly 2.5GB to work with.
The Model Selection Journey
I evaluated 15+ open-source models across four tiers. The question: what's the best LLM you can run in ~2.5GB?
The top 5 candidates that passed the filter:
| Model | Size | Context | Key Strength |
|---|---|---|---|
| Qwen3:4b | 2.5GB | 256K | Rivals 72B performance |
| Phi-3-mini | 2.2GB | 128K | Strong reasoning (Microsoft) |
| Llama 3.2-3B | 2.0GB | 128K | Instruction following |
| Qwen2.5-3B | 1.9GB | 32K | Good coding/math |
| Gemma3-4B | 3.3GB | 128K | Multimodal (images) |
The research was clear: Qwen3:4b was the winner. 2.5GB, 256K context window, benchmarks rivaling 72B models. A no-brainer.
Reality Disagreed
I deployed Qwen3:4b to the ByteNUC. It fit in memory. It loaded fine. Then I asked it a question.
Problem 1: Painfully slow. On a CPU-only Intel NUC with 6GB RAM, Qwen3:4b took 15-20+ seconds for simple responses. For a household chat assistant, that's unusable. You ask "how much did I spend on groceries?" and wait long enough to check the bank app yourself.
Problem 2: Hallucinations in Chinese. Qwen3 is a multilingual model supporting 100+ languages. On constrained hardware with limited context, the model would occasionally bleed into Chinese mid-response. Great for a multilingual product. Confusing for a household assistant that only needs English.
The benchmarks didn't lie — Qwen3:4b is a remarkable model. But benchmarks run on A100 GPUs with 80GB VRAM, not on a $200 mini PC with shared system RAM and no GPU.
The Downgrade
I switched to phi3:mini (Microsoft's Phi-3 Mini, 3.8B parameters, 2.2GB):
| Qwen3:4b | phi3:mini | |
|---|---|---|
| Size | 2.5GB | 2.2GB |
| Context | 256K | 128K |
| Speed on ByteNUC | 15-20s+ | 3-7s |
| Language stability | Occasional Chinese bleed | Stable English |
| Benchmark ranking | Higher | Lower |
| Actually usable? | No | Yes |
phi3:mini is less capable on paper. Smaller context window. Lower benchmark scores. But it responds in seconds, stays in English, and gives coherent answers about household data. That's what matters.
Lesson: benchmarks ≠ real-world performance on constrained hardware. Test on your actual target device, not on specs.
A Note on Choosing Models for Constrained Hardware
A few things I wish I'd prioritized earlier in the evaluation:
- Quantized models matter. On hardware like this, quantized variants (Q4_0, Q4_K_M) are far more practical than full-precision weights. They reduce memory footprint and improve inference speed with minimal quality loss. All the Ollama models above are already quantized — that's how a 4B parameter model fits in 2.5GB.
- Match the model to the task, not the leaderboard. A multimodal model that understands images is wasted when your use case is explaining bank transactions. A model with a 256K context window is overkill when your prompts are 200 tokens. Pick for what you actually need.
- CPU inference is slow — plan for it. Even with phi3:mini, responses take around 30 seconds end-to-end on the ByteNUC. That's liveable for a household assistant you check a few times a day, but it rules out anything conversational or real-time. If you need snappy responses, you need a GPU or a smaller model.
Why the LLM Barely Matters
Here's the counterintuitive insight that makes a smaller model viable: the LLM is just the natural language interface. Neo4j does the heavy lifting.
The architecture follows a "tools first, LLM explains" pattern:
- Intent classification is deterministic keyword matching — no LLM needed
- Data retrieval uses specialized tools that run Cypher queries against Neo4j
- The LLM only translates between natural language and presenting structured results
The LLM never calculates, never queries, never decides what data to fetch. It receives pre-fetched results and writes a human-readable response. For that job, phi3:mini is more than sufficient.
From the actual code (aapla-hushar/src/agents/graph.py):
# Deterministic routing — no LLM involved
finance_keywords = ["spend", "expense", "budget", "transaction", ...]
if any(k in last_message for k in finance_keywords):
intent = "finance"
# Tools fetch data first, then LLM explains
data = get_spending_summary() # Cypher query to Neo4j
response = llm.invoke([
SystemMessage(content="You are a financial assistant."),
HumanMessage(content=f"ONLY interpret this data:\n{data}")
])
The Ollama sidecar runs alongside FastAPI in the same K8s pod — localhost communication, zero network latency between the Python service and the LLM.
Future: Better LLM Requires Better Hardware
The current setup works for structured queries against Neo4j. But for actual data analysis — "what trends do you see in my spending?" or "suggest ways to save money" — the model needs to reason over larger contexts with more capability.
That upgrade means upgrading the ByteNUC's RAM first. A 7B or 8B model needs 8-16GB total system memory. Until then, phi3:mini handles the structured query-and-explain pattern well.
What's Next
The model runs. The queries work. But is "runs locally" enough to keep your data safe?
In Part 3, I look at what happened to OpenClaw — another local-first AI assistant — and why the real security lesson isn't about where your AI runs, but what it's allowed to do.
This is Part 3 of a 3-part series. Use the series navigation above to read Part 1 (Architecture & Neo4j) and Part 3 (What OpenClaw Teaches Us).
Top comments (0)