Koustubh

Posted on Feb 8

Running an LLM on 6GB RAM — Model Selection for Edge AI

#ai #selfhosted #llm #kubernetes

The hardest part of building a self-hosted AI isn't the architecture — it's choosing the right LLM for your actual use case and your actual hardware. A multimodal model that understands images is pointless when your task is explaining bank transactions. A model that tops benchmarks is useless if it takes 20 seconds to respond on your CPU. Here's what happened when I learned both lessons.

The Hardware

The entire gharasathi system runs on a ByteNUC mini PC:

Spec	Value
CPU	Intel (no discrete GPU)
RAM	6GB total
Disk	2TB SSD
OS	Talos Linux (immutable, K8s-native)
Cost	~$200

6GB RAM. That's it. And it needs to run the OS, Kubernetes, Neo4j, four microservices, and an LLM.

The Memory Budget

Every megabyte matters. Here's how the 6GB is carved up, based on actual K8s resource limits from our deployment manifests:

Service	K8s requests	K8s limits	Notes
Ollama sidecar	2Gi	3Gi	LLM inference engine
FastAPI (aapla-hushar)	512Mi	1Gi	Python service
Neo4j	512Mi	1Gi	Graph database
aapla-dhan	256Mi	512Mi	Go finance service

The LLM service gets ~3GB total (Ollama + FastAPI together in one K8s pod via sidecar pattern). After the Python runtime, FastAPI, and LangChain dependencies eat ~500MB, the actual model has roughly 2.5GB to work with.

The Model Selection Journey

I evaluated 15+ open-source models across four tiers. The question: what's the best LLM you can run in ~2.5GB?

The top 5 candidates that passed the filter:

Model	Size	Context	Key Strength
Qwen3:4b	2.5GB	256K	Rivals 72B performance
Phi-3-mini	2.2GB	128K	Strong reasoning (Microsoft)
Llama 3.2-3B	2.0GB	128K	Instruction following
Qwen2.5-3B	1.9GB	32K	Good coding/math
Gemma3-4B	3.3GB	128K	Multimodal (images)

The research was clear: Qwen3:4b was the winner. 2.5GB, 256K context window, benchmarks rivaling 72B models. A no-brainer.

Reality Disagreed

I deployed Qwen3:4b to the ByteNUC. It fit in memory. It loaded fine. Then I asked it a question.

Problem 1: Painfully slow. On a CPU-only Intel NUC with 6GB RAM, Qwen3:4b took 15-20+ seconds for simple responses. For a household chat assistant, that's unusable. You ask "how much did I spend on groceries?" and wait long enough to check the bank app yourself.

Problem 2: Hallucinations in Chinese. Qwen3 is a multilingual model supporting 100+ languages. On constrained hardware with limited context, the model would occasionally bleed into Chinese mid-response. Great for a multilingual product. Confusing for a household assistant that only needs English.

The benchmarks didn't lie — Qwen3:4b is a remarkable model. But benchmarks run on A100 GPUs with 80GB VRAM, not on a $200 mini PC with shared system RAM and no GPU.

The Downgrade

I switched to phi3:mini (Microsoft's Phi-3 Mini, 3.8B parameters, 2.2GB):

	Qwen3:4b	phi3:mini
Size	2.5GB	2.2GB
Context	256K	128K
Speed on ByteNUC	15-20s+	3-7s
Language stability	Occasional Chinese bleed	Stable English
Benchmark ranking	Higher	Lower
Actually usable?	No	Yes

phi3:mini is less capable on paper. Smaller context window. Lower benchmark scores. But it responds in seconds, stays in English, and gives coherent answers about household data. That's what matters.

Lesson: benchmarks ≠ real-world performance on constrained hardware. Test on your actual target device, not on specs.

A Note on Choosing Models for Constrained Hardware

A few things I wish I'd prioritized earlier in the evaluation:

Quantized models matter. On hardware like this, quantized variants (Q4_0, Q4_K_M) are far more practical than full-precision weights. They reduce memory footprint and improve inference speed with minimal quality loss. All the Ollama models above are already quantized — that's how a 4B parameter model fits in 2.5GB.
Match the model to the task, not the leaderboard. A multimodal model that understands images is wasted when your use case is explaining bank transactions. A model with a 256K context window is overkill when your prompts are 200 tokens. Pick for what you actually need.
CPU inference is slow — plan for it. Even with phi3:mini, responses take around 30 seconds end-to-end on the ByteNUC. That's liveable for a household assistant you check a few times a day, but it rules out anything conversational or real-time. If you need snappy responses, you need a GPU or a smaller model.

Why the LLM Barely Matters

Here's the counterintuitive insight that makes a smaller model viable: the LLM is just the natural language interface. Neo4j does the heavy lifting.

The architecture follows a "tools first, LLM explains" pattern:

Intent classification is deterministic keyword matching — no LLM needed
Data retrieval uses specialized tools that run Cypher queries against Neo4j
The LLM only translates between natural language and presenting structured results

The LLM never calculates, never queries, never decides what data to fetch. It receives pre-fetched results and writes a human-readable response. For that job, phi3:mini is more than sufficient.

From the actual code (aapla-hushar/src/agents/graph.py):

# Deterministic routing — no LLM involved
finance_keywords = ["spend", "expense", "budget", "transaction", ...]
if any(k in last_message for k in finance_keywords):
    intent = "finance"

# Tools fetch data first, then LLM explains
data = get_spending_summary()  # Cypher query to Neo4j
response = llm.invoke([
    SystemMessage(content="You are a financial assistant."),
    HumanMessage(content=f"ONLY interpret this data:\n{data}")
])

The Ollama sidecar runs alongside FastAPI in the same K8s pod — localhost communication, zero network latency between the Python service and the LLM.

Future: Better LLM Requires Better Hardware

The current setup works for structured queries against Neo4j. But for actual data analysis — "what trends do you see in my spending?" or "suggest ways to save money" — the model needs to reason over larger contexts with more capability.

That upgrade means upgrading the ByteNUC's RAM first. A 7B or 8B model needs 8-16GB total system memory. Until then, phi3:mini handles the structured query-and-explain pattern well.

What's Next

The model runs. The queries work. But is "runs locally" enough to keep your data safe?

In Part 3, I look at what happened to OpenClaw — another local-first AI assistant — and why the real security lesson isn't about where your AI runs, but what it's allowed to do.

This is Part 3 of a 3-part series. Use the series navigation above to read Part 1 (Architecture & Neo4j) and Part 3 (What OpenClaw Teaches Us).

DEV Community