DEV Community

Koustubh
Koustubh

Posted on

Running an LLM on 6GB RAM — Model Selection for Edge AI

The hardest part of building a self-hosted AI isn't the architecture — it's choosing the right LLM for your actual use case and your actual hardware. A multimodal model that understands images is pointless when your task is explaining bank transactions. A model that tops benchmarks is useless if it takes 20 seconds to respond on your CPU. Here's what happened when I learned both lessons.

The Hardware

The entire gharasathi system runs on a ByteNUC mini PC:

Spec Value
CPU Intel (no discrete GPU)
RAM 6GB total
Disk 2TB SSD
OS Talos Linux (immutable, K8s-native)
Cost ~$200

6GB RAM. That's it. And it needs to run the OS, Kubernetes, Neo4j, four microservices, and an LLM.

The Memory Budget

Every megabyte matters. Here's how the 6GB is carved up, based on actual K8s resource limits from our deployment manifests:

diagram

Service K8s requests K8s limits Notes
Ollama sidecar 2Gi 3Gi LLM inference engine
FastAPI (aapla-hushar) 512Mi 1Gi Python service
Neo4j 512Mi 1Gi Graph database
aapla-dhan 256Mi 512Mi Go finance service

The LLM service gets ~3GB total (Ollama + FastAPI together in one K8s pod via sidecar pattern). After the Python runtime, FastAPI, and LangChain dependencies eat ~500MB, the actual model has roughly 2.5GB to work with.

The Model Selection Journey

I evaluated 15+ open-source models across four tiers. The question: what's the best LLM you can run in ~2.5GB?

diagram

The top 5 candidates that passed the filter:

Model Size Context Key Strength
Qwen3:4b 2.5GB 256K Rivals 72B performance
Phi-3-mini 2.2GB 128K Strong reasoning (Microsoft)
Llama 3.2-3B 2.0GB 128K Instruction following
Qwen2.5-3B 1.9GB 32K Good coding/math
Gemma3-4B 3.3GB 128K Multimodal (images)

The research was clear: Qwen3:4b was the winner. 2.5GB, 256K context window, benchmarks rivaling 72B models. A no-brainer.

Reality Disagreed

I deployed Qwen3:4b to the ByteNUC. It fit in memory. It loaded fine. Then I asked it a question.

Problem 1: Painfully slow. On a CPU-only Intel NUC with 6GB RAM, Qwen3:4b took 15-20+ seconds for simple responses. For a household chat assistant, that's unusable. You ask "how much did I spend on groceries?" and wait long enough to check the bank app yourself.

Problem 2: Hallucinations in Chinese. Qwen3 is a multilingual model supporting 100+ languages. On constrained hardware with limited context, the model would occasionally bleed into Chinese mid-response. Great for a multilingual product. Confusing for a household assistant that only needs English.

The benchmarks didn't lie — Qwen3:4b is a remarkable model. But benchmarks run on A100 GPUs with 80GB VRAM, not on a $200 mini PC with shared system RAM and no GPU.

The Downgrade

I switched to phi3:mini (Microsoft's Phi-3 Mini, 3.8B parameters, 2.2GB):

Qwen3:4b phi3:mini
Size 2.5GB 2.2GB
Context 256K 128K
Speed on ByteNUC 15-20s+ 3-7s
Language stability Occasional Chinese bleed Stable English
Benchmark ranking Higher Lower
Actually usable? No Yes

phi3:mini is less capable on paper. Smaller context window. Lower benchmark scores. But it responds in seconds, stays in English, and gives coherent answers about household data. That's what matters.

Lesson: benchmarks ≠ real-world performance on constrained hardware. Test on your actual target device, not on specs.

A Note on Choosing Models for Constrained Hardware

A few things I wish I'd prioritized earlier in the evaluation:

  • Quantized models matter. On hardware like this, quantized variants (Q4_0, Q4_K_M) are far more practical than full-precision weights. They reduce memory footprint and improve inference speed with minimal quality loss. All the Ollama models above are already quantized — that's how a 4B parameter model fits in 2.5GB.
  • Match the model to the task, not the leaderboard. A multimodal model that understands images is wasted when your use case is explaining bank transactions. A model with a 256K context window is overkill when your prompts are 200 tokens. Pick for what you actually need.
  • CPU inference is slow — plan for it. Even with phi3:mini, responses take around 30 seconds end-to-end on the ByteNUC. That's liveable for a household assistant you check a few times a day, but it rules out anything conversational or real-time. If you need snappy responses, you need a GPU or a smaller model.

Why the LLM Barely Matters

Here's the counterintuitive insight that makes a smaller model viable: the LLM is just the natural language interface. Neo4j does the heavy lifting.

diagram

The architecture follows a "tools first, LLM explains" pattern:

  1. Intent classification is deterministic keyword matching — no LLM needed
  2. Data retrieval uses specialized tools that run Cypher queries against Neo4j
  3. The LLM only translates between natural language and presenting structured results

The LLM never calculates, never queries, never decides what data to fetch. It receives pre-fetched results and writes a human-readable response. For that job, phi3:mini is more than sufficient.

From the actual code (aapla-hushar/src/agents/graph.py):

# Deterministic routing — no LLM involved
finance_keywords = ["spend", "expense", "budget", "transaction", ...]
if any(k in last_message for k in finance_keywords):
    intent = "finance"

# Tools fetch data first, then LLM explains
data = get_spending_summary()  # Cypher query to Neo4j
response = llm.invoke([
    SystemMessage(content="You are a financial assistant."),
    HumanMessage(content=f"ONLY interpret this data:\n{data}")
])
Enter fullscreen mode Exit fullscreen mode

The Ollama sidecar runs alongside FastAPI in the same K8s pod — localhost communication, zero network latency between the Python service and the LLM.

Future: Better LLM Requires Better Hardware

The current setup works for structured queries against Neo4j. But for actual data analysis — "what trends do you see in my spending?" or "suggest ways to save money" — the model needs to reason over larger contexts with more capability.

That upgrade means upgrading the ByteNUC's RAM first. A 7B or 8B model needs 8-16GB total system memory. Until then, phi3:mini handles the structured query-and-explain pattern well.

What's Next

The model runs. The queries work. But is "runs locally" enough to keep your data safe?

In Part 3, I look at what happened to OpenClaw — another local-first AI assistant — and why the real security lesson isn't about where your AI runs, but what it's allowed to do.


This is Part 3 of a 3-part series. Use the series navigation above to read Part 1 (Architecture & Neo4j) and Part 3 (What OpenClaw Teaches Us).

Top comments (0)