DEV Community

Julien L for WiScale

Posted on

Run your AI assistant fully offline: a local-first architecture

What if your AI assistant worked on an airplane? In a hospital? On a classified network?

Most AI stacks fall apart without internet. They depend on OpenAI for inference, Pinecone for vectors, and half a dozen cloud APIs for everything in between. Kill the connection, kill the assistant.

This article builds a complete AI assistant that works offline. Not "mostly offline." Fully offline. After initial setup, you can unplug the ethernet cable and everything still runs.

The cloud dependency problem

Here is a typical AI assistant stack:

User query
   → OpenAI API (inference)          ← needs internet
   → Pinecone/Weaviate (vectors)     ← needs internet
   → Redis (session state)           ← needs server
   → PostgreSQL (structured data)    ← needs server
Enter fullscreen mode Exit fullscreen mode

Four network dependencies. Four points of failure. Four things that do not work on a plane, in a hospital server room, or inside a SCIF.

The local-first stack

Here is the same assistant, rebuilt to run entirely on your machine:

User query
   → Ollama (local LLM)                  ← runs on your CPU/GPU
   → sentence-transformers (embeddings)   ← runs on your CPU
   → VelesDB (vectors + memory)          ← a file on disk
Enter fullscreen mode Exit fullscreen mode

Three components. Zero network calls. Everything fits on a laptop.

Component Cloud version Local version Size
LLM OpenAI GPT-4 Ollama + Llama 3 ~4GB model
Embeddings OpenAI ada-002 all-MiniLM-L6-v2 ~80MB model
Vector DB Pinecone VelesDB ~3MB binary
Memory Redis + Postgres VelesDB Agent Memory included

Setup (the only part that needs internet)

Download everything once. Then go dark.

# 1. Install Ollama (local LLM runtime)
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3

# 2. Install Python dependencies
pip install velesdb sentence-transformers requests

# 3. Download the embedding model (first run caches it locally)
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
Enter fullscreen mode Exit fullscreen mode

After this, disconnect. Everything below runs without internet.

Step 1: Local embeddings

The embedding model runs entirely on your CPU. No API key. No network call.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

def embed(text: str) -> list[float]:
    return model.encode(text).tolist()

# Test it
vector = embed("VelesDB runs fully offline")
print(f"Dimension: {len(vector)}")  # 384
Enter fullscreen mode Exit fullscreen mode

Step 2: Store documents in VelesDB

VelesDB is a source-available vector database that runs as a library. No server process. No Docker. Just a folder on disk.

import velesdb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

def embed(text: str) -> list[float]:
    return model.encode(text).tolist()

# Open a local database (creates a folder on disk)
db = velesdb.Database("./offline_assistant")
collection = db.get_or_create_collection("knowledge", dimension=384)

# Your knowledge base
documents = [
    "Patient records must be stored on-premises per HIPAA regulations.",
    "Air-gapped networks prohibit any outbound internet connections.",
    "Edge devices in manufacturing often lack reliable connectivity.",
    "Privacy-conscious users prefer local processing over cloud APIs.",
    "Retrieval-augmented generation grounds LLM answers in real data.",
    "Vector similarity search finds semantically related documents.",
    "Local-first architectures work offline and sync when connected.",
    "Ollama runs open-weight LLMs locally on consumer hardware.",
]

# Embed and store
for i, doc in enumerate(documents):
    collection.upsert(i, vector=embed(doc), payload={"text": doc})

print(f"Stored {len(documents)} documents")
Enter fullscreen mode Exit fullscreen mode

Step 3: RAG retrieval (offline)

Search your knowledge base by meaning, not keywords:

def retrieve(query: str, top_k: int = 3) -> list[str]:
    results = collection.search(vector=embed(query), top_k=top_k)
    return [r["payload"]["text"] for r in results]

# Test retrieval
context = retrieve("Why would someone need offline AI?")
for doc in context:
    print(f"  - {doc}")
Enter fullscreen mode Exit fullscreen mode

Output:

  - Air-gapped networks prohibit any outbound internet connections.
  - Privacy-conscious users prefer local processing over cloud APIs.
  - Edge devices in manufacturing often lack reliable connectivity.
Enter fullscreen mode Exit fullscreen mode

No internet was used. The embedding model and the vector search both run locally.

Step 4: Agent memory (persistent context)

RAG retrieves documents, but your assistant also needs to remember conversations and learned procedures. VelesDB's Agent Memory SDK handles this with three subsystems:

import time

memory = db.agent_memory(384)

# Semantic memory: facts the agent knows
memory.semantic.store(1, "User prefers concise answers", embed("User prefers concise answers"))
memory.semantic.store(2, "User works in healthcare IT", embed("User works in healthcare IT"))

# Episodic memory: events that happened
memory.episodic.record(1, "User asked about HIPAA compliance", int(time.time()))

# Procedural memory: learned workflows
memory.procedural.learn(
    1,
    "answer_with_context",
    ["retrieve relevant docs", "check episodic memory for prior questions", "generate answer with Ollama"],
    confidence=0.9
)
Enter fullscreen mode Exit fullscreen mode

Now the assistant can recall facts, events, and procedures:

# What does the agent know about the user?
facts = memory.semantic.query(embed("What does the user need?"), top_k=2)
for f in facts:
    print(f"  Fact: {f['content']} (score: {f['score']:.2f})")

# What happened recently?
events = memory.episodic.recent(5)
for e in events:
    print(f"  Event: {e['description']}")

# What procedure should the agent follow?
procedures = memory.procedural.recall(embed("how to answer a question"), top_k=1)
for p in procedures:
    print(f"  Procedure: {p['name']} -> {p['steps']}")
Enter fullscreen mode Exit fullscreen mode

Step 5: Connect to a local LLM

Ollama exposes a local HTTP API on localhost:11434. No internet required.

import requests
import json

OLLAMA_URL = "http://localhost:11434/api/generate"

def ask_local_llm(question: str, context: list[str], facts: list[str]) -> str:
    context_block = "\n".join(f"- {c}" for c in context)
    facts_block = "\n".join(f"- {f}" for f in facts)

    prompt = f"""You are a helpful assistant. Answer the question using ONLY the provided context and facts.

Context (retrieved documents):
{context_block}

Known facts about the user:
{facts_block}

Question: {question}

Answer:"""

    response = requests.post(OLLAMA_URL, json={
        "model": "llama3",
        "prompt": prompt,
        "stream": False
    })

    return response.json()["response"]
Enter fullscreen mode Exit fullscreen mode

The full pipeline

Putting it all together:

def offline_assistant(question: str) -> str:
    # 1. Retrieve relevant documents
    context = retrieve(question, top_k=3)

    # 2. Recall user-specific facts from memory
    facts_results = memory.semantic.query(embed(question), top_k=2)
    facts = [f["content"] for f in facts_results]

    # 3. Check if we have answered something similar before
    events = memory.episodic.recent(5)

    # 4. Ask the local LLM
    answer = ask_local_llm(question, context, facts)

    # 5. Record this interaction in episodic memory
    memory.episodic.record(
        int(time.time()),
        f"Answered: {question[:100]}",
        int(time.time())
    )

    return answer

# Use it
answer = offline_assistant("What are the requirements for handling patient data?")
print(answer)
Enter fullscreen mode Exit fullscreen mode

This entire pipeline runs without a single network call. The embedding model, the vector database, the memory system, and the LLM all execute locally.

Cloud vs. local: a practical comparison

Cloud stack Local stack
Works offline No Yes
Latency 200-500ms per API call 10-50ms (embeddings), 1-10s (LLM)
Cost Pay per token/query Free after download
Privacy Data leaves your machine Data never leaves your machine
Setup API keys, accounts, billing One-time download
Model quality GPT-4 level Good (Llama 3 8B) to great (70B)
Scalability Unlimited Limited by hardware

The tradeoff is clear: cloud gives you better models and infinite scale. Local gives you privacy, offline capability, and zero ongoing cost. For many use cases, local is not just "good enough." It is the only option.

Where this matters

Healthcare: HIPAA requires that patient data stays on-premises. A local AI assistant can help clinicians without sending PHI to cloud servers.

Defense and government: Air-gapped networks exist for a reason. An offline AI assistant works behind any security perimeter.

Travel and field work: No Wi-Fi on the plane, no signal in the field. Your assistant still works.

Privacy-conscious users: Some people simply do not want their questions going to a third-party server. Fair enough.

Edge computing: IoT gateways, manufacturing floors, remote sensors. Connectivity is unreliable. Local AI is the only reliable option.

Getting started

Everything you need:

  1. Ollama for local LLM inference
  2. sentence-transformers for local embeddings
  3. VelesDB for local vector storage and agent memory (~3MB binary, source-available under Elastic License 2.0)

The complete code from this article works as a starting point. Swap in your own documents, adjust the retrieval parameters, and you have a private AI assistant that never phones home.


What is the most interesting offline use case you can think of? I would love to hear about environments where cloud AI simply is not an option. Drop a comment below.

Top comments (1)

Collapse
 
blacksun profile image
marwood

good to get started on this now, local llms will just get better and better