Julien L for WiScale

Posted on Apr 1 • Edited on Apr 5

Run your AI assistant fully offline: a local-first architecture

#ai #python #privacy #tutorial

What if your AI assistant worked on an airplane? In a hospital? On a classified network?

Most AI stacks fall apart without internet. They depend on OpenAI for inference, Pinecone for vectors, and half a dozen cloud APIs for everything in between. Kill the connection, kill the assistant.

This article builds a complete AI assistant that works offline. Not "mostly offline." Fully offline. After initial setup, you can unplug the ethernet cable and everything still runs.

The cloud dependency problem

Here is a typical AI assistant stack:

User query
   → OpenAI API (inference)          ← needs internet
   → Pinecone/Weaviate (vectors)     ← needs internet
   → Redis (session state)           ← needs server
   → PostgreSQL (structured data)    ← needs server

Four network dependencies. Four points of failure. Four things that do not work on a plane, in a hospital server room, or inside a SCIF.

The local-first stack

Here is the same assistant, rebuilt to run entirely on your machine:

User query
   → Ollama (local LLM)                  ← runs on your CPU/GPU
   → sentence-transformers (embeddings)   ← runs on your CPU
   → VelesDB® (vectors + memory)          ← a file on disk

Three components. Zero network calls. Everything fits on a laptop.

Component	Cloud version	Local version	Size
LLM	OpenAI GPT-4	Ollama + Llama 3	~4GB model
Embeddings	OpenAI ada-002	all-MiniLM-L6-v2	~80MB model
Vector DB	Pinecone	VelesDB	~3MB binary
Memory	Redis + Postgres	VelesDB Agent Memory	included

Setup (the only part that needs internet)

Download everything once. Then go dark.

# 1. Install Ollama (local LLM runtime)
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3

# 2. Install Python dependencies
pip install velesdb sentence-transformers requests

# 3. Download the embedding model (first run caches it locally)
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"

After this, disconnect. Everything below runs without internet.

Step 1: Local embeddings

The embedding model runs entirely on your CPU. No API key. No network call.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

def embed(text: str) -> list[float]:
    return model.encode(text).tolist()

# Test it
vector = embed("VelesDB runs fully offline")
print(f"Dimension: {len(vector)}")  # 384

Step 2: Store documents in VelesDB

VelesDB is a source-available vector database that runs as a library. No server process. No Docker. Just a folder on disk.

import velesdb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

def embed(text: str) -> list[float]:
    return model.encode(text).tolist()

# Open a local database (creates a folder on disk)
db = velesdb.Database("./offline_assistant")
collection = db.get_or_create_collection("knowledge", dimension=384)

# Your knowledge base
documents = [
    "Patient records must be stored on-premises per HIPAA regulations.",
    "Air-gapped networks prohibit any outbound internet connections.",
    "Edge devices in manufacturing often lack reliable connectivity.",
    "Privacy-conscious users prefer local processing over cloud APIs.",
    "Retrieval-augmented generation grounds LLM answers in real data.",
    "Vector similarity search finds semantically related documents.",
    "Local-first architectures work offline and sync when connected.",
    "Ollama runs open-weight LLMs locally on consumer hardware.",
]

# Embed and store
for i, doc in enumerate(documents):
    collection.upsert(i, vector=embed(doc), payload={"text": doc})

print(f"Stored {len(documents)} documents")

Step 3: RAG retrieval (offline)

Search your knowledge base by meaning, not keywords:

def retrieve(query: str, top_k: int = 3) -> list[str]:
    results = collection.search(vector=embed(query), top_k=top_k)
    return [r["payload"]["text"] for r in results]

# Test retrieval
context = retrieve("Why would someone need offline AI?")
for doc in context:
    print(f"  - {doc}")

Output:

  - Air-gapped networks prohibit any outbound internet connections.
  - Privacy-conscious users prefer local processing over cloud APIs.
  - Edge devices in manufacturing often lack reliable connectivity.

No internet was used. The embedding model and the vector search both run locally.

Step 4: Agent memory (persistent context)

RAG retrieves documents, but your assistant also needs to remember conversations and learned procedures. VelesDB's Agent Memory SDK handles this with three subsystems:

import time

memory = db.agent_memory(384)

# Semantic memory: facts the agent knows
memory.semantic.store(1, "User prefers concise answers", embed("User prefers concise answers"))
memory.semantic.store(2, "User works in healthcare IT", embed("User works in healthcare IT"))

# Episodic memory: events that happened
memory.episodic.record(1, "User asked about HIPAA compliance", int(time.time()))

# Procedural memory: learned workflows
memory.procedural.learn(
    1,
    "answer_with_context",
    ["retrieve relevant docs", "check episodic memory for prior questions", "generate answer with Ollama"],
    confidence=0.9
)

Now the assistant can recall facts, events, and procedures:

# What does the agent know about the user?
facts = memory.semantic.query(embed("What does the user need?"), top_k=2)
for f in facts:
    print(f"  Fact: {f['content']} (score: {f['score']:.2f})")

# What happened recently?
events = memory.episodic.recent(5)
for e in events:
    print(f"  Event: {e['description']}")

# What procedure should the agent follow?
procedures = memory.procedural.recall(embed("how to answer a question"), top_k=1)
for p in procedures:
    print(f"  Procedure: {p['name']} -> {p['steps']}")

Step 5: Connect to a local LLM

Ollama exposes a local HTTP API on localhost:11434. No internet required.

import requests
import json

OLLAMA_URL = "http://localhost:11434/api/generate"
OLLAMA_MODEL = "llama3"

def check_ollama() -> bool:
    """Verify Ollama is running and has at least one model installed."""
    try:
        r = requests.get("http://localhost:11434/api/tags", timeout=3)
        models = [m["name"] for m in r.json().get("models", [])]
        if not models:
            print("Ollama is running but no models installed.")
            print(f"Run: ollama pull {OLLAMA_MODEL}")
            return False
        if not any(OLLAMA_MODEL in m for m in models):
            print(f"Model '{OLLAMA_MODEL}' not found. Available: {models}")
            return False
        return True
    except requests.ConnectionError:
        print("Ollama is not running. Start it with: ollama serve")
        return False

def ask_local_llm(question: str, context: list[str], facts: list[str]) -> str:
    context_block = "\n".join(f"- {c}" for c in context)
    facts_block = "\n".join(f"- {f}" for f in facts)

    prompt = f"""You are a helpful assistant. Answer the question using ONLY the provided context and facts.

Context (retrieved documents):
{context_block}

Known facts about the user:
{facts_block}

Question: {question}

Answer:"""

    response = requests.post(OLLAMA_URL, json={
        "model": OLLAMA_MODEL,
        "prompt": prompt,
        "stream": False
    }, timeout=120)

    result = response.json()
    if "error" in result:
        raise RuntimeError(f"Ollama error: {result['error']}")
    return result["response"]

The full pipeline

Putting it all together:

def offline_assistant(question: str) -> str:
    # 1. Retrieve relevant documents
    context = retrieve(question, top_k=3)

    # 2. Recall user-specific facts from memory
    facts_results = memory.semantic.query(embed(question), top_k=2)
    facts = [f["content"] for f in facts_results]

    # 3. Check if we have answered something similar before
    events = memory.episodic.recent(5)

    # 4. Ask the local LLM
    answer = ask_local_llm(question, context, facts)

    # 5. Record this interaction in episodic memory
    memory.episodic.record(
        int(time.time()),
        f"Answered: {question[:100]}",
        int(time.time())
    )

    return answer

# Verify Ollama is ready, then use it
if check_ollama():
    answer = offline_assistant("What are the requirements for handling patient data?")
    print(answer)

This entire pipeline runs without a single network call. The embedding model, the vector database, the memory system, and the LLM all execute locally.

Cloud vs. local: a practical comparison

	Cloud stack	Local stack
Works offline	No	Yes
Latency	200-500ms per API call	10-50ms (embeddings), 1-10s (LLM)
Cost	Pay per token/query	Free after download
Privacy	Data leaves your machine	Data never leaves your machine
Setup	API keys, accounts, billing	One-time download
Model quality	GPT-4 level	Good (Llama 3 8B) to great (70B)
Scalability	Unlimited	Limited by hardware

The tradeoff is clear: cloud gives you better models and infinite scale. Local gives you privacy, offline capability, and zero ongoing cost. For many use cases, local is not just "good enough." It is the only option.

Where this matters

Healthcare: HIPAA requires that patient data stays on-premises. A local AI assistant can help clinicians without sending PHI to cloud servers.

Defense and government: Air-gapped networks exist for a reason. An offline AI assistant works behind any security perimeter.

Travel and field work: No Wi-Fi on the plane, no signal in the field. Your assistant still works.

Privacy-conscious users: Some people simply do not want their questions going to a third-party server. Fair enough.

Edge computing: IoT gateways, manufacturing floors, remote sensors. Connectivity is unreliable. Local AI is the only reliable option.

Getting started

Everything you need:

Ollama for local LLM inference
sentence-transformers for local embeddings
VelesDB® for local vector storage and agent memory (~3MB binary, source-available under Elastic License 2.0)

The complete code from this article works as a starting point. Swap in your own documents, adjust the retrieval parameters, and you have a private AI assistant that never phones home.

What is the most interesting offline use case you can think of? I would love to hear about environments where cloud AI simply is not an option. Drop a comment below.

Top comments (2)

blacksun • Apr 1

good to get started on this now, local llms will just get better and better

Julien L WiScale • Apr 1

I totally agree with you 😉.