What if your AI assistant worked on an airplane? In a hospital? On a classified network?
Most AI stacks fall apart without internet. They depend on OpenAI for inference, Pinecone for vectors, and half a dozen cloud APIs for everything in between. Kill the connection, kill the assistant.
This article builds a complete AI assistant that works offline. Not "mostly offline." Fully offline. After initial setup, you can unplug the ethernet cable and everything still runs.
The cloud dependency problem
Here is a typical AI assistant stack:
User query
→ OpenAI API (inference) ← needs internet
→ Pinecone/Weaviate (vectors) ← needs internet
→ Redis (session state) ← needs server
→ PostgreSQL (structured data) ← needs server
Four network dependencies. Four points of failure. Four things that do not work on a plane, in a hospital server room, or inside a SCIF.
The local-first stack
Here is the same assistant, rebuilt to run entirely on your machine:
User query
→ Ollama (local LLM) ← runs on your CPU/GPU
→ sentence-transformers (embeddings) ← runs on your CPU
→ VelesDB (vectors + memory) ← a file on disk
Three components. Zero network calls. Everything fits on a laptop.
| Component | Cloud version | Local version | Size |
|---|---|---|---|
| LLM | OpenAI GPT-4 | Ollama + Llama 3 | ~4GB model |
| Embeddings | OpenAI ada-002 | all-MiniLM-L6-v2 | ~80MB model |
| Vector DB | Pinecone | VelesDB | ~3MB binary |
| Memory | Redis + Postgres | VelesDB Agent Memory | included |
Setup (the only part that needs internet)
Download everything once. Then go dark.
# 1. Install Ollama (local LLM runtime)
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3
# 2. Install Python dependencies
pip install velesdb sentence-transformers requests
# 3. Download the embedding model (first run caches it locally)
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
After this, disconnect. Everything below runs without internet.
Step 1: Local embeddings
The embedding model runs entirely on your CPU. No API key. No network call.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
def embed(text: str) -> list[float]:
return model.encode(text).tolist()
# Test it
vector = embed("VelesDB runs fully offline")
print(f"Dimension: {len(vector)}") # 384
Step 2: Store documents in VelesDB
VelesDB is a source-available vector database that runs as a library. No server process. No Docker. Just a folder on disk.
import velesdb
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
def embed(text: str) -> list[float]:
return model.encode(text).tolist()
# Open a local database (creates a folder on disk)
db = velesdb.Database("./offline_assistant")
collection = db.get_or_create_collection("knowledge", dimension=384)
# Your knowledge base
documents = [
"Patient records must be stored on-premises per HIPAA regulations.",
"Air-gapped networks prohibit any outbound internet connections.",
"Edge devices in manufacturing often lack reliable connectivity.",
"Privacy-conscious users prefer local processing over cloud APIs.",
"Retrieval-augmented generation grounds LLM answers in real data.",
"Vector similarity search finds semantically related documents.",
"Local-first architectures work offline and sync when connected.",
"Ollama runs open-weight LLMs locally on consumer hardware.",
]
# Embed and store
for i, doc in enumerate(documents):
collection.upsert(i, vector=embed(doc), payload={"text": doc})
print(f"Stored {len(documents)} documents")
Step 3: RAG retrieval (offline)
Search your knowledge base by meaning, not keywords:
def retrieve(query: str, top_k: int = 3) -> list[str]:
results = collection.search(vector=embed(query), top_k=top_k)
return [r["payload"]["text"] for r in results]
# Test retrieval
context = retrieve("Why would someone need offline AI?")
for doc in context:
print(f" - {doc}")
Output:
- Air-gapped networks prohibit any outbound internet connections.
- Privacy-conscious users prefer local processing over cloud APIs.
- Edge devices in manufacturing often lack reliable connectivity.
No internet was used. The embedding model and the vector search both run locally.
Step 4: Agent memory (persistent context)
RAG retrieves documents, but your assistant also needs to remember conversations and learned procedures. VelesDB's Agent Memory SDK handles this with three subsystems:
import time
memory = db.agent_memory(384)
# Semantic memory: facts the agent knows
memory.semantic.store(1, "User prefers concise answers", embed("User prefers concise answers"))
memory.semantic.store(2, "User works in healthcare IT", embed("User works in healthcare IT"))
# Episodic memory: events that happened
memory.episodic.record(1, "User asked about HIPAA compliance", int(time.time()))
# Procedural memory: learned workflows
memory.procedural.learn(
1,
"answer_with_context",
["retrieve relevant docs", "check episodic memory for prior questions", "generate answer with Ollama"],
confidence=0.9
)
Now the assistant can recall facts, events, and procedures:
# What does the agent know about the user?
facts = memory.semantic.query(embed("What does the user need?"), top_k=2)
for f in facts:
print(f" Fact: {f['content']} (score: {f['score']:.2f})")
# What happened recently?
events = memory.episodic.recent(5)
for e in events:
print(f" Event: {e['description']}")
# What procedure should the agent follow?
procedures = memory.procedural.recall(embed("how to answer a question"), top_k=1)
for p in procedures:
print(f" Procedure: {p['name']} -> {p['steps']}")
Step 5: Connect to a local LLM
Ollama exposes a local HTTP API on localhost:11434. No internet required.
import requests
import json
OLLAMA_URL = "http://localhost:11434/api/generate"
def ask_local_llm(question: str, context: list[str], facts: list[str]) -> str:
context_block = "\n".join(f"- {c}" for c in context)
facts_block = "\n".join(f"- {f}" for f in facts)
prompt = f"""You are a helpful assistant. Answer the question using ONLY the provided context and facts.
Context (retrieved documents):
{context_block}
Known facts about the user:
{facts_block}
Question: {question}
Answer:"""
response = requests.post(OLLAMA_URL, json={
"model": "llama3",
"prompt": prompt,
"stream": False
})
return response.json()["response"]
The full pipeline
Putting it all together:
def offline_assistant(question: str) -> str:
# 1. Retrieve relevant documents
context = retrieve(question, top_k=3)
# 2. Recall user-specific facts from memory
facts_results = memory.semantic.query(embed(question), top_k=2)
facts = [f["content"] for f in facts_results]
# 3. Check if we have answered something similar before
events = memory.episodic.recent(5)
# 4. Ask the local LLM
answer = ask_local_llm(question, context, facts)
# 5. Record this interaction in episodic memory
memory.episodic.record(
int(time.time()),
f"Answered: {question[:100]}",
int(time.time())
)
return answer
# Use it
answer = offline_assistant("What are the requirements for handling patient data?")
print(answer)
This entire pipeline runs without a single network call. The embedding model, the vector database, the memory system, and the LLM all execute locally.
Cloud vs. local: a practical comparison
| Cloud stack | Local stack | |
|---|---|---|
| Works offline | No | Yes |
| Latency | 200-500ms per API call | 10-50ms (embeddings), 1-10s (LLM) |
| Cost | Pay per token/query | Free after download |
| Privacy | Data leaves your machine | Data never leaves your machine |
| Setup | API keys, accounts, billing | One-time download |
| Model quality | GPT-4 level | Good (Llama 3 8B) to great (70B) |
| Scalability | Unlimited | Limited by hardware |
The tradeoff is clear: cloud gives you better models and infinite scale. Local gives you privacy, offline capability, and zero ongoing cost. For many use cases, local is not just "good enough." It is the only option.
Where this matters
Healthcare: HIPAA requires that patient data stays on-premises. A local AI assistant can help clinicians without sending PHI to cloud servers.
Defense and government: Air-gapped networks exist for a reason. An offline AI assistant works behind any security perimeter.
Travel and field work: No Wi-Fi on the plane, no signal in the field. Your assistant still works.
Privacy-conscious users: Some people simply do not want their questions going to a third-party server. Fair enough.
Edge computing: IoT gateways, manufacturing floors, remote sensors. Connectivity is unreliable. Local AI is the only reliable option.
Getting started
Everything you need:
- Ollama for local LLM inference
- sentence-transformers for local embeddings
- VelesDB for local vector storage and agent memory (~3MB binary, source-available under Elastic License 2.0)
The complete code from this article works as a starting point. Swap in your own documents, adjust the retrieval parameters, and you have a private AI assistant that never phones home.
What is the most interesting offline use case you can think of? I would love to hear about environments where cloud AI simply is not an option. Drop a comment below.
Top comments (1)
good to get started on this now, local llms will just get better and better