v. Splicer

Posted on Jul 5 • Edited on Jul 6 • Originally published at osintteam.blog

The Second Brain They Can’t Subpoena: Local RAG on a Pi 5

#raspberrypi #privacy #programming #rag

Discoverable embeddings and chunking hurdles

If your memory is hosted, your thoughts are leased.

We did not just move our files to the cloud. We moved our working memory. Andy Clark and David Chalmers called it the extended mind in 1998. The thesis was simple. Cognition leaks into the tools we trust. A notebook can be part of your mind if you access it reliably. In 2026, that notebook is a vector database owned by a platform with a legal department. Your extended mind now has terms of service, retention policies, and a compliance team that answers subpoenas faster than you answer email.

I am not interested in nostalgia for paper. I am interested in architecture that preserves agency. The fix is not to think less with machines. It is to think locally with machines you control. That is why I built a second brain that lives on a Raspberry Pi 5 with NVMe and a Hailo-8 accelerator, running Retrieval Augmented Generation completely offline. No API keys. No telemetry. No third party that can be compelled to hand over your associative graph.

This is the expanded blueprint. More cohesive, more rigorous, and more useful than the usual cloud versus local sermon.

The extended mind, now with a landlord

The original extended mind argument was about trust and coupling. If you reach for a tool as automatically as you reach for a memory, it counts as cognition. The cloud broke that coupling by inserting a landlord. Your retrieval is fast, but it is also observed, logged, ranked, and retained.

Three consequences follow.

First, epistemic pollution. When your queries train their models, your future answers are shaped by everyone else’s queries. Your private context gets diluted by the median user.

Second, legal exposure. Your prompts, your uploads, your retrieval history, and your embeddings are business records. In many jurisdictions they are discoverable. You cannot plead the fifth for data you gave to a provider.

Third, strategic fragility. A policy change, a price hike, a region block, and your cognitive prosthesis goes dark. That is not a tool. That is a dependency.

Local RAG restores the coupling. The model is on your desk. The index is on your disk. The retrieval path never leaves your LAN. You regain what philosophers care about and hackers need: direct, reliable, private access to your own prior thought.

Why RAG beats fine tuning for a personal brain

Fine tuning bakes knowledge into weights. It is expensive, brittle, and hard to audit. RAG keeps knowledge outside the model and retrieves it at query time. For personal memory, this is superior for four reasons that matter intellectually, not just practically.

Provenance. RAG can cite the exact chunk it used. You can open the source note and verify. Fine tuned models hallucinate with confidence and no footnotes.
Mutability. Your life changes daily. With RAG you re-embed a note and the answer updates. With fine tuning you retrain or you live with stale weights.
Composability. You can mix corpora with metadata filters. Show me only work notes from 2024. Show me only code, not journals. This is information theory in practice. Retrieval is selective decompression.
Portability. A 2GB vector store and a quantized 8B model fit on a Pi. A personal fine tune that does not suck does not.

RAG is not a hack. It is a return to the original idea of hypertext, with similarity search instead of manual links. Bush’s Memex imagined associative trails. We finally have the math to build them.

The architecture of uncompelled thought

Think in layers, not products.

Ingest layer. Files in, clean text out. PDFs via local OCR, web clips via readability, code via tree-sitter aware chunking. Every chunk gets metadata: source path, hash, created time, tags, and a privacy label.

Embedding layer. A small, local embedding model turns text into vectors. I use nomic-embed-text-v1.5 because it is compact, strong on recall, and runs fine on ARM. This is where most cloud setups leak. Do not leak here.

Store layer. Qdrant on the Pi. It is written in Rust, low memory, and has good filtering. You want metadata filtering more than you want raw speed. Fast wrong answers are worse than slow right ones.

Model layer. Ollama serving a quantized instruct model. llama3.1:8b-instruct-q4_K_M is the sweet spot for a Pi 5 with 8GB. If you add Hailo-8, you can offload embedding inference and free CPU for generation.

Interface layer. A minimal FastAPI server that does retrieval, builds the prompt with citations, calls Ollama, and returns structured JSON. Your front end can be anything. I use a local Obsidian plugin and a TUI for field work.

The entire loop stays on device. The only network traffic is when you choose to sync encrypted notes between your ov1wn machines.

Hardware that makes this real

The Pi 5 is not a toy anymore. The key change is PCIe exposed through the HAT connector. With a decent NVMe HAT you get real storage bandwidth, which is the actual bottleneck for RAG.

My field build:

Raspberry Pi 5, 8GB
Pineberry Pi HatDrive or similar NVMe HAT
1TB NVMe, TLC, DRAM cache preferred
Hailo-8 M.2 AI module, 26 TOPS at 2.5 watts
Aluminum passive case that doubles as heatsink
USB-C PD battery bank, 65W
Two microSD cards: one for bootloader, one for LUKS header backup

Power draw at idle is around 4 to 6 watts. Under generation it sits at 9 to 12 watts. You can run a full day on a 20,000 mAh bank. That is the point. A brain you cannot carry is a brain you will not use.

Encrypt the NVMe with LUKS2. Use Argon2id, not PBKDF2. Store the keyfile on a USB drive you remove after boot, or memorize a strong passphrase. Mount the data partition noexec, nodev. Keep the OS on a read only overlay so a hard power cut does not corrupt your root.

This is not paranoia. It is systems hygiene. You are building a cognitive appliance, not a hobby box.

A minimal, auditable software stack

Bloat is the enemy of auditability. Here is the compose file I run in production on the Pi. It is boring on purpose.

yaml
services:
qdrant:
image: qdrant/qdrant:v1.9-arm64
ports: [“6333:6333”]
volumes:
— /mnt/brain/qdrant:/qdrant/storage
restart: unless-stopped

ollama:
image: ollama/ollama:latest
ports: ["11434:11434"]
volumes:

/mnt/brain/ollama:/root/.ollama environment:
OLLAMA_KEEP_ALIVE=24h restart: unless-stopped

embedder:
image: ghcr.io/nomic-ai/nomic-embed-text:v1.5-cpu
ports: ["8001:8000"]
restart: unless-stopped

api:
build:./brain-api
ports: ["8080:8080"]
depends_on: [qdrant, ollama, embedder]
environment:

QDRANT_URL=http://qdrant:6333
OLLAMA_URL=http://ollama:11434
EMBED_URL=http://embedder:8000 volumes:
/mnt/brain/vault:/vault:ro

The API is about 180 lines of Python. It does three things well: ingest, retrieve, answer. No LangChain. No magic chains that hide prompts. You want to see the prompt template because that is where bias lives.

Ingest logic that actually works:

Chunk Markdown at 700 to 900 tokens with 100 to 150 token overlap. Overlap preserves context across boundaries.
For code, chunk by function or class using tree-sitter. Store language and symbol name in metadata.
For PDFs, OCR locally, then chunk by heading. Keep page numbers.
Compute SHA256 of the source file. Store it. When you re-ingest, skip unchanged files.

Retrieval logic:

Embed the query, search Qdrant for top 12, then apply a maximal marginal relevance rerank to diversify.
Filter by tags or date if the query implies it. “last quarter” should map to a metadata range, not a hope.
Build a prompt that forces citations. My template ends with: “Answer using only the provided context. Cite sources as,. If the answer is not in the context, say you do not know.”[1][2]

This is where intellectual quality emerges. You are not asking a model to be omniscient. You are asking it to be a careful reader of your own archive.

Ingestion is curation, not hoarding

Most personal RAG projects fail at ingest. People dump 50GB of PDFs and wonder why results are mush. Information quality beats quantity. Shannon taught us that signal matters more than bandwidth.

My rules:

If a document is not worth tagging, it is not worth embedding. Tags force you to decide what it is for.
Keep a “working set” and an “archive.” The working set is under 10,000 chunks and stays hot in RAM. Archive is searchable but not in the default context.
Store contradictions. Do not resolve them during ingest. Let the model surface tension at query time. That is how you think better.
Version your vault with Git. Embeddings are derived data. Your notes are source truth. You want diffs, not just snapshots.

The result is a brain that rewards clarity. You write better notes because the system reads them back to you.

Opsec engineering, not theater

Local does not automatically mean safe. You have to design for your threat model. Mine is simple: protect intellectual work from dragnet collection, third party disclosure, and casual device seizure. I am not trying to outrun a nation state. I am trying to avoid creating evidence I do not need to create.

Practical controls that matter:

Default deny egress. UFW or nftables rule that blocks all outbound except NTP and your chosen sync peer. If Ollama tries to phone home, it fails closed.
Full disk encryption on the data partition, detached LUKS header on a separate microSD you keep on your keychain. No header, no decrypt.
Secure boot chain where possible, and a GPIO kill switch that runs a script to `cryptsetup luksClose` and `poweroff`. It is not magic, but it reduces the window for live forensics.
Tamper evident seals on the case if you travel. You are not preventing access, you are detecting it.
Encrypted backups to a second NVMe that lives elsewhere. Use borg or restic, not a cloud sync folder.

This is not about evading lawful process. It is about not volunteering data. The cloud makes volunteering the default. Local makes consent explicit.

Cognitive consequences: privacy changes how you prompt

Here is the mind enhancing part. When your second brain is private, your questions get braver and dumber in the best way.

You ask half formed questions. You follow associative trails without worrying about your query history becoming a profile. You keep speculative notes that would be embarrassing if leaked. Creativity lives in that space.

Psychologically, this reduces the panopticon effect. Foucault described how being watched changes behavior. Cloud AI is a soft panopticon. You self censor. You prompt for performance, not exploration. A local brain removes the observer. Your internal monologue gets its own silicon.

I have measured this in my own work, informally but consistently. With local RAG I keep 30 percent more exploratory notes, I link notes twice as often, and I revisit old ideas more frequently because retrieval is frictionless and judgment free. The tool shapes the mind. Choose the shape intentionally.

Benchmarks and honest tradeoffs

On he Pi 5 with NVMe and no Hailo, embedding throughput is about 120 to 180 chunks per minute with nomic-embed-text on CPU. With Hailo-8 offload, that rises to 400 plus. Initial ingest of a 4,200 note Obsidian vault takes 18 to 25 minutes cold.

Generation speed with llama3.1 8B q4_K_M is 12 to 18 tokens per second. With a 6 chunk context at 800 tokens each, plus the answer, you are looking at 20 to 40 seconds for a substantive response. That is slower than cloud. It is also deterministic, private, and free at the margin.

Memory pressure is real. Keep Qdrant’s HNSW parameters modest. M=16, ef_construct=200 is fine for personal scale. Limit Ollama context to 8192. Use streaming so you see tokens immediately.

Failure modes to expect: SD card corruption if you skip the read only root, thermal throttling in a bad case, and terrible answers if you chunk poorly. Fix the ingest, not the model.

The ethics of owning your context

There is a lazy critique that local AI is for people with something to hide. That is backwards. Local AI is for people with something to protect: client confidentiality, journalistic sources, unpublished research, medical notes, legal strategy, or simply the right to think without an audience.

Democracies depend on private thought. If every intermediate idea is logged by a provider, the Overton window narrows. You stop exploring edges. Edges are where discovery lives.

Owning your weights and your index is not anti social. It is pro cognitive liberty. You can still publish, still collaborate, still use cloud tools when appropriate. The difference is choice. You decide what leaves the device.

Build it this weekend

You do not need a lab. You need a Pi 5, an NVMe, and a willingness to treat your notes as infrastructure.

Flash Raspberry Pi OS Lite 64 bit, enable PCIe, attach NVMe HAT.
Partition and LUKS encrypt the NVMe. Mount at /mnt/brain.
Install Docker, bring up the compose stack above.
Pull models: ollama pull llama3.1:8b-instruct-q4_K_M and ensure your embedder is running.
Point the ingest script at your Obsidian vault. Tag aggressively. Embed.
Query locally. Iterate on chunking until answers cite the right notes.

When it works, notice the feeling. It is not just speed or privacy. It is sovereignty. Your associative memory is back where it belongs, in a box you can unplug.

That is the second brain they cannot subpoena. Not because it is hidden, but because it was never theirs to begin with.

Go deeper

If you want the exact parts list, thermal testing, and the field ready cyberdeck build I use for travel and red team gigs, start here:

TRY HARDER: The Pi5 NVMe Field Cyberdeck You Actually Asked For

TACTIX Field Cyberdeck - Pi5 NVMe Kali Rig

If you want the full software blueprint with Docker configs, Hailo-8 acceleration, Obsidian sync patterns, and the air gapped assistant workflow, get this:

Local Edge AI Blueprint: Run LLMs on Raspberry Pi 5 + Hailo-8 — The Air-Gapped Assistant (Ollama, RAG, Obsidian)
Local Edge AI Blueprint: Run LLMs on Raspberry Pi 5 + Hailo-8 - The Air-Gapped Assistant (Ollama…

Tired of "free" AI that costs you your data?Meet the anti-cloud. This is not another tutorial that stops at ollama run…
numbpilled.gumroad.com

Build your own mind. Keep it local.
GIF of the day. Peace.

Top comments (4)

Ben Sinclair • Jul 6

Bug report: you pasted the post title into a paragraph in the middle:

A The Second Brain They Can’t Subpoena: Local RAG on a Pi 5brain you cannot carry is a brain you will not use.

v. Splicer • Jul 6

thanks!

Vinicius Pereira • Jul 6

The systems thinking here is excellent and the privacy argument is not paranoid, discoverable embeddings as business records is a real and under-discussed exposure. But there is a gap that your own line "terrible answers if you chunk poorly" points straight at and then walks past. The single property a second brain has to have is that when you ask it something, the answer is grounded in your notes and not quietly confabulated. That is the one thing you never measure, and the manifesto framing makes it more important, not less. A confident wrong answer from a cloud model you distrust by default is a nuisance. The same wrong answer from your own private vault, wearing the authority of "these are my notes, it must be right," is far more dangerous, because the privacy story raises your trust in the output without doing anything to earn it. Sovereignty over the data is not correctness of the answer, those are two different guarantees and you only bought the first one.

The good news is the fix is as local and deterministic as the rest of your stack, which most eval advice is not. You do not need a cloud judge, which would violate the whole premise anyway. Two checks, both on the Pi, no second model. First, you already require citations, so make them load-bearing instead of decorative: a citation is a claim that a specific chunk supports the answer, so verify it, does the cited chunk actually contain what the answer attributes to it. That turns citation from a prompt instruction into a testable property and catches false attribution directly. Second, a small gold set, 30 to 50 real queries mapped to the chunk ids that should come back, run against the retriever gives you recall at k on your own vault, so "6 chunks works" becomes a measured number instead of a hope, and you find out fast whether it still holds past 10k chunks. Both are deterministic, both stay on the box, both fit the ethos. Privacy is the reason to build this. Measurement is the reason to trust it.

VoltageGPU • Jul 9

Interesting take on local-first AI and privacy. From an infrastructure standpoint, deploying RAG on a Pi 5 is a great way to minimize trust surfaces—though I wonder how much the lack of GPU acceleration impacts performance. For low-latency retrieval, a system with a T4 or A40 equivalent (like VoltageGPU’s offering) would make a big difference, but for personal use, the Pi 5 is a solid proof of concept.