Building a P2P 'Wikipedia for Machines': Verifiable RAG with the Holepunch Stack

#opensource #ai #node #p2p

Current LLMs hallucinate when they lack context. While Retrieval-Augmented Generation (RAG) helps, existing pipelines force a tough compromise: you either trust centralized search APIs, scrape the live web (which is slow, fragile, and bloated with SEO spam), or maintain heavy, complex crawling infrastructure yourself.

More importantly, none of these methods give you cryptographic proof that the content your LLM is citing actually came from the source you claim.

To solve this, I've been building HIVE—a decentralized, peer-to-peer knowledge base designed to be consumed by LLMs, not humans. Think of it as a "Wikipedia for machines" that is completely P2P, cryptographically verifiable, and has no central authority.

The Architecture: BEEs and Queens
Instead of building a monolithic crawler, HIVE splits the workload into two distinct peer topologies using a pure Holepunch stack (Hypercore + Hyperswarm). There is no shared coordination ledger or Autobase; it relies entirely on single-writer logs.

[ Wikipedia / arXiv / RSS ]
│
▼ (Autonomous Extraction & Signing)
┌─────────┐
│   BEE   │ (Producer Node - Low Power)
└────┬────┘
│
▼ (Hyperswarm DHT / P2P Replication)
┌─────────┐
│  QUEEN  │ (Consumer Node - Heavy Indexer)
└────┬────┘
│
▼ (Semantic Vector Search)
[ Qdrant ] ──► [ Local LLM / API Synthesis ]

BEEs (Producer Nodes) BEEs are lightweight, Raspberry Pi-friendly nodes. They autonomously crawl and extract knowledge from structured sources (currently heavy on Wikipedia, with initial arXiv and RSS support).

Every extracted text fragment is ed25519-signed by the BEE.

The signed fragments are appended to the node's local, single-writer Hypercore feed.

Other nodes can only replicate this feed as read-only.

Queens (Consumer Nodes) Queens are the heavy indexer and query nodes.

They discover active BEEs via the Hyperswarm DHT.

They replicate all discovery feeds entirely over P2P—there is zero HTTP traffic between nodes.

As they ingest the data, they verify the ed25519 signatures to guarantee data integrity.

The fragments are then indexed into a local Qdrant vector database to serve semantic queries.

When an LLM queries a Queen, it receives signed, source-traceable fragments with absolute cryptographic proof of origin.

Try It In 3 Minutes (Docker Compose)
If you want to spin up the full stack locally—including a BEE node, a Queen indexer, Qdrant, and a reverse proxy—you can run it via Docker:

git clone https://github.com/capybarist/hive.git && cd hive
cp .env.example .env
nano .env  # Paste your LLM API Key (e.g., a free Gemini key from aistudio.google.com)
docker compose up -d

Queen Query Endpoint: Open http://localhost to test the AI synthesis.

BEE Dashboard: Open http://localhost:8080 to monitor autonomous extraction activity.

Minimal Path (~150MB)
If you just want to run a lightweight BEE node to contribute to the network without a heavy vector indexing endpoint:

git clone https://github.com/capybarist/hive.git && cd hive
npm install
echo "LLM_PROVIDER=gemini" > .env
echo "LLM_API_KEY=your_key" >> .env
bash hive.sh

Current State & Honest Limitations (v0.7.0)
HIVE is functional, but it is not battle-tested at a massive scale yet. Here is exactly where the project stands today:

What Works: Full Bee/Queen role splitting, native Hypercore replication over Hyperswarm DHT, automated signature verification on block receipt, and a working Wikipedia forager with a persistent BFS crawl queue.

The Bottlenecks: The extraction rate is highly dependent on your local hardware or your LLM provider's rate limits (e.g., Groq free tier yields ~6k fragments/day, while an Ollama CPU setup does around ~600/day). Additionally, Hyperswarm DHT can still be finicky behind aggressive corporate firewalls.

Roadmap: What's Next (v0.7.x)
Source-Driven Extraction: Transitioning from a static topic tree to per-BEE source declarations (Wikipedia, arXiv, Common Crawl).

Unified ForagerSource Interface: Making it so adding a new data source only requires implementing a single interface file.

Common Crawl Integration: Indexing the open web via reproducible, immutable snapshots instead of live scraping.

Score-by-Corroboration: Implementing an algorithm to boost content confidence when multiple independent BEEs extract identical facts from different sources.

Looking for Feedback
HIVE is completely open-source (BUSL, automatically converting to MIT in 4 years). There is no startup behind this, no VC funding, and absolutely no token economy or cryptocurrency attached. I built it because I needed verifiable local RAG and couldn't find an existing solution.

I would love to hear from developers working on P2P architectures, distributed systems, or self-hosted RAG:

The Topo Split: Does the Bee/Queen separation feel sound to you, or do you see architectural edge cases that could lead to unexpected bottlenecks?

Open Web Ingestion: Is leveraging Common Crawl snapshots a solid approach for reproducible web ingestion, or are the local storage overheads too punishing for self-hosted nodes?

The P2P Commons: Would you dedicate spare home-lab compute or bandwidth to seed/produce signed knowledge fragments for a shared network?

Let me know your thoughts or hit me with some tough architectural critique in the comments!

📂 Source Code: https://github.com/capybarist/hive

📖 Manifesto: https://github.com/capybarist/hive/blob/main/MANIFESTO.md

🏗️ Technical Deep-Dive: https://github.com/capybarist/hive/blob/main/CLAUDE.md