ELY NIXON

Posted on Apr 18

Meshcore: Architecture for a Decentralized P2P LLM Inference Network

#ai #web3 #architecture #opensource

Copyright (C) 2026 Concept House LLP
This document and the Meshcore architecture are licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file for details.

Meshcore: Decentralized Protocol for P2P Compute and LLM Inference (DePIN Architecture)

This document outlines the concept, architectural design, and system implementation plan for Meshcore—a hybrid Web2.5 infrastructure protocol built for distributed edge computing, autonomous AI agents, and hosting Large Language Models (LLMs).

1. Initial Concept: Core Problem & Vision

How do we build a truly decentralized P2P compute network? What is the ideal pragmatic approach, and what are the current bottlenecks?

Building a decentralized peer-to-peer cloud for high-performance computation is an engineering challenge of the highest order. If we are designing this infrastructure today—specifically targeting heavy workloads like local LLM inference (e.g., 30B+ parameter models) and autonomous AI agent operations—the architecture must be strictly pragmatic, bypassing pure "crypto-utopian" bottlenecks.

Here is the blueprint for an ideal, production-ready implementation:

1. The Ideal Architecture: A Pragmatic Hybrid
Total decentralization (where even routing and matchmaking live on a blockchain) is far too slow for high-performance data processing. The ideal solution is a hybrid (Web2.5) model: the Control Plane is federated or centralized for ultra-low latency, while the Data/Compute Plane remains completely P2P.

Provider Nodes (Workers): A lightweight daemon running on client hardware. The "holy grail" hardware pool consists of machines with unified memory architecture (like Apple Silicon M-series chips), which possess immense memory bandwidth that typically sits idle. Workloads are strictly isolated using WebAssembly (WASM).
The Orchestrator (Control Plane): The management node acts as an intelligent matchmaker, profiling the decentralized nodes in real-time (CPU, RAM/VRAM availability, uptime, latency) and routing inference tasks accordingly.
Workload Distribution: Slicing a single LLM into layers across the internet (Tensor Parallelism) is physically impossible due to latency constraints. The architecture distributes tasks, not layers—routing a complete prompt to a specific node that already has the required model cached in RAM.

2. The Critical Barriers (What's Missing Today)

Proof of Compute: How do we algorithmically prove that a node actually processed a prompt through a 32B model, rather than generating a cheap hallucination? Cryptographic trust models (FHE, ZK-SNARKs) are currently magnitudes too slow. We must rely on probabilistic consensus and selective auditing.
Data Privacy: Enterprise and B2B sectors are hesitant to send RAG (Retrieval-Augmented Generation) data to unverified consumer hardware. Trusted Execution Environments (TEE) lack widespread hardware adoption in consumer hardware.
The "Cold Start" Problem: Loading extreme model weights (40GB+) into RAM takes significant time. The router must implement predictive "pre-warming" strategies across the network swarm.

How do we host the underlying platform infrastructure?

The paradox of decentralized systems: the Control Plane must be hyperscaled, ultra-reliable, and lightning-fast.

Infrastructure: Kubernetes deployed on Bare-Metal (Hetzner, OVH) to negate exorbitant cloud egress traffic fees.
Edge Layer: Global Anycast (Cloudflare) to provide clients and workers with the lowest possible latency ingress points worldwide.
Network Protocol: Node-to-Control-Plane telemetry uses bidirectional gRPC or WebSockets for continuous heartbeats.
The Nervous System: NATS JetStream acts as the message broker (ideal for the ephemeral, IoT-like nature of dropping nodes), paired with an In-Memory state for live routing tables.

How are models hosted and attached to nodes?

The heaviest network bottleneck is transmitting the model weights (20-40 GB per checkpoint).

Distribution: P2P payload delivery via embedded BitTorrent. The compute network itself acts as a massive CDN for its own model weights.
Inference Engine: llama.cpp (maximizing hardware democratization across CPU, RAM, and Apple Silicon) or vLLM (for enterprise Tensor Core nodes). All consumer-grade execution revolves around the highly optimized GGUF format.
Routing: Traffic follows either a Proxy pattern (tunneled securely through the network gateway) or an Ephesian Direct pattern (the platform issues a token and IP, allowing the client to establish a WebRTC connection directly to the worker hardware).

2. Technical Engineering Plan (System Architect Perspective)

Based on the core vision, Meshcore's architecture operates as a high-performance, hybrid graph-compute network.

2.1 Core Concept & Tech Stack

Platform Type: DePIN Computing Network & Distributed Inference Grid.
Model Format: Absolute focus on GGUF quantization running via llama.cpp for consumer and premium M-Series tiers.
- Reasoning: GGUF perfectly aligns with P2P ideology—it represents models as single files, makes distribution trivial, and seamlessly leverages Mac unified memory grids.

2.2 Control Plane Infrastructure (Orchestration)

The Mesh-Router operates as a microservice mesh.

Core Hosting: Bare-Metal Kubernetes with unmetered 10 Gbps uplinks.
In-Memory Router State & Matchmaking: A clustered Redis setup handles ephemeral TTL keys (node liveness), but actual payload routing is executed by a custom Golang Matchmaking Engine (In-Memory B-Tree). This engine spatially indexes nodes by ping, loaded model hash, and available RAM in real-time, bypassing the bottlenecks of standard Redis SCAN operations across 10,000+ nodes.
Pub/Sub (Task Queues): NATS JetStream provides microsecond latency publish-subscribe patterns, expertly designed for ephemeral clients.

2.3 Worker Node Architecture (Provider Daemon)

The Meshcore-Daemon is a heavy-duty client application written in Rust or Golang.

Protection against Memory Thrashing (OS Level): The daemon proactively reads OS-level metrics like Page Faults and Swap Activity. If the user’s operating system begins swapping the daemon’s memory to the SSD, the node immediately drops its READY status to prevent severe SLA degradation during inference.
Containment & GPU IPC (Security): AI agent payloads and rogue client scripts execute strictly inside a WebAssembly (Wasmtime) sandbox. The Inference Engine (llama.cpp) runs as an isolated native host service. Communication between the WASM sandbox and the GPU inference thread occurs via high-speed IPC sockets or a Shared Memory Buffer (mmap), guaranteeing malicious payloads cannot escape into host memory.
Model Distribution (P2P + Hybrid Fallback): The daemon utilizes an embedded BitTorrent protocol. However, to solve the "long-tail" availability problem (rare model fines-tunes lacking seeders), the node implements a Tiered Storage Overlay. If P2P swarm speed degrades below a functional threshold, the node transparently falls back to downloading the remaining chunks from the protocol's centralized cold storage (e.g., Cloudflare R2).
Telemetry Connection: Bidirectional gRPC over mTLS ensures absolute prevention of MitM attacks and node telemetry spoofing.

2.4 Request Routing Protocols (Data Plane)

The network supports two primary traffic modes based on client SLA requirements:

Strict Proxy (Secure Tunnel): Client -> Mesh Router -> P2P Worker. Higher latency, but masks client IPs and provides a unified corporate endpoint.
Ephesian Direct (Low-Latency Agents): The Mesh Router provides the client with a JWT and the Worker's IP:Port. The client communicates directly with the hardware via WebRTC.
- Critical Engineering Upgrade: Because roughly 20% of residential networks sit behind Symmetric NATs (breaking standard STUN logic), the Edge layer hosts a high-throughput Coturn (TURN) cluster. If the client SDK fails to punch a direct WebRTC hole within 100ms, it executes a "Shadow Fallback" and seamlessly relays traffic through the TURN cluster.

2.5 Security & Proof of Compute Layer

To bypass the current computational impossibility of heavy cryptography like FHE, the network utilizes Auditor Nodes (Logit Consensus):

The Control Plane injects "Shadow Prompts" dynamically into the traffic stream.
A randomized 2% of client requests are parallelized across 3 highly-trusted, platform-owned Auditor Nodes.
If a P2P worker returns a hallucinated response or its output logits diverge from the consensus baseline, the node triggers a Slashing Event: its rating resets to zero, its balance is frozen, and it is permanently banned from the routing pool.

3. Strategic Vision: The Ideal Tech World (The Roadmap)

If we remove all current hardware and engineering constraints and project 3-5 years into the future, here is how decentralized inference elegantly solves its core bottlenecks:

3.1 SLA Reliability: Predictive Routing & Speculative Execution

Instead of server-side sequential retries, the Meshcore client SDK will open a WebRTC swarm multi-plex directly to a pool of worker nodes.

The Ideal: The SDK streams the prompt to 3-5 nodes concurrently. The node that streams the first valid token becomes the "Leader," seamlessly pausing (suspending) the others. If the Leader node stutters or drops out (e.g., >50ms inter-token latency), the SDK instantly failovers to the secondary node's stream (which already holds the prompt in its KV-cache). Network volatility is entirely negated by the mathematical redundancy of the swarm.

3.2 The Privacy Utopia: Fast FHE (Fully Homomorphic Encryption)

Hardware TEEs are frequently exploited. In the ideal world, the protocol relies purely on mathematics—Fully Homomorphic Encryption.

The Ideal: The client encrypts the prompt and RAG payload before transmission. The P2P node receives cryptographic "noise" and performs matrix multiplications (LLM inference) directly on that noise. The host physically cannot decipher what it is processing. It returns encrypted output tokens, which the client locally unlocks. Once NPUs begin hardware-accelerating FHE, this will permanently disrupt the centralized cloud monopoly.

3.3 The "Cold Start" Solution: Base Model Architecture + LoRA Injection

Downloading 40GB arrays across a distributed network for every model update is irrational.

The Ideal: The network mandates LoRA Routing. 100% of the network's nodes maintain a single, static "Foundation Layer" cached in RAM (e.g., Llama-3-Base). When a client requires specialized inference (e.g., legal or medical coding), they transmit only the Low-Rank Adaptation (LoRA) weights—a mere 50MB payload. The LoRA is injected into the foundation model in milliseconds, instantly terraforming the node into a domain-expert agent without massive network overhead.

3.4 Proof of Compute: zkML (Zero-Knowledge Machine Learning)

Shadow-prompting is a waste of global energy resources.

The Ideal: Rather than recalculating tokens, the P2P node generates a ZK-SNARK (Zero-Knowledge Proof) concurrently with the inference pass. The Control Plane validates this ZK-Proof mathematically in one millisecond. This cryptographically guarantees that the node actually executed the matrix mathematics against the exact declared model weights, terminating any possibility of hallucination or weight substitution.

Links & Author

GitHub Repository: Check out the Meshcore Repo
Architected by Ilya Nikitin at Concept House.

Top comments (3)

Max Quimby • Apr 21

The "proof of compute" problem is genuinely the hardest part of any decentralized inference system, and I'm curious about the auditor node design. How do you handle the adversarial case where an auditor and an inference node collude to report false results? Standard approaches require redundant independent auditing, but that multiplies compute costs and defeats some of the efficiency rationale.

On the WebAssembly sandboxing: have you benchmarked the overhead for inference workloads? My intuition is that WASM isolation adds relatively little overhead when GPU compute already dominates (the bottleneck is memory bandwidth and matrix ops, not the WASM harness). But it could matter for preprocessing, tokenization, and sampling steps where CPU latency is more visible.

The BitTorrent model distribution is elegant for popular models, but the long-tail availability problem seems thorny — if only 3 peers globally hold a model's shards at a given moment, availability collapses unpredictably. Have you considered incentivized seeding for less-popular models? The Filecoin pattern of storage proofs + rewards for pinning might map well here.

The federated control plane compromise feels right to me. Full decentralization of the orchestration layer sounds principled but would make latency variance brutal at the user level — the data plane can be decentralized; the coordination plane probably shouldn't be.

Max Quimby • Apr 24

This is a genuinely interesting architecture problem to dig into. The logit consensus approach for proof-of-compute is clever, but I'm curious about the latency budget — if verification requires N auditor nodes each running a full inference pass, you're adding at least one full inference RTT to every request. That could be painful for interactive use cases where p95 latency matters.

The cold start problem is the one I keep coming back to. Even with BitTorrent distribution, a 70B model at Q4 quantization is ~40GB. On residential hardware with asymmetric upload speeds, seeding to a new node could take hours. Have you considered a "hot model registry" — pinned models that a minimum quorum of nodes must always keep warm, similar to how IPFS pinning services work? That would make popular models effectively always available while letting long-tail models cold-start on demand.

The WebAssembly sandboxing for privacy is a smart design choice, but Wasm still has limited GPU passthrough support. Are provider nodes currently CPU-only, or is there a roadmap for GPU inference within the sandbox? That feels like the key constraint on competitive throughput.

Max Quimby • May 7

The federated-control-plane + decentralized-compute split is the right concession to make. Pure-blockchain proposals for inference keep tripping over the same wall: you can't cheaply verify that a peer actually ran the model you asked for. Auditor-node consensus is a reasonable compromise, but it pushes the cost into auditing rather than removing it.

Two things I'd want to see fleshed out:

Variance from non-determinism. Even with fixed weights and the same prompt, different inference engines (and even the same engine with different batch sizes) produce different logits. Auditors comparing outputs need a tolerance model, and that tolerance is itself an attack surface — too loose and a malicious node swaps in a smaller model, too tight and honest nodes get slashed.
Latency budget. P2P routing + auditor verification is going to add hundreds of ms. That's fine for batch workloads but makes interactive use cases (chat, agent loops) painful. Worth being explicit about which workloads this targets.

BitTorrent for weights is clean. The harder problem is keeping that distribution honest as models update.