Copyright (C) 2026 Concept House LLP
This document and the Meshcore architecture are licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file for details.
Meshcore: Decentralized Protocol for P2P Compute and LLM Inference (DePIN Architecture)
This document outlines the concept, architectural design, and system implementation plan for Meshcore—a hybrid Web2.5 infrastructure protocol built for distributed edge computing, autonomous AI agents, and hosting Large Language Models (LLMs).
1. Initial Concept: Core Problem & Vision
How do we build a truly decentralized P2P compute network? What is the ideal pragmatic approach, and what are the current bottlenecks?
Building a decentralized peer-to-peer cloud for high-performance computation is an engineering challenge of the highest order. If we are designing this infrastructure today—specifically targeting heavy workloads like local LLM inference (e.g., 30B+ parameter models) and autonomous AI agent operations—the architecture must be strictly pragmatic, bypassing pure "crypto-utopian" bottlenecks.
Here is the blueprint for an ideal, production-ready implementation:
1. The Ideal Architecture: A Pragmatic Hybrid
Total decentralization (where even routing and matchmaking live on a blockchain) is far too slow for high-performance data processing. The ideal solution is a hybrid (Web2.5) model: the Control Plane is federated or centralized for ultra-low latency, while the Data/Compute Plane remains completely P2P.
- Provider Nodes (Workers): A lightweight daemon running on client hardware. The "holy grail" hardware pool consists of machines with unified memory architecture (like Apple Silicon M-series chips), which possess immense memory bandwidth that typically sits idle. Workloads are strictly isolated using WebAssembly (WASM).
- The Orchestrator (Control Plane): The management node acts as an intelligent matchmaker, profiling the decentralized nodes in real-time (CPU, RAM/VRAM availability, uptime, latency) and routing inference tasks accordingly.
- Workload Distribution: Slicing a single LLM into layers across the internet (Tensor Parallelism) is physically impossible due to latency constraints. The architecture distributes tasks, not layers—routing a complete prompt to a specific node that already has the required model cached in RAM.
2. The Critical Barriers (What's Missing Today)
- Proof of Compute: How do we algorithmically prove that a node actually processed a prompt through a 32B model, rather than generating a cheap hallucination? Cryptographic trust models (FHE, ZK-SNARKs) are currently magnitudes too slow. We must rely on probabilistic consensus and selective auditing.
- Data Privacy: Enterprise and B2B sectors are hesitant to send RAG (Retrieval-Augmented Generation) data to unverified consumer hardware. Trusted Execution Environments (TEE) lack widespread hardware adoption in consumer hardware.
- The "Cold Start" Problem: Loading extreme model weights (40GB+) into RAM takes significant time. The router must implement predictive "pre-warming" strategies across the network swarm.
How do we host the underlying platform infrastructure?
The paradox of decentralized systems: the Control Plane must be hyperscaled, ultra-reliable, and lightning-fast.
- Infrastructure: Kubernetes deployed on Bare-Metal (Hetzner, OVH) to negate exorbitant cloud egress traffic fees.
- Edge Layer: Global Anycast (Cloudflare) to provide clients and workers with the lowest possible latency ingress points worldwide.
- Network Protocol: Node-to-Control-Plane telemetry uses bidirectional gRPC or WebSockets for continuous heartbeats.
- The Nervous System: NATS JetStream acts as the message broker (ideal for the ephemeral, IoT-like nature of dropping nodes), paired with an In-Memory state for live routing tables.
How are models hosted and attached to nodes?
The heaviest network bottleneck is transmitting the model weights (20-40 GB per checkpoint).
- Distribution: P2P payload delivery via embedded BitTorrent. The compute network itself acts as a massive CDN for its own model weights.
- Inference Engine:
llama.cpp(maximizing hardware democratization across CPU, RAM, and Apple Silicon) orvLLM(for enterprise Tensor Core nodes). All consumer-grade execution revolves around the highly optimized GGUF format. - Routing: Traffic follows either a Proxy pattern (tunneled securely through the network gateway) or an Ephesian Direct pattern (the platform issues a token and IP, allowing the client to establish a WebRTC connection directly to the worker hardware).
2. Technical Engineering Plan (System Architect Perspective)
Based on the core vision, Meshcore's architecture operates as a high-performance, hybrid graph-compute network.
2.1 Core Concept & Tech Stack
- Platform Type: DePIN Computing Network & Distributed Inference Grid.
- Model Format: Absolute focus on GGUF quantization running via
llama.cppfor consumer and premium M-Series tiers.- Reasoning: GGUF perfectly aligns with P2P ideology—it represents models as single files, makes distribution trivial, and seamlessly leverages Mac unified memory grids.
2.2 Control Plane Infrastructure (Orchestration)
The Mesh-Router operates as a microservice mesh.
- Core Hosting: Bare-Metal Kubernetes with unmetered 10 Gbps uplinks.
- In-Memory Router State & Matchmaking: A clustered Redis setup handles ephemeral TTL keys (node liveness), but actual payload routing is executed by a custom Golang Matchmaking Engine (In-Memory B-Tree). This engine spatially indexes nodes by ping, loaded model hash, and available RAM in real-time, bypassing the bottlenecks of standard Redis
SCANoperations across 10,000+ nodes. - Pub/Sub (Task Queues): NATS JetStream provides microsecond latency publish-subscribe patterns, expertly designed for ephemeral clients.
2.3 Worker Node Architecture (Provider Daemon)
The Meshcore-Daemon is a heavy-duty client application written in Rust or Golang.
- Protection against Memory Thrashing (OS Level): The daemon proactively reads OS-level metrics like Page Faults and Swap Activity. If the user’s operating system begins swapping the daemon’s memory to the SSD, the node immediately drops its
READYstatus to prevent severe SLA degradation during inference. - Containment & GPU IPC (Security): AI agent payloads and rogue client scripts execute strictly inside a WebAssembly (Wasmtime) sandbox. The Inference Engine (
llama.cpp) runs as an isolated native host service. Communication between the WASM sandbox and the GPU inference thread occurs via high-speed IPC sockets or a Shared Memory Buffer (mmap), guaranteeing malicious payloads cannot escape into host memory. - Model Distribution (P2P + Hybrid Fallback): The daemon utilizes an embedded BitTorrent protocol. However, to solve the "long-tail" availability problem (rare model fines-tunes lacking seeders), the node implements a Tiered Storage Overlay. If P2P swarm speed degrades below a functional threshold, the node transparently falls back to downloading the remaining chunks from the protocol's centralized cold storage (e.g., Cloudflare R2).
- Telemetry Connection: Bidirectional gRPC over mTLS ensures absolute prevention of MitM attacks and node telemetry spoofing.
2.4 Request Routing Protocols (Data Plane)
The network supports two primary traffic modes based on client SLA requirements:
- Strict Proxy (Secure Tunnel): Client
->Mesh Router->P2P Worker. Higher latency, but masks client IPs and provides a unified corporate endpoint. - Ephesian Direct (Low-Latency Agents): The Mesh Router provides the client with a JWT and the Worker's IP:Port. The client communicates directly with the hardware via WebRTC.
- Critical Engineering Upgrade: Because roughly 20% of residential networks sit behind Symmetric NATs (breaking standard STUN logic), the Edge layer hosts a high-throughput Coturn (TURN) cluster. If the client SDK fails to punch a direct WebRTC hole within 100ms, it executes a "Shadow Fallback" and seamlessly relays traffic through the TURN cluster.
2.5 Security & Proof of Compute Layer
To bypass the current computational impossibility of heavy cryptography like FHE, the network utilizes Auditor Nodes (Logit Consensus):
- The Control Plane injects "Shadow Prompts" dynamically into the traffic stream.
- A randomized 2% of client requests are parallelized across 3 highly-trusted, platform-owned Auditor Nodes.
- If a P2P worker returns a hallucinated response or its output logits diverge from the consensus baseline, the node triggers a Slashing Event: its rating resets to zero, its balance is frozen, and it is permanently banned from the routing pool.
3. Strategic Vision: The Ideal Tech World (The Roadmap)
If we remove all current hardware and engineering constraints and project 3-5 years into the future, here is how decentralized inference elegantly solves its core bottlenecks:
3.1 SLA Reliability: Predictive Routing & Speculative Execution
Instead of server-side sequential retries, the Meshcore client SDK will open a WebRTC swarm multi-plex directly to a pool of worker nodes.
- The Ideal: The SDK streams the prompt to 3-5 nodes concurrently. The node that streams the first valid token becomes the "Leader," seamlessly pausing (suspending) the others. If the Leader node stutters or drops out (e.g., >50ms inter-token latency), the SDK instantly failovers to the secondary node's stream (which already holds the prompt in its KV-cache). Network volatility is entirely negated by the mathematical redundancy of the swarm.
3.2 The Privacy Utopia: Fast FHE (Fully Homomorphic Encryption)
Hardware TEEs are frequently exploited. In the ideal world, the protocol relies purely on mathematics—Fully Homomorphic Encryption.
- The Ideal: The client encrypts the prompt and RAG payload before transmission. The P2P node receives cryptographic "noise" and performs matrix multiplications (LLM inference) directly on that noise. The host physically cannot decipher what it is processing. It returns encrypted output tokens, which the client locally unlocks. Once NPUs begin hardware-accelerating FHE, this will permanently disrupt the centralized cloud monopoly.
3.3 The "Cold Start" Solution: Base Model Architecture + LoRA Injection
Downloading 40GB arrays across a distributed network for every model update is irrational.
- The Ideal: The network mandates LoRA Routing. 100% of the network's nodes maintain a single, static "Foundation Layer" cached in RAM (e.g., Llama-3-Base). When a client requires specialized inference (e.g., legal or medical coding), they transmit only the Low-Rank Adaptation (LoRA) weights—a mere 50MB payload. The LoRA is injected into the foundation model in milliseconds, instantly terraforming the node into a domain-expert agent without massive network overhead.
3.4 Proof of Compute: zkML (Zero-Knowledge Machine Learning)
Shadow-prompting is a waste of global energy resources.
- The Ideal: Rather than recalculating tokens, the P2P node generates a ZK-SNARK (Zero-Knowledge Proof) concurrently with the inference pass. The Control Plane validates this ZK-Proof mathematically in one millisecond. This cryptographically guarantees that the node actually executed the matrix mathematics against the exact declared model weights, terminating any possibility of hallucination or weight substitution.
--
Links & Author
- GitHub Repository: Check out the Meshcore Repo
- Architected by Ilya Nikitin at Concept House.
Top comments (0)