DEV Community

Cover image for Why I'm Building a Decentralized AI Inference Protocol
Mohi Rostami
Mohi Rostami

Posted on

Why I'm Building a Decentralized AI Inference Protocol

I've spent the past decade building distributed infrastructure, bare-metal servers, multi-region Kubernetes clusters, P2P networking, and monitoring stacks. Most of that work was in blockchain infrastructure, managing hundreds of machines across multiple continents running validator nodes for various networks.

That background gave me a front-row seat to a problem that's been growing for years: we have all this distributed compute sitting idle, and meanwhile, AI inference is expensive, centralized, and gated.

So I started building the thing I couldn't find anywhere else.

The Problem

The numbers are hard to ignore.

On one side, developers and startups are spending $5–25 per million tokens on centralized inference providers. AI agent builders are burning through budgets. Hobbyists and researchers can't afford to experiment at scale. And everyone is locked into a handful of providers with opaque pricing.

On the other side, there's a massive amount of compute sitting underutilized around the world. Homelabbers with powerful machines. Ex-crypto miners with GPU rigs gathering dust. Small businesses with servers running at 10–20% capacity. Students with gaming rigs that only see full load a few hours a day.

The gap between idle compute and expensive inference is enormous, and there's no efficient way to bridge it.

What Exists Today (and Why It's Not Enough)

I'm far from the first person to notice this. The "decentralized AI" space has attracted serious talent and capital. But when you look closely at what's been built, a clear pattern emerges.

Petals (github.com/bigscience-workshop/petals) pioneered the BitTorrent-style approach, share GPU layers to collaboratively run large models. Brilliant concept. But when I checked their network recently, it was effectively empty. IMO no economic incentive to contribute meant nobody stuck around. The "if you build it, they will come" approach doesn't work for infrastructure that requires sustained participation.

Exo (github.com/exo-explore/exo) is genuinely impressive engineering for running models across devices on your local network. But that's the key constraint, it's LAN-only. No internet routing, no discovery of remote nodes, no payments. Great for a home cluster; not a protocol you can build an economy around.

Parallax (github.com/GradientHQ/parallax) has arguably the best scheduler for distributed inference I've seen. Smart, well-engineered. But same story: no payment layer, no economic coordination. Without an incentive, there's no reason for strangers to contribute compute to your network.

Bittensor (github.com/opentensor/bittensor) has the largest decentralized AI network, 128 active subnets, real inference usage, $43M in Q1 2026 AI revenue. But it's built on a fundamentally different model: token emissions subsidize inference pricing, making it artificially cheap today.

Akash Network tackles decentralized compute, but at the raw GPU rental level. You get a virtual machine with a GPU, you still need to set up your own inference stack, handle scaling, manage ops. It's cloud compute, not an inference protocol.

See the pattern? Every project solved either the technical problem (how to distribute inference across nodes) or the economic problem (how to incentivize participation), but nobody has shipped a working system where both work together. The coordination layer, discovery, routing, trust, AND payments, all functioning as one protocol doesn't exist.

That's the gap.

What I'm Building: Tooti

Tooti (tooti.network) is my attempt to build that missing coordination layer.

It’s a "parrot", because parrots relay messages, and that's exactly what the protocol does: relay AI inference requests across a global network of nodes.

The Key Architecture Decision

Tooti is not an inference engine, it's the protocol layer*.* We don't touch CUDA kernels, tensor math, or model weights. We build the protocol layer that sits on top of existing inference engines and handles everything around them: discovery, routing, trust, and payments.

The analogy I keep coming back to: Kubernetes doesn't run containers; Docker does. Kubernetes orchestrates. Tooti doesn't run inference, Ollama, vLLM, llama.cpp, and Exo do. Tooti orchestrates.

This means Tooti is backend-agnostic. Whether you're running a single Mac Mini with Ollama, a homelab cluster with Exo, or a datacenter rack with vLLM, you can join the same network with the same protocol. The inference engine is a plugin; the coordination is the product.

How It Works

The system has two core components:

Node Agent: installed by anyone who wants to provide compute and earn USDC. You point it at your local inference backend, configure which models you serve and at what price, and start it up:

tooti node start -file ./node.yaml
Enter fullscreen mode Exit fullscreen mode

The node agent joins the P2P network via libp2p, announces your models, hardware specs, pricing, and current load. It handles incoming inference requests and streams responses back. You earn USDC for every request served.

Any device can be a node, a gaming PC, a homelab server, a Mac with Apple Silicon, a cloud GPU instance, even a Raspberry Pi running a small model. If it can run inference, it can join Tooti and earn.

Gateway: the API endpoint that consumers call. It's OpenAI-compatible, which means any developer using the OpenAI SDK (or any tool that calls OpenAI's API) can switch by changing a single URL:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:70b",
    "stream": true,
    "messages": [{"role": "user", "content": "hello"}]
  }'
Enter fullscreen mode Exit fullscreen mode

Behind the scenes, the gateway queries the network for nodes that serve the requested model, scores them by latency, current load, and reputation, routes to the best one, and streams the response back. The consumer doesn't know or care that inference happened on a homelab in Berlin or a cloud GPU in Singapore. They just get a response.

Payment is settled in USDC on Base L2. The node operator earns, the consumer pays, and the protocol takes no cut.

The Technical Stack

For those interested in the internals:

  • Language: Go, built for infrastructure, great concurrency model, easy cross-platform compilation
  • P2P Networking: libp2p, the same library powering IPFS and Filecoin. Handles peer discovery via DHT, NAT traversal with hole-punching, gossipsub for health updates, and relay fallback for nodes behind restrictive NATs
  • Coordination Messages: Protocol Buffers, typed, versioned, efficient
  • Payments: USDC on Base L2 via the x402 protocol, per-request settlement, no batching delays
  • Supported Backends: Ollama, vLLM, llama.cpp, Exo, and any server exposing an OpenAI-compatible API

What's Working Today

The core protocol is built and tested end-to-end over real internet:

  • Node agent with config-driven model advertisement and hardware detection
  • Full OpenAI-compatible API gateway with SSE streaming
  • Multi-node discovery with model-aware routing and latency/load/price scoring
  • Automatic failover, if a node fails mid-request, the gateway retries on the next-best node
  • Heartbeat-based health monitoring (30s intervals, 90s timeout for deregistration)
  • NAT traversal with hole-punching, tested behind typical home routers
  • x402 payment verification and settlement on Base (USDC)
  • Per-request pricing (operators set their own rates)
  • CLI for all operations
  • Tested with multiple nodes across different regions and networks

The project is open-source on GitHub: http://github.com/mrostamii/tooti

The Five Players in the Ecosystem

Tooti isn't just a two-sided marketplace. There are five distinct roles in the ecosystem:

  1. Consumers: developers, AI agents, and tools that need cheap inference. They call the OpenAI-compatible API and pay per request. They don't care about decentralization; they care about price, latency, and reliability.
  2. Node Providers: anyone with compute who wants to earn. From a single laptop running Ollama to a GPU datacenter running vLLM. Install the binary, set your price, earn USDC.
  3. Gateway Operators: the entities running the API endpoints. We'll run the official gateway, but the protocol is designed so anyone can run their own, with their own pricing, SLA guarantees, and brand. Think Infura vs. Alchemy vs. QuickNode for Ethereum: same network, different gateways.
  4. Model Creators: people who fine-tune and publish models. Today they upload to Hugging Face for free. Eventually, Tooti could give them a monetization channel: their models run on nodes, they earn a royalty per inference request. (This is a later-phase feature.)
  5. Integrators: tools that embed Tooti as a backend. An OpenClaw plugin that uses Tooti for economy-tier inference. A CLI tool that routes to Tooti. An AI agent framework that uses Tooti as its default LLM provider. Each integrator brings their entire user base to the network.

How You Can Contribute

I'm building this in the open because the best infrastructure protocols are built by their communities.

If you have compute, any device, from a laptop to a GPU rig, you'll be able to run a Tooti node and earn USDC once mainnet payments are live. Star the repo to get notified.

If you're a developer paying too much for inference, the API is OpenAI-compatible.

If you're a Go developer interested in P2P networking, distributed systems, or protocol design, the codebase is clean, well-structured, and there are open issues for every skill level.

If you know the decentralized AI space, I want to hear what I'm getting wrong. Open an issue, start a discussion, or DM me.

If you want to run an early node, I'm looking for node operators across different regions and hardware profiles to stress-test the network before public launch.

⭐ GitHub: github.com/mrostamii/tooti

🌐 Website: tooti.network (NA atm)

Top comments (0)