Aman Sachan

Posted on Jun 11

How ComputePool allocates work across a peer-to-peer GPU mesh in under 50ms

#python #distributed #gpu #opensource

How ComputePool allocates work across a peer-to-peer GPU mesh in under 50ms

A centralized cloud makes job scheduling easy: one big scheduler, one big pool of identical machines, one big bill. A peer-to-peer mesh makes it interesting — every node is different (an RTX 4090 in Bangalore, a 16-core Threadripper in Berlin, a 4GB-VRAM laptop in a Tier-3 Indian town), they appear and disappear on Wi-Fi drops, and the "marketplace" is supposed to be the thing that makes all of this coherent. I built ComputePool as a hub-and-spoke orchestrator that pairs that messy physical reality with a typed, observable matching layer. Here's the part I found most fun to write: the scoring allocator.

The problem in one sentence

Match an incoming job (needs ≥ 16GB VRAM, ≥ 8 cores) to the best idle node in a fleet that is constantly churning — without doing a full DB roundtrip per heartbeat.

The hub is not the bottleneck — the registry is

Every node runs a small Python agent (node-repo/agent.py) that polls the hub every 10 seconds. Each poll reports four things:

capacity = {
    "gpu_vram":   24.0,   # GB, parsed from nvidia-smi or GPUtil
    "gpu_tier":   "rtx-4090",
    "cpu_cores":  32,
    "ram_gb":     64.0,
    "disk_gb":    412.3,
    "region":     "in",   # ISO-2, used for price normalization
}

The hub keeps a hot in-memory registry of every live node — not a Postgres table, not a Redis hash, just a Python dict keyed by node_id. The DB exists for billing and audit; the registry exists for speed. Heartbeats refresh a last_heartbeat timestamp; a background task in monitor_fleet flips any node silent for > 60s to offline and reclaims its in-flight jobs.

async def monitor_fleet(self):
    while True:
        now = asyncio.get_event_loop().time()
        for node_id, data in list(self.nodes.items()):
            if now - data["last_heartbeat"] > 60:
                logger.warning(f"Node {node_id} timed out. Reclaiming...")
                data["status"] = "offline"
        await asyncio.sleep(10)

The whole match path lives in memory. Allocation latency is dominated by len(self.nodes) dict iteration — at 500 active nodes we're still inside 5ms on a single core.

The weighted scoring algorithm

Here's where it gets opinionated. A simple "lowest price wins" is wrong for compute: a 4090 is 6× faster than a 3060 for the same job, so picking the 3060 to save 30% is a net loss. I score every idle node against the request and pick the highest scorer.

def _calculate_score(self, capacity: Dict[str, Any],
                          requirements: Dict[str, Any]) -> float:
    score = 0.0
    # VRAM is the dominant cost driver — weight it hard.
    if capacity.get("gpu_vram", 0) >= requirements.get("min_vram", 0):
        score += 10.0
    if capacity.get("cpu_cores", 0) >= requirements.get("min_cores", 0):
        score += 5.0
    return score

That looks boring. The interesting part is that the minimums here are typed — the orchestrator never allocates a job to a node that can't actually run it. Hardware affinity is a hard gate, not a hint. If you ask for 24GB VRAM and the best idle node has 12GB, allocate_task returns None and the job sits in the queue until something better comes online. Underprovisioning is the only failure mode worse than dropping a job.

Why a marketplace and not just a scheduler

Spot pricing sounds like a finance thing, but it's actually a load-shedding thing. When 400 nodes come online simultaneously in a single region after a thunderstorm, the price floor has to move or every job will stampede the cheapest node. The marketplace layer (backend/app/api/v1/endpoints/market.py) lets node providers post floors and users post ceilings, and the orchestrator only considers nodes whose price intersects the user's budget. This is also where the region multiplier lives — nodes in in (India) currently price at 0.7× of US/EU equivalents to reflect that 80% of organic demand is domestic. A 20% platform fee covers the hub's infra and the credit-to-INR cashout path (min ₹500 payout).

The GPU tier table

The node agent sniffs the GPU at registration time and tags it with a tier string. This isn't a benchmark — it's a routing key.

def get_gpu_info():
    if "4090" in name: return "rtx-4090", 24
    elif "5090" in name: return "rtx-5090", 32
    elif "3090" in name: return "rtx-3090", 24
    elif "4070" in name: return "rtx-4070", 12
    elif "3060" in name: return "rtx-3060", 12
    ...

The dashboard at man44.zo.space/pool reads these tiers to color nodes on the mesh map. A future iteration will use them to bias the score — a 4090 should win ties against a 3060 even if both technically meet the minimum, because it's the same job done in 1/3 the wall-clock.

What's not in production yet

Gaming workloads. The agent is built around subprocess.run and psutil — game streaming needs a real-time frame-bus, not a 10s poll. Probably a separate agent binary.
Sandboxing. Right now a job runs with the node owner's user permissions. I want to push node execution through a gVisor or firecracker shim before I let untrusted code anywhere near a 4090.
Reputation decay. A node that returns success: false for three jobs in a row should drop in the score, not just be re-tried. I have the data, I haven't shipped the formula.

Try it

git clone https://github.com/AmSach/compute-pool
cd compute-pool
docker compose up backend frontend
# in another shell
HUB_URL=http://localhost:8000 python3 node-repo/agent.py

The dashboard is live at man44.zo.space/pool. Spin up a node, watch it appear on the mesh in < 15 seconds, and ping me on GitHub Issues if a job hangs — the logs are written to /dev/shm/computepool_*.log and are very chatty on purpose.

If you've built a peer-to-peer scheduler that does this differently, I'd love to compare notes. Especially around reputation decay — I think the answer is something between an exponential moving average and a Bayesian beta, but I haven't found a writeup that convinces me yet.

Python #FastAPI #DistributedSystems #GPU #OpenSource #BuildInPublic

DEV Community

How ComputePool allocates work across a peer-to-peer GPU mesh in under 50ms

How ComputePool allocates work across a peer-to-peer GPU mesh in under 50ms

The problem in one sentence

The hub is not the bottleneck — the registry is

The weighted scoring algorithm

Why a marketplace and not just a scheduler

The GPU tier table

What's not in production yet

Try it

Python #FastAPI #DistributedSystems #GPU #OpenSource #BuildInPublic

Top comments (0)