How ComputePool allocates work across a peer-to-peer GPU mesh in under 50ms
A centralized cloud makes job scheduling easy: one big scheduler, one big pool of identical machines, one big bill. A peer-to-peer mesh makes it interesting — every node is different (an RTX 4090 in Bangalore, a 16-core Threadripper in Berlin, a 4GB-VRAM laptop in a Tier-3 Indian town), they appear and disappear on Wi-Fi drops, and the "marketplace" is supposed to be the thing that makes all of this coherent. I built ComputePool as a hub-and-spoke orchestrator that pairs that messy physical reality with a typed, observable matching layer. Here's the part I found most fun to write: the scoring allocator.
The problem in one sentence
Match an incoming job (needs ≥ 16GB VRAM, ≥ 8 cores) to the best idle node in a fleet that is constantly churning — without doing a full DB roundtrip per heartbeat.
The hub is not the bottleneck — the registry is
Every node runs a small Python agent (node-repo/agent.py) that polls the hub every 10 seconds. Each poll reports four things:
capacity = {
"gpu_vram": 24.0, # GB, parsed from nvidia-smi or GPUtil
"gpu_tier": "rtx-4090",
"cpu_cores": 32,
"ram_gb": 64.0,
"disk_gb": 412.3,
"region": "in", # ISO-2, used for price normalization
}
The hub keeps a hot in-memory registry of every live node — not a Postgres table, not a Redis hash, just a Python dict keyed by node_id. The DB exists for billing and audit; the registry exists for speed. Heartbeats refresh a last_heartbeat timestamp; a background task in monitor_fleet flips any node silent for > 60s to offline and reclaims its in-flight jobs.
async def monitor_fleet(self):
while True:
now = asyncio.get_event_loop().time()
for node_id, data in list(self.nodes.items()):
if now - data["last_heartbeat"] > 60:
logger.warning(f"Node {node_id} timed out. Reclaiming...")
data["status"] = "offline"
await asyncio.sleep(10)
The whole match path lives in memory. Allocation latency is dominated by len(self.nodes) dict iteration — at 500 active nodes we're still inside 5ms on a single core.
The weighted scoring algorithm
Here's where it gets opinionated. A simple "lowest price wins" is wrong for compute: a 4090 is 6× faster than a 3060 for the same job, so picking the 3060 to save 30% is a net loss. I score every idle node against the request and pick the highest scorer.
def _calculate_score(self, capacity: Dict[str, Any],
requirements: Dict[str, Any]) -> float:
score = 0.0
# VRAM is the dominant cost driver — weight it hard.
if capacity.get("gpu_vram", 0) >= requirements.get("min_vram", 0):
score += 10.0
if capacity.get("cpu_cores", 0) >= requirements.get("min_cores", 0):
score += 5.0
return score
That looks boring. The interesting part is that the minimums here are typed — the orchestrator never allocates a job to a node that can't actually run it. Hardware affinity is a hard gate, not a hint. If you ask for 24GB VRAM and the best idle node has 12GB, allocate_task returns None and the job sits in the queue until something better comes online. Underprovisioning is the only failure mode worse than dropping a job.
Why a marketplace and not just a scheduler
Spot pricing sounds like a finance thing, but it's actually a load-shedding thing. When 400 nodes come online simultaneously in a single region after a thunderstorm, the price floor has to move or every job will stampede the cheapest node. The marketplace layer (backend/app/api/v1/endpoints/market.py) lets node providers post floors and users post ceilings, and the orchestrator only considers nodes whose price intersects the user's budget. This is also where the region multiplier lives — nodes in in (India) currently price at 0.7× of US/EU equivalents to reflect that 80% of organic demand is domestic. A 20% platform fee covers the hub's infra and the credit-to-INR cashout path (min ₹500 payout).
The GPU tier table
The node agent sniffs the GPU at registration time and tags it with a tier string. This isn't a benchmark — it's a routing key.
def get_gpu_info():
if "4090" in name: return "rtx-4090", 24
elif "5090" in name: return "rtx-5090", 32
elif "3090" in name: return "rtx-3090", 24
elif "4070" in name: return "rtx-4070", 12
elif "3060" in name: return "rtx-3060", 12
...
The dashboard at man44.zo.space/pool reads these tiers to color nodes on the mesh map. A future iteration will use them to bias the score — a 4090 should win ties against a 3060 even if both technically meet the minimum, because it's the same job done in 1/3 the wall-clock.
What's not in production yet
-
Gaming workloads. The agent is built around
subprocess.runand psutil — game streaming needs a real-time frame-bus, not a 10s poll. Probably a separate agent binary. - Sandboxing. Right now a job runs with the node owner's user permissions. I want to push node execution through a gVisor or firecracker shim before I let untrusted code anywhere near a 4090.
-
Reputation decay. A node that returns
success: falsefor three jobs in a row should drop in the score, not just be re-tried. I have the data, I haven't shipped the formula.
Try it
git clone https://github.com/AmSach/compute-pool
cd compute-pool
docker compose up backend frontend
# in another shell
HUB_URL=http://localhost:8000 python3 node-repo/agent.py
The dashboard is live at man44.zo.space/pool. Spin up a node, watch it appear on the mesh in < 15 seconds, and ping me on GitHub Issues if a job hangs — the logs are written to /dev/shm/computepool_*.log and are very chatty on purpose.
If you've built a peer-to-peer scheduler that does this differently, I'd love to compare notes. Especially around reputation decay — I think the answer is something between an exponential moving average and a Bayesian beta, but I haven't found a writeup that convinces me yet.
Top comments (0)