Aman Sachan

Posted on Jun 11 • Originally published at github.com

I built a distributed compute grid where your idle laptop runs ML jobs — the orchestrator behind it

#distributedsystems #machinelearning #python #showdev

I built a distributed compute grid where your idle laptop runs ML jobs — the orchestrator behind it

The pitch: a single FastAPI hub takes compute jobs from ML researchers, and a fleet of home PCs and gaming rigs (RTX 4090s, M2 MacBooks, anything with a GPU and a Python interpreter) polls in, picks up work, and ships results back. A 20% platform fee funds the hub. An interactive dashboard shows the mesh in real time.

I have been living inside this codebase for a few weeks. This post is about the part that actually determines whether the thing works or does not — the orchestrator. No frontend, no marketing — just the brain.

Live dashboard: man44.zo.space/compute-pool
Repo: github.com/AmSach/ComputePool-Grid

The problem with "dumb" schedulers

The first version of ComputeOrchestrator had a one-line bug that took down a 12-node stress test. Two jobs hit the hub at the same millisecond. Both saw the same node as idle. Both wrote busy to the same row. One node ended up double-allocated, the other starved, and the test logs looked like a hostage negotiation.

The fix had to be three things at once:

A scoring function that picks the right node, not just the first idle one.
An async lock so concurrent submissions cannot race on a single node.
A heartbeat monitor that reclaims nodes that ghosted.

Here is what it looks like now.

The scoring algorithm

def _calculate_score(self, capacity: Dict[str, Any], requirements: Dict[str, Any]) -> float:
    """Heuristic for node-task matching."""
    score = 0.0
    if capacity.get("gpu_vram", 0) >= requirements.get("min_vram", 0):
        score += 10.0
    if capacity.get("cpu_cores", 0) >= requirements.get("min_cores", 0):
        score += 5.0
    return score

The weights are deliberately lopsided. A node that satisfies a job's VRAM requirement gets a 2x bonus over a node that just barely has enough cores. The intuition: GPU work is the long pole. If you cannot fit the model in VRAM, nothing else matters, no matter how many cores you have. A job that needs 24GB on a 16GB GPU gets scored as a hard fail and is skipped, even if the node is otherwise idle.

The scoring layer is intentionally simple. I tried a more elaborate weighted sum with reliability scores, regional latency, and price-per-hour, and it made the correct answer harder to reason about without meaningfully changing which node was picked. The simpler version is easier to A/B test, and the more elaborate version is on the roadmap once there is a real fleet to train on.

The async lock

The race condition above is a textbook distributed-systems bug. In a single-process FastAPI app you can dodge it with asyncio.Lock, but a lock that only protects Python state does not survive a hub restart. For a network this small, the right tradeoff is: in-process lock for the hot path, a database-level SELECT ... FOR UPDATE SKIP LOCKED for durability.

from sqlalchemy import text

async def allocate_task(self, task_id: str, requirements: Dict[str, Any]):
    async with self._allocation_lock:                 # in-process guard
        async with self._db.begin() as tx:
            row = await tx.execute(
                text("""
                    SELECT id FROM nodes
                    WHERE status = 'idle'
                      AND gpu_vram >= :min_vram
                      AND cpu_cores >= :min_cores
                    ORDER BY last_score DESC
                    LIMIT 1
                    FOR UPDATE SKIP LOCKED
                """),
                requirements,
            )
            node_id = row.scalar_one_or_none()
            if not node_id:
                return None
            await tx.execute(
                text("UPDATE nodes SET status='busy' WHERE id=:id"),
                {"id": node_id},
            )
        self.active_jobs[task_id] = node_id
        return node_id

Two things to notice:

The asyncio.Lock keeps the Python-side state coherent during a burst. Two coroutines entering allocate_task simultaneously queue up instead of stomping each other's in-memory self.nodes map.
FOR UPDATE SKIP LOCKED is the real safety net. It tells Postgres "give me one row that nobody else is currently holding, and skip any row another transaction has locked." That is the entire reason the stress test passes now.

The heartbeat reclaim loop

Nodes crash. Laptops sleep. Wi-Fi drops. If the hub trusts the last status it saw, jobs vanish into a black hole. So we run a coroutine that walks the fleet every 10 seconds:

async def monitor_fleet(self):
    while True:
        now = asyncio.get_event_loop().time()
        for node_id, data in list(self.nodes.items()):
            if now - data["last_heartbeat"] > 60:
                logger.warning(f"Node {node_id} timed out. Reclaiming...")
                data["status"] = "offline"
        await asyncio.sleep(10)

The 60-second timeout is a 30-second grace on top of the agent's 30-second heartbeat interval. Without that grace, heartbeat jitter from a slow network connection would mark healthy nodes offline. With it, we tolerate a missed heartbeat or two but still reclaim a node that genuinely disappeared within about a minute. (Full code: backend/app/core/orchestrator.py.)

The reclaim path also releases any job the dead node was holding back to the queue, so the work gets re-dispatched to a healthy node. That part is in agent.py and a bit gnarly — happy to do a follow-up post on it if there is interest.

The spot-pricing layer

Above the orchestrator sits a thin market (backend/app/api/v1/endpoints/market.py). Each node can post an offer: "I have 24GB VRAM, charge ₹40/hour." Job submitters specify a requirement, not a node. The market returns the cheapest offer that meets the bar, and the orchestrator only sees the resulting pair.

def get_best_price(self, requirements):
    suitable = [o for o in self.offers if self._matches(o.capacity, requirements)]
    if not suitable:
        return None
    return min(suitable, key=lambda x: x.price_per_hour)

The interesting design call is that the market never knows about the orchestrator. It just answers "cheapest matching offer." That keeps the unit test surface tiny (it is 8 lines) and lets the orchestrator be replaced without rewriting the price logic.

What surprised me

Locks are cheap, race conditions are expensive. The cost of acquiring asyncio.Lock is unmeasurable next to the cost of a DB roundtrip. But the cost of a double-allocated node is one confused ML researcher and a Slack thread that takes 20 minutes to debug. The lock is the highest-leverage code in the whole project.

The scoring function is a UX decision. I keep wanting to treat it as engineering, but it is product. If I score VRAM too aggressively, a researcher running a CPU-only data-prep job will wait forever while an idle RTX 4090 sits there. If I score it too lightly, a 70B model gets dropped on a 12GB card. The "correct" scoring depends on what kind of jobs you want to encourage. I have been tuning it for ML jobs; if the use case shifts, the weights shift too.

Heartbeats are a contract, not a value. A node sending heartbeats every 30 seconds is a feature, not a bug. The orchestrator is the one that needs to be forgiving, not the agent. Hardcode the agent to be chatty; tune the orchestrator to be patient.

Try it

Live dashboard: man44.zo.space/compute-pool — interactive mesh view, current node count, GPU tier mix, recent jobs.
Repo + quickstart: github.com/AmSach/ComputePool-Grid — pip install -r backend/requirements.txt && python3 -m app.main and you have a hub. python3 backend/agent.py and you have a node.
Simulate a fleet without hardware: python3 scripts/spawn_fleet.py boots 5 synthetic nodes (RTX 4090, 3090, 4070, 3060, CPU-only) so you can watch the orchestrator pick winners in real time.

If you build a node agent for a hardware class I do not support yet (Apple Silicon, Jetson, Raspberry Pi), I would love a PR — agent.py is intentionally tiny so it is easy to fork.

Built for the Stardance Challenge. Distributed systems are just three locks and a heartbeat away.

Python #FastAPI #DistributedSystems #OpenSource #GPU

DEV Community

I built a distributed compute grid where your idle laptop runs ML jobs — the orchestrator behind it

I built a distributed compute grid where your idle laptop runs ML jobs — the orchestrator behind it

The problem with "dumb" schedulers

The scoring algorithm

The async lock

The heartbeat reclaim loop

The spot-pricing layer

What surprised me

Try it

Python #FastAPI #DistributedSystems #OpenSource #GPU

Top comments (0)