How I built a GPU job matching system for decentralized AI inference

#ai #webdev #architecture #tutorial

The Challenge
When you have hundreds of GPU nodes with different specs (VRAM, TFLOPS, models supported) scattered worldwide, how do you route an inference request to the right node in milliseconds?

This is the core engineering problem behind NeuralGrid, the decentralized GPU network I'm building. Here's how I solved it.

Architecture Overview
Client Request → API Gateway → Job Matcher → Node Selection → Inference → Response
↓ ↓
Auth + Rate Score each node:
Limiting - Available VRAM
- TFLOPS capacity
- Network latency
- Current load
The Matching Algorithm
Each node reports its specs when it joins the network:

interface NodeSpec {
gpu_model: string; // "RTX 4090", "A100", etc.
vram_gb: number; // Available VRAM
tflops: number; // Compute capacity
status: string; // "online" | "busy" | "offline"
}
When a job comes in, the matcher scores every online node:

function scoreNode(node: NodeSpec, job: InferenceJob): number {
if (node.status !== 'online') return -1;
if (node.vram_gb < job.minVram) return -1;

const vramScore = node.vram_gb / job.minVram; // Prefer right-sized
const computeScore = node.tflops / 100; // Normalize TFLOPS
const costScore = 1 / (node.hourlyRate + 0.01); // Prefer cheaper

return (vramScore * 0.3) + (computeScore * 0.5) + (costScore * 0.2);
}
The top-scoring node gets the job. If it fails, we cascade to the next one.

Lessons Learned

Health checks matter more than you think
Nodes go offline without warning. We ping every 30 seconds and mark unresponsive nodes as offline after 3 missed pings.
Right-sizing beats max-sizing
Sending a small Llama-7B job to an A100 wastes expensive compute. The VRAM score rewards nodes that are just big enough.
Cold starts are the real latency killer
Model loading takes 10-30 seconds. We keep track of which models are already loaded on each node to prefer "warm" nodes.

Tech Stack
Frontend: React + TypeScript + Tailwind
Backend: Supabase (Postgres + Edge Functions + Auth)
Real-time: Supabase Realtime for node status updates
API: OpenAI-compatible REST endpoints
What's Next
I'm working on:

Predictive routing: Using historical data to pre-warm models on likely nodes
Geographic awareness: Routing to the nearest node to minimize network latency
Reputation system: Nodes build trust scores based on uptime and job completion rates

Try It
The platform is live at [https://starshot-venture.lovable.app). You can:

Browse the real-time network map
Sign up and get API keys
Deploy your own GPU node
If you're working on anything similar or have questions about the architecture, I'd love to hear from you in the comments.

DEV Community

How I built a GPU job matching system for decentralized AI inference

Top comments (0)