luckrig: a concept for tasting LLM rigs, not just models
HuggingFace Spaces lets you try models.
LMSys Arena lets you compare models.
Neither lets you try a specific rig.
Exact GPU. Exact quantization. Exact context length.
Someone's actual tuning notes — with your own prompt, right now.
That's the gap. luckrig is a concept to fill it.
If Arena maps models, luckrig maps the rigs.
| Service | What you taste | Hardware visible? |
|---|---|---|
| HF Spaces | Author's model wrap | Whatever they printed |
| LMSys Arena | Blind A/B models | Model name. Nothing else. |
| AI Horde | Any worker that fits | Abstracted away |
| luckrig | A specific rig | GPU · quant · ctx · tuning |
AI Horde abstracts the worker away.
luckrig makes the hardware the star.
Access earned by contribution, not money.
Inspired by Hotline Connect — the early-2000s Mac P2P tool where
contribution score, not payment, determined access rights.
Register a node → write tuning notes → upload timing measurements.
That's how you earn access to other people's rigs.
Three seed nodes exist in the POC — not yet public.
- first-5090-qwen3 — RTX 5090, Qwen3-35B-A3B, Q4_K_XL, 267 tok/s
- weekend-m3max — Apple M3 Max, Qwen2.5-14B, Q5_K_M
- shed-pi5 — Raspberry Pi 5, llama3.2-1B, 2.3 tok/s
These are local test nodes to demonstrate the concept.
Looking for early contributors who want to register a real node.
Rarity-first, not leaderboard.
The Pi node ranks higher than the 5090 because it's rarer.
Not a speed competition — a showcase of diversity.
Working POC. No external dependencies.
git clone github.com/prospectorlabs/luckrig
cd luckrig
npm start
→ http://127.0.0.1:8787
Concept + full spec + working code, all open.
https://github.com/prospectorlabs/luckrig
https://prospectorlabs.dev/luckrig/
Top comments (1)
The gap you're identifying is real, and I'd add one more dimension that hardware specs alone can't capture: inference quality consistency under realistic load patterns.
Two rigs with identical GPU/quant/ctx specs can behave very differently when you run 50 concurrent requests vs 5. The 5090 with Qwen3-35B might give you 267 tok/s in isolation but degrade to 180 tok/s with queue depth 10 — and the output quality often degrades before the throughput does. Smaller context requests start hallucinating earlier when KV cache is under pressure.
What I'd love to see luckrig expose (maybe as optional fields in tuning notes): sustained throughput at different concurrency levels, and — harder but more valuable — a quality degradation curve. At what queue depth does the rig start producing outputs that would fail a basic factual eval?
The Pi node rarity framing is clever. Though in practice for production workloads, what matters most is: at my expected p95 request rate, does this rig maintain output quality? The rarity metric is interesting for discovery; quality-under-load is what determines whether a rig is actually usable.
Solid concept — the harness-aware comparison space is completely empty right now.