Training a 100-billion-parameter language model has, until recently, been the exclusive domain of organizations with access to tightly coupled, high-bandwidth GPU clusters — the kind that cost tens of millions of dollars to build and operate. Macrocosmos, a team building on the Bittensor decentralized AI network, just published results showing that this assumption is no longer strictly true.
Their Orion-100B project completed a full distributed pretraining run of a 100B-parameter model across nodes spread over five U.S. datacenters, connected via the public internet. The result is not a toy experiment: the system achieved 30–38% Model FLOP Utilization (MFU) and ran at roughly 65% of the throughput of an equivalent centralized setup — while allowing individual participants to contribute compute for as little as $1.25 per hour.
Here is what they actually built, and why the engineering choices matter.
The Core Problem: Why Distributed Training Is Hard
Most large-scale model training relies on distributed data parallelism (DDP): each node holds a full copy of the model, processes a different batch of data, and synchronizes gradients across all nodes after each step. DDP works well when nodes are colocated in the same datacenter with high-bandwidth interconnects (NVLink, InfiniBand), but it has a critical weakness for heterogeneous, internet-connected setups: the system's effective capacity is bounded by the memory of the smallest participating node. A single underpowered machine can bottleneck the entire run.
Macrocosmos chose a different approach: distributed pipeline parallelism (DPP). Instead of replicating the full model across nodes, DPP shards the model's layers across multiple machines, with each node responsible for a contiguous slice of the network. Data flows through the pipeline sequentially — node 1 processes the first set of layers, passes activations to node 2, and so on. The total model capacity scales with the aggregate memory of all participants, not the minimum.
For Orion-100B, the team configured 16 pipeline stages across 48 devices (16 stages × 3 replicas each), all running on Nvidia A100 80GB GPUs distributed across five non-colocated datacenters.
The Bandwidth Problem — and How ResBM Solves It
Pipeline parallelism introduces its own bottleneck: every time activations pass from one pipeline stage to the next, those tensors must travel over the network. In a standard setup, transferring activations between stages for a 100B-parameter model requires moving roughly 140.6 MB per step. Over a public internet connection, that is prohibitive.
Macrocosmos addressed this with ResBM activation compression, a lossless compression technique applied specifically to the inter-stage activation tensors. ResBM reduced the transfer size from 140.6 MB down to 2.2 MB — a 64× reduction — making the bandwidth requirements compatible with commodity internet connections (the cluster's median upload speed was 856 Mbps, download 1,322 Mbps).
This is arguably the most important technical contribution of the project. Without it, the communication overhead would dominate training time and make the approach impractical. With it, the system can sustain meaningful throughput even across geographically distributed nodes.
Keeping the Pipeline Coherent: IOTA and Stochastic Pathfinding
Running a pipeline across non-colocated, potentially unreliable nodes requires solving two additional problems: synchronizing model weights across replicas, and handling node failures gracefully.
For synchronization, Macrocosmos built the IOTA Bridge Service, which manages distributed variable synchronization across pipeline stages. The system runs 10 inner gradient accumulation steps (H=10) per synchronization cycle. Because synchronization time scales inversely with the number of pipeline stages, this design keeps sync overhead low — and the team estimates that with further tuning (pseudogradient compression, H=100), synchronization could be reduced to just 0.5% of total training time, pushing utilization toward 97.8%.
For fault tolerance, the system uses a stochastic pathfinding algorithm (co-developed with Bittensor Subnet 1) that dynamically reroutes data flow when a node drops out, maintaining training coherence without requiring a full restart.
What the Numbers Actually Mean
The headline metrics from the Orion-100B run:
- MFU: 30.8% sustained, 38% peak
- Throughput: ~65% of an equivalent centralized datacenter setup
- Entry cost: $1.25/hr for a single contributing node (16 non-colocated A100s: ~$20/hr; enterprise 8×B200 peer: ~$50/hr)
A 30% MFU is not exceptional by datacenter standards — well-optimized centralized runs on H100s can reach 50–60% MFU. But the comparison point here is not a centralized cluster; it is the alternative of not being able to train at all without one. For organizations that cannot afford or access a dedicated GPU cluster, 65% of datacenter throughput at a fraction of the cost is a meaningful option.
The economic model is also worth noting. Orion-100B runs on Bittensor Subnet 9, where participants are compensated in TAO tokens for contributing compute. This creates an incentive structure for distributed contributors that does not exist in traditional cloud training setups.
What Comes Next
Macrocosmos has outlined a roadmap for progressively relaxing the constraints of the current system:
- Heterogeneous hardware: Mixing different GPU generations to utilize stranded or underutilized compute
- Interruptible compute: Using low-cost spot-market instances that can be preempted and resumed
- Permissionless participation: Removing centralized coordination to allow untrusted global contributions
- Consumer hardware: Onboarding RTX 4090/5090 cards and Apple Silicon systems via the existing "Train at Home" initiative on Bittensor
Each step increases the pool of available compute while introducing new engineering challenges around fault tolerance, security, and gradient integrity.
Why This Matters for the Field
The standard narrative around frontier model training is that it requires centralized infrastructure at a scale only a handful of organizations can afford. Orion-100B does not overturn that narrative entirely — a 30% MFU run on A100s is not going to out-compete a well-funded lab's H100 cluster. But it demonstrates that the technical barriers to distributed, internet-scale training are lower than previously assumed.
The key enablers — pipeline parallelism over DDP, aggressive activation compression, and fault-tolerant synchronization — are all transferable techniques. As ResBM and similar compression methods mature, and as the Bittensor ecosystem grows, the cost floor for training large models will continue to drop.
For developers and researchers who want to follow the project, the primary technical writeup is available on the Macrocosmos Substack. Additional technical analysis can be found at SimplyTao and Tao.media. The economic breakdown is covered in detail at Ayen.
The broader question Orion-100B raises is not whether decentralized training can match centralized infrastructure today — it cannot, yet. The question is how quickly the gap closes as compression, fault tolerance, and incentive mechanisms improve. Based on this run, the answer appears to be: faster than most expected.
Top comments (0)