In the last few months, I have been very interested in large language models. At the same time, the GPU world is also changing. Nvidia is still the market leader, but AMD, Intel, and even Chinese companies are making cheaper GPUs. The main challenge is that CUDA is still the dominant software stack, and Nvidia drivers are not open source. Because of this, using non‑Nvidia GPUs is still not smooth.
As someone who runs a homelab, I wanted a setup where I can use different GPUs together. But even mixing two Nvidia GPUs of different generations is hard. If you upgrade from RTX 3090 to RTX 5090, you may need a different CUDA version, a different Python version, and a different PyTorch version. New architectures like Blackwell also take time to enter mainstream frameworks.
So many people end up buying the same model GPU again just to do multi‑GPU training.
I wanted to avoid that and see if mixed‑GPU training is possible.
System Architecture Diagram
The system auto generates a topology diagram after you configure and run the coordinator once. The generated file is saved at architecture.png.
What Current ML Systems Support
I looked into many systems like:
- DeepSpeed
- Megatron‑LM
- PyTorch Distributed + TorchGpipe
- vLLM
- Colossal‑AI
All of these are powerful, but none properly support mixing CUDA and ROCm GPUs in one training job.
There is something called UCC (Unified Collective Communication) that tries to help. But the PyTorch integration here (torch‑ucc) is still experimental and archived:
https://github.com/openucx/torch-ucc
UCX developers also said here that CUDA and ROCm support is “in theory”, but mixed setups were never fully tested:
https://github.com/openucx/ucx/discussions/9985
So true heterogeneous GPU training is still not ready in major frameworks.
Papers Trying to Solve This
I found some research papers that aim to solve heterogeneous GPU training:
- HetHub https://arxiv.org/pdf/2405.16256
- HyperPipe https://ieeexplore.ieee.org/document/11033309
- Cephalo https://dl.acm.org/doi/10.1145/3721145.3730418
- HeterMoE https://arxiv.org/pdf/2504.03871
- Zorse https://arxiv.org/abs/2507.10392
These papers show that the idea is possible, but:
- None of these are open source
- Real‑world implementations are still missing
- Homelab users cannot use these systems directly
Because of all these limitations, I decided to build my own simple framework.
How my HeteroGPU framework enables mixed‑GPU pipeline training in homelabs
My goal was very simple:
I wanted to run LLM training across different GPUs in my homelab, even if they belong to different generations or vendors, without depending on complicated distributed frameworks.
My HeteroGPU framework helps to do this by providing:
Layer‑based pipeline parallelism
The model is split by layers so it can run across GPUs with different VRAM sizes.Simple coordinator–worker design
The main machine holds the first part of the model. Remote machines run later layers. They communicate using a lightweight socket interface over 10Gb ethernet or thunderbolt (not implemented).Support for mixed GPU speeds
Faster GPU can take more layers, slower GPU can take fewer layers.Small and hackable codebase
Ideal for homelab experimentation, unlike large frameworks like DeepSpeed.Profiler inspired by Cephalo
Helps decide how to split layers between GPUs based on compute speed, memory capacity, and communication delay.Works even when GPUs require different drivers or CUDA versions
Because each machine only loads its own shard locally and communicates via raw tensors over the network, you do not need unified CUDA versions.
This makes heterogeneous pipeline training practical for home users who may have a strong Nvidia GPU as main device, an older GPU on another machine, or even an integrated GPU like Strix Halo. With this design, training becomes possible even if a single GPU cannot fit the model.
Quick Explanation of Parallelism
- Data Parallelism: Copy the whole model to each GPU and split the batch.
- Tensor / Model Parallelism: Split each layer across GPUs. Very communication heavy.
- Pipeline Parallelism: Split the model layer‑wise. GPU 1 runs early layers, GPU 2 runs later layers.
Pipeline parallelism is the easiest for mixed GPUs. The only drawback is that transformers often cause one GPU to wait while the other works. But it still allows training when a model cannot fit into one GPU.
My Experiments With LLaMA Finetuning
I tested the same training script on:
- RTX 5090 single GPU
- AMD Strix Halo single GPU
- Two‑machine pipeline setup
The results showed how mixed GPU training behaves.
RTX 5090 (Single GPU)
» python examples/alpaca_example_singlemachine.py
Using device: cuda
`torch_dtype` is deprecated! Use `dtype` instead!
trainable params: 13,631,488 || all params: 8,043,892,736 || trainable%: 0.1695
Epoch 0 | Step 10 | Loss 2.4383 | LR 0.000020
Epoch 0 | Step 20 | Loss 1.8139 | LR 0.000040
Epoch 0 | Step 30 | Loss 1.4709 | LR 0.000060
Epoch 0 | Step 40 | Loss 1.2903 | LR 0.000080
Epoch 0 | Step 50 | Loss 1.2693 | LR 0.000100
Epoch 0 | Step 60 | Loss 1.2671 | LR 0.000120
Saved LoRA adapters to: ./lora_unsloth_sft/lora
Training complete.
Sample generation:
<s>You are a helpful assistant.
<|user|>
Write a haiku about GPUs.
<|assistant|>
In the lab, the GPU
Is the heart of the machine,
Running calculations.
</s>
Total training time: 289.11 seconds
Training time: 289 seconds
Loss dropped smoothly from 2.43 to 1.26.
Fast and stable.
Strix Halo (Single GPU)
$ python examples/alpaca_example_singlemachine.py
Using device: cuda
`torch_dtype` is deprecated! Use `dtype` instead!
g++ (GCC) 15.2.1 20250813
Copyright (C) 2025 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
trainable params: 13,631,488 || all params: 8,043,892,736 || trainable%: 0.1695
Epoch 0 | Step 10 | Loss 2.4027 | LR 0.000020
Epoch 0 | Step 20 | Loss 1.8115 | LR 0.000040
Epoch 0 | Step 30 | Loss 1.2460 | LR 0.000060
Epoch 0 | Step 40 | Loss 1.4227 | LR 0.000080
Epoch 0 | Step 50 | Loss 1.2628 | LR 0.000100
Epoch 0 | Step 60 | Loss 1.2507 | LR 0.000120
Saved LoRA adapters to: ./lora_unsloth_sft/lora
Training complete.
Sample generation:
<s>You are a helpful assistant.
<|user|>
Write a haiku about GPUs.
<|assistant|>
A GPU, a powerful tool
For processing data and computing
A helpful aid for many a task.
</s>
Total training time: 3242.91 seconds
(.venv) [alpha@toolbx HeteroShard]$
Training time: 3243 seconds
Loss also went down correctly, but speed was extremely slow. Around 11 times slower than the 5090. This shows the large performance gap between GPU types.
Distributed Pipeline Training (Two GPUs)
Expand for full logs
» python examples/demo_llama8b4bit_distributed.py --config hetero_config.json
📍 This machine: doraemon-arch (192.168.1.153)
✓ Role: COORDINATOR
======================================================================
COORDINATOR MODE - LLAMA 8B 4-BIT TRAINING
======================================================================
Device: cuda
Worker: worker1 (192.168.1.166:9999)
Split: Layers 0-15 (local) | 16-31 (remote)
Connecting to worker...
✓ Connected
Loading tokenizer...
Loading model...
`torch_dtype` is deprecated! Use `dtype` instead!
trainable params: 13,631,488 || all params: 8,043,892,736 || trainable%: 0.1695
Creating local shard...
✓ Local shard ready (Embedding + Layers 0-15)
Loading dataset...
✓ Dataset: 100 examples
======================================================================
TRAINING
======================================================================
Steps: 25 | Batch: 1 | Accum: 4
/mnt/sdc3/Documents/hetrogpu/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1044: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
return fn(*args, **kwargs)
Epoch 0 | Step 1/25 | Loss 2.3243 | LR 0.000020
Epoch 0 | Step 2/25 | Loss 2.4754 | LR 0.000040
Epoch 0 | Step 3/25 | Loss 2.4923 | LR 0.000060
Epoch 0 | Step 4/25 | Loss 2.7389 | LR 0.000080
Epoch 0 | Step 5/25 | Loss 2.1877 | LR 0.000100
Epoch 0 | Step 6/25 | Loss 2.0371 | LR 0.000120
Epoch 0 | Step 7/25 | Loss 2.3928 | LR 0.000140
Epoch 0 | Step 8/25 | Loss 1.5122 | LR 0.000160
Epoch 0 | Step 9/25 | Loss 1.9724 | LR 0.000180
Epoch 0 | Step 10/25 | Loss 2.2792 | LR 0.000200
Epoch 0 | Step 11/25 | Loss 1.9573 | LR 0.000198
Epoch 0 | Step 12/25 | Loss 1.4388 | LR 0.000192
Epoch 0 | Step 13/25 | Loss 1.8510 | LR 0.000183
Epoch 0 | Step 14/25 | Loss 1.6279 | LR 0.000170
Epoch 0 | Step 15/25 | Loss 1.4549 | LR 0.000155
Epoch 0 | Step 16/25 | Loss 1.2129 | LR 0.000138
Epoch 0 | Step 17/25 | Loss 1.3626 | LR 0.000119
Epoch 0 | Step 18/25 | Loss 1.2285 | LR 0.000101
Epoch 0 | Step 19/25 | Loss 1.4700 | LR 0.000082
Epoch 0 | Step 20/25 | Loss 1.3244 | LR 0.000065
Epoch 0 | Step 21/25 | Loss 1.4875 | LR 0.000050
Epoch 0 | Step 22/25 | Loss 1.4656 | LR 0.000037
Epoch 0 | Step 23/25 | Loss 1.0804 | LR 0.000028
Epoch 0 | Step 24/25 | Loss 1.5531 | LR 0.000022
Epoch 0 | Step 25/25 | Loss 1.0947 | LR 0.000020
✓ Training complete!
Total training time: 184.59 seconds
Saved LoRA adapters to: ./lora_unsloth_sft_distributed/lora
Sample generation:
You are a helpful assistant.
<|user|>
Write a short haiku about distributed training.
<|assistant|>
Distributed training,
Like a symphony,
All the parts work together.
---
$ python examples/demo_llama8b4bit_distributed.py --config hetero_config.json
📍 This machine: toolbx (192.168.1.166)
✓ Role: WORKER 1
======================================================================
WORKER MODE - LLAMA 8B 4-BIT (LAYERS 16-31)
======================================================================
Device: cuda
Port: 9999
Loading model...
`torch_dtype` is deprecated! Use `dtype` instead!
g++ (GCC) 15.2.1 20250813
Copyright (C) 2025 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Creating remote shard...
✓ Remote shard ready (Layers 16-31)
Listening on 0.0.0.0:9999...
✓ Connected to coordinator at ('192.168.1.153', 46384)
[Step 0] Waiting for data...
/torch-therock/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1035: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
return fn(*args, **kwargs)
[Step 0] Loss: 1.6613
[Step 0] ✓ Complete
[Step 1] Waiting for data...
[Step 1] Loss: 2.5880
[Step 1] ✓ Complete
[Step 2] Waiting for data...
[Step 2] Loss: 3.1850
[Step 2] ✓ Complete
[Step 3] Waiting for data...
[Step 3] Loss: 1.8631
[Step 3] ✓ Complete
[Step 4] Waiting for data...
[Step 4] Loss: 2.3016
[Step 4] ✓ Complete
[Step 5] Waiting for data...
[Step 5] Loss: 2.4796
[Step 5] ✓ Complete
[Step 6] Waiting for data...
[Step 6] Loss: 2.7196
[Step 6] ✓ Complete
[Step 7] Waiting for data...
[Step 7] Loss: 2.4008
[Step 7] ✓ Complete
[Step 8] Waiting for data...
[Step 8] Loss: 1.9301
[Step 8] ✓ Complete
[Step 9] Waiting for data...
[Step 9] Loss: 1.9098
[Step 9] ✓ Complete
[Step 10] Waiting for data...
[Step 10] Loss: 3.0177
[Step 10] ✓ Complete
[Step 11] Waiting for data...
[Step 11] Loss: 3.1114
[Step 11] ✓ Complete
[Step 12] Waiting for data...
[Step 12] Loss: 1.7507
[Step 12] ✓ Complete
[Step 13] Waiting for data...
[Step 13] Loss: 3.0108
[Step 13] ✓ Complete
[Step 14] Waiting for data...
[Step 14] Loss: 2.5046
[Step 14] ✓ Complete
[Step 15] Waiting for data...
[Step 15] Loss: 3.6894
[Step 15] ✓ Complete
[Step 16] Waiting for data...
[Step 16] Loss: 1.8336
[Step 16] ✓ Complete
[Step 17] Waiting for data...
[Step 17] Loss: 1.5026
[Step 17] ✓ Complete
[Step 18] Waiting for data...
[Step 18] Loss: 3.4676
[Step 18] ✓ Complete
[Step 19] Waiting for data...
[Step 19] Loss: 1.9469
[Step 19] ✓ Complete
[Step 20] Waiting for data...
[Step 20] Loss: 2.0781
[Step 20] ✓ Complete
[Step 21] Waiting for data...
[Step 21] Loss: 1.7651
[Step 21] ✓ Complete
[Step 22] Waiting for data...
[Step 22] Loss: 2.0139
[Step 22] ✓ Complete
[Step 23] Waiting for data...
[Step 23] Loss: 2.2912
[Step 23] ✓ Complete
[Step 24] Waiting for data...
[Step 24] Loss: 2.6897
[Step 24] ✓ Complete
[Step 25] Waiting for data...
[Step 25] Loss: 2.8378
[Step 25] ✓ Complete
[Step 26] Waiting for data...
[Step 26] Loss: 1.9898
[Step 26] ✓ Complete
[Step 27] Waiting for data...
[Step 27] Loss: 2.0538
[Step 27] ✓ Complete
[Step 28] Waiting for data...
[Step 28] Loss: 1.6081
[Step 28] ✓ Complete
[Step 29] Waiting for data...
[Step 29] Loss: 1.4623
[Step 29] ✓ Complete
[Step 30] Waiting for data...
[Step 30] Loss: 1.2606
[Step 30] ✓ Complete
[Step 31] Waiting for data...
[Step 31] Loss: 1.7178
[Step 31] ✓ Complete
[Step 32] Waiting for data...
[Step 32] Loss: 1.9203
[Step 32] ✓ Complete
[Step 33] Waiting for data...
[Step 33] Loss: 1.6814
[Step 33] ✓ Complete
[Step 34] Waiting for data...
[Step 34] Loss: 2.5819
[Step 34] ✓ Complete
[Step 35] Waiting for data...
[Step 35] Loss: 1.7061
[Step 35] ✓ Complete
[Step 36] Waiting for data...
[Step 36] Loss: 2.3311
[Step 36] ✓ Complete
[Step 37] Waiting for data...
[Step 37] Loss: 2.2990
[Step 37] ✓ Complete
[Step 38] Waiting for data...
[Step 38] Loss: 1.8855
[Step 38] ✓ Complete
[Step 39] Waiting for data...
[Step 39] Loss: 2.6010
[Step 39] ✓ Complete
[Step 40] Waiting for data...
[Step 40] Loss: 2.3807
[Step 40] ✓ Complete
[Step 41] Waiting for data...
[Step 41] Loss: 2.0204
[Step 41] ✓ Complete
[Step 42] Waiting for data...
[Step 42] Loss: 1.7209
[Step 42] ✓ Complete
[Step 43] Waiting for data...
[Step 43] Loss: 1.7073
[Step 43] ✓ Complete
[Step 44] Waiting for data...
[Step 44] Loss: 1.1900
[Step 44] ✓ Complete
[Step 45] Waiting for data...
[Step 45] Loss: 1.8439
[Step 45] ✓ Complete
[Step 46] Waiting for data...
[Step 46] Loss: 1.1291
[Step 46] ✓ Complete
[Step 47] Waiting for data...
[Step 47] Loss: 1.5923
[Step 47] ✓ Complete
[Step 48] Waiting for data...
[Step 48] Loss: 1.9110
[Step 48] ✓ Complete
[Step 49] Waiting for data...
[Step 49] Loss: 1.1971
[Step 49] ✓ Complete
[Step 50] Waiting for data...
[Step 50] Loss: 3.0576
[Step 50] ✓ Complete
[Step 51] Waiting for data...
[Step 51] Loss: 1.2383
[Step 51] ✓ Complete
[Step 52] Waiting for data...
[Step 52] Loss: 1.6820
[Step 52] ✓ Complete
[Step 53] Waiting for data...
[Step 53] Loss: 1.7755
[Step 53] ✓ Complete
[Step 54] Waiting for data...
[Step 54] Loss: 1.2515
[Step 54] ✓ Complete
[Step 55] Waiting for data...
[Step 55] Loss: 1.8027
[Step 55] ✓ Complete
[Step 56] Waiting for data...
[Step 56] Loss: 1.2692
[Step 56] ✓ Complete
[Step 57] Waiting for data...
[Step 57] Loss: 1.6293
[Step 57] ✓ Complete
[Step 58] Waiting for data...
[Step 58] Loss: 1.1256
[Step 58] ✓ Complete
[Step 59] Waiting for data...
[Step 59] Loss: 1.7956
[Step 59] ✓ Complete
[Step 60] Waiting for data...
[Step 60] Loss: 1.3114
[Step 60] ✓ Complete
[Step 61] Waiting for data...
[Step 61] Loss: 1.4944
[Step 61] ✓ Complete
[Step 62] Waiting for data...
[Step 62] Loss: 0.9233
[Step 62] ✓ Complete
[Step 63] Waiting for data...
[Step 63] Loss: 1.1224
[Step 63] ✓ Complete
[Step 64] Waiting for data...
[Step 64] Loss: 1.4849
[Step 64] ✓ Complete
[Step 65] Waiting for data...
[Step 65] Loss: 1.0226
[Step 65] ✓ Complete
[Step 66] Waiting for data...
[Step 66] Loss: 1.3064
[Step 66] ✓ Complete
[Step 67] Waiting for data...
[Step 67] Loss: 1.6367
[Step 67] ✓ Complete
[Step 68] Waiting for data...
[Step 68] Loss: 1.6595
[Step 68] ✓ Complete
[Step 69] Waiting for data...
[Step 69] Loss: 1.3235
[Step 69] ✓ Complete
[Step 70] Waiting for data...
[Step 70] Loss: 0.8673
[Step 70] ✓ Complete
[Step 71] Waiting for data...
[Step 71] Loss: 1.0639
[Step 71] ✓ Complete
[Step 72] Waiting for data...
[Step 72] Loss: 1.6803
[Step 72] ✓ Complete
[Step 73] Waiting for data...
[Step 73] Loss: 1.5877
[Step 73] ✓ Complete
[Step 74] Waiting for data...
[Step 74] Loss: 1.3728
[Step 74] ✓ Complete
[Step 75] Waiting for data...
[Step 75] Loss: 1.2393
[Step 75] ✓ Complete
[Step 76] Waiting for data...
[Step 76] Loss: 1.4007
[Step 76] ✓ Complete
[Step 77] Waiting for data...
[Step 77] Loss: 0.9818
[Step 77] ✓ Complete
[Step 78] Waiting for data...
[Step 78] Loss: 1.3658
[Step 78] ✓ Complete
[Step 79] Waiting for data...
[Step 79] Loss: 1.5493
[Step 79] ✓ Complete
[Step 80] Waiting for data...
[Step 80] Loss: 1.3884
[Step 80] ✓ Complete
[Step 81] Waiting for data...
[Step 81] Loss: 1.3920
[Step 81] ✓ Complete
[Step 82] Waiting for data...
[Step 82] Loss: 1.9356
[Step 82] ✓ Complete
[Step 83] Waiting for data...
[Step 83] Loss: 1.2340
[Step 83] ✓ Complete
[Step 84] Waiting for data...
[Step 84] Loss: 1.2280
[Step 84] ✓ Complete
[Step 85] Waiting for data...
[Step 85] Loss: 1.7844
[Step 85] ✓ Complete
[Step 86] Waiting for data...
[Step 86] Loss: 1.2704
[Step 86] ✓ Complete
[Step 87] Waiting for data...
[Step 87] Loss: 1.5795
[Step 87] ✓ Complete
[Step 88] Waiting for data...
[Step 88] Loss: 0.9333
[Step 88] ✓ Complete
[Step 89] Waiting for data...
[Step 89] Loss: 0.9236
[Step 89] ✓ Complete
[Step 90] Waiting for data...
[Step 90] Loss: 1.0831
[Step 90] ✓ Complete
[Step 91] Waiting for data...
[Step 91] Loss: 1.3817
[Step 91] ✓ Complete
[Step 92] Waiting for data...
[Step 92] Loss: 1.3752
[Step 92] ✓ Complete
[Step 93] Waiting for data...
[Step 93] Loss: 1.9094
[Step 93] ✓ Complete
[Step 94] Waiting for data...
[Step 94] Loss: 1.6458
[Step 94] ✓ Complete
[Step 95] Waiting for data...
[Step 95] Loss: 1.2820
[Step 95] ✓ Complete
[Step 96] Waiting for data...
[Step 96] Loss: 1.5715
[Step 96] ✓ Complete
[Step 97] Waiting for data...
[Step 97] Loss: 0.8391
[Step 97] ✓ Complete
[Step 98] Waiting for data...
[Step 98] Loss: 0.9126
[Step 98] ✓ Complete
[Step 99] Waiting for data...
[Step 99] Loss: 1.0555
[Step 99] ✓ Complete
[Step 100] Waiting for data...
Connection closed.
(.venv) [alpha@toolbx HeteroShard]$
Training time: 184 seconds
Model was split:
- Layers 0–15 on the main machine
- Layers 16–31 on the worker machine
Both GPUs handled their parts. Worker logs show: Waiting for data, Loss, Complete. This shows the pipeline stalls, which is expected. Still, the total time was faster than the single 5090.
What I Learnt From These Runs
- Mixed‑GPU pipeline training works in real life, not just in papers.
- Speed depends on the slowest GPU, so good splitting is important.
- Distributed training has waiting time and communication cost, but still can beat a single strong GPU.
- Consumer GPUs vary hugely in speed, which is why homelab users need flexible systems.
- A simple framework like HeteroGPU can achieve things that big frameworks do not support yet.
My Profiler System
The profiler I added does the following:
- Runs tiny batches on each GPU
- Measures latency and memory usage
- Builds simple linear models to predict performance
- Measures communication cost
- Chooses the best pipeline split
This matches the idea in the Cephalo paper:
https://dl.acm.org/doi/10.1145/3721145.3730418
This allows the system to work even when one GPU is fast but low VRAM, and another GPU is slow but high VRAM.
Next Steps
Now I plan to experiment with:
- HeterMoE: https://arxiv.org/pdf/2504.03871 or maybe
- Zorse: https://arxiv.org/abs/2507.10392
MoE (Mixture‑of‑Experts) models are naturally suited for heterogeneous hardware, so they may perform better in mixed GPU clusters.
Github repo: https://github.com/0xrushi/HeteroShard

Top comments (0)