How I pooled 24GB of RAM across two discarded PCs, ran a 13B LLM, and discovered exactly why modern AI infrastructure exists.
Sometimes engineering is about solving a problem. Sometimes it’s about proving why a problem exists in the first place.
Coming from a background in data engineering, I’ve spent years chasing bottlenecks.
Whether it was optimizing data transformations across dozens of workflows, debugging slow pipelines, or cutting cloud storage usage by more than a terabyte, there was always a constraint hiding somewhere in the system.
Most of the time, constraints can be engineered away.
So when I started working more deeply with Generative AI and wanted to build a local MVP using open-source LLMs, I naturally assumed the same rule applied.
I was wrong.
The Challenge
Cloud GPUs are expensive.
For experimentation, prototypes, and personal projects, renting powerful hardware can quickly become the most expensive part of the stack.
My available hardware wasn’t exactly encouraging either.
In one corner sat an aging desktop powered by an Intel i5–3470 with 16GB of DDR3 RAM.
In another corner sat its equally elderly sibling: another Intel i5–3470, this time with 8GB of RAM.
No GPUs.
No accelerators.
No fancy networking.
Just two forgotten PCs from 2012 collecting dust.
A 13B parameter model was clearly too large for either machine individually.
But then a dangerous thought appeared:
What if I connected them together and treated them as a tiny cluster?
If one machine couldn’t hold the model, perhaps two machines could.
And thus began the creation of what I lovingly call The Poor Man’s AI Cluster.
The Plan
The idea was surprisingly simple.
Instead of connecting both machines through a router, I connected them directly using a Cat5e Gigabit Ethernet cable.
I assigned static IP addresses:
- Master Node: 192.168.1.10 (16GB RAM)
- Worker Node: 192.168.1.20 (8GB RAM)
After a bit of firewall configuration, the two systems could communicate directly over a dedicated full-duplex 1 Gbps link.
In theory, that gave me roughly:
- 1 Gbps bandwidth
- ~125 MB/s real-world transfer speeds
- Zero router overhead
Not exactly a supercomputer.
But enough to experiment.
Bringing the Monster to Life
Using llama.cpp and its RPC server running inside WSL, I split a quantized 13B model across both machines.
The architecture looked something like this:
User Prompt
│
▼
Master Node (16GB)
│
▼
Worker Node (8GB)
│
▼
Shared Inference
│
▼
Generated Response
The master node handled prompt orchestration while the worker node processed portions of the model that no longer fit in memory.
And then something unexpected happened.
It worked.
Against all common sense, against every reasonable hardware recommendation, I was chatting with a 13B parameter language model running across two decade-old machines.
For a brief moment, I felt like I had cheated the system.
Then I looked at the token generation speed.
Reality Arrives at 1 Token per Second
The model was generating roughly 1–1.5 tokens per second.
A moderately sized prompt could take close to a minute before the AI even started responding.
The cluster was technically functioning.
But it felt less like modern AI and more like waiting for dial-up internet.
The reason came down to three unavoidable hardware bottlenecks.
Bottleneck #1: The Compute Wall
The Intel i5–3470 was released in 2012.
While it was a respectable CPU for its era, modern LLMs demand absurd amounts of computation.
A 13B parameter model requires approximately 26 billion floating-point operations per token during prompt processing.
For a 100-token prompt:
26 Billion FLOPs × 100
=
2.6 Trillion FLOPs
Meanwhile, my CPU could sustain roughly 50 GFLOPS.
The result?
Nearly a minute of pure mathematical suffering before the model could move forward.
Physics wasn’t impressed by my creativity.
Bottleneck #2: The Memory Wall
Even after solving the memory-capacity problem, I still had to deal with memory bandwidth.
Every generated token requires repeatedly accessing model weights stored in RAM.
The DDR3 memory in these systems delivered roughly:
~15 GB/s bandwidth
The model itself occupied around:
~8 GB
Which meant the CPU spent most of its time waiting for data to arrive.
No amount of clever engineering could change the fact that old memory moves data slowly.
The result was a practical ceiling of roughly two tokens per second.
Bottleneck #3: The Network Tax
Then came the hidden enemy.
Networking.
Splitting the model meant constantly exchanging activations between machines.
Every layer crossing the machine boundary introduced additional latency and synchronization overhead.
On paper, Gigabit Ethernet sounds fast.
For AI workloads, it is painfully slow.
The cluster spent a surprising amount of time simply moving data from one machine to another instead of performing useful computation.
Then I Considered Fine-Tuning
Inference was slow.
But perhaps training a LoRA adapter would still be possible?
That’s when the numbers became truly ridiculous.
Distributed training relies heavily on a communication pattern called Ring-AllReduce , where every node continuously exchanges gradient updates with every other node.
In other words:
Compute
→ Synchronize
→ Compute
→ Synchronize
→ Repeat
The synchronization step quickly became the dominant cost.
The Math That Ended the Dream
Imagine synchronizing an 8GB gradient payload across a 1 Gbps connection.
8,000 MB / 125 MB/s
=
64 seconds
Just to transfer the gradients.
One training step.
No computation included.
If a training run required only 1,000 optimization steps:
64 × 1,000
=
64,000 seconds
That’s almost 18 hours spent purely moving data across an Ethernet cable.
Not training.
Not learning.
Just waiting.
Even after aggressively optimizing the payload down to roughly 1GB, synchronization still consumed around 8 seconds per step.
Add approximately 40 seconds of CPU computation per step and a modest training run would still take well over half a day.
Suddenly, cloud GPUs didn’t seem expensive anymore.
Why Data Centers Look the Way They Do
This experiment taught me something more valuable than a successful fine-tuning run ever could.
When people see AI clusters powered by dozens of GPUs connected through NVLink and specialized interconnects, it’s easy to assume it’s overengineering.
It isn’t.
Modern AI infrastructure exists because the laws of physics demand it.
When GPUs exchange data at hundreds of gigabytes per second, they aren’t chasing luxury.
They’re avoiding exactly the bottlenecks I spent weeks fighting.
The challenge isn’t storing the model.
The challenge is moving enormous amounts of data fast enough to keep every processor busy.
Final Thoughts
My two-node cluster was never going to compete with enterprise AI infrastructure.
But that wasn’t really the point.
The project succeeded in proving something fascinating:
If you’re memory-constrained, you can absolutely stitch together old hardware and run models that technically shouldn’t fit.
The experience was equal parts engineering, experimentation, and stubborn curiosity.
For a brief moment, two forgotten PCs from 2012 became an AI cluster.
And while they ultimately lost the battle against compute, memory bandwidth, and network latency, they taught me a lesson every AI engineer eventually learns:
In machine learning, clever architecture can bend the rules. Eventually, physics collects the bill.
Have you ever tried running an LLM on absurdly underpowered hardware? I’d love to hear the most ridiculous AI infrastructure experiments you’ve attempted.



Top comments (0)