There’s a strange moment that happens when you first start running local LLMs seriously.
At the beginning, everything feels almost magical. You pull a model, type a prompt, and watch tokens stream back from your own machine. No API keys. No cloud dependency. No latency from some distant server farm. Just raw local inference running quietly beside your editor.
For a while, it feels like the future arrived early.
Then reality shows up.
Your browser opens fifteen tabs. Cursor agents are running. Docker wakes up. A few terminals pile on. Maybe a frontend dev server starts watching files, a backend process begins rebuilding itself every few seconds, and suddenly your laptop is no longer “running AI” — it’s negotiating for survival with swap memory.
That’s where this setup came from.
I was running a local model — gemma4:e2b, specifically, and noticed that inference alone was consuming roughly 7.2GB of RAM during active usage. On paper, that sounds manageable. In practice, once the rest of a modern development workflow joins the party, things become tight very quickly, especially on lightweight machines built more for portability than sustained inference workloads.
The obvious solution wasn’t buying new hardware immediately.
It was separating responsibilities.
One machine became the dedicated inference node running Ollama. Another handled development, editors, browsers, containers, and application runtime. The result was cleaner, faster, quieter — and honestly, far more stable than I expected.
This article walks through that exact setup.
Why Run Ollama on a Separate Device?
Local AI workloads become memory-heavy long before they become compute-heavy.
A normal development session alone can easily consume:
1–2GB from browser tabs
1–3GB from VSCode and extensions
additional memory from Docker, databases, terminals, and background tooling
Then the model arrives.
Even relatively efficient quantized models can occupy several gigabytes while running:
Model Approx Runtime Memory
DeepSeek-R1 1.5B ~1.5–2GB
Gemma 4 ~7GB+
Larger 8B models 8–12GB
The real issue isn’t simply:
“Can it run?”
It’s whether your entire workflow remains usable while it runs.
That distinction matters.
A multi-device setup solves this elegantly.
One system handles:
model execution
token generation
embeddings
inference serving
The other handles:
application development
IDEs
frontend/backend processes
testing
browser-heavy workflows
You essentially create your own lightweight local AI server — except it lives entirely inside your home network.
Architecture Overview
The setup is surprisingly simple:
[Development Machine]
Frontend / Backend / IDE
↓
HTTP Requests
↓
[Inference Machine]
Ollama + Local Models
Everything communicates over your local network.
No cloud infrastructure.
No tunneling.
No external APIs.
Just two machines doing separate jobs well.
Installing Ollama
Install Ollama on the machine that will handle inference.
Verify installation:
ollama --version
Start the Ollama server:
ollama serve
In another terminal, test a model:
ollama run gemma4:e2b
Or pull the model first if needed:
ollama pull gemma4:e2b
At this point, everything works — but only locally on that device.
Understanding Ollama’s Default Behavior
By default, Ollama binds to:
localhost:11434
That means:
requests from the same machine work
requests from other devices do not
This is intentional and generally safer by default.
To expose Ollama to your local network, it needs to listen on all interfaces instead of only localhost.
Exposing Ollama to Your Local Network
First, stop existing Ollama processes:
pkill ollama
Now launch Ollama like this:
OLLAMA_HOST=0.0.0.0:11434 ollama serve
This changes the binding from:
localhost
to:
0.0.0.0
Which effectively means:
accept incoming connections from devices on the local network.
A small change.
A very important one.
Finding Your Local IP Address
On macOS:
ipconfig getifaddr en0
On Linux:
hostname -I
That becomes your Ollama server address.
Testing the API
First test locally on the inference machine:
curl http://localhost:11434/api/tags
Expected response:
{
"models": [...]
}
Now test using the actual network IP:
curl http://192.168.1.15:11434/api/tags
If this works, Ollama is exposed correctly.
Connecting From Another Device
Now move to your development machine and test the same request:
curl http://192.168.1.15:11434/api/tags
If successful, you now have remote inference access.
And this is the moment where things start feeling interesting.
Because at this point, any application capable of calling the Ollama API can connect over LAN:
- VSCode extensions
- local AI agents
- Python backends
- Node.js apps
- browser interfaces
- internal tooling
- RAG pipelines
Your model is no longer tied to a single machine.
It becomes infrastructure.
Streaming Responses Over the Network
One of the nicest parts about Ollama is that streaming works identically over network requests.
Example:
curl http://192.168.1.15:11434/api/chat -d '{
"model": "gemma4:e2b",
"messages": [
{
"role": "user",
"content": "Explain vector embeddings simply"
}
],
"stream": true
}'
Responses arrive incrementally as tokens generate.
- No extra configuration.
- No websocket setup.
- No special transport layer.
It just streams.
That simplicity makes local development workflows feel surprisingly polished.
Common Issues
- Firewall Blocking Connections
Operating systems often block inbound traffic silently.
Temporarily disable the firewall or allow:
Terminal
Ollama
Then retest connectivity.
- Devices Cannot Reach Each Other
Verify both devices are:
on the same WiFi network
on the same subnet
not connected through guest isolation
Healthy subnet alignment usually looks like:
192.168.1.x
192.168.1.x
If one device is on something like 192.168.0.x while the other is 10.0.0.x, they may not communicate directly.
- Ollama Still Listening on Localhost
Verify using:
lsof -i :11434
You want to see:
*:11434 (LISTEN)
Not:
localhost:11434
That tiny difference determines whether your server is accessible across the network or trapped locally.
Performance Notes
A dedicated inference machine changes the experience more than I expected.
The development machine remains responsive because:
editors stop fighting with model memory
browsers stop competing for RAM
swap usage drops significantly
thermal throttling becomes less common
Meanwhile, the inference machine focuses almost entirely on token generation.
Even over WiFi, latency remains surprisingly usable for day-to-day development.
For heavier workloads:
prefer 5GHz WiFi
or ideally ethernet
Streaming becomes noticeably smoother, especially with larger models.
And interestingly, older machines suddenly become useful again.
An older M1 Air with 8GB RAM may struggle trying to do everything at once. But as a dedicated inference node? It becomes a surprisingly capable local AI server for lightweight and medium-sized models.
That’s a much better use of aging hardware than letting it collect dust.
Why This Setup Is Worth It
What surprised me most wasn’t just the performance improvement.
It was the mental separation.
Inference stopped feeling like a fragile experiment running beside my editor and started feeling like infrastructure — something stable, persistent, and always available on the network.
That changes how you build.
You stop thinking:
“Can my machine handle this?”
And start thinking:
“What can I build now that inference is always available?”
That shift matters.
Because once local AI becomes dependable instead of temporary, your workflow changes completely. Models stop being demos. They become building blocks.
And honestly, that’s when local AI development starts becoming truly fun again.


Top comments (0)