Siddharth Chopda

Posted on May 23

Running Ollama Across Multiple Devices: A Low cost Practical Setup for Local AI Development

#aii

There’s a strange moment that happens when you first start running local LLMs seriously.

At the beginning, everything feels almost magical. You pull a model, type a prompt, and watch tokens stream back from your own machine. No API keys. No cloud dependency. No latency from some distant server farm. Just raw local inference running quietly beside your editor.

For a while, it feels like the future arrived early.

Then reality shows up.

Your browser opens fifteen tabs. Cursor agents are running. Docker wakes up. A few terminals pile on. Maybe a frontend dev server starts watching files, a backend process begins rebuilding itself every few seconds, and suddenly your laptop is no longer “running AI” — it’s negotiating for survival with swap memory.

That’s where this setup came from.

I was running a local model — gemma4:e2b, specifically, and noticed that inference alone was consuming roughly 7.2GB of RAM during active usage. On paper, that sounds manageable. In practice, once the rest of a modern development workflow joins the party, things become tight very quickly, especially on lightweight machines built more for portability than sustained inference workloads.

The obvious solution wasn’t buying new hardware immediately.

It was separating responsibilities.

One machine became the dedicated inference node running Ollama. Another handled development, editors, browsers, containers, and application runtime. The result was cleaner, faster, quieter — and honestly, far more stable than I expected.

This article walks through that exact setup.

Why Run Ollama on a Separate Device?

Local AI workloads become memory-heavy long before they become compute-heavy.

A normal development session alone can easily consume:

1–2GB from browser tabs
1–3GB from VSCode and extensions
additional memory from Docker, databases, terminals, and background tooling

Then the model arrives.

Even relatively efficient quantized models can occupy several gigabytes while running:

Model Approx Runtime Memory
DeepSeek-R1 1.5B ~1.5–2GB
Gemma 4 ~7GB+
Larger 8B models 8–12GB

The real issue isn’t simply:

“Can it run?”

It’s whether your entire workflow remains usable while it runs.

That distinction matters.

A multi-device setup solves this elegantly.

One system handles:

model execution
token generation
embeddings
inference serving

The other handles:

application development
IDEs
frontend/backend processes
testing
browser-heavy workflows

You essentially create your own lightweight local AI server — except it lives entirely inside your home network.

Architecture Overview

The setup is surprisingly simple:

[Development Machine]
Frontend / Backend / IDE
            ↓
     HTTP Requests
            ↓
[Inference Machine]
Ollama + Local Models

Everything communicates over your local network.

No cloud infrastructure.
No tunneling.
No external APIs.

Just two machines doing separate jobs well.

Installing Ollama

Install Ollama on the machine that will handle inference.

Verify installation:

ollama --version

Start the Ollama server:

ollama serve

In another terminal, test a model:

ollama run gemma4:e2b

Or pull the model first if needed:

ollama pull gemma4:e2b

At this point, everything works — but only locally on that device.

Understanding Ollama’s Default Behavior

By default, Ollama binds to:

localhost:11434

That means:

requests from the same machine work
requests from other devices do not

This is intentional and generally safer by default.

To expose Ollama to your local network, it needs to listen on all interfaces instead of only localhost.

Exposing Ollama to Your Local Network

First, stop existing Ollama processes:

pkill ollama

Now launch Ollama like this:

OLLAMA_HOST=0.0.0.0:11434 ollama serve

This changes the binding from:

localhost

to:

0.0.0.0

Which effectively means:

accept incoming connections from devices on the local network.

A small change.
A very important one.

Finding Your Local IP Address

On macOS:

ipconfig getifaddr en0

On Linux:

hostname -I

That becomes your Ollama server address.

Testing the API

First test locally on the inference machine:

curl http://localhost:11434/api/tags

Expected response:

{
  "models": [...]
}

Now test using the actual network IP:

curl http://192.168.1.15:11434/api/tags

If this works, Ollama is exposed correctly.

Connecting From Another Device

Now move to your development machine and test the same request:

curl http://192.168.1.15:11434/api/tags

If successful, you now have remote inference access.

And this is the moment where things start feeling interesting.

Because at this point, any application capable of calling the Ollama API can connect over LAN:

VSCode extensions
local AI agents
Python backends
Node.js apps
browser interfaces
internal tooling
RAG pipelines

Your model is no longer tied to a single machine.

It becomes infrastructure.

Streaming Responses Over the Network

One of the nicest parts about Ollama is that streaming works identically over network requests.

Example:

curl http://192.168.1.15:11434/api/chat -d '{
  "model": "gemma4:e2b",
  "messages": [
    {
      "role": "user",
      "content": "Explain vector embeddings simply"
    }
  ],
  "stream": true
}'

Responses arrive incrementally as tokens generate.

No extra configuration.
No websocket setup.
No special transport layer.

It just streams.

That simplicity makes local development workflows feel surprisingly polished.

Common Issues

Firewall Blocking Connections

Operating systems often block inbound traffic silently.

Temporarily disable the firewall or allow:

Terminal
Ollama

Then retest connectivity.

Devices Cannot Reach Each Other

Verify both devices are:

on the same WiFi network
on the same subnet
not connected through guest isolation

Healthy subnet alignment usually looks like:

192.168.1.x
192.168.1.x

If one device is on something like 192.168.0.x while the other is 10.0.0.x, they may not communicate directly.

Ollama Still Listening on Localhost

Verify using:

lsof -i :11434

You want to see:

*:11434 (LISTEN)

Not:

localhost:11434

That tiny difference determines whether your server is accessible across the network or trapped locally.

Performance Notes

A dedicated inference machine changes the experience more than I expected.

The development machine remains responsive because:

editors stop fighting with model memory
browsers stop competing for RAM
swap usage drops significantly
thermal throttling becomes less common

Meanwhile, the inference machine focuses almost entirely on token generation.

Even over WiFi, latency remains surprisingly usable for day-to-day development.

For heavier workloads:

prefer 5GHz WiFi
or ideally ethernet

Streaming becomes noticeably smoother, especially with larger models.

And interestingly, older machines suddenly become useful again.

An older M1 Air with 8GB RAM may struggle trying to do everything at once. But as a dedicated inference node? It becomes a surprisingly capable local AI server for lightweight and medium-sized models.

That’s a much better use of aging hardware than letting it collect dust.

Why This Setup Is Worth It

What surprised me most wasn’t just the performance improvement.

It was the mental separation.

Inference stopped feeling like a fragile experiment running beside my editor and started feeling like infrastructure — something stable, persistent, and always available on the network.

That changes how you build.

You stop thinking:

“Can my machine handle this?”

And start thinking:

“What can I build now that inference is always available?”

That shift matters.

Because once local AI becomes dependable instead of temporary, your workflow changes completely. Models stop being demos. They become building blocks.

And honestly, that’s when local AI development starts becoming truly fun again.