RAXXO Studios

Posted on Apr 4 • Originally published at raxxo.shop

Ollama Changed How I Think About AI Infrastructure

#ai #productivity #claudecode #automation

Ollama turns local AI model deployment into a single terminal command
Running 70B parameter models on a Mac Studio hits 80% of daily coding needs
Local models cost zero per request after hardware investment
Cloud APIs still win for complex reasoning but lose for routine tasks
A hybrid setup cut my monthly AI spend from 180 EUR to 65 EUR
The gap between local and cloud model quality is closing faster than anyone predicted

Why I Moved 80% of My AI Workload Off the Cloud

Something clicked for me about three weeks ago. I was looking at my API usage dashboard and realized I had spent 180 EUR in a single month on cloud AI requests. Not because I was doing anything extraordinary. Just coding. Writing functions, debugging, generating tests, refactoring. The kind of work that fills 90% of a developer's day.

Most of those requests did not need the most powerful model available. They needed a competent model that understood code, followed instructions, and responded quickly. That realization sent me down the Ollama rabbit hole. Three weeks later, my cloud AI bill is a third of what it was, and I have not noticed a quality difference for the tasks I moved locally.

This is not a theoretical argument for local models. This is a practical guide based on running both setups side by side and measuring the results.

What Ollama Actually Is

Ollama is a tool that lets you run large language models on your own machine. One command installs it. One command downloads a model. One command starts it. That is not an exaggeration for marketing purposes. It is literally three commands.


curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama4
ollama run llama4

Behind the scenes, Ollama handles model quantization, memory management, GPU acceleration, and serving. It exposes an API that is compatible with the OpenAI API format, which means most tools that work with OpenAI or Claude can be pointed at your local Ollama instance with a URL change.

The project is open source. It runs on macOS, Linux, and Windows. It supports dozens of model families including Llama, Qwen, Gemma, Mistral, DeepSeek, and CodeLlama. New models typically appear on Ollama within days of their public release.

The reason Ollama matters now, more than it did a year ago, is that the models available to run locally have crossed a quality threshold. A year ago, local models were noticeably worse than cloud models for anything beyond simple tasks. Today, the best local models handle code generation, refactoring, documentation, and test writing at a level that is functionally equivalent to mid-tier cloud models.

The Hardware Question

The most common pushback against local models is hardware cost. Let me break down what actually works.

Mac with Apple Silicon (M2 Pro or better, 32GB+ unified memory). This is the sweet spot. Apple's unified memory architecture means the GPU can access all system memory directly, which is critical for running large models. A MacBook Pro with 36GB can run 13B parameter models comfortably and 34B models with some quantization. A Mac Studio with 192GB can run 70B+ models at production-quality speeds.

Linux with NVIDIA GPU (RTX 4090 or better, 24GB+ VRAM). Faster inference than Apple Silicon for equivalent model sizes. The RTX 4090 with 24GB VRAM handles 13B models natively and larger models with offloading. A dual-GPU setup extends this range significantly.

Any machine with 16GB+ RAM. You can run 7B parameter models on modest hardware. These smaller models are surprisingly capable for focused tasks like code completion, simple refactoring, and documentation generation.

The cost math works like this. A Mac Studio with 64GB unified memory runs about 2,300 EUR. At my previous cloud AI spend of 180 EUR per month, it pays for itself in 13 months. With 192GB, which opens up 70B models, you are looking at roughly 4,500 EUR and a 25-month payback. These numbers shift dramatically if you are spending more on cloud APIs.

Which Models to Run and When

Not all local models are equal. After testing extensively, here is what I settled on for different tasks.

Qwen 3.5 72B (quantized to Q4). My default for complex coding tasks. Understands context well, generates clean code, handles multi-file refactoring without losing track of dependencies. Runs at about 15 tokens per second on my Mac Studio, which is fast enough for interactive use.

Gemma 4 27B. The speed champion. Runs at 40+ tokens per second and handles straightforward coding tasks beautifully. I use this for writing individual functions, generating tests, explaining code, and quick debugging. The quality-to-speed ratio is exceptional.

Llama 4 Scout. General-purpose workhorse with a massive context window. Good for tasks that need to process large amounts of code or documentation at once. The 10M token context makes it useful for codebase-wide analysis that would be prohibitively expensive on cloud APIs.

CodeLlama 34B. Still competitive for pure code generation tasks. Particularly strong at completing code from partial implementations and understanding coding patterns.

DeepSeek R1 (distilled). When I need actual reasoning about architecture decisions or complex debugging, this gets closer to cloud model quality than anything else running locally. Slower due to the reasoning overhead, but the output quality justifies the wait for hard problems.

Cloud vs Local: The Honest Comparison

I tracked my usage across both setups for three weeks. Here is what the data shows.

Where local models win decisively:

Code completion and generation (single functions, components)
Test writing from existing code
Documentation and comment generation
Simple refactoring (rename, extract, inline)
Explaining code sections
Generating boilerplate and scaffolding
Any task where latency matters more than peak quality

Where cloud models (Opus 4.6 specifically) still win:

Multi-step architectural reasoning
Debugging complex cross-system issues
Creative problem-solving with novel constraints
Tasks requiring deep domain knowledge outside of coding
Long, multi-turn conversations where context coherence matters
Anything where being 5% better justifies the cost

Where it is genuinely a toss-up:

Code review and quality analysis
Refactoring across multiple files
Writing technical content
Build system configuration
API integration work

The surprise was how large the "local wins" category turned out to be. I estimated it would be 50% of my workload. It was closer to 80%. Most coding work is not architecturally complex. It is execution. Write this function. Test this component. Refactor this module. Format this data. For execution work, local models are not a compromise. They are the right tool.

Setting Up the Hybrid Workflow

The setup that works for me uses Ollama as the default for everything, with automatic escalation to Claude API for tasks that need it. Here is how it works in practice.

Step 1: Install and configure Ollama.


ollama pull qwen3.5:72b-q4
ollama pull gemma4:27b
ollama serve

Ollama serves on localhost:11434 by default. The API is OpenAI-compatible, so any tool that supports custom endpoints can point there.

Step 2: Configure your coding tools to use local models first.

Most AI-powered dev tools support custom model endpoints. Point them at your Ollama instance. For tools that support model routing, set local models as the default and cloud models as the fallback.

Step 3: Set up cost-aware routing.

The approach I use is simple. Routine tasks go to local models automatically. When I need Opus-level quality, I explicitly request it. This puts me in control of when I spend money on cloud compute, rather than having every request hit the API by default.

Step 4: Monitor and adjust.

Track which tasks you escalate to cloud models and whether the escalation was actually necessary. After a week, you will have a clear picture of your actual cloud dependency versus your assumed cloud dependency. For most developers, the actual number is much lower.

The Privacy and Speed Advantages Nobody Talks About

Cost savings get all the attention, but two other benefits of local models are arguably more important.

Privacy. When code runs through a local model, it never leaves your machine. No cloud provider sees your proprietary code. No API request gets logged on someone else's infrastructure. For anyone working with sensitive codebases, client code, or unreleased products, this is not a nice-to-have. It is a requirement.

I work on projects where the code itself is the product. Sending that code to a third-party API for processing has always felt uncomfortable, even with the privacy policies and data handling agreements. Local models eliminate that concern entirely.

Latency. Cloud API requests have inherent network latency. Even with fast connections, you are adding 100 to 300 milliseconds of round-trip time before the model even starts generating. Local models start generating instantly. For interactive coding where you are making rapid requests, the latency difference compounds into a meaningfully better experience.

On my setup, a typical code generation request returns the first token in under 50 milliseconds locally versus 200+ milliseconds from a cloud API. Over hundreds of requests per day, that adds up to a noticeably smoother workflow.

The Ollama Ecosystem

Ollama is the engine, but the ecosystem around it makes local AI practical for daily work.

Open WebUI provides a ChatGPT-style interface for your local models. Useful for non-coding tasks like writing, brainstorming, and research where a chat interface is more natural than a terminal.

Continue is an IDE extension that connects to local models for code completion and chat. Works with VS Code and JetBrains IDEs.

LiteLLM is a proxy that unifies access to local and cloud models behind a single API. Useful for building tools that need to route between providers.

Modelfile customization lets you create specialized versions of base models with custom system prompts, temperature settings, and context configurations. I have a Modelfile that turns Qwen into a focused code reviewer with specific instructions about my coding standards.

What Changes in the Next Six Months

The trajectory of local model quality is steep. Every major model release pushes the quality floor higher. Qwen 3.5 at 72B today is better than GPT-4 was at launch. Gemma 4 at 27B outperforms models that cost hundreds of millions to train just two years ago.

At the current pace, by October 2026, local models will handle 90% or more of typical development tasks at cloud-equivalent quality. The remaining 10%, the genuinely hard reasoning tasks, will still benefit from cloud models. But the economic case for running everything in the cloud will be gone.

The hardware trajectory matters too. Apple's next generation of silicon will push unified memory higher and inference speeds faster. NVIDIA's next consumer GPUs will handle larger models in less VRAM. AMD is finally competitive in the AI inference space. The hardware you buy today will run better models tomorrow.

My Recommendation

If you are still running 100% of your AI workload through cloud APIs, you are overpaying and introducing unnecessary dependencies. Here is the minimum viable local setup.

Install Ollama (5 minutes)
Pull one general-purpose model like Gemma 4 27B (10 minutes depending on bandwidth)
Run it alongside your existing cloud setup for one week
Track which tasks you can handle locally without quality loss
Gradually shift those tasks to local models
Keep cloud API access for the tasks that genuinely need it

The investment is minimal. The downside is zero. And the upside, lower costs, better privacy, faster response times, and independence from provider policy changes, compounds over time.

The developers who figure out the local-cloud hybrid workflow now will have a significant advantage when the next round of pricing changes, capacity restrictions, or terms-of-service updates inevitably arrives. That is not pessimism. It is pattern recognition.

Your AI infrastructure should not have a single point of failure. Today is a good day to start building redundancy.

DEV Community