DEV Community

Nrk Raju Guthikonda
Nrk Raju Guthikonda

Posted on

The Developer's Guide to Running LLMs Locally: Ollama, Gemma 4, and Why Your Side Projects Don't Need an API Key

Every tutorial about building with LLMs starts the same way: "First, get your OpenAI API key." But what if I told you that you can build production-quality AI applications without ever making a cloud API call?

I've built over 90 applications using local LLMs — no API keys, no cloud costs, no rate limits. Here's a practical guide to getting started with Ollama and Gemma 4 for your own projects.


Why Local LLMs?

Before diving into the how, let's talk about why:

1. Zero Cost Per Request

Cloud APIs charge per token. A moderate application making 1,000 requests/day costs $30-100/month. Scale to production and you're looking at thousands per month. Local inference costs electricity — pennies per hour.

2. No Rate Limits

I've hit OpenAI rate limits at 3 AM on a Sunday during a hackathon. With local models, you can generate as fast as your hardware allows, 24/7.

3. Privacy by Default

No data leaves your machine. This isn't just nice-to-have — it's essential for healthcare (HIPAA), legal (attorney-client privilege), finance (PCI), and education (FERPA) applications.

4. Offline Capability

Once the model is downloaded, you need zero internet. Build on a plane. Demo without WiFi. Deploy in air-gapped environments.

5. Reproducibility

Cloud models change without notice. GPT-4 in January behaves differently than GPT-4 in June. Local models are frozen — same model, same behavior, always.

Getting Started: 5 Minutes to Your First Local LLM

Step 1: Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download
Enter fullscreen mode Exit fullscreen mode

Step 2: Pull Gemma 4

ollama pull gemma4
Enter fullscreen mode Exit fullscreen mode

This downloads the model (~5GB). One-time cost, then it's on your machine forever.

Step 3: Test It

ollama run gemma4 "Explain quantum computing in one paragraph"
Enter fullscreen mode Exit fullscreen mode

That's it. You now have a local LLM running on your machine.

Building Applications with Python + Ollama

Here's a minimal Python application:

import ollama

def ask(question: str) -> str:
    response = ollama.generate(
        model="gemma4",
        prompt=question,
        options={"temperature": 0.3}
    )
    return response["response"]

# That's literally it
print(ask("What are the SOLID principles in software engineering?"))
Enter fullscreen mode Exit fullscreen mode

Adding Structure: The Pattern I Use in 90+ Projects

class LocalLLMApp:
    def __init__(self, model: str = "gemma4"):
        self.client = ollama.Client()
        self.model = model

    def generate(self, prompt: str, temperature: float = 0.3, 
                 system: str = None) -> str:
        messages = []
        if system:
            messages.append({"role": "system", "content": system})
        messages.append({"role": "user", "content": prompt})

        response = self.client.chat(
            model=self.model,
            messages=messages,
            options={"temperature": temperature}
        )
        return response["message"]["content"]
Enter fullscreen mode Exit fullscreen mode

This base class pattern is the foundation of every application I've built. Domain-specific logic goes in subclasses — the LLM integration stays clean and swappable.

Adding a Web Interface: Streamlit

import streamlit as st

app = LocalLLMApp()

st.title("My Local AI Tool")
user_input = st.text_area("Enter your text:")

if st.button("Analyze"):
    with st.spinner("Thinking..."):
        result = app.generate(user_input)
    st.write(result)
Enter fullscreen mode Exit fullscreen mode

Three imports. Ten lines. A full web interface for your local AI tool.

Adding an API: FastAPI

from fastapi import FastAPI
from pydantic import BaseModel

api = FastAPI()
app = LocalLLMApp()

class Query(BaseModel):
    text: str
    temperature: float = 0.3

@api.post("/analyze")
async def analyze(query: Query):
    result = app.generate(query.text, temperature=query.temperature)
    return {"result": result}
Enter fullscreen mode Exit fullscreen mode

Now you have a REST API that any frontend, mobile app, or service can call — all running locally.

Docker: One-Command Deployment

Every project I build ships with this docker-compose.yml:

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

  app:
    build: .
    ports:
      - "8501:8501"
      - "8000:8000"
    depends_on:
      - ollama
    environment:
      - OLLAMA_HOST=http://ollama:11434

volumes:
  ollama-data:
Enter fullscreen mode Exit fullscreen mode

docker compose up — that's the entire deployment story. Works on any machine with Docker and a GPU.

Performance: What to Expect

On consumer hardware (RTX 3080, 16GB RAM):

  • Simple Q&A: 0.5-1 second
  • Paragraph generation: 2-5 seconds
  • Document analysis (2-3 pages): 5-15 seconds
  • Long-form generation (1000+ words): 15-30 seconds

These are practical, usable response times for interactive applications.

When to Use Cloud vs. Local

Use Case Local Cloud
Prototyping ✅ Zero cost ❌ Token costs add up
Sensitive data ✅ Privacy by default ❌ Requires BAA/DPA
Production (small scale) ✅ Fixed hardware cost ✅ Easy to scale
Production (large scale) ❌ Hardware limits ✅ Elastic scaling
Offline/air-gapped ✅ Works anywhere ❌ Requires internet
Cutting-edge capability ❌ Smaller models ✅ Latest models

My rule: start local, move to cloud only when you've proven the concept and need scale that local hardware can't handle.

90+ Projects and Counting

I've applied this pattern across:

  • Healthcare: Patient intake, lab results, EHR de-identification
  • Legal: Contract analysis, brief generation, compliance checking
  • Education: Study bots, exam generators, flashcard creators
  • Creative: Story generators, poetry engines, mood journals
  • Developer Tools: Code review, API docs, performance profiling
  • Finance: Budget analyzers, financial report summarizers
  • Security: Vulnerability scanners, alert summarizers

Every single one follows the same pattern: Ollama + Gemma 4 + Python + FastAPI + Streamlit + Docker.

The code is open source: github.com/kennedyraju55

Start building locally. Your AI projects don't need an API key.


*Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team. He maintains 116+ original open-source repositories built with local LLMs. Read more on dev.to.*aipythontutorialdocker

Top comments (0)