Every tutorial about building with LLMs starts the same way: "First, get your OpenAI API key." But what if I told you that you can build production-quality AI applications without ever making a cloud API call?
I've built over 90 applications using local LLMs — no API keys, no cloud costs, no rate limits. Here's a practical guide to getting started with Ollama and Gemma 4 for your own projects.
Why Local LLMs?
Before diving into the how, let's talk about why:
1. Zero Cost Per Request
Cloud APIs charge per token. A moderate application making 1,000 requests/day costs $30-100/month. Scale to production and you're looking at thousands per month. Local inference costs electricity — pennies per hour.
2. No Rate Limits
I've hit OpenAI rate limits at 3 AM on a Sunday during a hackathon. With local models, you can generate as fast as your hardware allows, 24/7.
3. Privacy by Default
No data leaves your machine. This isn't just nice-to-have — it's essential for healthcare (HIPAA), legal (attorney-client privilege), finance (PCI), and education (FERPA) applications.
4. Offline Capability
Once the model is downloaded, you need zero internet. Build on a plane. Demo without WiFi. Deploy in air-gapped environments.
5. Reproducibility
Cloud models change without notice. GPT-4 in January behaves differently than GPT-4 in June. Local models are frozen — same model, same behavior, always.
Getting Started: 5 Minutes to Your First Local LLM
Step 1: Install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download
Step 2: Pull Gemma 4
ollama pull gemma4
This downloads the model (~5GB). One-time cost, then it's on your machine forever.
Step 3: Test It
ollama run gemma4 "Explain quantum computing in one paragraph"
That's it. You now have a local LLM running on your machine.
Building Applications with Python + Ollama
Here's a minimal Python application:
import ollama
def ask(question: str) -> str:
response = ollama.generate(
model="gemma4",
prompt=question,
options={"temperature": 0.3}
)
return response["response"]
# That's literally it
print(ask("What are the SOLID principles in software engineering?"))
Adding Structure: The Pattern I Use in 90+ Projects
class LocalLLMApp:
def __init__(self, model: str = "gemma4"):
self.client = ollama.Client()
self.model = model
def generate(self, prompt: str, temperature: float = 0.3,
system: str = None) -> str:
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": prompt})
response = self.client.chat(
model=self.model,
messages=messages,
options={"temperature": temperature}
)
return response["message"]["content"]
This base class pattern is the foundation of every application I've built. Domain-specific logic goes in subclasses — the LLM integration stays clean and swappable.
Adding a Web Interface: Streamlit
import streamlit as st
app = LocalLLMApp()
st.title("My Local AI Tool")
user_input = st.text_area("Enter your text:")
if st.button("Analyze"):
with st.spinner("Thinking..."):
result = app.generate(user_input)
st.write(result)
Three imports. Ten lines. A full web interface for your local AI tool.
Adding an API: FastAPI
from fastapi import FastAPI
from pydantic import BaseModel
api = FastAPI()
app = LocalLLMApp()
class Query(BaseModel):
text: str
temperature: float = 0.3
@api.post("/analyze")
async def analyze(query: Query):
result = app.generate(query.text, temperature=query.temperature)
return {"result": result}
Now you have a REST API that any frontend, mobile app, or service can call — all running locally.
Docker: One-Command Deployment
Every project I build ships with this docker-compose.yml:
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama-data:/root/.ollama
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
app:
build: .
ports:
- "8501:8501"
- "8000:8000"
depends_on:
- ollama
environment:
- OLLAMA_HOST=http://ollama:11434
volumes:
ollama-data:
docker compose up — that's the entire deployment story. Works on any machine with Docker and a GPU.
Performance: What to Expect
On consumer hardware (RTX 3080, 16GB RAM):
- Simple Q&A: 0.5-1 second
- Paragraph generation: 2-5 seconds
- Document analysis (2-3 pages): 5-15 seconds
- Long-form generation (1000+ words): 15-30 seconds
These are practical, usable response times for interactive applications.
When to Use Cloud vs. Local
| Use Case | Local | Cloud |
|---|---|---|
| Prototyping | ✅ Zero cost | ❌ Token costs add up |
| Sensitive data | ✅ Privacy by default | ❌ Requires BAA/DPA |
| Production (small scale) | ✅ Fixed hardware cost | ✅ Easy to scale |
| Production (large scale) | ❌ Hardware limits | ✅ Elastic scaling |
| Offline/air-gapped | ✅ Works anywhere | ❌ Requires internet |
| Cutting-edge capability | ❌ Smaller models | ✅ Latest models |
My rule: start local, move to cloud only when you've proven the concept and need scale that local hardware can't handle.
90+ Projects and Counting
I've applied this pattern across:
- Healthcare: Patient intake, lab results, EHR de-identification
- Legal: Contract analysis, brief generation, compliance checking
- Education: Study bots, exam generators, flashcard creators
- Creative: Story generators, poetry engines, mood journals
- Developer Tools: Code review, API docs, performance profiling
- Finance: Budget analyzers, financial report summarizers
- Security: Vulnerability scanners, alert summarizers
Every single one follows the same pattern: Ollama + Gemma 4 + Python + FastAPI + Streamlit + Docker.
The code is open source: github.com/kennedyraju55
Start building locally. Your AI projects don't need an API key.
*Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team. He maintains 116+ original open-source repositories built with local LLMs. Read more on dev.to.*aipythontutorialdocker
Top comments (0)