Last year, I set myself a challenge: build AI-powered tools for every domain I could think of — healthcare, legal, education, finance, security, creative writing — but with one hard constraint.
Nothing leaves my machine.
No API keys. No usage-based billing. No data flying to someone else's server. Every single inference runs locally, on my own hardware, with open-weight models.
The result? Over 90 production-quality AI tools (and counting — we're past 100 now), all sharing a single reusable architecture pattern. They run offline. They cost nothing after the initial setup. And they handle sensitive data — patient intake forms, legal contracts, financial records — without ever phoning home.
Here's exactly how I built them.
Why Local?
Three reasons kept pulling me toward local inference:
Privacy is non-negotiable. When you're building a patient intake summarizer or a differential diagnosis assistant, you can't casually send medical records to a cloud API. HIPAA doesn't care how convenient GPT-4 is. Running locally means the data never leaves the machine. Period.
Cost compounds fast. I was burning through OpenAI credits during prototyping. Multiply that by 90+ tools, each with their own usage patterns, and the bill gets ugly. Local inference is a one-time investment in compute. After that? Free forever.
Offline capability matters. I wanted tools that work on a plane, in a coffee shop with bad WiFi, during an internet outage. Cloud APIs are a single point of failure. Local models just... work.
The Stack
Every tool in the portfolio uses the same core stack. Here's what I chose and why:
Gemma 4 + Ollama
Ollama is the backbone. It's a local LLM runtime that makes running open-weight models as simple as ollama run gemma4. No CUDA configuration nightmares. No manual weight downloading. It exposes a clean REST API on port 11434 that's compatible with the OpenAI chat completions format.
I standardized on Google's Gemma 4 as the default model. It punches well above its weight for instruction-following, structured output, and domain-specific tasks. For 90% of my tools, it's more than enough.
Streamlit (Web UI) + FastAPI (REST API)
Every tool ships with two interfaces:
-
Streamlit on port
8501— for interactive use, demos, and rapid prototyping. Users get a web UI in seconds. -
FastAPI on port
8000— for programmatic access, integration with other systems, and batch processing.
This dual-interface pattern means every tool is both human-friendly and machine-friendly from day one.
Docker Compose
Everything is containerized. The app, the LLM runtime, the API server — all orchestrated with Docker Compose. This makes deployment reproducible and eliminates "works on my machine" problems.
Click CLI + Rich
For command-line interfaces, I use Click for argument parsing and Rich for beautiful terminal output. Some tools are CLI-first, some are web-first — but the core logic is always the same underneath.
The Shared Architecture Pattern
Here's the secret sauce: every single tool follows the same structural pattern. Once I nailed this template, spinning up a new tool takes 30 minutes, not 3 days.
1. Configuration — config.yaml
Every tool reads its settings from a YAML config file:
model: "gemma4"
temperature: 0.3
max_tokens: 2048
log_level: "INFO"
ollama_url: "http://localhost:11434"
This is intentionally simple. Want to swap models? Change one line. Need more creative output? Bump the temperature. Debugging? Set log_level to DEBUG. No code changes required.
2. The Shared LLM Client
This is the most important piece. Every tool imports from a shared common.llm_client module that handles all LLM communication:
from common.llm_client import chat, check_ollama_running
from .config import load_config
CONFIG = load_config()
SYSTEM_PROMPT = """You are a clinical intake summarization assistant.
Extract key medical information, organize by category,
and flag any urgent findings."""
def summarize_intake(
intake_text,
summary_format="structured",
focus_areas=None,
conversation_history=None
):
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
messages.extend(conversation_history or [])
messages.append({"role": "user", "content": user_prompt})
response = chat(messages)
return response
The chat() function handles the HTTP call to Ollama, error handling, retries, and response parsing. The check_ollama_running() function is a health check that every tool calls on startup. This pattern means the LLM integration code is written once and shared everywhere. Each new tool only needs to define:
- A system prompt
- A domain-specific function that builds the message list
- Any post-processing logic
That's it. The plumbing is invisible.
3. Docker Compose — The Sidecar Pattern
Every tool uses the same Docker Compose template with Ollama running as a sidecar:
services:
app:
build: .
ports:
- "8501:8501"
environment:
- OLLAMA_HOST=http://ollama:11434
depends_on:
ollama:
condition: service_healthy
networks:
- llm-network
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 5
networks:
- llm-network
api:
build: .
command: ["uvicorn", "src.module.api:app", "--host", "0.0.0.0", "--port", "8000"]
ports:
- "8000:8000"
environment:
- OLLAMA_HOST=http://ollama:11434
depends_on:
ollama:
condition: service_healthy
networks:
- llm-network
networks:
llm-network:
driver: bridge
volumes:
ollama_data:
Key details that matter:
- Health checks: The app doesn't start until Ollama is confirmed healthy. No more race conditions where the app tries to call a model that hasn't loaded yet.
-
Bridge network: Services communicate by name (
http://ollama:11434), not localhost. Clean, predictable, portable. -
Persistent volume: Model weights persist across container restarts. You download Gemma 4 once, not every time you
docker compose up. - Sidecar pattern: Ollama runs alongside the app, not inside it. This keeps containers focused and means you can share one Ollama instance across multiple tools if needed.
4. Testing — pytest
Every tool includes a pytest test suite. Tests cover the core logic, config loading, and — where possible — mock the LLM responses to keep tests fast and deterministic.
What I've Built
The 90-local-llm-projects repo (now past 100 projects) catalogs the full portfolio. Here's a sampling by category:
🏥 Healthcare AI
- Patient Intake Summarizer — Extracts structured medical information from intake forms, flags urgent findings
- Differential Diagnosis Assistant — Suggests potential diagnoses from symptom descriptions, ranked by likelihood
- Clinical note generators, medication interaction checkers, and more
⚖️ Legal AI
- Contract analyzers that flag risky clauses
- Legal document summarizers for non-lawyers
- NDA review tools
🎓 Education
- Personalized tutoring assistants
- Quiz and flashcard generators
- Research paper summarizers
💻 Developer Tools
- Code review assistants
- Documentation generators
- Git commit message writers
- Architecture diagram generators
🎨 Creative AI
- Story writing assistants
- Poetry generators
- Dialogue writers for game developers
🔒 Security
- Log analysis tools
- Vulnerability description parsers
- Incident report generators
💰 Finance
- Expense categorizers
- Financial report summarizers
- Investment research assistants
Each one follows the exact same architecture. Swap the system prompt, adjust the config, and you have a new domain-specific AI tool.
Performance & Tradeoffs
Let's be honest about the comparison:
| Local (Gemma 4 + Ollama) | Cloud (GPT-4 / Claude) | |
|---|---|---|
| Privacy | ✅ Data never leaves machine | ❌ Sent to third-party servers |
| Cost | ✅ Free after hardware | ❌ Pay per token, scales with usage |
| Offline | ✅ Works anywhere | ❌ Requires internet |
| Speed (first token) | ⚡ ~200ms on decent GPU | ⚡ ~300-500ms (network + inference) |
| Speed (throughput) | ⚠️ Limited by local hardware | ✅ Scales with cloud infra |
| Model quality | ⚠️ Good, not SOTA | ✅ Best-in-class reasoning |
| Setup complexity | ⚠️ Docker + Ollama needed | ✅ One API key |
The tradeoff is real. For tasks that require frontier-level reasoning — complex multi-step logic, nuanced creative writing, advanced code generation — cloud models still win. But for the 80% of tasks that are "read this text, extract structured information, follow this template"? Local models are more than enough. And the privacy, cost, and offline benefits are massive.
Lessons Learned
After building 100+ tools on this architecture, here's what I'd tell someone starting today:
Start with the template, not the tool. Get your Docker Compose, shared LLM client, and config pattern right first. Then cranking out new tools is trivial.
Health checks save hours of debugging. Without them, you'll spend half your time figuring out why the app can't reach Ollama. The service_healthy condition in Docker Compose is non-negotiable.
YAML config > environment variables for model settings. Environment variables are fine for deployment overrides, but YAML is readable, version-controllable, and self-documenting. Use both: YAML for defaults, env vars for overrides.
System prompts are 90% of the work. The code patterns barely change between tools. The system prompt is where the domain expertise lives. Invest time crafting and iterating on prompts. The architecture is just plumbing.
Temperature matters more than you think. For medical and legal tools, I use 0.1–0.3. For creative tools, 0.7–0.9. Getting this wrong makes an otherwise good tool feel broken.
Test with mocked responses. Don't make your CI dependent on a running Ollama instance. Mock the chat() function, test your prompt construction and response parsing independently.
Persistent volumes for model weights. Downloading a 5GB+ model every time you rebuild a container is painful. The ollama_data volume solves this permanently.
What's Next
I'm continuing to expand the portfolio. The architecture scales effortlessly — each new tool is just a new system prompt, a new core.py, and a copy of the Docker Compose template. I'm also experimenting with:
- Multi-model pipelines — chaining different models for different subtasks within a single tool
- RAG integration — adding retrieval-augmented generation with local vector stores for tools that need domain-specific knowledge bases
- Model hot-swapping — changing models at runtime based on task complexity
The entire portfolio is open source. If you're interested in building local AI tools — or just want to see how 100+ tools can share a single architecture — check out the repos:
- 📂 90-local-llm-projects — The full catalog
- 🏥 patient-intake-summarizer — Healthcare AI example
- 🩺 differential-diagnosis-assistant — Another healthcare tool
Star the repos if you find them useful. Fork them. Build your own. The whole point is that this architecture is simple enough that anyone can run it.
Your data. Your hardware. Your AI.
About the Author
Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft, working on Copilot Search Infrastructure — specifically Semantic Indexing and Retrieval-Augmented Generation (RAG) systems that power GitHub Copilot. By night, he builds open-source AI tools that run entirely locally, with a portfolio of 100+ projects spanning healthcare, legal, education, developer tools, and more. Find him on GitHub and LinkedIn.
Originally published on Medium. Cross-posted to dev.to.
Tags: #AI #LLM #LocalAI #Ollama #Gemma #Python #Docker #OpenSource #MachineLearning #Privacy #HealthcareAI #DevTools
Top comments (0)