Nrk Raju Guthikonda

Posted on Apr 12

I Built 90+ AI Tools That Never Touch the Cloud — Here's the Architecture

#opensource

Last year, I set myself a challenge: build AI-powered tools for every domain I could think of — healthcare, legal, education, finance, security, creative writing — but with one hard constraint.

Nothing leaves my machine.

No API keys. No usage-based billing. No data flying to someone else's server. Every single inference runs locally, on my own hardware, with open-weight models.

The result? Over 90 production-quality AI tools (and counting — we're past 100 now), all sharing a single reusable architecture pattern. They run offline. They cost nothing after the initial setup. And they handle sensitive data — patient intake forms, legal contracts, financial records — without ever phoning home.

Here's exactly how I built them.

Why Local?

Three reasons kept pulling me toward local inference:

Privacy is non-negotiable. When you're building a patient intake summarizer or a differential diagnosis assistant, you can't casually send medical records to a cloud API. HIPAA doesn't care how convenient GPT-4 is. Running locally means the data never leaves the machine. Period.

Cost compounds fast. I was burning through OpenAI credits during prototyping. Multiply that by 90+ tools, each with their own usage patterns, and the bill gets ugly. Local inference is a one-time investment in compute. After that? Free forever.

Offline capability matters. I wanted tools that work on a plane, in a coffee shop with bad WiFi, during an internet outage. Cloud APIs are a single point of failure. Local models just... work.

The Stack

Every tool in the portfolio uses the same core stack. Here's what I chose and why:

Gemma 4 + Ollama

Ollama is the backbone. It's a local LLM runtime that makes running open-weight models as simple as ollama run gemma4. No CUDA configuration nightmares. No manual weight downloading. It exposes a clean REST API on port 11434 that's compatible with the OpenAI chat completions format.

I standardized on Google's Gemma 4 as the default model. It punches well above its weight for instruction-following, structured output, and domain-specific tasks. For 90% of my tools, it's more than enough.

Streamlit (Web UI) + FastAPI (REST API)

Every tool ships with two interfaces:

Streamlit on port 8501 — for interactive use, demos, and rapid prototyping. Users get a web UI in seconds.
FastAPI on port 8000 — for programmatic access, integration with other systems, and batch processing.

This dual-interface pattern means every tool is both human-friendly and machine-friendly from day one.

Docker Compose

Everything is containerized. The app, the LLM runtime, the API server — all orchestrated with Docker Compose. This makes deployment reproducible and eliminates "works on my machine" problems.

Click CLI + Rich

For command-line interfaces, I use Click for argument parsing and Rich for beautiful terminal output. Some tools are CLI-first, some are web-first — but the core logic is always the same underneath.

The Shared Architecture Pattern

Here's the secret sauce: every single tool follows the same structural pattern. Once I nailed this template, spinning up a new tool takes 30 minutes, not 3 days.

1. Configuration — `config.yaml`

Every tool reads its settings from a YAML config file:

model: "gemma4"
temperature: 0.3
max_tokens: 2048
log_level: "INFO"
ollama_url: "http://localhost:11434"

This is intentionally simple. Want to swap models? Change one line. Need more creative output? Bump the temperature. Debugging? Set log_level to DEBUG. No code changes required.

2. The Shared LLM Client

This is the most important piece. Every tool imports from a shared common.llm_client module that handles all LLM communication:

from common.llm_client import chat, check_ollama_running
from .config import load_config

CONFIG = load_config()

SYSTEM_PROMPT = """You are a clinical intake summarization assistant.
Extract key medical information, organize by category,
and flag any urgent findings."""

def summarize_intake(
    intake_text,
    summary_format="structured",
    focus_areas=None,
    conversation_history=None
):
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    messages.extend(conversation_history or [])
    messages.append({"role": "user", "content": user_prompt})
    response = chat(messages)
    return response

The chat() function handles the HTTP call to Ollama, error handling, retries, and response parsing. The check_ollama_running() function is a health check that every tool calls on startup. This pattern means the LLM integration code is written once and shared everywhere. Each new tool only needs to define:

A system prompt
A domain-specific function that builds the message list
Any post-processing logic

That's it. The plumbing is invisible.

3. Docker Compose — The Sidecar Pattern

Every tool uses the same Docker Compose template with Ollama running as a sidecar:

services:
  app:
    build: .
    ports:
      - "8501:8501"
    environment:
      - OLLAMA_HOST=http://ollama:11434
    depends_on:
      ollama:
        condition: service_healthy
    networks:
      - llm-network

  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 5
    networks:
      - llm-network

  api:
    build: .
    command: ["uvicorn", "src.module.api:app", "--host", "0.0.0.0", "--port", "8000"]
    ports:
      - "8000:8000"
    environment:
      - OLLAMA_HOST=http://ollama:11434
    depends_on:
      ollama:
        condition: service_healthy
    networks:
      - llm-network

networks:
  llm-network:
    driver: bridge

volumes:
  ollama_data:

Key details that matter:

Health checks: The app doesn't start until Ollama is confirmed healthy. No more race conditions where the app tries to call a model that hasn't loaded yet.
Bridge network: Services communicate by name (http://ollama:11434), not localhost. Clean, predictable, portable.
Persistent volume: Model weights persist across container restarts. You download Gemma 4 once, not every time you docker compose up.
Sidecar pattern: Ollama runs alongside the app, not inside it. This keeps containers focused and means you can share one Ollama instance across multiple tools if needed.

4. Testing — pytest

Every tool includes a pytest test suite. Tests cover the core logic, config loading, and — where possible — mock the LLM responses to keep tests fast and deterministic.

What I've Built

The 90-local-llm-projects repo (now past 100 projects) catalogs the full portfolio. Here's a sampling by category:

🏥 Healthcare AI

Patient Intake Summarizer — Extracts structured medical information from intake forms, flags urgent findings
Differential Diagnosis Assistant — Suggests potential diagnoses from symptom descriptions, ranked by likelihood
Clinical note generators, medication interaction checkers, and more

⚖️ Legal AI

Contract analyzers that flag risky clauses
Legal document summarizers for non-lawyers
NDA review tools

🎓 Education

Personalized tutoring assistants
Quiz and flashcard generators
Research paper summarizers

💻 Developer Tools

Code review assistants
Documentation generators
Git commit message writers
Architecture diagram generators

🎨 Creative AI

Story writing assistants
Poetry generators
Dialogue writers for game developers

🔒 Security

Log analysis tools
Vulnerability description parsers
Incident report generators

💰 Finance

Expense categorizers
Financial report summarizers
Investment research assistants

Each one follows the exact same architecture. Swap the system prompt, adjust the config, and you have a new domain-specific AI tool.

Performance & Tradeoffs

Let's be honest about the comparison:

	Local (Gemma 4 + Ollama)	Cloud (GPT-4 / Claude)
Privacy	✅ Data never leaves machine	❌ Sent to third-party servers
Cost	✅ Free after hardware	❌ Pay per token, scales with usage
Offline	✅ Works anywhere	❌ Requires internet
Speed (first token)	⚡ ~200ms on decent GPU	⚡ ~300-500ms (network + inference)
Speed (throughput)	⚠️ Limited by local hardware	✅ Scales with cloud infra
Model quality	⚠️ Good, not SOTA	✅ Best-in-class reasoning
Setup complexity	⚠️ Docker + Ollama needed	✅ One API key

The tradeoff is real. For tasks that require frontier-level reasoning — complex multi-step logic, nuanced creative writing, advanced code generation — cloud models still win. But for the 80% of tasks that are "read this text, extract structured information, follow this template"? Local models are more than enough. And the privacy, cost, and offline benefits are massive.

Lessons Learned

After building 100+ tools on this architecture, here's what I'd tell someone starting today:

Start with the template, not the tool. Get your Docker Compose, shared LLM client, and config pattern right first. Then cranking out new tools is trivial.

Health checks save hours of debugging. Without them, you'll spend half your time figuring out why the app can't reach Ollama. The service_healthy condition in Docker Compose is non-negotiable.

YAML config > environment variables for model settings. Environment variables are fine for deployment overrides, but YAML is readable, version-controllable, and self-documenting. Use both: YAML for defaults, env vars for overrides.

System prompts are 90% of the work. The code patterns barely change between tools. The system prompt is where the domain expertise lives. Invest time crafting and iterating on prompts. The architecture is just plumbing.

Temperature matters more than you think. For medical and legal tools, I use 0.1–0.3. For creative tools, 0.7–0.9. Getting this wrong makes an otherwise good tool feel broken.

Test with mocked responses. Don't make your CI dependent on a running Ollama instance. Mock the chat() function, test your prompt construction and response parsing independently.

Persistent volumes for model weights. Downloading a 5GB+ model every time you rebuild a container is painful. The ollama_data volume solves this permanently.

What's Next

I'm continuing to expand the portfolio. The architecture scales effortlessly — each new tool is just a new system prompt, a new core.py, and a copy of the Docker Compose template. I'm also experimenting with:

Multi-model pipelines — chaining different models for different subtasks within a single tool
RAG integration — adding retrieval-augmented generation with local vector stores for tools that need domain-specific knowledge bases
Model hot-swapping — changing models at runtime based on task complexity

The entire portfolio is open source. If you're interested in building local AI tools — or just want to see how 100+ tools can share a single architecture — check out the repos:

📂 90-local-llm-projects — The full catalog
🏥 patient-intake-summarizer — Healthcare AI example
🩺 differential-diagnosis-assistant — Another healthcare tool

Star the repos if you find them useful. Fork them. Build your own. The whole point is that this architecture is simple enough that anyone can run it.

Your data. Your hardware. Your AI.

About the Author

Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft, working on Copilot Search Infrastructure — specifically Semantic Indexing and Retrieval-Augmented Generation (RAG) systems that power GitHub Copilot. By night, he builds open-source AI tools that run entirely locally, with a portfolio of 100+ projects spanning healthcare, legal, education, developer tools, and more. Find him on GitHub and LinkedIn.

Originally published on Medium. Cross-posted to dev.to.

Tags: #AI #LLM #LocalAI #Ollama #Gemma #Python #Docker #OpenSource #MachineLearning #Privacy #HealthcareAI #DevTools

DEV Community

I Built 90+ AI Tools That Never Touch the Cloud — Here's the Architecture

Why Local?

The Stack

Gemma 4 + Ollama

Streamlit (Web UI) + FastAPI (REST API)

Docker Compose

Click CLI + Rich

The Shared Architecture Pattern

1. Configuration — `config.yaml`

2. The Shared LLM Client

3. Docker Compose — The Sidecar Pattern

4. Testing — pytest

What I've Built

🏥 Healthcare AI

⚖️ Legal AI

🎓 Education

💻 Developer Tools

🎨 Creative AI

🔒 Security

💰 Finance

Performance & Tradeoffs

Lessons Learned

What's Next

About the Author

Top comments (0)

Why Local?

The Stack

Gemma 4 + Ollama

Streamlit (Web UI) + FastAPI (REST API)

Docker Compose

Click CLI + Rich

The Shared Architecture Pattern

1. Configuration — config.yaml

2. The Shared LLM Client

3. Docker Compose — The Sidecar Pattern

4. Testing — pytest

What I've Built

🏥 Healthcare AI

⚖️ Legal AI

🎓 Education

💻 Developer Tools

🎨 Creative AI

🔒 Security

💰 Finance

Performance & Tradeoffs

Lessons Learned

What's Next

About the Author

1. Configuration — `config.yaml`