Why I chose Ollama over cloud AI for my screen-reading assistant (and what I learned)

#opensource #ai #showdev #python

When I started building Clicky — a Windows AI assistant that reads your screen and answers out loud — I had to make a fundamental choice: cloud AI or local AI?

I chose local (Ollama). Here's exactly why, and what I learned the hard way.

The problem with cloud AI for a screen assistant

Clicky's core loop is:

Take a screenshot of your screen
Record your voice question
Send both to an LLM
Speak the answer back

Step 3 is where cloud AI becomes a problem. You're sending a screenshot of your screen to a remote server every single time someone presses the hotkey.

Think about what's on your screen:

Passwords in password managers
Emails with sensitive info
Code with API keys
Personal documents

I wasn't comfortable sending that to OpenAI servers (or any server) by default. And I couldn't expect users to be comfortable with it either.

What Ollama actually gives you

Ollama lets you run LLMs locally. Pull a model, it runs on your GPU/CPU, responses never leave your machine.

For Clicky, this means:

Zero privacy risk — screenshots stay on your PC
Zero cost per query — no API bills
Zero internet required — works offline
Zero rate limits — query as fast as your hardware allows

The tradeoff: you need a decent GPU and ~4–8GB RAM for a usable model. But most modern gaming PCs qualify.

Models I tested

Model	VRAM needed	Quality	Speed
Llama 3.1 8B	6GB	Excellent	Fast
Mistral 7B	5GB	Great	Fast
Phi-3 mini	3GB	Good	Very fast
Llama 3.2 3B	2GB	OK	Blazing

I use Llama 3.1 8B as the default. It handles screen content, code, and general questions well.

The Ollama HTTP API is dead simple

import httpx
import base64

async def ask_ollama(screenshot_path: str, question: str) -> str:
    with open(screenshot_path, "rb") as f:
        img_b64 = base64.b64encode(f.read()).decode()

    payload = {
        "model": "llama3.1",
        "messages": [
            {
                "role": "user",
                "content": question,
                "images": [img_b64]
            }
        ],
        "stream": False
    }

    async with httpx.AsyncClient(timeout=30) as client:
        response = await client.post(
            "http://localhost:11434/api/chat",
            json=payload
        )
        return response.json()["message"]["content"]

That's it. No SDK, no auth, no API key. Just a POST to localhost.

When I kept cloud AI as a fallback

Despite going local-first, I built a CompanionManager class that routes to different providers:

class CompanionManager:
    def get_provider(self):
        if self.settings.provider == "ollama":
            return OllamaProvider()
        elif self.settings.provider == "openai":
            return OpenAIProvider(api_key=self.settings.openai_key)
        elif self.settings.provider == "claude":
            return ClaudeProvider(api_key=self.settings.claude_key)
        else:
            return OllamaProvider()  # default

Power users who want GPT-4V or Claude's vision can switch in settings. But the default is Ollama, zero config.

What I learned

1. Ollama model management is the real UX problem.
Users don't want to open a terminal and run ollama pull llama3.1. I had to build a model-picker UI that shows which models are installed and lets you pull new ones from inside Clicky.

2. First response latency matters more than throughput.
For a voice assistant, you want the first token fast. Streaming responses (even over localhost) made Clicky feel much snappier.

3. GPU vs CPU is night and day.
On CPU only, Llama 3.1 8B takes 15–30 seconds. On a GTX 1080, it's 2–4 seconds. I added a GPU detection check and recommend CPU-only users use Phi-3 mini instead.

4. Context window size limits what you can do with screenshots.
A 1080p screenshot as base64 JPEG at q=70 is ~200–400KB. Some models handle this fine; others choke. I compress to 768px max width before encoding.

The verdict

For a privacy-sensitive app like a screen assistant, local AI is the right default. Cloud AI makes sense as an opt-in for users who want more power and don't mind the tradeoff.

Ollama made this possible without asking users to do anything complicated. Install Clicky, install Ollama, pull a model — done.

Download Clicky free (Windows): https://clicky.foo

Article 1 (what it does): https://dev.to/bitshank2338/i-built-a-free-ai-that-reads-my-screen-and-answers-out-loud-no-api-key-runs-offline-with-ollama-15np

Article 2 (full technical breakdown): https://dev.to/bitshank2338/how-i-built-a-screen-aware-ai-assistant-in-python-full-stack-breakdown-pyqt6-whisper-ollama-1354

What model are you running locally? Let me know in the comments.