DEV Community

Shashank Kumar Singh
Shashank Kumar Singh

Posted on

Why I chose Ollama over cloud AI for my screen-reading assistant (and what I learned)

When I started building Clicky — a Windows AI assistant that reads your screen and answers out loud — I had to make a fundamental choice: cloud AI or local AI?

I chose local (Ollama). Here's exactly why, and what I learned the hard way.

The problem with cloud AI for a screen assistant

Clicky's core loop is:

  1. Take a screenshot of your screen
  2. Record your voice question
  3. Send both to an LLM
  4. Speak the answer back

Step 3 is where cloud AI becomes a problem. You're sending a screenshot of your screen to a remote server every single time someone presses the hotkey.

Think about what's on your screen:

  • Passwords in password managers
  • Emails with sensitive info
  • Code with API keys
  • Personal documents

I wasn't comfortable sending that to OpenAI servers (or any server) by default. And I couldn't expect users to be comfortable with it either.

What Ollama actually gives you

Ollama lets you run LLMs locally. Pull a model, it runs on your GPU/CPU, responses never leave your machine.

For Clicky, this means:

  • Zero privacy risk — screenshots stay on your PC
  • Zero cost per query — no API bills
  • Zero internet required — works offline
  • Zero rate limits — query as fast as your hardware allows

The tradeoff: you need a decent GPU and ~4–8GB RAM for a usable model. But most modern gaming PCs qualify.

Models I tested

Model VRAM needed Quality Speed
Llama 3.1 8B 6GB Excellent Fast
Mistral 7B 5GB Great Fast
Phi-3 mini 3GB Good Very fast
Llama 3.2 3B 2GB OK Blazing

I use Llama 3.1 8B as the default. It handles screen content, code, and general questions well.

The Ollama HTTP API is dead simple

import httpx
import base64

async def ask_ollama(screenshot_path: str, question: str) -> str:
    with open(screenshot_path, "rb") as f:
        img_b64 = base64.b64encode(f.read()).decode()

    payload = {
        "model": "llama3.1",
        "messages": [
            {
                "role": "user",
                "content": question,
                "images": [img_b64]
            }
        ],
        "stream": False
    }

    async with httpx.AsyncClient(timeout=30) as client:
        response = await client.post(
            "http://localhost:11434/api/chat",
            json=payload
        )
        return response.json()["message"]["content"]
Enter fullscreen mode Exit fullscreen mode

That's it. No SDK, no auth, no API key. Just a POST to localhost.

When I kept cloud AI as a fallback

Despite going local-first, I built a CompanionManager class that routes to different providers:

class CompanionManager:
    def get_provider(self):
        if self.settings.provider == "ollama":
            return OllamaProvider()
        elif self.settings.provider == "openai":
            return OpenAIProvider(api_key=self.settings.openai_key)
        elif self.settings.provider == "claude":
            return ClaudeProvider(api_key=self.settings.claude_key)
        else:
            return OllamaProvider()  # default
Enter fullscreen mode Exit fullscreen mode

Power users who want GPT-4V or Claude's vision can switch in settings. But the default is Ollama, zero config.

What I learned

1. Ollama model management is the real UX problem.
Users don't want to open a terminal and run ollama pull llama3.1. I had to build a model-picker UI that shows which models are installed and lets you pull new ones from inside Clicky.

2. First response latency matters more than throughput.
For a voice assistant, you want the first token fast. Streaming responses (even over localhost) made Clicky feel much snappier.

3. GPU vs CPU is night and day.
On CPU only, Llama 3.1 8B takes 15–30 seconds. On a GTX 1080, it's 2–4 seconds. I added a GPU detection check and recommend CPU-only users use Phi-3 mini instead.

4. Context window size limits what you can do with screenshots.
A 1080p screenshot as base64 JPEG at q=70 is ~200–400KB. Some models handle this fine; others choke. I compress to 768px max width before encoding.

The verdict

For a privacy-sensitive app like a screen assistant, local AI is the right default. Cloud AI makes sense as an opt-in for users who want more power and don't mind the tradeoff.

Ollama made this possible without asking users to do anything complicated. Install Clicky, install Ollama, pull a model — done.


Download Clicky free (Windows): https://clicky.foo

Article 1 (what it does): https://dev.to/bitshank2338/i-built-a-free-ai-that-reads-my-screen-and-answers-out-loud-no-api-key-runs-offline-with-ollama-15np

Article 2 (full technical breakdown): https://dev.to/bitshank2338/how-i-built-a-screen-aware-ai-assistant-in-python-full-stack-breakdown-pyqt6-whisper-ollama-1354

What model are you running locally? Let me know in the comments.

Top comments (0)