DEV Community: Shashank Kumar Singh

Why I chose Ollama over cloud AI for my screen-reading assistant (and what I learned)

Shashank Kumar Singh — Wed, 06 May 2026 11:38:43 +0000

When I started building Clicky — a Windows AI assistant that reads your screen and answers out loud — I had to make a fundamental choice: cloud AI or local AI?

I chose local (Ollama). Here's exactly why, and what I learned the hard way.

The problem with cloud AI for a screen assistant

Clicky's core loop is:

Take a screenshot of your screen
Record your voice question
Send both to an LLM
Speak the answer back

Step 3 is where cloud AI becomes a problem. You're sending a screenshot of your screen to a remote server every single time someone presses the hotkey.

Think about what's on your screen:

Passwords in password managers
Emails with sensitive info
Code with API keys
Personal documents

I wasn't comfortable sending that to OpenAI servers (or any server) by default. And I couldn't expect users to be comfortable with it either.

What Ollama actually gives you

Ollama lets you run LLMs locally. Pull a model, it runs on your GPU/CPU, responses never leave your machine.

For Clicky, this means:

Zero privacy risk — screenshots stay on your PC
Zero cost per query — no API bills
Zero internet required — works offline
Zero rate limits — query as fast as your hardware allows

The tradeoff: you need a decent GPU and ~4–8GB RAM for a usable model. But most modern gaming PCs qualify.

Models I tested

Model	VRAM needed	Quality	Speed
Llama 3.1 8B	6GB	Excellent	Fast
Mistral 7B	5GB	Great	Fast
Phi-3 mini	3GB	Good	Very fast
Llama 3.2 3B	2GB	OK	Blazing

I use Llama 3.1 8B as the default. It handles screen content, code, and general questions well.

The Ollama HTTP API is dead simple

import httpx
import base64

async def ask_ollama(screenshot_path: str, question: str) -> str:
    with open(screenshot_path, "rb") as f:
        img_b64 = base64.b64encode(f.read()).decode()

    payload = {
        "model": "llama3.1",
        "messages": [
            {
                "role": "user",
                "content": question,
                "images": [img_b64]
            }
        ],
        "stream": False
    }

    async with httpx.AsyncClient(timeout=30) as client:
        response = await client.post(
            "http://localhost:11434/api/chat",
            json=payload
        )
        return response.json()["message"]["content"]

That's it. No SDK, no auth, no API key. Just a POST to localhost.

When I kept cloud AI as a fallback

Despite going local-first, I built a CompanionManager class that routes to different providers:

class CompanionManager:
    def get_provider(self):
        if self.settings.provider == "ollama":
            return OllamaProvider()
        elif self.settings.provider == "openai":
            return OpenAIProvider(api_key=self.settings.openai_key)
        elif self.settings.provider == "claude":
            return ClaudeProvider(api_key=self.settings.claude_key)
        else:
            return OllamaProvider()  # default

Power users who want GPT-4V or Claude's vision can switch in settings. But the default is Ollama, zero config.

What I learned

1. Ollama model management is the real UX problem.
Users don't want to open a terminal and run ollama pull llama3.1. I had to build a model-picker UI that shows which models are installed and lets you pull new ones from inside Clicky.

2. First response latency matters more than throughput.
For a voice assistant, you want the first token fast. Streaming responses (even over localhost) made Clicky feel much snappier.

3. GPU vs CPU is night and day.
On CPU only, Llama 3.1 8B takes 15–30 seconds. On a GTX 1080, it's 2–4 seconds. I added a GPU detection check and recommend CPU-only users use Phi-3 mini instead.

4. Context window size limits what you can do with screenshots.
A 1080p screenshot as base64 JPEG at q=70 is ~200–400KB. Some models handle this fine; others choke. I compress to 768px max width before encoding.

The verdict

For a privacy-sensitive app like a screen assistant, local AI is the right default. Cloud AI makes sense as an opt-in for users who want more power and don't mind the tradeoff.

Ollama made this possible without asking users to do anything complicated. Install Clicky, install Ollama, pull a model — done.

Download Clicky free (Windows): https://clicky.foo

Article 1 (what it does): https://dev.to/bitshank2338/i-built-a-free-ai-that-reads-my-screen-and-answers-out-loud-no-api-key-runs-offline-with-ollama-15np

Article 2 (full technical breakdown): https://dev.to/bitshank2338/how-i-built-a-screen-aware-ai-assistant-in-python-full-stack-breakdown-pyqt6-whisper-ollama-1354

What model are you running locally? Let me know in the comments.

How I built a screen-aware AI assistant in Python – full stack breakdown (PyQt6 + Whisper + Ollama)

Shashank Kumar Singh — Wed, 06 May 2026 11:35:22 +0000

Three months ago I started building Clicky — a Windows AI assistant that reads your screen and answers out loud. Here's the full technical breakdown of every piece.

TL;DR: PyQt6 system tray → Ctrl+Alt+Space hotkey → screenshot + Whisper STT → Ollama/OpenAI/Claude → edge-tts speaks answer back. Open source, free, no API key needed.

Architecture overview

User presses Ctrl+Alt+Space
        ↓
GlobalHotkey listener (pynput)
        ↓
Screenshot all monitors (mss)
        ↓
Whisper.cpp transcribes audio
        ↓
CompanionManager routes to AI provider
        ↓
Ollama (local) / OpenAI / Claude / Copilot
        ↓
edge-tts speaks answer + arrow overlay on screen

1. System tray + hotkey (PyQt6 + pynput)

The app lives in the system tray — no window, zero friction.

from pynput import keyboard

def on_activate():
    QMetaObject.invokeMethod(companion, "start_listening", Qt.QueuedConnection)

hotkey = keyboard.GlobalHotKeys({'<ctrl>+<alt>+<space>': on_activate})
hotkey.start()

The key trick: QMetaObject.invokeMethod with Qt.QueuedConnection — this crosses the thread boundary safely from pynput's background thread into Qt's main thread.

2. Screen capture (mss)

import mss, base64
from PIL import Image

def capture_all_screens():
    with mss.mss() as sct:
        for monitor in sct.monitors[1:]:  # skip monitor[0] (all combined)
            shot = sct.grab(monitor)
            img = Image.frombytes("RGB", shot.size, shot.bgra, "raw", "BGRX")
            # encode as JPEG base64 for vision API
            buffer = io.BytesIO()
            img.save(buffer, format="JPEG", quality=75)
            yield base64.b64encode(buffer.getvalue()).decode()

Quality 75 JPEG keeps the payload under API limits while preserving readability.

3. Speech-to-text (Whisper.cpp)

I use the whisper-cpp Python bindings — runs on CPU, no GPU needed.

from whispercpp import Whisper

w = Whisper.from_pretrained("base.en")

def transcribe(audio_path: str) -> str:
    result = w.transcribe(audio_path)
    return w.extract_text(result)[0].strip()

The base.en model is 142MB and transcribes ~10s of audio in ~2s on a mid-range CPU. Fast enough to feel instant.

4. AI provider routing

This was the trickiest part — supporting 4 providers with one interface:

class CompanionManager:
    def get_provider(self):
        match self.config["provider"]:
            case "ollama":   return OllamaProvider()
            case "openai":   return OpenAIProvider()
            case "claude":   return ClaudeProvider()
            case "copilot":  return GitHubCopilotProvider()

    async def ask(self, question: str, screenshots: list[str]) -> str:
        provider = self.get_provider()
        # Identity questions skip screenshots (avoids vision API refusals)
        if is_identity_question(question):
            screenshots = []
        return await provider.complete(question, screenshots, self.system_prompt)

The is_identity_question() filter was a fun challenge — vision APIs refuse to identify people in images. So I detect "who is X" patterns with regex and strip the screenshots before sending.

5. Local AI with Ollama

import httpx

async def complete(self, question, images, system_prompt):
    payload = {
        "model": "qwen2.5vl:3b",  # vision model
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question, "images": images}
        ],
        "stream": False
    }
    async with httpx.AsyncClient(timeout=60) as client:
        r = await client.post("http://localhost:11434/api/chat", json=payload)
        return r.json()["message"]["content"]

6. Text-to-speech (edge-tts)

Microsoft's neural TTS — free, no API key, sounds great:

import edge_tts, asyncio

async def speak(text: str):
    communicate = edge_tts.Communicate(text, voice="en-US-AriaNeural")
    await communicate.save("/tmp/response.mp3")
    # play with pygame
    pygame.mixer.music.load("/tmp/response.mp3")
    pygame.mixer.music.play()

7. Packaging (PyInstaller + Inno Setup)

# build.bat
pyinstaller clicky.spec --noconfirm
# Then Inno Setup builds Setup-Clicky.exe
iscc installer.iss

The .spec file needs explicit hidden imports for everything dynamic:

hiddenimports=["ai.ollama_bootstrap", "ui.setup_wizard", "whispercpp", ...]

Lessons learned

Thread safety with PyQt6 is non-negotiable — never call Qt UI methods from background threads. Use QMetaObject.invokeMethod or signals.
Whisper base.en is the sweet spot — tiny is too inaccurate, small is too slow on CPU.
Vision APIs hate face identification — build the filter early, not after your first support ticket.
PyInstaller + Ollama = packaging nightmare — Ollama runs as a separate process, not bundled. The setup wizard that auto-installs it saved countless support issues.

What's next

macOS port (the hotkey system is the main blocker)
Plugin system for custom AI skills
Sub-200ms response time optimization

The full source is on GitHub and it's MIT licensed. If you build something on top of it, let me know.

📥 Download: https://clicky.foo
⭐ GitHub: https://github.com/Bitshank-2338/clicky-windows

I built a free AI that reads my screen and answers out loud – no API key, runs offline with Ollama

Shashank Kumar Singh — Wed, 06 May 2026 11:15:16 +0000

I kept hitting ChatGPT's $20/mo wall. So I built my own.

Clicky is a free, open-source AI assistant for Windows that floats next to your cursor, reads your screen, and speaks the answer back out loud. No copy-paste. No alt-tab. No subscription.

The problem I was solving

Every time I hit an API limit or saw that $20/mo charge, I thought — I have a decent laptop, Ollama runs locally, why am I paying for this?

But existing local AI setups require you to open a terminal, type prompts, copy-paste context... it's friction. I wanted something that just worked — press a hotkey, ask a question, get an answer.

How Clicky works

Press Ctrl+Alt+Space anywhere on Windows
It captures all your screens and starts listening
You ask your question out loud
It reads the screen context, queries the AI, and speaks the answer back

That's the entire flow. No window switching. No typing.

Tech stack

Python + PyQt6 — system tray app, always running in background
Whisper.cpp — fast local speech-to-text (runs on CPU)
Ollama — local LLM inference (llama3.2:3b for text, qwen2.5vl:3b for vision)
edge-tts — Microsoft's neural TTS, free and offline-capable
PyInstaller — packages everything into a single Windows .exe
Inno Setup — builds the one-click installer

AI provider flexibility

Clicky isn't locked to Ollama. It supports:

Provider	Cost	Requires
Ollama (local)	Free forever	~4GB RAM
OpenAI GPT-4o	Pay per use	API key
Anthropic Claude	Pay per use	API key
GitHub Copilot	Free with Student Pack	GitHub account

What it can do

Read whatever is on your screen (YouTube, code, emails, docs, anything)
Search the web and answer instantly
Point at exact UI elements with a glowing arrow overlay
Click buttons on your behalf just by asking
Answer questions about code you're looking at without copy-pasting

Setup in v1.1.1

The latest version ships with a first-run setup wizard — it detects if Ollama is missing, downloads and installs it automatically, then pulls the right models. One click from installer to working AI.