How I built a screen-aware AI assistant in Python – full stack breakdown (PyQt6 + Whisper + Ollama)

#opensource #ai #showdev #python

Three months ago I started building Clicky — a Windows AI assistant that reads your screen and answers out loud. Here's the full technical breakdown of every piece.

TL;DR: PyQt6 system tray → Ctrl+Alt+Space hotkey → screenshot + Whisper STT → Ollama/OpenAI/Claude → edge-tts speaks answer back. Open source, free, no API key needed.

Architecture overview

User presses Ctrl+Alt+Space
        ↓
GlobalHotkey listener (pynput)
        ↓
Screenshot all monitors (mss)
        ↓
Whisper.cpp transcribes audio
        ↓
CompanionManager routes to AI provider
        ↓
Ollama (local) / OpenAI / Claude / Copilot
        ↓
edge-tts speaks answer + arrow overlay on screen

1. System tray + hotkey (PyQt6 + pynput)

The app lives in the system tray — no window, zero friction.

from pynput import keyboard

def on_activate():
    QMetaObject.invokeMethod(companion, "start_listening", Qt.QueuedConnection)

hotkey = keyboard.GlobalHotKeys({'<ctrl>+<alt>+<space>': on_activate})
hotkey.start()

The key trick: QMetaObject.invokeMethod with Qt.QueuedConnection — this crosses the thread boundary safely from pynput's background thread into Qt's main thread.

2. Screen capture (mss)

import mss, base64
from PIL import Image

def capture_all_screens():
    with mss.mss() as sct:
        for monitor in sct.monitors[1:]:  # skip monitor[0] (all combined)
            shot = sct.grab(monitor)
            img = Image.frombytes("RGB", shot.size, shot.bgra, "raw", "BGRX")
            # encode as JPEG base64 for vision API
            buffer = io.BytesIO()
            img.save(buffer, format="JPEG", quality=75)
            yield base64.b64encode(buffer.getvalue()).decode()

Quality 75 JPEG keeps the payload under API limits while preserving readability.

3. Speech-to-text (Whisper.cpp)

I use the whisper-cpp Python bindings — runs on CPU, no GPU needed.

from whispercpp import Whisper

w = Whisper.from_pretrained("base.en")

def transcribe(audio_path: str) -> str:
    result = w.transcribe(audio_path)
    return w.extract_text(result)[0].strip()

The base.en model is 142MB and transcribes ~10s of audio in ~2s on a mid-range CPU. Fast enough to feel instant.

4. AI provider routing

This was the trickiest part — supporting 4 providers with one interface:

class CompanionManager:
    def get_provider(self):
        match self.config["provider"]:
            case "ollama":   return OllamaProvider()
            case "openai":   return OpenAIProvider()
            case "claude":   return ClaudeProvider()
            case "copilot":  return GitHubCopilotProvider()

    async def ask(self, question: str, screenshots: list[str]) -> str:
        provider = self.get_provider()
        # Identity questions skip screenshots (avoids vision API refusals)
        if is_identity_question(question):
            screenshots = []
        return await provider.complete(question, screenshots, self.system_prompt)

The is_identity_question() filter was a fun challenge — vision APIs refuse to identify people in images. So I detect "who is X" patterns with regex and strip the screenshots before sending.

5. Local AI with Ollama

import httpx

async def complete(self, question, images, system_prompt):
    payload = {
        "model": "qwen2.5vl:3b",  # vision model
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question, "images": images}
        ],
        "stream": False
    }
    async with httpx.AsyncClient(timeout=60) as client:
        r = await client.post("http://localhost:11434/api/chat", json=payload)
        return r.json()["message"]["content"]

6. Text-to-speech (edge-tts)

Microsoft's neural TTS — free, no API key, sounds great:

import edge_tts, asyncio

async def speak(text: str):
    communicate = edge_tts.Communicate(text, voice="en-US-AriaNeural")
    await communicate.save("/tmp/response.mp3")
    # play with pygame
    pygame.mixer.music.load("/tmp/response.mp3")
    pygame.mixer.music.play()

7. Packaging (PyInstaller + Inno Setup)

# build.bat
pyinstaller clicky.spec --noconfirm
# Then Inno Setup builds Setup-Clicky.exe
iscc installer.iss

The .spec file needs explicit hidden imports for everything dynamic:

hiddenimports=["ai.ollama_bootstrap", "ui.setup_wizard", "whispercpp", ...]

Lessons learned

Thread safety with PyQt6 is non-negotiable — never call Qt UI methods from background threads. Use QMetaObject.invokeMethod or signals.
Whisper base.en is the sweet spot — tiny is too inaccurate, small is too slow on CPU.
Vision APIs hate face identification — build the filter early, not after your first support ticket.
PyInstaller + Ollama = packaging nightmare — Ollama runs as a separate process, not bundled. The setup wizard that auto-installs it saved countless support issues.