DEV Community

Shashank Kumar Singh
Shashank Kumar Singh

Posted on

How I built a screen-aware AI assistant in Python – full stack breakdown (PyQt6 + Whisper + Ollama)

Three months ago I started building Clicky — a Windows AI assistant that reads your screen and answers out loud. Here's the full technical breakdown of every piece.

TL;DR: PyQt6 system tray → Ctrl+Alt+Space hotkey → screenshot + Whisper STT → Ollama/OpenAI/Claude → edge-tts speaks answer back. Open source, free, no API key needed.

Architecture overview

User presses Ctrl+Alt+Space
        ↓
GlobalHotkey listener (pynput)
        ↓
Screenshot all monitors (mss)
        ↓
Whisper.cpp transcribes audio
        ↓
CompanionManager routes to AI provider
        ↓
Ollama (local) / OpenAI / Claude / Copilot
        ↓
edge-tts speaks answer + arrow overlay on screen
Enter fullscreen mode Exit fullscreen mode

1. System tray + hotkey (PyQt6 + pynput)

The app lives in the system tray — no window, zero friction.

from pynput import keyboard

def on_activate():
    QMetaObject.invokeMethod(companion, "start_listening", Qt.QueuedConnection)

hotkey = keyboard.GlobalHotKeys({'<ctrl>+<alt>+<space>': on_activate})
hotkey.start()
Enter fullscreen mode Exit fullscreen mode

The key trick: QMetaObject.invokeMethod with Qt.QueuedConnection — this crosses the thread boundary safely from pynput's background thread into Qt's main thread.

2. Screen capture (mss)

import mss, base64
from PIL import Image

def capture_all_screens():
    with mss.mss() as sct:
        for monitor in sct.monitors[1:]:  # skip monitor[0] (all combined)
            shot = sct.grab(monitor)
            img = Image.frombytes("RGB", shot.size, shot.bgra, "raw", "BGRX")
            # encode as JPEG base64 for vision API
            buffer = io.BytesIO()
            img.save(buffer, format="JPEG", quality=75)
            yield base64.b64encode(buffer.getvalue()).decode()
Enter fullscreen mode Exit fullscreen mode

Quality 75 JPEG keeps the payload under API limits while preserving readability.

3. Speech-to-text (Whisper.cpp)

I use the whisper-cpp Python bindings — runs on CPU, no GPU needed.

from whispercpp import Whisper

w = Whisper.from_pretrained("base.en")

def transcribe(audio_path: str) -> str:
    result = w.transcribe(audio_path)
    return w.extract_text(result)[0].strip()
Enter fullscreen mode Exit fullscreen mode

The base.en model is 142MB and transcribes ~10s of audio in ~2s on a mid-range CPU. Fast enough to feel instant.

4. AI provider routing

This was the trickiest part — supporting 4 providers with one interface:

class CompanionManager:
    def get_provider(self):
        match self.config["provider"]:
            case "ollama":   return OllamaProvider()
            case "openai":   return OpenAIProvider()
            case "claude":   return ClaudeProvider()
            case "copilot":  return GitHubCopilotProvider()

    async def ask(self, question: str, screenshots: list[str]) -> str:
        provider = self.get_provider()
        # Identity questions skip screenshots (avoids vision API refusals)
        if is_identity_question(question):
            screenshots = []
        return await provider.complete(question, screenshots, self.system_prompt)
Enter fullscreen mode Exit fullscreen mode

The is_identity_question() filter was a fun challenge — vision APIs refuse to identify people in images. So I detect "who is X" patterns with regex and strip the screenshots before sending.

5. Local AI with Ollama

import httpx

async def complete(self, question, images, system_prompt):
    payload = {
        "model": "qwen2.5vl:3b",  # vision model
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question, "images": images}
        ],
        "stream": False
    }
    async with httpx.AsyncClient(timeout=60) as client:
        r = await client.post("http://localhost:11434/api/chat", json=payload)
        return r.json()["message"]["content"]
Enter fullscreen mode Exit fullscreen mode

6. Text-to-speech (edge-tts)

Microsoft's neural TTS — free, no API key, sounds great:

import edge_tts, asyncio

async def speak(text: str):
    communicate = edge_tts.Communicate(text, voice="en-US-AriaNeural")
    await communicate.save("/tmp/response.mp3")
    # play with pygame
    pygame.mixer.music.load("/tmp/response.mp3")
    pygame.mixer.music.play()
Enter fullscreen mode Exit fullscreen mode

7. Packaging (PyInstaller + Inno Setup)

# build.bat
pyinstaller clicky.spec --noconfirm
# Then Inno Setup builds Setup-Clicky.exe
iscc installer.iss
Enter fullscreen mode Exit fullscreen mode

The .spec file needs explicit hidden imports for everything dynamic:

hiddenimports=["ai.ollama_bootstrap", "ui.setup_wizard", "whispercpp", ...]
Enter fullscreen mode Exit fullscreen mode

Lessons learned

  1. Thread safety with PyQt6 is non-negotiable — never call Qt UI methods from background threads. Use QMetaObject.invokeMethod or signals.
  2. Whisper base.en is the sweet spot — tiny is too inaccurate, small is too slow on CPU.
  3. Vision APIs hate face identification — build the filter early, not after your first support ticket.
  4. PyInstaller + Ollama = packaging nightmare — Ollama runs as a separate process, not bundled. The setup wizard that auto-installs it saved countless support issues.

What's next

  • macOS port (the hotkey system is the main blocker)
  • Plugin system for custom AI skills
  • Sub-200ms response time optimization

The full source is on GitHub and it's MIT licensed. If you build something on top of it, let me know.

📥 Download: https://clicky.foo
⭐ GitHub: https://github.com/Bitshank-2338/clicky-windows

Top comments (0)