Three months ago I started building Clicky — a Windows AI assistant that reads your screen and answers out loud. Here's the full technical breakdown of every piece.
TL;DR: PyQt6 system tray → Ctrl+Alt+Space hotkey → screenshot + Whisper STT → Ollama/OpenAI/Claude → edge-tts speaks answer back. Open source, free, no API key needed.
Architecture overview
User presses Ctrl+Alt+Space
↓
GlobalHotkey listener (pynput)
↓
Screenshot all monitors (mss)
↓
Whisper.cpp transcribes audio
↓
CompanionManager routes to AI provider
↓
Ollama (local) / OpenAI / Claude / Copilot
↓
edge-tts speaks answer + arrow overlay on screen
1. System tray + hotkey (PyQt6 + pynput)
The app lives in the system tray — no window, zero friction.
from pynput import keyboard
def on_activate():
QMetaObject.invokeMethod(companion, "start_listening", Qt.QueuedConnection)
hotkey = keyboard.GlobalHotKeys({'<ctrl>+<alt>+<space>': on_activate})
hotkey.start()
The key trick: QMetaObject.invokeMethod with Qt.QueuedConnection — this crosses the thread boundary safely from pynput's background thread into Qt's main thread.
2. Screen capture (mss)
import mss, base64
from PIL import Image
def capture_all_screens():
with mss.mss() as sct:
for monitor in sct.monitors[1:]: # skip monitor[0] (all combined)
shot = sct.grab(monitor)
img = Image.frombytes("RGB", shot.size, shot.bgra, "raw", "BGRX")
# encode as JPEG base64 for vision API
buffer = io.BytesIO()
img.save(buffer, format="JPEG", quality=75)
yield base64.b64encode(buffer.getvalue()).decode()
Quality 75 JPEG keeps the payload under API limits while preserving readability.
3. Speech-to-text (Whisper.cpp)
I use the whisper-cpp Python bindings — runs on CPU, no GPU needed.
from whispercpp import Whisper
w = Whisper.from_pretrained("base.en")
def transcribe(audio_path: str) -> str:
result = w.transcribe(audio_path)
return w.extract_text(result)[0].strip()
The base.en model is 142MB and transcribes ~10s of audio in ~2s on a mid-range CPU. Fast enough to feel instant.
4. AI provider routing
This was the trickiest part — supporting 4 providers with one interface:
class CompanionManager:
def get_provider(self):
match self.config["provider"]:
case "ollama": return OllamaProvider()
case "openai": return OpenAIProvider()
case "claude": return ClaudeProvider()
case "copilot": return GitHubCopilotProvider()
async def ask(self, question: str, screenshots: list[str]) -> str:
provider = self.get_provider()
# Identity questions skip screenshots (avoids vision API refusals)
if is_identity_question(question):
screenshots = []
return await provider.complete(question, screenshots, self.system_prompt)
The is_identity_question() filter was a fun challenge — vision APIs refuse to identify people in images. So I detect "who is X" patterns with regex and strip the screenshots before sending.
5. Local AI with Ollama
import httpx
async def complete(self, question, images, system_prompt):
payload = {
"model": "qwen2.5vl:3b", # vision model
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": question, "images": images}
],
"stream": False
}
async with httpx.AsyncClient(timeout=60) as client:
r = await client.post("http://localhost:11434/api/chat", json=payload)
return r.json()["message"]["content"]
6. Text-to-speech (edge-tts)
Microsoft's neural TTS — free, no API key, sounds great:
import edge_tts, asyncio
async def speak(text: str):
communicate = edge_tts.Communicate(text, voice="en-US-AriaNeural")
await communicate.save("/tmp/response.mp3")
# play with pygame
pygame.mixer.music.load("/tmp/response.mp3")
pygame.mixer.music.play()
7. Packaging (PyInstaller + Inno Setup)
# build.bat
pyinstaller clicky.spec --noconfirm
# Then Inno Setup builds Setup-Clicky.exe
iscc installer.iss
The .spec file needs explicit hidden imports for everything dynamic:
hiddenimports=["ai.ollama_bootstrap", "ui.setup_wizard", "whispercpp", ...]
Lessons learned
-
Thread safety with PyQt6 is non-negotiable — never call Qt UI methods from background threads. Use
QMetaObject.invokeMethodor signals. - Whisper base.en is the sweet spot — tiny is too inaccurate, small is too slow on CPU.
- Vision APIs hate face identification — build the filter early, not after your first support ticket.
- PyInstaller + Ollama = packaging nightmare — Ollama runs as a separate process, not bundled. The setup wizard that auto-installs it saved countless support issues.
What's next
- macOS port (the hotkey system is the main blocker)
- Plugin system for custom AI skills
- Sub-200ms response time optimization
The full source is on GitHub and it's MIT licensed. If you build something on top of it, let me know.
📥 Download: https://clicky.foo
⭐ GitHub: https://github.com/Bitshank-2338/clicky-windows
Top comments (0)