I built a local screen reader that reads your screen aloud — no cloud, no API keys

#python #ai #opensource #a11y

I got tired of switching between reading and listening, so I built sttts — a local pipeline that watches any region of your screen, OCRs it, and speaks it aloud in real time. Everything runs on your own machine.

Demo

What it does

🖱️ You draw a rectangle on any part of your screen
📸 It snapshots that region every N seconds
🔍 Pixel diff check — skips frames where nothing changed
🧠 LightOnOCR-2-1B reads the text (runs on AMD GPU via ROCm)
🗣️ Kokoro-82M speaks it through your speakers (runs on CPU)

🖥️ screen → 🔍 diff → 🧠 OCR → ✨ clean text → 🗣️ TTS → 🔊 speaker

The killer feature — auto page-turn

You can draw a second rectangle over any button on screen. After TTS finishes speaking and the screen stays idle, sttts automatically clicks it. I use this with Kindle for PC — it reads the entire book hands-free, turning pages automatically.

# Draw OCR region, then draw the next-page button
uv run python capture.py --next-btn -i 2

Models used

OCR: LightOnOCR-2-1B — fast, accurate, runs on AMD GPU via ROCm
TTS: Kokoro-82M — high quality, ~100ms latency on CPU

Both download automatically from HuggingFace on first run. No API keys, no subscriptions.

Smart idle detection

Pixel-level diff comparison means OCR and TTS only fire when something actually changed. Reading a static page? Silent. New content loaded? Speaks immediately.

# Only trigger OCR when >1% of pixels changed
uv run python capture.py --diff-threshold 1.0

Quick start

# Install system deps
sudo apt-get install -y slop xdotool libportaudio2 libsndfile1

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and run
git clone https://github.com/paradisecy/sttts
cd sttts
uv sync
uv run python capture.py