I got tired of switching between reading and listening, so I built sttts — a local pipeline that watches any region of your screen, OCRs it, and speaks it aloud in real time. Everything runs on your own machine.
Demo
What it does
- 🖱️ You draw a rectangle on any part of your screen
- 📸 It snapshots that region every N seconds
- 🔍 Pixel diff check — skips frames where nothing changed
- 🧠 LightOnOCR-2-1B reads the text (runs on AMD GPU via ROCm)
- 🗣️ Kokoro-82M speaks it through your speakers (runs on CPU)
🖥️ screen → 🔍 diff → 🧠 OCR → ✨ clean text → 🗣️ TTS → 🔊 speaker
The killer feature — auto page-turn
You can draw a second rectangle over any button on screen. After TTS finishes speaking and the screen stays idle, sttts automatically clicks it. I use this with Kindle for PC — it reads the entire book hands-free, turning pages automatically.
# Draw OCR region, then draw the next-page button
uv run python capture.py --next-btn -i 2
Models used
- OCR: LightOnOCR-2-1B — fast, accurate, runs on AMD GPU via ROCm
- TTS: Kokoro-82M — high quality, ~100ms latency on CPU
Both download automatically from HuggingFace on first run. No API keys, no subscriptions.
Smart idle detection
Pixel-level diff comparison means OCR and TTS only fire when something actually changed. Reading a static page? Silent. New content loaded? Speaks immediately.
# Only trigger OCR when >1% of pixels changed
uv run python capture.py --diff-threshold 1.0
Quick start
# Install system deps
sudo apt-get install -y slop xdotool libportaudio2 libsndfile1
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and run
git clone https://github.com/paradisecy/sttts
cd sttts
uv sync
uv run python capture.py
Use cases
- 📖 Hands-free ebook reading (Kindle, epub readers, PDFs)
- 📊 Financial dashboards spoken aloud as they update
- ♿ Accessibility tool for any app that lacks screen reader support
- 💻 Read terminal output or logs aloud while working
- 🌐 Listen to any webpage without a browser extension
Tech stack
- Python 3.13
- PyTorch 2.8 + ROCm 6.3 (AMD GPU)
-
mssfor fast screen capture -
transformersfor OCR -
kokorofor TTS -
sounddevicefor audio playback -
slop+xdotoolfor region selection and mouse clicks
⭐ GitHub: paradisecy/sttts
Top comments (0)