DEV Community

Cover image for I Built a Privacy-First Alternative to Microsoft Recall — Using All 3 Gemma 4 Modalities
Ayush Shekhar
Ayush Shekhar

Posted on

I Built a Privacy-First Alternative to Microsoft Recall — Using All 3 Gemma 4 Modalities

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

After Microsoft's Recall got torn apart for storing screenshots in plaintext with telemetry phoning home, I thought — the idea is genuinely useful. Knowing what you were doing 3 days ago, finding that one message someone sent you, remembering what you were working on before lunch. The execution was just terrible.

So I built ScreenMind — an open-source screen activity journal that runs entirely on your machine. It captures your screen, analyzes every screenshot with Gemma 4's vision, and builds a searchable, chat-able AI memory of your digital life.

The difference from Recall: nothing ever leaves your computer. No cloud. No telemetry. No "trust us with your screenshots." Everything — capture, analysis, search, chat — runs locally on a single GPU.

But it's way more than a Recall clone. Here's what it actually does:

📸 Smart Capture — Doesn't blindly screenshot every 30 seconds. Uses perceptual hashing to detect when your screen actually changes. Cursor blinks and clock updates get ignored. Real content changes get captured.

🧠 Gemma 4 Vision Analysis — Every screenshot goes through Gemma 4 with OCR context. It figures out what app you're using, what you're doing, categorizes the activity, detects your mood, and writes a detailed scene description. Not just "user is in Chrome" — more like "user is reading a pull request review on GitHub for the auth-middleware refactor."

📐 Spatial Layout Detection — OCR boxes get classified into screen regions (sidebar, chat area, toolbar, profile panel) using coordinate-based parsing. Text gets organized by section so when you search or chat, you get structured context — not a wall of raw OCR.

🔍 Hybrid Search — Semantic search (MiniLM embeddings + cosine similarity) combined with FTS5 keyword search. Ask "debugging the auth module" and it finds screenshots by meaning, not just exact word matches. Results show OCR text highlighted directly on the screenshot.

💬 Chat With Your Screen History — This is the feature people love most. Ask "what should I reply to that Discord message?" and it pulls up the relevant screenshot, reads the organized text, and answers. Ask "did I get any email from Zerodha?" and it finds your inbox screenshot and tells you. It's RAG over your actual life, not documents.

🎙️ Voice Memos — Hold Ctrl+Shift+V, speak, release. Gemma 4's native audio encoder transcribes it. A screenshot is captured alongside so you have visual context with every memo.

🎤 Meeting Transcription — Auto-detects when you're in Zoom, Teams, Discord, or Meet. Records audio, transcribes in 15-second chunks using Gemma's audio encoder, then runs map-reduce summarization for long meetings. Outputs structured summaries with topics, decisions, and action items.

🤖 Agent Platform — This is the part I'm most proud of. You can build custom automations by writing a markdown file in plain English:

---
name: Daily Focus Report
schedule: every 6h
data: timeline, apps, mood
output: local, obsidian
---

Analyze my screen activity and generate a focus report:
- How many hours of deep work vs shallow work?
- What were my main distractions?
- Give me a focus score out of 10.
Enter fullscreen mode Exit fullscreen mode

Drop it in a folder. It runs automatically. Gemma processes your prompt with injected screen data. No code needed. For developers who want more control, there's a full Python SDK with state persistence and GPU-safe LLM access.

🔌 MCP Server — Exposes your screen history to Claude Desktop, Cursor, and VS Code via Model Context Protocol. 8 tools: search, recent activity, time-range queries, daily summaries, meeting transcripts, instant capture.

🔐 Privacy — Auto-redacts credit cards, SSNs, API keys, and passwords from captured text before storage. Optional AES encryption at rest. Dashboard PIN lock. App blocklist. Incognito mode.

📊 Analytics — Category breakdown, top apps, hourly heatmap, meeting stats. See where your time actually goes.

⏪ Day Rewind — Timelapse playback of your entire day with play/pause/scrub/speed controls.

🔗 Integrations — Obsidian vault sync, Notion database export, webhooks (Slack, Discord, IFTTT) with HMAC signing and auto-retry.

Demo

Code

GitHub logo ayushh0110 / ScreenMind

🧠 AI-powered screen memory — captures, analyzes, and lets you search/chat your screen history. Powered by Gemma 4 E2B. 100% local, 100% private.


ScreenMind

Captures your screen → Analyzes with Gemma 4 → Builds a searchable AI memory
100% local. 100% private. Zero cloud dependencies.


Python 3.10+ Gemma 4 E2B llama.cpp License MIT MCP Ready


Features · Gemma 4 Deep Dive · Quick Start · Architecture · Agent Platform · MCP · API


Timeline — AI-analyzed screen activity feed

Agents Chat with your memory
Agents Chat

Microsoft showed the world wants screen-aware AI with Recall. But Recall stores data in plaintext, sends telemetry, and was met with massive privacy backlash. ScreenMind is the open-source, privacy-first alternative — every screenshot analyzed, every insight generated, every search result — all computed locally using Gemma 4's multimodal capabilities.

It's not just a screen recorder. It's an AI memory you can talk to, search through, and build automations on top of.


✨ Features

🧠 Core Intelligence

  • 📸 Smart Capture — Content-change detection, not a fixed timer. Captures when your screen actually changes.
  • 🔬 Gemma 4 Vision Analysis — Every screenshot analyzed: app detection, activity categorization, mood…

How I Used Gemma 4

I chose the Gemma 4 family — and it's not a preference, it's an architectural requirement. E2B is the default for 4GB GPUs, E4B for users with more headroom. Let me explain why no other model family works here.

ScreenMind runs continuously in the background. It needs to analyze a screenshot every 30-40 seconds, transcribe voice memos on demand, power a chat interface, and run agent prompts — all on a single consumer GPU. These constraints eliminate everything else:

Constraint What it eliminates
Must run continuously in background on 4GB VRAM Rules out 12B+ models
Must understand screenshots natively Rules out text-only models
Must transcribe audio natively Rules out models without audio encoder
Must stay 100% local Rules out cloud APIs
Must be fast enough for 30-40s capture cycle E2B does it in 12-76s depending on mode

Gemma 4 E2B is the only model that checks all five boxes.

All three modalities in one product:

Vision — Every screenshot gets sent to Gemma 4 with OCR text as context. The prompt asks for structured JSON: app name, activity category, summary, detailed context, mood, confidence, scene description, and layout regions. I built three analysis modes:

  • Fast (~12s) — uses a no-thinking prefill trick (pre-fill <think>\n</think> in the assistant message to skip reasoning)
  • Balanced (~40s) — natural thinking, analysis only
  • Accurate (~76s) — thinking + spatial layout detection in one call

Audio — Gemma 4 E2B has a native audio encoder. I use it for voice memo transcription and meeting transcription. No Whisper, no separate ASR model. One model handles everything. For meetings, audio gets chunked into 15-second segments, each transcribed by Gemma, then a final Gemma call does map-reduce summarization.

Reasoning — Daily summaries use think=True for deep reasoning over a day's activities. Chat uses Gemma to answer questions grounded in screen context. Agents feed screen data into Gemma prompts for custom analysis.

Performance engineering around a single GPU:

Since there's only one GPU slot, I built a priority system. Chat cancels in-flight analysis instantly (closes the HTTP client → llama-server frees the slot in <1s). The cancelled analysis gets re-queued at the front, not the back. Users never wait for background work to finish.

I also built a per-app pHash cache with three tiers:

  • Identical screens (diff ≤3): skip everything, copy from cache — 0ms
  • Minor changes (diff ≤9): re-run OCR only, reuse Gemma analysis — 3-10s
  • Full change (diff 10+): run the complete pipeline — 12-76s

This cuts Gemma inference calls by 30-50% during typical usage. Combined with the three analysis modes, ScreenMind runs comfortably on my GTX 1650 with 4GB VRAM as a daily driver.

The multi-model pipeline:

Screenshot → EasyOCR (text) → Gemma 4 E2B (understanding) → MiniLM (embeddings) → SQLite + FTS5
                                     ↑
                              OCR text fed as context
Enter fullscreen mode Exit fullscreen mode

Four AI models working together, with Gemma 4 as the brain. OCR extracts what's written. Gemma understands what you're doing. MiniLM enables semantic search. FTS5 handles instant keyword lookup. Each model does what it's best at.

I've been using this daily for two weeks. The chat feature is genuinely addictive — being able to ask "what was I working on before lunch?" or "what did that email say?" and getting an actual answer from your own screen history changes how you think about your computer.

Top comments (0)