Ayush Shekhar

Posted on May 23 • Edited on Jun 18

I Built a Privacy-First Alternative to Microsoft Recall — Using All 3 Gemma 4 Modalities

#devchallenge #gemmachallenge #gemma #ai

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

I kept losing things. Not files — context. "What was that error I saw 2 hours ago?" "What did that Slack message say before I scrolled past it?" "What was I even working on before lunch?"

Microsoft Recall tried to solve this but stored everything in plaintext with telemetry phoning home. The idea is genuinely useful — the execution was terrible. So I built my own.

ScreenMind is an open-source screen activity journal that runs entirely on your machine. It captures your screen, analyzes every screenshot with Gemma 4's vision, and builds a searchable, chat-able AI memory of your digital life.

After two weeks of daily driving it, the thing I didn't expect: you stop screenshotting Slack messages "just in case." You stop bookmarking stuff you know you'll forget. You just trust your computer to remember. The chat is the real feature — "What did Alex say on Discord?" pulls up the actual message. "What was I working on at 3pm?" shows you. It's grep for your visual memory.

Agents	Chat with your memory

But it goes way beyond a Recall clone. Here's what it actually does:

Smart Capture

Doesn't blindly screenshot every 30 seconds. Uses perceptual hashing to detect when your screen actually changes. Cursor blinks and clock updates get ignored. Real content changes get captured. This alone significantly reduces unnecessary inference calls.

Gemma 4 Vision Analysis

Every screenshot goes through Gemma 4 with OCR text as context. It figures out what app you're using, what you're doing, categorizes the activity, detects mood, and writes a detailed scene description. Not just "user is in Chrome" — more like "user is reading a pull request review on GitHub for the auth-middleware refactor."

Spatial Layout Detection

OCR boxes get classified into screen regions (sidebar, chat area, toolbar, profile panel) using coordinate-based parsing. Text gets organized by section so when you search or chat, you get structured context — not a wall of raw OCR.

Hybrid Search

Semantic search (MiniLM embeddings + cosine similarity) combined with SQLite FTS5 keyword search. Ask "debugging the auth module" and it finds screenshots by meaning, not just exact word matches. Results show OCR text highlighted directly on the screenshot.

Chat With Your Screen History

Ask "what should I reply to that Discord message?" and it pulls up the relevant screenshot, reads the organized text, and answers. Ask "did I get any email from Zerodha?" and it finds your inbox screenshot and tells you. It's RAG over your actual life, not documents.

Voice Memos

Hold Ctrl+Shift+V, speak, release. Gemma 4's native audio encoder transcribes it. A screenshot is captured alongside so you have visual context with every memo.

Meeting Transcription

Auto-detects when you're in Zoom, Teams, Discord, or Meet. Records audio, transcribes in 15-second chunks using Gemma's audio encoder, then runs map-reduce summarization. Outputs structured summaries with topics, decisions, and action items.

Agent Platform

Build custom automations by writing a markdown file in plain English:

---
name: Daily Focus Report
schedule: every 6h
data: timeline, apps, mood
output: local, obsidian
---
Analyze my screen activity and generate a focus report:
- How many hours of deep work vs shallow work?
- What were my main distractions?
- Give me a focus score out of 10.

Drop it in a folder. It runs automatically. For developers who want more control, there's a full Python SDK with state persistence and GPU-safe LLM access.

MCP Server

Exposes your screen history to Claude Desktop, Cursor, and VS Code via Model Context Protocol. 8 tools: search, recent activity, time-range queries, daily summaries, meeting transcripts, instant capture.

Privacy

Auto-redacts credit cards, SSNs, API keys, and passwords from captured text before storage. AES encryption at rest. Dashboard PIN lock. App blocklist. Incognito mode. Nothing ever leaves your machine.

Plus

Analytics dashboard — category breakdown, top apps, hourly heatmap, meeting stats
Day Rewind — timelapse playback of your entire day with play/pause/scrub controls
Integrations — Obsidian vault sync, Notion export, webhooks (Slack, Discord, IFTTT)

Demo

Code

ayushh0110 / ScreenMind

AI-powered screen memory — captures, analyzes, and lets you search/chat your screen history. Powered by Gemma 4 . 100% local, 100% private.

Captures your screen → Analyzes with Gemma 4 → Builds a searchable AI memory
100% local. 100% private. Zero cloud dependencies.

Features · Gemma 4 Deep Dive · Quick Start · Architecture · Agent Platform · MCP · API

Agents	Chat with your memory

Microsoft showed the world wants screen-aware AI with Recall. But Recall stores data in plaintext, sends telemetry, and was met with massive privacy backlash. ScreenMind is the open-source, privacy-first alternative — every screenshot analyzed, every insight generated, every search result — all computed locally using Gemma 4's multimodal capabilities.

It's not just a screen recorder. It's an AI memory you can talk to, search through, and build automations on top of.

✨ Features

🧠 Core Intelligence

📸 Smart Capture — Content-change detection, not a fixed timer. Captures when your screen actually changes.
🔬 Gemma 4 Vision Analysis — Every screenshot analyzed: app detection, activity categorization, mood…

View on GitHub

How I Used Gemma 4

This is where it gets interesting. The hard part wasn't capturing screenshots — it was making continuous AI analysis sustainable on a single consumer GPU that also needs to answer questions in real-time.

Why Gemma 4 E2B — and why nothing else works

I chose the Gemma 4 family — and it's not a preference, it's an architectural requirement. E2B is the default for 4GB GPUs, E4B for users with more headroom. Let me explain why no other model family works here.

ScreenMind runs continuously in the background. It needs to analyze a screenshot every 30-40 seconds, transcribe voice memos on demand, power a chat interface, and run agent prompts — all on a single consumer GPU. These constraints eliminate everything else:

Constraint	What it eliminates
Must run continuously in background on 4GB VRAM	Rules out 12B+ models
Must understand screenshots natively	Rules out text-only models
Must transcribe audio natively	Rules out models without audio encoder
Must stay 100% local	Rules out cloud APIs
Must be fast enough for 30-40s capture cycle	E2B does it in 12-76s on a 4GB GTX 1650 (faster on bigger GPUs)

Gemma 4 E2B is the only model that checks all five boxes. One model in VRAM instead of two or three. Runs through llama.cpp with Q4 quantization on my GTX 1650.

The GPU scheduling problem

If you send every screenshot to a vision model, your GPU is permanently busy with background work. Nothing left when you actually want to chat. That's useless.

I built three systems to solve this:

1. Perceptual hash caching (3 tiers)

Screen change	What happens	Time
Identical (pHash diff ≤3)	Skip everything, copy from cache	0ms
Minor change (diff ≤9)	Re-run OCR only, reuse Gemma analysis	3-10s
Real change (diff 10+)	Full pipeline	12-76s

The trick: cache staleness is per-app. Discord/Slack screenshots expire faster than VS Code. Chat moves fast, code doesn't.

2. Chat-first GPU priority

When you ask a question, it kills whatever background analysis is running — closes the HTTP client, llama-server frees the slot in <1s. The cancelled analysis gets re-queued at the front, not the back. Users never wait for background work to finish.

3. Three analysis modes using Gemma's thinking

Fast (~12s) — pre-fill <think>\n</think> in the assistant message to skip reasoning entirely. Good enough for app detection and basic categorization
Balanced (~40s) — natural thinking. Better scene descriptions and activity understanding
Accurate (~76s) — full thinking + spatial layout detection in one call. Used for complex screens and meetings

Quick note on these numbers: they're all from my GTX 1650. 4GB VRAM, which is about as low as you can go for multimodal local AI. The model doesn't fully fit in 4GB so it's spilling to CPU, and that spill is where most of the time goes.

On a card where E2B actually fits — 8GB and up, something like a 3060 — the heavier modes get roughly 4-6x faster.Accurate goes from ~76s to somewhere around 10-15s. Fast mode barely moves though. Once you strip out the thinking you're not waiting on Gemma anymore, you're waiting on OCR and embeddings — and right now those run on CPU. I could push them onto the GPU later, which would help Fast mode, but haven't yet.

Figured I'd be upfront that this is the floor, not the ceiling.

All three modalities in one product

Vision — Every screenshot gets sent to Gemma 4 with OCR context. The prompt asks for structured JSON: app name, activity category, summary, detailed context, mood, confidence, scene description, and layout regions. The structured output means search and chat get organized data, not a blob of text.

Audio — Gemma 4 E2B has a native audio encoder. I use it for voice memo transcription and meeting transcription. No Whisper, no separate ASR model. One model handles everything. For meetings, audio gets chunked into 15-second segments, each transcribed by Gemma, then a final Gemma call does map-reduce summarization.

Reasoning — Daily summaries use think=True for deep reasoning over a full day's activities. Chat uses Gemma to answer questions grounded in screen context with retrieved screenshots. Agents feed screen data into Gemma prompts for custom analysis.

The full pipeline

Screenshot → EasyOCR (text extraction) → Gemma 4 E2B (understanding) → MiniLM (embeddings) → SQLite + FTS5
                ↑ OCR text fed as context to Gemma

Four AI models working together, with Gemma 4 as the brain. OCR extracts what's written. Gemma understands what you're doing. MiniLM enables semantic search. FTS5 handles instant keyword lookup. Each model does what it's best at.

This significantly cuts Gemma inference calls during typical usage. Combined with the three analysis modes, ScreenMind runs comfortably on my GTX 1650 with 4GB VRAM as a daily driver.

I've been using this daily for two weeks. The chat feature is genuinely addictive — being able to ask "what was I working on before lunch?" or "what did that email say?" and getting an actual answer from your own screen history changes how you think about your computer. It makes me wonder — does personal AI get fundamentally more useful once it has persistent context about what you're actually doing? Would love to hear what you think.