DEV Community: Ayush Shekhar

I Built a Privacy-First Alternative to Microsoft Recall — Using All 3 Gemma 4 Modalities

Ayush Shekhar — Sat, 23 May 2026 10:43:57 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

After Microsoft's Recall got torn apart for storing screenshots in plaintext with telemetry phoning home, I thought — the idea is genuinely useful. Knowing what you were doing 3 days ago, finding that one message someone sent you, remembering what you were working on before lunch. The execution was just terrible.

So I built ScreenMind — an open-source screen activity journal that runs entirely on your machine. It captures your screen, analyzes every screenshot with Gemma 4's vision, and builds a searchable, chat-able AI memory of your digital life.

The difference from Recall: nothing ever leaves your computer. No cloud. No telemetry. No "trust us with your screenshots." Everything — capture, analysis, search, chat — runs locally on a single GPU.

But it's way more than a Recall clone. Here's what it actually does:

📸 Smart Capture — Doesn't blindly screenshot every 30 seconds. Uses perceptual hashing to detect when your screen actually changes. Cursor blinks and clock updates get ignored. Real content changes get captured.

🧠 Gemma 4 Vision Analysis — Every screenshot goes through Gemma 4 with OCR context. It figures out what app you're using, what you're doing, categorizes the activity, detects your mood, and writes a detailed scene description. Not just "user is in Chrome" — more like "user is reading a pull request review on GitHub for the auth-middleware refactor."

📐 Spatial Layout Detection — OCR boxes get classified into screen regions (sidebar, chat area, toolbar, profile panel) using coordinate-based parsing. Text gets organized by section so when you search or chat, you get structured context — not a wall of raw OCR.

🔍 Hybrid Search — Semantic search (MiniLM embeddings + cosine similarity) combined with FTS5 keyword search. Ask "debugging the auth module" and it finds screenshots by meaning, not just exact word matches. Results show OCR text highlighted directly on the screenshot.

💬 Chat With Your Screen History — This is the feature people love most. Ask "what should I reply to that Discord message?" and it pulls up the relevant screenshot, reads the organized text, and answers. Ask "did I get any email from Zerodha?" and it finds your inbox screenshot and tells you. It's RAG over your actual life, not documents.

🎙️ Voice Memos — Hold Ctrl+Shift+V, speak, release. Gemma 4's native audio encoder transcribes it. A screenshot is captured alongside so you have visual context with every memo.

🎤 Meeting Transcription — Auto-detects when you're in Zoom, Teams, Discord, or Meet. Records audio, transcribes in 15-second chunks using Gemma's audio encoder, then runs map-reduce summarization for long meetings. Outputs structured summaries with topics, decisions, and action items.

🤖 Agent Platform — This is the part I'm most proud of. You can build custom automations by writing a markdown file in plain English:

---
name: Daily Focus Report
schedule: every 6h
data: timeline, apps, mood
output: local, obsidian
---

Analyze my screen activity and generate a focus report:
- How many hours of deep work vs shallow work?
- What were my main distractions?
- Give me a focus score out of 10.

Drop it in a folder. It runs automatically. Gemma processes your prompt with injected screen data. No code needed. For developers who want more control, there's a full Python SDK with state persistence and GPU-safe LLM access.

🔌 MCP Server — Exposes your screen history to Claude Desktop, Cursor, and VS Code via Model Context Protocol. 8 tools: search, recent activity, time-range queries, daily summaries, meeting transcripts, instant capture.

🔐 Privacy — Auto-redacts credit cards, SSNs, API keys, and passwords from captured text before storage. Optional AES encryption at rest. Dashboard PIN lock. App blocklist. Incognito mode.

📊 Analytics — Category breakdown, top apps, hourly heatmap, meeting stats. See where your time actually goes.

⏪ Day Rewind — Timelapse playback of your entire day with play/pause/scrub/speed controls.

🔗 Integrations — Obsidian vault sync, Notion database export, webhooks (Slack, Discord, IFTTT) with HMAC signing and auto-retry.

Demo

Code

ayushh0110 / ScreenMind

🧠 AI-powered screen memory — captures, analyzes, and lets you search/chat your screen history. Powered by Gemma 4 E2B. 100% local, 100% private.

Captures your screen → Analyzes with Gemma 4 → Builds a searchable AI memory
100% local. 100% private. Zero cloud dependencies.

Features · Gemma 4 Deep Dive · Quick Start · Architecture · Agent Platform · MCP · API

Agents	Chat with your memory

Microsoft showed the world wants screen-aware AI with Recall. But Recall stores data in plaintext, sends telemetry, and was met with massive privacy backlash. ScreenMind is the open-source, privacy-first alternative — every screenshot analyzed, every insight generated, every search result — all computed locally using Gemma 4's multimodal capabilities.

It's not just a screen recorder. It's an AI memory you can talk to, search through, and build automations on top of.

✨ Features

🧠 Core Intelligence

📸 Smart Capture — Content-change detection, not a fixed timer. Captures when your screen actually changes.
🔬 Gemma 4 Vision Analysis — Every screenshot analyzed: app detection, activity categorization, mood…

View on GitHub

How I Used Gemma 4

I chose the Gemma 4 family — and it's not a preference, it's an architectural requirement. E2B is the default for 4GB GPUs, E4B for users with more headroom. Let me explain why no other model family works here.

ScreenMind runs continuously in the background. It needs to analyze a screenshot every 30-40 seconds, transcribe voice memos on demand, power a chat interface, and run agent prompts — all on a single consumer GPU. These constraints eliminate everything else:

Constraint	What it eliminates
Must run continuously in background on 4GB VRAM	Rules out 12B+ models
Must understand screenshots natively	Rules out text-only models
Must transcribe audio natively	Rules out models without audio encoder
Must stay 100% local	Rules out cloud APIs
Must be fast enough for 30-40s capture cycle	E2B does it in 12-76s depending on mode

Gemma 4 E2B is the only model that checks all five boxes.

All three modalities in one product:

Vision — Every screenshot gets sent to Gemma 4 with OCR text as context. The prompt asks for structured JSON: app name, activity category, summary, detailed context, mood, confidence, scene description, and layout regions. I built three analysis modes:

Fast (~12s) — uses a no-thinking prefill trick (pre-fill <think>\n</think> in the assistant message to skip reasoning)
Balanced (~40s) — natural thinking, analysis only
Accurate (~76s) — thinking + spatial layout detection in one call

Audio — Gemma 4 E2B has a native audio encoder. I use it for voice memo transcription and meeting transcription. No Whisper, no separate ASR model. One model handles everything. For meetings, audio gets chunked into 15-second segments, each transcribed by Gemma, then a final Gemma call does map-reduce summarization.

Reasoning — Daily summaries use think=True for deep reasoning over a day's activities. Chat uses Gemma to answer questions grounded in screen context. Agents feed screen data into Gemma prompts for custom analysis.

Performance engineering around a single GPU:

Since there's only one GPU slot, I built a priority system. Chat cancels in-flight analysis instantly (closes the HTTP client → llama-server frees the slot in <1s). The cancelled analysis gets re-queued at the front, not the back. Users never wait for background work to finish.

I also built a per-app pHash cache with three tiers:

Identical screens (diff ≤3): skip everything, copy from cache — 0ms
Minor changes (diff ≤9): re-run OCR only, reuse Gemma analysis — 3-10s
Full change (diff 10+): run the complete pipeline — 12-76s

This cuts Gemma inference calls by 30-50% during typical usage. Combined with the three analysis modes, ScreenMind runs comfortably on my GTX 1650 with 4GB VRAM as a daily driver.

The multi-model pipeline:

Screenshot → EasyOCR (text) → Gemma 4 E2B (understanding) → MiniLM (embeddings) → SQLite + FTS5
                                     ↑
                              OCR text fed as context

Four AI models working together, with Gemma 4 as the brain. OCR extracts what's written. Gemma understands what you're doing. MiniLM enables semantic search. FTS5 handles instant keyword lookup. Each model does what it's best at.

I've been using this daily for two weeks. The chat feature is genuinely addictive — being able to ask "what was I working on before lunch?" or "what did that email say?" and getting an actual answer from your own screen history changes how you think about your computer.

From Heuristics to Fine-Tuning: Teaching a Model to Use Tools

Ayush Shekhar — Sun, 26 Apr 2026 13:28:09 +0000

How I replaced 200 lines of regex with a fine-tuned 7B model — and why it was worth it.

The Problem

I built an autonomous AI agent with 9 tools: web search, calculator, weather, Wikipedia, translation, and more. The first question every request must answer is deceptively simple:

Which tool should I use?

My first solution was a heuristic classifier — a function called classify_query() that uses regex patterns to detect intent:

# 200+ lines of patterns like this:
_SEARCH_INDICATORS = re.compile(
    r"\b(latest|current|news|today|recent|who won|score|price|"
    r"stock|update|happening|trending|release|launched)\b", re.IGNORECASE
)

_KNOWLEDGE_INDICATORS = re.compile(
    r"\b(explain|what is|how does|define|difference between|"
    r"why do|concept of|overview|meaning of|works)\b", re.IGNORECASE
)

It worked. About 75% of the time.

The remaining 25% was a graveyard of edge cases: "say hello in Japanese" (needs translate, matched nothing), "what's 15% of 2850" (needs calculator, matched what's → routed to search), "compare React vs Vue" (needs autonomous executor, matched compare → routed to direct answer).

Every fix introduced new regressions. Regex-based routing doesn't scale.

The Idea

What if the model itself could learn the routing? Not a giant foundation model — a small, fast 7B model fine-tuned specifically for this task. The hypothesis:

A QLoRA-adapted 7B model trained on 1K high-quality tool-call traces should outperform hand-crafted regex, with comparable latency.

This became ToolForge.

Step 1: Generating Training Data (The Hard Part)

I had 9 tools but no labeled dataset. Creating one manually would take weeks. Instead, I used teacher distillation — using a stronger model (Gemini 2.5 Flash) to generate high-quality training examples.

The Distillation Pipeline

User queries (generated) → Gemini 2.5 Flash → Structured tool-call traces → Filtered dataset

The trick was diversity. I needed queries covering:

Single-tool requests ("What's the weather in Tokyo?")
Multi-tool chains ("What's the weather in Tokyo and convert 25°C to Fahrenheit?")
No-tool queries ("Explain recursion")
Ambiguous queries ("Tell me about Python" — search or direct answer?)
Edge cases ("sqrt of 44567" — calculator, not search)

I built a ClientPool that rotates across 6 free-tier Gemini API keys to avoid rate limits:

class ClientPool:
    """Round-robin pool of (key, model) slots for maximum throughput."""

    def next_client(self):
        # Pick the slot that has rested the longest
        best = min(self._slots, key=lambda s: s.last_used)
        elapsed = time.time() - best.last_used
        if elapsed < self._min_gap:
            time.sleep(self._min_gap - elapsed)
        return best

After filtering for quality (valid JSON, correct schema, no hallucinated tools), I had 1,173 clean examples — enough for fine-tuning.

Dataset Distribution

Tool	Count	%
`web_search`	287	24%
`calculator`	156	13%
`weather`	143	12%
`translate`	132	11%
`wikipedia`	128	11%
`no_tool`	119	10%
`dictionary`	78	7%
`datetime`	68	6%
`unit_converter`	62	5%

The distribution is intentionally skewed toward web_search — mirroring real-world query patterns.

Step 2: Training with QLoRA

I trained on a Kaggle T4 GPU (free tier). The key insight: you don't need an A100 for fine-tuning. QLoRA with 4-bit NF4 quantization fits a 7B model in ~6GB VRAM.

Configuration

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # Double quantization saves ~0.4GB
)

lora_config = LoraConfig(
    r=64,                    # LoRA rank
    lora_alpha=128,          # Scaling factor (alpha/r = 2)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
)

Why these choices?

r=64: Higher rank = more parameters = more capacity to learn tool routing patterns. I tested r=16 (too small) and r=64 (sweet spot).
All attention + MLP layers: Tool routing requires understanding query intent (attention) AND mapping it to structured output (MLP). Targeting only attention heads wasn't enough.
alpha=128 (2×r): Standard scaling that prevents gradient instability.

Step 3: The Ablation Study

This is where the project goes from "I fine-tuned a model" to "I systematically evaluated design choices." I ran 4 experiments:

Run	Base Model	LoRA Rank	LR	Result
1	Mistral-7B-Instruct-v0.3	16	2e-4	78.4%
2	Mistral-7B-Instruct-v0.3	64	2e-4	81.7%
3	Qwen2.5-7B-Instruct	16	2e-4	83.1%
4	Qwen2.5-7B-Instruct	64	2e-4	86.2%

All tracked on Weights & Biases.

Key Findings

1. Qwen > Mistral for tool routing (+4.5%)

Qwen2.5-7B-Instruct has stronger structured output capabilities out of the box. Its chat template naturally handles tool-call JSON, while Mistral required more prompt engineering to produce valid output.

2. r=64 > r=16 for both models (+3-4%)

The routing task isn't trivial — the model needs to learn mappings between natural language patterns and 9 discrete tool categories plus argument extraction. r=16 underfits.

3. Eval loss converges by epoch 2

All runs showed minimal improvement after epoch 2, with some showing slight overfitting in epoch 3. load_best_model_at_end=True was essential.

Step 4: Integration

The integration into the autonomous agent was designed as a feature flag — zero behavior change in production unless explicitly enabled:

# In executor.py
if is_toolforge_available():
    decision = toolforge_classify(query, memory_hits, has_memory)
    router_source = "toolforge"

if decision is None:
    decision = classify_query(query, memory_hits, has_memory)  # heuristic fallback

The toolforge_classify() function:

Loads the LoRA adapter lazily on first query
Runs inference with greedy decoding (deterministic routing)
Parses the model's tool-call output
Maps specific tools to the agent's decision types (web_search → needs_search, no tool → direct_answer)
Returns None on any failure → heuristic takes over

This means:

Production (HF Spaces, CPU): heuristic runs as before
GPU-enabled environments: ToolForge model handles routing
The code is always visible: interviewers can see the integration pattern

Results

Metric	Heuristic (Regex)	ToolForge (QLoRA)
Overall Accuracy	~75%	86.2%
Approach	200 lines of regex	Fine-tuned Qwen2.5-7B
Latency	0ms (regex)	~200ms (GPU)
Handles edge cases	❌ Constant regressions	✅ Learned from data
Maintenance cost	High (new regex per bug)	Low (retrain on new data)

The 15% accuracy improvement isn't just a number — it means:

"Say hello in Japanese" → correctly routes to translate (was: missed entirely)
"sqrt(44567)" → correctly routes to calculator (was: matched "what" → search)
"Compare React vs Vue for 2026" → correctly routes to autonomous_task (was: partial match → direct answer)

What I'd Do Differently

More data: 1.1K examples is enough for proof-of-concept, but 5K+ would likely push accuracy above 90%. The distillation pipeline can scale — I just ran out of free API quota.
Argument extraction evaluation: I evaluated tool selection accuracy but didn't formally measure argument extraction quality (e.g., did the model extract "Tokyo" from "weather in Tokyo?"). The traces show it works, but a proper F1 metric would be stronger.
GGUF quantization for CPU inference: The current serving path requires GPU. Converting to GGUF and using llama.cpp would enable CPU inference at ~1-2s latency — viable for production on free-tier hosting.

The Story

This project isn't about fine-tuning. Fine-tuning is a technique — anyone can run SFTTrainer. The story is:

I built an agent with hand-crafted routing
I measured where it failed (75% accuracy, constant regex regressions)
I generated training data using teacher distillation from my own pipeline
I trained and compared models with systematic ablation studies
I proved it works with quantitative evaluation (86.2% accuracy)
I integrated it as a production-ready feature flag

That's not a tutorial project. That's the ML engineering loop — identify problem → collect data → train → evaluate → deploy.