LocalFind Gemma — AI-Powered Semantic Search and Chat for Your Local Files

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

LocalFind Gemma is a fully local, privacy-first semantic search engine for your own files — documents, images, and audio — powered by Gemma 4 running on Ollama.

Most search tools match filenames or keywords. LocalFind Gemma understands content:

Images indexed by what's in them — Gemma 4 captions every image at sync time so you can search "whiteboard with the system architecture diagram" or "receipt from the coffee shop" and actually find it.
Agent that reads images to answer questions — ask "how much does that invoice say?" and the agent finds the image, sends it to Gemma 4 vision, and gives you the number, not a file path.
Audio fully searchable — Whisper transcribes recordings at index time so you can search across hours of meetings by what was said.
Cross-lingual search — the nomic-embed-text-v2-moe embedding model supports ~100 languages in a shared vector space. Search in French, find English documents.

Supported file types: PDF, DOCX, TXT, MD, CSV, JPG, PNG, GIF, BMP, WEBP, MP3, WAV, FLAC, M4A.

Everything — Gemma 4, Whisper, the ChromaDB vector store — runs on your machine. No API keys, no cloud, no data leaving your device. There's also an optional Claude Desktop integration via MCP for files you're comfortable sharing with a third party.

Demo

Code

https://github.com/maliklovable1-spec/localfind-gemma

How I Used Gemma 4

Gemma 4 isn't just the chat model here — it's active at three distinct points in the pipeline:

1. Index time: captioning every image
When you sync a folder, each image is sent to Gemma 4 via Ollama's vision API. The caption is embedded and stored permanently in ChromaDB. Future searches use the stored caption; the model isn't called again unless you re-sync. This means fast search without repeated inference.

2. Agent reasoning and tool use
The conversational agent runs on gemma4:e4b (the recommended default). It decides when to search, what query to issue, and how to synthesise results into a direct answer rather than just returning file paths.

I chose e4b over e2b because it follows tool-use instructions more reliably — which matters a lot in an agentic loop where the model needs to decide between search, image reading, and response synthesis. e2b is also supported for users with less RAM (~12 GB vs 16 GB).

3. Live image reading
When the agent finds an image relevant to your question, it sends the image bytes directly to Ollama's native /api/chat API with your question as context. Gemma 4 reads the image and the agent uses that to answer you. The bytes go from your disk to your local Ollama process —nowhere else.

A note on audio
Gemma 4 E2B and E4B natively support audio transcription at the architecture level — multilingual, up to 30 seconds, built into the model. LocalFind Gemma currently uses Whisper for audio because
Ollama doesn't expose audio input via its API yet. Once Ollama ships that support
([issue #11798(https://github.com/ollama/ollama/issues/11798)), the transcription backend can
switch to Gemma 4 — the architecture is already designed with that transition in mind, though it will require some code changes depending on how Ollama exposes the audio API.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.