This is a submission for the Gemma 4 Challenge: Build with Gemma 4
Imagine receiving an official envelope and not being able to read it. Not because you're careless, but rather because English isn't your first language. Worse yet, the letter is written in bureaucratic legalese that would confuse many native speakers, too. Is it urgent? Do you need to respond? Who do you call?
That's the problem Legible tries to solve.
What it does
Legible lets you photograph or upload any official document: a tax notice, school letter, eviction warning, lease, utility bill, or government letter. It returns:
- A plain-language explanation in your own language
- A deadline countdown ("You have 23 days to respond")
- Numbered next steps — concrete, specific actions to take
- A glossary of legal or bureaucratic terms found in the document, with simple definitions
- An encrypted local history of past scans, so you can refer back without re-uploading Eleven languages supported: Spanish, Chinese (Simplified), Tagalog, Vietnamese, Arabic, Hindi, Korean, French, Portuguese, Russian, and English.
The model choice: why Gemma 4
This is a Build With Gemma 4, and I want to be specific about why Gemma 4 — running locally via Ollama — is the right model here, not just a convenient one.
I ran gemma4:latest, which Ollama resolved to the Gemma 4 Effective 4B (E4B) instruction-tuned model. The privacy and offline arguments below apply equally to any Gemma 4 variant served locally. Still, document parsing with structured XML output and multilingual generation is exactly the kind of task where the extra capacity of E4B shows. The key point is that nothing leaves the machine.
Privacy is the core value proposition. The documents this app handles contain some of the most sensitive information a person owns: Social Security numbers, tax IDs, case numbers, home addresses, and medical details. These are exactly the fields that get scraped, leaked, or sold when you upload documents to a cloud API.
Gemma 4 runs entirely on the user's machine via Ollama. The image never leaves the device. There's no API key, no usage logs, and no third-party server seeing your user's tax notice. It’s not just a privacy policy, it's a privacy architecture.
Multimodal input is essential. Real documents aren't clean PDFs. They're photos taken at an angle under fluorescent lighting, or scans of crumpled letters that have been in someone's bag. Gemma 4's native image understanding handles this skillfully. The model reads the document directly from the photo rather than depending on a separate OCR pipeline that might fail on non-Latin scripts.
Offline capability matters for this audience. Immigrant communities often rely on metered mobile data or shared Wi-Fi. The model downloads once, then runs indefinitely with no internet connection.
How it works
The backend is a FastAPI app that proxies a streaming request to Ollama's local API. The frontend is a single HTML file — no build step, no framework, no dependencies to install beyond Python. Open the browser at localhost:8000 and it works.
The system prompt
Getting structured output from a vision model reliably is mostly a prompting problem. I landed on asking Gemma 4 to respond in a fixed XML schema:
Tax Notice | Lease Agreement | School Letter | ...
YYYY-MM-DD or none
integer or none
Plain-language summary in {target_language}
1. action\n2. action...
English Term | Simple explanation in {target_language}
Injecting today's date into the prompt and asking the model to calculate days remaining directly (rather than doing it in code) turned out to be more reliable than parsing a date string and computing the delta separately. Gemma 4 handles this arithmetic accurately.
The full system prompt is pinned to the user's chosen language so that the glossary definitions, explanation, and next steps all arrive in one language, no mixing.
Encrypted history
Past scans are stored as Fernet-encrypted files in .history/records/. The encryption key lives in .history/key.bin and is generated fresh on first run. Both paths are .gitignored. Deleting the .history/ folder clears everything.
Each record stores a JPEG thumbnail of the document alongside the parsed results, so the history panel shows what the document looked like without re-processing it.
Stack
Running it yourself
Prerequisites: Ollama installed and running.
Pull the model (one-time, ~3 GB)
ollama pull gemma4:latest
Install Python dependencies
pip install -r requirements.txt
Start the app
uvicorn main:app --reload
Open http://localhost:8000 — that's it.
Environment variables to override defaults:
What I learned
Local multimodal inference has crossed a usability threshold. A year ago, running a vision model locally meant wrestling with quantization, drivers, and memory issues. With Ollama and Gemma 4, ollama pull gemma4:latest and uvicorn main:app is the entire setup. That simplicity matters enormously for a tool meant to be shared with non-technical communities.
Structured output from vision models is still a prompt engineering problem. Gemma 4 followed the XML schema reliably once I made the format explicit and gave it examples of what "none" should look like for optional fields. Before that, it occasionally invented its own tags or wrapped the XML in markdown code fences — easy to handle in the parser, but cleaner to prevent at the prompt level.
The privacy architecture is the product. For this use case, "runs locally" isn't a feature — it's the reason the tool is trustworthy enough to use with sensitive documents. That framing changed how I thought about the whole design.
What's next
Mobile-optimised layout for direct phone camera capture
Support for multi-page documents (PDF input)
Offline-first PWA packaging so it can be installed like an app
Optional audio read-aloud of the explanation for users with low literacy
Repo
Source code, setup instructions, and the full system prompt are in the repository -> https://github.com/RealWorldApplications/legible. Questions welcome in the comments.
Built for the DEV × Google Gemma 4 Challenge — Build With Gemma 4 track.




Top comments (0)