Ndimofor Aretas

Posted on May 24

Building an Air-Gapped NotebookLM Alternative via Gemma 4 E4B, FAISS, and Strands Agents

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Build With Gemma 4 Submission

_This is a submission for the Gemma 4 Challenge: Build with Gemma 4_

What I Built

The Privacy Imperative: Why I Built CogniVault

As an IT trainer based in Germany helping career-changers pivot into the technology sector, a significant portion of my daily work is subject to strict non-disclosure agreements (NDAs), regional privacy mandates such as the EU General Data Protection Regulation (GDPR/DSGVO), and evolving legal frameworks like the EU AI Act. Whether handling proprietary corporate training material, confidential client curricula, or sensitive student records, data security is an absolute prerequisite.

When analyzing mainstream cloud-based AI tools, corporate compliance audits present an unyielding barrier: cloud dependencies represent an unacceptable data leak surface. For a legal compliance officer, "zero data retention" policies are insufficient; the core problem is that data leaves the hardware boundary at all. Furthermore, mandatory cloud registrations, per-token API overhead costs, and restrictive usage quotas severely limit the continuous, exploratory workflow required for deep academic and professional study.

Gemma CogniVault is my solution to this challenge. It is a 100% local, privacy-first AI study companion engineered to turn confidential documents into structured learning resources entirely on-device. By running localized inference and embedding models on existing workstation hardware, CogniVault delivers intelligent chat interfaces, multi-lesson workshops, visual mindmaps, customized quizzes, and interactive flashcard decks without a single byte of data ever escaping the machine.

CogniVault is built explicitly for professionals and organizations operating in highly regulated spaces:

Educators and Corporate Trainers who require an intelligent didactic assistant that honors ironclad institutional confidentiality.
Academic Researchers processing sensitive grant proposals, proprietary datasets, or pre-publication manuscripts.
Healthcare, Finance, and Legal Practitioners bound by regulatory frameworks (such as HIPAA, FINRA, or GDPR) who need advanced text and document analytics on isolated hardware.

System Architecture & Capabilities

CogniVault bridges the gap between secure, local data storage and advanced generative AI capabilities through the following core architectures:

Local Hybrid Retrieval-Augmented Generation (RAG): The platform ingests eight foundational document types: PDF (with native OCR fallback via Tesseract), DOCX, PPTX, XLSX, Markdown, CSV, TXT, and HTML. Files are parsed using structure-aware chunkers (e.g., Markdown retains header breadcrumbs; CSV chunks repeat the initial header row), translated into dense 768-dimensional vectors via embeddinggemma, and indexed using an in-memory FAISS IndexFlatIP coupled with a sparse BM25 retrieval layer. Sub-millisecond top-7 document matching is achieved through Reciprocal Rank Fusion (RRF).
Two-Phase Streaming with Auditable Chain-of-Thought: To build trust in high-stakes compliance environments, chat responses execute in two discrete phases. Phase 1 queries Gemma 4 with thinking: True enabled, streaming the model's raw reasoning steps directly into a collapsible frontend panel. Phase 2 utilizes a localized Strands Agent loop to evaluate relevant context chunks, generating a factual final answer pinned with interactive inline source citations.
Autonomous Strands Agent Loop: Gemma 4 E4B is integrated into the Strands Agents SDK and acts as a central decision engine equipped with 6 local tools: search_knowledge_base, list_documents, analyze_document, compare_documents, calculator, and current_time. The model evaluates incoming requests, determines optimal execution order, and solves complex analytical queries entirely on-device without hardcoded routing rules.
Grammar-Constrained Study Generation: The Study Hub generates interactive materials (quizzes, multi-lesson workshops, 3D flashcards, and radial mindmaps) using Ollama's grammar-constrained format="json" parameter. Token generation is strictly restricted at the runtime level to match target schema configurations, ensuring deterministic, valid JSON outputs that bypass typical parsing errors.
Crash-Safe Ingestion & Content Hashing: Ingestion operations run as durable, checkpointed workflows powered by a DBOS backend. If a device power cycle occurs mid-ingestion, processing resumes from the last successfully indexed batch. Additionally, SHA-256 content hashing automatically tracks file variations on re-upload, soft-deleting obsolete chunks while seamlessly re-indexing modified text.
Localized Analytics & Progress Tracking: Learning sessions and metrics are monitored completely offline using a standalone SQLite layer (progress.db). The dashboard tracks total study durations, active streaks, and automatically grants 25 distinct achievement badges across study milestones, presenting data visually via a GitHub-style 90-day activity heatmap.
Secure Document Exports: Generated quizzes, mindmaps, and workshop materials can be natively exported as Markdown, high-resolution PNGs, or printable PDFs. The system leverages the browser's native File System Access API to trigger authentic "Save As" OS dialog boxes entirely inside the client environment.

Demo

Watch the full architectural walkthrough and deep-dive on YouTube: [https://youtu.be/rw5Rse_TP2o?si=4rIh8staIPwTiuyc]

Code

Full source repository: [https://github.com/ndimoforaretas/local-gemma-rag]

Key modules and implementations to explore:

backend/services/rag_agent.py: Implements the two-phase asynchronous stream (raw reasoning extraction followed by Strands agent orchestration) alongside multi-session history encapsulation.
backend/services/vector_db.py: Coordinates the dense FAISS and sparse BM25 retrieval algorithms, executing Reciprocal Rank Fusion (RRF) and post-fusion scope metadata filtering.
backend/services/ingest.py: Houses the DBOS-checkpointed extraction routines for all 8 supported document formats and coordinates SHA-256 identity checks.
backend/services/quiz_generator.py (alongside workshop_, flashcard_, and mindmap_generator.py): Manages the structural formatting parameters, zero-comma JSON repair processes, and model retry fallbacks.
frontend/src/components/study/: Contains the React orchestrators, SVG canvas rendering engines, and custom animation definitions for the client-side user experience.

How I Used Gemma 4

Primary Inference Model: gemma4:e4b (4.5B effective parameters / 8B with embedded parameters, 128K context window, served locally via Ollama).

Vector Embedding Model: embeddinggemma (768-dimensional dense vector embeddings).

Model Selection Rationale

Choosing the right runtime model required balancing raw analytical reasoning with the hardware profiles of standard developer and trainee laptops. Testing determined that the E4B variant offered the ideal intersection of performance and local feasibility.

Architectural Feature / Benchmark Metric	Gemma 4 E2B	Gemma 4 E4B (Chosen)	Gemma 4 31B Dense
Effective Parameter Weight	2.3B (5.1B embedded)	4.5B (8B embedded)	31B Dense
Native Context Length	128K	128K	256K
Quantized VRAM/RAM Target (GGUF)	~2.5 GB	~5.1 GB	~20+ GB
Standard 16 GB Workstation Footprint	Low resource utilization	Optimal balance; ample memory headroom	Exceeds system limits; heavy thrashing
MMLU Pro (Reasoning Capability)	60.0%	69.4%	85.2%
GPQA Diamond (Advanced Science)	43.4%	58.6%	84.3%
MMMU Pro (Visual Analytics)	44.2%	52.6%	76.9%
MMMLU (Multilingual Evaluation)	67.4%	76.6%	88.4%
OmniDocBench 1.5 (Lower = Superior)	0.290	0.181	0.131

The choice of Gemma 4 E4B was driven by three architectural necessities:

Host Memory Headroom: At a quantized GGUF profile requiring roughly 5 GB of system memory, E4B loads comfortably alongside standard system processes, the FastAPI runtime, localized database containers, and browser development instances on a standard 16 GB laptop.
Context Density and Retention: The 128K token boundary easily accommodates extensive system prompts, specialized JSON schemas, recursive tool definitions, multi-session conversation text, and 15 highly detailed RAG text chunks without requiring aggressive context truncation techniques.
Multi-Lingual and Visual Consistency: Scoring 76.6% on the MMMLU benchmark means that E4B parses technical materials in German or English with identical analytical fidelity. Its strong visual reasoning score (0.181 on OmniDocBench) provides stable text extraction from complex whiteboard diagrams and smartphone photos uploaded directly into the chat interface.

End-to-End Implementation Flow

[Confidential Document] ──► Ingestion Pipeline (DBOS & SHA-256) ──► Chunking Strategy
                                                                         │
    ┌────────────────────────────────────────────────────────────────────┘
    ▼
Ollama Embeddings (embeddinggemma) ──► Local Hybrid Database (FAISS + BM25)
                                                                         │
    ┌────────────────────────────────────────────────────────────────────┘
    ▼
User Query ──► Phase 1: Local Asynchronous Streaming (gemma4:e4b - thinking=true)
    │
    └────────► Phase 2: Strands Agent Loop Evaluation ──► Deterministic GenUI Layout

Embedding Matrix Generation: Context blocks are vectorized via ollama.embed using embeddinggemma in automated batches of 5, speeding up local processing pipelines by avoiding single-item API round trips.
Phase 1 Reasoning Capture: Prompts clear a specialized context routing validation layer before launching an asynchronous stream using ollama.AsyncClient.chat with options={"thinking": True}. Raw internal analysis tokens are written directly to a customized UI panel.
Phase 2 Strands Execution: When execution transitions to the tool matrix, a dedicated strands.Agent initializes with an explicit, system-level operational block, suppressing further <think> output tags to maintain clean Markdown delivery in the primary workspace UI.
Grammar-Restricted Output Generation: Complex layouts utilize localized structural boundaries through Ollama's native grammar engine. System configurations employ lower temperature parameters (0.3 to 0.4) during data layout requests to guarantee strict adherence to data schemas, while using a standard 0.5 setting for rich long-form textual formatting.

Engineering Insights & Takeaways

Architecturally, Gemma 4’s format="json" capability entirely changes how local multi-tasking modules behave. Older edge-compute models frequently produced erratic token fragments or broke syntax constraints under long generation cycles, forcing developers to maintain fragile regex filters or expansive string recovery logic. By applying grammar-constrained validation natively at the token-generation level, CogniVault transforms local structured output from a brittle experimental feature into a guarantee.

Honest Design Compromises

Absolute Edge-Execution Boundary: CogniVault purposefully eliminates cloud network fallbacks. The absolute requirement of a ~10 GB local model initialization is the unavoidable trade-off for a system designed to guarantee complete data privacy.
Interim Audio Ingestion Strategy: While Gemma 4 E4B possesses an integrated audio encoder, local voice translation is handled by a separate Whisper container in the current version. Migrating this directly into the native Gemma pipeline remains a priority on the development roadmap.

What This Unlocks For The User

Until now, professionals working in highly regulated fields were effectively locked out of the generative AI landscape due to strict compliance standards and data security risks. The massive gap between cloud-based model performance and what could be safely deployed in an air-gapped corporate environment created a major barrier to adoption.

Gemma 4 successfully eliminates that barrier.

Corporate Training Instructors can systematically convert highly proprietary curricula into interactive tests without triggering network security audits.
Students and Trainees can safely query proprietary or confidential material without violating institutional privacy policies.
Academic and Scientific Researchers can analyze draft manuscripts or sensitive research data completely free from data-leak concerns.

By running advanced reasoning, multi-tool agent logic, native vision ingestion, and grammar-precise structural parsing on local hardware, CogniVault provides a secure, high-performance solution for the privacy-conscious professional.

_CogniVault is engineered to operate completely independent of external infrastructure, leveraging FastAPI, FAISS, DBOS, SQLite, and the Strands Agents SDK to deliver secure, local AI applications.

Connect with me on LinkedInto track ongoing development updates._

DEV Community