DEV Community

Cover image for Active Page: Tackling Local AI for Transforming Passive Reading into Active Recall
Muhammad Dafi
Muhammad Dafi

Posted on

Active Page: Tackling Local AI for Transforming Passive Reading into Active Recall

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

Most readers suffer from the "forgetting curve." By the time we finish the later chapters of a dense book, the foundational concepts from the introduction have already begun to blur.

As a middle school student trying to learn something new with reading books and scientific journal article, I wanted a better way to retain knowledge. My inspiration came from observing National Science Olympiad winners, my friend and other figure, who maintain peak retention not through passive rereading, but through consistent daily answering a lot of questions.

Active Page is a local-first application that transforms passive reading into an interactive learning experience. It automatically generates high-quality, analytical, and contextual quizzes directly from your reading material for immediate memory reinforcement. To help users build a sustainable learning habit, Active Page also features a built-in streak mechanics system to keep readers motivated daily. 🔥🔥

Because Active Page run locally, it has operational costs at zero (beside the use of the device) and side benefit of reading books without internet. While local compute constraints often drive developers toward over-engineering, Active Page takes a more elegant path.

Demo

demo video

Code

GitHub logo wsad-wsad / ActivePage

recall quiz on book

📖 Active Page: Tackling Local AI for the Best Active Recall Experiences

Active Page is a privacy-first, local-LLM-powered reading companion designed to solve the "forgetting curve." By leveraging the cutting-edge Gemma 4 E2B model, it transforms passive reading into an interactive learning session through real-time, contextual active recall—running entirely on your machine.


âš¡ Quick Start

1. Prerequisites

2. Initialization

The init.sh script automates the heavy lifting: it manages dependencies via uv, compiles llama.cpp for your specific hardware, and pulls the optimized Gemma 4 E2B weights.

bash init.sh
Enter fullscreen mode Exit fullscreen mode

Note for Silicon/AMD: If using Apple M-Series or AMD GPUs, edit init.sh to enable GGML_METAL=ON or GGML_HIPBLAS=ON respectively for hardware acceleration.

3. Launch

Launch the inference engine and the interactive web interface simultaneously:

bash run.sh
Enter fullscreen mode Exit fullscreen mode

Access the application at: http://localhost:8000

🔧 Troubleshooting

System Crashing / Out of Memory in the init.sh If your ram or CPU is limited, adjust the pararrel of building…

How I Used Gemma 4

Why gemma 4 E2B

I selected the Gemma-4-E2B model because it perfectly balances performance and efficiency for local deployment. It leverages Per-Layer Embeddings (PLE) and a hybrid attention mechanism combining Sliding Window Attention (SWA) with Grouped Query Attention (GQE). This architecture allows it to have 128K context window while deliver output quality that rivals much larger models while remaining lightweight and fast enough for edge devices.

Beyond simply powering the app, Gemma-4-E2B design unlocked sophisticated long-context capabilities on-device. Its compact size enables aggressive KV cache usage for manipulation, which is essential for maintaining a seamless, responsive reading experience with active recall across extended contexts.

Strategic KV Cache Management

The "memory" of an AI (KV Cache) is usually treated as a linear path. In most apps, the book data is treated as a fresh prompt every time, which is slow and memory-intensive.

Strategic KV Cache Management

The "memory" of an AI (KV Cache) is usually treated as a linear path. In most apps, the book data is treated as a fresh prompt every time, which is slow and memory-intensive.

Normal KV cache usage

  1. System Prompt (Cached): The cached part with different prompt.
  2. Data (Not Cached): Even thought the book data is so much larger than system prompt, it needs to calculate it each time the system prompt is different.

I inverted this structure to maximize Prefix Caching:
Improve KV cache usage

  1. System instruction (Cached): Core instructions on how to analyze text.
  2. Data (Append-Only Cache): As the user reads, the book data is appended to the cache. Because KV caches are append-only, the model "remembers" previous chapters without re-processing them.
  3. Specific Instruction (Dynamic): Specific instructions (e.g., "Generate a quiz" or "Explain this concept etc") are swapped at the very end.

Decode and memory: MTP & Turboquant

For tackling memory constrain and decode speed, we use this technique to solved it, which also come from google.

  • Multi-Token Prediction (MTP): I used the Gemma 4 E2B assistant drafter model to enable speculative decoding. This resulted in a 40% increase in output tokens per second, making the AI feel like a real-time conversation partner.
  • TurboQuant Compression: By applying TurboQuant o the KV cache, I reduced the memory footprint of a full 128K tokens context from 800MB to just 200MB, 4X memory usage reduction. This 4x reduction allows the local application to sit quietly in the background without causing system-wide lag.

Tackling latency using Asynchronous pre-fetching

Even with an optimized KV cache, generating multiple-choice questions (MCQs) quiz requires a slight processing window. Forcing a reader to wait at a loading spinner when a quiz triggers would break their reading immersion.

Active Page completely cut local execution latency by decoupling the generation engine from the UI through an Asynchronous Pre-Fetching Pipeline:
prefetch strategy

  • The Read-Ahead Engine: While the user is stationary and reading the current page, a background inference engine fill KV cache with current page.
  • Zero-Delay quiz: The Inference engine generate quiz and buffers them into the Quiz queue.
  • The UX Result: Whether a random checkpoint quiz triggers automatically or the user clicks the manual "Quiz Me" button, the application pulls directly from the pre-fetched queue. The system delivers quizwith instant speed, faster than the cloud host because it's just fetching from Quiz queue

Top comments (0)