Andreas

Posted on Mar 19

A Deep Dive Into Page Sync

#android #kotlin #audiobooks #machinelearning

Page Sync is the feature in Earleaf where you photograph a page from your physical book and the app finds that position in the audiobook. It takes about two seconds. Everything runs on your phone. This post is about how it actually works under the hood.

The problem

You're reading a physical book at home. You get in the car and switch to the audiobook. Where were you?

You could scrub around trying to find the right spot. You could try to remember the chapter number and estimate. Or you could take a photo of the page you were on and let the app figure it out.

That last option sounds simple until you think about what it actually requires. You need to extract text from a photograph (OCR), extract text from audio (speech recognition), and then figure out where those two texts overlap. Both the OCR and the speech recognition will make mistakes. Different mistakes.

Two imperfect signals

Here's what makes Page Sync tricky. You're not matching clean text against clean text. You're matching the output of one ML model against the output of another, and both of them are wrong in different ways.

OCR mistakes are visual. It reads "rn" as "m", or "cl" as "d". It drops characters at the edge of the page. It picks up text from the other side of thin paper (bleed-through). It includes the page header and footer.

Speech recognition mistakes are phonetic. It hears "propper" instead of "proper". It mangles names, especially fantasy names. "Daenerys" comes out as something only vaguely recognizable. It can't tell the difference between "their", "there", and "they're", but that doesn't matter here because we're only matching word shapes, not meaning.

So the matching has to be fuzzy enough to tolerate errors from both sides, but precise enough to find the right spot in a 10+ hour audiobook.

Step one: transcribe the audiobook

Before Page Sync can work, the audiobook needs a transcription. Not a human transcription. An on-device one, generated by Vosk, an offline speech recognition engine. The Vosk model is about 40MB and downloads once.

Transcription runs as a background process. A 10-hour book takes roughly 4-7 hours to transcribe, depending on device speed. Vosk is the bottleneck, eating 40-60% of the processing time, with audio decoding and resampling splitting the rest.

The pipeline never loads the full audiobook into memory. It streams through three stages: MediaCodec decodes compressed audio into PCM one buffer at a time, a resampling step converts whatever sample rate the file uses (usually 44.1kHz) down to the 16kHz that Vosk expects, and Vosk ingests the resampled audio and spits out JSON with word-level timestamps.

Each word gets stored individually in a database with millisecond timestamps:

"the"    → 142560ms - 142710ms
"castle" → 142740ms - 143120ms

A 10-hour audiobook at roughly 120 words per minute produces about 72,000 of these entries. The text column is indexed in an FTS4 full-text search table. Total storage: about 5-6MB per book. Not nothing, but not a problem on modern phones.

For books with multiple files (one per chapter), a running time offset keeps all timestamps in absolute book time rather than chapter-relative time. If transcription gets interrupted (app killed, phone rebooted), it picks up where it left off by checking the last timestamp in the database.

Step two: photograph a page

You point your camera at a page. ML Kit runs OCR and returns text blocks with bounding boxes. But before any matching happens, the raw OCR output needs cleaning.

Filtering out garbage

OCR picks up more than you want. The facing page bleeds through thin paper. The header says "Chapter 12 — The Return" on every page. The footer has a page number. None of this helps with matching and some of it actively hurts.

The filtering is heuristic. It finds the main text column by looking at the 5 largest text blocks, then throws out anything too far left or right of that column (bleed-through from the other page). For headers, it looks for an unusually large gap in the top 30% of the page — if there's a gap 2.5x the normal spacing between blocks, everything above it gets cut. Footers are simpler: short text in the bottom 10% of the image, especially if it contains a digit, gets removed.

It's conservative. Better to accidentally include a header (slightly noisier query) than to accidentally remove the first line of body text.

Building the query

The surviving text is normalized (lowercase, strip punctuation, collapse whitespace) and split into words. From those, up to 20 "query words" are selected: at least 4 characters long, not common stopwords like "the" or "and". Shorter and more common words are kept as fallbacks but ranked last.

Step three: find the position

OK so now you have ~20 query words from the photograph and ~72,000 word entries in the transcription index. Finding the right 30-second window in a 10+ hour book.

Phase one: cheap broad search

Each query word gets a prefix search against the FTS4 index:

WHERE fts.text MATCH 'castle*'

The prefix wildcard is important. "castle" matches "castles" and vice versa. It handles pluralization and partial OCR reads without needing a stemmer.

Typical numbers for a 10-hour book: 15 query words might produce 200-500 FTS hits across the entire transcription.

Phase two: time window grouping

All those hits are grouped into 30-second time windows. Each window is scored by how many distinct query words matched within it. A window where 8 different query words appear in the same 30 seconds is probably the right spot. A window with only 1 or 2 hits is probably a coincidence.

Windows with 4 or more distinct matching words survive. The rest are discarded. This usually gets you from hundreds of hits down to 5-15 candidate positions. The search space just shrank by orders of magnitude, and we haven't done anything expensive yet.

Phase three: fuzzy matching

Now the expensive part, but only on a handful of candidates. For each candidate position, the system loads the surrounding transcription segments and slides a window across them, scoring what fraction of the query words have a match with at least 70% Levenshtein similarity.

That 70% threshold means a 5-letter word tolerates 1 edit, and a 10-letter word tolerates 3 edits. This is where OCR's "rn"→"m" errors and Vosk's "propper"→"proper" errors get absorbed.

A minimum overall score of 0.5 is required — at least half the query words need to match. Results are deduplicated within 30-second windows, and the top 5 are returned sorted by score.

End-to-end, from finished OCR to results: typically 100-500ms. The FTS queries take under 1ms each. Almost all the time is in the fuzzy matching, and that's only running on 5-15 candidates instead of 72,000 words.

The resampling bug

For several days during development, Page Sync was landing about 30 seconds early. Consistently. And it got worse the further into a book you went.

The matching was working. It was finding the right text. But the timestamps attached to that text were slightly wrong, and the error accumulated over time.

The problem was in the resampling step. Vosk needs 16kHz audio. Audiobooks are usually 44.1kHz. The ratio is 16000/44100, which is irrational. You can't convert an integer number of source samples to an integer number of target samples without rounding.

The original code calculated target frames per chunk independently:

val targetFrames = (sourceFrames * ratio).roundToInt()

Each chunk introduces a rounding error of up to half a sample. At 16kHz, that's about 31 microseconds. Over a 12-hour audiobook with roughly 465,000 chunks, these errors accumulate like a random walk. The theoretical worst case is around 21 seconds of drift. In practice, I was seeing about 30 seconds on a 12-hour book (unlucky bias direction).

The fix was to track cumulative frames globally instead of rounding per-chunk:

var totalSourceFramesProcessed = 0L
var totalTargetFramesProduced = 0L

// Per chunk:
val newTotalSource = totalSourceFramesProcessed + sourceFrames
val expectedTotalTarget = round(newTotalSource * ratio).toLong()
val targetFramesThisChunk = (expectedTotalTarget - totalTargetFramesProduced).toInt()

totalSourceFramesProcessed = newTotalSource
totalTargetFramesProduced = expectedTotalTarget

Now the rounding happens once on the cumulative total. Any rounding error in one chunk is automatically compensated by the next. Maximum drift at any point in the file is bounded to one sample (about 63 microseconds at 16kHz), regardless of how long the book is.

Small bug. Days of frustration. Six lines to fix.

What trips it up

Page Sync is not magic. It's a system built on two ML models that both have failure modes, and it helps to know where those are.

Proper nouns and invented words. Vosk was trained on general English. "Malazan" or "Daenerys" will be transcribed as phonetic approximations. The fuzzy matching helps, but it can only absorb so much distance. Fantasy novels are the hardest genre for Page Sync.

Very short pages. If the OCR only extracts 3-4 usable words, there isn't enough signal to narrow down the position. The system might return multiple candidates and you'd have to pick.

Numbers and abbreviations. The page says "Dr. Smith arrived at 3:15 p.m." The OCR produces "dr smith arrived at 3 15 p m". The speech recognition produces "doctor smith arrived at three fifteen pm". None of those words match each other.

Heavy dialogue with short utterances. Pages of "Yes." "No." "Why?" produce almost no searchable words after filtering.

Where it works best: Standard prose, novels and non-fiction, with paragraphs of normal English. Pages with distinctive vocabulary (technical terms, unusual words) are the easiest to match. Good lighting and a flat page help the OCR side. Clear narration by a single narrator helps the speech recognition side.

In practice, I'd estimate it gets the right position about 90-95% of the time on standard prose, dropping to 70-80% on the harder cases above. When it's wrong, it's usually close (within a page or two of the right spot) rather than completely off.

Why this and not that

Some stuff that might not be obvious from the description above.

Word-level timestamps instead of sentences. Vosk produces word boundaries natively. Storing individual words lets the player seek to within about 200ms of the target, and the matching window can start at any word boundary rather than waiting for a sentence break. The trade-off is more database rows (72K for a 10-hour book vs maybe 7K for sentences), but 5-6MB is nothing.

FTS4 + fuzzy matching instead of longest common substring. A naive "find the longest shared substring" approach would be O(n*m) where n is the transcription length. For 72K words, that's slow. The two-phase approach (cheap FTS to find candidates, expensive fuzzy matching only on the candidates) turns a search through 72,000 words into a search through 5-15 positions. The total time stays under 500ms.

Levenshtein distance instead of phonetic similarity. Soundex or Metaphone would help with speech-recognition errors, but wouldn't help with OCR errors (which are visual, not phonetic). Levenshtein handles both kinds of errors with one metric. The 0.7 threshold was tuned empirically across a few dozen books.

No stemming or lemmatization. The prefix wildcard on FTS queries (castle*) already handles basic pluralization. A real stemmer would add complexity and risk false positives (matching words that stem the same but mean different things). Given that the fuzzy matching layer already provides error tolerance, stemming didn't seem worth it.

Try it

Page Sync is part of Earleaf, an audiobook player for Android. $4.99, no ads, no subscriptions. The transcription runs on your phone, nothing leaves your device.

If you have questions about any of this, I'm at arcadianalpaca@gmail.com.

DEV Community