The problem
I'm building biblie-school, an open-source LMS for Bible schools. The product is bilingual (Russian and English) and most teacher-authored content goes through a Gemini-backed translation pipeline.
For 95% of the content this is fine. Course titles, chapter prose, quiz questions, announcements, the LLM does an honest job and the worst that happens is a slightly clumsy turn of phrase.
For Bible quotes it isn't fine. At all.
A teacher writes a Russian course quoting Acts 1:8 from the Synodal translation, the canonical Russian-language text since 1876. An English-locale student reads it. The expected behaviour is that they see Acts 1:8 in the King James Version, the canonical English-language text since 1769. Not Gemini's interpretation of the Synodal text.
This isn't a stylistic preference. KJV and Synodal are the texts these communities cite, memorize, and study from. A model paraphrase, even a "good" one, breaks the contract: students need to read the same wording their pastor will quote on Sunday. And every LLM, including the strongest ones, paraphrases scripture. Sometimes subtly, sometimes egregiously.
The naive fix is a prompt rule: "Bible verses must be preserved verbatim." It does not work. The model still paraphrases, especially across languages where there is no direct passthrough. So we built something else.
The constraint
Every quote we care about is one of two public-domain corpora:
- King James Version (1769) for English. ~31,103 verses, ~4.5 MB as flat JSON.
- Synodal (1876) for Russian. ~30,111 verses, ~6.1 MB as flat JSON.
Both bundle into the backend. They never change. They are the source of truth for any rendered Bible quote in either locale.
The only problem is figuring out, for any given chunk of teacher-authored HTML, which substrings are quotes that need this canonical-text treatment, and what verse exactly each one is.
The architecture
The pipeline runs in three steps around each Gemini call:
HTML in source locale
|
v
pre_substitute(html, source_locale)
- find <blockquote> + reference pairs
- confirm canonical via similarity match
- swap verse text for VERSE_<hex> marker
|
v
markered HTML -----> Gemini translate -----> translated HTML
|
v
post_substitute(html, subs, target_locale)
- replace each marker with
the canonical target-locale verse
- localize the (Acts 1:8) reference too
|
v
HTML in target locale
1. Detect
We walk every <blockquote> in the HTML. Two layouts in real-world content:
<!-- A: reference inside the blockquote (most Synodal-style citations) -->
<blockquote>...verse text... (Деян. 20:28).</blockquote>
<!-- B: reference outside, immediately after </blockquote> -->
<blockquote>...verse text...</blockquote> (Acts 1:8).
We try the inside layout first, fall back to a 120-character lookahead window after the closing tag. The reference parser is a regex built from a 66-book canonical alias list (Matthew / Matt. / Mt. / Матфей / Мф. / Матф. / etc.) so a sloppy book pattern doesn't accidentally swallow surrounding prose like "See Acts" as a book name.
2. Confirm
Detection alone isn't enough. The author might have paraphrased the verse themselves, or written commentary, or quoted only part of the verse. We don't want to "correct" intentional paraphrases.
So for every detected blockquote+reference pair, we look up the canonical text in the source locale (Synodal for ru, KJV for en) and compare it to the author's text using difflib.SequenceMatcher. If similarity is at least 0.80, this is a real canonical quote and gets the substitution treatment. Below 0.80, we leave it alone and the LLM handles it under the existing "leave verses untouched" prompt rule (a fallback, not the main mechanism).
We tested the threshold empirically on real course content. Author copy-pastes of Synodal hit 0.95 and above. Paraphrases land below 0.6. The 0.80 threshold tolerates minor punctuation differences (em-dashes, smart quotes, ё vs е normalization) without false-matching a paraphrase.
3. Substitute
When we accept a quote, we replace the verse text with a marker:
VERSE_a1b2c3d4e5f6g7h8
Constraints this marker satisfies:
- Plain ASCII. An earlier version used Unicode Private-Use Area characters as fences. They were invisible in the editor and silently broke ASCII assertions in tests.
-
Postgres-TEXT-safe. The first prototype used NUL-byte fences (
\x00...\x00). PostgresTEXTrejects NUL. The translated marker came back stripped, leaving rawVERSE_<hex>substrings visible to students. Took an embarrassing prod inspection to catch. -
Identifier-shaped. This matters because Gemini's "preserve placeholders verbatim" prompt rule applies to identifier-shaped tokens.
VERSE_a1b2c3d4e5f6g7h8reads as a placeholder.≪V≫does not. - Random hex suffix. Multiple verses in one document each get a unique marker so substitutions round-trip independently.
We also extend the marker leftwards through the opening parenthesis of the reference if there is one, so we don't leave a stray ( inside the marker-replaced verse text. The reference notation itself, (Деян. 20:28)., is preserved as-is in the markered HTML and survives translation untouched (parens-with-digits looks like data to the model).
4. Restore
After Gemini returns the translated HTML, post_substitute walks the substitution list and:
- Replaces each marker with the canonical target-locale verse (
canonical_target = lookup(ref, target_locale)). - Falls back to the original source-locale text when the target lookup misses (e.g. a verse the bundled JSON happens to lack), better than leaving a marker visible.
- Rewrites the surviving reference notation to the target locale's conventional short form.
(Acts 1:8)becomes(Деян. 1:8).(Матф. 28:19)becomes(Matt. 28:19). The book-name display table is keyed by the same canonical slug as the alias parser, so adding a new book is one row in two places.
What it looks like in production
Russian-locale teacher writes:
<p>Завершающее повеление Иисуса ученикам:</p>
<blockquote>«Итак идите, научите все народы, крестя их во имя
Отца и Сына и Святаго Духа» (Матф. 28:19).</blockquote>
English-locale student reads:
<p>The final command of Jesus to His disciples:</p>
<blockquote>Go ye therefore, and teach all nations, baptizing them in the
name of the Father, and of the Son, and of the Holy Ghost: (Matt. 28:19).</blockquote>
Note three things:
- The blockquote's verse text is the canonical KJV verbatim, not Gemini's interpretation of the Synodal.
- The reference is
(Matt. 28:19), not(Матф. 28:19). Gemini didn't translate the book name. Our display table did. - The surrounding prose ("The final command of Jesus to His disciples:") is the LLM doing its job on the parts the LLM should be doing.
Edge cases that bit us
Synodal short forms missing from the alias map. First version had
матфей/мф/от матфеяfor Matthew but missedМатф., the most common abbreviation in actual Synodal-printed Bibles. Russian-authored content silently failed substitution and the verse leaked through Gemini paraphrased. Caught in production via Datadog RUM. Fixed by expanding aliases for every Gospel and Pauline epistle (мар,лук,иоан,фил,1 фесс,1 иоан, etc.) plus a contract test that asserts every canonical slug has an alias entry.Reference outside the blockquote. Some content puts the citation after
</blockquote>instead of inside it. The first version captured the verse correctly but didn't track the outside reference, so a Russian student saw a Synodal verse next to a stray English(Acts 1:8). Fixed by storing the outside ref text on the substitution record and running the same locale rewrite on it during post.Marker spacing. The marker swallowed the trailing whitespace and closing curly quote of the blockquote, so the post-substituted output read
…canonical text.(Matt. 28:19).with no space. Re-introduced a single ASCII space when the tail starts with(.Verse range references.
(Acts 1:8-10)localizes correctly to(Деян. 1:8-10)because the display formatter respects theverse_endfield on the ref struct. The corresponding canonical lookup joins the verses with a single space and falls back to None if any verse in the range is missing.
What's interesting about this approach
The Bible substitution layer doesn't compete with the LLM. It uses the LLM for what it's good at (translating prose, preserving HTML structure, transliterating proper nouns) and replaces the LLM where the LLM is wrong (touching canonical text). Each layer has a clean job.
The same pattern applies anywhere you have:
- A small, public-domain or licensed corpus of canonical text
- A larger surface that needs LLM translation
- A reliable way to detect quotes from the corpus inside the surface
Examples I can think of: legal contracts citing statute, scientific writing citing equations or named constants, classical literature quoting older works in their established translations. The shape is the same. Detect the canonical chunk, swap it for a placeholder, let the LLM handle the surrounding prose, restore the canonical chunk in the target locale.
Code
All of this lives in backend/app/services/bible/:
-
books.py: 66-book canon, alias map, per-locale display names -
references.py: regex parser built from the alias list -
store.py: bundled JSON loader (KJV / Synodal) -
substitution.py: pre/post substitute, similarity threshold, marker tokens -
data/: kjv-en.json (4.5 MB) + synodal-ru.json (6.1 MB), both public-domain
39 unit tests cover the alias map, the reference parser, the locale store, full round-trips for both directions, the verse-range case, and the spacing regression. The pipeline integrates into the broader translation registry which also handles non-Bible content.
Repo: github.com/ArVaViT/biblie-school (MIT)
How you can help
The pipeline above is one corner of a small open-source LMS for Bible schools and volunteer-run training programs. If you've worked on translation pipelines, LLM I/O hardening, or just like the idea of an LMS that respects scripture as a source of truth, the issues tab is open. Star the repo if you want to follow along.
Thanks for reading.
Top comments (0)