SQLite FTS5 won't tokenize Chinese — here's the 7-line bigram fix that did
Day 18 of my public cold start. After yesterday's pivot from a $9 PDF (17 days / 247 readers / 0 sales) to pbot — a one-click personal knowledge bot — I built the spike. 25 minutes from npm init to end-to-end query.
One detail almost killed it.
The setup
Stack: Node + Express + better-sqlite3 + Anthropic SDK. Five dependencies total. No LangChain, no ChromaDB, no Python embed worker. Aggressive minimalism on purpose because the v1 ships as a single binary on the user's own machine.
Retrieval: SQLite FTS5 with BM25, virtual table:
CREATE VIRTUAL TABLE chunks_fts USING fts5(
content,
topic,
source UNINDEXED,
tokenize='unicode61 remove_diacritics 2'
);
unicode61 is the FTS5 default. Solid for English, handles Vietnamese diacritics, lowercases properly.
131 chunks ingested from my pbot SPEC docs. Test query: 變壓器絕緣油試驗規範 (transformer insulating oil test spec — one of the questions my factory daemon would answer).
SELECT source FROM chunks_fts WHERE chunks_fts MATCH '"變壓器"';
-- 0 rows
Zero. From 131 chunks containing the exact phrase.
Why unicode61 fails on CJK
unicode61 tokenizes on Unicode word boundaries. For English, that's whitespace + punctuation. Roman alphabet → real tokens.
CJK has no word boundaries. unicode61 sees 變壓器絕緣油試驗規範 as one token — the entire 10-character run. MATCH on "變壓器" is searching for a token that equals 變壓器 exactly, not a substring of a larger token. No match.
SQLite has an ICU tokenizer that handles CJK better, but it requires building SQLite with --enable-fts5 AND ICU linked. better-sqlite3 ships prebuilt without ICU. Not a viable path for a "drop binary on user's machine" product.
Three options I considered
- jieba / nodejieba — proper Chinese segmentation. Adds a Python or N-API dependency. Quality higher but spike scope creep.
- Embeddings — OpenAI ada-002 or local sentence-transformers. Solves the language problem entirely but adds API cost and embedding latency per query. Premature for spike.
-
Bigram preprocessing — transform
變壓器into變壓 壓器before storing/querying. Pure string transform. No new dependency. Quality good enough for keyword search.
Picked 3 for the spike. Embeddings stay on the v1.1 list.
The 7-line bigram function
function bigramCJK(text) {
return text.replace(/[一-鿿㐀-䶿]+/g, (run) => {
if (run.length < 2) return run;
const bigrams = [];
for (let i = 0; i < run.length - 1; i++) {
bigrams.push(run.slice(i, i + 2));
}
return ' ' + bigrams.join(' ') + ' ';
});
}
What it does, by example:
-
"變壓器絕緣油"→" 變壓 壓器 器絕 絕緣 緣油 " -
"BDV 變壓器 test"→"BDV 變壓 壓器 test" -
"好"(single char) →"好"(no bigram possible) - English/spaces/punctuation pass through untouched
- Includes Unicode Extension A range (
㐀-䶿) for rare characters
The regex matches runs of CJK characters and splits each run into overlapping 2-character bigrams. Runs less than 2 characters pass through.
Apply on both ends
Critical: same transform on ingest AND query.
// ingest
db.prepare('INSERT INTO chunks_fts(content,...) VALUES (?,...)')
.run(bigramCJK(rawText), ...);
// query
const ftsQuery = bigramCJK(userQuestion)
.split(/\s+/)
.filter(Boolean)
.map(t => `"${t}"`)
.join(' OR ');
db.prepare('SELECT ... FROM chunks_fts WHERE chunks_fts MATCH ?')
.all(ftsQuery);
Now MATCH '"變壓" OR "壓器" OR "絕緣" OR "緣油"' hits every chunk containing the original phrase. BM25 ranks the most-overlapping chunk on top.
One footgun: don't show users the bigrammed content
After bigram preprocessing, chunks_fts.content looks like " 變壓 壓器 器絕 ... " — readable but ugly. Don't return this to your LLM or your UI.
Solution: keep chunks_meta with raw content, JOIN to retrieve:
CREATE TABLE chunks_meta (
id INTEGER PRIMARY KEY,
source TEXT NOT NULL,
topic TEXT,
content_raw TEXT NOT NULL -- unsplit original
);
SELECT m.content_raw
FROM chunks_fts
JOIN chunks_meta m ON m.id = chunks_fts.rowid
WHERE chunks_fts MATCH ?;
FTS5's hidden rowid column links them. content_raw goes to the LLM. The bigram-mangled chunks_fts.content stays internal.
What I tested
After fix:
| Query | Hits | Top source | Verdict |
|---|---|---|---|
變壓器絕緣油試驗規範 |
3 | transformer-oil-test-spec |
exact ✅ |
工廠有幾台變壓器 |
2 | transformer-23units-inventory |
exact ✅ |
今天台股 (out-of-scope) |
0 | — | correct miss ✅ |
Then I scaled it up. Ingested 493 markdown files (8.3 MB) from my Hope memory directory. 15,119 chunks. Database size 17.3 MB (≈ 2× expansion from bigram).
Query latency on 15k chunks: 0–3 ms. FTS5 BM25 is logarithmic.
Linear extrapolation to a customer with 5 GB of markdown notes:
- Ingest: ~10 minutes
- DB size: ~10 GB- Query latency: still well under 100 ms
That's acceptable for an on-device knowledge base. Embeddings would multiply ingest time by 100× and add per-query latency. Bigram BM25 wins for the spike, embeddings ship in v1.1 for recall improvements.
Limits I'm not pretending don't exist
Bigram BM25 retrieves on keyword overlap. It struggles when the query is conceptually close but uses different words. Test query 變壓器 DGA (which I'd want to match my DGA interpretation reference) instead matched a daily log that happened to contain both substrings. Embeddings solve this. v1.1 territory.
Also: bigram doesn't help languages with longer word averages (Korean, Japanese) as much as Chinese. Korean morphology and Japanese hiragana/kanji mix want their own segmentation. Out of scope for this spike; a real product would need per-language tokenizers.
Test it yourself
The full spike is ~/pbot/code/spike/ — five JS files, ~350 LOC. The bigram function is in ingest.js (writer side) and sqlite-adapter.js (query side, must match).
The whole spike grew out of yesterday's pivot story: 17 days / 247 views / 0 sales — I killed the $9 PDF and spiked a new direction in 25 minutes (Chinese).
If you want to follow the v1 ship of pbot — one-click personal knowledge bot for LINE / Telegram / Zalo — there's a free waitlist ($29 · first-100 get -30% → $20 early-bird): foxck.gumroad.com/l/xkaemm.
What's next on this build log
Day 19: how I packaged the spike in a 65.9 MB Docker image as a fallback to the binary path.
If you're shipping CJK-aware search and don't want to ship a 30 MB jieba dictionary, the bigram trick is one Saturday afternoon's work. Steal it.
Code referenced: bigramCJK() in ingest.js/sqlite-adapter.js of the pbot spike (~350 LOC total, MIT, currently local only — opens when v1 alpha drops).
Numbers verified: 15,119 chunks / 17.3 MB DB / 0-3 ms query latency on node v22.22.2 + better-sqlite3 12.8.0 + sqlite3 3.43.
Top comments (0)