foxck016077

Posted on May 24

SQLite FTS5 won't tokenize Chinese — here's the 7-line bigram fix that did

#sqlite #search #ai #rag

SQLite FTS5 won't tokenize Chinese — here's the 7-line bigram fix that did

Day 18 of my public cold start. After yesterday's pivot from a $9 PDF (17 days / 247 readers / 0 sales) to pbot — a one-click personal knowledge bot — I built the spike. 25 minutes from npm init to end-to-end query.

One detail almost killed it.

The setup

Stack: Node + Express + better-sqlite3 + Anthropic SDK. Five dependencies total. No LangChain, no ChromaDB, no Python embed worker. Aggressive minimalism on purpose because the v1 ships as a single binary on the user's own machine.

Retrieval: SQLite FTS5 with BM25, virtual table:

CREATE VIRTUAL TABLE chunks_fts USING fts5(
  content,
  topic,
  source UNINDEXED,
  tokenize='unicode61 remove_diacritics 2'
);

unicode61 is the FTS5 default. Solid for English, handles Vietnamese diacritics, lowercases properly.

131 chunks ingested from my pbot SPEC docs. Test query: 變壓器絕緣油試驗規範 (transformer insulating oil test spec — one of the questions my factory daemon would answer).

SELECT source FROM chunks_fts WHERE chunks_fts MATCH '"變壓器"';
-- 0 rows

Zero. From 131 chunks containing the exact phrase.

Why unicode61 fails on CJK

unicode61 tokenizes on Unicode word boundaries. For English, that's whitespace + punctuation. Roman alphabet → real tokens.

CJK has no word boundaries. unicode61 sees 變壓器絕緣油試驗規範 as one token — the entire 10-character run. MATCH on "變壓器" is searching for a token that equals 變壓器 exactly, not a substring of a larger token. No match.

SQLite has an ICU tokenizer that handles CJK better, but it requires building SQLite with --enable-fts5 AND ICU linked. better-sqlite3 ships prebuilt without ICU. Not a viable path for a "drop binary on user's machine" product.

Three options I considered

jieba / nodejieba — proper Chinese segmentation. Adds a Python or N-API dependency. Quality higher but spike scope creep.
Embeddings — OpenAI ada-002 or local sentence-transformers. Solves the language problem entirely but adds API cost and embedding latency per query. Premature for spike.
Bigram preprocessing — transform 變壓器 into 變壓壓器 before storing/querying. Pure string transform. No new dependency. Quality good enough for keyword search.

Picked 3 for the spike. Embeddings stay on the v1.1 list.

The 7-line bigram function

function bigramCJK(text) {
  return text.replace(/[一-鿿㐀-䶿]+/g, (run) => {
    if (run.length < 2) return run;
    const bigrams = [];
    for (let i = 0; i < run.length - 1; i++) {
      bigrams.push(run.slice(i, i + 2));
    }
    return ' ' + bigrams.join(' ') + ' ';
  });
}

What it does, by example:

"變壓器絕緣油" → " 變壓壓器器絕絕緣緣油 "
"BDV 變壓器 test" → "BDV 變壓壓器 test"
"好" (single char) → "好" (no bigram possible)
English/spaces/punctuation pass through untouched
Includes Unicode Extension A range (㐀-䶿) for rare characters

The regex matches runs of CJK characters and splits each run into overlapping 2-character bigrams. Runs less than 2 characters pass through.

Apply on both ends

Critical: same transform on ingest AND query.

// ingest
db.prepare('INSERT INTO chunks_fts(content,...) VALUES (?,...)')
  .run(bigramCJK(rawText), ...);

// query
const ftsQuery = bigramCJK(userQuestion)
  .split(/\s+/)
  .filter(Boolean)
  .map(t => `"${t}"`)
  .join(' OR ');

db.prepare('SELECT ... FROM chunks_fts WHERE chunks_fts MATCH ?')
  .all(ftsQuery);

Now MATCH '"變壓" OR "壓器" OR "絕緣" OR "緣油"' hits every chunk containing the original phrase. BM25 ranks the most-overlapping chunk on top.

One footgun: don't show users the bigrammed content

After bigram preprocessing, chunks_fts.content looks like " 變壓壓器器絕 ... " — readable but ugly. Don't return this to your LLM or your UI.

Solution: keep chunks_meta with raw content, JOIN to retrieve:

CREATE TABLE chunks_meta (
  id INTEGER PRIMARY KEY,
  source TEXT NOT NULL,
  topic TEXT,
  content_raw TEXT NOT NULL  -- unsplit original
);

SELECT m.content_raw
FROM chunks_fts
JOIN chunks_meta m ON m.id = chunks_fts.rowid
WHERE chunks_fts MATCH ?;

FTS5's hidden rowid column links them. content_raw goes to the LLM. The bigram-mangled chunks_fts.content stays internal.

What I tested

After fix:

Query	Hits	Top source	Verdict
`變壓器絕緣油試驗規範`	3	`transformer-oil-test-spec`	exact ✅
`工廠有幾台變壓器`	2	`transformer-23units-inventory`	exact ✅
`今天台股` (out-of-scope)	0	—	correct miss ✅

Then I scaled it up. Ingested 493 markdown files (8.3 MB) from my Hope memory directory. 15,119 chunks. Database size 17.3 MB (≈ 2× expansion from bigram).

Query latency on 15k chunks: 0–3 ms. FTS5 BM25 is logarithmic.

Linear extrapolation to a customer with 5 GB of markdown notes:

Ingest: ~10 minutes
DB size: ~10 GB- Query latency: still well under 100 ms

That's acceptable for an on-device knowledge base. Embeddings would multiply ingest time by 100× and add per-query latency. Bigram BM25 wins for the spike, embeddings ship in v1.1 for recall improvements.

Limits I'm not pretending don't exist

Bigram BM25 retrieves on keyword overlap. It struggles when the query is conceptually close but uses different words. Test query 變壓器 DGA (which I'd want to match my DGA interpretation reference) instead matched a daily log that happened to contain both substrings. Embeddings solve this. v1.1 territory.

Also: bigram doesn't help languages with longer word averages (Korean, Japanese) as much as Chinese. Korean morphology and Japanese hiragana/kanji mix want their own segmentation. Out of scope for this spike; a real product would need per-language tokenizers.

Test it yourself

The full spike is ~/pbot/code/spike/ — five JS files, ~350 LOC. The bigram function is in ingest.js (writer side) and sqlite-adapter.js (query side, must match).

The whole spike grew out of yesterday's pivot story: 17 days / 247 views / 0 sales — I killed the $9 PDF and spiked a new direction in 25 minutes (Chinese).

If you want to follow the v1 ship of pbot — one-click personal knowledge bot for LINE / Telegram / Zalo — there's a free waitlist ($29 · first-100 get -30% → $20 early-bird): foxck.gumroad.com/l/xkaemm.

What's next on this build log

Day 19: how I packaged the spike in a 65.9 MB Docker image as a fallback to the binary path.

If you're shipping CJK-aware search and don't want to ship a 30 MB jieba dictionary, the bigram trick is one Saturday afternoon's work. Steal it.

Code referenced: bigramCJK() in ingest.js/sqlite-adapter.js of the pbot spike (~350 LOC total, MIT, currently local only — opens when v1 alpha drops).
Numbers verified: 15,119 chunks / 17.3 MB DB / 0-3 ms query latency on node v22.22.2 + better-sqlite3 12.8.0 + sqlite3 3.43.

Top comments (1)

mamoru kubokawa • May 25

The "same transform on both ingest and query" line is the part people skip and then burn a day debugging — good call making it explicit.

On the Japanese caveat: you're right it doesn't transfer, and the why changes the fix. Chinese bigrams work because every char is high-information (變壓 is a real semantic unit). Japanese mixes scripts, and hiragana is mostly low-information grammar (って, した, conjugation tails) — bigram that and you flood BM25 with junk tokens that match almost everything. The pragmatic middle path is script-aware bigramming: bigram only the kanji runs (the CJK Unified range your regex already isolates), let hiragana/katakana pass through. You keep the zero-dependency win and drop most of the noise.

The failure that bites hardest in my world (Japanese e-commerce search) isn't even segmentation — it's katakana brand/loanword variants: アネッサ vs ANESSA vs anessa, full-width vs half-width. That needs a normalization pass, not a tokenizer. Funny how this thread keeps circling the same root: Vicky's translation API, Hiroki's address parser, your tokenizer — all of it is "CJK doesn't fit the whitespace assumption baked into the tools."

Are you shipping the script-aware version for the JP/KR waitlist users, or holding the line at Chinese for v1?