DEV Community

 Omochi
Omochi

Posted on

Why SQLite FTS5's default tokenizer drops your Japanese substrings (and the one-line fix)

If you're building any kind of personal-memory layer on top of SQLite — Claude Code conversation history, notes app, indexed knowledge base — there's a sharp edge in FTS5 that takes most people by surprise the first time they hit it.

The default tokenizer (unicode61) silently drops most Japanese substring queries. The fix is one line of SQL. But the failure mode is invisible enough that you can ship a personal search tool, use it for weeks, and never realize half your content is unreachable.

This post walks through:

  1. The failure, reproducible in 20 lines of Python
  2. The one-line fix (tokenize='trigram') and what it actually does under the hood
  3. A two-layer Git + SQLite design that uses this index in production for ~800 Claude Code conversations
  4. A separate FTS5 footgun around the - character that breaks time-blocking-style queries
  5. A free GitHub sample at the end if you want to run the same approach against your own data

The failure, reproducible in 20 lines

Spin up a fresh SQLite FTS5 table with the default settings and insert a single multilingual sentence:

import sqlite3

conn = sqlite3.connect(":memory:")
conn.execute("""
    CREATE VIRTUAL TABLE notes USING fts5(content)
""")
conn.execute("""
    INSERT INTO notes(content) VALUES
    ('Tried time-blocking with the new 朝の運用フロー — '
     'the 9-11 slot worked but the 午後 part collapsed again.')
""")

for q in ["time", "blocking", "朝の運用", "午後"]:
    hits = conn.execute(
        "SELECT count(*) FROM notes WHERE content MATCH ?", (q,)
    ).fetchone()[0]
    print(f"{q!r}: {hits} hit(s)")
Enter fullscreen mode Exit fullscreen mode

Output:

'time': 1 hit(s)
'blocking': 1 hit(s)
'朝の運用': 0 hit(s)
'午後': 0 hit(s)
Enter fullscreen mode Exit fullscreen mode

Same row. Same content. English queries land, Japanese substring queries don't. That's not a bug, it's the default tokenizer behavior — and the default doesn't print a warning about it.

The reason: unicode61 segments text on whitespace and unicode word-break properties. English words have spaces between them, so individual tokens are recoverable. Japanese (and other CJK scripts) doesn't put spaces between words, so the entire run "朝の運用フロー" becomes one opaque token. Searching for the substring 朝の運用 is asking FTS5 "is this token equal to this query?" — and the token is 朝の運用フロー, which isn't equal.

The same problem exists for any non-space-delimited script: Chinese, Korean, Thai. If your index has any content in those languages and you're using the default tokenizer, you have a silent recall problem.


The one-line fix

CREATE VIRTUAL TABLE notes USING fts5(
    content,
    tokenize = 'trigram'
);
Enter fullscreen mode Exit fullscreen mode

Re-running the same test against this table:

'time': 1 hit(s)
'blocking': 1 hit(s)
'朝の運用': 1 hit(s)
'午後': 1 hit(s)
Enter fullscreen mode Exit fullscreen mode

All four queries land. Here's what the trigram tokenizer does differently.

Instead of trying to find word boundaries, it indexes every overlapping 3-character window of every input string. The Japanese substring "朝の運用フロー" becomes:

朝の運
の運用
運用フ
用フロ
フロー
Enter fullscreen mode Exit fullscreen mode

A search for 朝の運用 now matches because the substring "朝の運" and "の運用" are both windows the index already knows about. The trigram doesn't care whether a language has spaces — it operates below the word level.

The trade-off is index size. Every line gets indexed as N - 2 trigram entries (where N is the line length). On real corpora this comes out to roughly 1.5× to 2× the raw text size. For a personal memory tool with megabytes of content, that's nothing. For a 100GB corpus, you'd want to think harder.

Two practical notes:

  • Trigram requires queries of at least 3 characters. A search for (1 char) or 朝の (2 chars) gets degraded behavior — it falls back to a more expensive query plan. Filter out short queries at the app layer.
  • BM25 ranking still works on top of trigram tokenization. Searching mochi across 800 documents and getting the top 5 most relevant doesn't change.

The two-layer design: Git as source of truth, SQLite as disposable index

This index is the search half of a memory layer I built for Claude Code conversation history. ~800 conversations after several months of use, the architecture has stabilized around a deliberate split:

┌────────────────────────────────────────┐
│  Git repo: sample_events/              │   ← source of truth
│    └── 2026-06-09/                     │      (durable, human-readable,
│        ├── abc-def.json                │       diff-able, irreplaceable)
│        └── ghi-jkl.json                │
├────────────────────────────────────────┤
│  _data/memory_index.db                 │   ← disposable index
│    └── events_fts (FTS5 trigram)       │      (rebuildable in one command,
│                                        │       gitignored)
└────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Each conversation (or each meaningful decision inside one) becomes a JSON event on disk:

{
  "event_id": "0bf61ebd-c835-4a45-a147-675369258d61",
  "timestamp": "2026-05-09T12:00:00Z",
  "type": "decision",
  "content": "Picked SQLite over a vector store because grep needs to still work.",
  "metadata": {
    "session": "S8",
    "tags": ["architecture", "memory"]
  }
}
Enter fullscreen mode Exit fullscreen mode

The save path does three things in order:

def save_and_index(event: dict) -> tuple[Path, bool]:
    path = write_event(event)                       # 1. JSON to disk
    rel  = str(path.relative_to(MEMORY_DIR))
    msg  = f"note({event['type']}): {event['content'][:60]}"
    committed = git_commit(rel, msg)                # 2. Git commit
    con = get_db()
    ensure_schema(con)
    index_event(con, event)                         # 3. SQLite index
    con.close()
    return path, committed
Enter fullscreen mode Exit fullscreen mode

The reason for the split: SQLite indexes are great for query speed but bad at being trusted with the only copy of anything. A schema migration that goes wrong, an FTS5 corruption, a rm _data/* typo — any of these should be recoverable in seconds, not a panic. With this design, deleting the index and rebuilding it from the Git repo is a one-command operation. The Markdown / JSON in Git is the contract; the database is just a cache that happens to be fast.

This also makes the SQLite file itself completely disposable — there's no risk in adding it to .gitignore, no concern about syncing it across machines. Whichever machine builds the index first wins, and rebuilding on a new machine takes seconds.


A separate FTS5 footgun: the - operator

There's a second invisible failure in FTS5 that you'll hit the moment any of your users searches for a hyphenated term. The hyphen - is parsed as the NOT operator. So if someone runs:

conn.execute("SELECT * FROM notes WHERE content MATCH ?", ("time-blocking",))
Enter fullscreen mode Exit fullscreen mode

FTS5 sees this as time NOT blocking and crashes with:

sqlite3.OperationalError: no such column: blocking
Enter fullscreen mode Exit fullscreen mode

Fix is to wrap any user-provided query in double quotes before handing it to FTS5, which tells the parser to treat the whole thing as a literal phrase:

def to_fts_phrase(query: str) -> str:
    # Escape internal double-quotes by doubling them per FTS5 syntax.
    escaped = query.replace('"', '""')
    return f'"{escaped}"'
Enter fullscreen mode Exit fullscreen mode

Apply this to every user-input query, regardless of whether it looks like it has special characters. It's the only safe default. Phrase-wrapped queries still match substrings correctly under the trigram tokenizer, so there's no downside.

If you're building anything that lets a user type a free-form string into an FTS5 query, this should land before your first user.


What I'm actually using this for

Full disclosure on motivation: I keep most of my project notes inside Claude.ai conversations — strategy decisions, half-written drafts, debug logs from the day before. After ~800 conversations the built-in sidebar search started failing me, and the failure was the trigram-tokenizer story above. My notes are partly in Japanese (I'm a Japanese developer, can't help it), so half my own substring queries were returning nothing.

The setup that came out of solving it:

  • A SessionStart hook for Claude Code that pulls the last 7 days of events out of SQLite, feeds them to claude -p for a one-paragraph summary, and injects that as a system message at the top of every new session. The result is that Claude opens already knowing where I left off.
  • A Stop hook that records decisions automatically when a session ends.
  • A CLI: memory_ask "朝の運用" returns ranked hits in under 50ms over the full ~800 conversations on a laptop. The fact that this query works at all is the entire point of the trigram switch.

The most surprising upside has been Claude bringing up reasoning I'd forgotten: "We considered this approach three weeks ago and rejected it because of X." Reasons that feel obvious when you write them down go faded enough by month 3 that this kind of recall pays for the whole system.


Try it on your own machine

There's an open-source sample of the indexer that runs end-to-end on three fictional events:

https://github.com/tititi533551-create/mochi-memory-sample-en

git clone https://github.com/tititi533551-create/mochi-memory-sample-en.git
cd mochi-memory-sample-en
python3 check_mochi.py
Enter fullscreen mode Exit fullscreen mode

It clones, builds the SQLite FTS5 index, runs five mixed English/Japanese queries against the three sample events, and prints the trigram index doing its thing in your local environment. No external dependencies — only the Python standard library and a sqlite3 module with FTS5 compiled in (which is the default on macOS, Homebrew Python, official Windows installer, and most Linux distros).

The sample is MIT-licensed. Use it, fork it, drop the trigram pattern into whatever search layer you're building.


What I'd still like to figure out

A few things this design doesn't solve yet, in case anyone reading has thoughts:

  • Semantic recall on top. Trigram nails substring search, but "find me the conversation where we discussed pricing strategy" is a different shape of query and would want vector embeddings. The two indexes can coexist over the same JSON events, but I haven't built that second layer yet.
  • Cross-project memory. Right now each project gets its own repo. There's an argument for a federated query that looks across all of them — useful for "have I ever decided anything about X before?" — but the right design isn't obvious. Per-project context isolation has value too.
  • Event-granularity heuristics. What's the right unit to record? I've landed on "decisions, shipped things, learnings, getting stuck" empirically, but I'd be curious whether anyone has formalized this.

If you've solved any of these — or hit different versions of the same problems — I'd genuinely like to hear how. Comments below.

Top comments (0)