Building a Book Keyword Index with FTS5 Search, Auto-Suggest, and Daily Scanning

#ai #automation #opensource

In knowledge management systems, text search is rarely sufficient—users expect instant suggestions and up-to-date indexes. Our book keyword index module leverages SQLite’s FTS5 for full-text search, implements auto-suggest through prefix queries, and runs a daily scan to keep the index fresh. Here’s how we built it without extra dependencies.

The core requirement was straightforward: index book keywords from a relational schema, support fast lookups, and return suggestions as the user types. FTS5, a virtual table module in SQLite, handles the heavy lifting. We create an FTS5 table mapped to the main keyword data, using content= to link it to an external table. This keeps the source data normalized while FTS5 manages the inverted index.

CREATE VIRTUAL TABLE keywords_fts USING fts5(
    keyword,
    book_id UNINDEXED,
    content=keywords,
    tokenize='unicode61 remove_diacritics 1'
);

The tokenize parameter strips accents and normalizes Unicode, which is critical for multilingual book metadata. For auto-suggest, we query FTS5 with a prefix operator. When a user types "neur", we append * to the search term. FTS5’s BM25 ranking returns top matches quickly, even for partial queries.

SELECT keyword, rank
FROM keywords_fts
WHERE keywords_fts MATCH 'neur*'
ORDER BY rank
LIMIT 10;

FTS5 prefix queries are efficient because they leverage the same inverted index. To avoid showing stale data, we restrict suggestions to keywords updated within the last 30 days, using a separate timestamp column in the source table. The query joins the FTS table with the source to filter by updated_at.

Daily scanning is the third piece. A scheduled background job (implemented via a simple cron entry or a scheduler like systemd timers) triggers rebuild_fts_index(). This function reads all keywords modified since the last scan, deletes stale rows from the FTS table using the content= sync, and inserts new ones. The key insight is to avoid a full rebuild—only delta syncs via the UPDATE and DELETE handlers that FTS5 provides when content= is set. The daily scan calls INSERT OR REPLACE on the source table to trigger these handlers automatically.

-- Example function called by daily cron
INSERT OR REPLACE INTO keywords (book_id, keyword, updated_at)
VALUES (?, ?, datetime('now'));
-- FTS5 automatically syncs because of content= link

We avoided external search engines (like Elasticsearch) to keep the stack simple and the dependency count low. For most book databases under a few million rows, FTS5 performs admirably. The auto-suggest latency stays under 10 ms, and the daily scan runs in seconds.

One caveat: FTS5 does not support incremental updates through the virtual table directly; it relies on triggers or the content= synchronization. Our daily scan ensures the updated_at field is set, and a trigger on the source table keeps FTS5 in sync for insertions and deletions. During the scan, we also rebuild the FTS index on the full table if corruption is detected, but this is rare.

The module integrates with the broader knowledge management system through a simple API: the search endpoint accepts a q parameter, the suggest endpoint returns a JSON array of completions, and the scan is invoked by an internal service. This design keeps the codebase modular and testable.

For experienced developers, the trade-offs are clear. FTS5 lacks distributed capabilities and advanced scoring, but for a single-machine setup with moderate data sizes, it’s a pragmatic choice. The auto-suggest prefix query requires a trailing wildcard, which can be customized with minimum token length filters to avoid empty results. Daily scanning uses a lock file to prevent concurrent runs, handled through a simple file-based mutex.

In production, the system handles thousands of queries per day without issues. The real lesson is that you don’t always need a heavyweight search infrastructure—sometimes a well-tuned FTS5 table and a daily cron job are enough. Focus on indexing the right fields, tuning the tokenizer for your dataset, and keeping the sync precise.

Top comments (1)

Hayrullah Kar • Jun 27

Using FTS5 with external content tables is such an underrated pattern for low-dependency stacks. Keeping it lean with SQLite instead of pulling in Elasticsearch for a sub-million row dataset is top-tier pragmatism.