How Tian AI Builds a Million-Entry Knowledge Base on Your Phone

#ai #database #python

SQLite at Scale: Million-Entry Knowledge Base

Tian AI demonstrates that you don't need a cloud database to build a powerful knowledge base. Using SQLite with FTS5 full-text search and custom optimizations, we achieve sub-0.05 second retrieval across a million entries -- all on your phone.

Data Generation Strategy

# Synthetic data generation with realistic patterns
entries = []
for i in range(1_000_000):
    title = generate_title(i)
    content = generate_content(i)
    tags = random_tags(2, 5)
    entries.append((title, content, json.dumps(tags)))

The key insight: batch insert 10,000 rows at a time with executemany(), wrapped in explicit transactions. This reduces overhead from ~10ms per insert to 0.02ms per insert.

FTS5 with Chinese Text Segmentation

Chinese text doesn't have spaces between words, making full-text search challenging. The solution uses jieba for tokenization:

import jieba

def chinese_fts5_tokenize(text):
    words = jieba.cut(text, cut_all=False)
    return ' '.join(words)

For each entry, we store both the raw text AND a space-separated tokenized version, allowing FTS5 to match Chinese terms effectively.

Index Optimization

CREATE INDEX idx_knowledge_timestamp ON knowledge_base(created_at);
CREATE INDEX idx_knowledge_tags ON knowledge_base(tags, id);
CREATE VIRTUAL TABLE knowledge_fts USING fts5(title, content, tags, content=knowledge_base);

Performance Tuning

Page size: PRAGMA page_size=4096 for better read performance
Cache: PRAGMA cache_size=-8000 (8MB cache)
MMAP: PRAGMA mmap_size=268435456 (256MB memory-mapped I/O)
WAL mode: PRAGMA journal_mode=WAL for concurrent reads

Retrieval Speed

Single lookup by ID: 0.001s
FTS5 search (top 10): 0.02s
Complex join query: 0.04s
Full text search across 1M entries: 0.04s

The entire knowledge base occupies only 380MB on disk, making it perfectly viable for mobile deployment.

Architecture

The knowledge base integrates seamlessly with the Thinker module's Deep mode -- when the LLM needs factual context, the KB retrieves relevant entries, formats them as context, and injects them into the prompt. The entire pipeline completes in under 100ms.