Tian AI Knowledge Base: Million Entries on Your Phone

#database #ai #python

How Tian AI Builds a Million-Entry Knowledge Base on Your Phone

Tian AI includes a massive local knowledge base — millions of indexed concepts across 100+ domains, stored in a single SQLite file, searchable in ~0.04 seconds.

The Problem

Large language models like GPT-4 store knowledge in their weights. Smaller local models (1.5B parameters) have limited knowledge capacity. The solution: augment the LLM with an external knowledge base.

The Architecture

User Query → KnowledgeRetriever → Confidence > 0.8? → Direct Answer
                                      ↓ No
                               Inject context into LLM prompt
                                      ↓
                               LLM generates augmented response

Database Schema

CREATE TABLE IF NOT EXISTS concepts (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    concept TEXT NOT NULL,
    category TEXT,
    response_template TEXT,
    question_patterns TEXT
);

CREATE VIRTUAL TABLE IF NOT EXISTS concepts_fts USING fts5(
    concept, category, response_template, question_patterns
);

Each concept stores:

The concept name (e.g., "artificial intelligence")
Category (e.g., "technology")
Response template (the knowledge content)
30 question patterns for flexible retrieval

Batch Generation Strategy

Building millions of entries requires careful batch processing:

No primary key on batch insert — Using INSERT instead of INSERT OR IGNORE prevents key conflicts
Chinese tokenization — Single-character splitting (each Chinese character is a token) instead of regex r'[\w]+' (which matches Chinese chars in Python)
Index after insert — Build FTS5 index after all data is loaded

Retrieval Performance

Metric	Value
Query time	0.04-0.1s
Database size	~34GB (indexed)
Concepts	Millions
Domains	100+
Question patterns per concept	30

The Result

Even without a cloud connection, Tian AI can answer questions about science, technology, history, medicine, finance, and more — drawing from its local knowledge base rather than relying on model parameters.

Published on 2026-04-25 21:19 UTC by Tian AI Dev Team