How Tian AI Builds a Million-Entry Knowledge Base on Your Phone
Tian AI includes a massive local knowledge base — millions of indexed concepts across 100+ domains, stored in a single SQLite file, searchable in ~0.04 seconds.
The Problem
Large language models like GPT-4 store knowledge in their weights. Smaller local models (1.5B parameters) have limited knowledge capacity. The solution: augment the LLM with an external knowledge base.
The Architecture
User Query → KnowledgeRetriever → Confidence > 0.8? → Direct Answer
↓ No
Inject context into LLM prompt
↓
LLM generates augmented response
Database Schema
CREATE TABLE IF NOT EXISTS concepts (
id INTEGER PRIMARY KEY AUTOINCREMENT,
concept TEXT NOT NULL,
category TEXT,
response_template TEXT,
question_patterns TEXT
);
CREATE VIRTUAL TABLE IF NOT EXISTS concepts_fts USING fts5(
concept, category, response_template, question_patterns
);
Each concept stores:
- The concept name (e.g., "artificial intelligence")
- Category (e.g., "technology")
- Response template (the knowledge content)
- 30 question patterns for flexible retrieval
Batch Generation Strategy
Building millions of entries requires careful batch processing:
-
No primary key on batch insert — Using
INSERTinstead ofINSERT OR IGNOREprevents key conflicts -
Chinese tokenization — Single-character splitting (each Chinese character is a token) instead of regex
r'[\w]+'(which matches Chinese chars in Python) - Index after insert — Build FTS5 index after all data is loaded
Retrieval Performance
| Metric | Value |
|---|---|
| Query time | 0.04-0.1s |
| Database size | ~34GB (indexed) |
| Concepts | Millions |
| Domains | 100+ |
| Question patterns per concept | 30 |
The Result
Even without a cloud connection, Tian AI can answer questions about science, technology, history, medicine, finance, and more — drawing from its local knowledge base rather than relying on model parameters.
Published on 2026-04-25 21:19 UTC by Tian AI Dev Team
Top comments (0)