The Problem
Buddhist texts are scattered across hundreds of databases worldwide — CBETA (Chinese Buddhist Canon), SuttaCentral (Pali texts), 84000 (Tibetan translations), BDRC (manuscripts), GRETIL (Sanskrit), and 430+ more. Each has different interfaces, languages, and data formats.
Researchers spend more time finding texts than reading them.
I wanted to fix that.
What I Built
FoJin (佛津) is an open-source platform that aggregates all of these into one unified interface. Think of it as "Google Scholar for Buddhist texts" — but with features no search engine offers.
Live demo: fojin.app
Source code: github.com/xr843/fojin
Key Features
- Multilingual full-text search across 440+ sources using Elasticsearch with ICU tokenizer (handles Chinese, Sanskrit, Pali, Tibetan, Japanese, Korean, and 24 more languages)
- 4,488 fascicles available for online reading
- Parallel reading — compare translations side-by-side in 30 languages
- Knowledge graph — 9,600+ entities and 3,800+ relationships, visualized with interactive force-directed graphs
- RAG-based AI Q&A — ask questions and get answers grounded in ~11 million characters of canonical texts, always with source citations
- 6 dictionaries — 237,593 entries across Chinese, Sanskrit, Pali, and English
- IIIF manuscript viewer — browse digitized manuscripts from BDRC and other institutions
Tech Stack
Here's what powers FoJin under the hood:
| Layer | Technology |
|---|---|
| Frontend | React 18 + TypeScript + Vite + Ant Design 5 |
| State | Zustand + TanStack Query |
| Backend | FastAPI + SQLAlchemy (async) + Pydantic v2 |
| Database | PostgreSQL 15 + pgvector + pg_trgm |
| Search | Elasticsearch 8 with ICU tokenizer |
| Cache | Redis 7 |
| AI/RAG | Dify + dual retrieval (keyword + semantic) |
| Deployment | Docker Compose + Nginx + GitHub Actions CI |
The Hardest Part: Multilingual Search
Searching across Buddhist texts isn't like searching English documents. You're dealing with:
- Classical Chinese (no spaces, no punctuation in original texts)
- Sanskrit (complex compound words, multiple romanization schemes)
- Pali (diacritical marks: ā, ī, ū, ṃ, ṇ, ṭ)
- Tibetan (unique script with syllable markers)
- Japanese (mixed kanji + kana)
- Korean (Hangul + Hanja)
The solution: Elasticsearch's ICU tokenizer with language-specific analyzers. Each language gets its own analysis chain, and queries are normalized to handle diacritical variations (so searching "nirvana" also finds "nirvāṇa" and "涅槃").
RAG Over Ancient Texts
The AI Q&A feature ("XiaoJin") uses Retrieval-Augmented Generation to answer questions about Buddhist philosophy, history, and textual analysis.
Why RAG matters here: Buddhist studies demand accuracy. You can't have an AI hallucinate a sutra citation. Every answer must be grounded in actual canonical texts with traceable sources.
The pipeline:
- User asks a question
- Dual retrieval: keyword search (BM25) + semantic search (pgvector embeddings)
- Retrieved passages are fed to the LLM as context
- LLM generates an answer with inline citations pointing to specific texts
The corpus covers ~11 million characters of canonical texts across multiple languages.
Knowledge Graph
The knowledge graph maps relationships between Buddhist concepts:
- People → teachers, translators, historical figures
- Texts → sutras, commentaries, treatises
- Schools → Theravada, Mahayana, Vajrayana traditions
- Monasteries → historical and active institutions
- Concepts → philosophical terms and doctrines
9,600+ entities connected by 3,800+ typed relationships, rendered as an interactive force-directed graph using D3.js.
Self-Hosting
FoJin is fully self-hostable with Docker Compose:
git clone https://github.com/xr843/fojin.git
cd fojin
cp .env.example .env
docker compose up -d
Visit http://localhost:3000 and you're done.
What I Learned
Multilingual NLP is hard. ICU tokenization solved 80% of the problem, but edge cases in Classical Chinese and Sanskrit required custom analyzers.
RAG needs domain-specific tuning. Generic chunking strategies don't work well for ancient texts with verse-prose mixed formats. I had to design custom chunking that respects the structural units of Buddhist canons.
Knowledge graphs are underrated. The force-directed visualization became one of the most-used features — researchers love exploring conceptual relationships visually.
Docker Compose is the right deployment choice for academic tools. Researchers aren't DevOps engineers. One command should get everything running.
Get Involved
FoJin is Apache 2.0 licensed and actively looking for contributors:
- Translate the UI — we need Japanese, Korean, Thai, Vietnamese, and more (good first issues)
- Add data sources — know a Buddhist text database we're missing?
- Improve search — better tokenization, ranking, or multilingual support
- Frontend/Backend — dark mode, mobile responsiveness, citation export
Links:
- Live: fojin.app
- GitHub: github.com/xr843/fojin
- Discord: discord.gg/76SZeuJekq
If you're interested in NLP, search engines, knowledge graphs, or digital humanities, I'd love to hear your feedback. Drop a comment or star the repo!
Top comments (0)