Tim Ren

Posted on Mar 17

How I Built an Open-Source Search Engine for 440+ Buddhist Text Sources in 30 Languages

#opensource #elasticsearch #rag #knowledgegraph

The Problem

Buddhist texts are scattered across hundreds of databases worldwide — CBETA (Chinese Buddhist Canon), SuttaCentral (Pali texts), 84000 (Tibetan translations), BDRC (manuscripts), GRETIL (Sanskrit), and 430+ more. Each has different interfaces, languages, and data formats.

Researchers spend more time finding texts than reading them.

I wanted to fix that.

What I Built

FoJin (佛津) is an open-source platform that aggregates all of these into one unified interface. Think of it as "Google Scholar for Buddhist texts" — but with features no search engine offers.

Live demo: fojin.app
Source code: github.com/xr843/fojin

Key Features

Multilingual full-text search across 440+ sources using Elasticsearch with ICU tokenizer (handles Chinese, Sanskrit, Pali, Tibetan, Japanese, Korean, and 24 more languages)
4,488 fascicles available for online reading
Parallel reading — compare translations side-by-side in 30 languages
Knowledge graph — 9,600+ entities and 3,800+ relationships, visualized with interactive force-directed graphs
RAG-based AI Q&A — ask questions and get answers grounded in ~11 million characters of canonical texts, always with source citations
6 dictionaries — 237,593 entries across Chinese, Sanskrit, Pali, and English
IIIF manuscript viewer — browse digitized manuscripts from BDRC and other institutions

Tech Stack

Here's what powers FoJin under the hood:

Layer	Technology
Frontend	React 18 + TypeScript + Vite + Ant Design 5
State	Zustand + TanStack Query
Backend	FastAPI + SQLAlchemy (async) + Pydantic v2
Database	PostgreSQL 15 + pgvector + pg_trgm
Search	Elasticsearch 8 with ICU tokenizer
Cache	Redis 7
AI/RAG	Dify + dual retrieval (keyword + semantic)
Deployment	Docker Compose + Nginx + GitHub Actions CI

The Hardest Part: Multilingual Search

Searching across Buddhist texts isn't like searching English documents. You're dealing with:

Classical Chinese (no spaces, no punctuation in original texts)
Sanskrit (complex compound words, multiple romanization schemes)
Pali (diacritical marks: ā, ī, ū, ṃ, ṇ, ṭ)
Tibetan (unique script with syllable markers)
Japanese (mixed kanji + kana)
Korean (Hangul + Hanja)

The solution: Elasticsearch's ICU tokenizer with language-specific analyzers. Each language gets its own analysis chain, and queries are normalized to handle diacritical variations (so searching "nirvana" also finds "nirvāṇa" and "涅槃").

RAG Over Ancient Texts

The AI Q&A feature ("XiaoJin") uses Retrieval-Augmented Generation to answer questions about Buddhist philosophy, history, and textual analysis.

Why RAG matters here: Buddhist studies demand accuracy. You can't have an AI hallucinate a sutra citation. Every answer must be grounded in actual canonical texts with traceable sources.

The pipeline:

User asks a question
Dual retrieval: keyword search (BM25) + semantic search (pgvector embeddings)
Retrieved passages are fed to the LLM as context
LLM generates an answer with inline citations pointing to specific texts

The corpus covers ~11 million characters of canonical texts across multiple languages.

Knowledge Graph

The knowledge graph maps relationships between Buddhist concepts:

People → teachers, translators, historical figures
Texts → sutras, commentaries, treatises
Schools → Theravada, Mahayana, Vajrayana traditions
Monasteries → historical and active institutions
Concepts → philosophical terms and doctrines

9,600+ entities connected by 3,800+ typed relationships, rendered as an interactive force-directed graph using D3.js.

Self-Hosting

FoJin is fully self-hostable with Docker Compose:

git clone https://github.com/xr843/fojin.git
cd fojin
cp .env.example .env
docker compose up -d

Visit http://localhost:3000 and you're done.

What I Learned

Multilingual NLP is hard. ICU tokenization solved 80% of the problem, but edge cases in Classical Chinese and Sanskrit required custom analyzers.
RAG needs domain-specific tuning. Generic chunking strategies don't work well for ancient texts with verse-prose mixed formats. I had to design custom chunking that respects the structural units of Buddhist canons.
Knowledge graphs are underrated. The force-directed visualization became one of the most-used features — researchers love exploring conceptual relationships visually.
Docker Compose is the right deployment choice for academic tools. Researchers aren't DevOps engineers. One command should get everything running.

Get Involved

FoJin is Apache 2.0 licensed and actively looking for contributors:

Translate the UI — we need Japanese, Korean, Thai, Vietnamese, and more (good first issues)
Add data sources — know a Buddhist text database we're missing?
Improve search — better tokenization, ranking, or multilingual support
Frontend/Backend — dark mode, mobile responsiveness, citation export

Links:

If you're interested in NLP, search engines, knowledge graphs, or digital humanities, I'd love to hear your feedback. Drop a comment or star the repo!

DEV Community