I Built an Open-Source Buddhist Text Search Engine — Here's What I Learned

Tim Ren — Sun, 22 Mar 2026 13:10:00 +0000

If you've ever tried researching Buddhist texts, you know the pain: scriptures are scattered across hundreds of databases worldwide — CBETA, SuttaCentral, BDRC, 84000... Searching a single topic means juggling a dozen sites, each with different formats, languages, and varying quality of experience.

So I built FoJin (佛津) to bring it all together.

What is FoJin?

FoJin is an open-source platform that aggregates 504 data sources and 9,200+Buddhist texts across 21 languages (Chinese, Sanskrit, Pali, Tibetan, and more) into a single, searchable interface.

🔗 Live: fojin.app
📦 Source: github.com/xr843/fojin(Apache 2.0)

Core Features

🔍 Multilingual Full-Text Search

Built on Elasticsearch with ICU tokenization, supporting CJK, Sanskrit, Pali, and Tibetan scripts. Filter by dynasty, category, and source.

📖 Parallel Reading

Compare different language translations of the same text side by side — for example, the Heart Sutra in Sanskrit, Chinese, Tibetan, and English.

🤖 AI Q&A (XiaoJin)

RAG-powered assistant that answers questions by citing canonical sources. Free anonymous quota with BYOK support.

🕸️ Knowledge Graph

9,600 entities and 3,800 relations, visualized with D3. Explore connections between texts, authors, schools, and concepts.

📚 Dictionary Integration

6 authoritative dictionaries with 230,000+ entries, accessible inline while reading.

📝 Academic Export BibTeX, RIS, and APA citation formats for researchers.

Tech Stack

Frontend: React 18 + TypeScript + Ant Design 5
Backend: FastAPI + SQLAlchemy (async)
Database: PostgreSQL 15 (pgvector) + Elasticsearch 8 + Redis 7
Deploy: Docker Compose — one command to run everything

Why I Built This

I started FoJin as a personal tool. I was frustrated by how fragmented Buddhist digital resources are — hundreds of databases, each with its own search interface, its own text format, its own limitations.

I wanted one place where I could:

Search across all major collections at once
Read texts in multiple languages side by side
Ask questions and get answers grounded in primary sources

After using it privately for a while, I decided to open-source it in case it's useful to others in Buddhist studies or digital humanities.

Architecture Overview

┌─────────────┐ ┌──────────────┐ ┌─────────────────┐
│ React UI │────▶│ FastAPI │────▶│ PostgreSQL 15 │
│ Ant Design │ │ async APIs │ │ (pgvector) │
└─────────────┘ └──────┬───────┘ └─────────────────┘
│
┌──────┴───────┐
│ │
┌─────▼─────┐ ┌────▼────┐
│ ES 8 │ │ Redis 7 │
│ full-text │ │ cache │
└───────────┘ └─────────┘

Key technical decisions:

Elasticsearch + ICU plugin for proper CJK/Sanskrit/Tibetan tokenization
pgvector for semantic search embeddings
SSE streaming for AI chat responses
Docker Compose for one-command deployment

Get Involved

FoJin is actively maintained and welcomes contributions:

🐛 Report issues
🌐 Help with translations (currently supporting 8 UI languages)
📚 Suggest new data sources to integrate

If you work in Buddhist studies, digital humanities, or just find this interesting — give
it a try at fojin.app and let me know what you think!

Star the repo if you find it useful: ⭐
github.com/xr843/fojin

How I Built an Open-Source Search Engine for 440+ Buddhist Text Sources in 30 Languages

Tim Ren — Tue, 17 Mar 2026 04:06:34 +0000

The Problem

Buddhist texts are scattered across hundreds of databases worldwide — CBETA (Chinese Buddhist Canon), SuttaCentral (Pali texts), 84000 (Tibetan translations), BDRC (manuscripts), GRETIL (Sanskrit), and 430+ more. Each has different interfaces, languages, and data formats.

Researchers spend more time finding texts than reading them.

I wanted to fix that.

What I Built

FoJin (佛津) is an open-source platform that aggregates all of these into one unified interface. Think of it as "Google Scholar for Buddhist texts" — but with features no search engine offers.

Live demo: fojin.app
Source code: github.com/xr843/fojin

Key Features

Multilingual full-text search across 440+ sources using Elasticsearch with ICU tokenizer (handles Chinese, Sanskrit, Pali, Tibetan, Japanese, Korean, and 24 more languages)
4,488 fascicles available for online reading
Parallel reading — compare translations side-by-side in 30 languages
Knowledge graph — 9,600+ entities and 3,800+ relationships, visualized with interactive force-directed graphs
RAG-based AI Q&A — ask questions and get answers grounded in ~11 million characters of canonical texts, always with source citations
6 dictionaries — 237,593 entries across Chinese, Sanskrit, Pali, and English
IIIF manuscript viewer — browse digitized manuscripts from BDRC and other institutions

Tech Stack

Here's what powers FoJin under the hood:

Layer	Technology
Frontend	React 18 + TypeScript + Vite + Ant Design 5
State	Zustand + TanStack Query
Backend	FastAPI + SQLAlchemy (async) + Pydantic v2
Database	PostgreSQL 15 + pgvector + pg_trgm
Search	Elasticsearch 8 with ICU tokenizer
Cache	Redis 7
AI/RAG	Dify + dual retrieval (keyword + semantic)
Deployment	Docker Compose + Nginx + GitHub Actions CI

The Hardest Part: Multilingual Search

Searching across Buddhist texts isn't like searching English documents. You're dealing with:

Classical Chinese (no spaces, no punctuation in original texts)
Sanskrit (complex compound words, multiple romanization schemes)
Pali (diacritical marks: ā, ī, ū, ṃ, ṇ, ṭ)
Tibetan (unique script with syllable markers)
Japanese (mixed kanji + kana)
Korean (Hangul + Hanja)

The solution: Elasticsearch's ICU tokenizer with language-specific analyzers. Each language gets its own analysis chain, and queries are normalized to handle diacritical variations (so searching "nirvana" also finds "nirvāṇa" and "涅槃").

RAG Over Ancient Texts

The AI Q&A feature ("XiaoJin") uses Retrieval-Augmented Generation to answer questions about Buddhist philosophy, history, and textual analysis.

Why RAG matters here: Buddhist studies demand accuracy. You can't have an AI hallucinate a sutra citation. Every answer must be grounded in actual canonical texts with traceable sources.

The pipeline:

User asks a question
Dual retrieval: keyword search (BM25) + semantic search (pgvector embeddings)
Retrieved passages are fed to the LLM as context
LLM generates an answer with inline citations pointing to specific texts

The corpus covers ~11 million characters of canonical texts across multiple languages.

Knowledge Graph

The knowledge graph maps relationships between Buddhist concepts:

People → teachers, translators, historical figures
Texts → sutras, commentaries, treatises
Schools → Theravada, Mahayana, Vajrayana traditions
Monasteries → historical and active institutions
Concepts → philosophical terms and doctrines

9,600+ entities connected by 3,800+ typed relationships, rendered as an interactive force-directed graph using D3.js.

Self-Hosting

FoJin is fully self-hostable with Docker Compose:

git clone https://github.com/xr843/fojin.git
cd fojin
cp .env.example .env
docker compose up -d

Visit http://localhost:3000 and you're done.

What I Learned

Multilingual NLP is hard. ICU tokenization solved 80% of the problem, but edge cases in Classical Chinese and Sanskrit required custom analyzers.
RAG needs domain-specific tuning. Generic chunking strategies don't work well for ancient texts with verse-prose mixed formats. I had to design custom chunking that respects the structural units of Buddhist canons.
Knowledge graphs are underrated. The force-directed visualization became one of the most-used features — researchers love exploring conceptual relationships visually.
Docker Compose is the right deployment choice for academic tools. Researchers aren't DevOps engineers. One command should get everything running.

Get Involved

FoJin is Apache 2.0 licensed and actively looking for contributors:

Translate the UI — we need Japanese, Korean, Thai, Vietnamese, and more (good first issues)
Add data sources — know a Buddhist text database we're missing?
Improve search — better tokenization, ranking, or multilingual support
Frontend/Backend — dark mode, mobile responsiveness, citation export

Links:

If you're interested in NLP, search engines, knowledge graphs, or digital humanities, I'd love to hear your feedback. Drop a comment or star the repo!

DEV Community: Tim Ren

I Built an Open-Source Buddhist Text Search Engine — Here's What I Learned

What is FoJin?

Core Features

🔍 Multilingual Full-Text Search

📖 Parallel Reading

🤖 AI Q&A (XiaoJin)

🕸️ Knowledge Graph

📚 Dictionary Integration

📝 Academic Export BibTeX, RIS, and APA citation formats for researchers.

Tech Stack

Why I Built This

Architecture Overview

Get Involved

How I Built an Open-Source Search Engine for 440+ Buddhist Text Sources in 30 Languages

The Problem

What I Built

Key Features

Tech Stack

The Hardest Part: Multilingual Search

RAG Over Ancient Texts

Knowledge Graph

Self-Hosting

What I Learned

Get Involved