Why your RAG chatbot fails in Thai — and how to fix it
A real-world walkthrough of how we built a customer service chatbot for a Thai e-commerce company — and the chunking problem nobody warns you about.
When I started building a RAG (Retrieval-Augmented Generation) chatbot for a Thai e-commerce company, I made the same mistake every developer makes: I copied the LangChain quickstart example, set chunk_size=500, and expected things to just work.
They didn't.
This is the story of why naive chunking fails for Thai text, what we built instead, and the full pipeline from PDF product manuals to chatbot answers — using Python, Qdrant, and OpenAI.
The Problem Nobody Warns You About
Most RAG tutorials are written with English in mind. The chunking logic looks like this:
# Works fine for English
chunks = text.split('. ')
# or
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
This works because English has clear word boundaries — spaces between every word. When you split on periods or character count, you still get coherent, searchable chunks.
Thai is completely different.
Thai has no spaces between words.
This sentence — "ร้านค้าของเรามีสินค้าหลายหมวดหมู่ให้เลือกซื้อ" — means "Our store has many product categories to choose from." But to a naive chunker, it looks like one enormous, unsplittable blob. There are 7 meaningful words in there, with zero whitespace between them.
Here's what happens when you embed that raw blob versus properly tokenized words:
| Input to embedding model | What it sees |
|---|---|
ร้านค้าของเรามีสินค้าหลายหมวดหมู่ให้เลือกซื้อ |
One opaque token sequence |
| `ร้านค้า \ | ของเรา \ |
The second form produces embeddings that actually capture the meaning of each concept — "store", "product", "category" — which leads to better retrieval when a user asks "มีสินค้าหมวดหมู่ไหนบ้าง" (what product categories are available?).
The Pipeline We Built
Here's the full architecture:
{% raw %}
PDF product manuals / FAQ documents
|
Python (PyMuPDF) → extract raw text
|
Sentence splitting by '. '
|
[Stored in MongoDB as raw sentences]
|
Python → pythainlp tokenization
|
OpenAI text-embedding-3-small
|
Qdrant vector database (cosine similarity, 1536 dims)
|
User query → tokenize → embed → search → top-7 chunks
|
GPT-4o-mini + context → answer
Let's walk through each step with real code. Here are the dependencies we'll use:
# requirements.txt
pymupdf==1.27.2.2
pythainlp==5.2.0
openai==2.32.0
qdrant-client==1.17.1
pymongo==4.10.1
Step 1 — Extract Text from PDF
We use PyMuPDF (the fitz library) instead of PyPDF2 because it handles Thai character encoding much more reliably.
# app/python/PdfToSentences.py
import pymupdf as fitz # PyMuPDF 1.27+ (legacy: import fitz)
import re
import uuid
import requests
def extract_sentences_from_pdf(pdf_path):
pdf_file = fitz.open(pdf_path)
text = ""
for page in pdf_file:
text += page.get_text("text")
# Split on English period + space — works for mixed Thai/English documents
sentences = [sentence.strip() for sentence in text.split('. ') if sentence.strip()]
return sentences
def clean_text(text):
cleaned_text = re.sub(r'\u2022', '', text) # Remove bullet points
cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
return cleaned_text
Two things to note here:
Why PyMuPDF over PyPDF2? Thai PDF documents often use non-standard font encodings. PyMuPDF handles these much better — with PyPDF2 you'd frequently get garbled output or empty strings for Thai text blocks. Note: as of PyMuPDF 1.24+, the recommended import is import pymupdf (the old import fitz still works but is considered legacy).
Why split on . (period + space)? Our documents are mixed Thai/English — product names, SKUs, and technical specs are often in English, while descriptions are Thai. The period-space split is a pragmatic middle ground that preserves Thai paragraphs as single chunks rather than fragmenting them randomly at character 500.
⚠️ Limitation: Formal Thai text often ends paragraphs with a line break rather than a period. If your PDFs have no periods at all,
text.split('. ')will return one giant chunk per page. In that case, usepythainlp's sentence tokenizer instead:from pythainlp.tokenize import sent_tokenize sentences = sent_tokenize(text, engine="crfcut")
Step 2 — Thai Word Tokenization Before Embedding
This is the most important step, and the one that differs most from English RAG.
Before sending any Thai text to the embedding model, we tokenize it with pythainlp:
# thai_tokenizer.py
from pythainlp.tokenize import word_tokenize
def word_cut(text: str) -> str:
tokens = word_tokenize(text, engine="newmm")
# Join with pipe separator so the embedding model sees distinct units
return "|".join(tokens)
pythainlp uses a dictionary-based approach (newmm engine) to segment Thai text into individual words:
Input: "สินค้าอิเล็กทรอนิกส์ราคาถูกส่งฟรี"
Output: "สินค้า|อิเล็กทรอนิกส์|ราคาถูก|ส่งฟรี"
Now the embedding model sees four distinct semantic units instead of one long string. The cosine similarity between "ส่งฟรี" (free shipping) and a user's query "จัดส่งฟรีไหม" (is shipping free?) will be much higher and more meaningful after proper tokenization.
We also tried attacut (a neural-network-based engine in pythainlp) but settled on newmm for its speed and dictionary coverage — important when your domain includes product jargon and Thai promotional phrases like "ลดราคา", "ส่งฟรี", "ผ่อนชำระ".
Step 3 — Generate and Store Embeddings
We use OpenAI's text-embedding-3-small for embeddings — the current-generation model that replaced text-embedding-ada-002. It scores 44% on the MIRACL multilingual benchmark vs 31.4% for the old model, and costs 5x less. The key is that we tokenize before embedding — not after:
# ingest_embeddings.py
from thai_tokenizer import word_cut
from openai_module import create_embedding
for item in data:
# ✅ Tokenize Thai text FIRST
tokenized = word_cut(item["keyword"])
# Then embed the tokenized version
result = create_embedding(tokenized)
if result["status"]:
sentence = {
"id": item["id"],
"sentence": item["text"], # store original for display
"keyword": item["keyword"], # store original keyword
"embeded": result["embed"], # embed the tokenized version
}
sentences_collection.insert_one(sentence)
Notice we store the original text as the payload but create the embedding from the tokenized version. This way, when a match is found, the chatbot returns the human-readable original sentence — not the pipe-separated tokenized form.
The embedding function itself:
# openai_module.py
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
MAX_INPUT_LENGTH = 10000
def create_embedding(text: str) -> dict:
if len(text) > MAX_INPUT_LENGTH:
return {"status": False, "message": "Text too long"}
response = client.embeddings.create(
model="text-embedding-3-small", # replaces text-embedding-ada-002
input=text,
dimensions=1536, # if you change this, update Qdrant collection size too!
)
return {
"status": True,
"embed": response.data[0].embedding,
}
Step 4 — Qdrant as the Vector Store
We use Qdrant running in Docker as our vector database. It's fast, lightweight, and the REST API is straightforward to call with Python's requests:
# qdrant_module.py
import os
import requests
QDRANT_URL = os.environ.get("QDRANT_URL", "http://localhost:6333")
def create_rag_collection(collection_name: str, vector_size: int):
requests.put(
f"{QDRANT_URL}/collections/{collection_name}",
json={
"vectors": {
"chatgpt_vector": {
"size": vector_size, # 1536 for text-embedding-3-small (default)
"distance": "Cosine",
}
}
},
)
def search(collection_name: str, vector: dict, limit: int = 5) -> dict:
response = requests.post(
f"{QDRANT_URL}/collections/{collection_name}/points/search",
json={
"vector": vector,
"limit": limit,
"with_payload": True,
},
)
return response.json()
Start Qdrant locally with one Docker command:
docker run -dt --name VectorDB \
-p 6333:6333 \
-v /your/path/storage:/qdrant/storage \
qdrant/qdrant:latest
We use Cosine similarity rather than Euclidean distance. For semantic search in Thai, cosine similarity performs better because it measures the angle between vectors (meaning similarity) rather than the absolute distance, which is sensitive to text length differences.
Step 5 — The RAG Query Flow
When a user asks a question, here's what happens:
# chat_module.py
from openai_module import create_embedding
from qdrant_module import search
def rag(question: str, category_name: str) -> str:
# 1. Build a context-rich search query
search_query = "สินค้า" + category_name # "Product [category]"
# 2. Embed the search query (tokenization happens upstream before this call)
question_embed = create_embedding(search_query)
# 3. Search Qdrant for the top 7 most similar sentences
gpt_vector = {"name": "chatgpt_vector", "vector": question_embed["embed"]}
search_result = search("chatgpt", gpt_vector, limit=7)
# 4. Assemble context from the matched payloads
context = retrieve_relevant_context(search_result["result"])
return context
def retrieve_relevant_context(results: list) -> str:
context = ""
for item in results:
context += item["payload"]["sentence"] + "\n\n"
return context
The assembled context is then injected into GPT-4o-mini's system prompt:
system_content = f"""Use the attached context to answer the user's questions.
Answer only questions related to our company's products and services:
{context}
ภาษาที่ใช้ตอบกลับ User ให้ยึดจากภาษาของคำถามล่าสุดของ User เท่านั้น"""
That last Thai instruction tells the model: "Reply in the same language as the user's most recent message." This handles the bilingual nature of our users — some ask in Thai, some in English, some mix both.
Step 6 — Question Classification Before RAG
One non-obvious optimization: not every question needs a RAG lookup. We classify questions first with GPT-4o-mini to decide which path to take:
# chat_module.py
import json
from openai import OpenAI
client = OpenAI()
def question_classification(question: str) -> dict:
prompt = """วิเคราะห์คำถามของ User ว่าเป็นคำถามประเภทไหน โดยให้ตอบเป็น JSON { "type": value }
type 0 = ทักทาย / ไม่เกี่ยวกับสินค้าหรือบริการ
type 1 = ถามเกี่ยวกับโปรโมชั่น / ส่วนลด / หมวดหมู่สินค้า
type 2 = ถามเกี่ยวกับสาขา / พื้นที่จัดส่ง
type 3 = ถามเกี่ยวกับข้อมูลสินค้าหรือบริการ ← needs RAG
type 4 = ถามทั่วไปเกี่ยวกับบริษัท"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": question},
],
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
Only type 3 (specific product info questions) triggers the full RAG pipeline. Promotion and branch questions (type 1-2) use structured data from a JSON catalog instead. Greetings (type 0) go straight to the LLM without any retrieval at all.
This classification step saves both latency and API cost — you're not doing a vector search for "สวัสดีครับ" (hello).
What We Learned
1. Tokenize before embedding, always. The single biggest quality improvement came from running pythainlp on every piece of text before it touches the embedding model — both at ingest time and at query time. Without this, retrieval quality was noticeably worse for Thai-only queries.
2. Use PyMuPDF, not PyPDF2. For Thai PDF documents, PyMuPDF is dramatically more reliable. PyPDF2 would silently drop or garble Thai characters from complex layouts. Also note: as of v1.24+, use import pymupdf instead of the legacy import fitz.
3. Store original text, embed tokenized text. Users should see natural language in responses. Keep these as separate fields.
4. Sentence-level chunks beat character-level chunks for Thai. Because Thai sentences naturally carry complete thoughts, splitting at sentence boundaries (.) gives the model coherent context units rather than arbitrary fragments. A chunk_size=500 cut might land in the middle of a Thai word — or more precisely, in the middle of a run of characters that spans multiple words, since there's no space to safely break at.
5. Question classification as a router saves money. Not every user message needs vector search. A cheap classification step routes simple questions to a direct LLM call and complex ones to the full RAG pipeline.
The Stack at a Glance
| Layer | Tool | Version |
|---|---|---|
| PDF extraction | PyMuPDF (pymupdf) |
1.27.2.2 |
| Thai tokenization |
pythainlp (newmm engine) |
5.2.0 |
| Embedding model | OpenAI text-embedding-3-small (1536d) |
— |
| Vector database | Qdrant + qdrant-client
|
1.17.1 |
| LLM | OpenAI GPT-4o-mini | — |
| OpenAI SDK | openai |
2.32.0 |
| Backend | Python / FastAPI or Flask | — |
| Chat history | MongoDB | — |
Final Thoughts
Building RAG for Thai taught me that most of the "standard" chunking advice assumes English. Once you work with a language that has no word boundaries, the whole pipeline has to be rethought — from how you split sentences to how you normalize text before embedding.
The good news: the fix is not complicated. A single tokenization step with pythainlp before embedding makes a significant difference. The hard part is knowing you need it in the first place.
If you're building RAG for other Asian languages — Japanese, Chinese, Korean — the same principle applies. Never assume your text has whitespace-delimited tokens. Always pre-process with a language-appropriate tokenizer before hitting your embedding model.
Top comments (0)