Phasu Yeneng

Posted on Apr 19 • Edited on May 4

Why your RAG chatbot fails in Thai — and how to fix it

#rag #ai #nlp #python

Why your RAG chatbot fails in Thai — and how to fix it

A real-world walkthrough of how we built a customer service chatbot for a Thai e-commerce company — and the chunking problem nobody warns you about.

When I started building a RAG (Retrieval-Augmented Generation) chatbot for a Thai e-commerce company, I made the same mistake every developer makes: I copied the LangChain quickstart example, set chunk_size=500, and expected things to just work.

They didn't.

This is the story of why naive chunking fails for Thai text, what we built instead, and the full pipeline from PDF product manuals to chatbot answers — using Python, Qdrant, and OpenAI.

The Problem Nobody Warns You About

Most RAG tutorials are written with English in mind. The chunking logic looks like this:

# Works fine for English
chunks = text.split('. ')
# or
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

This works because English has clear word boundaries — spaces between every word. When you split on periods or character count, you still get coherent, searchable chunks.

Thai is completely different.

Thai has no spaces between words.

This sentence — "ร้านค้าของเรามีสินค้าหลายหมวดหมู่ให้เลือกซื้อ" — means "Our store has many product categories to choose from." But to a naive chunker, it looks like one enormous, unsplittable blob. There are 7 meaningful words in there, with zero whitespace between them.

Here's what happens when you embed that raw blob versus properly tokenized words:

Input to embedding model	What it sees
`ร้านค้าของเรามีสินค้าหลายหมวดหมู่ให้เลือกซื้อ`	One opaque token sequence
`ร้านค้า \	ของเรา \

The second form produces embeddings that actually capture the meaning of each concept — "store", "product", "category" — which leads to better retrieval when a user asks "มีสินค้าหมวดหมู่ไหนบ้าง" (what product categories are available?).

The Pipeline We Built

Here's the full architecture:
{% raw %}

PDF product manuals / FAQ documents
    |
Python (PyMuPDF) → extract raw text
    |
Sentence splitting by '. '
    |
[Stored in MongoDB as raw sentences]
    |
Python → pythainlp tokenization
    |
OpenAI text-embedding-3-small
    |
Qdrant vector database (cosine similarity, 1536 dims)
    |
User query → tokenize → embed → search → top-7 chunks
    |
GPT-4o-mini + context → answer

Let's walk through each step with real code. Here are the dependencies we'll use:

# requirements.txt
pymupdf==1.27.2.2
pythainlp==5.2.0
openai==2.32.0
qdrant-client==1.17.1
pymongo==4.10.1

Step 1 — Extract Text from PDF

We use PyMuPDF (the fitz library) instead of PyPDF2 because it handles Thai character encoding much more reliably.

# app/python/PdfToSentences.py
import pymupdf as fitz  # PyMuPDF 1.27+ (legacy: import fitz)
import re
import uuid
import requests

def extract_sentences_from_pdf(pdf_path):
    pdf_file = fitz.open(pdf_path)
    text = ""
    for page in pdf_file:
        text += page.get_text("text")

    # Split on English period + space — works for mixed Thai/English documents
    sentences = [sentence.strip() for sentence in text.split('. ') if sentence.strip()]
    return sentences

def clean_text(text):
    cleaned_text = re.sub(r'•', '', text)  # Remove bullet points
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
    return cleaned_text

Two things to note here:

Why PyMuPDF over PyPDF2? Thai PDF documents often use non-standard font encodings. PyMuPDF handles these much better — with PyPDF2 you'd frequently get garbled output or empty strings for Thai text blocks. Note: as of PyMuPDF 1.24+, the recommended import is import pymupdf (the old import fitz still works but is considered legacy).

Why split on . (period + space)? Our documents are mixed Thai/English — product names, SKUs, and technical specs are often in English, while descriptions are Thai. The period-space split is a pragmatic middle ground that preserves Thai paragraphs as single chunks rather than fragmenting them randomly at character 500.

⚠️ Limitation: Formal Thai text often ends paragraphs with a line break rather than a period. If your PDFs have no periods at all, text.split('. ') will return one giant chunk per page. In that case, use pythainlp's sentence tokenizer instead:
from pythainlp.tokenize import sent_tokenize
sentences = sent_tokenize(text, engine="crfcut")

Step 2 — Thai Word Tokenization Before Embedding

This is the most important step, and the one that differs most from English RAG.

Before sending any Thai text to the embedding model, we tokenize it with pythainlp:

# thai_tokenizer.py
from pythainlp.tokenize import word_tokenize

def word_cut(text: str) -> str:
    tokens = word_tokenize(text, engine="newmm")
    # Join with pipe separator so the embedding model sees distinct units
    return "|".join(tokens)

pythainlp uses a dictionary-based approach (newmm engine) to segment Thai text into individual words:

Input:  "สินค้าอิเล็กทรอนิกส์ราคาถูกส่งฟรี"
Output: "สินค้า|อิเล็กทรอนิกส์|ราคาถูก|ส่งฟรี"

Now the embedding model sees four distinct semantic units instead of one long string. The cosine similarity between "ส่งฟรี" (free shipping) and a user's query "จัดส่งฟรีไหม" (is shipping free?) will be much higher and more meaningful after proper tokenization.

We also tried attacut (a neural-network-based engine in pythainlp) but settled on newmm for its speed and dictionary coverage — important when your domain includes product jargon and Thai promotional phrases like "ลดราคา", "ส่งฟรี", "ผ่อนชำระ".

Step 3 — Generate and Store Embeddings

We use OpenAI's text-embedding-3-small for embeddings — the current-generation model that replaced text-embedding-ada-002. It scores 44% on the MIRACL multilingual benchmark vs 31.4% for the old model, and costs 5x less. The key is that we tokenize before embedding — not after:

# ingest_embeddings.py
from thai_tokenizer import word_cut
from openai_module import create_embedding

for item in data:
    # ✅ Tokenize Thai text FIRST
    tokenized = word_cut(item["keyword"])

    # Then embed the tokenized version
    result = create_embedding(tokenized)

    if result["status"]:
        sentence = {
            "id": item["id"],
            "sentence": item["text"],      # store original for display
            "keyword": item["keyword"],    # store original keyword
            "embeded": result["embed"],    # embed the tokenized version
        }
        sentences_collection.insert_one(sentence)

Notice we store the original text as the payload but create the embedding from the tokenized version. This way, when a match is found, the chatbot returns the human-readable original sentence — not the pipe-separated tokenized form.

The embedding function itself:

# openai_module.py
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
MAX_INPUT_LENGTH = 10000

def create_embedding(text: str) -> dict:
    if len(text) > MAX_INPUT_LENGTH:
        return {"status": False, "message": "Text too long"}

    response = client.embeddings.create(
        model="text-embedding-3-small",  # replaces text-embedding-ada-002
        input=text,
        dimensions=1536,                 # if you change this, update Qdrant collection size too!
    )
    return {
        "status": True,
        "embed": response.data[0].embedding,
    }

Step 4 — Qdrant as the Vector Store

We use Qdrant running in Docker as our vector database. It's fast, lightweight, and the REST API is straightforward to call with Python's requests:

# qdrant_module.py
import os
import requests

QDRANT_URL = os.environ.get("QDRANT_URL", "http://localhost:6333")

def create_rag_collection(collection_name: str, vector_size: int):
    requests.put(
        f"{QDRANT_URL}/collections/{collection_name}",
        json={
            "vectors": {
                "chatgpt_vector": {
                    "size": vector_size,  # 1536 for text-embedding-3-small (default)
                    "distance": "Cosine",
                }
            }
        },
    )

def search(collection_name: str, vector: dict, limit: int = 5) -> dict:
    response = requests.post(
        f"{QDRANT_URL}/collections/{collection_name}/points/search",
        json={
            "vector": vector,
            "limit": limit,
            "with_payload": True,
        },
    )
    return response.json()

Start Qdrant locally with one Docker command:

docker run -dt --name VectorDB \
  -p 6333:6333 \
  -v /your/path/storage:/qdrant/storage \
  qdrant/qdrant:latest

We use Cosine similarity rather than Euclidean distance. For semantic search in Thai, cosine similarity performs better because it measures the angle between vectors (meaning similarity) rather than the absolute distance, which is sensitive to text length differences.

Step 5 — The RAG Query Flow

When a user asks a question, here's what happens:

# chat_module.py
from openai_module import create_embedding
from qdrant_module import search

def rag(question: str, category_name: str) -> str:
    # 1. Build a context-rich search query
    search_query = "สินค้า" + category_name  # "Product [category]"

    # 2. Embed the search query (tokenization happens upstream before this call)
    question_embed = create_embedding(search_query)

    # 3. Search Qdrant for the top 7 most similar sentences
    gpt_vector = {"name": "chatgpt_vector", "vector": question_embed["embed"]}
    search_result = search("chatgpt", gpt_vector, limit=7)

    # 4. Assemble context from the matched payloads
    context = retrieve_relevant_context(search_result["result"])
    return context

def retrieve_relevant_context(results: list) -> str:
    context = ""
    for item in results:
        context += item["payload"]["sentence"] + "\n\n"
    return context

The assembled context is then injected into GPT-4o-mini's system prompt:

system_content = f"""Use the attached context to answer the user's questions.
Answer only questions related to our company's products and services:

{context}

ภาษาที่ใช้ตอบกลับ User ให้ยึดจากภาษาของคำถามล่าสุดของ User เท่านั้น"""

That last Thai instruction tells the model: "Reply in the same language as the user's most recent message." This handles the bilingual nature of our users — some ask in Thai, some in English, some mix both.

Step 6 — Question Classification Before RAG

One non-obvious optimization: not every question needs a RAG lookup. We classify questions first with GPT-4o-mini to decide which path to take:

# chat_module.py
import json
from openai import OpenAI

client = OpenAI()

def question_classification(question: str) -> dict:
    prompt = """วิเคราะห์คำถามของ User ว่าเป็นคำถามประเภทไหน โดยให้ตอบเป็น JSON { "type": value }
    type 0 = ทักทาย / ไม่เกี่ยวกับสินค้าหรือบริการ
    type 1 = ถามเกี่ยวกับโปรโมชั่น / ส่วนลด / หมวดหมู่สินค้า
    type 2 = ถามเกี่ยวกับสาขา / พื้นที่จัดส่ง
    type 3 = ถามเกี่ยวกับข้อมูลสินค้าหรือบริการ  ← needs RAG
    type 4 = ถามทั่วไปเกี่ยวกับบริษัท"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": question},
        ],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

Only type 3 (specific product info questions) triggers the full RAG pipeline. Promotion and branch questions (type 1-2) use structured data from a JSON catalog instead. Greetings (type 0) go straight to the LLM without any retrieval at all.

This classification step saves both latency and API cost — you're not doing a vector search for "สวัสดีครับ" (hello).

What We Learned

1. Tokenize before embedding, always. The single biggest quality improvement came from running pythainlp on every piece of text before it touches the embedding model — both at ingest time and at query time. Without this, retrieval quality was noticeably worse for Thai-only queries.

2. Use PyMuPDF, not PyPDF2. For Thai PDF documents, PyMuPDF is dramatically more reliable. PyPDF2 would silently drop or garble Thai characters from complex layouts. Also note: as of v1.24+, use import pymupdf instead of the legacy import fitz.

3. Store original text, embed tokenized text. Users should see natural language in responses. Keep these as separate fields.

4. Sentence-level chunks beat character-level chunks for Thai. Because Thai sentences naturally carry complete thoughts, splitting at sentence boundaries (.) gives the model coherent context units rather than arbitrary fragments. A chunk_size=500 cut might land in the middle of a Thai word — or more precisely, in the middle of a run of characters that spans multiple words, since there's no space to safely break at.

5. Question classification as a router saves money. Not every user message needs vector search. A cheap classification step routes simple questions to a direct LLM call and complex ones to the full RAG pipeline.

The Stack at a Glance

Layer	Tool	Version
PDF extraction	PyMuPDF (`pymupdf`)	1.27.2.2
Thai tokenization	`pythainlp` (`newmm` engine)	5.2.0
Embedding model	OpenAI `text-embedding-3-small` (1536d)	—
Vector database	Qdrant + `qdrant-client`	1.17.1
LLM	OpenAI GPT-4o-mini	—
OpenAI SDK	`openai`	2.32.0
Backend	Python / FastAPI or Flask	—
Chat history	MongoDB	—

Final Thoughts

Building RAG for Thai taught me that most of the "standard" chunking advice assumes English. Once you work with a language that has no word boundaries, the whole pipeline has to be rethought — from how you split sentences to how you normalize text before embedding.

The good news: the fix is not complicated. A single tokenization step with pythainlp before embedding makes a significant difference. The hard part is knowing you need it in the first place.

If you're building RAG for other Asian languages — Japanese, Chinese, Korean — the same principle applies. Never assume your text has whitespace-delimited tokens. Always pre-process with a language-appropriate tokenizer before hitting your embedding model.

Addendum — "But where's the benchmark?" (added after publication)

After this post went up, a sharp comment pushed back on the central claim — and they were right to. The question is worth surfacing in the article itself, because the answer matters for anyone considering this technique:

"Embedding models like text-embedding-3-small already have an internal BPE tokenizer. Pre-tokenizing with newmm and joining with | adds | tokens to the sequence — possibly OOD relative to the model's training distribution. So what's the actual mechanism that makes pre-tokenization help?"

Two honest acknowledgments first.

1. This article does not contain a rigorous ablation. The quality claims — including the ส่งฟรี ↔ จัดส่งฟรีไหม cosine-similarity example — come from production retrieval logs and qualitative observation, not from a controlled experiment with recall@k or MRR numbers. Treat the recommendation as "this worked for our case" rather than "this is empirically proven across Thai RAG generally."

2. The OOD concern is half-right. | (U+007C) is not OOD at the vocabulary level — it's a single token in cl100k_base, common in code, CSV, and markdown across the training data. But its co-occurrence pattern with Thai script is distributionally OOD: the model has rarely seen "ไทย|สมุนไพร|พื้นบ้าน" during training, which can shift the resulting embedding toward a "structured/code-like" subspace. Whether that helps or hurts retrieval is an empirical question, not a settled one.

Proposed mechanism (hypothesis, not proof)

The reason pre-tokenization seems to help isn't the | token itself — it's that a hard separator blocks BPE from merging across word boundaries.

Thai is scriptio continua — no whitespace between words. When BPE runs on raw Thai, it greedily merges subword units based on training-corpus frequency, which routinely produces tokens that span word boundaries: the trailing subword of one word fuses with the leading subword of the next. The resulting token's embedding "blurs" the semantics of two distinct words into a single fragment.

Insert any separator (|, space, special token) and BPE merging halts at the separator. The tokens that come out align more closely with linguistic word boundaries, so the embedding reflects per-word semantics rather than a subword soup.

If that hypothesis is correct, the predictions are:

pipe ≈ space (both block cross-boundary merging)
pipe > raw Thai (consistent with what this article reports)
pipe may be marginally worse than space if | injects distributional noise

The ablation that should exist

Variant	Input fed to the embedder
A	Raw Thai, no pre-tokenization
B	`newmm` tokens, space-joined
C	`newmm` tokens, **`\
D	{% raw %}`newmm` tokens, special separator like `<sep>`

Evaluate recall@5 and MRR on a Thai QA dataset (TyDi-QA Thai works as an open option) or in-house labeled pairs. My prediction: A trails B/C/D clearly, while B and C land within noise of each other. If that pattern holds, the takeaway updates from "pipe-tokenize before embedding" to:

"Insert any boundary marker that prevents BPE from merging across word boundaries. The marker itself doesn't matter much — the boundary signal does."

I plan to run this ablation and publish a follow-up with the actual numbers. If anyone runs it first, please share — happy to link it from here.

Thanks to the commenter who raised this. That's the kind of question that turns a heuristic into actual knowledge.

DEV Community

Why your RAG chatbot fails in Thai — and how to fix it

Why your RAG chatbot fails in Thai — and how to fix it

The Problem Nobody Warns You About

The Pipeline We Built

Step 1 — Extract Text from PDF

Step 2 — Thai Word Tokenization Before Embedding

Step 3 — Generate and Store Embeddings

Step 4 — Qdrant as the Vector Store

Step 5 — The RAG Query Flow

Step 6 — Question Classification Before RAG

What We Learned

The Stack at a Glance

Final Thoughts

Addendum — "But where's the benchmark?" (added after publication)

Proposed mechanism (hypothesis, not proof)

The ablation that should exist

Top comments (0)