DEV Community

Ekrem MUTLU
Ekrem MUTLU

Posted on

Building an Enterprise RAG System for Non-English Documents: A Turkish/Multilingual Case Study

Building an Enterprise RAG System for Non-English Documents: A Turkish/Multilingual Case Study

Retrieval Augmented Generation (RAG) systems are revolutionizing how we interact with information, allowing us to ask complex questions and receive answers grounded in a vast sea of documents. While much of the focus has been on English language applications, the real power of RAG lies in its ability to unlock knowledge hidden within documents in any language. This article dives into the challenges and solutions of building a production-ready RAG system specifically for non-English documents, using Turkish as our primary example but with insights applicable to many other languages.

The Challenge: Beyond Vanilla RAG

The basic RAG pipeline is deceptively simple:

  1. Chunking: Divide your documents into manageable pieces.
  2. Embedding: Convert each chunk into a vector representation.
  3. Indexing: Store these vectors in a vector database.
  4. Retrieval: Based on a user query, find the most relevant chunks in the database.
  5. Generation: Use a large language model (LLM) to generate an answer based on the retrieved chunks.

However, this vanilla approach often falls short when dealing with non-English languages, especially those with complex morphology. Languages like Turkish, Finnish, Hungarian, and many others rely heavily on suffixes and prefixes to convey grammatical information. This presents several problems:

  • Poor Embedding Quality: Pre-trained language models, often trained primarily on English data, may struggle to accurately represent the semantic meaning of words with complex inflections.
  • Inefficient Chunking: Splitting sentences naively can lead to chunks that contain only parts of words or incomplete grammatical structures, hindering retrieval.
  • Reduced Retrieval Accuracy: The vector database may not be able to effectively match query embeddings with document chunk embeddings due to morphological variations.

Our Approach: Morphological Preprocessing and Semantic Chunking

To overcome these challenges, we implemented a RAG pipeline incorporating morphological preprocessing and advanced chunking strategies. Here's a breakdown of our approach:

1. Morphological Analysis and Normalization

Before embedding, we perform morphological analysis to reduce words to their root forms. This helps the embedding model focus on the core meaning of the word, rather than being distracted by grammatical variations. For Turkish, we used the Zemberek-NLP library, a powerful open-source tool for Turkish natural language processing.

from zemberek import TurkishMorphology

morphology = TurkishMorphology.create_with_defaults()

def normalize_turkish_word(word):
    analysis_results = morphology.analyze(word)
    if analysis_results:
        # Return the lemma (root form) of the first analysis
        return analysis_results[0].get_lemma()
    else:
        return word

def normalize_text(text):
    words = text.split()
    normalized_words = [normalize_turkish_word(word) for word in words]
    return " ".join(normalized_words)

example_text = "Evdekilere bakmalıyız."
normalized_text = normalize_text(example_text)
print(f"Original Text: {example_text}")
print(f"Normalized Text: {normalized_text}") # Output: Evdekiler bak
Enter fullscreen mode Exit fullscreen mode

This snippet demonstrates how to use Zemberek to lemmatize Turkish words. Notice how "bakmalıyız" (we should look) is reduced to its root form "bak". This normalization significantly improves the embedding quality, especially for less common inflections.

2. Sentence-Boundary Chunking with Overlap

Instead of blindly splitting documents into fixed-size chunks, we use sentence-boundary chunking. This ensures that each chunk contains a complete thought or idea. We also introduce an overlap between adjacent chunks to provide more context during retrieval. This mitigates the risk of cutting off crucial information that bridges two chunks.

import nltk
nltk.download('punkt') # Download the Punkt sentence tokenizer if you haven't already
from nltk.tokenize import sent_tokenize

def chunk_text(text, overlap=50):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ""
    for sentence in sentences:
        if len(current_chunk) + len(sentence) + 1 <= 200: # Adjust chunk size as needed
            current_chunk += sentence + " "
        else:
            chunks.append(current_chunk.strip())
            # Overlap with the last 'overlap' words from the previous chunk
            overlap_words = current_chunk.split()[-overlap:] if len(current_chunk.split()) > overlap else current_chunk.split()
            current_chunk = " ".join(overlap_words) + sentence + " "
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

example_text = "Bu bir cümle. Başka bir cümle daha. Ve son bir cümle."
chunks = chunk_text(example_text, overlap=1)
print(f"Chunks: {chunks}")
Enter fullscreen mode Exit fullscreen mode

This code snippet demonstrates sentence-based chunking with a specified overlap. Adjust the overlap parameter to control the amount of shared context between chunks. The 200 in the if condition controls the maximum chunk size (in characters); adapt this according to your needs and the capabilities of your LLM.

3. Weaviate Hybrid Search: Combining Semantic and Keyword Search

We chose Weaviate as our vector database due to its powerful hybrid search capabilities. Hybrid search combines the strengths of vector-based semantic search with traditional keyword-based search. This is particularly beneficial for non-English languages, where morphological variations can still impact the effectiveness of vector embeddings.

By combining both approaches, we ensure that the system can retrieve relevant chunks even if the query or the document contains uncommon inflections or variations in spelling.

Here's how you'd configure hybrid search in Weaviate (simplified example):

import weaviate

client = weaviate.Client(
    url="YOUR_WEAVIATE_URL",  # Replace with your Weaviate instance URL
    auth_client_secret=weaviate.AuthClientPassword(
        username="YOUR_WEAVIATE_USERNAME",
        password="YOUR_WEAVIATE_PASSWORD"
    ),
    additional_headers={
        "X-HuggingFace-Api-Key": "YOUR_HUGGINGFACE_API_KEY"  # If using Hugging Face embeddings
    }
)

class_obj = {
    "class": "DocumentChunk",
    "description": "A chunk of a document",
    "vectorizer": "text2vec-huggingface",  # Or your preferred vectorizer
    "moduleConfig": {
        "text2vec-huggingface": {
            "modelName": "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",  # Choose a multilingual model
            "poolingMode": "mean"
        }
    },
    "properties": [
        {
            "name": "content",
            "dataType": ["text"]
        }
    ]
}

client.schema.create_class(class_obj)

# Example: Hybrid search query
response = (
    client.query
    .get("DocumentChunk", ["content"])
    .with_hybrid(query="örnek sorgu", alpha=0.5)  # 'örnek sorgu' means 'example query' in Turkish
    .with_limit(5)
    .do()
)

print(response)
Enter fullscreen mode Exit fullscreen mode

The alpha parameter in the with_hybrid function controls the weighting between vector search and keyword search. A value of 0.5 gives equal weight to both. Experiment with different values to optimize performance for your specific dataset and use case.

4. LLM for Generation

Finally, we use a powerful LLM to generate answers based on the retrieved chunks. For multilingual applications, it's crucial to choose an LLM that is proficient in the target language. Models like mBART, mT5, and multilingual versions of GPT-3/4 are excellent choices. We found that fine-tuning the LLM on a relevant dataset significantly improved the quality of generated answers.

Benchmarking and Results

We evaluated our RAG system on a dataset of Turkish legal documents. We measured retrieval recall, which is the percentage of relevant chunks that were successfully retrieved by the system. Our results showed a 93% recall rate, a significant improvement over a vanilla RAG system without morphological preprocessing and hybrid search. This demonstrates the effectiveness of our approach in handling the complexities of the Turkish language.

Conclusion

Building a production-ready RAG system for non-English documents requires careful consideration of the specific linguistic characteristics of the target language. Morphological preprocessing, semantic chunking, and hybrid search are essential techniques for achieving high retrieval accuracy and generating informative answers. By addressing these challenges, we can unlock the vast potential of RAG systems to access and utilize knowledge from documents in any language.

Are you looking to implement a robust RAG system for your multilingual documents? Visit https://bilgestore.com/product/rag-system to learn more about our enterprise RAG solution and how we can help you unlock the power of your data.

Top comments (0)