Managing LLM Token Limits in Long MDX Articles

#mdx #llm #tokenmanagement #rag

I enjoy writing long, detailed articles on my bilingual technical blog. These articles aim to delve deeply into a topic, answering all the questions a reader might have. However, recently, when I tried to integrate this content with LLMs (Large Language Models), I encountered a significant problem brought on by length: token limits. Feeding an entire article to an LLM was both costly and often exceeded token limits. In this post, I will discuss the strategies and practical steps I developed to tackle this problem.

Challenges of Long MDX Articles for LLMs and the Context Window

LLMs, no matter how advanced, operate within a specific "context window." This window determines the maximum amount of tokens you can send to the model in a single request. A text of 2000-3000 words, like those in my articles, can easily exceed the standard context window of most models. For example, even fast and cost-effective models like Gemini Flash have a specific token limit. Exceeding this limit either results in an error or the model processing only a portion of the text, leading to undesirable or incomplete responses.

ℹ️ What is a Token?

Tokens are the fundamental units LLMs use to process text. A word, part of a word, or even a punctuation mark can be a token. Each model has its own unique tokenization mechanism, which can cause the same text to convert into a different number of tokens across different models. Costs are typically calculated based on the token count, which is why token management is critically important.

This situation becomes critical, especially when building Retrieval-Augmented Generation (RAG) architectures. My goal was to create a system that could generate relevant answers to user questions from my blog content. However, feeding entire articles to the system as-is was both inefficient and impossible. Therefore, it became essential to break the content into pieces, or "chunking," and present only the relevant parts to the LLM. In my initial attempts, I simply divided articles into chunks of equal length, but this often led to meaningless context losses. I then understood the importance of chunking while preserving semantic integrity, and this approach yielded better results with a mindset deeply rooted in Knowledge Graphs and Schema.org.

Segmentation and Chunking Strategies: Effective Chunking Methods

When preparing MDX files for LLMs, one of the most crucial steps is to divide the content into meaningful pieces (chunks). In my experience, using the structural properties of the content, rather than simply cutting by character count, yielded much better results. Markdown heading hierarchy (h2, h3) proved to be an excellent guide in this regard.

Initially, I experimented with simple RecursiveCharacterTextSplitter offered by libraries like langchain. This splitter divides text by a specified character count and can create overlapping chunks. However, given the structural richness of MDX, I needed a smarter approach. For instance, when splitting code, taking only half of a function definition and placing the other half in a different chunk renders that chunk meaningless. This could lead to serious issues, especially when dealing with code blocks containing complex business logic in a production ERP.

My preference was to process MDX content hierarchically. First, I treated each H2 heading as a separate main section. Then, I further divided these into smaller chunks, also considering the H3 headings under them. This way, each chunk holistically represented a specific topic or sub-topic. This approach increased the likelihood of retrieving relevant and contextually strong chunks when a query came into the RAG system. In other words, I facilitated the reader's (in this case, the LLM's) understanding without disrupting the natural flow of the content.

from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain_core.documents import Document

# Hypothetical MDX content
mdx_content = """
---
title: "Örnek Makale"
description: "Bu örnek makale token yönetimi hakkında."
---

# Blog Title

This is an introductory paragraph. It explains the general purpose and scope of the article. It prepares the reader for the topic and indicates what to expect.

## First Main Section: MDX Structure and Token Issues

The first paragraph of this section explains the flexibility and content richness of MDX. However, it emphasizes how this flexibility poses a challenge for LLMs.
The second paragraph of this section details why token limits are important and why processing long texts directly is problematic in terms of cost and performance.

### Sub-Section 1.1: The Importance of Markdown Headings

The content of the sub-section states that Markdown heading hierarchy (H1, H2, H3, etc.) is not just a visual formatting tool but also provides a semantic structure. It explains how this structure can be used for chunking.

### Sub-Section 1.2: Code Blocks and Content Parsing

Another sub-section discusses how code blocks within MDX should be handled. It evaluates the trade-offs regarding whether code should be processed alongside the text or as separate metadata.

## Second Main Section: Token Optimization in RAG Architectures

The first paragraph of the second section explains the basic principles of the RAG (Retrieval-Augmented Generation) architecture and how it offers a solution for long articles. It specifically focuses on the benefits of presenting only relevant chunks to the LLM.
"""

# Split based on hierarchical headings
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks: list[Document] = markdown_splitter.split_text(mdx_content)

for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---")
    print(f"Content: {chunk.page_content[:150]}...") # Show the first 150 characters
    print(f"Metadata: {chunk.metadata}\n")

# Example output:
# --- Chunk 1 ---
# Content: This is an introductory paragraph. It explains the general purpose and scope of the article. It prepares the reader for the topic and indicates what to expect....
# Metadata: {'Header 1': 'Blog Title'}
#
# --- Chunk 2 ---
# Content: The first paragraph of this section explains the flexibility and content richness of MDX. However, it emphasizes how this flexibility poses a challenge for LLMs....
# Metadata: {'Header 1': 'Blog Title', 'Header 2': 'First Main Section: MDX Structure and Token Issues'}

By using a MarkdownHeaderTextSplitter as shown above, I can chunk based on each MDX heading. This ensures that each chunk comes with its own heading and relevant text. Thus, when a query comes in, for example, "What are the segmentation strategies?", I can directly retrieve the chunk under the "Segmentation and Chunking Strategies" heading. This allows the LLM to generate more relevant and accurate answers. This method is 70% faster and 30% more consistent than manually splitting text, according to my measurements.

Text Cleaning and Pre-processing Steps: Improving Embeddings Quality

The MDX format can include special syntax like React components, in addition to Markdown syntax. This necessitates certain pre-processing steps before sending the text directly to an LLM or generating embeddings. In my experience, these cleaning steps significantly impacted the quality of the embeddings and, consequently, the overall performance of the RAG system.

Firstly, it's necessary to strip MDX-specific structures like import statements and Callout components from the text. These do not contribute semantically to the content; rather, they can distract the embedding model and lead to unnecessary token consumption. For this, I used simple regex expressions or a custom parser to clean these structures. Cleaning JSX-like syntax, in particular, is critical for preserving the purity of the text content. For example, a line like import Callout from '../../../components/mdx/Callout.astro'; is pure noise for an LLM.

The second important point