DEV Community

jasu.dev
jasu.dev

Posted on • Originally published at jasu.dev

Content Transformation: The First Step Most RAG Tutorials Skip

Info: This article is part of a series on building a production RAG pipeline. Start with the overview if you haven't.

The most important thing to understand when working with LLMs is: If you insert trash, you get trash back.

Before you even start to build a RAG system you should think about which kind of documents you want to store in the system and which format the text should have.
This decision highly influences how you build your system.
What chunking mechanisms you can use, how much information you fit into one document and what technology you can use.

Most systems accept multiple file types like PDF, HTML, CSV or Markdown. But in the database they all need to be the same format.

Markdown

The text format in the database needs to have a couple of properties and markdown covers them.

Readable by Humans and LLMs Natively

The content of the inserted documents will be returned to the LLM in the retrieval process to craft an answer to a specific query.

In order for the LLM to craft a useful answer, the content of the document needs to be readable natively by LLMs. That means it needs to be a text format.
PDF and DOCX for example are containers that need extraction before being readable so they are already disqualified. Markdown is a proper text format and
can be read without any kind of parsing. It's also pretty easy to read for humans so debugging your pipeline is easy.

Cost Effective

To save tokens and storage space the stored text needs to deliver as much information as possible in the least amount of words.

Of course the text still needs to make sense and be structured somehow to preserve context (more on that in a later article) but you get my point right?
More text for the same information means more tokens. Although much cheaper than tokens, each character in the DB column also costs money in the form of storage which can
accumulate quickly when you serve millions of documents (rows in the db).

So you need a readable text format that can be structured with as few characters as possible. Markdown turns out to be very efficient with that.

Markdown vs HTML

In the last couple of weeks HTML got quite some attention on social media for being the new go-to
format when it comes to file formats for LLMs.

However I would like to push back here. Especially in terms of cost effectiveness and readability for humans, Markdown is still the king.
It makes quite some difference if you write # over <h1>..</h1> for a heading.

Parsing Different File Formats into Markdown

Now that the format is decided, each file type needs its own path to get there.

HTML

Embedding website content into a RAG system is the most common use case.

Transforming HTML to Markdown is fairly easy. There are many different libraries out there that can do the job. I personally like Crawl4AI.
They offer crawl functionality, asynchronous behaviour and a default Markdown generator. Important things to look out for are:

  • define tags you don't want included in your markdown (navigation, footers, headers, images)
  • define what the markdown generator should ignore (links, images)
  • find the right crawling strategy for your use case (consult the Crawl4AI documentation for your use case)

PDF

Transforming PDF documents to Markdown is the most complicated step of document transformation.

Yes, there are many different libraries out there that do the job and are fairly easy to use but the problem is the process itself.

Transforming PDF content into Markdown requires:

  • Downloading the PDF
  • Reading the PDF page by page
  • Extracting the PDF page by page
  • Transforming the content to Markdown

Most PDFs are several Megabytes and take a while to read. On top of that there might even be some images inside the PDF that are much harder to extract.

The problem with these steps is that they are quite hungry when it comes to resources. Depending on your infrastructure you need to find a good balance between speed
and memory/CPU usage (more on that later).

After trying out multiple libraries I found that pymupdf4llm does the best job.

CSV/XLS

CSV and XLS(X) files are pretty straightforward to transform into Markdown. I found the MarkItDown library to do a solid job in transforming the content into
proper Markdown tables.

MD/TXT

Markdown and TXT files don't need to be transformed. I listed them for completeness here.

Content Cleaning

All your input content needs to be properly cleaned before you embed and insert it into your vector database.

After transformation from different filetypes to Markdown you end up with a lot of noise. Even files that don't need transformation are worth cleaning.
You do not want to end up with blank lines and other noise that does not add any value to the system and just occupies space.

In general I recommend doing the following things in content cleaning.

  • Replace any multiple consecutive occurrences of blank lines to just a maximum of two.
  • Strip trailing and leading whitespace from each line
  • Remove lines that are only symbols and no text

The Speed Problem

Once you try to scrape a website which embeds a couple of PDFs you will notice one thing:

It's incredibly slow.

You need to download each PDF, extract it page by page and transform it. For multi-tenancy you most likely also want to cover multi-language PDFs and only save the relevant information in
a specific language to your system. All this not only takes time, but also eats a lot of resources. To solve this, I recommend using a queue system and let that processing run somewhere
in the background.

Depending on your resources you can also try to run several processes in parallel.

Luckily filling a RAG system with data is usually not super time critical and customers are willing to wait. At best, you can frame it as training the AI with their data.
Just make sure that you remove the PDF files after processing them to save storage space.

Conclusion

Content transformation is an often overlooked but crucial part of a RAG system.

The quality of your input determines whether your RAG system produces high-quality output or not.
Spending some time thinking about what input formats you want to support and how to ensure your content is clean and resource efficient saves you a lot of headaches down the line.

It is also worth thinking about performance early on, especially when working with PDF files.

In the next article we will focus on chunking.

Top comments (0)