DEV Community: jasu.dev

Content Transformation: The First Step Most RAG Tutorials Skip

jasu.dev — Sat, 23 May 2026 04:27:48 +0000

Info: This article is part of a series on building a production RAG pipeline. Start with the overview if you haven't.

The most important thing to understand when working with LLMs is: If you insert trash, you get trash back.

Before you even start to build a RAG system you should think about which kind of documents you want to store in the system and which format the text should have.
This decision highly influences how you build your system.
What chunking mechanisms you can use, how much information you fit into one document and what technology you can use.

Most systems accept multiple file types like PDF, HTML, CSV or Markdown. But in the database they all need to be the same format.

Markdown

The text format in the database needs to have a couple of properties and markdown covers them.

Readable by Humans and LLMs Natively

The content of the inserted documents will be returned to the LLM in the retrieval process to craft an answer to a specific query.

In order for the LLM to craft a useful answer, the content of the document needs to be readable natively by LLMs. That means it needs to be a text format.
PDF and DOCX for example are containers that need extraction before being readable so they are already disqualified. Markdown is a proper text format and
can be read without any kind of parsing. It's also pretty easy to read for humans so debugging your pipeline is easy.

Cost Effective

To save tokens and storage space the stored text needs to deliver as much information as possible in the least amount of words.

Of course the text still needs to make sense and be structured somehow to preserve context (more on that in a later article) but you get my point right?
More text for the same information means more tokens. Although much cheaper than tokens, each character in the DB column also costs money in the form of storage which can
accumulate quickly when you serve millions of documents (rows in the db).

So you need a readable text format that can be structured with as few characters as possible. Markdown turns out to be very efficient with that.

Markdown vs HTML

In the last couple of weeks HTML got quite some attention on social media for being the new go-to
format when it comes to file formats for LLMs.

However I would like to push back here. Especially in terms of cost effectiveness and readability for humans, Markdown is still the king.
It makes quite some difference if you write # over <h1>..</h1> for a heading.

Parsing Different File Formats into Markdown

Now that the format is decided, each file type needs its own path to get there.

HTML

Embedding website content into a RAG system is the most common use case.

Transforming HTML to Markdown is fairly easy. There are many different libraries out there that can do the job. I personally like Crawl4AI.
They offer crawl functionality, asynchronous behaviour and a default Markdown generator. Important things to look out for are:

define tags you don't want included in your markdown (navigation, footers, headers, images)
define what the markdown generator should ignore (links, images)
find the right crawling strategy for your use case (consult the Crawl4AI documentation for your use case)

PDF

Transforming PDF documents to Markdown is the most complicated step of document transformation.

Yes, there are many different libraries out there that do the job and are fairly easy to use but the problem is the process itself.

Transforming PDF content into Markdown requires:

Downloading the PDF
Reading the PDF page by page
Extracting the PDF page by page
Transforming the content to Markdown

Most PDFs are several Megabytes and take a while to read. On top of that there might even be some images inside the PDF that are much harder to extract.

The problem with these steps is that they are quite hungry when it comes to resources. Depending on your infrastructure you need to find a good balance between speed
and memory/CPU usage (more on that later).

After trying out multiple libraries I found that pymupdf4llm does the best job.

CSV/XLS

CSV and XLS(X) files are pretty straightforward to transform into Markdown. I found the MarkItDown library to do a solid job in transforming the content into
proper Markdown tables.

MD/TXT

Markdown and TXT files don't need to be transformed. I listed them for completeness here.

Content Cleaning

All your input content needs to be properly cleaned before you embed and insert it into your vector database.

After transformation from different filetypes to Markdown you end up with a lot of noise. Even files that don't need transformation are worth cleaning.
You do not want to end up with blank lines and other noise that does not add any value to the system and just occupies space.

In general I recommend doing the following things in content cleaning.

Replace any multiple consecutive occurrences of blank lines to just a maximum of two.
Strip trailing and leading whitespace from each line
Remove lines that are only symbols and no text

The Speed Problem

Once you try to scrape a website which embeds a couple of PDFs you will notice one thing:

It's incredibly slow.

You need to download each PDF, extract it page by page and transform it. For multi-tenancy you most likely also want to cover multi-language PDFs and only save the relevant information in
a specific language to your system. All this not only takes time, but also eats a lot of resources. To solve this, I recommend using a queue system and let that processing run somewhere
in the background.

Depending on your resources you can also try to run several processes in parallel.

Luckily filling a RAG system with data is usually not super time critical and customers are willing to wait. At best, you can frame it as training the AI with their data.
Just make sure that you remove the PDF files after processing them to save storage space.

Conclusion

Content transformation is an often overlooked but crucial part of a RAG system.

The quality of your input determines whether your RAG system produces high-quality output or not.
Spending some time thinking about what input formats you want to support and how to ensure your content is clean and resource efficient saves you a lot of headaches down the line.

It is also worth thinking about performance early on, especially when working with PDF files.

In the next article we will focus on chunking.

How You Can Build a Computer From a Single Gate

jasu.dev — Sun, 17 May 2026 03:04:22 +0000

Most software engineers can't explain how their computer works.

Me included, I work as a backend developer but know nothing about the internals of my laptop.
In modern computer science we have so many layers of abstraction that you don't need to know what's underneath.
But knowing a thing or two about it will certainly make you a better developer.

And you can start by building your own computer from just a single logic gate.

Nand2Tetris

Nand2Tetris is a course by Noam Nisan and Shimon Schocken where you build a fully functional computer by starting with just one simple logic gate - the NAND gate.

The course is available for free on the website nand2tetris.org and it does not require any pre-existing knowledge about computers or programming (although it helps).
It's divided into two parts: Hardware and Software, and includes twelve projects.
In the first six projects you build the computer from scratch, and in projects 7 to 12 you build the software stack: a VM translator, a compiler, and a small OS.

The course is widely known as one of the best courses in computer science for understanding how things work on a low level.

Boolean Logic

The first project is about boolean logic. Build 15 different chips from a simple logic gate.

The documents (or videos) explain everything that is needed to solve this task:

the theory of boolean logic
the connection between boolean logic and electrical circuits
an introduction to hardware description files and hardware simulation

You are provided with stub, comparison and test files for each gate. These files provide the necessary information on how the gate/chip
is supposed to work. The actual implementation is your part.

For the Xor gate for example, these are the stub and comparison file:

// This file is part of www.nand2tetris.org
// and the book "The Elements of Computing Systems"
// by Nisan and Schocken, MIT Press.
// File name: projects/1/Xor.hdl
/**
 * Exclusive-or gate:
 * if ((a and Not(b)) or (Not(a) and b)) out = 1, else out = 0
 */
CHIP Xor {
    IN a, b;
    OUT out;

    PARTS:
    //// Replace this comment with your code.
}

| a | b |out|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |

The first gates like NOT, AND and XOR were straightforward, but Mux and DMux were a different story.

Routing

Mux and DMux broke my pattern.

While the simpler gates are all basic boolean arithmetic, Mux and DMux do not calculate anything.
The Mux takes two input signals (a and b) and a select signal.
Based on the select signal the output has either the value of a or b. The DMux does the opposite: it takes one input and routes it to one of two outputs based on the selector.

This is the truth table for the Mux:

| a | b |sel|out|
| 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 |
| 0 | 1 | 0 | 0 |
| 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 |
| 1 | 0 | 1 | 0 |
| 1 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 |

The tricky part here is the shift in the pattern. NOT, AND, XOR all follow the principle of combining inputs and calculating a result. Routing needs a different pattern and requires you to shift your thinking.

Another challenge is the 4-Way and 8-Way Mux. I built them the same way as the Mux (but with a lot more code as you may imagine) first, before realizing that they can be built with just three Mux chips.
I discovered that you can use the divide and conquer pattern for both, software and hardware.

My Verdict of Project 1

In total the first project took me around five hours including going through the course material.

So far it's worth it to go through this course to deepen your knowledge of how computers work and eventually become a better software engineer.

The next project will be about building a calculator from logic gates.

Originally published on jasu.dev

Building a multi-tenant RAG pipeline with Postgres. Part 0: Overview

jasu.dev — Sun, 10 May 2026 04:28:36 +0000

Today I want to start with a series of articles describing my experience building a multi-tenant RAG system powered by Postgres that serves over
millions of documents while still delivering end-to-end responses in under 4 seconds (including the latency from AI providers). This article serves as the overview
before I will start diving deeper into the several topics in the upcoming weeks. I put a lot of research into most of the steps until I reached a somewhat
stable and fast system. I was heavily involved in building this at my company, but I wasn't the only one and many of the ideas came from working through problems together with the team.
In case you are thinking about building a RAG-based system this series could help you make the decisions regarding architecture or provider choice.

What makes a good RAG system?

In my opinion a good RAG system is mainly defined by recall and latency because these two things are directly impacting the end user experience.
One could argue that recall is more important than latency, since a fast answer is worth nothing if the system is giving users false answers.
But I think a good and well-crafted answer is also worth nothing when users need to wait ten to 20 seconds each time they ask something. The internet is a
fast-paced environment and people don’t like to wait.

Furthermore, a good RAG system should have guardrails against misuse. You are dealing with untrusted user input and therefore prompt injection is a real
threat for RAG systems. You need to find mechanisms to prevent misuse and still make the system respond in a friendly way.
Since LLMs tend to hallucinate (this is not exclusively LLM behavior, humans do that too), you also need a way to minimize the risk of providing false answers
to the user and in case your documents do not provide any useful information you also need to find a good solution for this.

As you can see, there are quite a few things you need to think of when building a RAG system that is somewhat publicly available. But before I jump into
how you can overcome the listed challenges, let’s take a look at what a RAG system is made of.

The Two Major Parts

A RAG system is generally split into two major parts: Ingestion and Retrieval.

Ingestion means storing documents in the vector database and retrieval is the process of getting these documents.
Both of these steps have some nuances that highly influence how good the documents that you feed the LLM are.
We will start with the ingestion since what you retrieve is only as good as what you put into the system in the first place.
If you put good stuff into the system, you have a fair chance to receive a proper answer from the LLM. If you put bad stuff into the system, chances are very
low you receive anything useful back.

Ingestion

When it comes to Ingestion, there are a couple of steps that need to be done before a piece of text can be stored in a vector database.

Content Transformation
Content Chunking
Embedding and Storage

I will go into detail on these three steps in the upcoming articles. For now I just want to say that it’s beneficial to think beforehand about

What kind of files you want to support, what text format you want to use for storage.
How big your chunks should be and how you want to split the documents.
What storage and embedding you want to use.

Especially for the third point there are many different providers that all have their own benefits. In the course of this series I will explain why the project team
I worked in decided to go with Postgres.

 Document --> Transform --> Chunk --> Embed --> pgvector

Retrieval

The Retrieval part of a RAG system includes significantly more steps than the ingestion part.

I personally split the retrieval part into six different steps:

Input Processing
Document Retrieval
Context Preparation
Reranking
Response Generation
Output Processing and Delivery

While most RAG tutorials focus on the core steps like document retrieval and response generation, I think a production RAG system needs way more than that.
Especially if you want to build it for multi-tenancy. It needs input and output guardrails, multiple retrievers and optimization regarding token cost and latency,
like the context preparation step that I included in my list. Simpler use cases may not need everything. In general having one RAG system for one specific use case only
will always lead to the best results possible. However, in the real world you cannot build and maintain a custom system for each and every customer due to time and cost restrictions.

Most of the steps mentioned above include several sub-steps. The input processing for example includes spam guards,
query rewriting and, depending on your use case, maybe even routing to or away from the document retrieval. I try to write one article for each of the steps listed above and
dive deep into the sub-steps so you have a proper understanding of what is needed for what.

Query --> Input Processing --> Document Retrieval --> Context Preparation ──┐
┌───────────────────────────────────────────────────────────────────────────┘
└──→ Reranking → Response Generation → Output & Delivery

What’s Next

In the next couple of articles I will go over the ingestion part of the RAG system. I will explain how to transform different file formats into a text format, how to split
documents into chunks that make sense for retrieval and how to store them in Postgres using pgvector. I will also provide some code examples.
We will be using Python and Langchain for this series.

Originally published on jasu.dev

How to make Claude Code actually follow your rules

jasu.dev — Tue, 05 May 2026 10:32:21 +0000

Coding Agents are great and fast evolving. I personally use Claude Code on every project. It’s super powerful, but it still needs a lot of handholding, especially when it comes to code consistency. Often times the implementation they are coming up with works properly but the code it produces is not optimal. If you are a person that just wants something that works somehow, then you don’t need to care. But in most professional environments you want the codebase to have some kind of consistency in style and to be easy to maintain. So everyone follows some rules or principles on how certain things should be done.

Example: Layered Architecture in Laravel

One example of such a rule is following the layered architecture when building projects in Laravel. The layered architecture consists of a presentation layer, an application/service layer,
a domain layer and an infrastructure/persistence layer. In Laravel terms this means: Controller, Service, Repository, Model.

The controller (in other languages often called the handler) naturally stays thin and is just for validating the request and returning a success or error response. The service handles the actual business logic and calls the repository to update models.

In order to follow this architecture, you have a bunch of rules. Here are some of them:

the controller should never execute business logic. It should always call a service.
the service should never update a model or store something in the database. It should always call a repository for that.
the repository should never execute business logic. It should only be responsible for editing/creating or deleting models.

CLAUDE.md rules don't stick

Naturally, I also include these rules in my Claude.md file (the file at your project root for project-specific instructions like coding conventions, architecture decisions and workflow preferences) and expect Claude Code to follow them. But despite being very clear about these rules, Claude still tries to sneak some business logic into the controller sometimes or wants to update a model inside a service. This can be quite annoying and slows me down a lot, since I need to constantly be aware what Claude is doing and steer it in the right direction. Even though it stores these corrections in the memory, mistakes like these keep happening.

Path-specific rules

I was quite upset about this and told a colleague. He suggested to me that I should try out rules.
Rules are a special set of instructions for files or paths that have a very high priority and are only loaded when accessing these files. So for example if I want Claude to not update models inside a service class, I can set a rule like this:

---
paths: "**/Services/*Service.php"
---

# Service Class Rules

## Rule

Service classes MUST NOT interact with the database directly. Delegate all persistence and query logic to a Repository.

## Forbidden in Service classes

- Eloquent model statics that hit the DB: `Model::create()`, `::find()`, `::findOrFail()`, `::where()`, `::first()`, `::firstOrCreate()`, `::updateOrCreate()`, `::all()`, `::query()`, `::destroy()`.
- Instance persistence: `$model->save()`, `->delete()`, `->update()`, `->forceDelete()`, `->restore()`, `->push()`.
- Relationship persistence: `->create()`, `->save()`, `->attach()`, `->detach()`, `->sync()`, `->associate()`, `->dissociate()`, `->updateExistingPivot()`.
- Query builder / raw SQL: `DB::table()`, `DB::select/insert/update/delete()`, `DB::statement()`.
- `DB::transaction()` allowed ONLY to compose multiple Repository calls atomically — reads/writes inside MUST still go through Repositories.

## Required pattern

```php
class ExampleService
{
    public function __construct(
        private readonly ExampleRepository $repository,
    ) {}

    public function doThing(ExampleData $data): Example
    {
        return $this->repository->create($data);
    }
}
```

- Inject Repository via constructor.
- Pass DTOs (Spatie Data) or typed params — never arrays.
- Service = business logic. Repository = persistence.

How it works

This file should be saved inside the .claude/rules folder and can be called anything you like. I tend to name the files according to the layer/type they describe the rules for. I name a rule file for services services.md.

The rule is always present in the context, but only the frontmatter. When a file that matches the path: ... is loaded, the full rule is loaded into context and Claude uses these rules with a higher priority.

Results

Since I implemented a couple of these rules, Claude makes fewer “mistakes” and follows the principles more strictly.

Previously I found that Claude tried to sneak in some violations of the rules approximately once every feature.
Since I added some files to the .claude/rules folder, I worked on five features and haven’t seen Claude trying to sneak in some violation so far.

If you want to try out rules and want to take a deeper look at what is possible you can visit the Claude Code Documentation on Rules.

Originally published on jasu.dev