DEV Community

An
An

Posted on

Ok RAG, but what about data extraction from documents?

Hi everyone,
I'm working on my RAG system — or rather, thinking about how to make it strong and general-purpose. I've started with the first step: extracting data from uploaded documents.

I'm facing a lot of issues — there are many models, many open-source ones, and many others that are quite costly — but none can guarantee that 100% (or even close) of the data will be extracted correctly from my PDFs, DOCX files, or other formats.

I also have another problem: I'm Italian and want to build this for an Italian audience, so the documents will be in Italian — and some extractors don’t handle that very well.

So my question is: what kinds of systems, tools, or approaches do you use to extract all the information from your documents before the chunking and embedding phase?

Let me know, thanks!

Top comments (0)