Magic_yuan

Posted on Oct 24

One API, Multiple Engines: Simplifying RAG Data Preprocessing

#datascience #tooling #ai #api

I'm Done With Data Preprocessing Nightmares in RAG Applications

Background

When I started working on RAG projects last year, the first challenge I faced wasn't choosing the right model or vector database. It was something much more fundamental: How do I convert various document formats into data that AI can actually use?

Tables in PDFs wouldn't parse correctly. Word documents came out as a formatting mess. OCR results from scanned documents were hit-or-miss at best. After testing numerous open-source tools, I found they either handled only a single format or required writing tons of glue code to connect different tools together.

The situation got even worse when our company required on-premise deployment—no sending documents to third-party APIs. We had multiple GPUs available, but most parsing tools either didn't support GPU acceleration or had terrible multi-GPU scheduling that constantly ran into out-of-memory errors.

So I decided to build my own tool to solve these problems.

That's how MinerU Tianshu was born.

What Does It Do?

Simply put, it converts various data formats into structured formats that AI can directly consume.

📄 Document Processing

Supports common formats like PDF, Word, Excel, PPT, and HTML, with output in Markdown or JSON.

MinerU Pipeline: Complete document parsing that recognizes tables, formulas, and images
DeepSeek OCR: High-precision OCR that handles even scanned documents accurately
PaddleOCR-VL: Supports 109+ languages including Chinese, English, Japanese, and Korean

🎙️ Audio Processing (NEW)

This is a recently added feature using the SenseVoice engine, and the results exceeded my expectations:

Multi-language Recognition: Chinese, English, Japanese, Korean, and Cantonese
Speaker Diarization: Automatically distinguishes between different speakers
Emotion Recognition: Detects neutral, happy, angry, and sad emotions
Event Detection: Identifies speech, applause, background music, and laughter

Best of all, transcription results are automatically annotated with emojis for emotions and events like 😊, 😠, 🎵, making them highly readable.

🖼️ Image Processing

Direct text extraction from JPG and PNG files, supporting multiple OCR engines with GPU acceleration.

Why Did I Build This?

There's no shortage of document parsing tools on the market, but in practice, they all have their issues:

1. Fragmented Tools, Inconsistent APIs

RAG applications need to handle various data formats, but no single tool covers all scenarios. This means maintaining multiple codebases, each with different calling conventions, parameter formats, and output structures. Want unified processing? You'll need to write your own abstraction layer.

Even worse, these tools often have conflicting dependencies. One OCR library requires PyTorch 1.13, while a PDF tool needs 2.0. For compatibility, you end up managing multiple virtual environments. Every time you deploy to a new machine, you spend hours just setting up the environment.

2. Multi-GPU Concurrency Is a Minefield

We have multiple GPUs and wanted to process documents in parallel for better efficiency. But in practice, we discovered:

Multiple processes calling models simultaneously easily exhaust VRAM
Manual GPU allocation? You need to write your own scheduling logic and handle edge cases
Using CUDA_VISIBLE_DEVICES for restrictions? Configuration becomes chaotic with more tasks

I tried multi-processing with fixed GPU IDs, but when one task got stuck, its GPU remained occupied while other tasks waited in queue.

I also tried Ray for distributed scheduling, but setting up a Ray cluster just for document processing felt like overkill. Plus, Ray's GPU management still requires writing significant code.

The ideal scenario would be: I submit tasks, and the system automatically distributes them to idle GPUs with automatic retry on failure. But most open-source tools lack this capability.

3. No Production-Grade Task Management

Open-source tools typically follow a "one-off script" approach without considering production environment needs:

Batch Processing: No queue mechanism—you process files one by one or write your own multi-processing
Error Retry: When document parsing fails, you manually identify and rerun it
Priority: Urgent tasks from your boss wait for earlier tasks to complete
Progress Tracking: No idea where tasks are—stuck or processing normally?

Once, I processed thousands of documents overnight. In the morning, I discovered an error occurred midway, and previous results weren't saved. I had to run everything again.

4. Not RAG-Friendly

RAG applications prioritize data quality and format consistency, but many tools fall short:

Messy Output Formats: Some output plain text, others HTML, and some use custom formats. Before connecting to a RAG system, you write numerous cleaning scripts
Lost Structure Information: Headers, lists, and tables lose their structure, making semantic chunking difficult later
Missing Metadata: No information about which page text came from or whether it's a table or body text, affecting RAG retrieval accuracy
Difficult On-Premise Deployment: Many commercial APIs work well, but enterprise sensitive documents can't be sent to external servers

Finally, there's the maintenance cost. More tools mean more complex environment dependencies—Python version conflicts, CUDA incompatibility, library conflicts... Every deployment to a new machine becomes an ordeal.

5. Commercial Solutions Are Expensive, Open-Source Solutions Are Scattered

Commercial solutions (like various OCR APIs) are convenient but have several issues:

High Cost: Pay-per-use pricing becomes scary with large volumes
Data Security: Documents from finance, healthcare, and legal sectors can't be sent externally
Limited Functionality: You're restricted to their provided features with no customization

Open-source solutions are free but too fragmented:

Different formats require different tools
Each tool needs separate deployment and maintenance
No unified interface or output format between tools
When something breaks, you don't know which component is at fault

I wondered: Could there be a tool that solves all these pain points?

It doesn't need to be comprehensive, but it should at least cover the most common data types for RAG applications. It doesn't need to be perfect, but it should be comfortable to use in production environments.

That's how Tianshu came to be.

So, How Did I Build It?

🎯 Multi-Engine Integration

Don't reinvent the wheel. There are already many excellent open-source engines like MinerU, DeepSeek OCR, PaddleOCR, and SenseVoice. My goal was to integrate them so users can choose the most suitable engine for their scenario.

Need complete parsing? Use MinerU Pipeline
Want high-precision OCR? Use DeepSeek OCR
Multi-language documents? Use PaddleOCR-VL
Audio to text? Use SenseVoice

⚡ GPU Load Balancing

Using LitServe for GPU scheduling supports multi-GPU concurrency and avoids VRAM conflicts. Tasks are automatically distributed to idle GPUs with 90%+ utilization rates.

📦 Complete Task Management

Task Queue: Supports batch submission with automatic queuing
Priority Management: Urgent tasks can jump the queue
Auto Retry: Automatic retry on failure without manual intervention
Real-time Monitoring: Web interface shows status of all tasks

🔗 MCP Protocol Support

This was added later but I find it incredibly useful.

MCP is an open protocol proposed by Anthropic that allows AI assistants to call external tools. With this integration, Claude Desktop can directly invoke Tianshu to process documents without manual uploading and downloading.

Use Cases

Legal Document Processing: Batch OCR for scanned court judgments with on-premise deployment ensuring data security
Multi-language Documents: Unified processing of technical documents in Chinese, English, Japanese, Korean, etc., with standard Markdown output
Meeting Minutes: Audio transcription + speaker diarization + emotion annotation for automatic meeting summary generation
Knowledge Base Construction: Batch processing of internal enterprise documents to prepare data for RAG retrieval systems

Why the Name "Tianshu"?

Tianshu (天枢) is the first star of the Big Dipper constellation, used in ancient times for navigation and direction.

In RAG applications, data preprocessing is like this "Tianshu" star—it determines the data quality of the entire system. Once data is properly prepared, subsequent retrieval and generation can proceed smoothly.

Additionally, the project's core capability is engine scheduling, similar to how Tianshu "orchestrates" other stars in the sky.

Open Source License and Contributions

The project is licensed under Apache 2.0 and welcomes all forms of contributions:

🐛 Report Bugs
💡 Suggest New Features
📝 Improve Documentation
🔧 Submit Code

GitHub: https://github.com/magicyuan876/mineru-tianshu

Final Thoughts

The motivation behind this project is simple: solve my own problems.

I'm not trying to build a comprehensive platform or compete with commercial products. I just hope that when other developers encounter the same problems I faced, they'll have a ready-to-use tool and won't need to step on the same landmines I did.

If you're also working on RAG applications and have been frustrated by data preprocessing, give Tianshu a try.

If you find this project useful, starring it on GitHub would be the greatest support.

If you have any suggestions or questions, feel free to open an issue on GitHub—I'll respond as soon as possible.

MinerU Tianshu

RAG Data Preprocessing Workstation

⭐ GitHub | 📖 Documentation

DEV Community