I'm Done With Data Preprocessing Nightmares in RAG Applications
Background
When I started working on RAG projects last year, the first challenge I faced wasn't choosing the right model or vector database. It was something much more fundamental: How do I convert various document formats into data that AI can actually use?
Tables in PDFs wouldn't parse correctly. Word documents came out as a formatting mess. OCR results from scanned documents were hit-or-miss at best. After testing numerous open-source tools, I found they either handled only a single format or required writing tons of glue code to connect different tools together.
The situation got even worse when our company required on-premise deployment—no sending documents to third-party APIs. We had multiple GPUs available, but most parsing tools either didn't support GPU acceleration or had terrible multi-GPU scheduling that constantly ran into out-of-memory errors.
So I decided to build my own tool to solve these problems.
That's how MinerU Tianshu was born.
What Does It Do?
Simply put, it converts various data formats into structured formats that AI can directly consume.
📄 Document Processing
Supports common formats like PDF, Word, Excel, PPT, and HTML, with output in Markdown or JSON.
- MinerU Pipeline: Complete document parsing that recognizes tables, formulas, and images
- DeepSeek OCR: High-precision OCR that handles even scanned documents accurately
- PaddleOCR-VL: Supports 109+ languages including Chinese, English, Japanese, and Korean
🎙️ Audio Processing (NEW)
This is a recently added feature using the SenseVoice engine, and the results exceeded my expectations:
- Multi-language Recognition: Chinese, English, Japanese, Korean, and Cantonese
- Speaker Diarization: Automatically distinguishes between different speakers
- Emotion Recognition: Detects neutral, happy, angry, and sad emotions
- Event Detection: Identifies speech, applause, background music, and laughter
Best of all, transcription results are automatically annotated with emojis for emotions and events like 😊, 😠, 🎵, making them highly readable.
🖼️ Image Processing
Direct text extraction from JPG and PNG files, supporting multiple OCR engines with GPU acceleration.
Why Did I Build This?
There's no shortage of document parsing tools on the market, but in practice, they all have their issues:
1. Fragmented Tools, Inconsistent APIs
RAG applications need to handle various data formats, but no single tool covers all scenarios. This means maintaining multiple codebases, each with different calling conventions, parameter formats, and output structures. Want unified processing? You'll need to write your own abstraction layer.
Even worse, these tools often have conflicting dependencies. One OCR library requires PyTorch 1.13, while a PDF tool needs 2.0. For compatibility, you end up managing multiple virtual environments. Every time you deploy to a new machine, you spend hours just setting up the environment.
2. Multi-GPU Concurrency Is a Minefield
We have multiple GPUs and wanted to process documents in parallel for better efficiency. But in practice, we discovered:
- Multiple processes calling models simultaneously easily exhaust VRAM
- Manual GPU allocation? You need to write your own scheduling logic and handle edge cases
- Using
CUDA_VISIBLE_DEVICESfor restrictions? Configuration becomes chaotic with more tasks
I tried multi-processing with fixed GPU IDs, but when one task got stuck, its GPU remained occupied while other tasks waited in queue.
I also tried Ray for distributed scheduling, but setting up a Ray cluster just for document processing felt like overkill. Plus, Ray's GPU management still requires writing significant code.
The ideal scenario would be: I submit tasks, and the system automatically distributes them to idle GPUs with automatic retry on failure. But most open-source tools lack this capability.
3. No Production-Grade Task Management
Open-source tools typically follow a "one-off script" approach without considering production environment needs:
- Batch Processing: No queue mechanism—you process files one by one or write your own multi-processing
- Error Retry: When document parsing fails, you manually identify and rerun it
- Priority: Urgent tasks from your boss wait for earlier tasks to complete
- Progress Tracking: No idea where tasks are—stuck or processing normally?
Once, I processed thousands of documents overnight. In the morning, I discovered an error occurred midway, and previous results weren't saved. I had to run everything again.
4. Not RAG-Friendly
RAG applications prioritize data quality and format consistency, but many tools fall short:
- Messy Output Formats: Some output plain text, others HTML, and some use custom formats. Before connecting to a RAG system, you write numerous cleaning scripts
- Lost Structure Information: Headers, lists, and tables lose their structure, making semantic chunking difficult later
- Missing Metadata: No information about which page text came from or whether it's a table or body text, affecting RAG retrieval accuracy
- Difficult On-Premise Deployment: Many commercial APIs work well, but enterprise sensitive documents can't be sent to external servers
Finally, there's the maintenance cost. More tools mean more complex environment dependencies—Python version conflicts, CUDA incompatibility, library conflicts... Every deployment to a new machine becomes an ordeal.
5. Commercial Solutions Are Expensive, Open-Source Solutions Are Scattered
Commercial solutions (like various OCR APIs) are convenient but have several issues:
- High Cost: Pay-per-use pricing becomes scary with large volumes
- Data Security: Documents from finance, healthcare, and legal sectors can't be sent externally
- Limited Functionality: You're restricted to their provided features with no customization
Open-source solutions are free but too fragmented:
- Different formats require different tools
- Each tool needs separate deployment and maintenance
- No unified interface or output format between tools
- When something breaks, you don't know which component is at fault
I wondered: Could there be a tool that solves all these pain points?
It doesn't need to be comprehensive, but it should at least cover the most common data types for RAG applications. It doesn't need to be perfect, but it should be comfortable to use in production environments.
That's how Tianshu came to be.
So, How Did I Build It?
🎯 Multi-Engine Integration
Don't reinvent the wheel. There are already many excellent open-source engines like MinerU, DeepSeek OCR, PaddleOCR, and SenseVoice. My goal was to integrate them so users can choose the most suitable engine for their scenario.
- Need complete parsing? Use MinerU Pipeline
- Want high-precision OCR? Use DeepSeek OCR
- Multi-language documents? Use PaddleOCR-VL
- Audio to text? Use SenseVoice
⚡ GPU Load Balancing
Using LitServe for GPU scheduling supports multi-GPU concurrency and avoids VRAM conflicts. Tasks are automatically distributed to idle GPUs with 90%+ utilization rates.
📦 Complete Task Management
- Task Queue: Supports batch submission with automatic queuing
- Priority Management: Urgent tasks can jump the queue
- Auto Retry: Automatic retry on failure without manual intervention
- Real-time Monitoring: Web interface shows status of all tasks
🔗 MCP Protocol Support
This was added later but I find it incredibly useful.
MCP is an open protocol proposed by Anthropic that allows AI assistants to call external tools. With this integration, Claude Desktop can directly invoke Tianshu to process documents without manual uploading and downloading.
Use Cases
- Legal Document Processing: Batch OCR for scanned court judgments with on-premise deployment ensuring data security
- Multi-language Documents: Unified processing of technical documents in Chinese, English, Japanese, Korean, etc., with standard Markdown output
- Meeting Minutes: Audio transcription + speaker diarization + emotion annotation for automatic meeting summary generation
- Knowledge Base Construction: Batch processing of internal enterprise documents to prepare data for RAG retrieval systems
Why the Name "Tianshu"?
Tianshu (天枢) is the first star of the Big Dipper constellation, used in ancient times for navigation and direction.
In RAG applications, data preprocessing is like this "Tianshu" star—it determines the data quality of the entire system. Once data is properly prepared, subsequent retrieval and generation can proceed smoothly.
Additionally, the project's core capability is engine scheduling, similar to how Tianshu "orchestrates" other stars in the sky.
Open Source License and Contributions
The project is licensed under Apache 2.0 and welcomes all forms of contributions:
- 🐛 Report Bugs
- 💡 Suggest New Features
- 📝 Improve Documentation
- 🔧 Submit Code
GitHub: https://github.com/magicyuan876/mineru-tianshu
Final Thoughts
The motivation behind this project is simple: solve my own problems.
I'm not trying to build a comprehensive platform or compete with commercial products. I just hope that when other developers encounter the same problems I faced, they'll have a ready-to-use tool and won't need to step on the same landmines I did.
If you're also working on RAG applications and have been frustrated by data preprocessing, give Tianshu a try.
If you find this project useful, starring it on GitHub would be the greatest support.
If you have any suggestions or questions, feel free to open an issue on GitHub—I'll respond as soon as possible.
MinerU Tianshu
RAG Data Preprocessing Workstation
⭐ GitHub | 📖 Documentation
Top comments (0)