I Built an Open-Source Pipeline to Convert Documents into LLM Training Data

#machinelearning #ai #rag #rust

Every time I wanted to fine-tune an LLM or build a RAG system, I hit the same wall: I have documents, how do I turn them into training data?

PDFs, HTML pages, JSON files, CSVs, LaTeX papers... Each project meant new scripts, no reproducibility, bloated contexts wasting tokens, and numbers silently getting corrupted.

So I built 3DCF/doc2dataset to fix this.

What It Does

30+ Document Formats Supported
- PDF, Markdown, Plain Text, HTML, XML, JSON, YAML, TOML, CSV, TSV, LaTeX, BibTeX, images with OCR (PNG, JPG, GIF, WebP), RTF, and more.
5-6x Token Compression
- Instead of dumping raw text, 3DCF creates macro-cells with layout preservation and importance scoring. Same information, fraction of the tokens.
NumGuard: Numeric Integrity
- When processing financial or legal documents, numbers can get corrupted. NumGuard extracts every number, computes a SHA-1 hash, and tracks it through the pipeline. If anything changes, you know immediately.
Multi-Framework Export
- Process once, export to HuggingFace, LLaMA-Factory, Axolotl, OpenAI fine-tuning format, and RAG triples.
Built in Rust
- Fast parallel processing with Python and Node.js bindings available.

Evaluation Results

We tested on policy documents, financial reports, technical docs, and scientific papers.

QA Accuracy: 98.0% (vs 91.3% baseline)
Average Context Tokens: 35.9 (vs 206 baseline)
Numeric Corruption Detection: 100% recall on 18,501 test cases

Who Is This For

Building RAG systems on your documents. Fine-tuning LLMs on domain-specific content. Processing financial or legal docs where numbers matter. Anyone tired of writing ad-hoc document scripts.

Try It Out

Install with cargo or pip, check the repo for documentation.
Star on GitHub if you find it useful!
Questions? Drop a comment or open an issue on GitHub.

DEV Community

I Built an Open-Source Pipeline to Convert Documents into LLM Training Data

What It Does

Evaluation Results

Who Is This For

Links

Try It Out

Top comments (0)