DEV Community

Yevhenii Molchanov
Yevhenii Molchanov

Posted on

I Built an Open-Source Pipeline to Convert Documents into LLM Training Data

Every time I wanted to fine-tune an LLM or build a RAG system, I hit the same wall: I have documents, how do I turn them into training data?

PDFs, HTML pages, JSON files, CSVs, LaTeX papers... Each project meant new scripts, no reproducibility, bloated contexts wasting tokens, and numbers silently getting corrupted.

So I built 3DCF/doc2dataset to fix this.

What It Does

  • 30+ Document Formats Supported
    • PDF, Markdown, Plain Text, HTML, XML, JSON, YAML, TOML, CSV, TSV, LaTeX, BibTeX, images with OCR (PNG, JPG, GIF, WebP), RTF, and more.
  • 5-6x Token Compression
    • Instead of dumping raw text, 3DCF creates macro-cells with layout preservation and importance scoring. Same information, fraction of the tokens.
  • NumGuard: Numeric Integrity
    • When processing financial or legal documents, numbers can get corrupted. NumGuard extracts every number, computes a SHA-1 hash, and tracks it through the pipeline. If anything changes, you know immediately.
  • Multi-Framework Export
    • Process once, export to HuggingFace, LLaMA-Factory, Axolotl, OpenAI fine-tuning format, and RAG triples.
  • Built in Rust
    • Fast parallel processing with Python and Node.js bindings available.

Evaluation Results

We tested on policy documents, financial reports, technical docs, and scientific papers.

  • QA Accuracy: 98.0% (vs 91.3% baseline)
  • Average Context Tokens: 35.9 (vs 206 baseline)
  • Numeric Corruption Detection: 100% recall on 18,501 test cases

Who Is This For

Building RAG systems on your documents. Fine-tuning LLMs on domain-specific content. Processing financial or legal docs where numbers matter. Anyone tired of writing ad-hoc document scripts.

Links

Try It Out

Install with cargo or pip, check the repo for documentation.
Star on GitHub if you find it useful!
Questions? Drop a comment or open an issue on GitHub.

Top comments (0)