I Built an Open-Source Pipeline to Convert Documents into LLM Training Data

Yevhenii Molchanov — Sun, 07 Dec 2025 11:00:10 +0000

Every time I wanted to fine-tune an LLM or build a RAG system, I hit the same wall: I have documents, how do I turn them into training data?

PDFs, HTML pages, JSON files, CSVs, LaTeX papers... Each project meant new scripts, no reproducibility, bloated contexts wasting tokens, and numbers silently getting corrupted.

So I built 3DCF/doc2dataset to fix this.

What It Does

30+ Document Formats Supported
- PDF, Markdown, Plain Text, HTML, XML, JSON, YAML, TOML, CSV, TSV, LaTeX, BibTeX, images with OCR (PNG, JPG, GIF, WebP), RTF, and more.
5-6x Token Compression
- Instead of dumping raw text, 3DCF creates macro-cells with layout preservation and importance scoring. Same information, fraction of the tokens.
NumGuard: Numeric Integrity
- When processing financial or legal documents, numbers can get corrupted. NumGuard extracts every number, computes a SHA-1 hash, and tracks it through the pipeline. If anything changes, you know immediately.
Multi-Framework Export
- Process once, export to HuggingFace, LLaMA-Factory, Axolotl, OpenAI fine-tuning format, and RAG triples.
Built in Rust
- Fast parallel processing with Python and Node.js bindings available.

Evaluation Results

We tested on policy documents, financial reports, technical docs, and scientific papers.

QA Accuracy: 98.0% (vs 91.3% baseline)
Average Context Tokens: 35.9 (vs 206 baseline)
Numeric Corruption Detection: 100% recall on 18,501 test cases

Who Is This For

Building RAG systems on your documents. Fine-tuning LLMs on domain-specific content. Processing financial or legal docs where numbers matter. Anyone tired of writing ad-hoc document scripts.

Try It Out

Install with cargo or pip, check the repo for documentation.
Star on GitHub if you find it useful!
Questions? Drop a comment or open an issue on GitHub.

VulnPlanet vulnerable code examples and fixes for Web2, Web3, API,etc

Yevhenii Molchanov — Wed, 18 Jan 2023 11:02:50 +0000

Link: https://github.com/yevh/VulnPlanet

DEV Community: Yevhenii Molchanov