Every time I wanted to fine-tune an LLM or build a RAG system, I hit the same wall: I have documents, how do I turn them into training data?
PDFs, HTML pages, JSON files, CSVs, LaTeX papers... Each project meant new scripts, no reproducibility, bloated contexts wasting tokens, and numbers silently getting corrupted.
So I built 3DCF/doc2dataset to fix this.
What It Does
-
30+ Document Formats Supported
- PDF, Markdown, Plain Text, HTML, XML, JSON, YAML, TOML, CSV, TSV, LaTeX, BibTeX, images with OCR (PNG, JPG, GIF, WebP), RTF, and more.
-
5-6x Token Compression
- Instead of dumping raw text, 3DCF creates macro-cells with layout preservation and importance scoring. Same information, fraction of the tokens.
-
NumGuard: Numeric Integrity
- When processing financial or legal documents, numbers can get corrupted. NumGuard extracts every number, computes a SHA-1 hash, and tracks it through the pipeline. If anything changes, you know immediately.
-
Multi-Framework Export
- Process once, export to HuggingFace, LLaMA-Factory, Axolotl, OpenAI fine-tuning format, and RAG triples.
-
Built in Rust
- Fast parallel processing with Python and Node.js bindings available.
Evaluation Results
We tested on policy documents, financial reports, technical docs, and scientific papers.
- QA Accuracy: 98.0% (vs 91.3% baseline)
- Average Context Tokens: 35.9 (vs 206 baseline)
- Numeric Corruption Detection: 100% recall on 18,501 test cases
Who Is This For
Building RAG systems on your documents. Fine-tuning LLMs on domain-specific content. Processing financial or legal docs where numbers matter. Anyone tired of writing ad-hoc document scripts.
Links
- See the GitHub repo link
- See the research paper link
- License: Apache-2.0 (fully open source)
Try It Out
Install with cargo or pip, check the repo for documentation.
Star on GitHub if you find it useful!
Questions? Drop a comment or open an issue on GitHub.
Top comments (0)