<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yevhenii Molchanov </title>
    <description>The latest articles on DEV Community by Yevhenii Molchanov  (@yevh).</description>
    <link>https://dev.to/yevh</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1010482%2Ffe712272-056b-4bd3-a15b-f73d4ded53e3.jpeg</url>
      <title>DEV Community: Yevhenii Molchanov </title>
      <link>https://dev.to/yevh</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yevh"/>
    <language>en</language>
    <item>
      <title>I Built an Open-Source Pipeline to Convert Documents into LLM Training Data</title>
      <dc:creator>Yevhenii Molchanov </dc:creator>
      <pubDate>Sun, 07 Dec 2025 11:00:10 +0000</pubDate>
      <link>https://dev.to/yevh/i-built-an-open-source-pipeline-to-convert-documents-into-llm-training-data-37pb</link>
      <guid>https://dev.to/yevh/i-built-an-open-source-pipeline-to-convert-documents-into-llm-training-data-37pb</guid>
      <description>&lt;p&gt;Every time I wanted to fine-tune an LLM or build a RAG system, I hit the same wall: I have documents, how do I turn them into training data?&lt;/p&gt;

&lt;p&gt;PDFs, HTML pages, JSON files, CSVs, LaTeX papers... Each project meant new scripts, no reproducibility, bloated contexts wasting tokens, and numbers silently getting corrupted.&lt;/p&gt;

&lt;p&gt;So I built 3DCF/doc2dataset to fix this.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Does
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;30+ Document Formats Supported&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;PDF, Markdown, Plain Text, HTML, XML, JSON, YAML, TOML, CSV, TSV, LaTeX, BibTeX, images with OCR (PNG, JPG, GIF, WebP), RTF, and more.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;5-6x Token Compression&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Instead of dumping raw text, 3DCF creates macro-cells with layout preservation and importance scoring. Same information, fraction of the tokens.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;NumGuard: Numeric Integrity&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;When processing financial or legal documents, numbers can get corrupted. NumGuard extracts every number, computes a SHA-1 hash, and tracks it through the pipeline. If anything changes, you know immediately.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Multi-Framework Export&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Process once, export to HuggingFace, LLaMA-Factory, Axolotl, OpenAI fine-tuning format, and RAG triples.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Built in Rust&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Fast parallel processing with Python and Node.js bindings available.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Evaluation Results
&lt;/h2&gt;

&lt;p&gt;We tested on policy documents, financial reports, technical docs, and scientific papers.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;QA Accuracy: 98.0% (vs 91.3% baseline)&lt;/li&gt;
&lt;li&gt;Average Context Tokens: 35.9 (vs 206 baseline)&lt;/li&gt;
&lt;li&gt;Numeric Corruption Detection: 100% recall on 18,501 test cases&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Who Is This For
&lt;/h2&gt;

&lt;p&gt;Building RAG systems on your documents. Fine-tuning LLMs on domain-specific content. Processing financial or legal docs where numbers matter. Anyone tired of writing ad-hoc document scripts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;See the &lt;a href="https://github.com/3DCF-Labs/doc2dataset" rel="noopener noreferrer"&gt;GitHub repo link&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;See the &lt;a href="https://github.com/3DCF-Labs/doc2dataset/blob/main/docs/doc2dataset_paper.pdf" rel="noopener noreferrer"&gt;research paper link&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;License: Apache-2.0 (fully open source)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try It Out
&lt;/h2&gt;

&lt;p&gt;Install with cargo or pip, check the repo for documentation.&lt;br&gt;
Star on GitHub if you find it useful!&lt;br&gt;
Questions? Drop a comment or open an issue on GitHub.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>rag</category>
      <category>rust</category>
    </item>
    <item>
      <title>VulnPlanet vulnerable code examples and fixes for Web2, Web3, API,etc</title>
      <dc:creator>Yevhenii Molchanov </dc:creator>
      <pubDate>Wed, 18 Jan 2023 11:02:50 +0000</pubDate>
      <link>https://dev.to/yevh/vulnplanet-vulnerable-code-examples-and-fixes-for-web2-web3-apietc-fam</link>
      <guid>https://dev.to/yevh/vulnplanet-vulnerable-code-examples-and-fixes-for-web2-web3-apietc-fam</guid>
      <description>&lt;p&gt;Link: &lt;a href="https://github.com/yevh/VulnPlanet"&gt;https://github.com/yevh/VulnPlanet&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>security</category>
      <category>code</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
