<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Radosław</title>
    <description>The latest articles on DEV Community by Radosław (@arondaron).</description>
    <link>https://dev.to/arondaron</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3891416%2F386e2f47-a8ad-4b89-8cda-06ba1082965c.png</url>
      <title>DEV Community: Radosław</title>
      <link>https://dev.to/arondaron</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/arondaron"/>
    <language>en</language>
    <item>
      <title>Desktop app to generate LLM fine-tuning datasets — got +16pp on HumanEval</title>
      <dc:creator>Radosław</dc:creator>
      <pubDate>Wed, 29 Apr 2026 09:37:46 +0000</pubDate>
      <link>https://dev.to/arondaron/desktop-app-to-generate-llm-fine-tuning-datasets-got-16pp-on-humaneval-4ng3</link>
      <guid>https://dev.to/arondaron/desktop-app-to-generate-llm-fine-tuning-datasets-got-16pp-on-humaneval-4ng3</guid>
      <description>&lt;p&gt;I'm not a professional developer. I learned by doing — vibe-coding with AI assistance — and a few months ago I wanted to fine-tune Qwen2.5-Coder-7B on my own data. The problem: there's no good way to generate a quality dataset without writing custom scripts every time, and existing tools are either CLI-heavy or built for researchers, not curious tinkerers.&lt;/p&gt;

&lt;p&gt;So I built one. It actually worked: my fine-tuned model went from &lt;strong&gt;55.5% to 72.3% on HumanEval&lt;/strong&gt; (5 runs averaged, Q4_K_M GGUF via Ollama).&lt;/p&gt;

&lt;p&gt;Here's what I built, what I learned, and what didn't work in this finetune example.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it is
&lt;/h2&gt;

&lt;p&gt;A no-code desktop app (Linux, Windows) that automates the full dataset generation pipeline — topic planning, multi-turn example generation, quality scoring via LLM Judge, deduplication, and HuggingFace Hub upload. Pick categories, set proportions, click Generate, get a ready-to-train JSONL.&lt;/p&gt;

&lt;p&gt;Under the hood it runs a three-stage engine: topics → outlines → examples. Instead of a naive "generate 100 examples" prompt, the app decomposes the job first, which kills the repetitive patterns you get from one-shot generation. Everything stays local; all model traffic goes through OpenRouter (~300 models, one key).&lt;/p&gt;

&lt;p&gt;PS: I know there are similar apps for generating fine-tuning data — but as always, I build the tools I want to use myself.&lt;/p&gt;

&lt;p&gt;A few features that made my life easier:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Per-category models&lt;/strong&gt; — different generators for different example types&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM-as-judge&lt;/strong&gt; — every example gets scored, low-quality ones rejected&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding deduplication&lt;/strong&gt; — cosine similarity removes near-duplicates before export&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HuggingFace upload&lt;/strong&gt; — push straight to the Hub when done&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality Report&lt;/strong&gt; — score histograms, token stats, per-category accept rates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resume on crash&lt;/strong&gt; — interrupted jobs restart from where they stopped (this saved me hours)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;p&gt;Everything is open source and reproducible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dataset (2,248 examples)&lt;/strong&gt;: &lt;a href="https://huggingface.co/datasets/AronDaron/OctoBench-2.2k" rel="noopener noreferrer"&gt;huggingface.co/datasets/AronDaron/OctoBench-2.2k&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-tuned model&lt;/strong&gt;: &lt;a href="https://huggingface.co/AronDaron/Qwen2.5-Coder-7B-Instruct-OctoBench-2.2k-Fine-tune" rel="noopener noreferrer"&gt;huggingface.co/AronDaron/Qwen2.5-Coder-7B-Instruct-OctoBench-2.2k&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;App repo&lt;/strong&gt;: &lt;a href="https://github.com/AronDaron/dataset-generator" rel="noopener noreferrer"&gt;github.com/AronDaron/dataset-generator&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The result
&lt;/h2&gt;

&lt;p&gt;I generated 2,248 examples across 8 categories targeting different code skills, then fine-tuned Qwen2.5-Coder-7B-Instruct (QLoRA via Unsloth, Q4_K_M GGUF served via Ollama).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Base&lt;/th&gt;
&lt;th&gt;Fine-tuned&lt;/th&gt;
&lt;th&gt;Δ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HumanEval (5 runs avg, n=164, t=0.2)&lt;/td&gt;
&lt;td&gt;55.5% (±2.1)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;72.3% (±2.0)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+16.8pp&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HumanEval+ (5 runs avg, n=164, t=0.2)&lt;/td&gt;
&lt;td&gt;49.0% (±1.9)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;65.1% (±1.6)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+16.1pp&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BigCodeBench full instruct (1 run, n=1140)&lt;/td&gt;
&lt;td&gt;39.3%&lt;/td&gt;
&lt;td&gt;39.7%&lt;/td&gt;
&lt;td&gt;+0.4pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LiveCodeBench v6 (1 run, n=1055, t=0.0)&lt;/td&gt;
&lt;td&gt;29.0%&lt;/td&gt;
&lt;td&gt;26.9%&lt;/td&gt;
&lt;td&gt;-2.1pp&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0a45q9xpey9mvn16vk4y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0a45q9xpey9mvn16vk4y.png" alt=" " width="800" height="376"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;HumanEval and HumanEval+ were the win. BigCodeBench barely moved and LiveCodeBench actually regressed. Both led to interesting lessons.&lt;/p&gt;

&lt;h2&gt;
  
  
  What surprised me
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;LCB regressed because of a format mismatch, not a knowledge gap.&lt;/strong&gt; I checked the fail cases — model output had correct logic but the wrong wrapper. My training data said "return only the function" while LCB tests need full programs with &lt;code&gt;input()&lt;/code&gt; / &lt;code&gt;print()&lt;/code&gt;. Format mismatches show up as "wrong answer" on benchmarks, but they're way easier to fix than actual missing knowledge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Judge model matters more than generator model.&lt;/strong&gt; I tested several judges. Some flash-tier models rubber-stamped almost everything (scores 95-100 across the board), while smaller models skipped 70% of examples they didn't understand. Pick the wrong judge and your "quality dataset" is just noise with a fancy filter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concise prompts beat elaborate ones.&lt;/strong&gt; I started with detailed multi-paragraph category descriptions. Generation quality got &lt;em&gt;worse&lt;/em&gt;. Stripped them down to 2-3 sentences with a 4-6 item judge criteria — accept rate jumped, output got cleaner.&lt;/p&gt;

&lt;h2&gt;
  
  
  What didn't work
&lt;/h2&gt;

&lt;p&gt;I tried to be clever with judge criteria. I added more and more filters trying to catch every edge case I noticed in pilot runs. Accept rate dropped from ~85% to 10%. The filters were technically correct, but the generator couldn't deliver against all of them. Lesson: it's better to accept some noise than to over-constrain and stall the whole pipeline.&lt;/p&gt;

&lt;p&gt;I also wasted time on BigCodeBench. My "Data Libraries" category was too generic — "any 2+ libs from this list" — and BCB tests precise library API usage with concrete kwargs. Result: +0.4pp. To actually move BCB, I'd need a category seeded from BCB's own taxonomy of ~139 libraries with specific signature drilling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech stack
&lt;/h2&gt;

&lt;p&gt;Nothing exotic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontend&lt;/strong&gt;: Next.js 16 (static export) + Tailwind + shadcn/ui&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend&lt;/strong&gt;: FastAPI + SQLite (WAL mode) + Pydantic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Desktop&lt;/strong&gt;: pywebview (WebKit2 on Linux, WebView2 on Windows)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Packaging&lt;/strong&gt;: PyInstaller — Linux AppImage works (~73 MB)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM access&lt;/strong&gt;: OpenRouter (no vendor lock-in, switch models freely)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dedup&lt;/strong&gt;: OpenRouter embeddings + numpy cosine similarity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;License is AGPL-3.0 — I picked it over MIT on purpose. If someone wraps this as SaaS, I want the changes to come back to the project.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;Local LLM support (Ollama / LM Studio) so people can generate datasets without paying for API calls. After that, a system tray version for quieter long-running jobs.&lt;/p&gt;

&lt;p&gt;Already in progress: two new categories targeting LiveCodeBench (algorithmic drill with edge-case coverage) and BigCodeBench (API-precise library taxonomy). Goal is to lift the two benchmarks where this run fell flat.&lt;/p&gt;

&lt;p&gt;If you've fine-tuned a model on a synthetic dataset, I'd love to hear what worked for you — especially around judge model selection and category design. Drop a comment.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Disclosure: I drafted this post with AI help — same way I built the app.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>vibecoding</category>
      <category>python</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
