DEV Community

Radosław
Radosław

Posted on

Desktop app to generate LLM fine-tuning datasets — got +16pp on HumanEval

I'm not a professional developer. I learned by doing — vibe-coding with AI assistance — and a few months ago I wanted to fine-tune Qwen2.5-Coder-7B on my own data. The problem: there's no good way to generate a quality dataset without writing custom scripts every time, and existing tools are either CLI-heavy or built for researchers, not curious tinkerers.

So I built one. It actually worked: my fine-tuned model went from 55.5% to 72.3% on HumanEval (5 runs averaged, Q4_K_M GGUF via Ollama).

Here's what I built, what I learned, and what didn't work in this finetune example.

What it is

A no-code desktop app (Linux, Windows) that automates the full dataset generation pipeline — topic planning, multi-turn example generation, quality scoring via LLM Judge, deduplication, and HuggingFace Hub upload. Pick categories, set proportions, click Generate, get a ready-to-train JSONL.

Under the hood it runs a three-stage engine: topics → outlines → examples. Instead of a naive "generate 100 examples" prompt, the app decomposes the job first, which kills the repetitive patterns you get from one-shot generation. Everything stays local; all model traffic goes through OpenRouter (~300 models, one key).

PS: I know there are similar apps for generating fine-tuning data — but as always, I build the tools I want to use myself.

A few features that made my life easier:

  • Per-category models — different generators for different example types
  • LLM-as-judge — every example gets scored, low-quality ones rejected
  • Embedding deduplication — cosine similarity removes near-duplicates before export
  • HuggingFace upload — push straight to the Hub when done
  • Quality Report — score histograms, token stats, per-category accept rates
  • Resume on crash — interrupted jobs restart from where they stopped (this saved me hours)

Resources

Everything is open source and reproducible:

The result

I generated 2,248 examples across 8 categories targeting different code skills, then fine-tuned Qwen2.5-Coder-7B-Instruct (QLoRA via Unsloth, Q4_K_M GGUF served via Ollama).

Benchmark Base Fine-tuned Δ
HumanEval (5 runs avg, n=164, t=0.2) 55.5% (±2.1) 72.3% (±2.0) +16.8pp
HumanEval+ (5 runs avg, n=164, t=0.2) 49.0% (±1.9) 65.1% (±1.6) +16.1pp
BigCodeBench full instruct (1 run, n=1140) 39.3% 39.7% +0.4pp
LiveCodeBench v6 (1 run, n=1055, t=0.0) 29.0% 26.9% -2.1pp

HumanEval and HumanEval+ were the win. BigCodeBench barely moved and LiveCodeBench actually regressed. Both led to interesting lessons.

What surprised me

LCB regressed because of a format mismatch, not a knowledge gap. I checked the fail cases — model output had correct logic but the wrong wrapper. My training data said "return only the function" while LCB tests need full programs with input() / print(). Format mismatches show up as "wrong answer" on benchmarks, but they're way easier to fix than actual missing knowledge.

Judge model matters more than generator model. I tested several judges. Some flash-tier models rubber-stamped almost everything (scores 95-100 across the board), while smaller models skipped 70% of examples they didn't understand. Pick the wrong judge and your "quality dataset" is just noise with a fancy filter.

Concise prompts beat elaborate ones. I started with detailed multi-paragraph category descriptions. Generation quality got worse. Stripped them down to 2-3 sentences with a 4-6 item judge criteria — accept rate jumped, output got cleaner.

What didn't work

I tried to be clever with judge criteria. I added more and more filters trying to catch every edge case I noticed in pilot runs. Accept rate dropped from ~85% to 10%. The filters were technically correct, but the generator couldn't deliver against all of them. Lesson: it's better to accept some noise than to over-constrain and stall the whole pipeline.

I also wasted time on BigCodeBench. My "Data Libraries" category was too generic — "any 2+ libs from this list" — and BCB tests precise library API usage with concrete kwargs. Result: +0.4pp. To actually move BCB, I'd need a category seeded from BCB's own taxonomy of ~139 libraries with specific signature drilling.

Tech stack

Nothing exotic:

  • Frontend: Next.js 16 (static export) + Tailwind + shadcn/ui
  • Backend: FastAPI + SQLite (WAL mode) + Pydantic
  • Desktop: pywebview (WebKit2 on Linux, WebView2 on Windows)
  • Packaging: PyInstaller — Linux AppImage works (~73 MB)
  • LLM access: OpenRouter (no vendor lock-in, switch models freely)
  • Dedup: OpenRouter embeddings + numpy cosine similarity

License is AGPL-3.0 — I picked it over MIT on purpose. If someone wraps this as SaaS, I want the changes to come back to the project.

What's next

Local LLM support (Ollama / LM Studio) so people can generate datasets without paying for API calls. After that, a system tray version for quieter long-running jobs.

Already in progress: two new categories targeting LiveCodeBench (algorithmic drill with edge-case coverage) and BigCodeBench (API-precise library taxonomy). Goal is to lift the two benchmarks where this run fell flat.

If you've fine-tuned a model on a synthetic dataset, I'd love to hear what worked for you — especially around judge model selection and category design. Drop a comment.


Disclosure: I drafted this post with AI help — same way I built the app.

Top comments (0)