Radosław

Posted on Apr 29

Desktop app to generate LLM fine-tuning datasets — got +16pp on HumanEval

#ai #vibecoding #python #beginners

I'm not a professional developer. I learned by doing — vibe-coding with AI assistance — and a few months ago I wanted to fine-tune Qwen2.5-Coder-7B on my own data. The problem: there's no good way to generate a quality dataset without writing custom scripts every time, and existing tools are either CLI-heavy or built for researchers, not curious tinkerers.

So I built one. It actually worked: my fine-tuned model went from 55.5% to 72.3% on HumanEval (5 runs averaged, Q4_K_M GGUF via Ollama).

Here's what I built, what I learned, and what didn't work in this finetune example.

What it is

A no-code desktop app (Linux, Windows) that automates the full dataset generation pipeline — topic planning, multi-turn example generation, quality scoring via LLM Judge, deduplication, and HuggingFace Hub upload. Pick categories, set proportions, click Generate, get a ready-to-train JSONL.

Under the hood it runs a three-stage engine: topics → outlines → examples. Instead of a naive "generate 100 examples" prompt, the app decomposes the job first, which kills the repetitive patterns you get from one-shot generation. Everything stays local; all model traffic goes through OpenRouter (~300 models, one key).

PS: I know there are similar apps for generating fine-tuning data — but as always, I build the tools I want to use myself.

A few features that made my life easier:

Per-category models — different generators for different example types
LLM-as-judge — every example gets scored, low-quality ones rejected
Embedding deduplication — cosine similarity removes near-duplicates before export
HuggingFace upload — push straight to the Hub when done
Quality Report — score histograms, token stats, per-category accept rates
Resume on crash — interrupted jobs restart from where they stopped (this saved me hours)

Resources

Everything is open source and reproducible:

Dataset (2,248 examples): huggingface.co/datasets/AronDaron/OctoBench-2.2k
Fine-tuned model: huggingface.co/AronDaron/Qwen2.5-Coder-7B-Instruct-OctoBench-2.2k
App repo: github.com/AronDaron/dataset-generator

The result

I generated 2,248 examples across 8 categories targeting different code skills, then fine-tuned Qwen2.5-Coder-7B-Instruct (QLoRA via Unsloth, Q4_K_M GGUF served via Ollama).

Benchmark	Base	Fine-tuned	Δ
HumanEval (5 runs avg, n=164, t=0.2)	55.5% (±2.1)	72.3% (±2.0)	+16.8pp
HumanEval+ (5 runs avg, n=164, t=0.2)	49.0% (±1.9)	65.1% (±1.6)	+16.1pp
BigCodeBench full instruct (1 run, n=1140)	39.3%	39.7%	+0.4pp
LiveCodeBench v6 (1 run, n=1055, t=0.0)	29.0%	26.9%	-2.1pp

HumanEval and HumanEval+ were the win. BigCodeBench barely moved and LiveCodeBench actually regressed. Both led to interesting lessons.

What surprised me

LCB regressed because of a format mismatch, not a knowledge gap. I checked the fail cases — model output had correct logic but the wrong wrapper. My training data said "return only the function" while LCB tests need full programs with input() / print(). Format mismatches show up as "wrong answer" on benchmarks, but they're way easier to fix than actual missing knowledge.

Judge model matters more than generator model. I tested several judges. Some flash-tier models rubber-stamped almost everything (scores 95-100 across the board), while smaller models skipped 70% of examples they didn't understand. Pick the wrong judge and your "quality dataset" is just noise with a fancy filter.

Concise prompts beat elaborate ones. I started with detailed multi-paragraph category descriptions. Generation quality got worse. Stripped them down to 2-3 sentences with a 4-6 item judge criteria — accept rate jumped, output got cleaner.

What didn't work

I tried to be clever with judge criteria. I added more and more filters trying to catch every edge case I noticed in pilot runs. Accept rate dropped from ~85% to 10%. The filters were technically correct, but the generator couldn't deliver against all of them. Lesson: it's better to accept some noise than to over-constrain and stall the whole pipeline.

I also wasted time on BigCodeBench. My "Data Libraries" category was too generic — "any 2+ libs from this list" — and BCB tests precise library API usage with concrete kwargs. Result: +0.4pp. To actually move BCB, I'd need a category seeded from BCB's own taxonomy of ~139 libraries with specific signature drilling.

Tech stack

Nothing exotic:

Frontend: Next.js 16 (static export) + Tailwind + shadcn/ui
Backend: FastAPI + SQLite (WAL mode) + Pydantic
Desktop: pywebview (WebKit2 on Linux, WebView2 on Windows)
Packaging: PyInstaller — Linux AppImage works (~73 MB)
LLM access: OpenRouter (no vendor lock-in, switch models freely)
Dedup: OpenRouter embeddings + numpy cosine similarity

License is AGPL-3.0 — I picked it over MIT on purpose. If someone wraps this as SaaS, I want the changes to come back to the project.

What's next

Local LLM support (Ollama / LM Studio) so people can generate datasets without paying for API calls. After that, a system tray version for quieter long-running jobs.

Already in progress: two new categories targeting LiveCodeBench (algorithmic drill with edge-case coverage) and BigCodeBench (API-precise library taxonomy). Goal is to lift the two benchmarks where this run fell flat.

If you've fine-tuned a model on a synthetic dataset, I'd love to hear what worked for you — especially around judge model selection and category design. Drop a comment.

Disclosure: I drafted this post with AI help — same way I built the app.

Top comments (3)

Jill Mercer • Apr 29

getting a 16 point jump on humaneval using vibe coding is wild. i’ve been shipping with cursor for a while now and that flow — just building until it works — is the only way i stay productive. if you're looking for users or just want this to get some eyes, you should list it on stackapps.app — its a spot i'm building for these kinds of focused indie tools. austin taught me: just start the thing.

Radosław • Apr 29

Thanks, I'll keep stackapps.app in mind

Jill Mercer • Apr 30

glad it helps. vibe coding is the only way i keep up with all these builds—just start the thing. let me know if it actually moves the needle for you.