I'm not a professional developer. I learned by doing — vibe-coding with AI assistance — and a few months ago I wanted to fine-tune Qwen2.5-Coder-7B on my own data. The problem: there's no good way to generate a quality dataset without writing custom scripts every time, and existing tools are either CLI-heavy or built for researchers, not curious tinkerers.
So I built one. It actually worked: my fine-tuned model went from 55.5% to 72.3% on HumanEval (5 runs averaged, Q4_K_M GGUF via Ollama).
Here's what I built, what I learned, and what didn't work in this finetune example.
What it is
A no-code desktop app (Linux, Windows) that automates the full dataset generation pipeline — topic planning, multi-turn example generation, quality scoring via LLM Judge, deduplication, and HuggingFace Hub upload. Pick categories, set proportions, click Generate, get a ready-to-train JSONL.
Under the hood it runs a three-stage engine: topics → outlines → examples. Instead of a naive "generate 100 examples" prompt, the app decomposes the job first, which kills the repetitive patterns you get from one-shot generation. Everything stays local; all model traffic goes through OpenRouter (~300 models, one key).
PS: I know there are similar apps for generating fine-tuning data — but as always, I build the tools I want to use myself.
A few features that made my life easier:
- Per-category models — different generators for different example types
- LLM-as-judge — every example gets scored, low-quality ones rejected
- Embedding deduplication — cosine similarity removes near-duplicates before export
- HuggingFace upload — push straight to the Hub when done
- Quality Report — score histograms, token stats, per-category accept rates
- Resume on crash — interrupted jobs restart from where they stopped (this saved me hours)
Resources
Everything is open source and reproducible:
- Dataset (2,248 examples): huggingface.co/datasets/AronDaron/OctoBench-2.2k
- Fine-tuned model: huggingface.co/AronDaron/Qwen2.5-Coder-7B-Instruct-OctoBench-2.2k
- App repo: github.com/AronDaron/dataset-generator
The result
I generated 2,248 examples across 8 categories targeting different code skills, then fine-tuned Qwen2.5-Coder-7B-Instruct (QLoRA via Unsloth, Q4_K_M GGUF served via Ollama).
| Benchmark | Base | Fine-tuned | Δ |
|---|---|---|---|
| HumanEval (5 runs avg, n=164, t=0.2) | 55.5% (±2.1) | 72.3% (±2.0) | +16.8pp |
| HumanEval+ (5 runs avg, n=164, t=0.2) | 49.0% (±1.9) | 65.1% (±1.6) | +16.1pp |
| BigCodeBench full instruct (1 run, n=1140) | 39.3% | 39.7% | +0.4pp |
| LiveCodeBench v6 (1 run, n=1055, t=0.0) | 29.0% | 26.9% | -2.1pp |
HumanEval and HumanEval+ were the win. BigCodeBench barely moved and LiveCodeBench actually regressed. Both led to interesting lessons.
What surprised me
LCB regressed because of a format mismatch, not a knowledge gap. I checked the fail cases — model output had correct logic but the wrong wrapper. My training data said "return only the function" while LCB tests need full programs with input() / print(). Format mismatches show up as "wrong answer" on benchmarks, but they're way easier to fix than actual missing knowledge.
Judge model matters more than generator model. I tested several judges. Some flash-tier models rubber-stamped almost everything (scores 95-100 across the board), while smaller models skipped 70% of examples they didn't understand. Pick the wrong judge and your "quality dataset" is just noise with a fancy filter.
Concise prompts beat elaborate ones. I started with detailed multi-paragraph category descriptions. Generation quality got worse. Stripped them down to 2-3 sentences with a 4-6 item judge criteria — accept rate jumped, output got cleaner.
What didn't work
I tried to be clever with judge criteria. I added more and more filters trying to catch every edge case I noticed in pilot runs. Accept rate dropped from ~85% to 10%. The filters were technically correct, but the generator couldn't deliver against all of them. Lesson: it's better to accept some noise than to over-constrain and stall the whole pipeline.
I also wasted time on BigCodeBench. My "Data Libraries" category was too generic — "any 2+ libs from this list" — and BCB tests precise library API usage with concrete kwargs. Result: +0.4pp. To actually move BCB, I'd need a category seeded from BCB's own taxonomy of ~139 libraries with specific signature drilling.
Tech stack
Nothing exotic:
- Frontend: Next.js 16 (static export) + Tailwind + shadcn/ui
- Backend: FastAPI + SQLite (WAL mode) + Pydantic
- Desktop: pywebview (WebKit2 on Linux, WebView2 on Windows)
- Packaging: PyInstaller — Linux AppImage works (~73 MB)
- LLM access: OpenRouter (no vendor lock-in, switch models freely)
- Dedup: OpenRouter embeddings + numpy cosine similarity
License is AGPL-3.0 — I picked it over MIT on purpose. If someone wraps this as SaaS, I want the changes to come back to the project.
What's next
Local LLM support (Ollama / LM Studio) so people can generate datasets without paying for API calls. After that, a system tray version for quieter long-running jobs.
Already in progress: two new categories targeting LiveCodeBench (algorithmic drill with edge-case coverage) and BigCodeBench (API-precise library taxonomy). Goal is to lift the two benchmarks where this run fell flat.
If you've fine-tuned a model on a synthetic dataset, I'd love to hear what worked for you — especially around judge model selection and category design. Drop a comment.
Disclosure: I drafted this post with AI help — same way I built the app.

Top comments (0)