Radosław

Posted on May 3

Dataset Generator v1.0.3-beta ships local LLM support — fine-tune your model without paying a cent for API

#ai #beginners #vibecoding #python

A some time ago I shipped a desktop app to generate LLM fine-tuning datasets. It worked: my Qwen2.5-Coder-7B fine-tune jumped from 55.5% → 72.3% on HumanEval. Whole pipeline ran on OpenRouter — pick a model, click Generate, get JSONL.

v1.0.3-beta ships multi-provider LLM support — Ollama, LM Studio, llama.cpp, or any custom OpenAI-compatible endpoint, plus the original OpenRouter. Mix and match: generate on your local Qwen3-14B, judge on a cheap cloud model. Or stay fully offline.

Here's what shipped, what was harder than I expected, and what I learned along the way.

What's new in v1.0.3-beta

One-click local LLM. Open Settings → Providers → "Auto-detect local". The app probes localhost:11434 (Ollama), 1234 (LM Studio), 8080 (llama.cpp). Anything that answers gets a one-click "Add" button. Onboarding for an offline-first user takes ~30 seconds.

Mixed mode. Each category can use its own provider. Gen on local Qwen2.5-Coder:14B, judge on cloud example GPT-4-mini. Or different generators per category — algorithm category on a code-specialised local model. The pipeline routes each call to the right backend automatically.

Custom endpoints. Any OpenAI-compatible URL works: vLLM, TGI, your buddy's self-hosted gateway. Paste base URL + optional bearer token, done.

Instant cancel for local jobs. Cloud APIs answer in seconds — cooperative cancel between calls is fine. Local 14B can sit on a single chat completion for minutes. v1.0.3-beta wires asyncio.Task.cancel() straight into the in-flight HTTP request, so cancel feels instant (~1s) instead of "wait 8 minutes for the chat call to time out".

Auto-handling for reasoning models. Qwen3, DeepSeek-R1, and friends emit <think>...</think> blocks that eat the whole token budget before any actual output. The pipeline detects "reasoning starvation" (empty content + finish=length + reasoning present) and auto-retries with a 4× budget. No manual fiddling.

Why this matters

Three concrete user types this unlocks:

Privacy-conscious — corporate data, NDA'd code, anything you can't send to a third-party API. Now stays on your laptop.
Cost-conscious — generating 5000 multi-turn examples on cloud GPT-4 is $$$ . On local Qwen3-14B it's electricity. Mixed mode (cheap local gen + cloud judge for quality control) is roughly 1/10th the bill.
No-cloud-account — regulations, no credit card, country without payment methods. The whole pipeline now runs without a single API call to anyone.

What was harder than expected

Token accounting across providers. OpenRouter cleanly breaks out reasoning_tokens in the usage payload. Ollama doesn't — completion_tokens is the full think+content figure. So when DeepSeek-R1 via Ollama generates 80 tokens of actual output after 800 tokens of <think>, the bill says 880, the dataset preview says 80, and the budget check trips constantly.

Fix: detect <think> blocks (Format A) or message.reasoning field (Format B), strip the reasoning, recount the kept content with tiktoken, write the corrected number back into usage.completion_tokens. tiktoken is an estimate (not the model's native tokeniser), but it's the only signal available when the provider doesn't surface a breakdown. Quality Report and per-example token counts now agree.

LM Studio uses yet another field name. Same idea as Ollama's message.reasoning, but they call it message.reasoning_content. Discovered this with a curl, double-checked with another model, sigh. Pipeline still works because LM Studio does surface reasoning_tokens in completion_tokens_details (more like OpenRouter), so the subtract path catches it. But the per-provider response shape table grew another row.

Capability-driven branching, not provider-kind switches. First draft of the integration had if provider.kind == "ollama" peppered through the pipeline. That doesn't scale — the next user wants TabbyAPI, the one after wants their custom corporate gateway. So I refactored to ProviderCapabilities flags: supports_provider_routing, supports_reasoning, requires_api_key, has_pricing, supports_embeddings. Adding a new backend is now one class + one registry entry. Zero changes to job_runner.py.

Default reassignment UX. User clicks "Disable" on the OpenRouter provider (which happens to be the default). Old behaviour: silent orphan state, next job hits "Provider 'openrouter-default' is disabled" 422 error, user has no idea why. New behaviour: backend auto-promotes the next enabled provider to default, frontend shows a 4-second toast "Default switched to Ollama (local)". Annoyingly small bug to find, easy to fix once seen.

What I learned

<14B local models aren't worth it for dataset generation. I tested 7B and 9B variants for a week. The output is technically valid but constantly drifts off-topic, repeats patterns, or misunderstands category descriptions. The tokens you save on cloud you spend 5× over on rejected examples. 14B is the floor; 32B is the comfortable middle. If you have the VRAM, use it.

The judge model still matters more than the generator. Same lesson as the original post, doubled by local LLM availability. I tried using small local judges to "save the judge cost too". Some 8B judges rubber-stamp 95-100 across the board. Some 14B judges skip 70% of perfectly good examples because they don't understand the category. Spend cloud money on the judge — or use a 32B+ local judge if you have the hardware.

Mixed mode is the killer feature. I expected "fully offline" to be the win. Turns out the workflow most people actually want is: cheap local model for the volume work (7000 examples), strong cloud model as judge (because rubber-stamp judges silently kill dataset quality). v1.0.3-beta makes this a one-line config — pick gen model from one provider, judge from another, ship it.

What didn't work

Per-provider concurrency limits. I prototyped this — you'd configure "Ollama: 1, OpenRouter: 10" so the global semaphore doesn't drown your local GPU. Turned out to be enterprise-flavoured complexity for ~zero real-world benefit (single-user single-GPU setup, which is 99% of users i guess). Cut from v1.0.3-beta, parked for someone with multi-GPU vLLM who actually needs it.

Provider badge in the model picker. When two providers serve the same model name (llama-3.1-8b on both Ollama and OpenRouter), the picker shows two identical-looking entries. I sketched a small badge UI to differentiate them, then realised typical setups don't have name collisions (you know which models you put where). Punted to a future polish pass.

Tech stack updates

Same foundations, new layer:

Frontend: Next.js 16 (static export) + Tailwind + base-ui — added ProvidersSection for CRUD + auto-detect + per-row connection test
Backend: FastAPI + SQLite (WAL) + Pydantic — added app/services/llm/ provider abstraction (LLMProvider ABC + ProviderCapabilities) and app/routers/providers.py
Schema migration: providers table added in v6 with backfill of the legacy single OpenRouter key — your existing setup migrates silently on first launch
Tests: 460 passing (up from 329 in the previous release) — full coverage for the four backends, registry resolution, auto-detect, mixed-mode jobs

Same AGPL-3.0 license. Same one-binary distribution (Linux AppImage, Windows exe).

Resources

Everything is open source:

App repo (v1.0.3-beta): github.com/AronDaron/dataset-generator
Original release post (with HumanEval +16pp benchmark): [previous dev.to post]
Dataset (2,248 examples): huggingface.co/datasets/AronDaron/OctoBench-2.2k
Fine-tuned model: huggingface.co/AronDaron/Qwen2.5-Coder-7B-Instruct-OctoBench-2.2k

What's next

System tray version. Long generation runs (5000+ examples on local hardware = hours) deserve a quieter UX than a permanent open window. Tray icon, "next job ready" notification, click to bring back the dashboard.

Embedding provider picker. Right now dedup works multi-provider on the backend, but the UI only exposes the OpenRouter embedding models. Adding a small dropdown so local users can run dedup on nomic-embed-text via Ollama too.

Two new categories targeting LiveCodeBench and BigCodeBench. The previous post explained why those benchmarks barely moved (format mismatch on LCB, too-generic library category for BCB). Both fixes are in progress — algorithmic drill with edge-case coverage for LCB, library-API-precise taxonomy for BCB.

If you generate datasets locally — what model size are you using and what's your accept rate? Especially curious if anyone got real value out of <14B local models for dataset gen, because my tests said no but I'd love to be wrong.

Disclosure: I drafted this post with AI help — same way I built the app.

DEV Community