Fine-tuning a model requires data. Good data requires human labeling. Human labeling doesn't scale. And most synthetic generation pipelines stop at generation, they produce candidate pairs but have no mechanism to filter them, measure quality, or feed failure cases back into the next round.
Synthetic Data Flywheel is a closed-loop pipeline that handles the full cycle: generate candidate instruction-output pairs, validate them deterministically, score them with an LLM-as-judge, calibrate that judge against human labels, export clean training data, and feed the failure cases from one cycle as seeds into the next. It ships as a CLI, a Python library, and an A2A-protocol agent surface for multi-agent orchestration.
Everything except the optional fine-tuning step runs on CPU.
The Problem
Synthetic data generation without a quality gate produces noise at scale. And quality gates without calibration produce a judge whose scores you can't trust. The flywheel addresses both: every candidate pair is scored, every score can be validated against human labels, and every failure becomes signal for the next generation cycle rather than a dead end.
How It Works
A dataset moves through a series of additive stages, each producing artifacts keyed by the dataset name. Every stage is idempotent and re-runnable.
Generation: Candidate pairs are produced from seed prompts via OpenRouter, using one of four prompt templates: QA, INSTRUCTION, REASONING, or CREATIVE.
Validation: Deterministic checks run over each pair: schema, length, dedup, PII, language, profanity. Results are written as a JSON report with severity levels (error, warning, never). A cleaned copy of the dataset can be written at this stage.
Judging: An LLM-as-judge scores each pair against a rubric. The judge supports three backends: Ollama, OpenRouter, and Anthropic. Judgments are cached on disk keyed by (backend, model, pair.id, rubric.name@version), repeated judge passes on unchanged pairs are free.
Labeling: Three modes: Interactive (human reviews pairs one by one), bulk (apply a status to a filtered subset), and auto-from-judge (derive labels from judgment scores above a threshold). Labels are stored append-only so sessions can be interrupted and resumed.
Calibration: Treats human labels (status == approved) as ground truth and measures the judge's precision, recall, F1, and accuracy.
Compare: Two or more judgment runs on the same dataset are compared: pass-agreement, Cohen's kappa, and Pearson correlation on the overall score.
Export: Pairs that clear the judge filter are written to a train/val split. The filter expression uses a safe evaluator, only arithmetic, comparisons, and subscript access into the context dict. Attribute access and function calls are rejected.
Cycle feedback: Failure instructions from one cycle are extracted and fed as additional seeds into cycle N+1. The autonomous loop stops when the pass rate drops below min_pass_rate (default 0.5) or max_cycles is reached.
Getting Started
Install
git clone https://github.com/dakshjain-1616/synthetic-data-flywheel
cd synthetic-data-flywheel
pip install -e .
Requires Python 3.11+. Generation requires OPENROUTER_API_KEY. The local judge path requires Ollama, verified against gemma4:latest. Fine-tuning requires Unsloth and a GPU; the repo was verified on a free Colab T4.
Initialize
init creates the directory structure the rest of the pipeline writes into.
flywheel init
Synthetic Data Flywheel Initialized
Data Directory: ./data
Checkpoint Directory: ./data/checkpoints
Report Directory: ./reports
Directories created successfully
Ingest
ingest normalises an existing dataset into the flywheel's internal JSONL format. It supports jsonl, csv, and HuggingFace datasets, and accepts a field mapping flag when the source uses different column names.
flywheel ingest -i demo.jsonl -n demo --tag demo1
Ingested 8 pairs -> data/user/demo.jsonl
Other ingest forms:
flywheel ingest -i data.csv -n my_dataset -f csv
flywheel ingest -i hf://tatsu-lab/alpaca -n alpaca --limit 500 --hf-split train
flywheel ingest -i data.jsonl -n aliased --map "instruction=prompt,output=completion"
flywheel ingest -i data.jsonl -n x --dry-run
Each successful ingest writes data/user/<name>.jsonl and data/user/<name>.meta.json.
Validate
Before any judging happens, the validator runs deterministic checks over the dataset. This catches structural problems, duplicate pairs, PII, malformed schema, before spending LLM calls on them.
flywheel validate -d demo --checks schema,length,dedup,pii --write-clean data/user/demo.clean.jsonl
Validation: demo
Total pairs 8
pii 1
severity:warning 1
Report: data/validation/demo.report.json
Clean dataset written (8 pairs): data/user/demo.clean.jsonl
The --fail-on error|warning|never flag lets you gate CI on validation issues.
Judge
With a clean dataset, the judge scores each pair against a rubric. The default rubric is built-in; custom rubrics can be passed with --rubric. Results are cached, so re-running after adding new pairs only scores the new ones.
flywheel judge -d demo --backend ollama --model gemma4:latest --tag v1 --max-pairs 3
Judging 3 pairs with ollama:gemma4:latest
Judged 3
Passed 0 (0.0%)
Avg overall (scored) 5.00
Output data/judgments/demo.v1.jsonl
Cache hits=0 misses=3 writes=3
Judgments land at data/judgments/<dataset>.<tag>.jsonl. The --tag flag is how multiple judgment runs on the same dataset are tracked separately.
Label
Labeling bridges human judgment and automated scoring. auto-from-judge derives labels directly from the judgment scores, pairs above the threshold are approved, pairs below are rejected.
flywheel label -d demo --mode auto-from-judge --judgments data/judgments/demo.v1.jsonl --reject-below 3.5
For manual review, --mode interactive walks through pairs one by one. For bulk operations, --mode bulk applies a status to a filtered subset. All labels are stored append-only at data/labels/<dataset>.jsonl.
Compare
When you have two judgment runs, say from two different models, compare measures how much they agree. Cohen's kappa close to 1.0 means the two judges are making the same pass/fail decisions.
flywheel compare -d demo --tags judge_a,judge_b
Judge comparison: judge_a vs judge_b
Common pairs 8
judge_a passed / mean 6 / 7.44
judge_b passed / mean 6 / 7.19
Pass agreement 100.0%
Cohen's kappa (p/f) 1.000 (near-perfect)
Score Pearson r 0.965
Output reports/demo/compare.json
Calibrate
Calibration answers the question you need to answer before trusting your judge: does its passed decision align with human labels? Precision of 1.0 means every pair the judge passed, a human also approved. Recall of 0.75 means the judge missed 25% of the pairs humans would have kept.
flywheel calibrate -d demo --tag judge_a --approved-is approved
Evaluated pairs 8
Precision 1.000
Recall 0.750
F1 0.857
Accuracy 0.750
TP/FP/TN/FN 6/0/0/2
Visualize
visualize renders a suite of PNG charts and an index.html for a dataset โ covering label distribution, score distributions, pass/fail breakdown, pair lengths, categories, judge agreement matrix, and validation results.
flywheel visualize -d demo
categories reports/demo/categories.png
lengths reports/demo/lengths.png
validation reports/demo/validation.png
pass_fail reports/demo/pass_fail.png
scores reports/demo/scores.png
criteria reports/demo/criteria.png
labels reports/demo/labels.png
judge_agreement reports/demo/judge_agreement.png
index.html reports/demo/index.html
Dataset inspection and export
Before exporting, dataset ls and dataset info show what artifacts exist for each dataset.
flywheel dataset ls
name pairs source tags
demo 8 jsonl demo1
flywheel dataset info demo
pairs data/user/demo.jsonl present
meta data/user/demo.meta.json present
validation data/validation/demo.report.json present
labels data/labels/demo.jsonl present
judgments data/judgments 5 set(s)
Export filters pairs using a safe expression, only pairs with an overall score of 7 or above are written, split 80/20 into train and val.
flywheel dataset export demo \
--to data/exports/demo.jsonl \
--format jsonl \
--judgments data/judgments/demo.judge_a.jsonl \
--filter "scores['overall'] >= 7" \
--split train=0.8,val=0.2
Wrote 4 pairs -> data/exports/demo.train.jsonl
Wrote 2 pairs -> data/exports/demo.val.jsonl
Run the autonomous loop
flywheel run ties everything together into a seeds-to-checkpoint cycle. Generation goes through OpenRouter; judging goes through Ollama. If Ollama isn't running, generation still succeeds and pairs are saved in the checkpoint, every judgment falls back to passed=false. The standalone flywheel judge --backend openrouter works fully without Ollama.
export OPENROUTER_API_KEY=sk-or-...
export OPENROUTER_MODEL=meta-llama/llama-3.2-3b-instruct
flywheel run -s "benefits of green tea,history of python language" --max-cycles 1
โญโโโโโ Configuration โโโโโโฎ
โ Synthetic Data Flywheel โ
โ Seeds: 2 โ
โ Max Cycles: 1 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Starting Flywheel with max_cycles=1
============================================================
Starting Cycle 1
============================================================
Using 2 seeds
Generating synthetic data...
Generated 2 pairs
Judging quality...
Passed: 0, Failed: 2
Cycle 1 complete. Pass rate: 0.00%
Flywheel complete. Ran 1 cycles.
Flywheel Summary
โโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโ
โ Metric โ Value โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ Total Cycles โ 1 โ
โ Total Passed Pairs โ 0 โ
โ Avg Pass Rate โ 0.00% โ
โโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโ
Each cycle writes a checkpoint. The generated pair is saved verbatim inside data/checkpoints/checkpoint_001.json:
{
"instruction": "benefits of green tea",
"output": "Here is an example of an instruction-following training data in JSON format:\n\n{\n \"instruction\": \"What are some of the benefits of drinking green tea?\",\n \"output\": \"Green tea has numerous benefits, including: - High antioxidant content - Anti-inflammatory properties - May help with weight loss ...\",\n \"category\": \"instruction\"\n}",
"source_seed": "benefits of green tea"
}
Status and report
flywheel status
flywheel report
status summarises checkpoint state. report produces an HTML report across cycles written to reports/flywheel_report_<timestamp>.html.
CLI Reference
flywheel --help lists the command groups. Every command has --help with full flag docs.
$ flywheel --help
Usage: flywheel [OPTIONS] COMMAND [ARGS]...
Synthetic Data Flywheel - Autonomous data generation pipeline.
Commands:
calibrate Measure judge 'passed' against human labels (precision/recall/F1).
compare Compare two+ judgment runs (Cohen's kappa, agreement, ...).
dataset Dataset management: ls | info | export.
ingest Ingest a user dataset into the flywheel's JSONL format.
init Initialize flywheel configuration.
judge Judge a dataset with an LLM-as-judge backend.
label Label a dataset: interactive/bulk/auto-from-judge.
pipeline Run declarative YAML pipelines.
report Generate HTML report from checkpoints.
run Run the synthetic data flywheel.
status Show current flywheel status.
validate Validate a dataset and write a ValidationReport.
visualize Render a suite of PNG charts + index.html for a dataset.
Pipeline Runner
Individual commands can be composed into a declarative YAML pipeline and run as a single step. This is useful for repeatable workflows, the pipeline dispatches through the same Click commands as manual runs, so behaviour is identical.
# pipeline_demo.yaml
dataset: demo
steps:
- validate:
checks: [schema, length, dedup]
- export:
to: data/user/demo_pipeline.jsonl
format: jsonl
flywheel pipeline run pipeline_demo.yaml
[1/2] flywheel validate -d demo --checks schema,length,dedup
[2/2] flywheel dataset export demo --to data/user/demo_pipeline.jsonl --format jsonl
Pipeline: demo
1 validate ok 0
2 export ok 0
Python API
The full pipeline is available as a library. The minimal end-to-end call scores a dataset with an async judge backed by Ollama:
import asyncio
from pathlib import Path
from synthetic_data_flywheel.ingest import load_dataset_jsonl
from synthetic_data_flywheel.rubrics import default_rubric
from synthetic_data_flywheel.judge import AsyncQualityJudge
from synthetic_data_flywheel.judge_backends import get_backend
from synthetic_data_flywheel.judge_cache import JudgmentCache
pairs = load_dataset_jsonl("data/user/demo.jsonl")
backend = get_backend("ollama", model="gemma4:latest")
judge = AsyncQualityJudge(
backend=backend,
rubric=default_rubric(),
cache=JudgmentCache(root=Path(".cache/judge")),
backend_name="ollama",
)
judgments = asyncio.run(judge.judge_batch(pairs, concurrency=2))
print(sum(j.passed for j in judgments), "/", len(judgments), "passed")
The statistical functions used internally by calibrate and compare are also directly callable:
from synthetic_data_flywheel.stats import cohens_kappa, pearson, prf
cohens_kappa([True, False, True, False], [True, True, True, False])
# 0.5
pearson([1,2,3,4], [1,3,2,5])
# 0.8315...
prf([True,True,False,False], [True,False,True,False])
# {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'accuracy': 0.5,
# 'tp': 1, 'fp': 1, 'tn': 1, 'fn': 1}
A2A Agent
The flywheel exposes a FastAPI application implementing the A2A protocol surface, /a2a/capabilities, /a2a/tasks/send, /a2a/tasks/get, /a2a/tasks/cancel so it can be orchestrated as a node in a multi-agent ML pipeline.
python -m synthetic_data_flywheel.a2a_agent
# or
uvicorn synthetic_data_flywheel.a2a_agent:app --host 0.0.0.0 --port 8080
Three capabilities are exposed: generate_synthetic_data, get_status, generate_report. Querying /a2a/capabilities returns the agent's identity and the full capability list:
from fastapi.testclient import TestClient
from synthetic_data_flywheel.a2a_agent import app
client = TestClient(app)
print(client.get("/a2a/capabilities").json())
# {'agent_name': 'synthetic_data_flywheel', 'version': '0.1.0',
# 'capabilities': [{'name': 'generate_synthetic_data', ...},
# {'name': 'get_status', ...},
# {'name': 'generate_report', ...}]}
r = client.post("/a2a/tasks/send", json={
"capability": "get_status",
"inputs": [],
"parameters": {},
})
print(r.json())
# {'task_id': '...', 'status': {'state': 'completed'},
# 'result': {'type': 'status_result',
# 'content': {'checkpoints_found': 1,
# 'checkpoint_dir': 'data/checkpoints'}}}
Configuration
All settings are read from environment variables or a .env file:
OPENROUTER_API_KEY=sk-or-...
OPENROUTER_MODEL=qwen/qwen3-8b:free
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=gemma4:latest
DEFAULT_JUDGE_BACKEND=ollama # ollama | openrouter | anthropic
JUDGE_CONCURRENCY=4
JUDGE_TIMEOUT=600
QUALITY_MIN_SCORE=7.0
MAX_CYCLES=10
PII_POLICY=warn # strict | warn | off
A2A_HOST=0.0.0.0
A2A_PORT=8080
JUDGE_TIMEOUT defaults to 600 seconds, large local models can take over two minutes on first call.
Limitations
Fine-tuning requires a GPU: Trainer.prepare_training_artifacts writes a Colab-ready Unsloth notebook under notebooks/training_cycle_NNN.ipynb. Running the training step locally on CPU is not supported by Unsloth.
Autonomous generation requires OpenRouter: flywheel run requires OPENROUTER_API_KEY. The in-loop judge is hardcoded to Ollama (engine.create_judge constructs a sync QualityJudge over OllamaClient); if Ollama isn't available, pairs are persisted but every judgment falls back to passed=false.
Large local judges are slow to cold-start: Gemma 4 (9 GB) takes about 130 seconds the first time it loads into VRAM/RAM. The default JUDGE_TIMEOUT is 600 seconds to cover this.
HuggingFace ingest requires datasets: already a dependency, but gated datasets additionally require HUGGINGFACE_TOKEN.
Anthropic judge backend requires ANTHROPIC_API_KEY: no offline fallback.
How I Built This Using NEO
This project was built using NEO, a fully autonomous AI engineering agent that writes code and builds solutions for AI/ML tasks including model evals, prompt optimisation, and end-to-end pipeline development.
The problem was defined at a high level: a closed-loop pipeline that generates synthetic instruction-tuning pairs, filters them with a calibrated LLM judge, and feeds failure cases back as seeds for the next cycle. NEO generated the full implementation, the FlywheelEngine cycle loop with checkpointing, the AsyncQualityJudge with three pluggable backends and disk-backed cache, the deterministic Validator with six check types, the LabelStore with append-only storage, the statistical calibration layer (cohens_kappa, pearson, prf), the safe-eval export filter, the declarative YAML pipeline runner, the Matplotlib visualisation suite, and the A2A FastAPI agent surface. 100 tests pass.
How You Can Build Further With NEO
Additional judge backends: the three existing backends share a common interface via get_backend. Any OpenAI-compatible endpoint can be wired in as a new backend, and the judge cache, calibration, and compare logic all work with it immediately without any changes.
Additional generation templates: the generator ships with four templates: QA, INSTRUCTION, REASONING, CREATIVE. New domain-specific templates would let the flywheel produce specialised training data, code generation, structured extraction, tool-use, while the cycle loop, judge, and export pipeline stay entirely unchanged.
Additional validation checks: the Validator already supports six check types plugged into the same --checks flag and report format. New checks for domain-specific quality signals would run in the same validation pass and appear in the same JSON report and visualisation output.
Multi-judge ensembling: compare already computes agreement metrics across judgment runs. Taking the average or majority vote across two or more judge scores before the pass/fail decision would reduce the noise that small local models introduce, without touching the labeling, calibration, or export logic downstream.
Final Notes
Synthetic Data Flywheel closes the loop that most synthetic data pipelines leave open. It generates, validates, judges, calibrates, and exports, and feeds what failed back into the next cycle. The result is a data pipeline that improves with each run rather than producing a static batch.
The code is at https://github.com/dakshjain-1616/synthetic-data-flywheel
You can also build with NEO in your IDE using the VS Code extension or Cursor.
Top comments (0)