Nilofer 🚀

Posted on Apr 28

Synthetic Data Flywheel: A Closed-Loop Pipeline for Instruction-Tuning Data

#machinelearning #syntheticdata #opensource #finetuning

Fine-tuning a model requires data. Good data requires human labeling. Human labeling doesn't scale. And most synthetic generation pipelines stop at generation, they produce candidate pairs but have no mechanism to filter them, measure quality, or feed failure cases back into the next round.
Synthetic Data Flywheel is a closed-loop pipeline that handles the full cycle: generate candidate instruction-output pairs, validate them deterministically, score them with an LLM-as-judge, calibrate that judge against human labels, export clean training data, and feed the failure cases from one cycle as seeds into the next. It ships as a CLI, a Python library, and an A2A-protocol agent surface for multi-agent orchestration.

Everything except the optional fine-tuning step runs on CPU.

The Problem

Synthetic data generation without a quality gate produces noise at scale. And quality gates without calibration produce a judge whose scores you can't trust. The flywheel addresses both: every candidate pair is scored, every score can be validated against human labels, and every failure becomes signal for the next generation cycle rather than a dead end.

How It Works

A dataset moves through a series of additive stages, each producing artifacts keyed by the dataset name. Every stage is idempotent and re-runnable.

Generation: Candidate pairs are produced from seed prompts via OpenRouter, using one of four prompt templates: QA, INSTRUCTION, REASONING, or CREATIVE.

Validation: Deterministic checks run over each pair: schema, length, dedup, PII, language, profanity. Results are written as a JSON report with severity levels (error, warning, never). A cleaned copy of the dataset can be written at this stage.

Judging: An LLM-as-judge scores each pair against a rubric. The judge supports three backends: Ollama, OpenRouter, and Anthropic. Judgments are cached on disk keyed by (backend, model, pair.id, rubric.name@version), repeated judge passes on unchanged pairs are free.

Labeling: Three modes: Interactive (human reviews pairs one by one), bulk (apply a status to a filtered subset), and auto-from-judge (derive labels from judgment scores above a threshold). Labels are stored append-only so sessions can be interrupted and resumed.

Calibration: Treats human labels (status == approved) as ground truth and measures the judge's precision, recall, F1, and accuracy.

Compare: Two or more judgment runs on the same dataset are compared: pass-agreement, Cohen's kappa, and Pearson correlation on the overall score.

Export: Pairs that clear the judge filter are written to a train/val split. The filter expression uses a safe evaluator, only arithmetic, comparisons, and subscript access into the context dict. Attribute access and function calls are rejected.

Cycle feedback: Failure instructions from one cycle are extracted and fed as additional seeds into cycle N+1. The autonomous loop stops when the pass rate drops below min_pass_rate (default 0.5) or max_cycles is reached.

Getting Started

Install

git clone https://github.com/dakshjain-1616/synthetic-data-flywheel
cd synthetic-data-flywheel
pip install -e .

Requires Python 3.11+. Generation requires OPENROUTER_API_KEY. The local judge path requires Ollama, verified against gemma4:latest. Fine-tuning requires Unsloth and a GPU; the repo was verified on a free Colab T4.

Initialize
init creates the directory structure the rest of the pipeline writes into.

flywheel init

Synthetic Data Flywheel Initialized
Data Directory: ./data
Checkpoint Directory: ./data/checkpoints
Report Directory: ./reports
Directories created successfully

Ingest
ingest normalises an existing dataset into the flywheel's internal JSONL format. It supports jsonl, csv, and HuggingFace datasets, and accepts a field mapping flag when the source uses different column names.

flywheel ingest -i demo.jsonl -n demo --tag demo1

Ingested 8 pairs -> data/user/demo.jsonl

Other ingest forms:

flywheel ingest -i data.csv              -n my_dataset -f csv
flywheel ingest -i hf://tatsu-lab/alpaca -n alpaca --limit 500 --hf-split train
flywheel ingest -i data.jsonl -n aliased --map "instruction=prompt,output=completion"
flywheel ingest -i data.jsonl -n x --dry-run

Each successful ingest writes data/user/<name>.jsonl and data/user/<name>.meta.json.

Validate
Before any judging happens, the validator runs deterministic checks over the dataset. This catches structural problems, duplicate pairs, PII, malformed schema, before spending LLM calls on them.

flywheel validate -d demo --checks schema,length,dedup,pii --write-clean data/user/demo.clean.jsonl

Validation: demo
  Total pairs       8
  pii               1
  severity:warning  1
Report: data/validation/demo.report.json
Clean dataset written (8 pairs): data/user/demo.clean.jsonl

The --fail-on error|warning|never flag lets you gate CI on validation issues.

Judge
With a clean dataset, the judge scores each pair against a rubric. The default rubric is built-in; custom rubrics can be passed with --rubric. Results are cached, so re-running after adding new pairs only scores the new ones.

flywheel judge -d demo --backend ollama --model gemma4:latest --tag v1 --max-pairs 3

Judging 3 pairs with ollama:gemma4:latest
  Judged                3
  Passed                0 (0.0%)
  Avg overall (scored)  5.00
  Output                data/judgments/demo.v1.jsonl
  Cache                 hits=0 misses=3 writes=3

Judgments land at data/judgments/<dataset>.<tag>.jsonl. The --tag flag is how multiple judgment runs on the same dataset are tracked separately.

Label
Labeling bridges human judgment and automated scoring. auto-from-judge derives labels directly from the judgment scores, pairs above the threshold are approved, pairs below are rejected.

flywheel label -d demo --mode auto-from-judge --judgments data/judgments/demo.v1.jsonl --reject-below 3.5

For manual review, --mode interactive walks through pairs one by one. For bulk operations, --mode bulk applies a status to a filtered subset. All labels are stored append-only at data/labels/<dataset>.jsonl.

Compare
When you have two judgment runs, say from two different models, compare measures how much they agree. Cohen's kappa close to 1.0 means the two judges are making the same pass/fail decisions.

flywheel compare -d demo --tags judge_a,judge_b

Judge comparison: judge_a vs judge_b
  Common pairs          8
  judge_a passed / mean 6 / 7.44
  judge_b passed / mean 6 / 7.19
  Pass agreement        100.0%
  Cohen's kappa (p/f)   1.000  (near-perfect)
  Score Pearson r       0.965
  Output                reports/demo/compare.json

Calibrate
Calibration answers the question you need to answer before trusting your judge: does its passed decision align with human labels? Precision of 1.0 means every pair the judge passed, a human also approved. Recall of 0.75 means the judge missed 25% of the pairs humans would have kept.

flywheel calibrate -d demo --tag judge_a --approved-is approved

Evaluated pairs  8
  Precision        1.000
  Recall           0.750
  F1               0.857
  Accuracy         0.750
  TP/FP/TN/FN      6/0/0/2

Visualize
visualize renders a suite of PNG charts and an index.html for a dataset — covering label distribution, score distributions, pass/fail breakdown, pair lengths, categories, judge agreement matrix, and validation results.

flywheel visualize -d demo

categories      reports/demo/categories.png
  lengths         reports/demo/lengths.png
  validation      reports/demo/validation.png
  pass_fail       reports/demo/pass_fail.png
  scores          reports/demo/scores.png
  criteria        reports/demo/criteria.png
  labels          reports/demo/labels.png
  judge_agreement reports/demo/judge_agreement.png
  index.html      reports/demo/index.html

Dataset inspection and export
Before exporting, dataset ls and dataset info show what artifacts exist for each dataset.

flywheel dataset ls

name   pairs  source  tags
  demo   8      jsonl   demo1

flywheel dataset info demo

pairs       data/user/demo.jsonl               present
  meta        data/user/demo.meta.json           present
  validation  data/validation/demo.report.json   present
  labels      data/labels/demo.jsonl             present
  judgments   data/judgments                     5 set(s)

Export filters pairs using a safe expression, only pairs with an overall score of 7 or above are written, split 80/20 into train and val.

flywheel dataset export demo \
  --to data/exports/demo.jsonl \
  --format jsonl \
  --judgments data/judgments/demo.judge_a.jsonl \
  --filter "scores['overall'] >= 7" \
  --split train=0.8,val=0.2

Wrote 4 pairs -> data/exports/demo.train.jsonl
Wrote 2 pairs -> data/exports/demo.val.jsonl

Run the autonomous loop
flywheel run ties everything together into a seeds-to-checkpoint cycle. Generation goes through OpenRouter; judging goes through Ollama. If Ollama isn't running, generation still succeeds and pairs are saved in the checkpoint, every judgment falls back to passed=false. The standalone flywheel judge --backend openrouter works fully without Ollama.

export OPENROUTER_API_KEY=sk-or-...
export OPENROUTER_MODEL=meta-llama/llama-3.2-3b-instruct

flywheel run -s "benefits of green tea,history of python language" --max-cycles 1

╭───── Configuration ─────╮
│ Synthetic Data Flywheel │
│ Seeds: 2                │
│ Max Cycles: 1           │
╰─────────────────────────╯
Starting Flywheel with max_cycles=1
============================================================
Starting Cycle 1
============================================================
Using 2 seeds
Generating synthetic data...
Generated 2 pairs
Judging quality...
Passed: 0, Failed: 2
Cycle 1 complete. Pass rate: 0.00%
Flywheel complete. Ran 1 cycles.
       Flywheel Summary
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Metric             ┃ Value ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ Total Cycles       │ 1     │
│ Total Passed Pairs │ 0     │
│ Avg Pass Rate      │ 0.00% │
└────────────────────┴───────┘

Each cycle writes a checkpoint. The generated pair is saved verbatim inside data/checkpoints/checkpoint_001.json:

{
  "instruction": "benefits of green tea",
  "output": "Here is an example of an instruction-following training data in JSON format:\n\n{\n  \"instruction\": \"What are some of the benefits of drinking green tea?\",\n  \"output\": \"Green tea has numerous benefits, including: - High antioxidant content - Anti-inflammatory properties - May help with weight loss ...\",\n  \"category\": \"instruction\"\n}",
  "source_seed": "benefits of green tea"
}

Status and report

flywheel status
flywheel report

status summarises checkpoint state. report produces an HTML report across cycles written to reports/flywheel_report_<timestamp>.html.

CLI Reference

flywheel --help lists the command groups. Every command has --help with full flag docs.

$ flywheel --help
Usage: flywheel [OPTIONS] COMMAND [ARGS]...

  Synthetic Data Flywheel - Autonomous data generation pipeline.

Commands:
  calibrate  Measure judge 'passed' against human labels (precision/recall/F1).
  compare    Compare two+ judgment runs (Cohen's kappa, agreement, ...).
  dataset    Dataset management: ls | info | export.
  ingest     Ingest a user dataset into the flywheel's JSONL format.
  init       Initialize flywheel configuration.
  judge      Judge a dataset with an LLM-as-judge backend.
  label      Label a dataset: interactive/bulk/auto-from-judge.
  pipeline   Run declarative YAML pipelines.
  report     Generate HTML report from checkpoints.
  run        Run the synthetic data flywheel.
  status     Show current flywheel status.
  validate   Validate a dataset and write a ValidationReport.
  visualize  Render a suite of PNG charts + index.html for a dataset.

Pipeline Runner

Individual commands can be composed into a declarative YAML pipeline and run as a single step. This is useful for repeatable workflows, the pipeline dispatches through the same Click commands as manual runs, so behaviour is identical.

# pipeline_demo.yaml
dataset: demo
steps:
  - validate:
      checks: [schema, length, dedup]
  - export:
      to: data/user/demo_pipeline.jsonl
      format: jsonl

flywheel pipeline run pipeline_demo.yaml

[1/2] flywheel validate -d demo --checks schema,length,dedup
[2/2] flywheel dataset export demo --to data/user/demo_pipeline.jsonl --format jsonl
   Pipeline: demo
  1  validate  ok  0
  2  export    ok  0

Python API

The full pipeline is available as a library. The minimal end-to-end call scores a dataset with an async judge backed by Ollama:

import asyncio
from pathlib import Path
from synthetic_data_flywheel.ingest import load_dataset_jsonl
from synthetic_data_flywheel.rubrics import default_rubric
from synthetic_data_flywheel.judge import AsyncQualityJudge
from synthetic_data_flywheel.judge_backends import get_backend
from synthetic_data_flywheel.judge_cache import JudgmentCache

pairs = load_dataset_jsonl("data/user/demo.jsonl")
backend = get_backend("ollama", model="gemma4:latest")
judge = AsyncQualityJudge(
    backend=backend,
    rubric=default_rubric(),
    cache=JudgmentCache(root=Path(".cache/judge")),
    backend_name="ollama",
)
judgments = asyncio.run(judge.judge_batch(pairs, concurrency=2))
print(sum(j.passed for j in judgments), "/", len(judgments), "passed")

The statistical functions used internally by calibrate and compare are also directly callable:

from synthetic_data_flywheel.stats import cohens_kappa, pearson, prf

cohens_kappa([True, False, True, False], [True, True, True, False])
# 0.5

pearson([1,2,3,4], [1,3,2,5])
# 0.8315...

prf([True,True,False,False], [True,False,True,False])
# {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'accuracy': 0.5,
#  'tp': 1, 'fp': 1, 'tn': 1, 'fn': 1}

A2A Agent

The flywheel exposes a FastAPI application implementing the A2A protocol surface, /a2a/capabilities, /a2a/tasks/send, /a2a/tasks/get, /a2a/tasks/cancel so it can be orchestrated as a node in a multi-agent ML pipeline.

python -m synthetic_data_flywheel.a2a_agent
# or
uvicorn synthetic_data_flywheel.a2a_agent:app --host 0.0.0.0 --port 8080

Three capabilities are exposed: generate_synthetic_data, get_status, generate_report. Querying /a2a/capabilities returns the agent's identity and the full capability list:

from fastapi.testclient import TestClient
from synthetic_data_flywheel.a2a_agent import app

client = TestClient(app)
print(client.get("/a2a/capabilities").json())
# {'agent_name': 'synthetic_data_flywheel', 'version': '0.1.0',
#  'capabilities': [{'name': 'generate_synthetic_data', ...},
#                   {'name': 'get_status', ...},
#                   {'name': 'generate_report', ...}]}

r = client.post("/a2a/tasks/send", json={
    "capability": "get_status",
    "inputs": [],
    "parameters": {},
})
print(r.json())
# {'task_id': '...', 'status': {'state': 'completed'},
#  'result': {'type': 'status_result',
#             'content': {'checkpoints_found': 1,
#                         'checkpoint_dir': 'data/checkpoints'}}}

Configuration

All settings are read from environment variables or a .env file:

OPENROUTER_API_KEY=sk-or-...
OPENROUTER_MODEL=qwen/qwen3-8b:free
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=gemma4:latest
DEFAULT_JUDGE_BACKEND=ollama        # ollama | openrouter | anthropic
JUDGE_CONCURRENCY=4
JUDGE_TIMEOUT=600
QUALITY_MIN_SCORE=7.0
MAX_CYCLES=10
PII_POLICY=warn                     # strict | warn | off
A2A_HOST=0.0.0.0
A2A_PORT=8080

JUDGE_TIMEOUT defaults to 600 seconds, large local models can take over two minutes on first call.

Limitations

Fine-tuning requires a GPU: Trainer.prepare_training_artifacts writes a Colab-ready Unsloth notebook under notebooks/training_cycle_NNN.ipynb. Running the training step locally on CPU is not supported by Unsloth.

Autonomous generation requires OpenRouter: flywheel run requires OPENROUTER_API_KEY. The in-loop judge is hardcoded to Ollama (engine.create_judge constructs a sync QualityJudge over OllamaClient); if Ollama isn't available, pairs are persisted but every judgment falls back to passed=false.

Large local judges are slow to cold-start: Gemma 4 (9 GB) takes about 130 seconds the first time it loads into VRAM/RAM. The default JUDGE_TIMEOUT is 600 seconds to cover this.

HuggingFace ingest requires datasets: already a dependency, but gated datasets additionally require HUGGINGFACE_TOKEN.

Anthropic judge backend requires ANTHROPIC_API_KEY: no offline fallback.

How I Built This Using NEO

This project was built using NEO, a fully autonomous AI engineering agent that writes code and builds solutions for AI/ML tasks including model evals, prompt optimisation, and end-to-end pipeline development.

The problem was defined at a high level: a closed-loop pipeline that generates synthetic instruction-tuning pairs, filters them with a calibrated LLM judge, and feeds failure cases back as seeds for the next cycle. NEO generated the full implementation, the FlywheelEngine cycle loop with checkpointing, the AsyncQualityJudge with three pluggable backends and disk-backed cache, the deterministic Validator with six check types, the LabelStore with append-only storage, the statistical calibration layer (cohens_kappa, pearson, prf), the safe-eval export filter, the declarative YAML pipeline runner, the Matplotlib visualisation suite, and the A2A FastAPI agent surface. 100 tests pass.

How You Can Build Further With NEO

Additional judge backends: the three existing backends share a common interface via get_backend. Any OpenAI-compatible endpoint can be wired in as a new backend, and the judge cache, calibration, and compare logic all work with it immediately without any changes.

Additional generation templates: the generator ships with four templates: QA, INSTRUCTION, REASONING, CREATIVE. New domain-specific templates would let the flywheel produce specialised training data, code generation, structured extraction, tool-use, while the cycle loop, judge, and export pipeline stay entirely unchanged.

Additional validation checks: the Validator already supports six check types plugged into the same --checks flag and report format. New checks for domain-specific quality signals would run in the same validation pass and appear in the same JSON report and visualisation output.

Multi-judge ensembling: compare already computes agreement metrics across judgment runs. Taking the average or majority vote across two or more judge scores before the pass/fail decision would reduce the noise that small local models introduce, without touching the labeling, calibration, or export logic downstream.

Final Notes

Synthetic Data Flywheel closes the loop that most synthetic data pipelines leave open. It generates, validates, judges, calibrates, and exports, and feeds what failed back into the next cycle. The result is a data pipeline that improves with each run rather than producing a static batch.
The code is at https://github.com/dakshjain-1616/synthetic-data-flywheel

You can also build with NEO in your IDE using the VS Code extension or Cursor.

DEV Community