DEV Community

Gaurav Vij
Gaurav Vij

Posted on

A CLI tool to score fine-tuning dataset quality before training starts


One of the most frustrating outcomes in machine learning is spending time and GPU budget on a fine-tuning run, only to discover later that the real issue was the dataset.

A few missing fields, inconsistent structure, duplicated samples, weak coverage, or noisy records can quietly drag down results. And by the time you notice, you have already paid for the experiment.

To make that easier to catch upfront, we built Fine-tune Dataset Quality Scorer using NEO - First autonomous AI engineering Agent.

It is a CLI tool that analyzes fine-tuning datasets before training begins and returns an actionable quality score in seconds.

What it does

Instead of waiting for model behavior to reveal data problems, the tool scans your JSONL dataset ahead of time and surfaces issues with exact row references and concrete recommendations.

It runs 11 automated checks across four layers:

  • data integrity
  • content coverage
  • LLM-based review
  • cross-dataset safety

It also auto-detects dataset schema, so it can adapt to formats like:

  • Alpaca
  • ChatML
  • Prompt/Completion
  • ShareGPT
  • Generic JSONL

How scoring works

Each check contributes to a weighted final score from 0 to 100.

That score maps to four grades:

  • READY: 92–100
  • CAUTION: 80–91
  • NEEDS WORK: 60–79
  • NOT READY: below 60

The weights are configurable through YAML, so teams can tune the scoring logic to match their own standards.

Domain-specific analysis

One part I especially like is that it does not stop at generic validation.

The tool can also detect the dataset domain automatically, such as:

  • coding
  • QA
  • translation
  • summarization
  • conversation

Then it runs coverage analysis that is specific to that type of dataset.

For example, a coding dataset can be checked for things like task-type balance and error-handling coverage, instead of receiving only generic warnings.

Optional LLM-based review

There is also an llm-review mode.

This samples records and asks a Claude model to evaluate them on clarity, quality, and coherence. That score can be folded into the overall result with a 15% weight. If no API key is present, it skips this step automatically.

Example output

We also generated an HTML report for the Hacker News comments dataset.

It scored 88.8 / 100, which landed in CAUTION. Most checks passed, but the report flagged missing values as the main issue, with completeness at 85.6%. That is a good example of the kind of problem that often slips through until much later in the pipeline.

Why we built it

This project was also a useful demonstration of what we are building with NEO.

Rather than using AI only for snippets or one-off code suggestions, we wanted to show that an autonomous agent can build something practical end-to-end: a real tool, with structured logic, useful outputs, and production relevance.

The result is not just a demo. It is something teams could actually plug into their workflow or CI pipeline to catch dataset issues before training starts.

Repo

https://github.com/dakshjain-1616/Fine-tune-Dataset-Quality-Scorer

I think dataset quality is still one of the most under-appreciated bottlenecks in fine-tuning workflows.

A lot of “model quality” problems are really data quality problems in disguise.

Top comments (0)