A CLI tool to score fine-tuning dataset quality before training starts

#ai #finetuning #llm #datascience

One of the most frustrating outcomes in machine learning is spending time and GPU budget on a fine-tuning run, only to discover later that the real issue was the dataset.

A few missing fields, inconsistent structure, duplicated samples, weak coverage, or noisy records can quietly drag down results. And by the time you notice, you have already paid for the experiment.

To make that easier to catch upfront, we built Fine-tune Dataset Quality Scorer using NEO - First autonomous AI engineering Agent.

It is a CLI tool that analyzes fine-tuning datasets before training begins and returns an actionable quality score in seconds.

What it does

Instead of waiting for model behavior to reveal data problems, the tool scans your JSONL dataset ahead of time and surfaces issues with exact row references and concrete recommendations.

It runs 11 automated checks across four layers:

data integrity
content coverage
LLM-based review
cross-dataset safety

It also auto-detects dataset schema, so it can adapt to formats like:

Alpaca
ChatML
Prompt/Completion
ShareGPT
Generic JSONL

How scoring works

Each check contributes to a weighted final score from 0 to 100.

That score maps to four grades:

READY: 92–100
CAUTION: 80–91
NEEDS WORK: 60–79
NOT READY: below 60

The weights are configurable through YAML, so teams can tune the scoring logic to match their own standards.

Domain-specific analysis

One part I especially like is that it does not stop at generic validation.

The tool can also detect the dataset domain automatically, such as:

coding
QA
translation
summarization
conversation

Then it runs coverage analysis that is specific to that type of dataset.

For example, a coding dataset can be checked for things like task-type balance and error-handling coverage, instead of receiving only generic warnings.

Optional LLM-based review

There is also an llm-review mode.

This samples records and asks a Claude model to evaluate them on clarity, quality, and coherence. That score can be folded into the overall result with a 15% weight. If no API key is present, it skips this step automatically.

Example output

We also generated an HTML report for the Hacker News comments dataset.

It scored 88.8 / 100, which landed in CAUTION. Most checks passed, but the report flagged missing values as the main issue, with completeness at 85.6%. That is a good example of the kind of problem that often slips through until much later in the pipeline.

Why we built it

This project was also a useful demonstration of what we are building with NEO.

Rather than using AI only for snippets or one-off code suggestions, we wanted to show that an autonomous agent can build something practical end-to-end: a real tool, with structured logic, useful outputs, and production relevance.

The result is not just a demo. It is something teams could actually plug into their workflow or CI pipeline to catch dataset issues before training starts.