You've decided to fine-tune a language model. You generate a few hundred samples, load them into Axolotl or LLaMA-Factory, kick off training, and wait.
The model comes out... worse. Or weirdly repetitive. Or it answers every question the same way regardless of context.
The problem usually isn't your training setup. It's the data.
The Uncomfortable Truth About Synthetic Datasets
Generating synthetic SFT data is easy. Generating good synthetic SFT data is not.
Most pipelines do something like this:
- Write a prompt: "Generate 500 customer support examples"
- Call an LLM
- Save the output
- Fine-tune
The problem? There's no mechanism to catch what goes wrong. And a lot goes wrong.
What a Bad Sample Looks Like
Here's a real example of what gets generated without any quality filter:
Instruction: Respond to a customer asking about their refund status.
Output: "Thank you for reaching out. Your refund is being processed. Please wait and we will get back to you soon."
Looks fine at first glance. But think about what your model learns from this:
- When someone asks about their refund, say something vague
- Don't give a timeline
- Don't offer any next steps
- Just acknowledge and deflect
Fine-tune on enough samples like this and your model becomes an expert at saying nothing helpfully.
Now here's what a good sample looks like:
Instruction: Respond to a customer asking about their refund status.
Output: "Your refund was approved on the 14th and has been submitted to your bank. Most banks process these within 3–5 business days, so you should see it by the 19th. If it hasn't appeared by then, reply here and I'll escalate immediately."
Same instruction. Completely different training signal.
The 4 Things That Actually Matter
When I evaluate an SFT sample, I look at four dimensions:
1. Relevance
Does the response directly address the instruction? An off-topic or partially relevant answer teaches your model to drift. Even a grammatically perfect response scores zero if it doesn't answer what was asked.
2. Factual Consistency
Are the claims in the response plausible and internally consistent? Hallucinated order numbers, impossible timelines, contradictory policies - these all slip through if you're not checking. Your model will learn to hallucinate the same way.
3. Format Quality
Is the response correctly structured for the schema you're using? A broken JSON field or a response that ignores the output format contaminates your training data at the structural level.
4. Response Usefulness
Would this response actually help someone? This is the hardest one to catch automatically - a response can be relevant, factually consistent, and correctly formatted while still being completely useless. Vague acknowledgements without concrete next steps fail here.
Why You Can't Just Prompt Your Way Out of This
A common fix people try: write a better generation prompt. Add instructions like "be specific", "include timelines", "don't be vague".
It helps. But it doesn't solve the problem.
LLMs drift across long generation runs. The first 50 samples might follow your instructions carefully. By sample 300, the model is taking shortcuts, repeating patterns, and producing outputs that technically match the format but miss the intent.
You need a separate evaluation pass - not more instructions in the generation prompt.
How a Judge Stage Works
The idea is simple: use a second, stronger model to score every sample your generation model produces.
The judge model doesn't generate. It evaluates. It reads each instruction-output pair and scores it on the four dimensions above, using a calibrated rubric with fixed anchor examples so the scoring stays consistent across the entire dataset.
Samples below a quality threshold get cut. You deliberately generate more than you need so the filtering doesn't leave you short.
The result: every sample that reaches your training loop has been independently evaluated, not just generated.
The Full Pipeline
In practice, the judge stage is one part of a larger quality process. Here's what a complete pipeline looks like:
Stage 1 - Generation
Domain-aware generation with a domain-specific system prompt. Rolling context injection prevents semantic drift across batches.
Stage 2 - Validation & Deduplication
Schema validation rejects malformed rows. Token length filtering removes samples outside the training-safe range. Deduplication (MinHash or semantic-based) removes similar samples that inflate dataset size without adding diversity.
Stage 3 - LLM-as-Judge Scoring
Every sample scored on the four dimensions above. Only samples above the threshold proceed.
Stage 4 - Human Review
Outputs reviewed for quality patterns before the final split. If a recurring issue is found, the threshold is adjusted and the affected stage re-runs.
Stage 5 - Split & Export
Shuffled 90/10 train/validation split. Output as production-ready JSONL in Alpaca or ShareGPT format.
What This Looks Like in Practice
On a recent 500-sample run, here's what the filter funnel looked like:
- Generated: 600 samples
- After schema validation: 584 (-16 malformed rows)
- After token length filter: 572 (-12 too short / too long)
- After deduplication: 569 (-3 near-duplicates)
- After LLM judge: 500 (-69 below quality threshold)
The judge stage removed ~11% of samples. Deduplication removed less than 1%. Schema errors were caught early.
The 69 samples removed by the judge are the ones that would have quietly degraded your model.
The Takeaway
A good SFT sample isn't just one that looks correct - it's one that teaches your model the right behavior.
Vague responses teach vague behavior. Hallucinated details teach hallucination. Near-duplicate samples teach repetition.
If you're building a domain-specific LLM, the quality of your training data matters more than almost any other variable. A better base model won't save you from bad data.
What's Next
I'm currently building a public Vietnamese legal Q&A dataset using this pipeline - one of the few domains with almost no public SFT data available. I'll share it on HuggingFace when it's ready.
If you're working on a fine-tuning project and need a validated dataset for your domain, I build these as a service on Fiverr. Link in the comments.
If this was useful, follow for more posts on LLM fine-tuning, dataset preparation, and practical ML engineering.

Top comments (0)