Introduction: Why this project matters?
Training instruction following LLMs is no longer just about scaling models. It is about scaling data quality.
In high resource languages like English, datasets such as Alpaca and OpenAssistant already exist. However, in low resource languages like Persian, high quality instruction datasets are extremely limited.
Most available Persian corpora suffer from:
• lack of instruction structure
• Arabic language contamination
• low diversity
• poor alignment quality
As a result, even strong base models fail to:
• follow instructions consistently
• generate fluent Persian
• maintain coherent structure
The core bottleneck is not model capacity but data scarcity.
This project addresses that problem through a full synthetic data generation and fine tuning pipeline.
System Overview: End to End Pipeline
The system is designed as a modular data engine:
> Topic Tree > LLM Generation > Deduplication > Quality Scoring > Dataset Export > QLoRA Fine Tuning > Evaluation
Each component is independent, allowing scalability and reproducibility.
Core Design Philosophy: Controlled Diversity
Instead of free form generation, a structured topic tree is used with:
• 51 domains
• approximately 350 subtopics
This ensures balanced coverage and prevents mode collapse.
Multi layer Filtering Raw synthetic data is inherently noisy. The system applies multiple filtering stages:
• semantic deduplication
• LLM based quality scoring
This transforms raw outputs into curated training data.
Model Agnostic Design. The pipeline supports multiple models across stages:
• GPT 4.1 mini and GPT 4.1 nano for generation
• second LLM for evaluation
• Qwen2.5 3B Instruct for fine tuning
This makes the system reusable across languages and domains.
Data Generation Engine
Prompting Strategy
Each generation call produces structured instruction data:
{
"instruction": "How can I prepare for university entrance exams?",
"input": "",
"output": "To prepare for entrance exams, you should...",
"topic": "Education",
"subtopic": "Entrance Exams"
}
Generation Configuration
Key parameters include:
• pairs per call: 3
• calls per subtopic: 2
• max tokens: 1500
• delay between calls: 0.3 seconds
These parameters balance cost, diversity, and stability.
Multi model generation
Using multiple models reduces bias and increases diversity:
• GPT 4.1 mini provides structured reasoning
• GPT 4.1 nano increases variation and reduces cost
Deduplication Layer : Semantic Filtering
Synthetic datasets often contain semantically similar entries.
Example:
• “How to reduce stress?”
• “Methods for anxiety control”
Although different in wording, both represent the same intent.
To address this, embedding based similarity is used:
if similarity(instruction_a, instruction_b) > 0.75 : remove duplicate
This step preserves semantic diversity and prevents overfitting on repetitive patterns.
Quality Scoring : LLM as a Judge
After deduplication, data is evaluated using a second LLM.
Each sample is scored based on:
Fluency
Naturalness and grammatical correctness of language
Relevance
Whether the response correctly addresses the instruction
Completeness
Whether the answer is sufficiently detailed and useful. Only samples with an average score above 3.5 out of 5 are retained.
Dataset Outcome
The final dataset contains:
• approximately 4,000 instruction pairs
• 51 domains
• around 350 subtopics
However, the key value is not size but structured diversity and filtering quality.
Fine Tuning Phase : QLoRA on Qwen2.5 3B
Setup:
• Base model: Qwen2.5 3B Instruct
• Method: QLoRA
• Framework: Unsloth
• Hardware: Google Colab T4
• Training: 3 epochs, 714 steps
Why QLoRA
QLoRA enables efficient fine tuning by training low rank adapters instead of full model weights. This reduces memory usage while maintaining strong performance.
Training Behavior
The training loss shows steady convergence without instability or overfitting, indicating:
• high dataset consistency
• low noise after filtering
• stable learning dynamics
Evaluation
Key Observations in Base vs Fine tuned Model:
The base model exhibits:
• occasional language switching to Arabic
• incomplete or repetitive responses
• weak instruction adherence
The fine tuned model shows:
• fluent and consistent Persian output
• structured reasoning
• improved instruction following behavior
Key Insight
The improvement is not driven by model scaling but by data engineering. This highlights a central principle in modern LLM systems. data quality is often more important than model size
Key Technical Insights
Insight 1: Data quality is the primary bottleneck
Even a small dataset (4,000 samples) can significantly improve performance when properly curated.
Insight 2: Dual filtering is essential
Both semantic deduplication and LLM based scoring are required to maintain dataset quality.
Insight 3: Structured topic graphs outperform free form prompting Controlled topic distribution leads to better coverage and diversity.
Insight 4: LLM as a judge is a core system component
Automated evaluation is necessary for scalable dataset construction.
What this project demonstrates?
This system is not just a dataset generator. It is a complete synthetic data engine for low resource LLM alignment, consisting of:
• structured generation
• semantic filtering
• quality evaluation
• fine tuning integration
• performance benchmarking
Future Work
Potential improvements include:
• scaling dataset size beyond 50,000 samples
• integrating preference optimization (DPO)
• adding multilingual support
• incorporating human feedback loops (RLHF style training)
Conclusion
This project demonstrates a shift in LLM development:
performance improvements are increasingly driven by data systems rather than model scaling.By combining structured generation, filtering, and lightweight fine tuning, significant improvements can be achieved even in low resource language settings.
Top comments (0)