DEV Community

Mohammad Heydari
Mohammad Heydari

Posted on

Designing a Synthetic Data Pipeline for Persian LLM Fine Tuning: From Topic Graphs to QLoRA Evaluation

Introduction: Why this project matters?

Training instruction following LLMs is no longer just about scaling models. It is about scaling data quality.
In high resource languages like English, datasets such as Alpaca and OpenAssistant already exist. However, in low resource languages like Persian, high quality instruction datasets are extremely limited.

Most available Persian corpora suffer from:

• lack of instruction structure
• Arabic language contamination
• low diversity
• poor alignment quality

As a result, even strong base models fail to:

• follow instructions consistently
• generate fluent Persian
• maintain coherent structure

The core bottleneck is not model capacity but data scarcity.

This project addresses that problem through a full synthetic data generation and fine tuning pipeline.

System Overview: End to End Pipeline
The system is designed as a modular data engine:

> Topic Tree > LLM Generation > Deduplication > Quality Scoring > Dataset Export > QLoRA Fine Tuning > Evaluation

Each component is independent, allowing scalability and reproducibility.

Core Design Philosophy: Controlled Diversity
Instead of free form generation, a structured topic tree is used with:

• 51 domains
• approximately 350 subtopics

This ensures balanced coverage and prevents mode collapse.
Multi layer Filtering Raw synthetic data is inherently noisy. The system applies multiple filtering stages:

• semantic deduplication
• LLM based quality scoring

This transforms raw outputs into curated training data.
Model Agnostic Design. The pipeline supports multiple models across stages:

GPT 4.1 mini and GPT 4.1 nano for generation
• second LLM for evaluation
Qwen2.5 3B Instruct for fine tuning

This makes the system reusable across languages and domains.

Data Generation Engine
Prompting Strategy

Each generation call produces structured instruction data:

{
"instruction": "How can I prepare for university entrance exams?",
"input": "",
"output": "To prepare for entrance exams, you should...",
"topic": "Education",
"subtopic": "Entrance Exams"
}

Generation Configuration

Key parameters include:

• pairs per call: 3
• calls per subtopic: 2
• max tokens: 1500
• delay between calls: 0.3 seconds

These parameters balance cost, diversity, and stability.

Multi model generation

Using multiple models reduces bias and increases diversity:

GPT 4.1 mini provides structured reasoning
GPT 4.1 nano increases variation and reduces cost

Deduplication Layer : Semantic Filtering

Synthetic datasets often contain semantically similar entries.
Example:

• “How to reduce stress?”
• “Methods for anxiety control”

Although different in wording, both represent the same intent.
To address this, embedding based similarity is used:

if similarity(instruction_a, instruction_b) > 0.75 : remove duplicate

This step preserves semantic diversity and prevents overfitting on repetitive patterns.

Quality Scoring : LLM as a Judge

After deduplication, data is evaluated using a second LLM.
Each sample is scored based on:

Fluency
Naturalness and grammatical correctness of language

Relevance
Whether the response correctly addresses the instruction

Completeness
Whether the answer is sufficiently detailed and useful. Only samples with an average score above 3.5 out of 5 are retained.

Dataset Outcome
The final dataset contains:

• approximately 4,000 instruction pairs
• 51 domains
• around 350 subtopics

However, the key value is not size but structured diversity and filtering quality.

Fine Tuning Phase : QLoRA on Qwen2.5 3B

Setup:

• Base model: Qwen2.5 3B Instruct
• Method: QLoRA
• Framework: Unsloth
• Hardware: Google Colab T4
• Training: 3 epochs, 714 steps

Why QLoRA

QLoRA enables efficient fine tuning by training low rank adapters instead of full model weights. This reduces memory usage while maintaining strong performance.

Training Behavior

The training loss shows steady convergence without instability or overfitting, indicating:

• high dataset consistency
• low noise after filtering
• stable learning dynamics

Evaluation

Key Observations in Base vs Fine tuned Model:

The base model exhibits:

• occasional language switching to Arabic
• incomplete or repetitive responses
• weak instruction adherence

The fine tuned model shows:

• fluent and consistent Persian output
• structured reasoning
• improved instruction following behavior

Key Insight

The improvement is not driven by model scaling but by data engineering. This highlights a central principle in modern LLM systems. data quality is often more important than model size

Key Technical Insights

Insight 1: Data quality is the primary bottleneck
Even a small dataset (4,000 samples) can significantly improve performance when properly curated.

Insight 2: Dual filtering is essential
Both semantic deduplication and LLM based scoring are required to maintain dataset quality.

Insight 3: Structured topic graphs outperform free form prompting Controlled topic distribution leads to better coverage and diversity.

Insight 4: LLM as a judge is a core system component
Automated evaluation is necessary for scalable dataset construction.

What this project demonstrates?

This system is not just a dataset generator. It is a complete synthetic data engine for low resource LLM alignment, consisting of:

• structured generation
• semantic filtering
• quality evaluation
• fine tuning integration
• performance benchmarking

Future Work

Potential improvements include:

• scaling dataset size beyond 50,000 samples
• integrating preference optimization (DPO)
• adding multilingual support
• incorporating human feedback loops (RLHF style training)

Conclusion

This project demonstrates a shift in LLM development:
performance improvements are increasingly driven by data systems rather than model scaling.By combining structured generation, filtering, and lightweight fine tuning, significant improvements can be achieved even in low resource language settings.

Links:
GitHub Repository
Dataset in Huggingface

Top comments (0)