<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mohammad Heydari</title>
    <description>The latest articles on DEV Community by Mohammad Heydari (@mohammadheydari).</description>
    <link>https://dev.to/mohammadheydari</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3997373%2Fc51cf885-16c1-4c93-9d65-ef595604e59e.jpg</url>
      <title>DEV Community: Mohammad Heydari</title>
      <link>https://dev.to/mohammadheydari</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mohammadheydari"/>
    <language>en</language>
    <item>
      <title>Designing a Synthetic Data Pipeline for Persian LLM Fine Tuning: From Topic Graphs to QLoRA Evaluation</title>
      <dc:creator>Mohammad Heydari</dc:creator>
      <pubDate>Mon, 22 Jun 2026 17:44:08 +0000</pubDate>
      <link>https://dev.to/mohammadheydari/designing-a-synthetic-data-pipeline-for-persian-llm-fine-tuning-from-topic-graphs-to-qlora-5cg5</link>
      <guid>https://dev.to/mohammadheydari/designing-a-synthetic-data-pipeline-for-persian-llm-fine-tuning-from-topic-graphs-to-qlora-5cg5</guid>
      <description>&lt;p&gt;&lt;strong&gt;Introduction: Why this project matters?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Training instruction following LLMs is no longer just about scaling models. It is about scaling data quality.&lt;br&gt;
In high resource languages like English, datasets such as Alpaca and OpenAssistant already exist. However, in low resource languages like Persian, high quality instruction datasets are extremely limited.&lt;/p&gt;

&lt;p&gt;Most available Persian corpora suffer from:&lt;/p&gt;

&lt;p&gt;• lack of instruction structure &lt;br&gt;
• Arabic language contamination &lt;br&gt;
• low diversity &lt;br&gt;
• poor alignment quality &lt;/p&gt;

&lt;p&gt;As a result, even strong base models fail to:&lt;/p&gt;

&lt;p&gt;• follow instructions consistently &lt;br&gt;
• generate fluent Persian &lt;br&gt;
• maintain coherent structure &lt;/p&gt;

&lt;p&gt;The core bottleneck is not model capacity but data scarcity.&lt;/p&gt;

&lt;p&gt;This project addresses that problem through a full synthetic data generation and fine tuning pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System Overview: End to End Pipeline&lt;/strong&gt;&lt;br&gt;
The system is designed as a modular data engine:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&amp;gt; Topic Tree &amp;gt; LLM Generation &amp;gt; Deduplication &amp;gt; Quality Scoring &amp;gt; Dataset Export &amp;gt; QLoRA Fine Tuning &amp;gt; Evaluation&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Each component is independent, allowing scalability and reproducibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Core Design Philosophy: Controlled Diversity&lt;/strong&gt;&lt;br&gt;
Instead of free form generation, a structured topic tree is used with:&lt;/p&gt;

&lt;p&gt;• 51 domains &lt;br&gt;
• approximately 350 subtopics &lt;/p&gt;

&lt;p&gt;This ensures balanced coverage and prevents mode collapse.&lt;br&gt;
Multi layer Filtering Raw synthetic data is inherently noisy. The system applies multiple filtering stages:&lt;/p&gt;

&lt;p&gt;• semantic deduplication &lt;br&gt;
• LLM based quality scoring &lt;/p&gt;

&lt;p&gt;This transforms raw outputs into curated training data.&lt;br&gt;
Model Agnostic Design. The pipeline supports multiple models across stages:&lt;/p&gt;

&lt;p&gt;• &lt;code&gt;GPT 4.1 mini&lt;/code&gt; and &lt;code&gt;GPT 4.1 nano&lt;/code&gt; for generation &lt;br&gt;
• second LLM for evaluation &lt;br&gt;
• &lt;code&gt;Qwen2.5 3B&lt;/code&gt; Instruct for fine tuning &lt;/p&gt;

&lt;p&gt;This makes the system reusable across languages and domains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Generation Engine&lt;/strong&gt;&lt;br&gt;
Prompting Strategy&lt;/p&gt;

&lt;p&gt;Each generation call produces structured instruction data:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;{&lt;br&gt;
  "instruction": "How can I prepare for university entrance exams?",&lt;br&gt;
  "input": "",&lt;br&gt;
  "output": "To prepare for entrance exams, you should...",&lt;br&gt;
  "topic": "Education",&lt;br&gt;
  "subtopic": "Entrance Exams"&lt;br&gt;
}&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generation Configuration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Key parameters include:&lt;/p&gt;

&lt;p&gt;• pairs per call: 3 &lt;br&gt;
• calls per subtopic: 2 &lt;br&gt;
• max tokens: 1500 &lt;br&gt;
• delay between calls: 0.3 seconds &lt;/p&gt;

&lt;p&gt;These parameters balance cost, diversity, and stability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi model generation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Using multiple models reduces bias and increases diversity:&lt;/p&gt;

&lt;p&gt;• &lt;code&gt;GPT 4.1 mini&lt;/code&gt; provides structured reasoning &lt;br&gt;
• &lt;code&gt;GPT 4.1 nano&lt;/code&gt; increases variation and reduces cost&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deduplication Layer  : Semantic Filtering&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Synthetic datasets often contain semantically similar entries.&lt;br&gt;
Example:&lt;/p&gt;

&lt;p&gt;• “How to reduce stress?” &lt;br&gt;
• “Methods for anxiety control” &lt;/p&gt;

&lt;p&gt;Although different in wording, both represent the same intent.&lt;br&gt;
To address this, embedding based similarity is used:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;if similarity(instruction_a, instruction_b) &amp;gt; 0.75 : remove duplicate&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This step preserves semantic diversity and prevents overfitting on repetitive patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quality Scoring  : LLM as a Judge&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After deduplication, data is evaluated using a second LLM.&lt;br&gt;
Each sample is scored based on:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fluency&lt;/strong&gt;&lt;br&gt;
Naturalness and grammatical correctness of language&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Relevance&lt;/strong&gt;&lt;br&gt;
Whether the response correctly addresses the instruction&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Completeness&lt;/strong&gt;&lt;br&gt;
Whether the answer is sufficiently detailed and useful. Only samples with an average score above &lt;code&gt;3.5&lt;/code&gt; out of &lt;code&gt;5&lt;/code&gt; are retained.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dataset Outcome&lt;/strong&gt;&lt;br&gt;
The final dataset contains:&lt;/p&gt;

&lt;p&gt;• approximately 4,000 instruction pairs &lt;br&gt;
• 51 domains &lt;br&gt;
• around 350 subtopics &lt;/p&gt;

&lt;p&gt;However, the key value is not size but structured diversity and filtering quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fine Tuning Phase  : QLoRA on Qwen2.5 3B&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Setup:&lt;/p&gt;

&lt;p&gt;• Base model: &lt;code&gt;Qwen2.5 3B&lt;/code&gt; Instruct &lt;br&gt;
• Method: QLoRA &lt;br&gt;
• Framework: Unsloth &lt;br&gt;
• Hardware: &lt;code&gt;Google Colab T4&lt;/code&gt; &lt;br&gt;
• Training: &lt;code&gt;3 epochs, 714 steps&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why QLoRA&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;QLoRA enables efficient fine tuning by training low rank adapters instead of full model weights. This reduces memory usage while maintaining strong performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Training Behavior&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The training loss shows steady convergence without instability or overfitting, indicating:&lt;/p&gt;

&lt;p&gt;• high dataset consistency &lt;br&gt;
• low noise after filtering &lt;br&gt;
• stable learning dynamics&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Key Observations in Base vs Fine tuned Model:&lt;/p&gt;

&lt;p&gt;The base model exhibits:&lt;/p&gt;

&lt;p&gt;• occasional language switching to Arabic &lt;br&gt;
• incomplete or repetitive responses &lt;br&gt;
• weak instruction adherence &lt;/p&gt;

&lt;p&gt;The fine tuned model shows:&lt;/p&gt;

&lt;p&gt;• fluent and consistent Persian output &lt;br&gt;
• structured reasoning &lt;br&gt;
• improved instruction following behavior &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Insight&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The improvement is not driven by model scaling but by data engineering. This highlights a central principle in modern LLM systems. data quality is often more important than model size&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Technical Insights&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Insight 1:&lt;/strong&gt; Data quality is the primary bottleneck&lt;br&gt;
Even a small dataset (4,000 samples) can significantly improve performance when properly curated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Insight 2:&lt;/strong&gt; Dual filtering is essential&lt;br&gt;
Both semantic deduplication and LLM based scoring are required to maintain dataset quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Insight 3:&lt;/strong&gt; Structured topic graphs outperform free form prompting Controlled topic distribution leads to better coverage and diversity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Insight 4:&lt;/strong&gt; LLM as a judge is a core system component&lt;br&gt;
Automated evaluation is necessary for scalable dataset construction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What this project demonstrates?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This system is not just a dataset generator. It is a complete synthetic data engine for low resource LLM alignment, consisting of:&lt;/p&gt;

&lt;p&gt;• structured generation &lt;br&gt;
• semantic filtering &lt;br&gt;
• quality evaluation &lt;br&gt;
• fine tuning integration &lt;br&gt;
• performance benchmarking &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Future Work&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Potential improvements include:&lt;/p&gt;

&lt;p&gt;• scaling dataset size beyond 50,000 samples &lt;br&gt;
• integrating preference optimization (DPO) &lt;br&gt;
• adding multilingual support &lt;br&gt;
• incorporating human feedback loops (RLHF style training) &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This project demonstrates a shift in LLM development:&lt;br&gt;
performance improvements are increasingly driven by data systems rather than model scaling.By combining structured generation, filtering, and lightweight fine tuning, significant improvements can be achieved even in low resource language settings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Links:&lt;/strong&gt; &lt;br&gt;
&lt;a href="https://github.com/MohammadHeydari/FarsiSyntheticData" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;&lt;br&gt;
&lt;a href="https://huggingface.co/datasets/Heydaritoday/Persian-Synthetic-Instruct" rel="noopener noreferrer"&gt;Dataset in Huggingface&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>graph</category>
      <category>qlora</category>
    </item>
  </channel>
</rss>
