Structured Reasoning Distillation in SQL via CoT and Reinforcement Learning with GRPO

#machinelearning

This study presents an efficient and low-cost methodology to generate SQL with explicit reasoning through Chain of Thought (CoT) using supervised distillation from the LLaMA 3 70B model, followed by fine-tuning with Group Relative Policy Optimization (GRPO), a modern reinforcement learning technique. By combining strategies inspired by DeepSeekMath with practical techniques like QLoRA and GRPO, we demonstrate that it's possible to fine-tune a model with only 3B parameters to produce coherent, explainable, and executable SQL responses at a cost below ~$17.

1. Introduction

The task of generating SQL queries from natural language requires structured symbolic reasoning. Recent works like DeepSeekMath have shown that Chain of Thought (CoT) can guide large models to generate interpretable solutions. However, the cost of using these models remains high.

This work aims to demonstrate that it's possible to transfer this reasoning to smaller models through:

Supervised distillation using the distilabel framework
Fine-tuning with Low-Rank Adaptation (QLoRA 4bit)
Reinforcement learning with GRPO, a technique based on PPO but with more stable grouping and policy control([arXiv][1])

2. Methodology

2.1 Distillation via distilabel

We used the distilabel library to create CoT reasoning with the meta-llama/Llama-3.1–70B model, applying the format <think>…</think><sql>…</sql> on the gretelai/synthetic_text_to_sql dataset. The distillation was performed with Hugging Face Endpoints (4x L40S, 192 GB VRAM), totaling:

Time: 1h45min
Cost: ~$14.52
Examples generated: 5,000
Final dataset: proton98/sql-distill-llama-3–1–70b-instruct-reasoning([arXiv][1])

2.2 Training with Unsloth + GRPO

We used the unsloth/llama-3.2–3b-instruct-unsloth-bnb-4bit model with QLoRA 4bit for fine-tuning. The training was done on RunPod using a single RTX 4090 GPU (24GB VRAM):

Cost: $0.69/h
Total time: 45 min
Environment: Custom Docker

Steps:

Supervised Fine-Tuning (SFT):
Supervised adjustment on the CoT dataset, following DeepSeek's pipeline. Result: coherent model with final loss ~0.38.
GRPO Reinforcement Learning:
Implemented based on the GRPO paper, a variation of PPO with group control. Each batch generates multiple responses evaluated by 3 reward functions:

match_format_exactly: Checks adherence to the exact <think> + <sql> format.
match_format_approximately: Scores for correct structure, even if partial.
check_sql_reward: Actual SQL execution validation via environment.([Blog da Trybe][4])

The smaller model is optimized based on the proximity of the output to the larger model, using KL divergence and normalized advantage.

3. Results

After 1 epoch of RL (300 steps), we observed:

Consistent generation of CoT with valid SQL
Structured reasoning present even in new prompts
Smaller model (3B) capable of competing with outputs from models 10x larger([arXiv][1])

We included incorrect SQLs during SFT to teach the model to identify and handle errors, enhancing its robustness.

4. Conclusion

This project demonstrated that it's possible to train a small, efficient, and functional model at a cost below ~$17. Using a modern combination of CoT distillation, fine-tuning with QLoRA, and GRPO, we achieved solid results in generating interpretable SQL.

Next steps: