Introduction
When training language models for mathematical reasoning, one of the key challenges is getting the model to not just produce correct answers, but to show its work in a structured, verifiable way. This is where GRPO (Generalized Reward-based Policy Optimization) comes in.
DeepFabric's GRPO formatter transforms your datasets into the precise format needed for GRPO training pipelines, wrapping reasoning traces and solutions in configurable tags that enable reward-based optimization. In this post, we'll dive deep into how to use it effectively.
What is GRPO?
GRPO is a reinforcement learning technique that optimizes language models using reward signals. For mathematical reasoning tasks, this typically means:
- Structured reasoning: The model's thought process is wrapped in specific tags
- Extractable solutions: Final answers are clearly delineated for verification
- Reward computation: Numerical answers can be automatically validated against ground truth
The format enables training systems to parse the model's output, extract the answer, compute rewards based on correctness, and update the model accordingly.
The GRPO Format Structure
A properly formatted GRPO sample looks like this:
{
"messages": [
{
"role": "system",
"content": "You are given a problem. Think about the problem and provide your working out. Place it between <start_working_out> and <end_working_out>. Then, provide your solution between <SOLUTION> and </SOLUTION>."
},
{
"role": "user",
"content": "What is 15% of 240?"
},
{
"role": "assistant",
"content": "<start_working_out>To find 15% of 240, I need to multiply 240 by 0.15. 240 × 0.15 = 36<end_working_out><SOLUTION>36</SOLUTION>"
}
]
}
Notice the clear separation:
-
Reasoning section: Wrapped in
<start_working_out>...</end_working_out>
-
Solution section: Wrapped in
<SOLUTION>...</SOLUTION>
This structure allows the training pipeline to:
- Extract the numerical answer (36)
- Compare it against ground truth
- Compute reward signals
- Backpropagate through the reasoning process
Using DeepFabric's GRPO Formatter
Basic Usage
The GRPO formatter is built into DeepFabric and supports multiple input formats out of the box:
from deepfabric.formatters.builtin.grpo import GrpoFormatter
# Initialize with default configuration
formatter = GrpoFormatter()
# Your raw dataset sample
sample = {
"question": "If a train travels 120 km in 2 hours, what is its average speed?",
"final_answer": "60",
"chain_of_thought": "Speed = Distance / Time. Speed = 120 km / 2 hours = 60 km/h"
}
# Format for GRPO training
formatted = formatter.format_dataset([sample])
Configuration Options
The formatter is highly configurable to match your training pipeline's requirements:
config = {
"reasoning_start_tag": "<think>",
"reasoning_end_tag": "</think>",
"solution_start_tag": "<answer>",
"solution_end_tag": "</answer>",
"system_prompt": "Solve the following problem step by step.",
"validate_numerical": True # Enforce numerical answer extraction
}
formatter = GrpoFormatter(config=config)
This flexibility means you can adapt to different GRPO implementations (like Qwen, DeepSeek, or custom pipelines) that might use different tag conventions.
Supported Input Formats
One of the formatter's strengths is its ability to handle diverse dataset structures. Let's explore each:
1. Question-Answer Format
The simplest format - just questions and answers:
sample = {
"question": "What is 25² ?",
"final_answer": "625"
}
The formatter automatically generates a basic reasoning wrapper if no reasoning trace is provided.
2. Chain-of-Thought Format
Includes explicit reasoning steps:
sample = {
"question": "Solve: 3x + 5 = 20",
"chain_of_thought": "Subtract 5 from both sides: 3x = 15. Divide both sides by 3: x = 5.",
"final_answer": "5"
}
3. Structured Chain-of-Thought
The most detailed format with message structure and reasoning traces:
sample = {
"messages": [
{"role": "user", "content": "Calculate 15! / 13!"}
],
"reasoning_trace": [
{"thought": "Using factorial properties, 15! / 13! = 15 × 14 × 13! / 13!"},
{"thought": "The 13! cancels out, leaving 15 × 14"},
{"action": "Calculate: 15 × 14 = 210"}
],
"final_answer": "210"
}
4. Conversation Format
Already has messages but needs GRPO formatting:
sample = {
"messages": [
{"role": "user", "content": "What is 2³ + 3²?"},
{"role": "assistant", "content": "Let me calculate: 2³ = 8 and 3² = 9. Therefore 8 + 9 = 17"}
]
}
The formatter intelligently extracts reasoning and answer, then wraps them in GRPO tags.
5. Generic Format
For datasets with non-standard field names:
sample = {
"problem": "Find the area of a circle with radius 5",
"solution": "78.54",
"reasoning": "Area = πr². With r=5: Area = π × 5² = π × 25 ≈ 78.54"
}
The formatter searches for common field name patterns like problem
, prompt
, input
for questions and solution
, output
, response
for answers.
Validation and Quality Control
The formatter includes robust validation at two levels:
Input Validation
Before formatting, each sample is validated:
# This returns False for invalid samples
is_valid = formatter.validate(sample)
Validation checks:
- Required fields are present
- Data types are correct
- Format can be detected and handled
Output Validation
After formatting, samples are validated against GRPO requirements:
formatted_sample = formatter._format_single_sample(sample)
is_grpo_compliant = formatter.validate_output(formatted_sample)
This ensures:
- All required roles (system, user, assistant) are present
- GRPO formatting tags are correctly applied
- Numerical answers are extractable (if validation enabled)
Numerical Answer Extraction
When validate_numerical: True
, the formatter uses regex patterns to ensure answers can be extracted:
# The formatter compiles these patterns
format_regex = re.compile(
r"<end_working_out>.*?<SOLUTION>(.+?)</SOLUTION>\s*$",
flags=re.MULTILINE | re.DOTALL
)
number_regex = re.compile(
r"<SOLUTION>.*?\s*([+-]?[\d\.,]+)",
flags=re.MULTILINE | re.DOTALL
)
This ensures the training pipeline can reliably extract answers for reward computation.
Real-World Example: Formatting a Math Dataset
Let's walk through a complete example using a mathematical reasoning dataset:
from deepfabric.formatters.builtin.grpo import GrpoFormatter
import json
# Configuration matching your training pipeline
config = {
"reasoning_start_tag": "<start_working_out>",
"reasoning_end_tag": "<end_working_out>",
"solution_start_tag": "<SOLUTION>",
"solution_end_tag": "</SOLUTION>",
"validate_numerical": True
}
formatter = GrpoFormatter(config=config)
# Your raw dataset
raw_samples = [
{
"question": "A rectangle has length 8cm and width 5cm. What is its perimeter?",
"chain_of_thought": "Perimeter of rectangle = 2(length + width). P = 2(8 + 5) = 2(13) = 26cm",
"final_answer": "26"
},
{
"question": "Simplify: (x² - 9) / (x - 3)",
"chain_of_thought": "Factor numerator: (x+3)(x-3) / (x-3). Cancel (x-3): x + 3",
"final_answer": "x + 3"
}
]
# Format the entire dataset
formatted_dataset = formatter.format_dataset(raw_samples)
# Save for training
with open("grpo_training_data.jsonl", "w") as f:
for sample in formatted_dataset:
f.write(json.dumps(sample) + "\n")
print(f"Formatted {len(formatted_dataset)} samples for GRPO training")
The output JSONL file is ready to feed into your GRPO training pipeline.
Integration with DeepFabric Dataset Generation
DeepFabric can generate synthetic datasets and format them for GRPO in one pipeline:
# config.yaml
dataset_system_prompt: |
Generate mathematical reasoning problems suitable for GRPO training.
Include step-by-step reasoning and numerical answers.
topic_tree:
args:
model: "gpt-4"
depth: 2
branching_factor: 3
root_prompt: "Mathematical reasoning: algebra, geometry, arithmetic"
save_as: "topics.jsonl"
data_engine:
args:
model: "gpt-4"
samples_per_topic: 10
temperature: 0.7
dataset:
creation:
formatter: "grpo"
formatter_config:
reasoning_start_tag: "<start_working_out>"
reasoning_end_tag: "<end_working_out>"
solution_start_tag: "<SOLUTION>"
solution_end_tag: "</SOLUTION>"
validate_numerical: true
save_as: "grpo_dataset.jsonl"
Run the pipeline:
deepfabric start config.yaml
This generates a complete GRPO-formatted dataset from scratch.
Best Practices
1. Match Your Training Pipeline
Different GRPO implementations use different tags. Always configure the formatter to match your training code:
# For Qwen-style GRPO
qwen_config = {
"reasoning_start_tag": "<start_working_out>",
"reasoning_end_tag": "<end_working_out>",
"solution_start_tag": "<SOLUTION>",
"solution_end_tag": "</SOLUTION>"
}
# For custom pipeline
custom_config = {
"reasoning_start_tag": "[REASONING]",
"reasoning_end_tag": "[/REASONING]",
"solution_start_tag": "[ANSWER]",
"solution_end_tag": "[/ANSWER]"
}
2. Enable Validation for Math Tasks
For mathematical reasoning, always enable numerical validation:
config = {"validate_numerical": True}
This ensures your reward function can extract answers reliably.
3. Provide Quality Reasoning Traces
The better your input reasoning, the better your GRPO training:
# Good: Detailed step-by-step
sample = {
"question": "What is 15% of 80?",
"chain_of_thought": "Convert percentage to decimal: 15% = 0.15. Multiply: 80 × 0.15 = 12",
"final_answer": "12"
}
# Less ideal: Minimal reasoning
sample = {
"question": "What is 15% of 80?",
"final_answer": "12"
}
4. Customize System Prompts
The system prompt guides model behavior during training:
config = {
"system_prompt": """Solve mathematical problems by:
1. Breaking down the problem
2. Showing all calculation steps
3. Providing the final numerical answer"""
}
5. Validate Your Output
Always check a few formatted samples before training:
formatted = formatter.format_dataset(raw_samples)
# Inspect first sample
print(json.dumps(formatted[0], indent=2))
# Validate all samples
valid_count = sum(1 for s in formatted if formatter.validate_output(s))
print(f"{valid_count}/{len(formatted)} samples are GRPO-compliant")
Troubleshooting Common Issues
Issue: Samples Being Filtered Out
Problem: Some samples don't appear in the formatted output.
Solution: Check validation errors:
for sample in raw_samples:
if not formatter.validate(sample):
print(f"Invalid sample: {sample}")
Common causes:
- Missing required fields
- Empty answers
- Incompatible format
Issue: Answer Extraction Fails
Problem: validate_numerical
rejects valid samples.
Solution: Check answer format:
# The regex expects numbers in the solution tags
# This works:
"<SOLUTION>42</SOLUTION>"
# This might fail:
"<SOLUTION>The answer is forty-two</SOLUTION>"
For non-numerical answers, disable validation:
config = {"validate_numerical": False}
Issue: Reasoning Not Preserved
Problem: Original reasoning is lost during formatting.
Solution: Ensure reasoning is in a recognized field:
# Recognized fields for reasoning:
- "chain_of_thought"
- "reasoning"
- "reasoning_trace"
# Not recognized:
- "explanation"
- "steps"
- "working"
Advanced: Custom Format Handling
If you have a unique dataset structure, you can extend the formatter:
from deepfabric.formatters.builtin.grpo import GrpoFormatter
class CustomGrpoFormatter(GrpoFormatter):
def _format_custom_format(self, sample: dict) -> dict:
"""Handle custom dataset structure."""
question = sample["problem_statement"]
steps = sample["solution_steps"] # List of steps
answer = sample["correct_answer"]
# Combine steps into reasoning
reasoning = " ".join(steps)
assistant_content = (
f"{self.reasoning_start_tag}{reasoning}{self.reasoning_end_tag}"
f"{self.solution_start_tag}{answer}{self.solution_end_tag}"
)
return {
"messages": [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": question},
{"role": "assistant", "content": assistant_content}
]
}
def _format_single_sample(self, sample: dict) -> dict | None:
# Try custom format first
if "problem_statement" in sample and "solution_steps" in sample:
return self._format_custom_format(sample)
# Fall back to standard formats
return super()._format_single_sample(sample)
Conclusion
DeepFabric's GRPO formatter provides a robust, flexible way to prepare datasets for reward-based optimization training. Key takeaways:
- Multiple format support: Works with Q&A, CoT, conversations, and generic formats
- Configurable tags: Adapt to any GRPO training pipeline
- Built-in validation: Ensures quality and compliance
- Numerical extraction: Enables reliable reward computation
- Integration ready: Works seamlessly with DeepFabric's generation pipeline
Whether you're formatting existing datasets or generating new synthetic data, the GRPO formatter handles the complex transformations needed for successful mathematical reasoning training.
Further Reading
Happy training!
Top comments (0)