DEV Community

Cover image for Fine-Tuning with GRPO Datasets: A Developer's Guide to DeepFabric's GRPO Formatter
Luke Hinds
Luke Hinds

Posted on

Fine-Tuning with GRPO Datasets: A Developer's Guide to DeepFabric's GRPO Formatter

Introduction

When training language models for mathematical reasoning, one of the key challenges is getting the model to not just produce correct answers, but to show its work in a structured, verifiable way. This is where GRPO (Generalized Reward-based Policy Optimization) comes in.

DeepFabric's GRPO formatter transforms your datasets into the precise format needed for GRPO training pipelines, wrapping reasoning traces and solutions in configurable tags that enable reward-based optimization. In this post, we'll dive deep into how to use it effectively.

What is GRPO?

GRPO is a reinforcement learning technique that optimizes language models using reward signals. For mathematical reasoning tasks, this typically means:

  1. Structured reasoning: The model's thought process is wrapped in specific tags
  2. Extractable solutions: Final answers are clearly delineated for verification
  3. Reward computation: Numerical answers can be automatically validated against ground truth

The format enables training systems to parse the model's output, extract the answer, compute rewards based on correctness, and update the model accordingly.

The GRPO Format Structure

A properly formatted GRPO sample looks like this:

{
  "messages": [
    {
      "role": "system",
      "content": "You are given a problem. Think about the problem and provide your working out. Place it between <start_working_out> and <end_working_out>. Then, provide your solution between <SOLUTION> and </SOLUTION>."
    },
    {
      "role": "user",
      "content": "What is 15% of 240?"
    },
    {
      "role": "assistant",
      "content": "<start_working_out>To find 15% of 240, I need to multiply 240 by 0.15. 240 × 0.15 = 36<end_working_out><SOLUTION>36</SOLUTION>"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Notice the clear separation:

  • Reasoning section: Wrapped in <start_working_out>...</end_working_out>
  • Solution section: Wrapped in <SOLUTION>...</SOLUTION>

This structure allows the training pipeline to:

  • Extract the numerical answer (36)
  • Compare it against ground truth
  • Compute reward signals
  • Backpropagate through the reasoning process

Using DeepFabric's GRPO Formatter

Basic Usage

The GRPO formatter is built into DeepFabric and supports multiple input formats out of the box:

from deepfabric.formatters.builtin.grpo import GrpoFormatter

# Initialize with default configuration
formatter = GrpoFormatter()

# Your raw dataset sample
sample = {
    "question": "If a train travels 120 km in 2 hours, what is its average speed?",
    "final_answer": "60",
    "chain_of_thought": "Speed = Distance / Time. Speed = 120 km / 2 hours = 60 km/h"
}

# Format for GRPO training
formatted = formatter.format_dataset([sample])
Enter fullscreen mode Exit fullscreen mode

Configuration Options

The formatter is highly configurable to match your training pipeline's requirements:

config = {
    "reasoning_start_tag": "<think>",
    "reasoning_end_tag": "</think>",
    "solution_start_tag": "<answer>",
    "solution_end_tag": "</answer>",
    "system_prompt": "Solve the following problem step by step.",
    "validate_numerical": True  # Enforce numerical answer extraction
}

formatter = GrpoFormatter(config=config)
Enter fullscreen mode Exit fullscreen mode

This flexibility means you can adapt to different GRPO implementations (like Qwen, DeepSeek, or custom pipelines) that might use different tag conventions.

Supported Input Formats

One of the formatter's strengths is its ability to handle diverse dataset structures. Let's explore each:

1. Question-Answer Format

The simplest format - just questions and answers:

sample = {
    "question": "What is 25² ?",
    "final_answer": "625"
}
Enter fullscreen mode Exit fullscreen mode

The formatter automatically generates a basic reasoning wrapper if no reasoning trace is provided.

2. Chain-of-Thought Format

Includes explicit reasoning steps:

sample = {
    "question": "Solve: 3x + 5 = 20",
    "chain_of_thought": "Subtract 5 from both sides: 3x = 15. Divide both sides by 3: x = 5.",
    "final_answer": "5"
}
Enter fullscreen mode Exit fullscreen mode

3. Structured Chain-of-Thought

The most detailed format with message structure and reasoning traces:

sample = {
    "messages": [
        {"role": "user", "content": "Calculate 15! / 13!"}
    ],
    "reasoning_trace": [
        {"thought": "Using factorial properties, 15! / 13! = 15 × 14 × 13! / 13!"},
        {"thought": "The 13! cancels out, leaving 15 × 14"},
        {"action": "Calculate: 15 × 14 = 210"}
    ],
    "final_answer": "210"
}
Enter fullscreen mode Exit fullscreen mode

4. Conversation Format

Already has messages but needs GRPO formatting:

sample = {
    "messages": [
        {"role": "user", "content": "What is 2³ + 3²?"},
        {"role": "assistant", "content": "Let me calculate: 2³ = 8 and 3² = 9. Therefore 8 + 9 = 17"}
    ]
}
Enter fullscreen mode Exit fullscreen mode

The formatter intelligently extracts reasoning and answer, then wraps them in GRPO tags.

5. Generic Format

For datasets with non-standard field names:

sample = {
    "problem": "Find the area of a circle with radius 5",
    "solution": "78.54",
    "reasoning": "Area = πr². With r=5: Area = π × 5² = π × 25 ≈ 78.54"
}
Enter fullscreen mode Exit fullscreen mode

The formatter searches for common field name patterns like problem, prompt, input for questions and solution, output, response for answers.

Validation and Quality Control

The formatter includes robust validation at two levels:

Input Validation

Before formatting, each sample is validated:

# This returns False for invalid samples
is_valid = formatter.validate(sample)
Enter fullscreen mode Exit fullscreen mode

Validation checks:

  • Required fields are present
  • Data types are correct
  • Format can be detected and handled

Output Validation

After formatting, samples are validated against GRPO requirements:

formatted_sample = formatter._format_single_sample(sample)
is_grpo_compliant = formatter.validate_output(formatted_sample)
Enter fullscreen mode Exit fullscreen mode

This ensures:

  • All required roles (system, user, assistant) are present
  • GRPO formatting tags are correctly applied
  • Numerical answers are extractable (if validation enabled)

Numerical Answer Extraction

When validate_numerical: True, the formatter uses regex patterns to ensure answers can be extracted:

# The formatter compiles these patterns
format_regex = re.compile(
    r"<end_working_out>.*?<SOLUTION>(.+?)</SOLUTION>\s*$",
    flags=re.MULTILINE | re.DOTALL
)

number_regex = re.compile(
    r"<SOLUTION>.*?\s*([+-]?[\d\.,]+)",
    flags=re.MULTILINE | re.DOTALL
)
Enter fullscreen mode Exit fullscreen mode

This ensures the training pipeline can reliably extract answers for reward computation.

Real-World Example: Formatting a Math Dataset

Let's walk through a complete example using a mathematical reasoning dataset:

from deepfabric.formatters.builtin.grpo import GrpoFormatter
import json

# Configuration matching your training pipeline
config = {
    "reasoning_start_tag": "<start_working_out>",
    "reasoning_end_tag": "<end_working_out>",
    "solution_start_tag": "<SOLUTION>",
    "solution_end_tag": "</SOLUTION>",
    "validate_numerical": True
}

formatter = GrpoFormatter(config=config)

# Your raw dataset
raw_samples = [
    {
        "question": "A rectangle has length 8cm and width 5cm. What is its perimeter?",
        "chain_of_thought": "Perimeter of rectangle = 2(length + width). P = 2(8 + 5) = 2(13) = 26cm",
        "final_answer": "26"
    },
    {
        "question": "Simplify: (x² - 9) / (x - 3)",
        "chain_of_thought": "Factor numerator: (x+3)(x-3) / (x-3). Cancel (x-3): x + 3",
        "final_answer": "x + 3"
    }
]

# Format the entire dataset
formatted_dataset = formatter.format_dataset(raw_samples)

# Save for training
with open("grpo_training_data.jsonl", "w") as f:
    for sample in formatted_dataset:
        f.write(json.dumps(sample) + "\n")

print(f"Formatted {len(formatted_dataset)} samples for GRPO training")
Enter fullscreen mode Exit fullscreen mode

The output JSONL file is ready to feed into your GRPO training pipeline.

Integration with DeepFabric Dataset Generation

DeepFabric can generate synthetic datasets and format them for GRPO in one pipeline:

# config.yaml
dataset_system_prompt: |
  Generate mathematical reasoning problems suitable for GRPO training.
  Include step-by-step reasoning and numerical answers.

topic_tree:
  args:
    model: "gpt-4"
    depth: 2
    branching_factor: 3
    root_prompt: "Mathematical reasoning: algebra, geometry, arithmetic"
  save_as: "topics.jsonl"

data_engine:
  args:
    model: "gpt-4"
    samples_per_topic: 10
    temperature: 0.7

dataset:
  creation:
    formatter: "grpo"
    formatter_config:
      reasoning_start_tag: "<start_working_out>"
      reasoning_end_tag: "<end_working_out>"
      solution_start_tag: "<SOLUTION>"
      solution_end_tag: "</SOLUTION>"
      validate_numerical: true
  save_as: "grpo_dataset.jsonl"
Enter fullscreen mode Exit fullscreen mode

Run the pipeline:

deepfabric start config.yaml
Enter fullscreen mode Exit fullscreen mode

This generates a complete GRPO-formatted dataset from scratch.

Best Practices

1. Match Your Training Pipeline

Different GRPO implementations use different tags. Always configure the formatter to match your training code:

# For Qwen-style GRPO
qwen_config = {
    "reasoning_start_tag": "<start_working_out>",
    "reasoning_end_tag": "<end_working_out>",
    "solution_start_tag": "<SOLUTION>",
    "solution_end_tag": "</SOLUTION>"
}

# For custom pipeline
custom_config = {
    "reasoning_start_tag": "[REASONING]",
    "reasoning_end_tag": "[/REASONING]",
    "solution_start_tag": "[ANSWER]",
    "solution_end_tag": "[/ANSWER]"
}
Enter fullscreen mode Exit fullscreen mode

2. Enable Validation for Math Tasks

For mathematical reasoning, always enable numerical validation:

config = {"validate_numerical": True}
Enter fullscreen mode Exit fullscreen mode

This ensures your reward function can extract answers reliably.

3. Provide Quality Reasoning Traces

The better your input reasoning, the better your GRPO training:

# Good: Detailed step-by-step
sample = {
    "question": "What is 15% of 80?",
    "chain_of_thought": "Convert percentage to decimal: 15% = 0.15. Multiply: 80 × 0.15 = 12",
    "final_answer": "12"
}

# Less ideal: Minimal reasoning
sample = {
    "question": "What is 15% of 80?",
    "final_answer": "12"
}
Enter fullscreen mode Exit fullscreen mode

4. Customize System Prompts

The system prompt guides model behavior during training:

config = {
    "system_prompt": """Solve mathematical problems by:
1. Breaking down the problem
2. Showing all calculation steps
3. Providing the final numerical answer"""
}
Enter fullscreen mode Exit fullscreen mode

5. Validate Your Output

Always check a few formatted samples before training:

formatted = formatter.format_dataset(raw_samples)

# Inspect first sample
print(json.dumps(formatted[0], indent=2))

# Validate all samples
valid_count = sum(1 for s in formatted if formatter.validate_output(s))
print(f"{valid_count}/{len(formatted)} samples are GRPO-compliant")
Enter fullscreen mode Exit fullscreen mode

Troubleshooting Common Issues

Issue: Samples Being Filtered Out

Problem: Some samples don't appear in the formatted output.

Solution: Check validation errors:

for sample in raw_samples:
    if not formatter.validate(sample):
        print(f"Invalid sample: {sample}")
Enter fullscreen mode Exit fullscreen mode

Common causes:

  • Missing required fields
  • Empty answers
  • Incompatible format

Issue: Answer Extraction Fails

Problem: validate_numerical rejects valid samples.

Solution: Check answer format:

# The regex expects numbers in the solution tags
# This works:
"<SOLUTION>42</SOLUTION>"

# This might fail:
"<SOLUTION>The answer is forty-two</SOLUTION>"
Enter fullscreen mode Exit fullscreen mode

For non-numerical answers, disable validation:

config = {"validate_numerical": False}
Enter fullscreen mode Exit fullscreen mode

Issue: Reasoning Not Preserved

Problem: Original reasoning is lost during formatting.

Solution: Ensure reasoning is in a recognized field:

# Recognized fields for reasoning:
- "chain_of_thought"
- "reasoning"
- "reasoning_trace"

# Not recognized:
- "explanation"
- "steps"
- "working"
Enter fullscreen mode Exit fullscreen mode

Advanced: Custom Format Handling

If you have a unique dataset structure, you can extend the formatter:

from deepfabric.formatters.builtin.grpo import GrpoFormatter

class CustomGrpoFormatter(GrpoFormatter):
    def _format_custom_format(self, sample: dict) -> dict:
        """Handle custom dataset structure."""
        question = sample["problem_statement"]
        steps = sample["solution_steps"]  # List of steps
        answer = sample["correct_answer"]

        # Combine steps into reasoning
        reasoning = " ".join(steps)

        assistant_content = (
            f"{self.reasoning_start_tag}{reasoning}{self.reasoning_end_tag}"
            f"{self.solution_start_tag}{answer}{self.solution_end_tag}"
        )

        return {
            "messages": [
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": question},
                {"role": "assistant", "content": assistant_content}
            ]
        }

    def _format_single_sample(self, sample: dict) -> dict | None:
        # Try custom format first
        if "problem_statement" in sample and "solution_steps" in sample:
            return self._format_custom_format(sample)

        # Fall back to standard formats
        return super()._format_single_sample(sample)
Enter fullscreen mode Exit fullscreen mode

Conclusion

DeepFabric's GRPO formatter provides a robust, flexible way to prepare datasets for reward-based optimization training. Key takeaways:

  1. Multiple format support: Works with Q&A, CoT, conversations, and generic formats
  2. Configurable tags: Adapt to any GRPO training pipeline
  3. Built-in validation: Ensures quality and compliance
  4. Numerical extraction: Enables reliable reward computation
  5. Integration ready: Works seamlessly with DeepFabric's generation pipeline

Whether you're formatting existing datasets or generating new synthetic data, the GRPO formatter handles the complex transformations needed for successful mathematical reasoning training.

Further Reading

Happy training!

Top comments (0)