Prashant Nigam

Posted on Nov 11

Data Preparation and Training Formats (Part 3)

#ai #machinelearning #finetuning #smalllanguagemodel

Data is the foundation of any successful AI model. In this part, we'll explore how to create, format, and prepare high-quality training data that will make our email sentiment classifier incredibly accurate.

Why Data Quality Matters More Than Model Size

Here's a truth that might surprise you: A smaller model trained on high-quality, domain-specific data can outperforms a massive general-purpose model on specific tasks.

Think of it this way: would you rather have a Swiss Army knife or a scalpel for surgery? General models are Swiss Army knives - versatile but not optimized. Fine-tuned models are scalpels - precise tools for specific jobs.

Understanding Language Model Training Formats

Language models learn by predicting the next piece of text. For fine-tuning, we need to show them examples of the exact conversations we want them to have.

The Anatomy of a Training Example

Every training example teaches the model a specific pattern. For our email sentiment classifier, each example shows:

The Question: "What's the sentiment of this email?"
The Context: The actual email content
The Expected Answer: The correct sentiment classification

Here's what this looks like in practice:

{
  "prompt": "Classify the sentiment of this email as positive, negative, or neutral.\n\nSubject: Thank you for excellent service\nEmail: I wanted to express my gratitude for the outstanding support I received. The team was helpful and professional.\n\nSentiment:",
  "completion": " positive"
}

Notice the space before "positive" in the completion - this helps the model learn proper tokenization.

Chat Templates: Teaching Models to Converse

Modern language models use chat templates to understand conversation structure. Think of them as formatting rules that help the model distinguish between:

User messages (questions/prompts)
Assistant messages (responses)
System messages (instructions)

Understanding the SmolLM2 Chat Template

Our base model (SmolLM2-1.7B-Instruct) uses this chat template:

<|im_start|>user
{user_message}
<|im_end|>
<|im_start|>assistant
{assistant_response}
<|im_end|>

The <|im_start|> and <|im_end|> tokens are special markers that help the model understand who's speaking.

Why Chat Templates Matter

Without proper formatting, models get confused about who's saying what. It's like having a conversation without knowing when each person starts and stops talking. Chat templates provide this crucial structure.

Creating High-Quality Training Data

Let's build our email sentiment dataset step by step. We'll create examples that cover the full range of scenarios our model might encounter.

Step 1: Define Our Classification Categories

For email sentiment analysis, we'll use three clear categories:

Positive: Grateful, satisfied, complimentary emails
Negative: Complaints, frustration, dissatisfaction
Neutral: Informational, requests, general inquiries

Step 2: Create Diverse Email Examples

Here's our data creation script with detailed examples:

touch data_creation.py

# Create data_creation.py
import json
import random
from typing import List, Dict

def create_training_example(subject: str, email_body: str, sentiment: str) -> Dict[str, str]:
    """Create a properly formatted training example"""

    # Create the prompt in a consistent format
    prompt = f"""Classify the sentiment of this email as positive, negative, or neutral.

Subject: {subject}
Email: {email_body}

Sentiment:"""

    # The completion should start with a space for proper tokenization
    completion = f" {sentiment}"

    return {
        "prompt": prompt,
        "completion": completion
    }

def generate_positive_examples() -> List[Dict[str, str]]:
    """Generate positive sentiment email examples"""

    positive_examples = [
        {
            "subject": "Thank you for excellent service",
            "body": "I wanted to express my gratitude for the outstanding support I received. The team was helpful and professional, and my issue was resolved quickly.",
            "sentiment": "positive"
        },
        {
            "subject": "Great job on the project",
            "body": "The deliverables exceeded our expectations. The attention to detail and quality of work was impressive. Looking forward to future collaborations.",
            "sentiment": "positive"
        },
        {
            "subject": "Wonderful experience",
            "body": "Just wanted to share that our experience with your service has been fantastic. The staff is knowledgeable and always willing to help.",
            "sentiment": "positive"
        },
        {
            "subject": "Love the new features",
            "body": "The latest update is amazing! The new features make everything so much easier. Thank you for listening to user feedback.",
            "sentiment": "positive"
        },
        {
            "subject": "Highly recommend",
            "body": "I've been using your service for months now and I'm consistently impressed. The reliability and quality are top-notch.",
            "sentiment": "positive"
        }
    ]

    return [create_training_example(ex["subject"], ex["body"], ex["sentiment"]) 
            for ex in positive_examples]

def generate_negative_examples() -> List[Dict[str, str]]:
    """Generate negative sentiment email examples"""

    negative_examples = [
        {
            "subject": "Disappointed with service",
            "body": "I'm extremely frustrated with the poor quality of support I received. My issue has been ongoing for weeks without resolution.",
            "sentiment": "negative"
        },
        {
            "subject": "System outage - unacceptable",
            "body": "The constant system failures are disrupting our business operations. This is the third outage this month and it's completely unacceptable.",
            "sentiment": "negative"
        },
        {
            "subject": "Billing error needs immediate attention",
            "body": "I've been charged incorrectly for the third time. This is becoming a serious problem and I'm losing confidence in your billing system.",
            "sentiment": "negative"
        },
        {
            "subject": "Very poor customer experience",
            "body": "The representative was unhelpful and seemed disinterested in solving my problem. I've never experienced such poor customer service.",
            "sentiment": "negative"
        },
        {
            "subject": "Product quality issues",
            "body": "The product arrived damaged and doesn't match the description. I'm disappointed and expect a full refund immediately.",
            "sentiment": "negative"
        }
    ]

    return [create_training_example(ex["subject"], ex["body"], ex["sentiment"]) 
            for ex in negative_examples]

def generate_neutral_examples() -> List[Dict[str, str]]:
    """Generate neutral sentiment email examples"""

    neutral_examples = [
        {
            "subject": "Account information update",
            "body": "Please update my billing address to the new address I provided. Let me know when this has been completed.",
            "sentiment": "neutral"
        },
        {
            "subject": "Question about pricing",
            "body": "Could you provide information about your enterprise pricing plans? We're evaluating options for our team of 50 users.",
            "sentiment": "neutral"
        },
        {
            "subject": "Meeting reschedule request",
            "body": "I need to reschedule our meeting from Tuesday to Thursday due to a scheduling conflict. Please confirm if this works.",
            "sentiment": "neutral"
        },
        {
            "subject": "Documentation request",
            "body": "Can you send me the technical documentation for the API integration? I need this for our development team.",
            "sentiment": "neutral"
        },
        {
            "subject": "Password reset",
            "body": "I'm unable to access my account and need to reset my password. Please send reset instructions to this email address.",
            "sentiment": "neutral"
        }
    ]

    return [create_training_example(ex["subject"], ex["body"], ex["sentiment"]) 
            for ex in neutral_examples]

def create_balanced_dataset() -> List[Dict[str, str]]:
    """Create a balanced dataset with equal representation"""

    print("Creating balanced email sentiment dataset...")

    # Generate examples for each category
    positive_examples = generate_positive_examples()
    negative_examples = generate_negative_examples()
    neutral_examples = generate_neutral_examples()

    # Combine all examples
    all_examples = positive_examples + negative_examples + neutral_examples

    # Shuffle to avoid category clustering
    random.shuffle(all_examples)

    print(f"Created {len(all_examples)} training examples:")
    print(f"  Positive: {len(positive_examples)}")
    print(f"  Negative: {len(negative_examples)}")
    print(f"  Neutral: {len(neutral_examples)}")

    return all_examples

def save_training_data(examples: List[Dict[str, str]], filename: str = "training_data.jsonl"):
    """Save training data in JSONL format"""

    with open(filename, 'w') as f:
        for example in examples:
            f.write(json.dumps(example) + '\n')

    print(f"✅ Saved {len(examples)} examples to {filename}")

def preview_examples(examples: List[Dict[str, str]], num_preview: int = 3):
    """Preview some training examples"""

    print(f"\n📋 Preview of {num_preview} training examples:")
    print("=" * 80)

    for i, example in enumerate(examples[:num_preview]):
        print(f"\nExample {i+1}:")
        print(f"Prompt:\n{example['prompt']}")
        print(f"Expected completion: '{example['completion']}'")
        print("-" * 40)

if __name__ == "__main__":
    # Create the dataset
    training_examples = create_balanced_dataset()

    # Preview some examples
    preview_examples(training_examples)

    # Save to file
    save_training_data(training_examples)

    print("\n🎉 Training data creation complete!")

So, let's examine what data we just created and understand the format

After running python data_creation.py, you will see this output and a new file:

Terminal Output:
Creating balanced email sentiment dataset...
Created 15 training examples:
Positive: 5
Negative: 5 Neutral: 5

✅ Saved 15 examples to training_data.jsonl
🎉 Training data creation complete!

New File Created:

training_data.jsonl (2-3 KB) - Your training dataset

### Understanding JSONL Format

JSONL (JSON Lines) is the standard format for ML training data. Unlike regular JSON, each line is a separate JSON object:

Regular JSON:

  [
    {"prompt": "...", "completion": " positive"},
    {"prompt": "...", "completion": " negative"}
  ]

  JSONL (what we created):
  {"prompt": "...", "completion": " positive"}
  {"prompt": "...", "completion": " negative"}

Why JSONL for training?

Memory efficient: Process one example at a time
Streamable: Handle huge datasets without loading everything
Standard: All ML frameworks expect this format

Your training_data.jsonl contains 15 examples (5 positive, 5 negative, 5 neutral) - each line teaching the model how to classify email sentiment. This file is the foundation for everything that follows.

Converting training data to MLX format

What is MLX?

MLX format refers to the specific data format expected by MLX (Apple'smachine learning framework for Apple Silicon).
Apple's ML framework optimized for M1/M2/M3 chips
Designed to leverage Apple Silicon's unified memory architecture
Efficient for training and running models on Mac hardware

MLX Training Data Format:

Uses JSONL (JSON Lines) where each line contains a single JSON object
Each object has a text field with the complete training example
Format: {"text": "your complete training text here"}

Why the specific format?
MLX's fine-tuning tools expect this simple structure so they can:

Stream data efficiently during training
Apply the model's chat template automatically
Handle tokenization and batching internally

Original Format (JSONL):
{
"prompt": "Classify the sentiment of this email as positive, negative,
or neutral.\n\nSubject: Thank you for excellent service\nEmail: I
wanted to express my gratitude for the outstanding support I received.
The team was helpful and professional.\n\nSentiment:",
"completion": " positive"
}

MLX Format (after conversion):
{
"text": "Classify the sentiment of this email as positive, negative,
or neutral.\n\nSubject: Thank you for excellent service\nEmail: I wanted
to express my gratitude for the outstanding support I received. The
team was helpful and professional.\n\nSentiment: positive"
}

Key Difference:

Original: Separate prompt and completion fields
MLX: Single text field combining both (concatenated together)

The conversion essentially does: text = prompt + completion

touch convert_to_mlx.py

# Create convert_to_mlx.py
import json
import os
from pathlib import Path

def convert_to_mlx_format(input_file: str = "training_data.jsonl", 
                         output_dir: str = "data/mlx_format"):
    """Convert JSONL training data to MLX format"""

    print(f"Converting {input_file} to MLX format...")

    # Create output directory
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    # Read training data
    examples = []
    with open(input_file, 'r') as f:
        for line in f:
            if line.strip():
                example = json.loads(line)
                # MLX format combines prompt and completion into a single text field
                text = example['prompt'] + example['completion']
                examples.append({"text": text})

    # Save training data
    train_file = os.path.join(output_dir, "train.jsonl")
    with open(train_file, 'w') as f:
        for example in examples:
            f.write(json.dumps(example) + '\n')

    print(f"✅ Converted {len(examples)} examples")
    print(f"✅ Saved to {train_file}")

    # Create a small validation set (10% of data)
    val_size = max(1, len(examples) // 10)
    val_examples = examples[:val_size]
    train_examples = examples[val_size:]

    # Save validation data
    val_file = os.path.join(output_dir, "valid.jsonl")
    with open(val_file, 'w') as f:
        for example in val_examples:
            f.write(json.dumps(example) + '\n')

    # Update training data to exclude validation examples
    with open(train_file, 'w') as f:
        for example in train_examples:
            f.write(json.dumps(example) + '\n')

    print(f"✅ Created train set: {len(train_examples)} examples")
    print(f"✅ Created validation set: {len(val_examples)} examples")

    return len(train_examples), len(val_examples)

def preview_mlx_format(output_dir: str = "data/mlx_format"):
    """Preview the MLX formatted data"""

    train_file = os.path.join(output_dir, "train.jsonl")

    print("\n📋 Preview of MLX formatted data:")
    print("=" * 80)

    with open(train_file, 'r') as f:
        for i, line in enumerate(f):
            if i >= 2:  # Show first 2 examples
                break

            example = json.loads(line)
            print(f"\nExample {i+1}:")
            print(f"Text: {example['text'][:200]}...")  # Show first 200 chars
            print("-" * 40)

if __name__ == "__main__":
    # Convert the data
    train_count, val_count = convert_to_mlx_format()

    # Preview the results
    preview_mlx_format()

    print(f"\n🎉 MLX format conversion complete!")
    print(f"Ready for training with {train_count} examples")

Takes 10% of examples for validation and remaining 90% will be used for training

Run the conversion:

python3 convert_to_mlx.py

After running python3 convert_to_mlx.py, you will see two new files created under data/mlx_format/:

valid.jsonl
train.jsonl

Now the data is ready and we will head into the next section, where we will get to the meat of this series, which is executing Fine-Tuning.

DEV Community