Data is the foundation of any successful AI model. In this part, we'll explore how to create, format, and prepare high-quality training data that will make our email sentiment classifier incredibly accurate.
Why Data Quality Matters More Than Model Size
Here's a truth that might surprise you: A smaller model trained on high-quality, domain-specific data can outperforms a massive general-purpose model on specific tasks.
Think of it this way: would you rather have a Swiss Army knife or a scalpel for surgery? General models are Swiss Army knives - versatile but not optimized. Fine-tuned models are scalpels - precise tools for specific jobs.
Understanding Language Model Training Formats
Language models learn by predicting the next piece of text. For fine-tuning, we need to show them examples of the exact conversations we want them to have.
The Anatomy of a Training Example
Every training example teaches the model a specific pattern. For our email sentiment classifier, each example shows:
- The Question: "What's the sentiment of this email?"
- The Context: The actual email content
- The Expected Answer: The correct sentiment classification
Here's what this looks like in practice:
{
"prompt": "Classify the sentiment of this email as positive, negative, or neutral.\n\nSubject: Thank you for excellent service\nEmail: I wanted to express my gratitude for the outstanding support I received. The team was helpful and professional.\n\nSentiment:",
"completion": " positive"
}
Notice the space before "positive" in the completion - this helps the model learn proper tokenization.
Chat Templates: Teaching Models to Converse
Modern language models use chat templates to understand conversation structure. Think of them as formatting rules that help the model distinguish between:
- User messages (questions/prompts)
- Assistant messages (responses)
- System messages (instructions)
Understanding the SmolLM2 Chat Template
Our base model (SmolLM2-1.7B-Instruct) uses this chat template:
<|im_start|>user
{user_message}
<|im_end|>
<|im_start|>assistant
{assistant_response}
<|im_end|>
The <|im_start|> and <|im_end|> tokens are special markers that help the model understand who's speaking.
Why Chat Templates Matter
Without proper formatting, models get confused about who's saying what. It's like having a conversation without knowing when each person starts and stops talking. Chat templates provide this crucial structure.
Creating High-Quality Training Data
Let's build our email sentiment dataset step by step. We'll create examples that cover the full range of scenarios our model might encounter.
Step 1: Define Our Classification Categories
For email sentiment analysis, we'll use three clear categories:
- Positive: Grateful, satisfied, complimentary emails
- Negative: Complaints, frustration, dissatisfaction
- Neutral: Informational, requests, general inquiries
Step 2: Create Diverse Email Examples
Here's our data creation script with detailed examples:
touch data_creation.py
# Create data_creation.py
import json
import random
from typing import List, Dict
def create_training_example(subject: str, email_body: str, sentiment: str) -> Dict[str, str]:
"""Create a properly formatted training example"""
# Create the prompt in a consistent format
prompt = f"""Classify the sentiment of this email as positive, negative, or neutral.
Subject: {subject}
Email: {email_body}
Sentiment:"""
# The completion should start with a space for proper tokenization
completion = f" {sentiment}"
return {
"prompt": prompt,
"completion": completion
}
def generate_positive_examples() -> List[Dict[str, str]]:
"""Generate positive sentiment email examples"""
positive_examples = [
{
"subject": "Thank you for excellent service",
"body": "I wanted to express my gratitude for the outstanding support I received. The team was helpful and professional, and my issue was resolved quickly.",
"sentiment": "positive"
},
{
"subject": "Great job on the project",
"body": "The deliverables exceeded our expectations. The attention to detail and quality of work was impressive. Looking forward to future collaborations.",
"sentiment": "positive"
},
{
"subject": "Wonderful experience",
"body": "Just wanted to share that our experience with your service has been fantastic. The staff is knowledgeable and always willing to help.",
"sentiment": "positive"
},
{
"subject": "Love the new features",
"body": "The latest update is amazing! The new features make everything so much easier. Thank you for listening to user feedback.",
"sentiment": "positive"
},
{
"subject": "Highly recommend",
"body": "I've been using your service for months now and I'm consistently impressed. The reliability and quality are top-notch.",
"sentiment": "positive"
}
]
return [create_training_example(ex["subject"], ex["body"], ex["sentiment"])
for ex in positive_examples]
def generate_negative_examples() -> List[Dict[str, str]]:
"""Generate negative sentiment email examples"""
negative_examples = [
{
"subject": "Disappointed with service",
"body": "I'm extremely frustrated with the poor quality of support I received. My issue has been ongoing for weeks without resolution.",
"sentiment": "negative"
},
{
"subject": "System outage - unacceptable",
"body": "The constant system failures are disrupting our business operations. This is the third outage this month and it's completely unacceptable.",
"sentiment": "negative"
},
{
"subject": "Billing error needs immediate attention",
"body": "I've been charged incorrectly for the third time. This is becoming a serious problem and I'm losing confidence in your billing system.",
"sentiment": "negative"
},
{
"subject": "Very poor customer experience",
"body": "The representative was unhelpful and seemed disinterested in solving my problem. I've never experienced such poor customer service.",
"sentiment": "negative"
},
{
"subject": "Product quality issues",
"body": "The product arrived damaged and doesn't match the description. I'm disappointed and expect a full refund immediately.",
"sentiment": "negative"
}
]
return [create_training_example(ex["subject"], ex["body"], ex["sentiment"])
for ex in negative_examples]
def generate_neutral_examples() -> List[Dict[str, str]]:
"""Generate neutral sentiment email examples"""
neutral_examples = [
{
"subject": "Account information update",
"body": "Please update my billing address to the new address I provided. Let me know when this has been completed.",
"sentiment": "neutral"
},
{
"subject": "Question about pricing",
"body": "Could you provide information about your enterprise pricing plans? We're evaluating options for our team of 50 users.",
"sentiment": "neutral"
},
{
"subject": "Meeting reschedule request",
"body": "I need to reschedule our meeting from Tuesday to Thursday due to a scheduling conflict. Please confirm if this works.",
"sentiment": "neutral"
},
{
"subject": "Documentation request",
"body": "Can you send me the technical documentation for the API integration? I need this for our development team.",
"sentiment": "neutral"
},
{
"subject": "Password reset",
"body": "I'm unable to access my account and need to reset my password. Please send reset instructions to this email address.",
"sentiment": "neutral"
}
]
return [create_training_example(ex["subject"], ex["body"], ex["sentiment"])
for ex in neutral_examples]
def create_balanced_dataset() -> List[Dict[str, str]]:
"""Create a balanced dataset with equal representation"""
print("Creating balanced email sentiment dataset...")
# Generate examples for each category
positive_examples = generate_positive_examples()
negative_examples = generate_negative_examples()
neutral_examples = generate_neutral_examples()
# Combine all examples
all_examples = positive_examples + negative_examples + neutral_examples
# Shuffle to avoid category clustering
random.shuffle(all_examples)
print(f"Created {len(all_examples)} training examples:")
print(f" Positive: {len(positive_examples)}")
print(f" Negative: {len(negative_examples)}")
print(f" Neutral: {len(neutral_examples)}")
return all_examples
def save_training_data(examples: List[Dict[str, str]], filename: str = "training_data.jsonl"):
"""Save training data in JSONL format"""
with open(filename, 'w') as f:
for example in examples:
f.write(json.dumps(example) + '\n')
print(f"β
Saved {len(examples)} examples to {filename}")
def preview_examples(examples: List[Dict[str, str]], num_preview: int = 3):
"""Preview some training examples"""
print(f"\nπ Preview of {num_preview} training examples:")
print("=" * 80)
for i, example in enumerate(examples[:num_preview]):
print(f"\nExample {i+1}:")
print(f"Prompt:\n{example['prompt']}")
print(f"Expected completion: '{example['completion']}'")
print("-" * 40)
if __name__ == "__main__":
# Create the dataset
training_examples = create_balanced_dataset()
# Preview some examples
preview_examples(training_examples)
# Save to file
save_training_data(training_examples)
print("\nπ Training data creation complete!")
So, let's examine what data we just created and understand the format
After running python data_creation.py, you will see this output and a new file:
Terminal Output:
Creating balanced email sentiment dataset...
Created 15 training examples:
Positive: 5
Negative: 5 Neutral: 5
β
Saved 15 examples to training_data.jsonl
π Training data creation complete!
New File Created:
-
training_data.jsonl(2-3 KB) - Your training dataset
### Understanding JSONL Format
JSONL (JSON Lines) is the standard format for ML training data. Unlike regular JSON, each line is a separate JSON object:
Regular JSON:
[
{"prompt": "...", "completion": " positive"},
{"prompt": "...", "completion": " negative"}
]
JSONL (what we created):
{"prompt": "...", "completion": " positive"}
{"prompt": "...", "completion": " negative"}
Why JSONL for training?
- Memory efficient: Process one example at a time
- Streamable: Handle huge datasets without loading everything
- Standard: All ML frameworks expect this format
Your training_data.jsonl contains 15 examples (5 positive, 5 negative, 5 neutral) - each line teaching the model how to classify email sentiment. This file is the foundation for everything that follows.
Converting training data to MLX format
What is MLX?
- MLX format refers to the specific data format expected by MLX (Apple'smachine learning framework for Apple Silicon).
- Apple's ML framework optimized for M1/M2/M3 chips
- Designed to leverage Apple Silicon's unified memory architecture
- Efficient for training and running models on Mac hardware
MLX Training Data Format:
- Uses JSONL (JSON Lines) where each line contains a single JSON object
- Each object has a text field with the complete training example
- Format: {"text": "your complete training text here"}
Why the specific format?
MLX's fine-tuning tools expect this simple structure so they can:
- Stream data efficiently during training
- Apply the model's chat template automatically
- Handle tokenization and batching internally
Original Format (JSONL):
{
"prompt": "Classify the sentiment of this email as positive, negative,
or neutral.\n\nSubject: Thank you for excellent service\nEmail: I
wanted to express my gratitude for the outstanding support I received.
The team was helpful and professional.\n\nSentiment:",
"completion": " positive"
}
MLX Format (after conversion):
{
"text": "Classify the sentiment of this email as positive, negative,
or neutral.\n\nSubject: Thank you for excellent service\nEmail: I wanted
to express my gratitude for the outstanding support I received. The
team was helpful and professional.\n\nSentiment: positive"
}
Key Difference:
- Original: Separate prompt and completion fields
- MLX: Single text field combining both (concatenated together)
The conversion essentially does: text = prompt + completion
touch convert_to_mlx.py
# Create convert_to_mlx.py
import json
import os
from pathlib import Path
def convert_to_mlx_format(input_file: str = "training_data.jsonl",
output_dir: str = "data/mlx_format"):
"""Convert JSONL training data to MLX format"""
print(f"Converting {input_file} to MLX format...")
# Create output directory
Path(output_dir).mkdir(parents=True, exist_ok=True)
# Read training data
examples = []
with open(input_file, 'r') as f:
for line in f:
if line.strip():
example = json.loads(line)
# MLX format combines prompt and completion into a single text field
text = example['prompt'] + example['completion']
examples.append({"text": text})
# Save training data
train_file = os.path.join(output_dir, "train.jsonl")
with open(train_file, 'w') as f:
for example in examples:
f.write(json.dumps(example) + '\n')
print(f"β
Converted {len(examples)} examples")
print(f"β
Saved to {train_file}")
# Create a small validation set (10% of data)
val_size = max(1, len(examples) // 10)
val_examples = examples[:val_size]
train_examples = examples[val_size:]
# Save validation data
val_file = os.path.join(output_dir, "valid.jsonl")
with open(val_file, 'w') as f:
for example in val_examples:
f.write(json.dumps(example) + '\n')
# Update training data to exclude validation examples
with open(train_file, 'w') as f:
for example in train_examples:
f.write(json.dumps(example) + '\n')
print(f"β
Created train set: {len(train_examples)} examples")
print(f"β
Created validation set: {len(val_examples)} examples")
return len(train_examples), len(val_examples)
def preview_mlx_format(output_dir: str = "data/mlx_format"):
"""Preview the MLX formatted data"""
train_file = os.path.join(output_dir, "train.jsonl")
print("\nπ Preview of MLX formatted data:")
print("=" * 80)
with open(train_file, 'r') as f:
for i, line in enumerate(f):
if i >= 2: # Show first 2 examples
break
example = json.loads(line)
print(f"\nExample {i+1}:")
print(f"Text: {example['text'][:200]}...") # Show first 200 chars
print("-" * 40)
if __name__ == "__main__":
# Convert the data
train_count, val_count = convert_to_mlx_format()
# Preview the results
preview_mlx_format()
print(f"\nπ MLX format conversion complete!")
print(f"Ready for training with {train_count} examples")
Takes 10% of examples for validation and remaining 90% will be used for training
Run the conversion:
python3 convert_to_mlx.py
After running python3 convert_to_mlx.py, you will see two new files created under data/mlx_format/:
valid.jsonltrain.jsonl
Now the data is ready and we will head into the next section, where we will get to the meat of this series, which is executing Fine-Tuning.
Top comments (0)