In this part, we'll set up everything you need to start fine-tuning Small Language Models on your local machine. We'll focus on Apple Silicon optimization.
Why Your Setup Matters
Setting up your development environment correctly is like having a well-organized workshop - it makes everything else easier and prevents countless headaches down the road. A proper setup ensures:
- Optimal Performance: Getting the most out of your hardware
- Reproducible Results: Consistent behavior across sessions
- Easy Debugging: Clean environments make problems easier to trace
- Future Flexibility: Easy to experiment and extend
What are the Hardware Requirements and Recommendations?
Minimum Requirements
- RAM: 8GB (though 16GB+ is highly recommended)
- Storage: 20GB free space
- Processor: Apple Silicon (M1/M2/M3/M4)
Recommended Setup
- RAM: 16GB or more (more RAM = larger models you can run)
- Storage: 50GB+ free space (SSD recommended)
- Processor: Apple Silicon M3 or M4 Pro for optimal performance
I did the fine tuning on a Apple M3 Pro
But, Why Apple Silicon?
First, this is what I have :).
Second, Apple Silicon processors have a unique architecture that's perfect for AI workloads:
- Unified Memory: CPU and GPU share the same memory pool
- High Memory Bandwidth: Extremely fast data transfer
- Efficient Compute: Optimized for matrix operations
- MLX Framework: Apple's specialized ML framework
Understanding MLX: Apple's AI Framework
MLX is Apple's machine learning framework specifically designed for Apple Silicon. Think of it as Apple's answer to NVIDIA's CUDA - it unlocks the full potential of M-series chips for AI workloads.
Key MLX Benefits:
- Native Apple Silicon optimization
- Unified memory utilization
- Fast training and inference
- Python-friendly API
- Growing ecosystem of models
Step-by-Step Environment Setup
Step 1: Create Your Project Directory
Let's start by creating a clean, organized project structure:
# Create main project directory
mkdir email-sentiment-classifier
cd email-sentiment-classifier
# Create subdirectories for organization
mkdir data models adapters results logs scripts
This structure will keep everything organized as our project grows:
-
data/
: Training and evaluation datasets -
models/
: Downloaded base models -
adapters/
: Fine-tuned model adapters -
results/
: Evaluation results and metrics -
logs/
: Training and evaluation logs -
scripts/
: All our Python scripts
Step 2: Set Up Python Virtual Environment
Virtual environments are crucial for avoiding dependency conflicts. Think of them as isolated workspaces where each project has its own set of Python packages.
# Create virtual environment
python3 -m venv email_sentiment_env
# Activate the environment
source email_sentiment_env/bin/activate
# Verify you're in the virtual environment
which python3
# Should show: /path/to/email-sentiment-classifier/email_sentiment_env/bin/python3
# Upgrade pip to latest version
pip install --upgrade pip
Important: Always activate your virtual environment before working on the project. You'll know it's active when you see (email_sentiment_env)
in your terminal prompt.
Step 3: Install Core Dependencies
Now we'll install the packages we need. I'll explain each one as we go:
# MLX Framework - Apple's ML framework for Apple Silicon
pip install mlx mlx-lm
# Transformers ecosystem - Hugging Face's toolkit
pip install transformers datasets tokenizers
# Data manipulation and analysis
pip install numpy pandas matplotlib seaborn
# Machine learning utilities
pip install scikit-learn
# Web interface framework
pip install gradio
# General utilities
pip install tqdm requests
Let's understand what each package does:
MLX & MLX-LM: The core frameworks for training and running models on Apple Silicon. MLX handles the low-level computations, while MLX-LM provides high-level tools for language models.
Transformers: Hugging Face's library that gives us access to thousands of pre-trained models for tasks like text generation and translation, making it easy to work with language, images, and audio.
Datasets: Makes it easy and helps us find, load, and manage large collections of training data for machine learning, so you don’t have to build datasets from scratch.
Tokenizers - Breaks down text into small pieces called tokens; this is needed to prepare text for AI models and makes them work more efficiently
NumPy & Pandas: Essential for data manipulation and analysis.
Matplotlib & Seaborn: For creating visualizations of our results.
Scikit-learn: gives Python users simple tools to build machine learning models that can classify, predict, and group patterns in data without needing to write everything by hand.
Gradio: creates beautiful web interfaces for testing our models.
Tdqm: shows a progress bar in the terminal while a program runs through a long loop, so users can easily see how much work is left and how fast it's going
Requests: makes it easy to download or send data over the internet from Python—for example, retrieving a webpage, posting data, or working with APIs.
*Step 4: Verify Your Installation (Optional) *
Let's make sure everything is working correctly:
Create a test script: test_installation.py **
touch test_installation.py
print("Testing MLX installation...")
try:
import mlx.core as mx
print("✅ MLX core imported successfully")
print(f" MLX version: {mx.__version__}")
# Test Metal availability (Apple's GPU framework)
if mx.metal.is_available():
print("✅ Metal GPU acceleration available")
print(f" GPU memory: {mx.metal.get_peak_memory() / 1e9:.1f}GB peak usage")
else:
print("⚠️ Metal GPU not available (using CPU)")
except ImportError as e:
print(f"❌ MLX import failed: {e}")
print("\nTesting MLX-LM...")
try:
import mlx_lm
print("✅ MLX-LM imported successfully")
except ImportError as e:
print(f"❌ MLX-LM import failed: {e}")
print("\nTesting Transformers...")
try:
import transformers
print(f"✅ Transformers imported successfully (v{transformers.__version__})")
except ImportError as e:
print(f"❌ Transformers import failed: {e}")
print("\nTesting other dependencies...")
dependencies = ['numpy', 'pandas', 'sklearn', 'gradio']
for dep in dependencies:
try:
__import__(dep)
print(f"✅ {dep} imported successfully")
except ImportError:
print(f"❌ {dep} import failed")
print("\n🎉 Installation test complete!")
Run the test:
python3 test_installation.py
You should see all green checkmarks. If anything fails, revisit the installation steps for that package.
Step 5: Download and Verify the Base Model
Before we start fine-tuning, we need to download the SmolLM2-1.7B-Instruct model. This is a one-time download that will be cached locally for all future use.
#### Model Details:
- Model: SmolLM2-1.7B-Instruct (1.7 billion parameters)
- Size: ~3.4GB
- Download time: 5-15 minutes (depending on internet speed)
-
Storage location:
~/.cache/huggingface/hub/
(on macOS)
Automatic Download (Recommended):
touch download_model.py
# download_model.py
from mlx_lm import load
import time
def download_base_model():
"""Download and verify the base model"""
print("🚀 Downloading SmolLM2-1.7B-Instruct...")
print("📦 Model size: ~3.4GB")
print("⏱️ This will take 5-15 minutes depending on your internet speed")
print("💾 Model will be cached locally for future use")
print("\nDownload starting...")
try:
start_time = time.time()
model, tokenizer = load("HuggingFaceTB/SmolLM2-1.7B-Instruct")
download_time = time.time() - start_time
print(f"\n✅ Model downloaded successfully!")
print(f"⏱️ Download time: {download_time:.1f} seconds")
print(f"💾 Model cached at: ~/.cache/huggingface/hub/")
print(f"🧪 Testing model...")
# Quick inference test
from mlx_lm import generate
test_response = generate(
model, tokenizer,
prompt="The weather today is",
max_tokens=3
)
print(f"✅ Model test successful: '{test_response.strip()}'")
print("\n🎉 Ready to proceed with fine-tuning!")
return True
except Exception as e:
print(f"\n❌ Download failed: {e}")
print("\n🔧 Troubleshooting:")
print(" - Check internet connection")
print(" - Try running again (partial downloads will resume)")
print(" - Ensure you have 5GB+ free disk space")
return False
if __name__ == "__main__":
download_base_model()
Run the download:
python download_model.py
Manual Download (Alternative):
If automatic download fails, you can download manually:
# Install huggingface-hub if not already installed
pip install huggingface-hub
# Download model manually
python -c "
from huggingface_hub import snapshot_download
snapshot_download('HuggingFaceTB/SmolLM2-1.7B-Instruct',
cache_dir='~/.cache/huggingface/hub')
print('✅ Manual download complete!')
"
Verify Download:
touch verify_model.py
# Create verify_model.py
import os
from pathlib import Path
def verify_model_download():
"""Verify the model was downloaded correctly"""
# Check cache directory
cache_dir = Path.home() / ".cache" / "huggingface" / "hub"
model_dirs = list(cache_dir.glob("*SmolLM2*"))
if model_dirs:
model_dir = model_dirs[0]
model_size = sum(f.stat().st_size for f in model_dir.rglob('*') if f.is_file())
size_gb = model_size / (1024**3)
print(f"✅ Model found at: {model_dir}")
print(f"📦 Model size: {size_gb:.1f}GB")
if size_gb > 3.0:
print("✅ Model appears complete")
return True
else:
print("⚠️ Model may be incomplete (too small)")
return False
else:
print("❌ Model not found in cache")
return False
if __name__ == "__main__":
verify_model_download()
Troubleshooting Download Issues:
Issue: Download interrupted
# Resume interrupted download
python download_model.py # Will automatically resume
Issue: Insufficient disk space
# Check available space
df -h ~
# Need at least 5GB free (3.4GB model + temporary files)
Development Best Practices
Environment Management
Always use virtual environments and document your dependencies:
# Save your current environment
pip freeze > requirements.txt
# Later, recreate the environment
pip install -r requirements.txt
Version Control Setup
Initialize git and create a proper .gitignore:
git init
Create .gitignore
:
# Python
__pycache__/
*.pyc
*.pyo
*.pyd
.Python
env/
venv/
email_sentiment_env/
# Models and data (large files)
models/
adapters/
*.bin
*.safetensors
# Logs and results
logs/
*.log
results/
# System files
.DS_Store
Thumbs.db
Verification and Next Steps
Let's create a final verification script to ensure everything is working:
touch final_verification.py
# Create final_verification.py
import mlx.core as mx
from mlx_lm import load
import time
def verify_complete_setup():
"""Verify that our complete setup is working"""
print("🔍 Final Setup Verification")
print("=" * 50)
# Check MLX
print(f"✅ MLX version: {mx.__version__}")
print(f"✅ Metal GPU: {'Available' if mx.metal.is_available() else 'Not available'}")
# Test model loading (this will download the model if needed)
print("\n📥 Testing model download and loading...")
try:
start_time = time.time()
model, tokenizer = load("HuggingFaceTB/SmolLM2-1.7B-Instruct")
load_time = time.time() - start_time
print(f"✅ Model loaded successfully in {load_time:.1f}s")
# Test inference
print("\n🧪 Testing inference...")
from mlx_lm import generate
response = generate(
model, tokenizer,
prompt="The quick brown fox",
max_tokens=5
)
print(f"✅ Inference test: '{response.strip()}'")
except Exception as e:
print(f"❌ Model loading failed: {e}")
return False
print("\n🎉 Complete setup verification passed!")
print("\nYou're ready to proceed to Part 3!")
return True
if __name__ == "__main__":
verify_complete_setup()
Run the final verification:
python3 final_verification.py
What We've Accomplished
Congratulations! You now have a complete, optimized development environment ready for fine-tuning Small Language Models. Here's what we've set up:
✅ Organized project structure
✅ Optimized virtual environment
✅ MLX framework for Apple Silicon acceleration
✅ All necessary dependencies installed
✅ Configuration and utility functions
✅ Performance monitoring tools
✅ Troubleshooting guides
Looking Ahead
With your environment ready, we can now move on to the exciting part - working with data and training our first model!
In Part 3, we'll dive deep into:
- Understanding training data formats
- Creating high-quality datasets
- Data preprocessing and tokenization
- Chat templates and prompt engineering
Your development environment is the foundation that makes everything else possible. With this solid base, you're ready to start building amazing AI applications locally!
Top comments (0)