Prashant Nigam

Posted on Sep 20

Setting Up Your Local Development Environment (Part 2)

#slm #ai #machinelearning #finetuning

In this part, we'll set up everything you need to start fine-tuning Small Language Models on your local machine. We'll focus on Apple Silicon optimization.

Why Your Setup Matters

Setting up your development environment correctly is like having a well-organized workshop - it makes everything else easier and prevents countless headaches down the road. A proper setup ensures:

Optimal Performance: Getting the most out of your hardware
Reproducible Results: Consistent behavior across sessions
Easy Debugging: Clean environments make problems easier to trace
Future Flexibility: Easy to experiment and extend

What are the Hardware Requirements and Recommendations?

Minimum Requirements

RAM: 8GB (though 16GB+ is highly recommended)
Storage: 20GB free space
Processor: Apple Silicon (M1/M2/M3/M4)

Recommended Setup

RAM: 16GB or more (more RAM = larger models you can run)
Storage: 50GB+ free space (SSD recommended)
Processor: Apple Silicon M3 or M4 Pro for optimal performance

I did the fine tuning on a Apple M3 Pro

But, Why Apple Silicon?

First, this is what I have :).

Second, Apple Silicon processors have a unique architecture that's perfect for AI workloads:

Unified Memory: CPU and GPU share the same memory pool
High Memory Bandwidth: Extremely fast data transfer
Efficient Compute: Optimized for matrix operations
MLX Framework: Apple's specialized ML framework

Understanding MLX: Apple's AI Framework

MLX is Apple's machine learning framework specifically designed for Apple Silicon. Think of it as Apple's answer to NVIDIA's CUDA - it unlocks the full potential of M-series chips for AI workloads.

Key MLX Benefits:

Native Apple Silicon optimization
Unified memory utilization
Fast training and inference
Python-friendly API
Growing ecosystem of models

Step-by-Step Environment Setup

Step 1: Create Your Project Directory

Let's start by creating a clean, organized project structure:

# Create main project directory
mkdir email-sentiment-classifier
cd email-sentiment-classifier

# Create subdirectories for organization
mkdir data models adapters results logs scripts

This structure will keep everything organized as our project grows:

data/: Training and evaluation datasets
models/: Downloaded base models
adapters/: Fine-tuned model adapters
results/: Evaluation results and metrics
logs/: Training and evaluation logs
scripts/: All our Python scripts

Step 2: Set Up Python Virtual Environment

Virtual environments are crucial for avoiding dependency conflicts. Think of them as isolated workspaces where each project has its own set of Python packages.

# Create virtual environment
python3 -m venv email_sentiment_env

# Activate the environment
source email_sentiment_env/bin/activate

# Verify you're in the virtual environment
which python3
# Should show: /path/to/email-sentiment-classifier/email_sentiment_env/bin/python3

# Upgrade pip to latest version
pip install --upgrade pip

Important: Always activate your virtual environment before working on the project. You'll know it's active when you see (email_sentiment_env) in your terminal prompt.

Step 3: Install Core Dependencies

Now we'll install the packages we need. I'll explain each one as we go:

# MLX Framework - Apple's ML framework for Apple Silicon
pip install mlx mlx-lm

# Transformers ecosystem - Hugging Face's toolkit
pip install transformers datasets tokenizers

# Data manipulation and analysis
pip install numpy pandas matplotlib seaborn

# Machine learning utilities
pip install scikit-learn

# Web interface framework
pip install gradio

# General utilities
pip install tqdm requests

Let's understand what each package does:

MLX & MLX-LM: The core frameworks for training and running models on Apple Silicon. MLX handles the low-level computations, while MLX-LM provides high-level tools for language models.

Transformers: Hugging Face's library that gives us access to thousands of pre-trained models for tasks like text generation and translation, making it easy to work with language, images, and audio.

Datasets: Makes it easy and helps us find, load, and manage large collections of training data for machine learning, so you don’t have to build datasets from scratch.

Tokenizers - Breaks down text into small pieces called tokens; this is needed to prepare text for AI models and makes them work more efficiently

NumPy & Pandas: Essential for data manipulation and analysis.

Matplotlib & Seaborn: For creating visualizations of our results.

Scikit-learn: gives Python users simple tools to build machine learning models that can classify, predict, and group patterns in data without needing to write everything by hand.

Gradio: creates beautiful web interfaces for testing our models.

Tdqm: shows a progress bar in the terminal while a program runs through a long loop, so users can easily see how much work is left and how fast it's going

Requests: makes it easy to download or send data over the internet from Python—for example, retrieving a webpage, posting data, or working with APIs.

Step 4: Verify Your Installation (Optional)

Let's make sure everything is working correctly:

Create a test script: test_installation.py **

touch test_installation.py


print("Testing MLX installation...")


try:
    import mlx.core as mx
    print("✅ MLX core imported successfully")
    print(f"   MLX version: {mx.__version__}")

    # Test Metal availability (Apple's GPU framework)
    if mx.metal.is_available():
        print("✅ Metal GPU acceleration available")
        print(f"   GPU memory: {mx.metal.get_peak_memory() / 1e9:.1f}GB peak usage")
    else:
        print("⚠️  Metal GPU not available (using CPU)")

except ImportError as e:
    print(f"❌ MLX import failed: {e}")

print("\nTesting MLX-LM...")
try:
    import mlx_lm
    print("✅ MLX-LM imported successfully")
except ImportError as e:
    print(f"❌ MLX-LM import failed: {e}")

print("\nTesting Transformers...")
try:
    import transformers
    print(f"✅ Transformers imported successfully (v{transformers.__version__})")
except ImportError as e:
    print(f"❌ Transformers import failed: {e}")

print("\nTesting other dependencies...")
dependencies = ['numpy', 'pandas', 'sklearn', 'gradio']
for dep in dependencies:
    try:
        __import__(dep)
        print(f"✅ {dep} imported successfully")
    except ImportError:
        print(f"❌ {dep} import failed")

print("\n🎉 Installation test complete!")

Run the test:

python3 test_installation.py

You should see all green checkmarks. If anything fails, revisit the installation steps for that package.

Step 5: Download and Verify the Base Model

Before we start fine-tuning, we need to download the SmolLM2-1.7B-Instruct model. This is a one-time download that will be cached locally for all future use.

#### Model Details:

Model: SmolLM2-1.7B-Instruct (1.7 billion parameters)
Size: ~3.4GB
Download time: 5-15 minutes (depending on internet speed)
Storage location: ~/.cache/huggingface/hub/ (on macOS)

Automatic Download (Recommended):

touch download_model.py

  # download_model.py
  from mlx_lm import load
  import time

  def download_base_model():
      """Download and verify the base model"""

      print("🚀 Downloading SmolLM2-1.7B-Instruct...")
      print("📦 Model size: ~3.4GB")
      print("⏱️ This will take 5-15 minutes depending on your internet speed")
      print("💾 Model will be cached locally for future use")
      print("\nDownload starting...")

      try:
          start_time = time.time()
          model, tokenizer = load("HuggingFaceTB/SmolLM2-1.7B-Instruct")
          download_time = time.time() - start_time

          print(f"\n✅ Model downloaded successfully!")
          print(f"⏱️ Download time: {download_time:.1f} seconds")
          print(f"💾 Model cached at: ~/.cache/huggingface/hub/")
          print(f"🧪 Testing model...")

          # Quick inference test
          from mlx_lm import generate
          test_response = generate(
              model, tokenizer,
              prompt="The weather today is",
              max_tokens=3
          )

          print(f"✅ Model test successful: '{test_response.strip()}'")
          print("\n🎉 Ready to proceed with fine-tuning!")

          return True

      except Exception as e:
          print(f"\n❌ Download failed: {e}")
          print("\n🔧 Troubleshooting:")
          print("  - Check internet connection")
          print("  - Try running again (partial downloads will resume)")
          print("  - Ensure you have 5GB+ free disk space")
          return False

  if __name__ == "__main__":
      download_base_model()

Run the download:
python download_model.py

Manual Download (Alternative):

If automatic download fails, you can download manually:

  # Install huggingface-hub if not already installed
  pip install huggingface-hub

  # Download model manually
  python -c "
  from huggingface_hub import snapshot_download
  snapshot_download('HuggingFaceTB/SmolLM2-1.7B-Instruct', 
                   cache_dir='~/.cache/huggingface/hub')
  print('✅ Manual download complete!')
  "

Verify Download:

touch verify_model.py

  # Create verify_model.py
  import os
  from pathlib import Path

  def verify_model_download():
      """Verify the model was downloaded correctly"""

      # Check cache directory
      cache_dir = Path.home() / ".cache" / "huggingface" / "hub"
      model_dirs = list(cache_dir.glob("*SmolLM2*"))

      if model_dirs:
          model_dir = model_dirs[0]
          model_size = sum(f.stat().st_size for f in model_dir.rglob('*') if f.is_file())
          size_gb = model_size / (1024**3)

          print(f"✅ Model found at: {model_dir}")
          print(f"📦 Model size: {size_gb:.1f}GB")

          if size_gb > 3.0:
              print("✅ Model appears complete")
              return True
          else:
              print("⚠️ Model may be incomplete (too small)")
              return False
      else:
          print("❌ Model not found in cache")
          return False

  if __name__ == "__main__":
      verify_model_download()

Troubleshooting Download Issues:

Issue: Download interrupted
# Resume interrupted download
python download_model.py # Will automatically resume

Issue: Insufficient disk space
# Check available space
df -h ~
# Need at least 5GB free (3.4GB model + temporary files)

Development Best Practices

Environment Management

Always use virtual environments and document your dependencies:

# Save your current environment
pip freeze > requirements.txt

# Later, recreate the environment
pip install -r requirements.txt

Version Control Setup

Initialize git and create a proper .gitignore:

git init

Create .gitignore:

# Python
__pycache__/
*.pyc
*.pyo
*.pyd
.Python
env/
venv/
email_sentiment_env/

# Models and data (large files)
models/
adapters/
*.bin
*.safetensors

# Logs and results
logs/
*.log
results/

# System files
.DS_Store
Thumbs.db

Verification and Next Steps

Let's create a final verification script to ensure everything is working:

touch final_verification.py

# Create final_verification.py
import mlx.core as mx
from mlx_lm import load
import time

def verify_complete_setup():
    """Verify that our complete setup is working"""

    print("🔍 Final Setup Verification")
    print("=" * 50)

    # Check MLX
    print(f"✅ MLX version: {mx.__version__}")
    print(f"✅ Metal GPU: {'Available' if mx.metal.is_available() else 'Not available'}")

    # Test model loading (this will download the model if needed)
    print("\n📥 Testing model download and loading...")
    try:
        start_time = time.time()
        model, tokenizer = load("HuggingFaceTB/SmolLM2-1.7B-Instruct")
        load_time = time.time() - start_time
        print(f"✅ Model loaded successfully in {load_time:.1f}s")

        # Test inference
        print("\n🧪 Testing inference...")
        from mlx_lm import generate
        response = generate(
            model, tokenizer, 
            prompt="The quick brown fox", 
            max_tokens=5
        )
        print(f"✅ Inference test: '{response.strip()}'")

    except Exception as e:
        print(f"❌ Model loading failed: {e}")
        return False

    print("\n🎉 Complete setup verification passed!")
    print("\nYou're ready to proceed to Part 3!")
    return True

if __name__ == "__main__":
    verify_complete_setup()

Run the final verification:

python3 final_verification.py

What We've Accomplished

Congratulations! You now have a complete, optimized development environment ready for fine-tuning Small Language Models. Here's what we've set up:

✅ Organized project structure
✅ Optimized virtual environment
✅ MLX framework for Apple Silicon acceleration
✅ All necessary dependencies installed
✅ Configuration and utility functions
✅ Performance monitoring tools
✅ Troubleshooting guides

Looking Ahead

With your environment ready, we can now move on to the exciting part - working with data and training our first model!

In Part 3, we'll dive deep into:

Understanding training data formats
Creating high-quality datasets
Data preprocessing and tokenization
Chat templates and prompt engineering

Your development environment is the foundation that makes everything else possible. With this solid base, you're ready to start building amazing AI applications locally!

DEV Community

Setting Up Your Local Development Environment (Part 2)

Why Your Setup Matters

What are the Hardware Requirements and Recommendations?

Minimum Requirements

Recommended Setup

But, Why Apple Silicon?

Understanding MLX: Apple's AI Framework

Key MLX Benefits:

Step-by-Step Environment Setup

Step 1: Create Your Project Directory

Step 2: Set Up Python Virtual Environment

Step 3: Install Core Dependencies

Step 4: Verify Your Installation (Optional)

Create a test script: test_installation.py **

Step 5: Download and Verify the Base Model

Manual Download (Alternative):

Verify Download:

Troubleshooting Download Issues:

Development Best Practices

Environment Management

Version Control Setup

Verification and Next Steps

What We've Accomplished

Looking Ahead

Top comments (0)

Why Your Setup Matters

What are the Hardware Requirements and Recommendations?

Minimum Requirements

Recommended Setup

But, Why Apple Silicon?

Understanding MLX: Apple's AI Framework

Key MLX Benefits:

Step-by-Step Environment Setup

Step 1: Create Your Project Directory

Step 2: Set Up Python Virtual Environment

Step 3: Install Core Dependencies

*Step 4: Verify Your Installation (Optional) *

Create a test script: test_installation.py **

Step 5: Download and Verify the Base Model

Manual Download (Alternative):

Verify Download:

Troubleshooting Download Issues:

Development Best Practices

Environment Management

Version Control Setup

Verification and Next Steps

What We've Accomplished

Looking Ahead

Step 4: Verify Your Installation (Optional)