Building Production AI: A Three-Part MLOps Journey - Pt.2

#ai #programming #productivity #machinelearning

Part 2: Training & MLOps Pipeline

"From Data to Deployment: Building the Production Pipeline"

Now that we have the blueprint, it’s time to actually 'cook.' But here’s the thing: in production, you can’t just train a model once and hope for the best. You need a 'factory' that can do it over and over again perfectly. I spent my time setting up automated gates. If the AI creates something ugly, the system automatically 'fires' that version and refuses to deploy it. It’s like having a robot manager who never sleeps.

1. The Training Lab: Google Colab Setup

First things first: we need a place to work. Training AI is like running a marathon for a computer, it's exhausting. We use Google Colab because it gives us a free T4 GPU, which is the 'engine' we need to train our Adire model.

We start by gathering our tools. We're installing diffusers (the main engine), peft (our LoRA 'sticky note' tool), and bits and bytes (a clever hack that lets us train big models on small GPUs).

# ========================================
# Cell 1: Environment Setup
# ========================================
!pip install -q diffusers==0.25.0 transformers==4.36.0 \
             accelerate==0.25.0 peft==0.7.1 bitsandbytes

# We need to make sure the GPU is actually awake and ready to work
import torch
assert torch.cuda.is_available(), "No GPU found! Check your Colab settings."
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB")

# ========================================
# Cell 2: Download the 'Recipe'
# ========================================
# We don't need to write the training logic from scratch. 
# We're grabbing a proven script from the HuggingFace team.
!wget -q https://raw.githubusercontent.com/huggingface/diffusers/v0.36.0/examples/dreambooth/train_dreambooth_lora.py

# ========================================
# Cell 3: The Secret Sauce (Configuration)
# ========================================
# This is where we tell the AI exactly what we want. 
# We're pointing it to our Adire images and telling it the "trigger word."
CONFIG = {
    "model": "runwayml/stable-diffusion-v1-5",
    "output_dir": "./lora_weights",
    "instance_data_dir": "./training_images",
    "instance_prompt": "a photo in nigerian_adire_style",
    "resolution": 512,
    "train_batch_size": 1,
    "gradient_accumulation_steps": 4, # We 'save up' steps to act like a bigger batch
    "learning_rate": 1e-4, 
    "lr_scheduler": "constant",
    "max_train_steps": 800, # 800 iterations is usually the sweet spot
    "lora_rank": 4,
    "lora_alpha": 4,
    "seed": 42
}

# ========================================
# Cell 4: Ignition!
# ========================================
# We launch the training. This is where the magic happens.
!accelerate launch train_dreambooth_lora.py \
  --pretrained_model_name_or_path="{CONFIG['model']}" \
  --instance_data_dir="{CONFIG['instance_data_dir']}" \
  --output_dir="{CONFIG['output_dir']}" \
  --instance_prompt="{CONFIG['instance_prompt']}" \
  --resolution={CONFIG['resolution']} \
  --train_batch_size={CONFIG['train_batch_size']} \
  --gradient_accumulation_steps={CONFIG['gradient_accumulation_steps']} \
  --learning_rate={CONFIG['learning_rate']} \
  --lr_scheduler="{CONFIG['lr_scheduler']}" \
  --max_train_steps={CONFIG['max_train_steps']} \
  --use_8bit_adam \
  --checkpointing_steps=100 \
  --validation_prompt="{CONFIG['instance_prompt']} sunset over Lagos" \
  --seed={CONFIG['seed']}

2. Tuning the Engine: Hyperparameter Analysis

You might wonder why I chose those specific numbers in the CONFIG. AI training is a bit like cooking, a pinch too much salt ruins the soup.

Learning Rate ($1e-4$): If this is too high, the AI 'panics' and learns nothing. Too low, and it takes days to learn.
Effective Batch Size: We're training on one image at a time but 'remembering' four (1 $\times$ 4). It keeps the training stable without crashing the GPU memory.
LoRA Rank: A rank of 4 is lean and fast. If we went to 16, the file would be 4x bigger but wouldn't actually look much better. We're going for efficiency here."

3. The Factory: Building the MLOps Pipeline

Now, we step away from the notebook and build a real software system. In a production environment, you don't want to manually copy-paste files. We use ZenML to build a conveyor belt.

Our pipeline has three main employees:

The Evaluator: Does the model actually create Adire patterns or is it just making noise?
The Promoter: The 'manager' who looks at the test scores and decides if this model is good enough for our customers.
The Deployer: The person who packs the model up and ships it to the cloud."

Step 1: The Evaluator (Quality Control)

This step loads our new model and asks it to draw a few pictures. We measure how fast it is and how well the images match our prompts. We log all these stats into MLflow so we have a permanent record of how this 'version' performed.

@step(enable_cache=False)
def evaluate_model(model_path: str, test_prompts: List[str]) -> Dict[str, float]:
    # We load the brain (Stable Diffusion) and the 'notes' (our LoRA weights)
    pipe = StableDiffusionPipeline.from_pretrained(...)
    pipe.unet.load_attn_procs(model_path)

    # We time the generation. In production, 'fast' is just as important as 'pretty.'
    start = time.time()
    image = pipe(prompt).images[0]
    gen_time = time.time() - start

    # We calculate a 'quality' score (using a tool called CLIP)
    quality = compute_clip_score(image, prompt)

    metrics = {"avg_time": gen_time, "avg_quality": quality}
    mlflow.log_metrics(metrics) # Keep a receipt!
    return metrics

Step 2: The Promoter (The Decision Maker)

This is our automated 'Quality Gate.' We set strict rules: if the quality is below 0.75, or if it takes longer than 30 seconds to draw a picture, the model is 'fired.' If it passes, it gets promoted to 'Production' status.

@step
def promote_model(metrics: Dict[str, float], thresholds: Dict[str, float]):
    # Does it meet our standards?
    checks = {
        "quality_check": metrics["avg_quality"] >= thresholds["quality"],
        "speed_check": metrics["avg_time"] <= thresholds["max_time"]
    }

    if all(checks.values()):
        # If yes, we officially tag it as 'Production' in our system
        client.transition_model_version_stage(name=model_name, stage="Production")
        print("✓ Model promoted!")
    return all(checks.values())

4. MLflow: The Project Diary

While the pipeline runs, MLflow is in the background taking notes on everything. Every loss value, every hyperparameter, and every test image is saved. If our model suddenly starts acting weird next week, we can look back at the 'diary' and see exactly what changed. It’s like having an infinite 'Undo' button for your entire AI project.

# We can literally ask MLflow: "Which version of the Adire model was the best?"
runs = client.search_runs(experiment_ids=["0"], order_by=["metrics.avg_quality DESC"])
print(f"Our champion model is: {runs[0].info.run_id}")

That's it! In one takeaway: We've moved from a single script mechanism to a factory. Our model is trained, tested, and vetted by an automated manager. We're not just building a model; we're building a system that can reliably produce many models.

NOTE/ASIDE: Implementing this is dependent on compute, the smaller the model size - the smaller the compute required for handling retraining.