Part 3: Deployment & Monitoring
"Production Deployment: UI, CI/CD, and Observability"
The Gist: We’ve built the engine and tested it on the track. Now, it’s time to open the showroom. In this final part, we aren’t just 'running code'—we’re launching a product. We’ll build a slick user interface, set up an automated 'safety net' (CI/CD) so we don't accidentally ship bugs, and install 'CCTV' (Monitoring) to make sure the AI stays healthy once it's out in the wild.
1. The Front Door: Gradio Application
Nobody wants to generate art by typing code into a terminal. We use Gradio to build a professional 'Front Door.' It’s a simple Python-based UI that lets users type a prompt and get an Adire masterpiece in seconds.
But here’s the pro secret: this app isn't just for the user. It’s wired into MLflow. Every time a user generates an image, the app 'telemeters' the performance data back to us. If a specific prompt is causing errors or taking 60 seconds to load, we’ll know immediately.
# app/gradio_app.py
import gradio as gr
from diffusers import StableDiffusionPipeline
import torch
import mlflow
from datetime import datetime
class InferenceApp:
def __init__(self, model_path: str):
# We wake up the brain and load our Adire LoRA
self.pipe = self._load_model(model_path)
mlflow.set_tracking_uri("../mlruns")
mlflow.set_experiment("production_inference")
def _load_model(self, model_path: str):
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16
)
pipe.unet.load_attn_procs(model_path)
return pipe.to("cuda" if torch.cuda.is_available() else "cpu")
def generate(self, prompt: str, steps: int, guidance: float):
"""Generate image with full observability"""
start = datetime.now()
with mlflow.start_run(run_name="inference"):
# We track exactly what the user asked for
mlflow.log_params({"prompt": prompt, "steps": steps, "guidance": guidance})
try:
image = self.pipe(prompt, num_inference_steps=steps, guidance_scale=guidance).images[0]
duration = (datetime.now() - start).total_seconds()
# We log the 'health' of this specific request
mlflow.log_metrics({"generation_time": duration, "prompt_length": len(prompt.split())})
return image, f"✓ Generated in {duration:.2f}s"
except Exception as e:
mlflow.log_param("error", str(e))
return None, f"✗ Error: {str(e)}"
def launch(self):
# The 'Layout' of our showroom
with gr.Blocks() as demo:
gr.Markdown("# Nigerian Adire Style Generator")
with gr.Row():
with gr.Column():
prompt = gr.Textbox(label="Prompt", placeholder="a nigerian_adire_style...")
steps = gr.Slider(20, 100, value=50, label="Steps")
guidance = gr.Slider(1, 15, value=7.5, label="Guidance")
btn = gr.Button("Generate", variant="primary")
with gr.Column():
output = gr.Image(label="Generated Image")
status = gr.Textbox(label="Status")
btn.click(fn=self.generate, inputs=[prompt, steps, guidance], outputs=[output, status])
demo.launch(share=True)
2. Shipping the Goods: HuggingFace Deployment
HuggingFace is the 'App Store' for AI. Instead of just sending someone a file, we deploy our model weights there. This script doesn't just upload the model; it creates a Model Card. Think of this as the 'Instruction Manual' and 'Nutrition Label' for your AI—it tells people what it is, how to use it, and what its quality scores were during training.
3. The Safety Net: CI/CD Pipeline
In professional software, we don't just 'upload and pray.' We use a GitHub Action (CI/CD). Every time we update the code, this automated 'Robot' wakes up and:
- Tests: Does the code even run?
- Evaluates: Does the new model version still meet our 0.75 quality score?
- Deploys: Only if everything is perfect does it push the update to the live app.It's how you sleep soundly at night knowing a small typo won't crash your production service.
4. The CCTV: Production Monitoring
Once your model is live, it can 'drift.' Maybe users start using slang the model doesn't understand, or maybe the GPU starts slowing down. We build a Monitoring Dashboard that acts like a heart rate monitor for our AI. If the average generation time spikes above 30 seconds, the system sends us an Alert.
# monitoring/dashboard.py
class MonitoringDashboard:
def check_degradation(self, df: pd.DataFrame) -> bool:
"""Alert if performance degrades"""
recent = df[df["timestamp"] > datetime.now() - timedelta(hours=24)]
avg_time = recent["generation_time"].mean()
# If the model gets 'tired' (slow), we trigger an alarm
if avg_time > 30:
print(f"⚠️ ALERT: Avg generation time {avg_time:.2f}s > 30s SLA")
return True
return False
5. The Turbo Boost: Performance Optimization
Finally, we want our AI to be fast. In 2026, we have a few 'Cheat Codes' to speed up Stable Diffusion. By using torch.compile and xFormers, we can often double the speed of generation. We calculate our 'Speedup Ratio' using: S= optimized/baseline If S = 2.0, your users are getting their art twice as fast, and you're paying half the price for GPU time. It's a win-win.
def optimize_pipeline(pipe):
# PyTorch 2.5+ 'compiles' the math into a faster format
pipe.unet = torch.compile(pipe.unet)
# Reduces VRAM so we can run on cheaper hardware
pipe.enable_attention_slicing()
# Uses 'Flash Attention' for a 20% speed boost
pipe.enable_xformers_memory_efficient_attention()
return pipe
We've finally completed the series; we’ve gone from a few images of Adire fabric to a production-ready AI system. It’s fast, it’s monitored, it’s automated, and most importantly, it’s ready for real users. You’re no longer just playing with AI; you’re an AI Engineer.
All the resources used here are available for free:
Huggingface Repository: https://huggingface.co/AfroLogicInsect/sd-lora-nigerian-adire
GitHub Repository: https://github.com/AkanimohOD19A/adire_mlops_poc
Gradio UI:https://huggingface.co/AfroLogicInsect/sd-adire-demo
Top comments (0)