The One-Line Summary: A model is "good enough" when it beats a reasonable baseline, meets business requirements, performs consistently across segments, fails gracefully, and the cost of its errors is acceptable. High accuracy alone means nothing without this context.
The Restaurant That Served a 95% Perfect Dish
Chef Marco was proud of his new signature dish.
"I've tested it on 100 customers. 95 loved it! 95% satisfaction!"
He added it to the menu.
Week 1 Results:
Customers served: 500
Loved it: 475 (95%)
Hated it: 20 (4%)
Hospitalized: 5 (1%)
Wait... HOSPITALIZED?
The Investigation:
The 95% who loved it: No allergies, adventurous eaters
The 4% who hated it: Didn't like the spice level
The 1% hospitalized: Had shellfish allergies (ingredient wasn't disclosed)
Marco's "95% satisfaction" was ACCURATE.
But 1% FOOD POISONING is a catastrophe.
The restaurant was shut down.
Marco was sued.
His career was over.
The Lesson:
Marco asked: "What percentage liked it?"
He should have asked:
- "What happens to the people who DON'T like it?"
- "Are there segments that react differently?"
- "What's the worst-case failure mode?"
- "What are the consequences of failure?"
- "Is 95% even good compared to alternatives?"
The Six Questions Before Deployment
Before shipping ANY model, answer these six questions:
┌─────────────────────────────────────────────────────────────┐
│ THE DEPLOYMENT READINESS CHECKLIST │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. BASELINE: Does it beat doing nothing / simple rules? │
│ │
│ 2. BUSINESS: Does it meet the actual business requirement? │
│ │
│ 3. SEGMENTS: Does it work for ALL user groups? │
│ │
│ 4. ERRORS: Are the failure modes acceptable? │
│ │
│ 5. STABILITY: Is performance consistent and reliable? │
│ │
│ 6. PRODUCTION: Will it work in the real world? │
│ │
└─────────────────────────────────────────────────────────────┘
ALL SIX must be "Yes" to deploy.
Question 1: Does It Beat the Baseline?
Your model is only valuable if it's better than the alternative.
What's a Baseline?
# BASELINE OPTIONS (from simplest to complex):
# 1. Random guessing
baseline_random = 0.50 # For binary classification
# 2. Always predict majority class
baseline_majority = y_train.value_counts().max() / len(y_train)
# If 90% are "No", predicting "No" always = 90% accuracy!
# 3. Simple rule-based system
# "If transaction > $10,000, flag as fraud"
baseline_rules = rule_based_accuracy
# 4. Current production model (if exists)
baseline_current = current_model_accuracy
# 5. Human performance
baseline_human = human_expert_accuracy
The Baseline Test
import numpy as np
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, f1_score
def baseline_comparison(model, X_train, X_test, y_train, y_test):
"""Compare model against baselines."""
results = {}
# Baseline 1: Random
dummy_random = DummyClassifier(strategy='uniform')
dummy_random.fit(X_train, y_train)
results['Random Guess'] = {
'accuracy': dummy_random.score(X_test, y_test),
'f1': f1_score(y_test, dummy_random.predict(X_test), average='weighted')
}
# Baseline 2: Most frequent
dummy_frequent = DummyClassifier(strategy='most_frequent')
dummy_frequent.fit(X_train, y_train)
results['Always Majority'] = {
'accuracy': dummy_frequent.score(X_test, y_test),
'f1': f1_score(y_test, dummy_frequent.predict(X_test), average='weighted')
}
# Baseline 3: Stratified random
dummy_stratified = DummyClassifier(strategy='stratified')
dummy_stratified.fit(X_train, y_train)
results['Stratified Random'] = {
'accuracy': dummy_stratified.score(X_test, y_test),
'f1': f1_score(y_test, dummy_stratified.predict(X_test), average='weighted')
}
# Your model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
results['Your Model'] = {
'accuracy': accuracy_score(y_test, y_pred),
'f1': f1_score(y_test, y_pred, average='weighted')
}
# Print comparison
print("="*60)
print("BASELINE COMPARISON")
print("="*60)
print(f"{'Method':<20} {'Accuracy':>12} {'F1':>12} {'Status':>15}")
print("-"*60)
model_acc = results['Your Model']['accuracy']
for name, metrics in results.items():
status = ""
if name != 'Your Model':
if model_acc > metrics['accuracy'] + 0.02:
status = "✓ Beating"
elif model_acc < metrics['accuracy']:
status = "✗ LOSING!"
else:
status = "~ Tied"
print(f"{name:<20} {metrics['accuracy']:>12.1%} {metrics['f1']:>12.3f} {status:>15}")
return results
# Usage
results = baseline_comparison(your_model, X_train, X_test, y_train, y_test)
Output:
============================================================
BASELINE COMPARISON
============================================================
Method Accuracy F1 Status
------------------------------------------------------------
Random Guess 50.2% 0.487 ✓ Beating
Always Majority 90.0% 0.473 ✗ LOSING!
Stratified Random 82.1% 0.451 ✓ Beating
Your Model 91.5% 0.712
Wait — the model LOSES to "Always Majority" on accuracy!
This is common with imbalanced data. Check F1 instead — model wins there.
The Minimum Bar
YOUR MODEL MUST:
✓ Beat random guessing (obviously)
✓ Beat "always predict majority" (harder than you think!)
✓ Beat simple rules (if they exist)
✓ Beat the current solution (if replacing something)
If you can't beat these, your model has NEGATIVE value.
A simple rule that's 89% accurate is better than a complex
model that's 88% accurate but costs 10x more to run.
Question 2: Does It Meet Business Requirements?
"Good accuracy" is meaningless without business context.
Translating Business to Metrics
# BUSINESS REQUIREMENT → TECHNICAL METRIC
# "We can't miss more than 5% of fraud"
requirement_recall = 0.95 # Recall ≥ 95%
# "Only 10% of flagged transactions should be false alarms"
requirement_precision = 0.90 # Precision ≥ 90%
# "Average prediction error under $50"
requirement_mae = 50 # MAE ≤ $50
# "Must process 1000 requests per second"
requirement_latency = 1 # Latency ≤ 1ms per request
# "Must work on mobile devices"
requirement_size = 50 # Model size ≤ 50MB
The Requirements Check
def check_business_requirements(y_true, y_pred, y_proba=None):
"""Check if model meets business requirements."""
from sklearn.metrics import precision_score, recall_score, f1_score
# Define YOUR business requirements here
requirements = {
'recall': {'min': 0.95, 'actual': None, 'met': False},
'precision': {'min': 0.80, 'actual': None, 'met': False},
'f1': {'min': 0.85, 'actual': None, 'met': False},
}
# Calculate actuals
requirements['recall']['actual'] = recall_score(y_true, y_pred)
requirements['precision']['actual'] = precision_score(y_true, y_pred)
requirements['f1']['actual'] = f1_score(y_true, y_pred)
# Check if met
for metric, vals in requirements.items():
vals['met'] = vals['actual'] >= vals['min']
# Print report
print("="*60)
print("BUSINESS REQUIREMENTS CHECK")
print("="*60)
print(f"{'Metric':<15} {'Required':>12} {'Actual':>12} {'Status':>15}")
print("-"*60)
all_met = True
for metric, vals in requirements.items():
status = "✓ PASS" if vals['met'] else "✗ FAIL"
if not vals['met']:
all_met = False
print(f"{metric:<15} {vals['min']:>12.1%} {vals['actual']:>12.1%} {status:>15}")
print("-"*60)
if all_met:
print("✓ ALL REQUIREMENTS MET — Ready for deployment consideration")
else:
print("✗ REQUIREMENTS NOT MET — Do not deploy")
return all_met, requirements
# Usage
ready, reqs = check_business_requirements(y_test, y_pred)
Output:
============================================================
BUSINESS REQUIREMENTS CHECK
============================================================
Metric Required Actual Status
------------------------------------------------------------
recall 95.0% 92.3% ✗ FAIL
precision 80.0% 87.5% ✓ PASS
f1 85.0% 89.7% ✓ PASS
------------------------------------------------------------
✗ REQUIREMENTS NOT MET — Do not deploy
Model has great F1 but misses the recall requirement. NOT deployable.
Common Business Translations
| Business Says | Technical Metric | Typical Target |
|---|---|---|
| "Don't miss fraud" | Recall | ≥ 95% |
| "Don't annoy customers with false alarms" | Precision | ≥ 80% |
| "Predict prices accurately" | MAE or MAPE | ≤ $X or ≤ Y% |
| "Rank good products higher" | AUC-ROC, NDCG | ≥ 0.90 |
| "Respond quickly" | Latency (p99) | ≤ 100ms |
| "Run on phones" | Model size | ≤ 50MB |
| "Beat the current system" | Lift over baseline | ≥ 10% improvement |
Question 3: Does It Work for All Segments?
A model that's 95% accurate overall but 20% accurate for a minority group is a disaster waiting to happen.
The Segment Analysis
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score, recall_score
def segment_analysis(y_true, y_pred, segments, segment_name="Segment"):
"""Analyze model performance across different segments."""
results = []
for segment in segments.unique():
mask = segments == segment
segment_size = mask.sum()
if segment_size < 10: # Skip tiny segments
continue
segment_acc = accuracy_score(y_true[mask], y_pred[mask])
segment_f1 = f1_score(y_true[mask], y_pred[mask], average='weighted', zero_division=0)
segment_recall = recall_score(y_true[mask], y_pred[mask], average='weighted', zero_division=0)
results.append({
'segment': segment,
'n_samples': segment_size,
'accuracy': segment_acc,
'f1': segment_f1,
'recall': segment_recall
})
df = pd.DataFrame(results).sort_values('accuracy', ascending=False)
# Print report
print("="*70)
print(f"SEGMENT ANALYSIS BY {segment_name.upper()}")
print("="*70)
print(f"{'Segment':<20} {'N':>8} {'Accuracy':>10} {'F1':>10} {'Recall':>10}")
print("-"*70)
for _, row in df.iterrows():
print(f"{str(row['segment']):<20} {row['n_samples']:>8} {row['accuracy']:>10.1%} {row['f1']:>10.3f} {row['recall']:>10.3f}")
# Check for disparities
print("-"*70)
acc_range = df['accuracy'].max() - df['accuracy'].min()
if acc_range > 0.10:
print(f"⚠️ WARNING: {acc_range:.1%} accuracy gap between best and worst segments!")
worst = df.iloc[-1]
print(f" Worst segment: {worst['segment']} ({worst['accuracy']:.1%} accuracy)")
else:
print(f"✓ Performance is consistent across segments (gap: {acc_range:.1%})")
return df
# Example usage
segment_results = segment_analysis(
y_test, y_pred,
test_df['age_group'], # Check across age groups
segment_name="Age Group"
)
segment_results = segment_analysis(
y_test, y_pred,
test_df['region'], # Check across regions
segment_name="Region"
)
Output:
======================================================================
SEGMENT ANALYSIS BY AGE GROUP
======================================================================
Segment N Accuracy F1 Recall
----------------------------------------------------------------------
35-50 1245 94.2% 0.891 0.887
25-35 987 93.1% 0.878 0.869
50-65 654 91.8% 0.856 0.845
18-25 432 89.5% 0.823 0.812
65+ 234 71.2% 0.654 0.623
----------------------------------------------------------------------
⚠️ WARNING: 23.0% accuracy gap between best and worst segments!
Worst segment: 65+ (71.2% accuracy)
The model is 23% WORSE for seniors. This could be:
- Discrimination (legal liability!)
- Missing features for that group
- Insufficient training data
- Different behavior patterns
Do NOT deploy until investigated and fixed.
Segments to Check
ALWAYS check performance across:
□ Age groups
□ Gender
□ Geographic regions
□ Income levels
□ Device types (mobile vs desktop)
□ Time periods (weekday vs weekend)
□ Customer tenure (new vs loyal)
□ Any protected class (race, religion, etc.)
If ANY segment has significantly worse performance,
investigate before deployment.
Question 4: Are the Failure Modes Acceptable?
Not all errors are equal. A 5% error rate is fine if errors are minor. It's catastrophic if errors cause deaths.
Error Analysis
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
def error_analysis(y_true, y_pred, X_test=None, feature_names=None):
"""Deep dive into model errors."""
# Basic error stats
errors = y_true != y_pred
n_errors = errors.sum()
error_rate = n_errors / len(y_true)
print("="*60)
print("ERROR ANALYSIS")
print("="*60)
print(f"Total predictions: {len(y_true)}")
print(f"Total errors: {n_errors} ({error_rate:.1%})")
# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print(f"\nConfusion Matrix:")
print(cm)
# Error types
if len(np.unique(y_true)) == 2: # Binary
tn, fp, fn, tp = cm.ravel()
print(f"\nError Breakdown:")
print(f" False Positives (Type I): {fp:>5} ({fp/len(y_true):.2%})")
print(f" False Negatives (Type II): {fn:>5} ({fn/len(y_true):.2%})")
# Which is worse for your business?
print(f"\n ⚠️ Consider: Which error type is MORE costly for your business?")
print(f" - FP ({fp}): Said YES when NO — e.g., flagged legitimate user as fraud")
print(f" - FN ({fn}): Said NO when YES — e.g., missed actual fraud")
# Sample some errors for inspection
if X_test is not None:
print(f"\nSample of errors to inspect:")
error_indices = np.where(errors)[0][:5] # First 5 errors
for idx in error_indices:
print(f"\n Error #{idx}:")
print(f" True: {y_true.iloc[idx] if hasattr(y_true, 'iloc') else y_true[idx]}")
print(f" Pred: {y_pred[idx]}")
if feature_names and len(feature_names) <= 5:
for i, feat in enumerate(feature_names):
print(f" {feat}: {X_test[idx, i]}")
# Usage
error_analysis(y_test, y_pred, X_test, feature_names=['age', 'amount', 'freq'])
The Cost Matrix
def calculate_error_cost(y_true, y_pred, cost_fp=10, cost_fn=100):
"""Calculate business cost of errors."""
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
total_cost = (fp * cost_fp) + (fn * cost_fn)
cost_per_prediction = total_cost / len(y_true)
print("="*60)
print("ERROR COST ANALYSIS")
print("="*60)
print(f"Cost assumptions:")
print(f" False Positive cost: ${cost_fp}")
print(f" False Negative cost: ${cost_fn}")
print(f"\nError counts:")
print(f" False Positives: {fp}")
print(f" False Negatives: {fn}")
print(f"\nTotal cost:")
print(f" FP cost: {fp} × ${cost_fp} = ${fp * cost_fp:,}")
print(f" FN cost: {fn} × ${cost_fn} = ${fn * cost_fn:,}")
print(f" ─────────────────────────")
print(f" TOTAL: ${total_cost:,}")
print(f" Per prediction: ${cost_per_prediction:.2f}")
# Compare to baseline
baseline_cost = len(y_true[y_true == 1]) * cost_fn # Missing all positives
savings = baseline_cost - total_cost
print(f"\nCompared to doing nothing:")
print(f" Baseline cost (miss all fraud): ${baseline_cost:,}")
print(f" Model cost: ${total_cost:,}")
print(f" Savings: ${savings:,} ({savings/baseline_cost:.1%} reduction)")
return total_cost, savings
# Example: Fraud detection
# FP = inconvenience customer (costs $10 in support)
# FN = miss fraud (costs $100 on average)
cost, savings = calculate_error_cost(y_test, y_pred, cost_fp=10, cost_fn=100)
Output:
============================================================
ERROR COST ANALYSIS
============================================================
Cost assumptions:
False Positive cost: $10
False Negative cost: $100
Error counts:
False Positives: 45
False Negatives: 12
Total cost:
FP cost: 45 × $10 = $450
FN cost: 12 × $100 = $1,200
─────────────────────────
TOTAL: $1,650
Per prediction: $1.65
Compared to doing nothing:
Baseline cost (miss all fraud): $15,000
Model cost: $1,650
Savings: $13,350 (89.0% reduction)
The model saves $13,350 compared to no model. That's your ROI case!
Question 5: Is Performance Stable?
A model that's 95% accurate... ±10% depending on the day is NOT production-ready.
Stability Check
import numpy as np
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
def stability_check(model, X, y, n_repeats=10):
"""Check if model performance is stable across different splits."""
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=n_repeats, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
print("="*60)
print("STABILITY CHECK")
print("="*60)
print(f"Ran {n_repeats} × 5-fold CV = {len(scores)} evaluations")
print(f"\nResults:")
print(f" Mean: {scores.mean():.4f}")
print(f" Std: {scores.std():.4f}")
print(f" Min: {scores.min():.4f}")
print(f" Max: {scores.max():.4f}")
print(f" Range: {scores.max() - scores.min():.4f}")
# Coefficient of variation
cv_pct = (scores.std() / scores.mean()) * 100
print(f" CV%: {cv_pct:.1f}%")
# Assessment
print(f"\nStability Assessment:")
if scores.std() < 0.02:
print(" ✓ EXCELLENT: Very stable (std < 2%)")
elif scores.std() < 0.05:
print(" ✓ GOOD: Reasonably stable (std < 5%)")
elif scores.std() < 0.10:
print(" ⚠️ WARNING: Moderate variance (std 5-10%)")
print(" Performance may vary significantly in production")
else:
print(" ✗ UNSTABLE: High variance (std > 10%)")
print(" Model is unreliable — do not deploy")
# 95% confidence interval
ci_low = scores.mean() - 1.96 * scores.std()
ci_high = scores.mean() + 1.96 * scores.std()
print(f"\n95% Confidence Interval: [{ci_low:.4f}, {ci_high:.4f}]")
print(f" → In production, expect accuracy between {ci_low:.1%} and {ci_high:.1%}")
return scores
# Usage
scores = stability_check(model, X, y, n_repeats=10)
Output:
============================================================
STABILITY CHECK
============================================================
Ran 10 × 5-fold CV = 50 evaluations
Results:
Mean: 0.9234
Std: 0.0187
Min: 0.8821
Max: 0.9612
Range: 0.0791
CV%: 2.0%
Stability Assessment:
✓ GOOD: Reasonably stable (std < 5%)
95% Confidence Interval: [0.8868, 0.9600]
→ In production, expect accuracy between 88.7% and 96.0%
Question 6: Will It Work in Production?
The test set is a simulation. Production is reality.
Production Readiness Checklist
def production_readiness_checklist():
"""Comprehensive production readiness checklist."""
checklist = {
"Data": [
"□ Training data distribution matches expected production data",
"□ Handles missing values gracefully",
"□ Handles unexpected categories (new user types, etc.)",
"□ Handles outliers without crashing",
"□ Works with the actual data pipeline (not just clean CSVs)",
],
"Performance": [
"□ Beats baseline on all key metrics",
"□ Meets business requirements",
"□ Consistent across all segments",
"□ Stable across different data splits",
"□ Error cost is acceptable",
],
"Technical": [
"□ Inference latency meets requirements",
"□ Model size fits deployment constraints",
"□ Memory usage is acceptable",
"□ Can handle expected throughput",
"□ Graceful degradation under load",
],
"Operational": [
"□ Monitoring is in place",
"□ Alerting thresholds defined",
"□ Rollback plan exists",
"□ A/B test designed (if applicable)",
"□ Data drift detection ready",
],
"Governance": [
"□ Bias/fairness audit completed",
"□ Model documentation complete",
"□ Stakeholder sign-off obtained",
"□ Legal/compliance review (if needed)",
"□ Model versioning in place",
]
}
print("="*70)
print("PRODUCTION READINESS CHECKLIST")
print("="*70)
for category, items in checklist.items():
print(f"\n{category.upper()}:")
for item in items:
print(f" {item}")
print("\n" + "="*70)
print("All boxes must be checked before deployment!")
print("="*70)
production_readiness_checklist()
The Final Deployment Test
def final_deployment_decision(
model_metrics,
baseline_metrics,
business_requirements,
segment_gaps,
stability_std,
error_cost_acceptable,
technical_ready
):
"""Make the final deployment decision."""
print("="*70)
print("FINAL DEPLOYMENT DECISION")
print("="*70)
checks = []
# Check 1: Beats baseline
beats_baseline = model_metrics['f1'] > baseline_metrics['f1'] + 0.02
checks.append(('Beats Baseline', beats_baseline,
f"Model F1 ({model_metrics['f1']:.3f}) vs Baseline ({baseline_metrics['f1']:.3f})"))
# Check 2: Meets requirements
meets_reqs = all(model_metrics[k] >= v for k, v in business_requirements.items())
checks.append(('Meets Requirements', meets_reqs,
f"All business thresholds satisfied"))
# Check 3: Segment consistency
segment_ok = segment_gaps < 0.15
checks.append(('Segment Consistency', segment_ok,
f"Max segment gap: {segment_gaps:.1%}"))
# Check 4: Stability
stable = stability_std < 0.05
checks.append(('Stable Performance', stable,
f"CV std: {stability_std:.3f}"))
# Check 5: Error cost
checks.append(('Acceptable Error Cost', error_cost_acceptable,
"Error costs within budget"))
# Check 6: Technical readiness
checks.append(('Technical Ready', technical_ready,
"Latency, memory, throughput OK"))
# Print results
print(f"\n{'Check':<25} {'Status':<10} {'Details':<35}")
print("-"*70)
all_passed = True
for name, passed, details in checks:
status = "✓ PASS" if passed else "✗ FAIL"
if not passed:
all_passed = False
print(f"{name:<25} {status:<10} {details:<35}")
print("-"*70)
if all_passed:
print("\n✅ ALL CHECKS PASSED — Model is READY for deployment!")
print("\nRecommendation: Proceed with staged rollout (5% → 25% → 100%)")
else:
print("\n❌ SOME CHECKS FAILED — Model is NOT ready for deployment")
print("\nRequired actions:")
for name, passed, _ in checks:
if not passed:
print(f" • Address: {name}")
return all_passed
# Example usage
ready = final_deployment_decision(
model_metrics={'f1': 0.89, 'recall': 0.94, 'precision': 0.85},
baseline_metrics={'f1': 0.45},
business_requirements={'recall': 0.90, 'precision': 0.80},
segment_gaps=0.12,
stability_std=0.03,
error_cost_acceptable=True,
technical_ready=True
)
Output:
======================================================================
FINAL DEPLOYMENT DECISION
======================================================================
Check Status Details
----------------------------------------------------------------------
Beats Baseline ✓ PASS Model F1 (0.890) vs Baseline (0.450)
Meets Requirements ✓ PASS All business thresholds satisfied
Segment Consistency ✓ PASS Max segment gap: 12.0%
Stable Performance ✓ PASS CV std: 0.030
Acceptable Error Cost ✓ PASS Error costs within budget
Technical Ready ✓ PASS Latency, memory, throughput OK
----------------------------------------------------------------------
✅ ALL CHECKS PASSED — Model is READY for deployment!
Recommendation: Proceed with staged rollout (5% → 25% → 100%)
The Deployment Readiness Flowchart
START
│
▼
┌─────────────────────────┐
│ Does it beat baseline? │
└─────────────────────────┘
│
┌────────┴────────┐
NO YES
│ │
▼ ▼
┌─────────────┐ ┌─────────────────────────┐
│ STOP! │ │ Does it meet business │
│ Why bother? │ │ requirements? │
└─────────────┘ └─────────────────────────┘
│
┌────────┴────────┐
NO YES
│ │
▼ ▼
┌─────────────┐ ┌─────────────────────────┐
│ Tune more │ │ Works for all segments? │
│ or adjust │ └─────────────────────────┘
│requirements │ │
└─────────────┘ ┌────────┴────────┐
NO YES
│ │
▼ ▼
┌─────────────┐ ┌─────────────────────────┐
│ Investigate │ │ Is performance stable? │
│ & fix │ └─────────────────────────┘
└─────────────┘ │
┌────────┴────────┐
NO YES
│ │
▼ ▼
┌─────────────┐ ┌─────────────────────────┐
│ Need more │ │ Are errors acceptable? │
│ data or │ └─────────────────────────┘
│ simpler │ │
│ model │ ┌────────┴────────┐
└─────────────┘ NO YES
│ │
▼ ▼
┌─────────────┐ ┌─────────────────┐
│ Adjust │ │ Technical ready?│
│ threshold │ └─────────────────┘
│ or accept │ │
└─────────────┘ ┌────────┴────────┐
NO YES
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Optimize │ │ DEPLOY! │
│ or accept │ │ (staged) │
│ limitations │ └─────────────┘
└─────────────┘
Common "Good Enough" Mistakes
Mistake 1: "High Accuracy = Good Model"
# ❌ WRONG
"Model has 97% accuracy! Ship it!"
# ✅ RIGHT
# What's the baseline? (Maybe 95% from "always predict no")
# What's the recall on the rare class? (Maybe 10%)
# What segments struggle? (Maybe 50% accuracy for elderly)
# What happens when it's wrong? (Maybe catastrophic)
Mistake 2: "Better Than Current = Good Enough"
# ❌ WRONG
"New model is 2% better than production. Deploy!"
# ✅ RIGHT
# Is 2% statistically significant? (Need confidence intervals)
# Is 2% PRACTICALLY significant? (Worth the deployment risk?)
# Is the new model more complex? (Tech debt?)
# What are the operational costs? (Latency, memory, maintenance)
Mistake 3: "Test Set Said It's Good"
# ❌ WRONG
"Test accuracy is 93%. We're golden!"
# ✅ RIGHT
# Is test data representative of production?
# Did you check multiple metrics (not just accuracy)?
# Did you segment analysis?
# What's the confidence interval?
# What happens with distribution shift?
Mistake 4: "Stakeholders Are Happy"
# ❌ WRONG
"VP loved the demo. Let's launch!"
# ✅ RIGHT
# Was the demo cherry-picked?
# Did you show failure cases?
# Did you explain limitations?
# Did you set appropriate expectations?
# Is there a monitoring plan?
Quick Reference: The Six Checks
| Check | Question | Minimum Bar |
|---|---|---|
| Baseline | Does it beat simple alternatives? | > 5% improvement |
| Business | Does it meet stated requirements? | All thresholds met |
| Segments | Does it work for everyone? | < 15% segment gap |
| Errors | Are failures acceptable? | Cost < value created |
| Stability | Is performance reliable? | Std < 5% |
| Production | Will it work in the real world? | All technical checks pass |
Key Takeaways
"Good metrics" without context is meaningless — Always compare to baseline
Beat the baseline by a meaningful margin — Not just statistically better
Meet business requirements, not arbitrary thresholds — Translate business needs to metrics
Check ALL segments, not just overall — Hidden disparities cause real-world disasters
Understand your errors — Not all mistakes are equal
Demand stability — High variance = unreliable in production
Production ≠ test set — Real world will surprise you
Deploy gradually — 5% → 25% → 100%, with monitoring
The One-Sentence Summary
Chef Marco's dish was 95% delicious but 1% hospitalized people with undisclosed allergens — before deploying your model, make sure you've checked not just how often it's right, but what happens when it's wrong, who it fails for, and whether "good enough" by your metrics is actually good enough for the people who will be affected by it.
What's Next?
Now that you know how to decide if your model is ready, you're ready for:
- A/B Testing — Validating in production
- Monitoring ML Models — Catching degradation
- Data Drift Detection — When your model goes stale
- Model Retraining — When and how to update
Follow me for the next article in this series!
Let's Connect!
If this deployment checklist is getting bookmarked, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
What's your deployment horror story? I once deployed a model that worked great... until Christmas shopping season changed user behavior entirely. Lesson learned about temporal validation! 🎄😅
The difference between a model that "should work" and one that actually works in production? Asking the right questions before deployment. 95% accuracy sounds great until you find out the 5% is food poisoning.
Share this with someone about to deploy their first model. This checklist might save them from a very public failure.
Happy (careful) deploying! 🚀
Top comments (0)