Vinicius Fagundes

Posted on Jan 9

Train Models. Ship Systems. Know the Difference.

#datascience #machinelearning

TL;DR: Your model's 94% accuracy means nothing if your system crashes after 10 predictions. The team spent 6 months optimizing accuracy and 0 seconds on reliability. I rebuilt it in 2 weeks. Same accuracy, 99.7% uptime.

The Problem: When Models Break in Production

It started with a Slack message at 2 AM.

"The anomaly detection model is down. We're flying blind."

By morning, we had lost an entire day of production monitoring.

The team had spent 6 months training. They'd tried gradient boosting, neural networks, feature engineering, hyperparameter tuning. They got to 94% accuracy and shipped it.

But the system hadn't actually been built. The model had been deployed like throwing code into the void.

What Actually Happened

I started investigating:

The Training Setup (What They Did):

laptop → model.pkl → email attachment → manual SCP to server

The Problems I Found:

No input validation - The model expected 12 features. Sometimes it got 11. Sometimes 15. No validation, just silent failures.
Environment dependency hell - The model was trained on one laptop with:
- numpy 1.19, scikit-learn 0.23, pandas 1.2
- The server had different versions
- The model's predictions silently drifted
No observability - I asked: "How many predictions failed yesterday?"
- Answer: "We don't know."
- "When did the accuracy drop?"
- "We don't check."
- "What's the current prediction latency?"
- "¯_(ツ)_/¯"
No fallback logic - When the model crashed, there was no backup. No default behavior. Nothing.
No health checks - The endpoint could be returning garbage and nobody would know.

The Result: A system that looked great in notebooks but was fundamentally unreliable in production.

The Mindset Shift: From Notebook to Production System

Here's the gap:

Data Science (Notebook):
✓ Try different algorithms
✓ Optimize accuracy
✓ Validate on test set
✓ Create visualizations
✗ Doesn't run 24/7
✗ Doesn't handle edge cases
✗ Doesn't fail gracefully

Production System:
✗ Accuracy is table stakes
✓ Must run 24/7
✓ Must handle edge cases
✓ Must fail gracefully
✓ Must be observable
✓ Must be reproducible

The uncomfortable truth: Your notebook is a prototype. It's not a product.

The Solution: Building a Real System

Let me show you the exact changes we made:

Step 1: Input Validation with Schema Contracts

Before (No Validation):

def predict(features):
    """Dangerous: no validation"""
    loaded_model = pickle.load(open('model.pkl', 'rb'))
    return loaded_model.predict([features])

This accepts anything. If you pass the wrong data, it silently fails or makes garbage predictions.

After (With Validation):

import pandas as pd
import numpy as np

class FeatureSchema:
    """Define what valid features look like"""

    REQUIRED_FEATURES = [
        'metric_value',
        'seasonal_factor',
        'region_id',
        'product_id',
        'day_of_week',
        'hour_of_day',
        'moving_average_7d',
        'volatility_7d',
        'weekday_avg',
        'weekday_std',
        'is_weekend',
        'lag_1'
    ]

    FEATURE_RANGES = {
        'metric_value': (0, 1000000),
        'seasonal_factor': (0.5, 2.5),
        'region_id': (0, 10),
        'product_id': (0, 100),
        'day_of_week': (0, 7),
        'hour_of_day': (0, 24),
        'moving_average_7d': (0, 1000000),
        'volatility_7d': (0, 100000),
        'weekday_avg': (0, 1000000),
        'weekday_std': (0, 100000),
        'is_weekend': (0, 1),
        'lag_1': (0, 1000000),
    }

    @classmethod
    def validate(cls, features_dict):
        """
        Validate incoming features against schema.
        Raises ValidationError if anything is wrong.
        """
        # Check all required features are present
        missing = set(cls.REQUIRED_FEATURES) - set(features_dict.keys())
        if missing:
            raise ValueError(f"Missing features: {missing}")

        # Check no extra features
        extra = set(features_dict.keys()) - set(cls.REQUIRED_FEATURES)
        if extra:
            raise ValueError(f"Unexpected features: {extra}")

        # Check ranges
        for feature, (min_val, max_val) in cls.FEATURE_RANGES.items():
            value = features_dict[feature]

            # Check it's a number
            if not isinstance(value, (int, float, np.number)):
                raise ValueError(f"{feature} must be numeric, got {type(value)}")

            # Check it's not NaN
            if pd.isna(value):
                raise ValueError(f"{feature} cannot be NaN")

            # Check it's in valid range
            if not (min_val <= value <= max_val):
                raise ValueError(
                    f"{feature}={value} out of range [{min_val}, {max_val}]"
                )

        return True

# Usage
def predict_with_validation(features_dict, model):
    """Predict with schema validation"""
    try:
        # This will raise if validation fails
        FeatureSchema.validate(features_dict)

        # Create DataFrame for model
        df = pd.DataFrame([features_dict])
        prediction = model.predict(df)[0]

        return {
            'success': True,
            'prediction': float(prediction),
            'error': None
        }

    except ValueError as e:
        # Log the error and return failure response
        return {
            'success': False,
            'prediction': None,
            'error': str(e)
        }

# Test it
good_features = {
    'metric_value': 450,
    'seasonal_factor': 1.2,
    'region_id': 3,
    'product_id': 5,
    'day_of_week': 3,
    'hour_of_day': 14,
    'moving_average_7d': 420,
    'volatility_7d': 45,
    'weekday_avg': 440,
    'weekday_std': 30,
    'is_weekend': 0,
    'lag_1': 460,
}

print("Valid input:")
print(predict_with_validation(good_features, model))

# Bad features - will fail validation
bad_features = {
    'metric_value': 450,
    'seasonal_factor': 1.2,
    # Missing region_id, product_id, etc.
}

print("\nInvalid input (missing features):")
print(predict_with_validation(bad_features, model))

Output:

Valid input:
{'success': True, 'prediction': 0.87, 'error': None}

Invalid input (missing features):
{'success': False, 'prediction': None, 'error': 'Missing features: {...}'}

Step 2: Reproducible Environment with Requirements File

Before (Environment Chaos):

"It works on my machine."
(different numpy version, different pandas version, ...)

After (Reproducible):

# requirements.txt
numpy==1.24.3
pandas==2.0.3
scikit-learn==1.3.0
joblib==1.3.1

# Dockerfile
FROM python:3.11-slim

WORKDIR /app

# Copy requirements and install
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy model and code
COPY model.joblib .
COPY prediction_service.py .

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD python -c "import requests; requests.get('http://localhost:5000/health')"

# Run the service
CMD ["python", "prediction_service.py"]

Now every environment is identical. No more "works on my machine."

Step 3: Logging and Observability

Before (No Observability):

def predict(features):
    return model.predict(features)  # Silent. No record. No visibility.

After (Observable):

import logging
import json
from datetime import datetime

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger('prediction_service')

class PredictionLogger:
    """Log every prediction for observability"""

    @staticmethod
    def log_prediction(features_dict, prediction, latency_ms, success=True, error=None):
        """Log prediction with full context"""
        log_entry = {
            'timestamp': datetime.utcnow().isoformat(),
            'features': features_dict,
            'prediction': float(prediction) if prediction is not None else None,
            'latency_ms': latency_ms,
            'success': success,
            'error': error,
        }

        if success:
            logger.info(f"Prediction successful: {json.dumps(log_entry)}")
        else:
            logger.error(f"Prediction failed: {json.dumps(log_entry)}")

        return log_entry

def predict_with_logging(features_dict, model):
    """Predict with full logging"""
    start_time = datetime.now()

    try:
        # Validate
        FeatureSchema.validate(features_dict)

        # Predict
        df = pd.DataFrame([features_dict])
        prediction = model.predict(df)[0]

        # Log success
        latency = (datetime.now() - start_time).total_seconds() * 1000
        PredictionLogger.log_prediction(
            features_dict, 
            prediction, 
            latency, 
            success=True
        )

        return {
            'success': True,
            'prediction': float(prediction),
            'latency_ms': latency
        }

    except Exception as e:
        latency = (datetime.now() - start_time).total_seconds() * 1000
        PredictionLogger.log_prediction(
            features_dict,
            None,
            latency,
            success=False,
            error=str(e)
        )

        return {
            'success': False,
            'prediction': None,
            'error': str(e),
            'latency_ms': latency
        }

Now you have complete visibility:

What predictions were made
How long they took
When they failed and why
What features came in

Step 4: Health Checks

Before (No Health Checks):

@app.route('/predict', methods=['POST'])
def predict():
    features = request.json
    return jsonify(model.predict([features]))
    # If model is broken, nobody knows

After (With Health Checks):

import time

class HealthCheck:
    """Monitor system health"""

    def __init__(self):
        self.last_prediction_time = None
        self.last_prediction_success = False
        self.consecutive_failures = 0
        self.model_loaded = False

    def test_model_load(self):
        """Can we load the model?"""
        try:
            import joblib
            joblib.load('model.joblib')
            self.model_loaded = True
            return True
        except Exception as e:
            logger.error(f"Model load failed: {e}")
            return False

    def test_prediction(self, model):
        """Can we make a prediction?"""
        try:
            test_features = {
                'metric_value': 500,
                'seasonal_factor': 1.0,
                'region_id': 1,
                'product_id': 1,
                'day_of_week': 3,
                'hour_of_day': 12,
                'moving_average_7d': 500,
                'volatility_7d': 50,
                'weekday_avg': 500,
                'weekday_std': 30,
                'is_weekend': 0,
                'lag_1': 500,
            }

            result = predict_with_logging(test_features, model)

            if result['success']:
                self.last_prediction_success = True
                self.consecutive_failures = 0
                self.last_prediction_time = datetime.now()
                return True
            else:
                self.consecutive_failures += 1
                return False
        except Exception as e:
            self.consecutive_failures += 1
            logger.error(f"Health check prediction failed: {e}")
            return False

    def get_status(self):
        """Return full health status"""
        return {
            'status': 'healthy' if self.is_healthy() else 'unhealthy',
            'model_loaded': self.model_loaded,
            'last_prediction_success': self.last_prediction_success,
            'consecutive_failures': self.consecutive_failures,
            'last_prediction_time': self.last_prediction_time.isoformat() if self.last_prediction_time else None,
        }

    def is_healthy(self):
        """Is the system healthy?"""
        return (
            self.model_loaded and
            self.last_prediction_success and
            self.consecutive_failures < 5
        )

# Flask endpoint
health_check = HealthCheck()

@app.route('/health', methods=['GET'])
def health():
    """Liveness probe for Kubernetes / container orchestration"""
    status = health_check.get_status()

    if health_check.is_healthy():
        return jsonify(status), 200
    else:
        return jsonify(status), 503

@app.route('/health/detailed', methods=['GET'])
def health_detailed():
    """Readiness probe with detailed metrics"""
    return jsonify(health_check.get_status()), 200

Step 5: Fallback Logic (Graceful Degradation)

Before (Fails Hard):

@app.route('/predict', methods=['POST'])
def predict():
    features = request.json
    # If model crashes, endpoint crashes
    return jsonify(model.predict([features]))

After (Graceful Degradation):

class PredictionService:
    """Service with fallback logic"""

    def __init__(self, model, fallback_model=None, baseline_rule=None):
        self.primary_model = model
        self.fallback_model = fallback_model  # Simpler model for backup
        self.baseline_rule = baseline_rule     # Rule-based fallback

    def predict(self, features_dict):
        """Predict with fallback chain"""

        # Try primary model
        try:
            FeatureSchema.validate(features_dict)
            df = pd.DataFrame([features_dict])
            prediction = self.primary_model.predict(df)[0]

            logger.info(f"Prediction from primary model: {prediction}")
            return {
                'success': True,
                'prediction': float(prediction),
                'model_used': 'primary',
                'latency_ms': 0
            }

        except Exception as e:
            logger.warning(f"Primary model failed: {e}. Trying fallback...")

            # Try fallback model
            if self.fallback_model:
                try:
                    df = pd.DataFrame([features_dict])
                    prediction = self.fallback_model.predict(df)[0]

                    logger.warning(f"Prediction from fallback model: {prediction}")
                    return {
                        'success': True,
                        'prediction': float(prediction),
                        'model_used': 'fallback',
                        'warning': 'Primary model unavailable'
                    }
                except Exception as e2:
                    logger.warning(f"Fallback model failed: {e2}. Using baseline rule...")

            # Try baseline rule
            if self.baseline_rule:
                try:
                    prediction = self.baseline_rule(features_dict)

                    logger.warning(f"Prediction from baseline rule: {prediction}")
                    return {
                        'success': True,
                        'prediction': float(prediction),
                        'model_used': 'baseline',
                        'warning': 'Using baseline rule (models unavailable)'
                    }
                except Exception as e3:
                    logger.error(f"All methods failed: {e3}")

            # Complete failure
            return {
                'success': False,
                'prediction': None,
                'model_used': 'none',
                'error': 'All prediction methods failed',
                'primary_error': str(e)
            }

def baseline_prediction_rule(features_dict):
    """Simple rule-based fallback"""
    # If we can't use ML, use simple heuristics
    metric = features_dict['metric_value']
    moving_avg = features_dict['moving_average_7d']
    volatility = features_dict['volatility_7d']

    # Simple rule: if value is >3 std devs from moving average, it's anomalous
    z_score = abs(metric - moving_avg) / (volatility + 1)

    return float(z_score > 3)  # Return 1 if anomalous, 0 if normal

# Use it
service = PredictionService(
    primary_model=model,
    fallback_model=simple_model,
    baseline_rule=baseline_prediction_rule
)

result = service.predict(features)

The fallback chain:

Try → Primary ML Model
     ↓ (fails)
Try → Fallback ML Model (simpler, faster)
     ↓ (fails)
Try → Baseline Rule (z-score based)
     ↓ (fails)
Return → Graceful Error (system is up, but can't predict)

This means your service never crashes. It gracefully degrades.

The Results: What Changed

Before vs After Comparison

results = pd.DataFrame({
    'Metric': [
        'Uptime',
        'Prediction Success Rate',
        'Model Accuracy',
        'Prediction Latency (p99)',
        'Time to Detect Failure',
        'Time to Fix Issues',
        'Incident Response Time',
        'Data Validation Errors/Day',
    ],
    'Before': [
        '87%',
        '91%',
        '94%',
        '2500ms',
        '3-6 hours',
        '2-4 days',
        '30+ minutes',
        'Unknown',
    ],
    'After': [
        '99.7%',
        '99.2%',
        '94%',
        '150ms',
        '< 2 minutes',
        '15-30 minutes',
        '< 2 minutes',
        '100-200 (caught + logged)',
    ]
})

print(results.to_string(index=False))

Output:

                               Metric  Before  After
                              Uptime    87%    99.7%
                 Prediction Success Rate    91%    99.2%
                         Model Accuracy    94%     94%
              Prediction Latency (p99)  2500ms   150ms
            Time to Detect Failure  3-6 hours  < 2 min
                       Time to Fix Issues  2-4 days  15-30 min
                   Incident Response Time  30+ min  < 2 min
              Data Validation Errors/Day  Unknown  100-200

Notice: Model accuracy didn't change. Everything else did.

The Framework: Building Production Systems

PRODUCTION_READINESS_CHECKLIST = {
    'Input Validation': {
        'Schema validation': '✓ Required',
        'Type checking': '✓ Required',
        'Range checking': '✓ Required',
        'Reject invalid data': '✓ Required',
    },
    'Reproducibility': {
        'Pinned dependencies': '✓ Required',
        'Docker/containerization': '✓ Required',
        'Environment parity': '✓ Required',
        'Version control': '✓ Required',
    },
    'Observability': {
        'Prediction logging': '✓ Required',
        'Latency tracking': '✓ Required',
        'Error tracking': '✓ Required',
        'Feature auditing': '✓ Required',
    },
    'Reliability': {
        'Health checks': '✓ Required',
        'Monitoring alerts': '✓ Required',
        'Graceful degradation': '✓ Required',
        'Fallback logic': '✓ Required',
    },
    'Operational': {
        'Runbooks': '✓ Required',
        'Incident response': '✓ Required',
        'Rollback procedures': '✓ Required',
        'Load testing': '✓ Required',
    },
}

def assess_production_readiness(checklist_status):
    """
    Score your system on production readiness.
    """
    total_items = sum(len(v) for v in checklist_status.values())
    completed = sum(
        sum(1 for item_status in v.values() if '✓' in str(item_status))
        for v in checklist_status.values()
    )

    readiness_score = (completed / total_items) * 100

    print(f"Production Readiness Score: {readiness_score:.0f}%")

    if readiness_score < 50:
        print("  → NOT ready. This will fail in production.")
    elif readiness_score < 80:
        print("  → Risky. Expect outages and firefighting.")
    elif readiness_score < 100:
        print("  → Ready but incomplete. Plan improvements.")
    else:
        print("  → Production ready. You're in good shape.")

# Before
before_checklist = {k: {item: '✗' for item in v} for k, v in PRODUCTION_READINESS_CHECKLIST.items()}
assess_production_readiness(before_checklist)
# Output: 0%

# After
after_checklist = {k: v for k, v in PRODUCTION_READINESS_CHECKLIST.items()}
assess_production_readiness(after_checklist)
# Output: 100%

Key Lessons

A model that runs beats a model that's right.
- 94% accuracy that crashes < 80% accuracy that's reliable
- Stakeholders trust systems, not accuracy numbers
Production is not research.
- Your notebook is a prototype
- Your pipeline is the product
- They have different requirements
Observability is not optional.
- You can't manage what you can't measure
- Logging every prediction is cheap insurance
- Detecting failures in minutes, not hours, changes everything
Design for failure.
- Something will break
- Design graceful degradation, not hard crashes
- Your fallback logic is as important as your primary model
Think about production from day one.
- Choose models that are reproducible, not just accurate
- Consider inference latency during training
- Plan for monitoring while you're building

Questions for Your Systems

How would you know if your model was making bad predictions right now?
How long would it take to detect an outage?
What happens when your model fails?
Can you reproduce your exact training environment?
Do you validate input data?

If you can't answer these confidently, your model isn't production-ready yet.

The uncomfortable truth: 94% of data science effort goes into accuracy. 94% of production problems come from everything else.

What aspects of production readiness does your team struggle with most?

DEV Community