DEV Community

Vinicius Fagundes
Vinicius Fagundes

Posted on

Train Models. Ship Systems. Know the Difference.

TL;DR: Your model's 94% accuracy means nothing if your system crashes after 10 predictions. The team spent 6 months optimizing accuracy and 0 seconds on reliability. I rebuilt it in 2 weeks. Same accuracy, 99.7% uptime.


The Problem: When Models Break in Production

It started with a Slack message at 2 AM.

"The anomaly detection model is down. We're flying blind."

By morning, we had lost an entire day of production monitoring.

The team had spent 6 months training. They'd tried gradient boosting, neural networks, feature engineering, hyperparameter tuning. They got to 94% accuracy and shipped it.

But the system hadn't actually been built. The model had been deployed like throwing code into the void.

What Actually Happened

I started investigating:

The Training Setup (What They Did):

laptop → model.pkl → email attachment → manual SCP to server
Enter fullscreen mode Exit fullscreen mode

The Problems I Found:

  1. No input validation - The model expected 12 features. Sometimes it got 11. Sometimes 15. No validation, just silent failures.

  2. Environment dependency hell - The model was trained on one laptop with:

    • numpy 1.19, scikit-learn 0.23, pandas 1.2
    • The server had different versions
    • The model's predictions silently drifted
  3. No observability - I asked: "How many predictions failed yesterday?"

    • Answer: "We don't know."
    • "When did the accuracy drop?"
    • "We don't check."
    • "What's the current prediction latency?"
    • "¯_(ツ)_/¯"
  4. No fallback logic - When the model crashed, there was no backup. No default behavior. Nothing.

  5. No health checks - The endpoint could be returning garbage and nobody would know.

The Result: A system that looked great in notebooks but was fundamentally unreliable in production.


The Mindset Shift: From Notebook to Production System

Here's the gap:

Data Science (Notebook):
✓ Try different algorithms
✓ Optimize accuracy
✓ Validate on test set
✓ Create visualizations
✗ Doesn't run 24/7
✗ Doesn't handle edge cases
✗ Doesn't fail gracefully

Production System:
✗ Accuracy is table stakes
✓ Must run 24/7
✓ Must handle edge cases
✓ Must fail gracefully
✓ Must be observable
✓ Must be reproducible
Enter fullscreen mode Exit fullscreen mode

The uncomfortable truth: Your notebook is a prototype. It's not a product.


The Solution: Building a Real System

Let me show you the exact changes we made:

Step 1: Input Validation with Schema Contracts

Before (No Validation):

def predict(features):
    """Dangerous: no validation"""
    loaded_model = pickle.load(open('model.pkl', 'rb'))
    return loaded_model.predict([features])
Enter fullscreen mode Exit fullscreen mode

This accepts anything. If you pass the wrong data, it silently fails or makes garbage predictions.

After (With Validation):

import pandas as pd
import numpy as np

class FeatureSchema:
    """Define what valid features look like"""

    REQUIRED_FEATURES = [
        'metric_value',
        'seasonal_factor',
        'region_id',
        'product_id',
        'day_of_week',
        'hour_of_day',
        'moving_average_7d',
        'volatility_7d',
        'weekday_avg',
        'weekday_std',
        'is_weekend',
        'lag_1'
    ]

    FEATURE_RANGES = {
        'metric_value': (0, 1000000),
        'seasonal_factor': (0.5, 2.5),
        'region_id': (0, 10),
        'product_id': (0, 100),
        'day_of_week': (0, 7),
        'hour_of_day': (0, 24),
        'moving_average_7d': (0, 1000000),
        'volatility_7d': (0, 100000),
        'weekday_avg': (0, 1000000),
        'weekday_std': (0, 100000),
        'is_weekend': (0, 1),
        'lag_1': (0, 1000000),
    }

    @classmethod
    def validate(cls, features_dict):
        """
        Validate incoming features against schema.
        Raises ValidationError if anything is wrong.
        """
        # Check all required features are present
        missing = set(cls.REQUIRED_FEATURES) - set(features_dict.keys())
        if missing:
            raise ValueError(f"Missing features: {missing}")

        # Check no extra features
        extra = set(features_dict.keys()) - set(cls.REQUIRED_FEATURES)
        if extra:
            raise ValueError(f"Unexpected features: {extra}")

        # Check ranges
        for feature, (min_val, max_val) in cls.FEATURE_RANGES.items():
            value = features_dict[feature]

            # Check it's a number
            if not isinstance(value, (int, float, np.number)):
                raise ValueError(f"{feature} must be numeric, got {type(value)}")

            # Check it's not NaN
            if pd.isna(value):
                raise ValueError(f"{feature} cannot be NaN")

            # Check it's in valid range
            if not (min_val <= value <= max_val):
                raise ValueError(
                    f"{feature}={value} out of range [{min_val}, {max_val}]"
                )

        return True

# Usage
def predict_with_validation(features_dict, model):
    """Predict with schema validation"""
    try:
        # This will raise if validation fails
        FeatureSchema.validate(features_dict)

        # Create DataFrame for model
        df = pd.DataFrame([features_dict])
        prediction = model.predict(df)[0]

        return {
            'success': True,
            'prediction': float(prediction),
            'error': None
        }

    except ValueError as e:
        # Log the error and return failure response
        return {
            'success': False,
            'prediction': None,
            'error': str(e)
        }

# Test it
good_features = {
    'metric_value': 450,
    'seasonal_factor': 1.2,
    'region_id': 3,
    'product_id': 5,
    'day_of_week': 3,
    'hour_of_day': 14,
    'moving_average_7d': 420,
    'volatility_7d': 45,
    'weekday_avg': 440,
    'weekday_std': 30,
    'is_weekend': 0,
    'lag_1': 460,
}

print("Valid input:")
print(predict_with_validation(good_features, model))

# Bad features - will fail validation
bad_features = {
    'metric_value': 450,
    'seasonal_factor': 1.2,
    # Missing region_id, product_id, etc.
}

print("\nInvalid input (missing features):")
print(predict_with_validation(bad_features, model))
Enter fullscreen mode Exit fullscreen mode

Output:

Valid input:
{'success': True, 'prediction': 0.87, 'error': None}

Invalid input (missing features):
{'success': False, 'prediction': None, 'error': 'Missing features: {...}'}
Enter fullscreen mode Exit fullscreen mode

Step 2: Reproducible Environment with Requirements File

Before (Environment Chaos):

"It works on my machine."
(different numpy version, different pandas version, ...)
Enter fullscreen mode Exit fullscreen mode

After (Reproducible):

# requirements.txt
numpy==1.24.3
pandas==2.0.3
scikit-learn==1.3.0
joblib==1.3.1
Enter fullscreen mode Exit fullscreen mode
# Dockerfile
FROM python:3.11-slim

WORKDIR /app

# Copy requirements and install
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy model and code
COPY model.joblib .
COPY prediction_service.py .

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD python -c "import requests; requests.get('http://localhost:5000/health')"

# Run the service
CMD ["python", "prediction_service.py"]
Enter fullscreen mode Exit fullscreen mode

Now every environment is identical. No more "works on my machine."

Step 3: Logging and Observability

Before (No Observability):

def predict(features):
    return model.predict(features)  # Silent. No record. No visibility.
Enter fullscreen mode Exit fullscreen mode

After (Observable):

import logging
import json
from datetime import datetime

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger('prediction_service')

class PredictionLogger:
    """Log every prediction for observability"""

    @staticmethod
    def log_prediction(features_dict, prediction, latency_ms, success=True, error=None):
        """Log prediction with full context"""
        log_entry = {
            'timestamp': datetime.utcnow().isoformat(),
            'features': features_dict,
            'prediction': float(prediction) if prediction is not None else None,
            'latency_ms': latency_ms,
            'success': success,
            'error': error,
        }

        if success:
            logger.info(f"Prediction successful: {json.dumps(log_entry)}")
        else:
            logger.error(f"Prediction failed: {json.dumps(log_entry)}")

        return log_entry

def predict_with_logging(features_dict, model):
    """Predict with full logging"""
    start_time = datetime.now()

    try:
        # Validate
        FeatureSchema.validate(features_dict)

        # Predict
        df = pd.DataFrame([features_dict])
        prediction = model.predict(df)[0]

        # Log success
        latency = (datetime.now() - start_time).total_seconds() * 1000
        PredictionLogger.log_prediction(
            features_dict, 
            prediction, 
            latency, 
            success=True
        )

        return {
            'success': True,
            'prediction': float(prediction),
            'latency_ms': latency
        }

    except Exception as e:
        latency = (datetime.now() - start_time).total_seconds() * 1000
        PredictionLogger.log_prediction(
            features_dict,
            None,
            latency,
            success=False,
            error=str(e)
        )

        return {
            'success': False,
            'prediction': None,
            'error': str(e),
            'latency_ms': latency
        }
Enter fullscreen mode Exit fullscreen mode

Now you have complete visibility:

  • What predictions were made
  • How long they took
  • When they failed and why
  • What features came in

Step 4: Health Checks

Before (No Health Checks):

@app.route('/predict', methods=['POST'])
def predict():
    features = request.json
    return jsonify(model.predict([features]))
    # If model is broken, nobody knows
Enter fullscreen mode Exit fullscreen mode

After (With Health Checks):

import time

class HealthCheck:
    """Monitor system health"""

    def __init__(self):
        self.last_prediction_time = None
        self.last_prediction_success = False
        self.consecutive_failures = 0
        self.model_loaded = False

    def test_model_load(self):
        """Can we load the model?"""
        try:
            import joblib
            joblib.load('model.joblib')
            self.model_loaded = True
            return True
        except Exception as e:
            logger.error(f"Model load failed: {e}")
            return False

    def test_prediction(self, model):
        """Can we make a prediction?"""
        try:
            test_features = {
                'metric_value': 500,
                'seasonal_factor': 1.0,
                'region_id': 1,
                'product_id': 1,
                'day_of_week': 3,
                'hour_of_day': 12,
                'moving_average_7d': 500,
                'volatility_7d': 50,
                'weekday_avg': 500,
                'weekday_std': 30,
                'is_weekend': 0,
                'lag_1': 500,
            }

            result = predict_with_logging(test_features, model)

            if result['success']:
                self.last_prediction_success = True
                self.consecutive_failures = 0
                self.last_prediction_time = datetime.now()
                return True
            else:
                self.consecutive_failures += 1
                return False
        except Exception as e:
            self.consecutive_failures += 1
            logger.error(f"Health check prediction failed: {e}")
            return False

    def get_status(self):
        """Return full health status"""
        return {
            'status': 'healthy' if self.is_healthy() else 'unhealthy',
            'model_loaded': self.model_loaded,
            'last_prediction_success': self.last_prediction_success,
            'consecutive_failures': self.consecutive_failures,
            'last_prediction_time': self.last_prediction_time.isoformat() if self.last_prediction_time else None,
        }

    def is_healthy(self):
        """Is the system healthy?"""
        return (
            self.model_loaded and
            self.last_prediction_success and
            self.consecutive_failures < 5
        )

# Flask endpoint
health_check = HealthCheck()

@app.route('/health', methods=['GET'])
def health():
    """Liveness probe for Kubernetes / container orchestration"""
    status = health_check.get_status()

    if health_check.is_healthy():
        return jsonify(status), 200
    else:
        return jsonify(status), 503

@app.route('/health/detailed', methods=['GET'])
def health_detailed():
    """Readiness probe with detailed metrics"""
    return jsonify(health_check.get_status()), 200
Enter fullscreen mode Exit fullscreen mode

Step 5: Fallback Logic (Graceful Degradation)

Before (Fails Hard):

@app.route('/predict', methods=['POST'])
def predict():
    features = request.json
    # If model crashes, endpoint crashes
    return jsonify(model.predict([features]))
Enter fullscreen mode Exit fullscreen mode

After (Graceful Degradation):

class PredictionService:
    """Service with fallback logic"""

    def __init__(self, model, fallback_model=None, baseline_rule=None):
        self.primary_model = model
        self.fallback_model = fallback_model  # Simpler model for backup
        self.baseline_rule = baseline_rule     # Rule-based fallback

    def predict(self, features_dict):
        """Predict with fallback chain"""

        # Try primary model
        try:
            FeatureSchema.validate(features_dict)
            df = pd.DataFrame([features_dict])
            prediction = self.primary_model.predict(df)[0]

            logger.info(f"Prediction from primary model: {prediction}")
            return {
                'success': True,
                'prediction': float(prediction),
                'model_used': 'primary',
                'latency_ms': 0
            }

        except Exception as e:
            logger.warning(f"Primary model failed: {e}. Trying fallback...")

            # Try fallback model
            if self.fallback_model:
                try:
                    df = pd.DataFrame([features_dict])
                    prediction = self.fallback_model.predict(df)[0]

                    logger.warning(f"Prediction from fallback model: {prediction}")
                    return {
                        'success': True,
                        'prediction': float(prediction),
                        'model_used': 'fallback',
                        'warning': 'Primary model unavailable'
                    }
                except Exception as e2:
                    logger.warning(f"Fallback model failed: {e2}. Using baseline rule...")

            # Try baseline rule
            if self.baseline_rule:
                try:
                    prediction = self.baseline_rule(features_dict)

                    logger.warning(f"Prediction from baseline rule: {prediction}")
                    return {
                        'success': True,
                        'prediction': float(prediction),
                        'model_used': 'baseline',
                        'warning': 'Using baseline rule (models unavailable)'
                    }
                except Exception as e3:
                    logger.error(f"All methods failed: {e3}")

            # Complete failure
            return {
                'success': False,
                'prediction': None,
                'model_used': 'none',
                'error': 'All prediction methods failed',
                'primary_error': str(e)
            }

def baseline_prediction_rule(features_dict):
    """Simple rule-based fallback"""
    # If we can't use ML, use simple heuristics
    metric = features_dict['metric_value']
    moving_avg = features_dict['moving_average_7d']
    volatility = features_dict['volatility_7d']

    # Simple rule: if value is >3 std devs from moving average, it's anomalous
    z_score = abs(metric - moving_avg) / (volatility + 1)

    return float(z_score > 3)  # Return 1 if anomalous, 0 if normal

# Use it
service = PredictionService(
    primary_model=model,
    fallback_model=simple_model,
    baseline_rule=baseline_prediction_rule
)

result = service.predict(features)
Enter fullscreen mode Exit fullscreen mode

The fallback chain:

Try → Primary ML Model
     ↓ (fails)
Try → Fallback ML Model (simpler, faster)
     ↓ (fails)
Try → Baseline Rule (z-score based)
     ↓ (fails)
Return → Graceful Error (system is up, but can't predict)
Enter fullscreen mode Exit fullscreen mode

This means your service never crashes. It gracefully degrades.


The Results: What Changed

Before vs After Comparison

results = pd.DataFrame({
    'Metric': [
        'Uptime',
        'Prediction Success Rate',
        'Model Accuracy',
        'Prediction Latency (p99)',
        'Time to Detect Failure',
        'Time to Fix Issues',
        'Incident Response Time',
        'Data Validation Errors/Day',
    ],
    'Before': [
        '87%',
        '91%',
        '94%',
        '2500ms',
        '3-6 hours',
        '2-4 days',
        '30+ minutes',
        'Unknown',
    ],
    'After': [
        '99.7%',
        '99.2%',
        '94%',
        '150ms',
        '< 2 minutes',
        '15-30 minutes',
        '< 2 minutes',
        '100-200 (caught + logged)',
    ]
})

print(results.to_string(index=False))
Enter fullscreen mode Exit fullscreen mode

Output:

                               Metric  Before  After
                              Uptime    87%    99.7%
                 Prediction Success Rate    91%    99.2%
                         Model Accuracy    94%     94%
              Prediction Latency (p99)  2500ms   150ms
            Time to Detect Failure  3-6 hours  < 2 min
                       Time to Fix Issues  2-4 days  15-30 min
                   Incident Response Time  30+ min  < 2 min
              Data Validation Errors/Day  Unknown  100-200
Enter fullscreen mode Exit fullscreen mode

Notice: Model accuracy didn't change. Everything else did.


The Framework: Building Production Systems

PRODUCTION_READINESS_CHECKLIST = {
    'Input Validation': {
        'Schema validation': '✓ Required',
        'Type checking': '✓ Required',
        'Range checking': '✓ Required',
        'Reject invalid data': '✓ Required',
    },
    'Reproducibility': {
        'Pinned dependencies': '✓ Required',
        'Docker/containerization': '✓ Required',
        'Environment parity': '✓ Required',
        'Version control': '✓ Required',
    },
    'Observability': {
        'Prediction logging': '✓ Required',
        'Latency tracking': '✓ Required',
        'Error tracking': '✓ Required',
        'Feature auditing': '✓ Required',
    },
    'Reliability': {
        'Health checks': '✓ Required',
        'Monitoring alerts': '✓ Required',
        'Graceful degradation': '✓ Required',
        'Fallback logic': '✓ Required',
    },
    'Operational': {
        'Runbooks': '✓ Required',
        'Incident response': '✓ Required',
        'Rollback procedures': '✓ Required',
        'Load testing': '✓ Required',
    },
}

def assess_production_readiness(checklist_status):
    """
    Score your system on production readiness.
    """
    total_items = sum(len(v) for v in checklist_status.values())
    completed = sum(
        sum(1 for item_status in v.values() if '' in str(item_status))
        for v in checklist_status.values()
    )

    readiness_score = (completed / total_items) * 100

    print(f"Production Readiness Score: {readiness_score:.0f}%")

    if readiness_score < 50:
        print("  → NOT ready. This will fail in production.")
    elif readiness_score < 80:
        print("  → Risky. Expect outages and firefighting.")
    elif readiness_score < 100:
        print("  → Ready but incomplete. Plan improvements.")
    else:
        print("  → Production ready. You're in good shape.")

# Before
before_checklist = {k: {item: '' for item in v} for k, v in PRODUCTION_READINESS_CHECKLIST.items()}
assess_production_readiness(before_checklist)
# Output: 0%

# After
after_checklist = {k: v for k, v in PRODUCTION_READINESS_CHECKLIST.items()}
assess_production_readiness(after_checklist)
# Output: 100%
Enter fullscreen mode Exit fullscreen mode

Key Lessons

  1. A model that runs beats a model that's right.

    • 94% accuracy that crashes < 80% accuracy that's reliable
    • Stakeholders trust systems, not accuracy numbers
  2. Production is not research.

    • Your notebook is a prototype
    • Your pipeline is the product
    • They have different requirements
  3. Observability is not optional.

    • You can't manage what you can't measure
    • Logging every prediction is cheap insurance
    • Detecting failures in minutes, not hours, changes everything
  4. Design for failure.

    • Something will break
    • Design graceful degradation, not hard crashes
    • Your fallback logic is as important as your primary model
  5. Think about production from day one.

    • Choose models that are reproducible, not just accurate
    • Consider inference latency during training
    • Plan for monitoring while you're building

Questions for Your Systems

  • How would you know if your model was making bad predictions right now?
  • How long would it take to detect an outage?
  • What happens when your model fails?
  • Can you reproduce your exact training environment?
  • Do you validate input data?

If you can't answer these confidently, your model isn't production-ready yet.

The uncomfortable truth: 94% of data science effort goes into accuracy. 94% of production problems come from everything else.

What aspects of production readiness does your team struggle with most?

Top comments (0)