DEV Community

Ferit
Ferit

Posted on

From scikit-learn to Production, Deploying ML Models That Actually Work

There is a gap between training a model in a Jupyter Notebook and running it in production. Most tutorials stop at model.score() and call it done. This article covers the full pipeline: data preprocessing, model selection, evaluation, serialization, and serving a scikit-learn model behind a FastAPI endpoint.

The Problem

We needed a transaction risk scoring system for a crypto payment gateway. The model receives transaction features and returns a fraud probability between 0 and 1. Requirements:

  • Latency under 50ms per prediction
  • Handle 1,000 requests per second
  • Update the model weekly without downtime
  • Explainable predictions (regulators want to know why a transaction was flagged)

scikit-learn turned out to be the right tool. Not TensorFlow, not PyTorch. For tabular data with fewer than 100 features, gradient boosted trees in scikit-learn are hard to beat.

Data Preprocessing Pipeline

Raw transaction data is messy. Missing values, mixed types, different scales. scikit-learn pipelines handle this cleanly:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

numeric_features = [
    "amount_usd", "gas_price", "tx_count_24h",
    "address_age_days", "avg_tx_value"
]

categorical_features = [
    "chain", "token_type", "sender_country"
]

numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="constant", fill_value="unknown")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features)
])
Enter fullscreen mode Exit fullscreen mode

The key insight here is that the preprocessor becomes part of the saved model. When you serialize the pipeline, all the preprocessing logic (imputation values, scaling parameters, encoding mappings) travels with the model. No separate preprocessing code needed at inference time.

Model Selection

We evaluated four models on our dataset:

from sklearn.ensemble import (
    GradientBoostingClassifier,
    RandomForestClassifier,
    AdaBoostClassifier
)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

models = {
    "logistic_regression": LogisticRegression(max_iter=1000),
    "random_forest": RandomForestClassifier(n_estimators=200),
    "gradient_boosting": GradientBoostingClassifier(n_estimators=200),
    "adaboost": AdaBoostClassifier(n_estimators=200)
}

results = {}
for name, model in models.items():
    pipeline = Pipeline([
        ("preprocessor", preprocessor),
        ("classifier", model)
    ])

    scores = cross_val_score(
        pipeline, X_train, y_train,
        cv=5, scoring="roc_auc"
    )

    results[name] = {
        "mean_auc": scores.mean(),
        "std": scores.std()
    }
    print(f"{name}: AUC = {scores.mean():.4f} (+/- {scores.std():.4f})")
Enter fullscreen mode Exit fullscreen mode

Results on our dataset:

Model AUC Latency (p95)
Logistic Regression 0.87 0.2ms
Random Forest 0.93 3.1ms
Gradient Boosting 0.96 1.8ms
AdaBoost 0.91 2.4ms

Gradient boosting won on accuracy. The 1.8ms latency is well within our 50ms budget.

Hyperparameter Tuning

Grid search with cross-validation finds the best parameters:

from sklearn.model_selection import GridSearchCV

param_grid = {
    "classifier__n_estimators": [100, 200, 300],
    "classifier__max_depth": [3, 5, 7],
    "classifier__learning_rate": [0.01, 0.05, 0.1],
    "classifier__min_samples_leaf": [5, 10, 20]
}

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", GradientBoostingClassifier())
])

search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring="roc_auc",
    n_jobs=-1,
    verbose=1
)

search.fit(X_train, y_train)

print(f"Best AUC: {search.best_score_:.4f}")
print(f"Best params: {search.best_params_}")
Enter fullscreen mode Exit fullscreen mode

Best configuration for our dataset:

best_model = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", GradientBoostingClassifier(
        n_estimators=200,
        max_depth=5,
        learning_rate=0.05,
        min_samples_leaf=10
    ))
])
Enter fullscreen mode Exit fullscreen mode

Evaluation Beyond Accuracy

For fraud detection, accuracy is misleading because the dataset is imbalanced (99% legitimate, 1% fraud). We care about precision and recall:

from sklearn.metrics import (
    classification_report,
    roc_auc_score,
    precision_recall_curve
)

best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
print(f"ROC AUC: {roc_auc_score(y_test, y_proba):.4f}")

# Find optimal threshold
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)

# We want at least 90% recall (catch 90% of fraud)
for i, recall in enumerate(recalls):
    if recall >= 0.90:
        optimal_threshold = thresholds[i]
        print(f"Threshold for 90% recall: {optimal_threshold:.3f}")
        print(f"Precision at this threshold: {precisions[i]:.3f}")
        break
Enter fullscreen mode Exit fullscreen mode

Feature Importance

Regulators need explainability. scikit-learn makes this straightforward:

import numpy as np

classifier = best_model.named_steps["classifier"]
feature_names = (
    numeric_features +
    list(best_model.named_steps["preprocessor"]
         .named_transformers_["cat"]
         .named_steps["encoder"]
         .get_feature_names_out(categorical_features))
)

importances = classifier.feature_importances_
indices = np.argsort(importances)[::-1]

print("Top 10 features:")
for i in range(min(10, len(indices))):
    idx = indices[i]
    print(f"  {feature_names[idx]}: {importances[idx]:.4f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Top 10 features:
  amount_usd: 0.2341
  tx_count_24h: 0.1876
  address_age_days: 0.1523
  gas_price: 0.0987
  avg_tx_value: 0.0834
  chain_ETH: 0.0612
  chain_TRC20: 0.0445
  sender_country_unknown: 0.0398
  token_type_USDT: 0.0321
  chain_BTC: 0.0287
Enter fullscreen mode Exit fullscreen mode

This tells us: high transaction amounts from new addresses with unusual gas prices are the strongest fraud signals. That is explainable to a regulator.

Model Serialization

Save the entire pipeline (preprocessor + model) as a single artifact:

import joblib
from datetime import datetime

model_version = datetime.now().strftime("%Y%m%d_%H%M%S")
model_path = f"models/fraud_detector_{model_version}.joblib"

joblib.dump(best_model, model_path)

# Verify the saved model works
loaded_model = joblib.load(model_path)
assert np.allclose(
    loaded_model.predict_proba(X_test),
    best_model.predict_proba(X_test)
)

# Save metadata
metadata = {
    "version": model_version,
    "auc": roc_auc_score(y_test, y_proba),
    "threshold": optimal_threshold,
    "features": feature_names,
    "training_samples": len(X_train)
}

import json
with open(f"models/fraud_detector_{model_version}.json", "w") as f:
    json.dump(metadata, f, indent=2)
Enter fullscreen mode Exit fullscreen mode

Serving with FastAPI

The production API loads the model once at startup and serves predictions:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import numpy as np

app = FastAPI()

# Load model at startup
model = joblib.load("models/fraud_detector_latest.joblib")
metadata = json.load(open("models/fraud_detector_latest.json"))
threshold = metadata["threshold"]

class TransactionInput(BaseModel):
    amount_usd: float
    gas_price: float
    tx_count_24h: int
    address_age_days: int
    avg_tx_value: float
    chain: str
    token_type: str
    sender_country: str

class PredictionOutput(BaseModel):
    fraud_probability: float
    is_flagged: bool
    model_version: str

@app.post("/predict", response_model=PredictionOutput)
async def predict(tx: TransactionInput):
    features = pd.DataFrame([tx.model_dump()])

    probability = model.predict_proba(features)[0][1]

    return PredictionOutput(
        fraud_probability=round(probability, 4),
        is_flagged=probability >= threshold,
        model_version=metadata["version"]
    )

@app.get("/health")
async def health():
    return {"status": "ok", "model_version": metadata["version"]}
Enter fullscreen mode Exit fullscreen mode

Hot-Swapping Models

To update the model without downtime, we use a simple versioning scheme:

import threading

class ModelManager:
    def __init__(self, model_dir: str):
        self.model_dir = model_dir
        self.current_model = None
        self.current_metadata = None
        self.lock = threading.Lock()
        self.load_latest()

    def load_latest(self):
        model_files = sorted(glob.glob(f"{self.model_dir}/fraud_detector_*.joblib"))
        if not model_files:
            raise FileNotFoundError("No model found")

        latest = model_files[-1]
        new_model = joblib.load(latest)
        meta_path = latest.replace(".joblib", ".json")
        new_metadata = json.load(open(meta_path))

        with self.lock:
            self.current_model = new_model
            self.current_metadata = new_metadata

    def predict(self, features):
        with self.lock:
            return self.current_model.predict_proba(features)
Enter fullscreen mode Exit fullscreen mode

Performance Numbers

After deploying this setup:

  • Prediction latency: 2ms p50, 8ms p95
  • Throughput: 2,400 requests/second on a single core
  • Model size: 12MB (serialized pipeline)
  • Fraud detection rate: 94% recall at 87% precision
  • False positive rate: 0.3%

The full implementation is on GitHub: python-machine-learning


If you are deploying ML models to production or working with scikit-learn at scale, I would like to hear about your experience. Find me on GitHub.

Top comments (0)