There is a gap between training a model in a Jupyter Notebook and running it in production. Most tutorials stop at model.score() and call it done. This article covers the full pipeline: data preprocessing, model selection, evaluation, serialization, and serving a scikit-learn model behind a FastAPI endpoint.
The Problem
We needed a transaction risk scoring system for a crypto payment gateway. The model receives transaction features and returns a fraud probability between 0 and 1. Requirements:
- Latency under 50ms per prediction
- Handle 1,000 requests per second
- Update the model weekly without downtime
- Explainable predictions (regulators want to know why a transaction was flagged)
scikit-learn turned out to be the right tool. Not TensorFlow, not PyTorch. For tabular data with fewer than 100 features, gradient boosted trees in scikit-learn are hard to beat.
Data Preprocessing Pipeline
Raw transaction data is messy. Missing values, mixed types, different scales. scikit-learn pipelines handle this cleanly:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
numeric_features = [
"amount_usd", "gas_price", "tx_count_24h",
"address_age_days", "avg_tx_value"
]
categorical_features = [
"chain", "token_type", "sender_country"
]
numeric_transformer = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
categorical_transformer = Pipeline([
("imputer", SimpleImputer(strategy="constant", fill_value="unknown")),
("encoder", OneHotEncoder(handle_unknown="ignore"))
])
preprocessor = ColumnTransformer([
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features)
])
The key insight here is that the preprocessor becomes part of the saved model. When you serialize the pipeline, all the preprocessing logic (imputation values, scaling parameters, encoding mappings) travels with the model. No separate preprocessing code needed at inference time.
Model Selection
We evaluated four models on our dataset:
from sklearn.ensemble import (
GradientBoostingClassifier,
RandomForestClassifier,
AdaBoostClassifier
)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
models = {
"logistic_regression": LogisticRegression(max_iter=1000),
"random_forest": RandomForestClassifier(n_estimators=200),
"gradient_boosting": GradientBoostingClassifier(n_estimators=200),
"adaboost": AdaBoostClassifier(n_estimators=200)
}
results = {}
for name, model in models.items():
pipeline = Pipeline([
("preprocessor", preprocessor),
("classifier", model)
])
scores = cross_val_score(
pipeline, X_train, y_train,
cv=5, scoring="roc_auc"
)
results[name] = {
"mean_auc": scores.mean(),
"std": scores.std()
}
print(f"{name}: AUC = {scores.mean():.4f} (+/- {scores.std():.4f})")
Results on our dataset:
| Model | AUC | Latency (p95) |
|---|---|---|
| Logistic Regression | 0.87 | 0.2ms |
| Random Forest | 0.93 | 3.1ms |
| Gradient Boosting | 0.96 | 1.8ms |
| AdaBoost | 0.91 | 2.4ms |
Gradient boosting won on accuracy. The 1.8ms latency is well within our 50ms budget.
Hyperparameter Tuning
Grid search with cross-validation finds the best parameters:
from sklearn.model_selection import GridSearchCV
param_grid = {
"classifier__n_estimators": [100, 200, 300],
"classifier__max_depth": [3, 5, 7],
"classifier__learning_rate": [0.01, 0.05, 0.1],
"classifier__min_samples_leaf": [5, 10, 20]
}
pipeline = Pipeline([
("preprocessor", preprocessor),
("classifier", GradientBoostingClassifier())
])
search = GridSearchCV(
pipeline,
param_grid,
cv=5,
scoring="roc_auc",
n_jobs=-1,
verbose=1
)
search.fit(X_train, y_train)
print(f"Best AUC: {search.best_score_:.4f}")
print(f"Best params: {search.best_params_}")
Best configuration for our dataset:
best_model = Pipeline([
("preprocessor", preprocessor),
("classifier", GradientBoostingClassifier(
n_estimators=200,
max_depth=5,
learning_rate=0.05,
min_samples_leaf=10
))
])
Evaluation Beyond Accuracy
For fraud detection, accuracy is misleading because the dataset is imbalanced (99% legitimate, 1% fraud). We care about precision and recall:
from sklearn.metrics import (
classification_report,
roc_auc_score,
precision_recall_curve
)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred))
print(f"ROC AUC: {roc_auc_score(y_test, y_proba):.4f}")
# Find optimal threshold
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)
# We want at least 90% recall (catch 90% of fraud)
for i, recall in enumerate(recalls):
if recall >= 0.90:
optimal_threshold = thresholds[i]
print(f"Threshold for 90% recall: {optimal_threshold:.3f}")
print(f"Precision at this threshold: {precisions[i]:.3f}")
break
Feature Importance
Regulators need explainability. scikit-learn makes this straightforward:
import numpy as np
classifier = best_model.named_steps["classifier"]
feature_names = (
numeric_features +
list(best_model.named_steps["preprocessor"]
.named_transformers_["cat"]
.named_steps["encoder"]
.get_feature_names_out(categorical_features))
)
importances = classifier.feature_importances_
indices = np.argsort(importances)[::-1]
print("Top 10 features:")
for i in range(min(10, len(indices))):
idx = indices[i]
print(f" {feature_names[idx]}: {importances[idx]:.4f}")
Output:
Top 10 features:
amount_usd: 0.2341
tx_count_24h: 0.1876
address_age_days: 0.1523
gas_price: 0.0987
avg_tx_value: 0.0834
chain_ETH: 0.0612
chain_TRC20: 0.0445
sender_country_unknown: 0.0398
token_type_USDT: 0.0321
chain_BTC: 0.0287
This tells us: high transaction amounts from new addresses with unusual gas prices are the strongest fraud signals. That is explainable to a regulator.
Model Serialization
Save the entire pipeline (preprocessor + model) as a single artifact:
import joblib
from datetime import datetime
model_version = datetime.now().strftime("%Y%m%d_%H%M%S")
model_path = f"models/fraud_detector_{model_version}.joblib"
joblib.dump(best_model, model_path)
# Verify the saved model works
loaded_model = joblib.load(model_path)
assert np.allclose(
loaded_model.predict_proba(X_test),
best_model.predict_proba(X_test)
)
# Save metadata
metadata = {
"version": model_version,
"auc": roc_auc_score(y_test, y_proba),
"threshold": optimal_threshold,
"features": feature_names,
"training_samples": len(X_train)
}
import json
with open(f"models/fraud_detector_{model_version}.json", "w") as f:
json.dump(metadata, f, indent=2)
Serving with FastAPI
The production API loads the model once at startup and serves predictions:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import numpy as np
app = FastAPI()
# Load model at startup
model = joblib.load("models/fraud_detector_latest.joblib")
metadata = json.load(open("models/fraud_detector_latest.json"))
threshold = metadata["threshold"]
class TransactionInput(BaseModel):
amount_usd: float
gas_price: float
tx_count_24h: int
address_age_days: int
avg_tx_value: float
chain: str
token_type: str
sender_country: str
class PredictionOutput(BaseModel):
fraud_probability: float
is_flagged: bool
model_version: str
@app.post("/predict", response_model=PredictionOutput)
async def predict(tx: TransactionInput):
features = pd.DataFrame([tx.model_dump()])
probability = model.predict_proba(features)[0][1]
return PredictionOutput(
fraud_probability=round(probability, 4),
is_flagged=probability >= threshold,
model_version=metadata["version"]
)
@app.get("/health")
async def health():
return {"status": "ok", "model_version": metadata["version"]}
Hot-Swapping Models
To update the model without downtime, we use a simple versioning scheme:
import threading
class ModelManager:
def __init__(self, model_dir: str):
self.model_dir = model_dir
self.current_model = None
self.current_metadata = None
self.lock = threading.Lock()
self.load_latest()
def load_latest(self):
model_files = sorted(glob.glob(f"{self.model_dir}/fraud_detector_*.joblib"))
if not model_files:
raise FileNotFoundError("No model found")
latest = model_files[-1]
new_model = joblib.load(latest)
meta_path = latest.replace(".joblib", ".json")
new_metadata = json.load(open(meta_path))
with self.lock:
self.current_model = new_model
self.current_metadata = new_metadata
def predict(self, features):
with self.lock:
return self.current_model.predict_proba(features)
Performance Numbers
After deploying this setup:
- Prediction latency: 2ms p50, 8ms p95
- Throughput: 2,400 requests/second on a single core
- Model size: 12MB (serialized pipeline)
- Fraud detection rate: 94% recall at 87% precision
- False positive rate: 0.3%
The full implementation is on GitHub: python-machine-learning
If you are deploying ML models to production or working with scikit-learn at scale, I would like to hear about your experience. Find me on GitHub.
Top comments (0)