SOURAB REDDY

Posted on Apr 27

Building HyFD: How We Used MongoDB to Store and Analyse Production ML Failure Logs

#mongodb #python #machinelearning #mlops

By @sourab_reddy_ @siddardha796 @bvishnu_2509 @giridhar_58 — developed under the guidance of @chanda_rajkumar

The Problem That Started Everything

Here is a question nobody asks enough: what happens to your machine learning model after you deploy it?

You spend weeks training it. You hit 92% accuracy. You deploy it. And then — nothing. No monitoring. No alerts. Just hope that the real world keeps behaving like your training data did.

It doesn't.

Data distributions shift over time. Sensors degrade. Data pipelines silently fail and start returning NaN values. The model gets used on data it was never designed for. And through all of this, the model keeps returning predictions. Just increasingly wrong ones.

We built HyFD — Hybrid Failure Detector — to solve this. It is a production ML monitoring system that watches five signals simultaneously and raises an alert the moment something starts going wrong. But this post is specifically about one part of the project that came later: how we integrated MongoDB to store, query, and analyse every experiment run, detection result, and monitoring alert our system produced.

It turned out to be one of the more interesting engineering decisions we made.

Why We Needed a Database at All

HyFD runs experiments. A lot of them.

Each experiment takes a production data batch, runs it through five detection modules — drift detector, uncertainty estimator, slice analyzer, quality monitor, OOD detector — computes a composite failure score, and decides whether a failure has occurred.

Early on we were just printing results to the terminal and saving CSVs. That worked fine for a single experiment. It completely fell apart when we wanted to:

Compare results across 120 experiments (20 repeats × 6 scenarios)
Track which failure types were most common over time
Query "show me every experiment where the OOD signal scored above 0.7"
Build a dashboard that shows monitoring history

CSVs can't do that. A proper database can.

Why MongoDB Specifically

Our experiment results are not uniform. A healthy batch result looks like this:

{
  "scenario": "no_failure",
  "composite_score": 0.019,
  "failure_detected": false,
  "signal_scores": {
    "drift": 0.021,
    "uncertainty": 0.018,
    "quality": 0.009,
    "ood": 0.031
  },
  "n_samples": 500,
  "timestamp": "2025-04-26T14:32:11Z"
}

A failure result looks like this:

{
  "scenario": "data_drift",
  "composite_score": 0.487,
  "failure_detected": true,
  "primary_signal": "drift",
  "signal_scores": {
    "drift": 0.847,
    "uncertainty": 0.412,
    "quality": 0.091,
    "ood": 0.198
  },
  "alerts": [
    {
      "type": "DRIFT_DETECTED",
      "severity": "HIGH",
      "message": "6 of 10 features show PSI > 0.25 — severe distribution shift",
      "drifted_features": ["income", "credit_score", "age", "debt_ratio", "savings_ratio", "employment_years"]
    }
  ],
  "n_samples": 500,
  "latency_ms": 177.3,
  "timestamp": "2025-04-26T14:32:19Z"
}

These two documents have completely different shapes. A healthy result has no alerts array, no primary_signal, no drifted_features. A failure result has all of those — and the alert document itself varies depending on which signal fired.

Forcing this into a relational schema would mean nullable columns everywhere, or a mess of joined tables. MongoDB's document model handles it naturally — each record stores exactly what it needs to store, nothing more.

Three Collections, Clean Separation

We wrote a mongo_logger.py module that handles all database operations. Three collections:

Collection	What it stores
`experiments`	One document per HyFD detection run — scores, result, latency
`alerts`	One document per alert raised, linked to experiment by ID
`production_batches`	Metadata about the data batch that was analysed

# mongo_logger.py
from pymongo import MongoClient
from datetime import datetime
import os

MONGO_URI = os.getenv("MONGO_URI", "mongodb://localhost:27017/")
DB_NAME   = "hyfd_monitoring"

client = MongoClient(MONGO_URI)
db     = client[DB_NAME]

experiments_col = db["experiments"]
alerts_col      = db["alerts"]
batches_col     = db["production_batches"]

Logging an Experiment

Every time HyFD runs detection on a production batch, we log the full result:

def log_experiment(result: dict, scenario: str, batch_meta: dict) -> str:
    document = {
        "scenario":         scenario,
        "failure_detected": result["failure_detected"],
        "composite_score":  round(result["composite_score"], 4),
        "primary_signal":   result.get("primary_failure_signal"),
        "signal_scores": {
            k: round(v, 4)
            for k, v in result["signal_scores"].items()
        },
        "n_samples":     batch_meta.get("n_samples", 500),
        "latency_ms":    round(result.get("latency_ms", 0), 2),
        "timestamp":     datetime.utcnow(),
        "model_version": "rf_v1.0",
    }

    inserted     = experiments_col.insert_one(document)
    experiment_id = inserted.inserted_id

    if result["failure_detected"]:
        _log_alert(experiment_id, result)

    return str(experiment_id)


def _log_alert(experiment_id, result: dict):
    primary = result.get("primary_failure_signal", "unknown")
    scores  = result["signal_scores"]

    alert = {
        "experiment_id": experiment_id,
        "alert_type":    f"{primary.upper()}_DETECTED",
        "severity":      _severity(scores.get(primary, 0)),
        "composite_score": result["composite_score"],
        "primary_signal":  primary,
        "all_scores":      scores,
        "timestamp":       datetime.utcnow(),
    }
    alerts_col.insert_one(alert)


def _severity(score: float) -> str:
    if score > 0.75: return "CRITICAL"
    if score > 0.50: return "HIGH"
    if score > 0.35: return "MEDIUM"
    return "LOW"

Querying with Aggregation Pipelines

This is where MongoDB earned its place in the project.

Which failure type is most common?

def get_failures_by_scenario() -> list:
    pipeline = [
        {"$match": {"failure_detected": True}},
        {"$group": {
            "_id":           "$scenario",
            "count":         {"$sum": 1},
            "avg_composite": {"$avg": "$composite_score"}
        }},
        {"$sort": {"count": -1}}
    ]
    return list(experiments_col.aggregate(pipeline))

Which signal fires most often as the primary?

def get_signal_dominance() -> list:
    pipeline = [
        {"$match": {"failure_detected": True, "primary_signal": {"$ne": None}}},
        {"$group": {
            "_id":       "$primary_signal",
            "count":     {"$sum": 1},
            "avg_score": {"$avg": "$composite_score"}
        }},
        {"$sort": {"count": -1}}
    ]
    return list(experiments_col.aggregate(pipeline))

Score distribution across all experiments?

def get_score_distribution() -> list:
    pipeline = [
        {"$bucket": {
            "groupBy":    "$composite_score",
            "boundaries": [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
            "default":    "other",
            "output":     {"count": {"$sum": 1}}
        }}
    ]
    return list(experiments_col.aggregate(pipeline))

The $bucket stage was something we found in the MongoDB docs and it saved writing manual binning logic in Python entirely. That kind of thing is where aggregation pipelines genuinely shine — the computation happens in the database, not in application memory.

The Numbers After Full Evaluation

After running 6 scenarios × 20 repeats × 5 methods, our MongoDB collections held:

Collection	Documents
experiments	600
alerts	480
production_batches	120

The signal dominance query gave us this breakdown across all detected failures:

Primary Signal	Times Fired	Avg Composite Score
quality	196	0.489
drift	118	0.512
uncertainty	88	0.498
ood	78	0.541

Quality fired most — it is the primary signal for both Missing Data and Corrupted Data scenarios, and each ran 20 times.

Where MongoDB Sits in the Pipeline

Production batch arrives
│
▼
Five HyFD detection modules run in parallel
(drift, uncertainty, slice, quality, ood)
│
▼
Weighted fusion → Composite Score
│
▼
Failure detected?
YES → Alert created
NO → Healthy logged
│
▼
mongo_logger.log_experiment()
│
├──► experiments collection
└──► alerts collection (if failure)
│
▼
Aggregation pipeline queries
│
▼
Results dashboard / charts

MongoDB sits at the end of the detection pipeline and at the start of the analysis pipeline. Everything the system produces flows in. Everything the dashboard displays flows out.

What Worked, What Didn't

What worked well:

The document model was the right call for this data. Having signal scores nested directly inside each experiment document made per-signal queries straightforward without any joins.

Aggregation pipelines are powerful. Once we understood the stage-by-stage mental model — think of it like Unix pipes for documents — writing complex queries became natural.

What was harder than expected:

We kept forgetting that _id is returned by default in find() and had to explicitly project it out with {"_id": 0} when building JSON responses for the dashboard. Minor thing but it caught us multiple times early.

The $bucket stage needs exhaustive boundaries or a default key — we got silent errors on this before reading the docs properly.

Querying nested fields uses dot notation inside a string: {"signal_scores.drift": {"$gt": 0.5}}. It works exactly as expected once you know the syntax, but the first time it looks a bit strange.

Running It Yourself

# Install dependencies (pymongo is included)
pip install -r requirements.txt

# Make sure MongoDB is running
# Windows: start MongoDB from Services or run mongod.exe
# Mac:     brew services start mongodb-community

# Run experiments — results auto-save to MongoDB
cd experiments
python run_experiments.py

# Query your results directly
python -c "
from mongo_logger import get_signal_dominance
import json
results = get_signal_dominance()
print(json.dumps(results, indent=2, default=str))
"

If MongoDB is not running, the code falls back to CSV output gracefully — the logging is wrapped in try/except so experiments still run even without a database connection.

What Is Next

The MongoDB foundation makes several future features straightforward:

Change Streams — push real-time alerts to a live dashboard without polling
Time-series collections — built-in time bucketing for high-frequency monitoring data
Atlas Charts — embed live monitoring dashboards directly without building chart code
Aggregation-based triggers — query for three consecutive drift failures and automatically kick off a model retraining job

GITHUB :
https://github.com/Arcanixhades0/HyFD_ml

Video Demo

Final Thought

The core of HyFD is the five detection signals and the weighted fusion — that is the research contribution. But a monitoring system that prints to a terminal and throws away its history is not really a monitoring system. It is a diagnostic tool you run once.

MongoDB turned HyFD from a one-shot experiment runner into a persistent monitoring layer that accumulates history, supports trend analysis, and can answer questions about failure patterns across hundreds of runs.

That is what a production monitoring system actually needs to be.

HyFD — Hybrid Failure Detection for Production ML Systems

Built with: Python · scikit-learn · NumPy · SciPy · Pandas · PyMongo · MongoDB

DEV Community