Pelumi Oluwategbe

Posted on May 6

Building Mithridatium: Detecting Hidden Backdoors in ML Models

#machinelearning #python #cybersecurity #opensource

As pretrained AI models become more common, one growing concern is whether those models can actually be trusted.

A model may appear completely normal during testing, but behave maliciously when exposed to a hidden trigger. These attacks are known as backdoor or poisoning attacks, and they represent a serious security risk for real-world AI systems.

This semester, our team built Mithridatium - an open-source framework designed to help detect hidden backdoors in pretrained machine learning models.

What is a Backdoor?

In simple terms, a backdoor attack hides malicious behavior inside an otherwise normal model.

Most of the time, the model behaves exactly as expected. But when a specific trigger appears in the input, the model changes its behavior in a way that benefits an attacker.

Imagine a self-driving vehicle that correctly recognizes stop signs during testing, but misclassifies them when a small sticker or visual trigger is placed on the sign. A hidden trigger like this could potentially cause extremely dangerous outcomes in real-world systems.

This problem becomes even more concerning because many developers rely heavily on pretrained models downloaded from external sources like Hugging Face or public repositories.

The question becomes:

How do we verify that a pretrained model has not been poisoned before deploying it?

That is the problem Mithridatium was designed to explore.

What Mithridatium Does

Mithridatium is a framework for evaluating pretrained image classification models for potential backdoor behavior.

The framework allows users to:

Load local checkpoints or Hugging Face models
Run multiple backdoor detection defenses
Generate structured JSON reports
Visualize results through a web demo interface
Compare detection signals across different methods

The goal is to translate AI security research into practical and reusable tooling.

The Detection Defenses

One of the most interesting parts of the project was implementing and evaluating several different detection strategies. Each defense approaches the problem differently.

FreeEagle

FreeEagle is a white-box, data-free defense.

Instead of relying on datasets or trigger injection, it analyzes the internal behavior of the model itself and looks for abnormal class bias patterns that may indicate hidden backdoor behavior.

This makes it especially useful for quickly screening unknown models.

STRIP

STRIP works by perturbing inputs with other images.

The intuition is that a normal model should become less confident when the input changes significantly. However, backdoored models often remain unusually stable when the trigger is present.

If prediction entropy remains suspiciously low across perturbed inputs, STRIP raises a red flag.

MMBD

MMBD focuses on abnormal dominance patterns across output classes.

The defense looks for suspicious concentration or bias in the model’s behavior that may suggest hidden trigger relationships.

This approach was especially interesting because it worked well even against some dynamic backdoor scenarios.

AEVA

AEVA takes a more adversarial approach.

It perturbs input images and observes how the model responds to trigger-like changes. By analyzing anomaly indices and perturbation behavior, the framework can identify suspicious patterns associated with backdoors.

Compared to some other defenses, AEVA can require significantly more queries and computation, especially in black-box settings.

Building the Project

Mithridatium was built primarily in Python using PyTorch and Hugging Face tooling.

The project currently includes:

A modular CLI interface
Support for Hugging Face models
JSON report generation
Multiple detection defenses
Demo interfaces for visualization
Compatibility validation for supported architectures

A typical CLI run looks like this:

mithridatium detect \
  --model models/resnet18_cifar10.pt \
  --data cifar10 \
  --defense freeeagle \
  --out reports/freeeagle_report.json \
  --force

The framework can also evaluate models directly from Hugging Face using model IDs instead of local checkpoints.

Reports and Visual Output

One major goal of the project was usability.

A user should not need to read multiple research papers just to understand whether a model might be risky. Mithridatium attempts to translate complex detection signals into understandable verdicts and metrics.

The framework produces structured reports and can visualize outputs through the demo interface.

Lessons Learned

One thing we learned very quickly is that ML security tooling is not just about implementing algorithms.

A practical tool also has to handle:

dataset compatibility
integration problems
reporting
usability
deployment assumptions
benchmarking
reproducibility

One particularly important lesson involved dataset mismatch.

Some defenses behaved very differently depending on whether the evaluation dataset matched the dataset the model was originally trained on. In some cases, mismatched datasets produced false positives that initially looked like detection failures.

We also learned that different defenses come with different tradeoffs.

Some methods are lightweight and data-free, while others require large numbers of model queries or significant computational resources.

Another major takeaway was the importance of clear reporting. Security tooling becomes far more useful when results are understandable to developers who may not specialize in AI security research.

Current Developers

Mithridatium was developed through Open Source with SLU by:

Pelumi Oluwategbe
Gustavo Lucca
Payton Guffey
Will Phoenix

Project Links

GitHub Repository: https://github.com/oss-slu/mithridatium
Project Website: https://mithridatium.vercel.app/
Hugging Face Demo: https://huggingface.co/spaces/williamphoenix/Mithridatium

Looking Ahead

Mithridatium currently focuses on image classification models, but the broader concept of model integrity verification is much larger.

As AI systems become more widely deployed, verifying pretrained models before deployment will likely become increasingly important.

This project represents one small step toward making AI security tooling more practical, accessible, and open source.

DEV Community