As pretrained AI models become more common, one growing concern is whether those models can actually be trusted.
A model may appear completely normal during testing, but behave maliciously when exposed to a hidden trigger. These attacks are known as backdoor or poisoning attacks, and they represent a serious security risk for real-world AI systems.
This semester, our team built Mithridatium - an open-source framework designed to help detect hidden backdoors in pretrained machine learning models.
What is a Backdoor?
In simple terms, a backdoor attack hides malicious behavior inside an otherwise normal model.
Most of the time, the model behaves exactly as expected. But when a specific trigger appears in the input, the model changes its behavior in a way that benefits an attacker.
Imagine a self-driving vehicle that correctly recognizes stop signs during testing, but misclassifies them when a small sticker or visual trigger is placed on the sign. A hidden trigger like this could potentially cause extremely dangerous outcomes in real-world systems.
This problem becomes even more concerning because many developers rely heavily on pretrained models downloaded from external sources like Hugging Face or public repositories.
The question becomes:
How do we verify that a pretrained model has not been poisoned before deploying it?
That is the problem Mithridatium was designed to explore.
What Mithridatium Does
Mithridatium is a framework for evaluating pretrained image classification models for potential backdoor behavior.
The framework allows users to:
- Load local checkpoints or Hugging Face models
- Run multiple backdoor detection defenses
- Generate structured JSON reports
- Visualize results through a web demo interface
- Compare detection signals across different methods
The goal is to translate AI security research into practical and reusable tooling.
The Detection Defenses
One of the most interesting parts of the project was implementing and evaluating several different detection strategies. Each defense approaches the problem differently.
FreeEagle
FreeEagle is a white-box, data-free defense.
Instead of relying on datasets or trigger injection, it analyzes the internal behavior of the model itself and looks for abnormal class bias patterns that may indicate hidden backdoor behavior.
This makes it especially useful for quickly screening unknown models.
STRIP
STRIP works by perturbing inputs with other images.
The intuition is that a normal model should become less confident when the input changes significantly. However, backdoored models often remain unusually stable when the trigger is present.
If prediction entropy remains suspiciously low across perturbed inputs, STRIP raises a red flag.
MMBD
MMBD focuses on abnormal dominance patterns across output classes.
The defense looks for suspicious concentration or bias in the modelโs behavior that may suggest hidden trigger relationships.
This approach was especially interesting because it worked well even against some dynamic backdoor scenarios.
AEVA
AEVA takes a more adversarial approach.
It perturbs input images and observes how the model responds to trigger-like changes. By analyzing anomaly indices and perturbation behavior, the framework can identify suspicious patterns associated with backdoors.
Compared to some other defenses, AEVA can require significantly more queries and computation, especially in black-box settings.
Building the Project
Mithridatium was built primarily in Python using PyTorch and Hugging Face tooling.
The project currently includes:
- A modular CLI interface
- Support for Hugging Face models
- JSON report generation
- Multiple detection defenses
- Demo interfaces for visualization
- Compatibility validation for supported architectures
A typical CLI run looks like this:
mithridatium detect \
--model models/resnet18_cifar10.pt \
--data cifar10 \
--defense freeeagle \
--out reports/freeeagle_report.json \
--force
The framework can also evaluate models directly from Hugging Face using model IDs instead of local checkpoints.
Reports and Visual Output
One major goal of the project was usability.
A user should not need to read multiple research papers just to understand whether a model might be risky. Mithridatium attempts to translate complex detection signals into understandable verdicts and metrics.
The framework produces structured reports and can visualize outputs through the demo interface.
Lessons Learned
One thing we learned very quickly is that ML security tooling is not just about implementing algorithms.
A practical tool also has to handle:
- dataset compatibility
- integration problems
- reporting
- usability
- deployment assumptions
- benchmarking
- reproducibility
One particularly important lesson involved dataset mismatch.
Some defenses behaved very differently depending on whether the evaluation dataset matched the dataset the model was originally trained on. In some cases, mismatched datasets produced false positives that initially looked like detection failures.
We also learned that different defenses come with different tradeoffs.
Some methods are lightweight and data-free, while others require large numbers of model queries or significant computational resources.
Another major takeaway was the importance of clear reporting. Security tooling becomes far more useful when results are understandable to developers who may not specialize in AI security research.
Current Developers
Mithridatium was developed through Open Source with SLU by:
- Pelumi Oluwategbe
- Gustavo Lucca
- Payton Guffey
- Will Phoenix
Project Links
- GitHub Repository: https://github.com/oss-slu/mithridatium
- Project Website: https://mithridatium.vercel.app/
- Hugging Face Demo: https://huggingface.co/spaces/williamphoenix/Mithridatium
Looking Ahead
Mithridatium currently focuses on image classification models, but the broader concept of model integrity verification is much larger.
As AI systems become more widely deployed, verifying pretrained models before deployment will likely become increasingly important.
This project represents one small step toward making AI security tooling more practical, accessible, and open source.




Top comments (0)