The Right Way to Measure Axiomatic Non-Sensitivity in XAI

Tamar.S — Sun, 18 Jan 2026 15:55:17 +0000

The Right Way to Measure Axiomatic Non-Sensitivity

Why your XAI metric might lie to you — and how we fixed it

If you’ve ever tried to actually measure how stable your attribution maps are, you probably ran into the same surprising thing we did: the theory behind axiomatic metrics sounds clean and elegant…
…but the real-world implementation?
Not always.

During the development of AIXPlainer, our explainability evaluation app, we wanted to include the Non-Sensitivity metric — a classic axiom that checks whether pixels with zero attribution truly have no influence on the model. Sounds simple, right?

Well… almost.

Once we moved from toy examples to real images, and from single-pixel perturbations to batch evaluations, things broke. Hard.

Before diving into how we solved it, I would like to thank AMAT and Extra-Tech for providing the tools and professional environment for our project. It is important to note that this work was developed in collaboration with Shmuel Fine, whose insights helped shape the structural direction of the solution. I also benefited from the guidance of Odelia Movadat, who supported the design of the overall evaluation process.
And that’s exactly what this post is about:

what goes wrong,

why it matters, and

how we built a correct and efficient implementation that eventually became a PR to the Quantus library.

Let’s dive in.

Wait, what is Non-Sensitivity again?

The idea is beautiful:

If a pixel receives zero attribution, perturbing it shouldn’t change the model’s prediction.

It’s a sanity check.
If your explainer claims a pixel “doesn’t matter”, then changing it shouldn’t matter.

To test that, we:

Take the input image
Perturb some pixel(s)
Re-run the model
Compare the attribution map vs. the prediction change

Quantify the violations between the heatmap claim and the actual difference between the predictions after the c. Simple.

Until you try to make it fast.

Where theory collapsed: the `features_in_step`trap

For high-resolution images, evaluating one pixel at a time - is too slow - in an irrelevant way.
So Quantus allows processing several pixels at once using:

features_in_step = N   # number of pixels perturbed in each step

Great idea… in theory.

In practice, two big things went wrong.

Problem #1 — Group perturbations break the math

When you perturb N pixels together, the prediction difference reflects a mixed effect:
Was it pixel #3 that caused the change? Pixel #4? Or the combination?

But the metric treated each pixel as if the model responded to it individually.

This means:

True violations were hidden

False violations appeared

Results became inconsistent

In other words — the metric was no longer measuring its own definition.

Problem #2 — Shape mismatches exploded the runtime

Deep inside the metric, Quantus tried to do:

preds_differences XOR non_features

But with group perturbations:

preds_differences= number of perturbation steps

non_features = number of pixels

These two sizes are not equal unless features_in_step= 1.

Boom → ValueError: operands could not be broadcast together

Meaning:
The metric literally could not run in the mode needed to make it efficient.

Our approach: make it correct, then make it fast

To address the issues, we restructured the entire evaluation flow so that batching improves performance without compromising the integrity of pixel-level reasoning.

✔ 1. Pixel influence is kept strictly independent

We begin by separating pixels into “important” and “non-important” groups based on their attribution values.
Even when multiple pixels are perturbed together, each pixel keeps its own dedicated evaluation record.
This ensures that batching affects runtime only—not the meaning of the test.

✔ 2. Perturbations are organized and traceable

Instead of random or ambiguous grouping, pixels are sorted and processed in stable, predictable batches.
Every prediction difference returned from a batch is cleanly mapped back to the exact pixels involved, so there is never uncertainty about which change belongs to which feature.

✔ 3. All internal shapes stay aligned and consistent

By restructuring the flow around pixel-index mapping, all arrays used for tracking perturbation effects naturally share compatible dimensions.
This removed the broadcasting conflicts and ensured that violation checks operate safely and efficiently.

✔ 4. Stability holds across resolutions and configurations

Because each pixel retains a one-to-one link with its own “effect slot”, the method behaves consistently whether the input is 32×32 or 224×224, and regardless of how large features_in_stepis.
This allowed us to use batching to reduce runtime while preserving pixel-level correctness.

When the updated flow proved reliable across datasets and explainers, we contributed the implementation back to Quantus as an open-source PR.
The link is here Fix NonSensitivity metric

What this enabled in AIXPlainer

Once the metric was finally correct and efficient:

We could measure Non-Sensitivity on real datasets, not just demos

Evaluations became fast enough for users to actually explore explainers

Stability comparisons across methods became meaningful

Our metric suite gained a reliable axiomatic component

In short:
the app finally behaved like a real research-grade evaluation tool.

Who should care about this?

If you’re working on:

Attribution method benchmarking
Explainability evaluation
Regulatory-grade transparency
Model debugging tools

…then Non-Sensitivity is one of those axioms that helps you discover when your explainer might be silently misleading you.

But only if the implementation is correct.

🔚 Final Thoughts

Building XAI metrics is a fascinating blend of math, engineering, and “debugging the theory”.
This journey taught us something important:

Even the cleanest axioms require thoughtful engineering to become real, trustworthy tools.

If you're building your own evaluation framework or integrating Quantus into your workflow, I hope this post saves you the same debugging hours it saved us.

If you want the full implementation details or want to integrate similar metrics — feel free to reach out.

DEV Community: Tamar.S