<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tamar.S</title>
    <description>The latest articles on DEV Community by Tamar.S (@t_s_7da0b0e0e14e6b58).</description>
    <link>https://dev.to/t_s_7da0b0e0e14e6b58</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3653325%2Fec8c655d-5698-4ffe-bdfd-4450422917a5.png</url>
      <title>DEV Community: Tamar.S</title>
      <link>https://dev.to/t_s_7da0b0e0e14e6b58</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/t_s_7da0b0e0e14e6b58"/>
    <language>en</language>
    <item>
      <title>The Right Way to Measure Axiomatic Non-Sensitivity in XAI</title>
      <dc:creator>Tamar.S</dc:creator>
      <pubDate>Sun, 18 Jan 2026 15:55:17 +0000</pubDate>
      <link>https://dev.to/t_s_7da0b0e0e14e6b58/the-right-way-to-measure-axiomatic-non-sensitivity-in-xai-bc7</link>
      <guid>https://dev.to/t_s_7da0b0e0e14e6b58/the-right-way-to-measure-axiomatic-non-sensitivity-in-xai-bc7</guid>
      <description>&lt;h1&gt;
  
  
  The Right Way to Measure Axiomatic Non-Sensitivity
&lt;/h1&gt;

&lt;h3&gt;
  
  
  Why your XAI metric might lie to you — and how we fixed it
&lt;/h3&gt;

&lt;p&gt;If you’ve ever tried to actually measure how stable your attribution maps are, you probably ran into the same surprising thing we did: the theory behind axiomatic metrics sounds clean and elegant…&lt;br&gt;
…but the real-world implementation?&lt;br&gt;
Not always.&lt;/p&gt;

&lt;p&gt;During the development of AIXPlainer, our explainability evaluation app, we wanted to include the Non-Sensitivity metric — a classic axiom that checks whether pixels with zero attribution truly have no influence on the model. Sounds simple, right?&lt;/p&gt;

&lt;p&gt;Well… almost.&lt;/p&gt;

&lt;p&gt;Once we moved from toy examples to real images, and from single-pixel perturbations to batch evaluations, things broke. Hard.&lt;/p&gt;

&lt;p&gt;Before diving into how we solved it, I would like to thank AMAT and Extra-Tech for providing the tools and professional environment for our project. It is important to note that this work was developed in collaboration with Shmuel Fine, whose insights helped shape the structural direction of the solution. I also benefited from the guidance of Odelia Movadat, who supported the  design of the overall evaluation process.&lt;br&gt;
And that’s exactly what this post is about:&lt;/p&gt;

&lt;p&gt;what goes wrong,&lt;/p&gt;

&lt;p&gt;why it matters, and&lt;/p&gt;

&lt;p&gt;how we built a correct and efficient implementation that eventually became a PR to the Quantus library.&lt;/p&gt;

&lt;p&gt;Let’s dive in.&lt;/p&gt;
&lt;h3&gt;
  
  
  Wait, what is Non-Sensitivity again?
&lt;/h3&gt;

&lt;p&gt;The idea is beautiful:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If a pixel receives zero attribution, perturbing it shouldn’t change the model’s prediction.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It’s a sanity check.&lt;br&gt;
If your explainer claims a pixel “doesn’t matter”, then changing it shouldn’t matter.&lt;/p&gt;

&lt;p&gt;To test that, we:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Take the input image&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Perturb some pixel(s)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Re-run the model&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Compare the attribution map vs. the prediction change&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fns46s2bcmj5zv8uenqf0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fns46s2bcmj5zv8uenqf0.png" alt=" " width="800" height="459"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79mvcgte36zhv2j59xsa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79mvcgte36zhv2j59xsa.png" alt=" " width="800" height="403"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Quantify the violations between the heatmap claim and the actual difference between the predictions after the c.
Simple.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Until you try to make it fast.&lt;/p&gt;
&lt;h3&gt;
  
  
  Where theory collapsed: the &lt;code&gt;features_in_step&lt;/code&gt;trap
&lt;/h3&gt;

&lt;p&gt;For high-resolution images, evaluating one pixel at a time - is  too slow - in an irrelevant way.&lt;br&gt;
So Quantus allows processing several pixels at once using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;features_in_step = N   # number of pixels perturbed in each step
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Great idea… in theory.&lt;/p&gt;

&lt;p&gt;In practice, two big things went wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem #1 — Group perturbations break the math
&lt;/h3&gt;

&lt;p&gt;When you perturb N pixels together, the prediction difference reflects a mixed effect:&lt;br&gt;
Was it pixel #3 that caused the change? Pixel #4? Or the combination?&lt;/p&gt;

&lt;p&gt;But the metric treated each pixel as if the model responded to it individually.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg49sgqa28hab9jwtw249.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg49sgqa28hab9jwtw249.png" alt=" " width="800" height="363"&gt;&lt;/a&gt;&lt;br&gt;
This means:&lt;/p&gt;

&lt;p&gt;True violations were hidden&lt;/p&gt;

&lt;p&gt;False violations appeared&lt;/p&gt;

&lt;p&gt;Results became inconsistent&lt;/p&gt;

&lt;p&gt;In other words — the metric was no longer measuring its own definition.&lt;/p&gt;
&lt;h3&gt;
  
  
  Problem #2 — Shape mismatches exploded the runtime
&lt;/h3&gt;

&lt;p&gt;Deep inside the metric, Quantus tried to do:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;preds_differences XOR non_features
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But with group perturbations:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;preds_differences&lt;/code&gt;= number of perturbation steps&lt;/p&gt;

&lt;p&gt;&lt;code&gt;non_features&lt;/code&gt; = number of pixels&lt;/p&gt;

&lt;p&gt;These two sizes are not equal unless &lt;code&gt;features_in_step&lt;/code&gt;= 1.&lt;/p&gt;

&lt;p&gt;Boom → &lt;code&gt;ValueError: operands could not be broadcast together&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Meaning:&lt;br&gt;
&lt;strong&gt;The metric literally could not run in the mode needed to make it efficient.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Our approach: make it correct, then make it fast
&lt;/h2&gt;

&lt;p&gt;To address the issues, we restructured the entire evaluation flow so that batching improves performance without compromising the integrity of pixel-level reasoning.&lt;/p&gt;

&lt;p&gt;✔ 1. Pixel influence is kept strictly independent&lt;/p&gt;

&lt;p&gt;We begin by separating pixels into “important” and “non-important” groups based on their attribution values.&lt;br&gt;
Even when multiple pixels are perturbed together, each pixel keeps its own dedicated evaluation record.&lt;br&gt;
This ensures that batching affects runtime only—not the meaning of the test.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F70vnl1v80ldyfn5zyhnj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F70vnl1v80ldyfn5zyhnj.png" alt=" " width="726" height="385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;✔ 2. Perturbations are organized and traceable&lt;/p&gt;

&lt;p&gt;Instead of random or ambiguous grouping, pixels are sorted and processed in stable, predictable batches.&lt;br&gt;
Every prediction difference returned from a batch is cleanly mapped back to the exact pixels involved, so there is never uncertainty about which change belongs to which feature.&lt;/p&gt;

&lt;p&gt;✔ 3. All internal shapes stay aligned and consistent&lt;/p&gt;

&lt;p&gt;By restructuring the flow around pixel-index mapping, all arrays used for tracking perturbation effects naturally share compatible dimensions.&lt;br&gt;
This removed the broadcasting conflicts and ensured that violation checks operate safely and efficiently.&lt;/p&gt;

&lt;p&gt;✔ 4. Stability holds across resolutions and configurations&lt;/p&gt;

&lt;p&gt;Because each pixel retains a one-to-one link with its own “effect slot”, the method behaves consistently whether the input is 32×32 or 224×224, and regardless of how large &lt;code&gt;features_in_step&lt;/code&gt;is.&lt;br&gt;
This allowed us to use batching to reduce runtime while preserving pixel-level correctness.&lt;/p&gt;

&lt;p&gt;When the updated flow proved reliable across datasets and explainers, we contributed the implementation back to Quantus as an open-source PR.&lt;br&gt;
The link is here &lt;a href="https://github.com/understandable-machine-intelligence-lab/Quantus/pull/369" rel="noopener noreferrer"&gt;Fix NonSensitivity metric&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What this enabled in AIXPlainer
&lt;/h2&gt;

&lt;p&gt;Once the metric was finally correct and efficient:&lt;/p&gt;

&lt;p&gt;We could measure Non-Sensitivity on real datasets, not just demos&lt;/p&gt;

&lt;p&gt;Evaluations became fast enough for users to actually explore explainers&lt;/p&gt;

&lt;p&gt;Stability comparisons across methods became meaningful&lt;/p&gt;

&lt;p&gt;Our metric suite gained a reliable axiomatic component&lt;/p&gt;

&lt;p&gt;In short:&lt;br&gt;
the app finally behaved like a real research-grade evaluation tool.&lt;/p&gt;

&lt;h3&gt;
  
  
  Who should care about this?
&lt;/h3&gt;

&lt;p&gt;If you’re working on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Attribution method benchmarking&lt;/li&gt;
&lt;li&gt;Explainability evaluation&lt;/li&gt;
&lt;li&gt;Regulatory-grade transparency&lt;/li&gt;
&lt;li&gt;Model debugging tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…then Non-Sensitivity is one of those axioms that helps you discover when your explainer might be silently misleading you.&lt;/p&gt;

&lt;p&gt;But only if the implementation is correct.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔚 Final Thoughts
&lt;/h3&gt;

&lt;p&gt;Building XAI metrics is a fascinating blend of math, engineering, and “debugging the theory”.&lt;br&gt;
This journey taught us something important:&lt;/p&gt;

&lt;p&gt;Even the cleanest axioms require thoughtful engineering to become real, trustworthy tools.&lt;/p&gt;

&lt;p&gt;If you're building your own evaluation framework or integrating Quantus into your workflow, I hope this post saves you the same debugging hours it saved us.&lt;/p&gt;

&lt;p&gt;If you want the full implementation details or want to integrate similar metrics — feel free to reach out.&lt;/p&gt;

</description>
      <category>ai</category>
    </item>
  </channel>
</rss>
