DEV Community

Tova A
Tova A

Posted on

Cleaning Up Complexity: Preprocessing Attribution Maps for Better Evaluation

I wanted to compare attribution maps from different XAI methods for vision models, using the Complexity metric from the Quantus library.

The idea was simple:

If a heatmap looks clean and focused, it should have lower complexity than a noisy, scattered one.

In practice, that’s not what happened.
Some maps that were visually sharp and localised got high (Bad) Complexity scores.
Other maps that looked messy or stretched over the whole image got surprisingly low scores.

Heatmaps Comparison

On the left is Guided Backprop, which spreads activation all over the image.
On the right is Fusion Grad, which is much more sparse and focused on the relevant structures.
But in our initial setup, the Quantus Complexity metric actually gave Fusion Grad a worse (higher) complexity score than Guided Backprop – a clear mismatch between what we see and what the metric reports.

The metric was doing exactly what it was defined to do — but it was reacting to things like scale, padding, resolution, and sign conventions, not just to the “shape” of the explanation.

That’s when it became clear: before evaluating attribution maps, you need to standardise them. Otherwise, you’re mostly comparing formatting differences between methods, not their actual behaviour.

In this post, I’ll show how I preprocess raw attribution maps into a canonical, evaluation-ready form before passing them to Quantus metrics.

At first I tried to “fix” this by using Quantus’s built-in normalize_func, but it didn’t change the ranking in a meaningful way.
The real issue wasn’t the overall scale – it was the pedestal:
both methods produced a low but non-zero activation almost everywhere in the image.
Guided Backprop had a noisy background plus a pedestal, while Fusion Grad had a very thin, sharp signal on top of its own pedestal.
Complexity only sees “how much structure lives above zero”.

If you keep the pedestal, Fusion Grad’s thin signal sits on a wide plateau and ends up looking more complex numerically than the noisier Guided Backprop map.

That’s why the next step was not “better normalisation”, but explicitly removing or reducing the pedestal before computing Complexity.

Baseline-Subtraction Normalization

Instead of relying on the default normalize_func, I implemented a custom one that does two things per attribution map:

  1. Baseline removal (pedestal):
    Compute a low percentile (for example, the 5th percentile) and treat it as a baseline.
    Subtract this baseline from all values and clamp negatives to zero. This removes the global “pedestal” while keeping the meaningful peaks.

  2. 0–1 normalisation:
    After baseline removal, rescale the map to the [0, 1] range so that Complexity sees something closer to a probability distribution per sample, instead of raw arbitrary units.

import numpy as np

def baseline_subtraction_norm(attr_map: np.ndarray,
                            baseline_quantile: float = 0.2) -> np.ndarray:
    """
    Normalize an attribution map for evaluation:
    1) subtract a low quantile as baseline (pedestal removal),
    2) clamp to >= 0,
    3) rescale to [0, 1].
    """
    # 1. pedestal removal
    baseline = np.quantile(attr_map, baseline_quantile)
    x = attr_map - baseline
    x = np.clip(x, a_min=0.0, a_max=None)

    # 2. scale to [0, 1]
    max_val = x.max()
    if max_val > 0:
        x = x / max_val
    return x

Enter fullscreen mode Exit fullscreen mode

And than you can simply use quantus complexity metric with your custom normaliza_func:

import quantus

complexity_metric = quantus.Complexity(
    abs=True,
    normalise=True,
    normalise_func=baseline_subtraction_norm,
)

scores = complexity_metric(
    model=model,
    x_batch=x_batch,      # input images
    y_batch=y_batch,      # targets
    a_batch=attr_maps,    # attribution maps
)

Enter fullscreen mode Exit fullscreen mode

Below you can see the value distribution of Fusion Grad before and after pedestal removal.
After subtracting the baseline, most background pixels are exactly zero, and the Complexity metric reacts much more to the actual structure around the defect line and contact.

Distributions of fusion_grad heatmap, before vs after normalization


Best practice
Before applying quantitative metrics to attribution maps, make preprocessing explicit and consistent. Remove method-specific pedestals, standardize sign conventions, and rescale per sample. Otherwise, metrics like Complexity primarily measure implementation artefacts (background mass, padding, resolution) rather than explanatory structure.

Top comments (0)