DEV Community

Itay Eylath
Itay Eylath

Posted on • Edited on

From Metrics to Action: Turning Embedding Analysis into Sprint Tickets


In an agile Computer Vision startup, global accuracy is a vanity metric.

It tells you the model is working.

It doesn’t tell you what to fix next.

To move fast, you need actionable signals not dashboards.

Here’s how we transform raw embeddings into sprint ready decisions.

1. Stop Calculating Accuracy. Start Finding Confusion.

A single accuracy score hides local failures.

Instead of asking “How accurate is the model?”, ask:

Where exactly is the model failing?

We use 1-Nearest Neighbor (1-NN) evaluation to expose structural confusion between visually similar items.

import numpy as np
from sklearn.metrics import pairwise_distances

# Compute pairwise distances between embeddings
D = pairwise_distances(embeddings, metric="euclidean")

# Ignore self comparisons
np.fill_diagonal(D, np.inf)

# Identify closest neighbor
closest_idx = np.argmin(D, axis=1)
is_correct = labels == labels[closest_idx]
Enter fullscreen mode Exit fullscreen mode

Instead of celebrating 88% accuracy, we extract:

  • Which classes are confused

  • How often confusion happens

  • Whether errors are isolated or concentrated

This immediately narrows the problem space.

2. Margin: Measuring Model Uncertainty

Accuracy is binary.

Confidence is continuous.

We define Margin as the gap between the closest and second closest neighbors:

Margin = Dist_{2nd} - Dist_{1st}

If the margin is near zero, the model is effectively guessing between two nearly identical items.

# Sort distances per sample
sorted_dists = np.sort(D, axis=1)

# Compute margin
margin = sorted_dists[:, 1] - sorted_dists[:, 0]

# Flag unstable samples
low_confidence = margin < 0.05
Enter fullscreen mode Exit fullscreen mode

Low margin samples are early warning signals.

They tell you where instability lives before accuracy visibly drops.

3. Visualization for Intuition (Not Decisions)

Metrics drive action.

Visualization builds intuition.

We use PCA or t-SNE to project embeddings into 2D space:

from sklearn.manifold import TSNE

tsne_results = TSNE(
    n_components=2,
    perplexity=30,
    random_state=42
).fit_transform(embeddings)
Enter fullscreen mode Exit fullscreen mode

Overlapping clusters often explain:

  • Why two classes are confused

  • Whether boundaries are weak

  • Whether the representation itself lacks separation

Visualization doesn’t replace metrics it explains them.


4. The Operational Matrix: Converting Signals to Ownership

Analysis only matters if someone owns the fix.

We translate technical signals into clear action items:

Signal Meaning Action Item Owner
High Intra-class Distance Same class samples are visually inconsistent Clean dataset (remove low-quality samples) Data Ops
Strong Confusion Pair Two distinct classes overlap heavily Add discriminative features or review labeling CV / Product
Low Margin Cluster Category boundary is unstable Targeted data collection or fine-tuning ML Team

This is where evaluation becomes execution.

5. The Agile Loop

Every iteration follows the same structure:

  1. Run embedding evaluation

  2. Extract top confusion pairs

  3. Rank unstable classes by margin

  4. Convert top issues into sprint tickets

  5. Re-run metrics after fixes

Metrics → Ownership → Action → Re-measure

No guesswork.

No intuition battles.

No blind retraining.

What We Learned

  • Most persistent errors were structural, not architectural.

  • Margin was a better early warning signal than accuracy.

  • Concentrated confusion mattered more than global percentage.

  • Ownership accelerated improvement more than model complexity.

The biggest shift wasn’t technical.

It was operational.

The Bottom Line

In real world Computer Vision systems, the difference between stable and unstable performance isn’t a new architecture.

It’s a tighter feedback loop.

If a metric doesn’t lead to a sprint ticket, it’s just noise.

The best models aren’t the ones with the highest accuracy.

They’re the ones whose failures are measurable, owned, and shrinking every iteration.

This project is open source feel free to explore the code on GitHub and try it yourself:
https://github.com/itayeylath/metrics-to-actions

Top comments (0)