Hunting Duplicate Files in Azure Blob Storage (Without Melting Your Wallet)

A deep dive into how duplicate data builds up silently in Azure Blob Storage, how to find it programmatically, and an open-source tool to track it.

The Silent Cloud Creep

In any scaling enterprise cloud environment, storage costs don't jump—they creep. You look at your Azure billing dashboard at the end of the month, and the Blob Storage line item is 30% higher than it was last quarter.

When you start digging, you usually find the usual suspects:

Redundant staging backups from three migrations ago.
Multiple microservices uploading identical reporting assets.
App pipelines processing the same unindexed datasets repeatedly.

The problem? Azure Blob Storage doesn't have a native "find duplicates" button.

If you ask Microsoft, the standard recommendation is to handle this at the application layer or write custom automation. But writing a custom engine to safely audit millions of production blobs is a massive headache.

Let's look at how you can do this programmatically yourself, where it falls short, and a lightweight open-source tool I built to solve it in one command.

The Native Way: Scripting It with C# and the Azure SDK

To find true duplicates, you can't rely on filenames or modified dates; you have to compare content hashes. Luckily, Azure Blobs often expose a Content-MD5 header.

If you wanted to build a quick console app using the Azure.Storage.Blobs SDK to find duplicates, the core logic looks something like this:

using Azure.Storage.Blobs;
using Azure.Storage.Blobs.Models;

var connectionString = "Your_Azure_Storage_Connection_String";
var containerName = "production-data";

var blobServiceClient = new BlobServiceClient(connectionString);
var containerClient = blobServiceClient.GetBlobContainerClient(containerName);

// Track MD5 hashes and the matching blob paths
var hashDictionary = new Dictionary<string, List<string>>();

await foreach (BlobItem blobItem in containerClient.GetBlobsAsync(BlobTraits.Metadata))
{
    // Ensure the MD5 hash is populated by Azure
    var md5Hash = Convert.ToBase64String(blobItem.Properties.ContentHash);

    if (!string.IsNullOrEmpty(md5Hash))
    {
        if (!hashDictionary.ContainsKey(md5Hash))
        {
            hashDictionary[md5Hash] = new List<string>();
        }
        hashDictionary[md5Hash].Add(blobItem.Name);
    }
}

// Filter out the unique files to expose the duplicates
var duplicates = hashDictionary.Where(kvp => kvp.Value.Count > 1);

foreach (var item in duplicates)
{
    Console.WriteLine($"Duplicate Hash: {item.Key}");
    foreach (var path in item.Value)
    {
        Console.WriteLine($" -> {path}");
    }
}

Why this approach gets painful in production:

While this works perfectly for a small staging bucket with a few thousand files, it falls apart at scale:

API Throttling & Pagination: If your container has tens of millions of blobs, fetching properties iteratively will hit Azure storage transaction limits or take hours to run.
Missing MD5 Headers: If files are uploaded via certain block-blob configurations or third-party tools, the Content-MD5 hash property might be null, forcing you to download the file to compute the hash manually (which triggers massive data egress costs).
No Financial ROI Context: A text list of duplicate paths doesn't tell a DevOps or FinOps engineer what they actually care about: How many gigabytes are wasted, and how much money are we throwing away right now?

Enter BlobPulse: The Open-Source, Local-First Scanner
I ran into this problem frequently enough that I decided to build a dedicated, open-source storage visualizer named BlobPulse.

It’s built using .NET Core and Next.js, packaged entirely as a lightweight Docker container. It scans your target containers, processes deduplication metrics via content hashing, and visualizes the exact dollar impact of your redundant data.

Core Features:

🔍 True Deduplication: Maps out identical files across different paths and containers using deep property analysis.

💰 FinOps Impact Assessment: Instantly translates wasted gigabytes into your estimated monthly/annual Azure storage bill savings.

🔒 100% Read-Only & Local-First: Infrastructure security matters. BlobPulse only requires a SAS token or connection string with Read/List permissions. It processes everything locally in memory inside your Docker runtime—your access keys and metadata never leave your network.

Spin it up in 30 seconds
Because developers hate complex onboarding, you can run the tool locally or on your cloud cluster with a single Docker command:

Bash
docker run -d -p 3000:3000 -p 8000:8000 --name blobpulse tarun06/blobpulse:latest

Open http://localhost:3000 in your browser, plug in your read-only token, and let it map out your environment.

How does your team handle storage sprawl?
BlobPulse is completely open-source and actively evolving. I’d love to get the community's feedback on it:

How do you currently audit data drift and duplication in your cloud buckets?

What premium or enterprise metrics would make your FinOps reporting easier?

Check out the repository, submit an issue, or drop a star if it helps you save some cloud budget this week!

⭐ GitHub Repo: https://github.com/tarun06/blobpulse