VipraTech Solutions for VipraTech Solutions

Posted on Sep 8

Count-Min Sketch: A Memory-Efficient Way to Track Frequencies in Data Streams

#cms #systemdesign #interview #bigdata

Count-Min Sketch: The Fast, Memory-Efficient Way to Estimate Frequencies in Data Streams

In the realm of big data, accurately counting the frequency of elements in a massive stream can be a daunting task. Traditional methods often fall short due to memory constraints and the sheer volume of data. Enter Count-Min Sketch (CMS)—a probabilistic data structure designed to provide approximate frequency counts with minimal memory usage.

🧠 What Is Count-Min Sketch?

Count-Min Sketch is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses multiple hash functions to map events to frequencies, but unlike a hash table, it uses only sub-linear space.

🔍 Core Idea

A 2D matrix of size d x w (where d is the number of hash functions and w is the width).
Each incoming element is hashed d times (once per row) using different seeds or hash functions.
Each hash function computes a column index where the count is incremented.
The estimated count is the minimum of the values across all hash functions.

✅ Simple Example

Problem

You have a small stream of words:

["apple", "banana", "apple"]

We use a CMS with:

d = 2 hash functions (rows)
w = 5 columns (columns indexed 0 to 4)

Step-by-Step Example

Start with a zero-initialized matrix:

Row\Col	0	1	2	3	4
1	0	0	0	0	0
2	0	0	0	0	0

Add "apple":
- Hash1("apple") → 1 → increment table[0][1]
- Hash2("apple") → 3 → increment table[1][3]

Row\Col	0	1	2	3	4
1	0	1	0	0	0
2	0	0	0	1	0

Add "banana":
- Hash1("banana") → 2 → increment table[0][2]
- Hash2("banana") → 1 → increment table[1][1]

Row\Col	0	1	2	3	4
1	0	1	1	0	0
2	0	1	0	1	0

Add "apple" again:
- Hash1("apple") → 1 → increment table[0]1
- Hash2("apple") → 3 → increment table[1]3

Final matrix:

Row\Col	0	1	2	3	4
1	0	2	1	0	0
2	0	1	0	2	0

Query Example: Count of "apple"

Hash1("apple") → 1 → table[0][1] = 2
Hash2("apple") → 3 → table[1][3] = 2

Estimated count → min(2, 2) = 2

⚖️ Trade-Offs: Accuracy vs. Memory

Feature	Count-Min Sketch
Memory Usage	Fixed, sub-linear
Accuracy	Approximate
Speed	Very Fast
Overcounting	Possible due to hash collisions

🚀 Real-World Use Cases

1. Tracking Popular Hashtags in Social Media

Efficiently track the frequency of hashtags in high-throughput environments like Twitter. CMS estimates frequencies without storing every individual tweet.

2. Network Traffic Monitoring

Estimate frequency of IP addresses or URLs accessed, useful for anomaly detection.

3. Recommendation Systems

Estimate popular items or user interactions to provide personalized recommendations.

💡 Example Java Implementation

Copy the following Java code into your project:

public class CountMinSketch {
    private final int[][] table;
    private final int[] seeds;
    private final int rows;
    private final int cols;
    private final Random rand = new Random();

    public CountMinSketch(int rows, int cols) {
        this.rows = rows;
        this.cols = cols;
        this.table = new int[rows][cols];
        this.seeds = new int[rows];
        for (int i = 0; i < rows; i++) seeds[i] = rand.nextInt();
    }

    public void add(String key) {
        for (int i = 0; i < rows; i++) {
            int index = Math.abs(hash(key, seeds[i]) % cols);
            table[i][index]++;
        }
    }

    public int count(String key) {
        int min = Integer.MAX_VALUE;
        for (int i = 0; i < rows; i++) {
            int index = Math.abs(hash(key, seeds[i]) % cols);
            min = Math.min(min, table[i][index]);
        }
        return min;
    }

    private int hash(String key, int seed) {
        int hash = 0;
        for (char c : key.toCharArray()) {
            hash = hash * seed + c;
        }
        return hash;
    }
}

✅ Conclusion

Count-Min Sketch provides a space-efficient solution for estimating frequencies in data streams, with trade-offs acceptable in many real-world scenarios where approximate answers are good enough.

For an advanced use case showing how to track Top-K Trending Hashtags, see our dedicated article:

How to Find Top-K Trending Hashtags Using CMS.

DEV Community