Count-Min Sketch: The Fast, Memory-Efficient Way to Estimate Frequencies in Data Streams
In the realm of big data, accurately counting the frequency of elements in a massive stream can be a daunting task. Traditional methods often fall short due to memory constraints and the sheer volume of data. Enter Count-Min Sketch (CMS)βa probabilistic data structure designed to provide approximate frequency counts with minimal memory usage.
π§ What Is Count-Min Sketch?
Count-Min Sketch is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses multiple hash functions to map events to frequencies, but unlike a hash table, it uses only sub-linear space.
π Core Idea
- A 2D matrix of size
d x w
(whered
is the number of hash functions andw
is the width). - Each incoming element is hashed
d
times (once per row) using different seeds or hash functions. - Each hash function computes a column index where the count is incremented.
- The estimated count is the minimum of the values across all hash functions.
β Simple Example
Problem
You have a small stream of words:
["apple", "banana", "apple"]
We use a CMS with:
-
d = 2
hash functions (rows) -
w = 5
columns (columns indexed 0 to 4)
Step-by-Step Example
- Start with a zero-initialized matrix:
Row\Col | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
1 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 |
- Add "apple":
- Hash1("apple") β 1 β increment table[0][1]
- Hash2("apple") β 3 β increment table[1][3]
Row\Col | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
1 | 0 | 1 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 1 | 0 |
- Add "banana":
- Hash1("banana") β 2 β increment table[0][2]
- Hash2("banana") β 1 β increment table[1][1]
Row\Col | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
1 | 0 | 1 | 1 | 0 | 0 |
2 | 0 | 1 | 0 | 1 | 0 |
- Add "apple" again:
Final matrix:
Row\Col | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
1 | 0 | 2 | 1 | 0 | 0 |
2 | 0 | 1 | 0 | 2 | 0 |
Query Example: Count of "apple"
- Hash1("apple") β 1 β table[0][1] = 2
- Hash2("apple") β 3 β table[1][3] = 2
Estimated count β min(2, 2) = 2
βοΈ Trade-Offs: Accuracy vs. Memory
Feature | Count-Min Sketch |
---|---|
Memory Usage | Fixed, sub-linear |
Accuracy | Approximate |
Speed | Very Fast |
Overcounting | Possible due to hash collisions |
π Real-World Use Cases
1. Tracking Popular Hashtags in Social Media
Efficiently track the frequency of hashtags in high-throughput environments like Twitter. CMS estimates frequencies without storing every individual tweet.
2. Network Traffic Monitoring
Estimate frequency of IP addresses or URLs accessed, useful for anomaly detection.
3. Recommendation Systems
Estimate popular items or user interactions to provide personalized recommendations.
π‘ Example Java Implementation
Copy the following Java code into your project:
public class CountMinSketch {
private final int[][] table;
private final int[] seeds;
private final int rows;
private final int cols;
private final Random rand = new Random();
public CountMinSketch(int rows, int cols) {
this.rows = rows;
this.cols = cols;
this.table = new int[rows][cols];
this.seeds = new int[rows];
for (int i = 0; i < rows; i++) seeds[i] = rand.nextInt();
}
public void add(String key) {
for (int i = 0; i < rows; i++) {
int index = Math.abs(hash(key, seeds[i]) % cols);
table[i][index]++;
}
}
public int count(String key) {
int min = Integer.MAX_VALUE;
for (int i = 0; i < rows; i++) {
int index = Math.abs(hash(key, seeds[i]) % cols);
min = Math.min(min, table[i][index]);
}
return min;
}
private int hash(String key, int seed) {
int hash = 0;
for (char c : key.toCharArray()) {
hash = hash * seed + c;
}
return hash;
}
}
β Conclusion
Count-Min Sketch provides a space-efficient solution for estimating frequencies in data streams, with trade-offs acceptable in many real-world scenarios where approximate answers are good enough.
For an advanced use case showing how to track Top-K Trending Hashtags, see our dedicated article:
How to Find Top-K Trending Hashtags Using CMS.
Top comments (0)