in this post I'll present part of my master degree field study: probabilistic data structures or just sketches. The latest post I describe what sketches means if you're missed.
In computing, the count–min sketch (CM sketch) is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear space, at the expense of overcounting some events due to collisions. The count–min sketch was invented in 2003 by Graham Cormode and S. Muthu Muthukrishnan and described by them in a 2005 paper.
Count–min sketches are essentially the same data structure as the counting Bloom filters introduced in 1998 by Fan et al. However, they are used differently and therefore sized differently: a count-min sketch typically has a sublinear number of cells, related to the desired approximation quality of the sketch, while a counting Bloom filter is more typically sized to match the number of elements in the set.
- Input data: google.com (1), twitter.com (1), instagram.com (4), apple.com (10), amazon.com (5), ebay.com (3), aliexmpress.com (12) and register.com (2).
- Hash functions: h1, h2 and h3.
- Count-min matrix with 10 positions by hash function called
Now, iterating through all input data we're going to perform this operation:
M[h1('google.com')] += 1 M[h2('google.com')] += 1 ... M[h3('register.com')] += 1
Suppose this result:
Checking 'amazon.com' frequency at Count-min?
h1('amazon.com') = 3 h2('amazon.com') = 4 h3('amazon.com') = 7
M values we obtains this result: 5, 8, 6.
Now, the answer for frequency of 'amazon.com' is probably 5, minimal value, because the others maybe influenced by hash collision.
Checking if 'facebook.com' is presented at Count-min?
h1('facebook.com') = 9 h2('facebook.com') = 6 h3('facebook.com') = 4
M in this case is different of zero, so 'facebook.com' isn't presented at Count-min.
To see Count-min in practice I'm going to use PyProbables:
pyprobables is a pure-python library for probabilistic data structures. The goal is to provide the developer with a pure-python implementation of common probabilistic data-structures to use in their work.
Let's try this sample code:
from probables import CountMinSketch # Init Count-min estimated with 50 positions and 5 hash functions cms = CountMinSketch(width=50, depth=5) # Add some items cms.add('google.com') cms.add('twitter.com',1) cms.add('instagram.com',4) cms.add('apple.com',10) cms.add('amazon.com',5) cms.add('ebay.com',3) cms.add('aliexpress.com',12) cms.add('register.com',2) # Checking print('twitter.com frequence:', cms.check('twitter.com')) print('facebook.com frequence:', cms.check('facebook.com')) print('amazon.com frequence:', cms.check('amazon.com')) print(cms)
twitter.com frequence: 1 facebook.com frequence: 0 amazon.com frequence: 5 Count-Min Sketch: Width: 50 Depth: 5 Confidence: 0.96875 Error Rate: 0.04 Elements Added: 38
- How to get highest frequency element.