DEV Community

Cover image for Skipfilter
Kevin Burns
Kevin Burns

Posted on • Updated on


A skipfilter is a skip list of arbitrary elements that can be efficiently filtered using roaring bitmaps stored in an LRU cache.

We will use this data structure to make an open source pub/sub messaging system 20,000x more efficient.

Use Case

Mercure is a pub/sub message protocol for Server Sent Events.

  • Topics are identified by their URI
  • Subscriptions are expressed as URI Templates (topic selectors)


Each time a message is published to the hub, every topic selector in every subscription must be tested against every one of the message's topics. So if you have 5000 subscribers with 3 topic selectors each and each message is published to 2 topics, that's 30,000 topic selector comparisons per message.

The Mercure Hub originally dealt with this issue using an in memory cache to eliminate duplicate regular expression evaluations.

  • Key (string): ${uri_template}-${topic_uri}
  • Value (boolean): regexp.Match(uri_template, topic_uri)

While this worked well initially to improve latency, it still required 30,000 hash table lookups per message which did not scale well to large numbers of subscribers.

[leak] High CPU usage with a lot of subscribers #558


This issue has been created to track down the problem that I'm having with Mercure:

Mercure uses almost %100 CPU, doesn't matter what specs you've.

The reason is probably a lot of subscribers and updates published to them:

# TYPE mercure_subscribers gauge
mercure_subscribers 4299
# HELP mercure_subscribers_total Total number of handled subscribers
# TYPE mercure_subscribers_total counter
mercure_subscribers_total 2.939146e+06
# HELP mercure_updates_total Total number of handled updates
# TYPE mercure_updates_total counter
mercure_updates_total 7.891176e+06

This is just for 2 days, this is a reduced amount, normally it's up to x4 of updates that are currently disabled.

Here is debug profile of CPU:

File: caddy
Type: cpu
Time: Aug 26, 2021 at 6:53am
Duration: 300.16s, Total samples = 461.92s (153.89%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 231.78s, 50.18% of 461.92s total
Dropped 1620 nodes (cum <= 2.31s)
Showing top 10 nodes out of 141
      flat  flat%   sum%        cum   cum%
    55.77s 12.07% 12.07%    191.39s 41.43%  runtime.selectgo
    41.19s  8.92% 20.99%     52.73s 11.42%  runtime.sellock
    24.36s  5.27% 26.26%     24.40s  5.28%  runtime.(*waitq).dequeue (inline)
    22.55s  4.88% 31.15%    114.28s 24.74%*Subscriber).CanDispatch
    18.83s  4.08% 35.22%     18.83s  4.08%*Logger).check
    16.90s  3.66% 38.88%     17.25s  3.73%  runtime.unlock2
    14.31s  3.10% 41.98%     32.20s  6.97%
    13.21s  2.86% 44.84%     13.21s  2.86%  runtime.empty
    12.48s  2.70% 47.54%     44.65s  9.67%  runtime.schedule
    12.18s  2.64% 50.18%     12.64s  2.74%  runtime.gopark


Here is the image of it: dump1

The problematic lines are:

At first, we thought the problem is with Zap(*Logger) check but PR should fix it.

Anyways, completely removing Zap logging, the situation didn't improve but instead showed more problems with runtime.selectgo

Here is debug profile of CPU:

File: caddy
Type: cpu
Time: Aug 31, 2021 at 12:27pm
Duration: 300.10s, Total samples = 568.67s (189.49%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 347.87s, 61.17% of 568.67s total
Dropped 1614 nodes (cum <= 2.84s)
Showing top 10 nodes out of 116
      flat  flat%   sum%        cum   cum%
    76.58s 13.47% 13.47%    263.92s 46.41%  runtime.selectgo
    50.10s  8.81% 22.28%     50.28s  8.84%  runtime.(*waitq).dequeue (inline)
    47.13s  8.29% 30.56%     59.44s 10.45%  runtime.sellock
    29.29s  5.15% 35.71%     29.29s  5.15%  memeqbody
    29.05s  5.11% 40.82%     86.37s 15.19%*Subscriber).CanDispatch
    28.25s  4.97% 45.79%    166.05s 29.20%*Subscriber).Dispatch
    25.76s  4.53% 50.32%     57.32s 10.08%
    25.14s  4.42% 54.74%     25.14s  4.42%  runtime.empty (inline)
    18.58s  3.27% 58.01%     19.21s  3.38%  runtime.gopark
    17.99s  3.16% 61.17%     17.99s  3.16%  runtime.(*gQueue).pop

Here is the image of it: dump2

Sorry, I can't write a benchmark for it but ready to test what changes is needed.

Data posted to Mercure is always to different topics.

Posted to topics: /users/613193cee6be0e1016042334

Private: true


    "@context": "/contexts/User",
    "@id": "/users/613193cee6be0e1016042334",
    "@type": "User",
    "id": "613193cee6be0e1016042334",
    "quota": 40000,
    "used": 7186,
    "isDisabled": false,
    "isDeleted": false,
    "createdAt": "2021-09-03T05:17:34+02:00",
    "updatedAt": "2021-09-03T05:18:29+02:00"


The O(n) growth in topic selector comparisons for each message meant that the computational complexity grows as a product of the number of subscribers and the volume of messages. Highly concurrent access to these caches caused a lot of lock contention which made the Go scheduler a bottleneck.


  1. The subscriber list is represented as a skipfilter. The test function uses the subscriber's topic selectors (compiled to RegExp) to match the topic.

  2. For each message, the list of subscribers that will receive the message is retrieved from the skipfilter.


Each subscriber is stored in a skip list. As subscribers are added, they are assigned an autoincrementing ID and the skip list grows. Skip lists are preferred over slices here due to their discontinuous nature. Subscribers can be added and removed at random and the memory usage will remain bounded.

Each topic has a roaring bitmap. Each bit corresponds to a subscriber in the skip list. For each topic, a cursor is maintained to ensure that newly added subscriptions are always tested. Deleted subscriptions are lazily evicted from the topic bitmaps on lookup. Roaring bitmaps are compressed and discontinuous so memory usage again remains bounded as subscribers come and go.

Finding the complete list of subscribers to which any message must be delivered thus requires only taking the logical And of a set of bitmaps. This sort of operation has been made exceptionally efficient by leveraging SIMD instructions on modern CPUs.


A multidimensional benchmark test shows a 20,000x improvement in throughput for certain highly concurrent and highly selective workloads.

An efficiency improvement of 4 orders of magnitude.


This technique has existed since at least 1980. ZeroMQ refers to it as inverted bitmaps.

The whitepaper was not discovered until after the implementation was complete. As with most computer algorithms, the creation of skipfilter turns out to be a rediscovery of a technique that predates the birth of its author.

GitHub logo kevburnsjr / skipfilter

An inverted bitmap index written in Go.


This package provides a data structure that combines a skiplist with a roaring bitmap cache.

GoDoc Go Report Card Go Coverage

This library was created to efficiently filter a multi-topic message input stream against a set of subscribers, each having a list of topic subscriptions expressed as regular expressions. Idealy, each subscriber should test each topic at most once to determine whether it wants to receive messages from the topic.

In this case, the skip list provides an efficient discontinuous slice of subscribers and the roaring bitmap for each topic provides an efficient ordered discontinuous set of all subscribers that have indicated that they wish to receive messages on the topic.

Filter bitmaps are stored in an LRU cache of variable size (default 100,000).

This package is theadsafe.

Top comments (0)