DEV Community

Cover image for The simplest way to count 100 billion unique IDs: Part 1
Cameron Archer for Tinybird

Posted on • Originally published at tinybird.co

1

The simplest way to count 100 billion unique IDs: Part 1

I recently came across an interesting LinkedIn post about how in 2017 Reddit designed the system to count post views and unique viewers:

Using Kafka, Redis, and Cassandra, they built a system to count 100 billion unique 64-bit IDs with only 12KB of storage. Impressive!

However, as someone who's spent quite some time helping companies solve these exact problems, I couldn't help but think: "There's definitely a simpler way to do this."

The Reddit approach

Here's what Reddit had built:

  1. Kafka ingests view events and checks if they should be counted.
  2. They're written to another Kafka topic for counting.
  3. A consumer updates Redis which maintains unique counts.
  4. Cassandra pulls the values from Redis every 10 seconds in case they're evicted.
  5. HyperLogLog for approximation (to save massive amounts of space).

And hey, it works! But, that's a lot of moving parts just to count things.

A simpler solution

What if you could:

  • Store all your raw events in one place
  • Count uniques with a single SQL query
  • Get real-time results
  • Filter by time
  • Keep it all crazy efficient

Turns out, you can. Here's how.

The implementation

First, I created a Tinybird project to store and count post views. To do this, I used the tb create command in the Tinybird CLI and supplied a prompt to bootstrap a first version. If you want to follow along and don’t have Tinybird installed, you can catch up to me in just 3 steps, detailed here.

tb create --prompt "A project to count Reddit post views. It'll capture views with a post_id, a timestamp in seconds, and a viewer_id as a long. It should also have an API for counting exact unique post viewers by post_id, filterable by date. A post_id will always be required for querying."
Enter fullscreen mode Exit fullscreen mode

I find that it’s best to be as specific as possible with requirements when prompting a project generator to get the best results. 

This is what Tinybird came up with on the first pass:

datasources/post_views.datasource

DESCRIPTION >
    Reddit post views events with post_id, timestamp and viewer_id

SCHEMA >
    `post_id` String `json:$.post_id`,
    `timestamp` DateTime `json:$.timestamp`,
    `viewer_id` Int64 `json:$.viewer_id`

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "post_id, timestamp"
Enter fullscreen mode Exit fullscreen mode

endpoints/unique_post_viewers.pipe

DESCRIPTION >
    API to count unique viewers by post_id with optional date filtering

NODE unique_post_viewers_1
SQL >
    %
    SELECT 
        post_id,
        uniqExact(viewer_id) as unique_viewers
    FROM post_views
    WHERE 
        post_id = {{String(post_id, '')}}
        {% if defined(start_date) %}
        AND timestamp >= {{DateTime(start_date, '2023-01-01 00:00:00')}}
        {% end %}
        {% if defined(end_date) %}
        AND timestamp <= {{DateTime(end_date, '2024-12-31 23:59:59')}}
        {% end %}
    GROUP BY post_id

TYPE endpoint
Enter fullscreen mode Exit fullscreen mode

This endpoint pipe automatically becomes a REST API, for now deployed on localhost:

curl -G 'http://localhost:7181/v0/pipes/unique_post_viewers.json' \
   -H 'Authorization: Bearer YOUR_TOKEN' \
   -d 'post_id=abc123' \
   -d 'start_date=2024-01-01'
Enter fullscreen mode Exit fullscreen mode

Tinybird also generates some fake data to match the schema of the data source, so I was able to test that the endpoint did accurately count the unique views stored in the event log.

I can almost hear Javi Santana in the back of my head saying, “BOOM!

So far, so good. I had a working API, but I did want to make one minor tweak to enforce that post_id is provided in the request.

I then deployed it to the cloud with tb --cloud deploy, and I had a hosted, production-ready API to count unique post viewers.

But does it scale?

The simplicity of this solution is nice, but it's not a solution if it doesn't scale.

So, I tested it using Mockingbird to send 1 post with 10M views and ~1M unique viewers, which is about ~1GB of raw data. These are the results: 

  • Storage: ~57MB after compression
  • Latency: Queries return in ~20 milliseconds (it’s even faster with date filters)
  • Throughput: Real-time ingestion at 100k events/second

While this isn't the 100B views in the post, it gives me an idea of how quickly I can read the data for a single post. Given how it's partitioned and sorted, I’d expect consistent performance even as the number of posts grows (hint: it’s a binary search on the primary key so, O(log2 n)).

As I add more views and unique viewers per post, it scales linearly. This is where the date filters really shine. Extrapolating from this, I’d expect the following for 10K posts with 10M views each (for a total of 100B events):

  • ~550GB stored (Remember, this is for timestamp + post_id + viewer_id. The implementation in the LinkedIn post has just viewer_id occupying 800GB)
  • Still ~20-40 milliseconds query latency! That's the beauty of O(log2 n).

Of course, I’d load test this before going to Production to make sure I didn’t make a mistake in my napkin math. 

And yes, this includes all the raw data – not just pre-aggregated counts. Want to slice by hour? Add a dimension? Filter differently? Just modify the SQL.

The trade-offs

Nothing's perfect, so let's be upfront about the limitations:

  1. I'm storing more raw data (but modern columnar compression makes this surprisingly efficient and you can do other things with this data)
  2. Queries do more work than just fetching a pre-computed number
  3. At extreme scales, I might need some optimization (more on this in the next post)

But consider what I'm NOT dealing with:

  • No complex data pipeline
  • No synchronization between services or stores
  • No separate systems to monitor
  • No distributed system headaches

Getting started

Want to try this yourself? Check out the Tinybird documentation to get started. You'll be up and running in minutes, not days.

What's next?

This solution works great for most use cases. But what happens when you hit real scale? When you need to count trillions of events? When uniqExact starts eating too much memory?

That's what I'll cover in the next post. Stay tuned.

Top comments (1)

Collapse
 
j0hanthemoonlightcoder profile image
Viv

intersting.