Markus

Posted on Aug 11 • Originally published at the-main-thread.com on Aug 7

Counting Millions of Things with Kilobytes

#java

How do you count millions of unique users, sessions, or IPs using just a few kilobytes of memory, without hitting an OutOfMemoryError under load?

In this tutorial, we’ll walk through how to use the HyperLogLog (HLL) data structure inside a Quarkus application to perform scalable, approximate cardinality estimation. We’ll go from a basic Java implementation to a real Quarkus microservice that ingests data from GitHub archives and estimates unique users.

Why HyperLogLog Beats HashSet at Scale

When you need to count unique identifiers in high-throughput systems, using a HashSet seems obvious. But storing millions of UUIDs, IPs, or usernames in memory quickly becomes unsustainable. A HashSet with 1 million 64-bit values can consume over 100 MB. Do this per service instance, and your JVM falls off a memory cliff.

Enter HyperLogLog , a probabilistic data structure that answers “how many unique items have I seen?” with near-constant memory, regardless of input size. You trade a small amount of accuracy (usually <2%) for massive space savings. It’s the ideal fit for analytics, observability, and stream processing use cases.

How HyperLogLog Works (Without the Math Headache)

Here’s the core intuition:

Hash your inputs : Each input (e.g., a user ID) is hashed into a long binary string.
Count leading zeros : Rare hashes start with more zeros. The number of leading zeros gives a clue about cardinality.
Divide into buckets : Use the first few bits of the hash to choose a bucket. Store the highest zero count observed for each.
Harmonic mean : Combine all bucket estimates using a specialized average (harmonic mean) to compute the final count.

This gives you a small, fixed-size sketch that approximates how many unique elements you've seen. For example, using 1024 buckets (p = 10) gives you ~2.4% error with just 1KB of memory.

Build Your Own HyperLogLog in Java

Let’s implement a simplified version in plain Java. Or go straight to my Github repository and clone the source of the working project.

Define the Structure

package org.example;

public class HyperLogLog {

    private final int p; // Precision
    private final int m; // Number of registers, m = 2^p
    private final byte[] registers;
    private final double alphaM;

    public HyperLogLog(int p) {
        if (p < 4 || p > 16) {
            throw new IllegalArgumentException("Precision p must be between 4 and 16.");
        }
        this.p = p;
        this.m = 1 << p; // 2^p
        this.registers = new byte[m];
        this.alphaM = getAlpha(m);
    }

    // Hash function placeholder (in a real app, use Murmur3)
    private int hash(Object o) {
        return o.hashCode();
    }

    // Pre-calculated constant for bias correction
    private double getAlpha(int m) {
        switch (m) {
            case 16:
                return 0.673;
            case 32:
                return 0.697;
            case 64:
                return 0.709;
            default:
                return 0.7213 / (1 + 1.079 / m);
        }
    }

    public long getMemoryUsageBytes() {
        return this.m; // Each register is a byte
    }

    public void add(Object o) {
        int hash = hash(o);

        // 1. Determine the register index from the first 'p' bits
        int registerIndex = hash >>> (Integer.SIZE - p);

        // 2. Get the rest of the hash for counting leading zeros
        int valueForCounting = hash << p;

        // 3. Count leading zeros (+1 because we count from 1)
        byte leadingZeros = (byte) (Integer.numberOfLeadingZeros(valueForCounting) + 1);

        // 4. Update the register with the maximum value seen
        registers[registerIndex] = (byte) Math.max(registers[registerIndex], leadingZeros);
    }

    public double estimate() {
        double sum = 0.0;
        int zeroRegisters = 0;

        for (byte registerValue : registers) {
            if (registerValue == 0) {
                zeroRegisters++;
            }
            sum += 1.0 / (1 << registerValue); // sum of 2^(-R_j)
        }

        double rawEstimate = alphaM * m * m / sum;

        // Small and large range corrections (important for accuracy!)
        if (rawEstimate <= 2.5 * m) {
            if (zeroRegisters > 0) {
                // Small range correction
                return Math.round(m * Math.log((double) m / zeroRegisters));
            } else {
                return Math.round(rawEstimate);
            }
        } else if (rawEstimate > (1.0 / 30.0) * Math.pow(2, 32)) {
            // Large range correction
            return -Math.pow(2, 32) * Math.log(1.0 - rawEstimate / Math.pow(2, 32));
        } else {
            return Math.round(rawEstimate);
        }
    }
}

Integrate HyperLogLog with Quarkus

Now, let's use our HyperLogLog class in a practical Quarkus microservice. We'll create an endpoint that can directly process a downloaded data file from gharchive.org.

Create a new Quarkus project

mvn io.quarkus.platform:quarkus-maven-plugin:create \
    -DprojectGroupId=org.example \
    -DprojectArtifactId=hll-tutorial \
    -DclassName="org.example.HllResource" \
    -Dpath="/stats" \
    -Dextensions="rest-jackson"
cd hll-tutorial

Move HyperLogLog.java to src/main/java/org/example/.

The Cardinality Service

This service is our stateful bean that holds the HLL sketches, completely decoupled from the data source.

package org.example;

import jakarta.enterprise.context.ApplicationScoped;

@ApplicationScoped
public class CardinalityService {
    // p=10 for ~2.4% error, uses 1KB memory
    private final HyperLogLog dailyUsers = new HyperLogLog(10);
    // p=12 for ~1.2% error, uses 4KB memory
    private final HyperLogLog weeklyRepos = new HyperLogLog(12);

    public void trackUser(String userId) {
        if (userId != null) {
            dailyUsers.add(userId);
        }
    }

    public void trackRepo(String repoName) {
        if (repoName != null) {
            weeklyRepos.add(repoName);
        }
    }

    public double getDailyUserEstimate() {
        return dailyUsers.estimate();
    }

    public double getWeeklyRepoEstimate() {
        return weeklyRepos.estimate();
    }

    public long getMemoryUsageBytes() {
        return dailyUsers.getMemoryUsageBytes() + weeklyRepos.getMemoryUsageBytes();
    }
}

Reading and Parsing the GitHub Archive Data

This component contains the logic for reading and parsing the gzipped archive file. It streams the file line by line without loading it all into memory.

package org.example;

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.nio.file.Path;
import java.util.zip.GZIPInputStream;

import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;

import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;

@ApplicationScoped
public class DataIngestor {

    @Inject
    CardinalityService cardinalityService;

    @Inject
    ObjectMapper mapper;

    public long processGhArchiveFile(Path filePath) throws Exception {
        long linesProcessed = 0;
        try (
                FileInputStream fileInputStream = new FileInputStream(filePath.toFile());
                GZIPInputStream gzipInputStream = new GZIPInputStream(fileInputStream);
                InputStreamReader inputStreamReader = new InputStreamReader(gzipInputStream);
                BufferedReader bufferedReader = new BufferedReader(inputStreamReader)) {
            String line;
            while ((line = bufferedReader.readLine()) != null) {
                JsonNode event = mapper.readTree(line);

                JsonNode actor = event.get("actor");
                if (actor != null && actor.has("login")) {
                    cardinalityService.trackUser(actor.get("login").asText());
                }

                JsonNode repo = event.get("repo");
                if (repo != null && repo.has("name")) {
                    cardinalityService.trackRepo(repo.get("name").asText());
                }
                linesProcessed++;
            }
        }
        return linesProcessed;
    }
}

REST Endpoints

We'll add a new endpoint, POST /ingest, to the HllResource to kick off the file processing.

package org.example;

import java.nio.file.Paths;
import java.util.Map;

import jakarta.inject.Inject;
import jakarta.ws.rs.GET;
import jakarta.ws.rs.POST;
import jakarta.ws.rs.Path;
import jakarta.ws.rs.Produces;
import jakarta.ws.rs.core.MediaType;
import jakarta.ws.rs.core.Response;

@Path("/")
public class HllResource {
    @Inject
    CardinalityService service;
    @Inject
    DataIngestor ingestor;

    @POST
    @Path("/ingest")
    @Produces(MediaType.APPLICATION_JSON)
    public Response ingest(Map<String, String> input) {
        try {
            long count = ingestor.processGhArchiveFile(Paths.get(input.get("path")));
            return Response.ok(Map.of("status", "ok", "lines", count)).build();
        } catch (Exception e) {
            return Response.serverError().entity(Map.of("error", e.getMessage())).build();
        }
    }

    @GET
    @Path("/stats/unique-users/today")
    public Response users() {
        return Response.ok(Map.of("estimate", service.getDailyUserEstimate())).build();
    }

    @GET
    @Path("/stats/unique-repos/weekly")
    public Response repos() {
        return Response.ok(Map.of("estimate", service.getWeeklyRepoEstimate())).build();
    }

    @GET
    @Path("/stats/memory-usage")
    public Response mem() {
        return Response.ok(Map.of("bytes", service.getMemoryUsageBytes())).build();
    }
}

Run and See the Magic

Download an hourly GitHub event file from https://www.gharchive.org/

wget https://data.gharchive.org/2025-07-20-15.json.gz

(If you’re on MacOS and need wget, do a “brew install wget”)

Start your app:

./mvnw quarkus:dev

Trigger ingestion:

curl -X POST http://localhost:8080/ingest \
  -H "Content-Type: application/json" \
  -d '{"path": "/tmp/2025-07-20-15.json.gz"}'

And you’ll get a result:

{
"status":"ok",
"lines":157625
}

Query the estimate:

curl http://localhost:8080/stats/unique-users/today
curl http://localhost:8080/stats/memory-usage

Memory vs. Accuracy – The Trade-off

HashSet : ~55 MB for 300k users, 100% accuracy

HyperLogLog (p=12): 4 KB, ~98.4% accuracy

This is the trade-off: you sacrifice a tiny amount of accuracy for a 13,000x reduction in memory usage. For most "how many" questions in monitoring, analytics, and data streaming, this is a phenomenal deal.

You get near-instant, low-cost estimates without worrying about JVM pressure.

Where to Go Next

Now that you have a working probabilistic counter:

Try using Redis’s native PFADD/PFCOUNT support.
Merge sketches across distributed services for global stats.
Serialize the sketch for durable state or historical analysis.
Build a dashboard in Quarkus + Qute showing estimate convergence.

This is how you count millions on kilobytes. HyperLogLog isn't just a clever trick, it's a serious engineering tool.

DEV Community