Web Analytics Pipeline: Handling Scale Without Breaking the Bank
Imagine processing millions of page view events per second while simultaneously answering questions like "How many unique visitors did we have today?" in milliseconds. Web analytics pipelines sit at the intersection of real-time processing and historical analysis, and getting the architecture right means the difference between insights you can act on and data you can't trust. Let's explore how to build a system that captures user behavior, counts unique visitors efficiently, and powers interactive dashboards without requiring massive storage overhead.
Architecture Overview
A robust web analytics pipeline typically consists of four main layers working in concert. The collection layer starts with lightweight SDKs or pixel trackers that fire events from user browsers, flowing into a message queue like Kafka or Redis Streams that acts as a shock absorber for traffic spikes. From there, a stream processing engine ingests these events in real-time, aggregating them by various dimensions (page, session, funnel step) and pushing results into both hot storage (for dashboards) and cold storage (for historical analysis). Finally, a query layer serves the analytics UI, enabling product teams to drill into metrics across different time windows and user segments.
The key design decision here is separating real-time aggregation from batch historical processing. Real-time dashboards need sub-second latency and approximate accuracy, while reporting systems often need exact counts and deeper historical context. By running parallel paths through your pipeline, you can answer "What's happening right now?" with a time-series database like InfluxDB or Prometheus, while simultaneously building more complete datasets in your data warehouse for tomorrow's retrospective analysis. This dual-path approach prevents the false choice between speed and accuracy.
Another critical consideration is how events flow through your system. Instead of storing raw events forever, you preaggregate at multiple levels: first-touch aggregations in stream processors for real-time metrics, then hourly and daily roll-ups for reporting. This compression keeps storage costs manageable and makes queries blazingly fast. Tools like InfraSketch help you visualize these data flows and the branching paths that make analytics systems so powerful.
Design Insight: Counting Unique Visitors at Scale
Here's where things get clever. Storing every user ID you've ever seen would balloon your infrastructure, but you still need accurate unique visitor counts. The answer lies in probabilistic data structures, specifically HyperLogLog. This data structure can estimate the number of unique elements in a dataset while using a tiny fraction of the memory that naive approaches require. When a user visits your site, you pass their ID through a HyperLogLog sketch for that day, which adds them to the count with constant memory overhead regardless of dataset size.
The trade-off? HyperLogLog gives you estimates with a configurable error margin, typically around 2 percent. For most analytics use cases, this is more than acceptable. You still want to store individual user IDs selectively, perhaps for your most important metrics or for specific cohorts you're analyzing, but for broad metrics like "daily active users" or "unique visitors by page," HyperLogLog lets you count billions of users while keeping memory usage under control. This is the kind of architectural decision that becomes obvious once you see it diagrammed out, which is exactly what you'd discover by sketching this system on InfraSketch.
Watch the Full Design Process
See how this architecture comes together in real-time as an AI generates the complete system design, including all components, data flows, and the nuanced decisions around counting unique visitors efficiently.
YouTube • LinkedIn • Facebook • X (Twitter) • TikTok • Threads • Instagram
Try It Yourself
Want to design your own analytics pipeline or explore variations on this architecture? Head over to InfraSketch and describe your system in plain English. In seconds, you'll have a professional architecture diagram, complete with a design document.
This is Day 85 of a 365-day system design challenge. What's the next architecture you'd like to explore?
Top comments (0)