Mike Kelvin

Posted on Sep 16

Building a Real-Time Analytics Dashboard That Processes 10M Events Per Hour

#analytics

It was supposed to be a routine product launch. Our e-commerce platform was expecting maybe 50,000 users during peak hours. Instead, we got 500,000. Within minutes, our analytics system was choking on the data influx, dashboards were showing stale data from hours ago, and our marketing team was flying blind during the most critical sales period of the year.

That night changed everything. What started as a crisis became our most valuable learning experience in building truly scalable real-time analytics. Six months later, we had rebuilt our entire analytics pipeline to handle 10 million events per hour without breaking a sweat. Here's how we did it, the mistakes we made, and the architecture decisions that saved our sanity.

The Problem: When Traditional Analytics Meets Real-Time Demands

Our original setup was a classic batch processing nightmare. We were using a traditional SQL database with hourly ETL jobs to populate our dashboards. When traffic spiked, everything broke:

Data Lag: Dashboards showing data from 3-4 hours ago during critical periods
Database Lock-ups: Heavy analytical queries blocking transactional operations
Resource Contention: Analytics workloads competing with customer-facing features
Incomplete Picture: Missing events due to database timeouts and connection limits

The business impact was immediate. Marketing couldn't optimize campaigns in real-time, product teams couldn't identify trending items, and customer support was answering questions with outdated information.

Architecture Decision: Event-Driven Real-Time Processing

We completely reimagined our approach around event streaming rather than batch processing. Here's the high-level architecture we built:

Core Components

Event Ingestion Layer: Apache Kafka cluster with 12 partitions per topic
Stream Processing: Apache Flink for real-time aggregations and transformations

Storage Layer: ClickHouse for analytical queries + Redis for real-time metrics
Visualization: Custom React dashboard with WebSocket connections for live updates

The Event Flow

User Actions → Event Collectors → Kafka Topics → Flink Jobs → Storage → Dashboard

Every user interaction generates events: page views, clicks, purchases, cart additions. Instead of writing directly to our transactional database, we publish these events to Kafka topics.

Implementation Deep Dive

1. Event Collection and Ingestion

We built lightweight event collectors that buffer events locally before batch-sending to Kafka:

class EventCollector {
  constructor(kafkaProducer, bufferSize = 1000, flushInterval = 5000) {
    this.producer = kafkaProducer;
    this.buffer = [];
    this.bufferSize = bufferSize;

    // Flush buffer every 5 seconds or when full
    setInterval(() => this.flush(), flushInterval);
  }

  track(eventType, userId, properties) {
    const event = {
      timestamp: Date.now(),
      type: eventType,
      userId: userId,
      properties: properties,
      sessionId: this.getSessionId()
    };

    this.buffer.push(event);

    if (this.buffer.length >= this.bufferSize) {
      this.flush();
    }
  }

  async flush() {
    if (this.buffer.length === 0) return;

    const events = this.buffer.splice(0);
    await this.producer.send({
      topic: 'user-events',
      messages: events.map(event => ({ value: JSON.stringify(event) }))
    });
  }
}

This approach reduced our event publishing latency from 50ms to 5ms while handling traffic spikes gracefully.

2. Stream Processing with Apache Flink

The magic happens in our Flink jobs. We run multiple parallel jobs for different aggregations:

Real-time Counters: Page views, unique visitors, conversion rates updated every second
Sliding Window Analytics: Revenue trends, popular products over 5-minute windows

Complex Event Processing: User journey analysis and funnel conversions

Here's a simplified example of our real-time counter job:

val eventStream = env
  .addSource(new FlinkKafkaConsumer("user-events", new EventSchema(), kafkaProps))
  .keyBy(_.eventType)
  .window(TumblingProcessingTimeWindows.of(Time.seconds(1)))
  .aggregate(new EventCounter())

eventStream
  .addSink(new RedisSink(redisConfig))
  .addSink(new ClickHouseSink(clickhouseConfig))

3. Storage Strategy: Hot and Cold Data

We implemented a tiered storage approach:

Hot Data (Redis): Last 24 hours of real-time metrics, sub-second query response
Warm Data (ClickHouse): Last 30 days of detailed analytics, optimized for complex queries
Cold Data (S3): Historical data for compliance and deep analysis

This architecture reduced our dashboard load times from 8 seconds to under 200ms.

4. Dashboard with Real-Time Updates

Our React dashboard connects via WebSockets to receive live updates:

const useRealTimeMetrics = (metricType) => {
  const [data, setData] = useState(null);
  const [socket, setSocket] = useState(null);

  useEffect(() => {
    const ws = new WebSocket(`wss://analytics-api.com/metrics/${metricType}`);

    ws.onmessage = (event) => {
      const metrics = JSON.parse(event.data);
      setData(prevData => ({
        ...prevData,
        ...metrics
      }));
    };

    setSocket(ws);

    return () => ws.close();
  }, [metricType]);

  return data;
};

Performance Optimization Lessons

1. Partitioning Strategy

We learned that Kafka partitioning is crucial. Initially, we used random partitioning, which caused hot spots. Switching to user ID-based partitioning improved throughput by 40%.

2. Batch Size Tuning

Finding the right balance between latency and throughput took weeks of testing. Our sweet spot:

Event Collection: 1000 events or 5-second intervals
Kafka Producer: 16KB batches with 10ms linger time
Flink Processing: 1-second tumbling windows for counters

3. Memory Management

ClickHouse memory usage was initially unpredictable. We implemented:

Proper data types (using UInt64 instead of String for IDs)
Compression algorithms (LZ4 for hot data, ZSTD for cold data)
Query result caching for common dashboard queries

4. Monitoring and Alerting

We monitor everything:

Event Ingestion Rate: Alert if drops below expected volume
Processing Lag: Flink job lag should stay under 10 seconds
Dashboard Response Time: P95 latency under 500ms
Data Quality: Missing events or schema validation failures

The Results: From Crisis to Confidence

Six months after our rebuild, the numbers speak for themselves:

Scale Improvements:

10M+ events per hour during peak traffic (200x improvement)
Sub-second dashboard updates vs. 3-4 hour delays
99.9% event processing reliability

Business Impact:

Marketing campaigns now adjust in real-time based on conversion data
Product teams identify trending items within minutes of traffic spikes
Customer support has access to real-time user behavior context

Cost Efficiency:

60% reduction in infrastructure costs despite 200x scale improvement
Eliminated expensive analytical database licenses
Reduced engineering time spent on data pipeline maintenance

Key Takeaways for Your Implementation

Start with Event-Driven Architecture: Don't try to retrofit real-time onto batch systems
Choose the Right Storage: Match storage technology to query patterns and latency requirements
Monitor Obsessively: Real-time systems fail in real-time - you need immediate visibility
Plan for Failure: Circuit breakers, graceful degradation, and data replay capabilities are essential
Optimize Incrementally: Start simple, measure everything, optimize the bottlenecks

Looking Forward: Enterprise-Scale Analytics

Building this system taught us that real-time analytics isn't just about technology—it's about enabling data-driven decision making at the speed of business. The architecture patterns we implemented here have become the foundation for much larger enterprise deployments.

For organizations looking to implement similar real-time analytics capabilities at enterprise scale, the combination of event streaming, distributed processing, and modern storage technologies provides a robust foundation. When integrated with comprehensive Microsoft technology solutions, these patterns can scale to handle billions of events while maintaining the reliability and security standards that enterprise environments demand.

The future of analytics is real-time, and the tools to build these systems have never been more accessible. The question isn't whether you need real-time analytics—it's how quickly you can implement them before your competitors do.

DEV Community