DEV Community

Big Data Fundamentals: real-time analytics example

Real-Time Aggregation of Clickstream Data with Apache Hudi and Flink

1. Introduction

The need for immediate insights into user behavior drives the demand for real-time analytics. A common engineering challenge is building a system capable of aggregating clickstream data – events representing user interactions with a website or application – with sub-second latency. Traditional batch processing, even with frameworks like Spark, often falls short when dealing with the velocity and volume of modern clickstream data. This necessitates a shift towards stream processing and incrementally updated data storage. We’re dealing with data volumes in the tens of terabytes per day, with event rates peaking at hundreds of thousands of events per second. Schema evolution is frequent as new product features are rolled out, requiring a flexible data storage layer. Query latency requirements are stringent: dashboards need to reflect recent activity within 5 seconds. Cost-efficiency is paramount, demanding optimized storage and compute resources. This post details a production-grade architecture leveraging Apache Hudi and Flink to address these challenges.

2. What is Real-Time Aggregation in Big Data Systems?

Real-time aggregation, in this context, refers to the continuous computation of aggregate metrics (e.g., page views, unique users, conversion rates) over a sliding window of time, directly from incoming data streams. It’s not simply low-latency querying; it’s the continuous update of materialized views. This differs from traditional ETL where data is processed in batches.

Architecturally, it sits between data ingestion and querying. Data is ingested via a message queue (Kafka), processed by a stream processing engine (Flink), and stored in a data lake format optimized for incremental updates (Hudi). Hudi provides ACID transactions and efficient upserts, enabling Flink to update aggregate tables without full rewrites. The protocol-level behavior involves Flink writing records to Hudi’s write buffer, which are then periodically compacted into Parquet files. Hudi’s indexing mechanisms (Bloom filters, Hive-style partitioning) accelerate query performance.

3. Real-World Use Cases

  • Real-time Personalization: Dynamically adjusting website content or product recommendations based on a user’s recent browsing history.
  • Fraud Detection: Identifying suspicious activity patterns (e.g., rapid-fire transactions from a single IP address) in real-time.
  • A/B Testing Analysis: Monitoring the performance of different website variations and making data-driven decisions about which version to deploy.
  • Operational Monitoring: Tracking key performance indicators (KPIs) like error rates, latency, and resource utilization to proactively identify and resolve issues.
  • Real-time Marketing Campaigns: Triggering targeted marketing messages based on user behavior (e.g., abandoned shopping carts).

4. System Design & Architecture

graph LR
    A[Clickstream Events] --> B(Kafka);
    B --> C{Flink Job};
    C --> D[Hudi Table (Parquet)];
    D --> E(Presto/Trino);
    E --> F[Dashboards/Applications];

    subgraph Data Lake
        D
    end

    subgraph Stream Processing
        C
    end

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style B fill:#ccf,stroke:#333,stroke-width:2px
    style C fill:#cfc,stroke:#333,stroke-width:2px
    style D fill:#fcc,stroke:#333,stroke-width:2px
    style E fill:#cff,stroke:#333,stroke-width:2px
    style F fill:#ffc,stroke:#333,stroke-width:2px
Enter fullscreen mode Exit fullscreen mode

This architecture utilizes Kafka as the ingestion layer, providing buffering and fault tolerance. Flink consumes events from Kafka, performs aggregations (e.g., counting page views per user per minute), and writes the results to a Hudi table stored in cloud object storage (S3, GCS, Azure Blob Storage). Presto/Trino is used for querying the Hudi table, powering real-time dashboards and applications.

A cloud-native setup on AWS EMR would involve:

  • EMR cluster with Flink and Presto installed.
  • Kafka managed by MSK.
  • Hudi table stored in S3.
  • IAM roles for secure access to resources.

5. Performance Tuning & Resource Management

Performance hinges on several factors. Flink parallelism is crucial. We typically configure flink.parallelism to be 2-4x the number of cores in the cluster.

taskmanager.memory.process.size: 16g
taskmanager.memory.fractional.override: 0.8
taskmanager.numberOfTaskSlots: 4
Enter fullscreen mode Exit fullscreen mode

Hudi compaction frequency impacts query latency. Too infrequent, and queries scan many small files. Too frequent, and compaction consumes excessive resources. We use a schedule based on data volume:

{
  "compaction.schedule": "0/15 * * * *", // Every 15 minutes
  "compaction.max_memory": "4g"
}
Enter fullscreen mode Exit fullscreen mode

File size optimization is critical. We aim for Parquet file sizes between 128MB and 256MB. Hudi’s hoodie.parquet.max.file.size property controls this. Shuffle reduction techniques in Flink (e.g., using keyed streams and avoiding unnecessary repartitioning) minimize network I/O. We also tune spark.sql.shuffle.partitions (even though we're using Flink, Presto uses Spark for query planning) to 200-400 based on cluster size. S3 connection settings are also important:

fs.s3a.connection.maximum: 1000
fs.s3a.connection.maximum-wait-milliseconds: 10000
Enter fullscreen mode Exit fullscreen mode

6. Failure Modes & Debugging

  • Data Skew: Uneven distribution of keys can lead to hot spots and out-of-memory errors. Solution: Use salting or pre-aggregation to redistribute the load.
  • Out-of-Memory Errors: Insufficient memory allocated to Flink task managers. Solution: Increase taskmanager.memory.process.size.
  • Job Retries: Transient errors (e.g., network issues) can cause job failures. Flink’s fault tolerance mechanisms handle retries, but excessive retries indicate underlying problems.
  • DAG Crashes: Errors in Flink code or configuration can lead to DAG crashes. The Flink UI provides detailed error messages and stack traces.
  • Hudi Compaction Failures: Can occur due to resource constraints or data corruption. Monitor Hudi metrics (e.g., compaction queue length, compaction success rate).

Monitoring tools like Datadog are essential for tracking key metrics: Flink job latency, Hudi compaction time, S3 read/write latency, and Kafka consumer lag. Flink’s web UI provides detailed information about task manager resource usage and job execution.

7. Data Governance & Schema Management

Hudi integrates with the Hive Metastore, providing a central metadata catalog. We use a schema registry (e.g., Confluent Schema Registry) to manage schema evolution. Avro is our preferred data format due to its schema evolution capabilities. Backward compatibility is maintained by ensuring that new schemas are compatible with older schemas. Data quality checks are implemented in Flink using custom operators to validate data against predefined rules.

8. Security and Access Control

Data encryption is enabled at rest (S3 encryption) and in transit (TLS). Apache Ranger is used to enforce fine-grained access control policies on the Hudi table, restricting access to authorized users and groups. Audit logging is enabled to track data access and modifications. Kerberos authentication is used to secure the Hadoop ecosystem.

9. Testing & CI/CD Integration

We use Great Expectations to validate data quality in Flink pipelines. DBT tests are used to validate the accuracy of aggregated data in the Hudi table. Unit tests are written for custom Flink operators. CI/CD pipelines are automated using Jenkins, with automated regression tests run on staging environments before deploying to production. Pipeline linting is performed to ensure code quality and adherence to coding standards.

10. Common Pitfalls & Operational Misconceptions

  • Ignoring Data Skew: Leads to performance bottlenecks and instability. Mitigation: Salting, pre-aggregation.
  • Underestimating Compaction Costs: Compaction can consume significant resources. Mitigation: Optimize compaction schedule and memory allocation.
  • Incorrect Hudi Configuration: Suboptimal settings can degrade performance. Mitigation: Thoroughly test different configurations.
  • Lack of Monitoring: Makes it difficult to identify and resolve issues. Mitigation: Implement comprehensive monitoring and alerting.
  • Treating Hudi as a Simple Parquet Store: Failing to leverage Hudi’s incremental update capabilities. Mitigation: Design pipelines to take advantage of Hudi’s features.

11. Enterprise Patterns & Best Practices

  • Data Lakehouse Architecture: Combining the benefits of data lakes and data warehouses.
  • Batch vs. Micro-batch vs. Streaming: Choosing the appropriate processing paradigm based on latency requirements. We favor micro-batching with Flink for a balance of latency and throughput.
  • Parquet as the Default Format: Provides efficient compression and columnar storage.
  • Storage Tiering: Moving infrequently accessed data to cheaper storage tiers (e.g., S3 Glacier).
  • Workflow Orchestration: Using Airflow or Dagster to manage complex data pipelines.

12. Conclusion

Real-time aggregation of clickstream data is a critical capability for modern data-driven organizations. The architecture described here, leveraging Apache Hudi and Flink, provides a scalable, reliable, and cost-efficient solution. Next steps include benchmarking different Hudi compaction strategies, introducing schema enforcement using a schema registry, and migrating to a more efficient file format like Apache Iceberg for improved metadata management. Continuous monitoring and optimization are essential for maintaining optimal performance and reliability.

Top comments (0)