DevOps Fundamental for DevOps Fundamentals

Posted on Aug 3

Big Data Fundamentals: data sharding project

#bigdata #dataengineering #data #datashardingproject

Data Sharding for Scale: A Production Deep Dive

Introduction

The relentless growth of data volume and velocity presents a constant challenge for data platform engineers. We recently faced a critical issue with our clickstream analytics pipeline: query latency on daily aggregates had ballooned from sub-second to over 30 seconds, impacting real-time dashboards and downstream machine learning models. The root cause wasn’t a single bottleneck, but a combination of factors – a 5x increase in daily event volume, increasingly complex analytical queries, and a monolithic data lake architecture built around a single Hive table. Simply throwing more compute at the problem wasn’t a sustainable solution; it was becoming prohibitively expensive and didn’t address the underlying architectural limitations. This led us to embark on a comprehensive data sharding project, fundamentally restructuring how we store and process our clickstream data. This post details the technical considerations, implementation, and operational learnings from that project, focusing on a Spark-based data lake built on AWS S3.

What is "Data Sharding Project" in Big Data Systems?

Data sharding, in the context of Big Data, is the practice of horizontally partitioning a large dataset across multiple storage nodes or compute clusters. It’s not merely partitioning in the Hive sense; it’s a holistic architectural approach impacting ingestion, storage format, query execution, and metadata management. The goal is to distribute the load, enabling parallel processing and reducing the amount of data any single node needs to handle.

At the protocol level, this manifests as careful consideration of partitioning schemes (hash, range, list), data locality, and the ability of processing engines (Spark, Flink, Presto) to efficiently leverage the sharding strategy. We chose a time-based sharding strategy, partitioning data by event_time (year, month, day, hour) and storing it in Parquet format with snappy compression. Parquet’s columnar storage and efficient encoding are crucial for analytical workloads, and snappy provides a good balance between compression ratio and decompression speed. The choice of partitioning keys directly impacts query performance; selecting keys frequently used in WHERE clauses is paramount.

Real-World Use Cases

Data sharding becomes essential in several production scenarios:

CDC Ingestion: When ingesting data from Change Data Capture (CDC) streams, sharding allows for parallel ingestion into different partitions based on the primary key of the source table. This avoids a single ingestion bottleneck.
Streaming ETL: Real-time ETL pipelines processing high-velocity streams benefit from sharding to distribute the processing load across multiple stream processing nodes (e.g., Flink tasks).
Large-Scale Joins: Joining massive tables is a common performance killer. Sharding both tables based on a common key allows for distributed joins, significantly reducing shuffle data.
Schema Validation & Quality Checks: Sharding enables parallel schema validation and data quality checks on individual partitions, accelerating the data onboarding process.
ML Feature Pipelines: Generating features for machine learning models often involves aggregating data across large datasets. Sharding allows for parallel feature computation, reducing training time.

System Design & Architecture

Our architecture centers around a Spark-based data lake on AWS S3. The sharding project involved modifying our ingestion pipeline, storage layer, and query engine configurations.

graph LR
    A[Kafka] --> B{Spark Streaming};
    B --> C[S3 - Sharded Parquet];
    C --> D[Hive Metastore];
    D --> E[Presto/Trino];
    E --> F[Dashboards/ML Models];

    subgraph Ingestion
        A
        B
    end

    subgraph Storage
        C
        D
    end

    subgraph Querying
        E
        F
    end

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style B fill:#ccf,stroke:#333,stroke-width:2px
    style C fill:#fcf,stroke:#333,stroke-width:2px
    style D fill:#ccf,stroke:#333,stroke-width:2px
    style E fill:#f9f,stroke:#333,stroke-width:2px
    style F fill:#fcf,stroke:#333,stroke-width:2px

The Spark Streaming application consumes clickstream events from Kafka, partitions them based on event_time, and writes them to S3 as Parquet files. The Hive Metastore maintains metadata about the sharded data, allowing Presto/Trino to query the data transparently. We leverage S3’s prefix-based partitioning to physically organize the sharded data.

For cloud-native deployments, we considered using AWS EMR with Spark, but ultimately opted for a self-managed Spark cluster on EC2 for greater control over configuration and resource allocation.

Performance Tuning & Resource Management

Several tuning strategies were critical:

Partitioning: We experimented with different partitioning granularities (daily, hourly) and settled on hourly partitioning for optimal query performance and manageable file sizes.
File Size Compaction: Small Parquet files lead to increased metadata overhead and slower query performance. We implemented a daily compaction job using Spark to combine small files into larger, more efficient files.
Shuffle Reduction: For joins, we enabled adaptive query execution (AQE) in Spark and tuned spark.sql.shuffle.partitions to 200 to balance parallelism and shuffle overhead.
I/O Optimization: We increased the S3 transfer buffer size (fs.s3a.buffer.size) to 16MB and the maximum concurrent requests (fs.s3a.connection.maximum) to 1000 to maximize I/O throughput.
Memory Management: We carefully tuned Spark executor memory (spark.executor.memory) and driver memory (spark.driver.memory) to avoid out-of-memory errors. We also enabled off-heap memory allocation for large aggregations.

Here's a snippet of our Spark configuration:

spark:
  driver:
    memory: 8g
  executor:
    memory: 16g
    cores: 4
  sql:
    shuffle.partitions: 200
    adaptive.enabled: true
  fs.s3a.buffer.size: 16m
  fs.s3a.connection.maximum: 1000

These optimizations resulted in a 70% reduction in query latency and a 30% reduction in infrastructure cost.

Failure Modes & Debugging

Common failure modes include:

Data Skew: Uneven distribution of data across partitions can lead to performance bottlenecks. We addressed this by using salting techniques to redistribute skewed keys.
Out-of-Memory Errors: Large aggregations or joins can exhaust executor memory. We mitigated this by increasing executor memory, enabling off-heap memory allocation, and optimizing query plans.
Job Retries: Transient network errors or S3 outages can cause job failures. We configured Spark to automatically retry failed tasks and jobs.
DAG Crashes: Complex Spark DAGs can sometimes crash due to unforeseen errors. The Spark UI is invaluable for debugging DAG crashes, allowing us to identify the failing stage and the root cause.

Monitoring metrics like executor memory usage, shuffle read/write times, and S3 request latency are crucial for identifying and diagnosing performance issues. We use Datadog to collect and visualize these metrics, and set up alerts to notify us of anomalies.

Data Governance & Schema Management

We use the AWS Glue Data Catalog as our central metadata repository. Schema evolution is handled using Avro schemas, which are stored in a schema registry. We enforce schema validation during ingestion to ensure data quality and prevent schema drift. Backward compatibility is maintained by allowing queries to read data written with older schemas. We also implemented a data lineage tracking system to understand the flow of data through our pipeline.

Security and Access Control

Data in S3 is encrypted at rest using KMS keys. Access to S3 buckets and data is controlled using IAM policies. We use AWS Lake Formation to manage fine-grained access control at the table and column level. Audit logging is enabled on all S3 buckets to track data access and modifications.

Testing & CI/CD Integration

We use Great Expectations to validate data quality and schema consistency. DBT tests are used to validate data transformations. Our CI/CD pipeline includes automated regression tests that verify the correctness of the sharding logic and query performance. We use Terraform to manage our infrastructure and ensure reproducibility.

Common Pitfalls & Operational Misconceptions

Incorrect Partitioning Key: Choosing a partitioning key not frequently used in queries renders sharding ineffective. Symptom: High full table scan times. Mitigation: Analyze query patterns and select appropriate keys.
Small File Problem: Too many small files degrade performance. Symptom: Slow query performance, high metadata overhead. Mitigation: Implement a compaction job.
Data Skew: Uneven data distribution leads to hotspots. Symptom: Long task times for specific partitions. Mitigation: Use salting or other skew mitigation techniques.
Ignoring Metadata Management: Lack of a robust metadata catalog hinders data discovery and governance. Symptom: Difficulty finding and understanding data. Mitigation: Implement a metadata catalog like AWS Glue or Hive Metastore.
Insufficient Monitoring: Lack of visibility into pipeline performance makes it difficult to diagnose and resolve issues. Symptom: Unexplained performance degradation. Mitigation: Implement comprehensive monitoring and alerting.

Enterprise Patterns & Best Practices

Data Lakehouse vs. Warehouse: We opted for a data lakehouse architecture, combining the flexibility of a data lake with the reliability and performance of a data warehouse.
Batch vs. Micro-Batch vs. Streaming: We use a combination of batch and micro-batch processing, depending on the latency requirements of the application.
File Format Decisions: Parquet is our preferred file format for analytical workloads due to its columnar storage and efficient encoding.
Storage Tiering: We use S3 lifecycle policies to automatically tier data to lower-cost storage classes based on access frequency.
Workflow Orchestration: Airflow is used to orchestrate our data pipelines and ensure reliable execution.

Conclusion

Data sharding is a complex undertaking, but it’s essential for building scalable and reliable Big Data infrastructure. By carefully considering the architectural trade-offs, implementing robust monitoring and alerting, and adhering to best practices, we were able to overcome the performance challenges we faced and deliver a more responsive and cost-effective data platform. Next steps include benchmarking new Parquet compression codecs (e.g., Zstandard) and introducing schema enforcement using a schema registry to further improve data quality and pipeline reliability.

DEV Community