DEV Community

Xin Xu
Xin Xu

Posted on

High-Performance Data Processing: A Practical Guide from the Data Engineering Book

Data Processing & Transformation: Mastering ETL/ELT Workflows with Spark and Flink ⚡

In data engineering, transformation is where raw data becomes valuable insight. Based on the data_engineering_book, this post dives deep into the ETL vs. ELT paradigms, provides hands-on code for Spark (Batch) and Flink (Stream), and shares industry best practices for performance tuning.

👉 GitHub: datascale-ai/data_engineering_book


1. ETL vs. ELT: The Paradigm Shift

The fundamental difference lies in where and when the data is transformed.

Dimension ETL (Extract-Transform-Load) ELT (Extract-Load-Transform)
Workflow Extract → Transform (External Engine) → Load Extract → Load (Raw Lake) → Transform (In-situ)
Execution Separate Compute (e.g., Spark Cluster) Target Engine (e.g., Snowflake, Delta Lake)
Schema Schema-on-Write (Structured only) Schema-on-Read (Structured/Unstructured)
Flexibility Low (Rigid rules) High (Agile exploration)
Use Case Traditional BI, Small datasets Big Data, ML, Data Lakes, Modern Lakehouse

2. Hands-on: Batch & Stream Transformation

A. Batch Processing with Spark (ELT Paradigm)

In ELT, we load raw CSV data into a Delta Lake table first, then perform cleaning and aggregation.

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder \
    .appName("Spark_Batch_ELT") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# 1. Extract & Load (Raw Layer)
raw_df = spark.read.csv("./data/orders.csv", header=True, inferSchema=True)
raw_df.write.format("delta").mode("overwrite").save("./delta/raw/orders")

# 2. Transform (Cleaning & Aggregation)
clean_df = spark.read.format("delta").load("./delta/raw/orders") \
    .dropDuplicates(["order_id"]) \
    .filter(F.col("amount") > 0)

agg_df = clean_df.groupBy(F.to_date("create_time").alias("dt")) \
    .agg(F.sum("amount").alias("daily_total"))

agg_df.show()

Enter fullscreen mode Exit fullscreen mode

B. Stream Processing with Flink (Real-time Transformation)

Real-time UV/PV calculation from a Kafka stream, persisting results to Delta Lake.

// Extracting from Kafka
FlinkKafkaConsumer<String> source = new FlinkKafkaConsumer<>("user_behavior", new SimpleStringSchema(), props);
DataStream<String> stream = env.addSource(source);

// Transform: 5-minute Tumbling Window for UV calculation
DataStream<Tuple2<String, Long>> uvStream = behaviorStream
    .keyBy(behavior -> behavior.action)
    .window(TumblingProcessingTimeWindows.of(Time.minutes(5)))
    .aggregate(new UvAggregateFunction());

// Load: Sink to Delta Lake
uvStream.sinkTo(deltaSink);

Enter fullscreen mode Exit fullscreen mode

3. Transformation Best Practices

✅ Data Cleaning (The "Minimum Cleaning" Principle)

  • Deduplication: Use business keys (Order ID) for batch and time-windowed logic for streams.
  • Outlier Handling: Instead of deleting records, flag them (e.g., is_valid=false) to maintain auditability.
  • Null Values: Use fillna() for non-critical fields; discard if primary keys are missing.

🔒 Data Masking & Privacy

Compliance is non-negotiable (GDPR/CCPA). Use hashing or masking for sensitive fields:

  • Phone Numbers: 138****5678
  • Emails: jo****@example.com

📏 Standardization

  • Naming: Use snake_case (e.g., user_id) consistently.
  • Timezones: Always standardize to UTC (yyyy-MM-dd HH:mm:ss).
  • Units: Explicitly label units (e.g., amount_usd, weight_kg).

4. Performance Tuning 101

Performance is about matching resources to demand. Focus on these two levers:

I. Parallelism (Concurrency)

Rule of Thumb: Parallelism = Total Data Size / Task Processing Capacity.

  • Spark: Adjust spark.sql.shuffle.partitions. For 100GB of data, 400-800 partitions is a good starting point.
  • Flink: Set operator-level parallelism. Use rebalance() to prevent data skew.

II. Resource Allocation

Resource Spark Config Flink Config
Memory spark.executor.memory taskmanager.memory.process.size
CPU Cores spark.executor.cores taskmanager.numberOfTaskSlots

Summary: Correctness → Performance → Cost

As the data_engineering_book suggests: "First ensure correctness, then optimize performance, and finally reduce cost." Never sacrifice data availability for a few seconds of speed.

Are you team Spark or team Flink for your daily transformations? Let's settle the debate in the comments! 👇


Top comments (0)