Xin Xu

Posted on Feb 13

High-Performance Data Processing: A Practical Guide from the Data Engineering Book

#ai #opensource #learning #architecture

Data Processing & Transformation: Mastering ETL/ELT Workflows with Spark and Flink ⚡

In data engineering, transformation is where raw data becomes valuable insight. Based on the data_engineering_book, this post dives deep into the ETL vs. ELT paradigms, provides hands-on code for Spark (Batch) and Flink (Stream), and shares industry best practices for performance tuning.

👉 GitHub: datascale-ai/data_engineering_book

1. ETL vs. ELT: The Paradigm Shift

The fundamental difference lies in where and when the data is transformed.

Dimension	ETL (Extract-Transform-Load)	ELT (Extract-Load-Transform)
Workflow	Extract → Transform (External Engine) → Load	Extract → Load (Raw Lake) → Transform (In-situ)
Execution	Separate Compute (e.g., Spark Cluster)	Target Engine (e.g., Snowflake, Delta Lake)
Schema	Schema-on-Write (Structured only)	Schema-on-Read (Structured/Unstructured)
Flexibility	Low (Rigid rules)	High (Agile exploration)
Use Case	Traditional BI, Small datasets	Big Data, ML, Data Lakes, Modern Lakehouse

2. Hands-on: Batch & Stream Transformation

A. Batch Processing with Spark (ELT Paradigm)

In ELT, we load raw CSV data into a Delta Lake table first, then perform cleaning and aggregation.

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder \
    .appName("Spark_Batch_ELT") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# 1. Extract & Load (Raw Layer)
raw_df = spark.read.csv("./data/orders.csv", header=True, inferSchema=True)
raw_df.write.format("delta").mode("overwrite").save("./delta/raw/orders")

# 2. Transform (Cleaning & Aggregation)
clean_df = spark.read.format("delta").load("./delta/raw/orders") \
    .dropDuplicates(["order_id"]) \
    .filter(F.col("amount") > 0)

agg_df = clean_df.groupBy(F.to_date("create_time").alias("dt")) \
    .agg(F.sum("amount").alias("daily_total"))

agg_df.show()

B. Stream Processing with Flink (Real-time Transformation)

Real-time UV/PV calculation from a Kafka stream, persisting results to Delta Lake.

// Extracting from Kafka
FlinkKafkaConsumer<String> source = new FlinkKafkaConsumer<>("user_behavior", new SimpleStringSchema(), props);
DataStream<String> stream = env.addSource(source);

// Transform: 5-minute Tumbling Window for UV calculation
DataStream<Tuple2<String, Long>> uvStream = behaviorStream
    .keyBy(behavior -> behavior.action)
    .window(TumblingProcessingTimeWindows.of(Time.minutes(5)))
    .aggregate(new UvAggregateFunction());

// Load: Sink to Delta Lake
uvStream.sinkTo(deltaSink);

3. Transformation Best Practices

✅ Data Cleaning (The "Minimum Cleaning" Principle)

Deduplication: Use business keys (Order ID) for batch and time-windowed logic for streams.
Outlier Handling: Instead of deleting records, flag them (e.g., is_valid=false) to maintain auditability.
Null Values: Use fillna() for non-critical fields; discard if primary keys are missing.

🔒 Data Masking & Privacy

Compliance is non-negotiable (GDPR/CCPA). Use hashing or masking for sensitive fields:

Phone Numbers: 138****5678
Emails: jo****@example.com

📏 Standardization

Naming: Use snake_case (e.g., user_id) consistently.
Timezones: Always standardize to UTC (yyyy-MM-dd HH:mm:ss).
Units: Explicitly label units (e.g., amount_usd, weight_kg).

4. Performance Tuning 101

Performance is about matching resources to demand. Focus on these two levers:

I. Parallelism (Concurrency)

Rule of Thumb: Parallelism = Total Data Size / Task Processing Capacity.

Spark: Adjust spark.sql.shuffle.partitions. For 100GB of data, 400-800 partitions is a good starting point.
Flink: Set operator-level parallelism. Use rebalance() to prevent data skew.

II. Resource Allocation

Resource	Spark Config	Flink Config
Memory	`spark.executor.memory`	`taskmanager.memory.process.size`
CPU Cores	`spark.executor.cores`	`taskmanager.numberOfTaskSlots`

Summary: Correctness → Performance → Cost

As the data_engineering_book suggests: "First ensure correctness, then optimize performance, and finally reduce cost." Never sacrifice data availability for a few seconds of speed.

Are you team Spark or team Flink for your daily transformations? Let's settle the debate in the comments! 👇

DEV Community