Data Processing & Transformation: Mastering ETL/ELT Workflows with Spark and Flink ⚡
In data engineering, transformation is where raw data becomes valuable insight. Based on the data_engineering_book, this post dives deep into the ETL vs. ELT paradigms, provides hands-on code for Spark (Batch) and Flink (Stream), and shares industry best practices for performance tuning.
👉 GitHub: datascale-ai/data_engineering_book
1. ETL vs. ELT: The Paradigm Shift
The fundamental difference lies in where and when the data is transformed.
| Dimension | ETL (Extract-Transform-Load) | ELT (Extract-Load-Transform) |
|---|---|---|
| Workflow | Extract → Transform (External Engine) → Load | Extract → Load (Raw Lake) → Transform (In-situ) |
| Execution | Separate Compute (e.g., Spark Cluster) | Target Engine (e.g., Snowflake, Delta Lake) |
| Schema | Schema-on-Write (Structured only) | Schema-on-Read (Structured/Unstructured) |
| Flexibility | Low (Rigid rules) | High (Agile exploration) |
| Use Case | Traditional BI, Small datasets | Big Data, ML, Data Lakes, Modern Lakehouse |
2. Hands-on: Batch & Stream Transformation
A. Batch Processing with Spark (ELT Paradigm)
In ELT, we load raw CSV data into a Delta Lake table first, then perform cleaning and aggregation.
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder \
.appName("Spark_Batch_ELT") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
# 1. Extract & Load (Raw Layer)
raw_df = spark.read.csv("./data/orders.csv", header=True, inferSchema=True)
raw_df.write.format("delta").mode("overwrite").save("./delta/raw/orders")
# 2. Transform (Cleaning & Aggregation)
clean_df = spark.read.format("delta").load("./delta/raw/orders") \
.dropDuplicates(["order_id"]) \
.filter(F.col("amount") > 0)
agg_df = clean_df.groupBy(F.to_date("create_time").alias("dt")) \
.agg(F.sum("amount").alias("daily_total"))
agg_df.show()
B. Stream Processing with Flink (Real-time Transformation)
Real-time UV/PV calculation from a Kafka stream, persisting results to Delta Lake.
// Extracting from Kafka
FlinkKafkaConsumer<String> source = new FlinkKafkaConsumer<>("user_behavior", new SimpleStringSchema(), props);
DataStream<String> stream = env.addSource(source);
// Transform: 5-minute Tumbling Window for UV calculation
DataStream<Tuple2<String, Long>> uvStream = behaviorStream
.keyBy(behavior -> behavior.action)
.window(TumblingProcessingTimeWindows.of(Time.minutes(5)))
.aggregate(new UvAggregateFunction());
// Load: Sink to Delta Lake
uvStream.sinkTo(deltaSink);
3. Transformation Best Practices
✅ Data Cleaning (The "Minimum Cleaning" Principle)
- Deduplication: Use business keys (Order ID) for batch and time-windowed logic for streams.
-
Outlier Handling: Instead of deleting records, flag them (e.g.,
is_valid=false) to maintain auditability. -
Null Values: Use
fillna()for non-critical fields; discard if primary keys are missing.
🔒 Data Masking & Privacy
Compliance is non-negotiable (GDPR/CCPA). Use hashing or masking for sensitive fields:
-
Phone Numbers:
138****5678 -
Emails:
jo****@example.com
📏 Standardization
-
Naming: Use
snake_case(e.g.,user_id) consistently. -
Timezones: Always standardize to UTC (
yyyy-MM-dd HH:mm:ss). -
Units: Explicitly label units (e.g.,
amount_usd,weight_kg).
4. Performance Tuning 101
Performance is about matching resources to demand. Focus on these two levers:
I. Parallelism (Concurrency)
Rule of Thumb: Parallelism = Total Data Size / Task Processing Capacity.
-
Spark: Adjust
spark.sql.shuffle.partitions. For 100GB of data, 400-800 partitions is a good starting point. -
Flink: Set operator-level parallelism. Use
rebalance()to prevent data skew.
II. Resource Allocation
| Resource | Spark Config | Flink Config |
|---|---|---|
| Memory | spark.executor.memory |
taskmanager.memory.process.size |
| CPU Cores | spark.executor.cores |
taskmanager.numberOfTaskSlots |
Summary: Correctness → Performance → Cost
As the data_engineering_book suggests: "First ensure correctness, then optimize performance, and finally reduce cost." Never sacrifice data availability for a few seconds of speed.
Are you team Spark or team Flink for your daily transformations? Let's settle the debate in the comments! 👇
Top comments (0)