DEV Community

Sandeep
Sandeep

Posted on

πŸ”₯ Day 7: Mastering Joins, Unions, and GroupBy in PySpark - The Core ETL Operations

Welcome to Day 7 of your Spark Mastery journey!

Today is one of the most practical days because joins, unions, and aggregations are used in almost every pipeline you will ever build - be it feature engineering, building fact tables, or aggregating transactional data.

Let’s master the fundamentals with clarity and real-world examples.

🌟 1. Joins in PySpark β€” The Heart of ETL Pipelines

A join merges two DataFrames based on keys, similar to SQL.

Basic join:

df.join(df2, df.id == df2.id, "inner")
Enter fullscreen mode Exit fullscreen mode

Join on same column name:

df.join(df2, ["id"], "left")
Enter fullscreen mode Exit fullscreen mode

πŸ”Ή Types of Joins in Spark
Join Type - Meaning
inner - Matching rows
left - All rows from left, match from right
right - All rows from right
full - All rows from both
left_anti - Rows in left NOT in right
left_semi - Rows in left WHERE match exists in right
cross Cartesian product

Important
left_semi = existence check
left_anti = anti-join / unmatched rows

🌟 2. Union β€” Stack DataFrames Vertically

Union (same schema, same order)

df.union(df2)
Enter fullscreen mode Exit fullscreen mode

Union by column names:

df.unionByName(df2)
Enter fullscreen mode Exit fullscreen mode

Why important?
Because in real projects you combine:

  • monthly files
  • daily ingestion datasets
  • partitions

🌟 3. GroupBy + Aggregation β€” Business Logic Layer

This is how reports, fact tables, metrics are built.

Example:

df.groupBy("dept").agg(
sum("salary").alias("total_salary"),
avg("age").alias("avg_age")
)
Enter fullscreen mode Exit fullscreen mode

πŸ”Ή count vs countDistinct

df.select(count("id"))
df.select(countDistinct("id"))
Enter fullscreen mode Exit fullscreen mode

πŸ”Ή approx_count_distinct (faster!)

df.select(approx_count_distinct("id"))
Enter fullscreen mode Exit fullscreen mode

🌟 4. Real ETL Example β€” Sales Aggregation

Suppose you have:

  • sales table
  • product table

Join them:

df_joined = sales.join(products, "product_id", "left")
Enter fullscreen mode Exit fullscreen mode

Aggregate revenue:

df_agg = df_joined.groupBy("category").agg(
 sum("amount").alias("total_revenue"),
    count("*").alias("transactions")
)
Enter fullscreen mode Exit fullscreen mode

This is EXACTLY how business dashboards are built.

🌟 5. Join Performance Optimization

Use Broadcast Join for small lookup tables:

df.join(broadcast(df_small), "id")
Enter fullscreen mode Exit fullscreen mode

Why?
Avoids shuffle β†’ runs much faster.

πŸš€ Summary of Day 7

Today we learnt:

  • Joins
  • Union / UnionByName
  • GroupBy
  • Aggregations
  • broadcast join optimization

Follow for more such content. Let me know if I missed anything in comments. Thank you!!

Top comments (0)