DEV Community

Sandeep
Sandeep

Posted on

πŸ”₯ Day 6: Essential PySpark DataFrame Transformations

Welcome back to Day 6 of the Spark Mastery Series!
Today we dive into the most frequently used PySpark transformations that power real-world ETL pipelines.

If you master today’s concepts, you’ll be comfortable writing 80% of all PySpark ETL jobs.

Let’s begin.

🌟 1. Selecting Columns β€” The Most Basic Transformation

Just like SQL:

df.select("name", "salary").show()
Enter fullscreen mode Exit fullscreen mode

You can modify columns inside select:

df.select(col("salary") * 2).show()
Enter fullscreen mode Exit fullscreen mode

🌟 2. Adding / Modifying Columns with withColumn()

df = df.withColumn("salary_hike", col("salary") * 1.25)
Enter fullscreen mode Exit fullscreen mode

Use withColumn for:

  • deriving new columns
  • replacing existing columns
  • applying functions
  • adding constants

🌟 3. Adding Constant Value with lit()

df = df.withColumn("country", lit("India"))
Enter fullscreen mode Exit fullscreen mode

Use this when:

  • tagging data
  • adding metadata columns (pipeline_run_id, load_date)

🌟 4. Conditional Logic with when()

Equivalent to SQL CASE WHEN.

df = df.withColumn(
    "age_group",
    when(col("age") < 18, "Minor")
    .when(col("age") < 60, "Adult")
    .otherwise("Senior"))
Enter fullscreen mode Exit fullscreen mode

🌟 5. Filtering Rows

df.filter(col("age") > 25).show()
df.filter((col("age") > 25) & (col("city") == "Hyderabad"))
Enter fullscreen mode Exit fullscreen mode

You can also use .where() which is same as filter.

🌟 6. Removing Columns

df = df.drop("middle_name")
Enter fullscreen mode Exit fullscreen mode

🌟 7. Removing Duplicate Rows

df.dropDuplicates(["id"]).show()
Enter fullscreen mode Exit fullscreen mode

For entire table:

df.distinct()
Enter fullscreen mode Exit fullscreen mode

🌟 8. Sorting Rows

df.orderBy(col("salary").desc())
Enter fullscreen mode Exit fullscreen mode

Sorting triggers shuffle β†’ expensive!
Use only when necessary.

🌟 9. Transformations Chaining (Best Practice)

Good code:

df = (df
      .filter(col("salary") > 30000)
      .withColumn("bonus", col("salary") * 0.10)
      .select("name", "salary", "bonus"))
Enter fullscreen mode Exit fullscreen mode

Bad code:

df = df.filter(...)
df = df.withColumn(...)
df = df.select(...)
Enter fullscreen mode Exit fullscreen mode

Always chain transformations for readability.

🌟 10. Real Use Case Example (Retail ETL)

Given sales data, add GST and categorize purchase:

df = (df
     .withColumn("amount_gst", col("amount") * 1.18)
     .withColumn("category",
                 when(col("amount") > 1000, "Premium")
                 .otherwise("Regular"))
     .filter(col("amount_gst") > 500)
)
Enter fullscreen mode Exit fullscreen mode

This is exactly how real-world ETL transformations look.

πŸš€ Summary

Today you learned:

  • select
  • filter
  • withColumn
  • lit
  • when
  • drop
  • distinct
  • orderBy
  • chaining

These are the building blocks of every PySpark pipeline.

Follow for more such content. Let me know if I missed anything in comments. Thank you!!

Top comments (0)