Sandeep

Posted on Dec 5, 2025

🔥 Day 5: Introduction to DataFrames - The Most Importantce of Spark API

#python #dataengineering #spark #bigdata

Welcome to Day 5 of the Spark Mastery Series. Today we begin the journey with DataFrames - the heart of modern Spark programming.

If you're a Data Engineer, Data Scientist, or ETL developer, DataFrames will become your default tool.

Let’s understand them deeply.

📌 What is a DataFrame?

A DataFrame in Spark is:

A distributed, column-based, optimized table-like structure used for efficient data processing.

It feels like SQL
It works like Pandas
But scales to terabytes effortlessly.

🔥 Why DataFrames are better than RDDs
DataFrames outperform RDDs because:

They use Catalyst optimizer → rewrites your query for speed
They use Tungsten execution engine → memory-efficient
They support automatic code generation
They allow SQL-like expressions
They support file formats like Parquet, ORC, JSON, Avro

This is why almost every industry Spark job uses DataFrames.

🧠 Creating Your First DataFrame
1️⃣ From Python list

df = spark.createDataFrame([(1,"A"), (2,"B")], ["id","name"])
df.show()

2️⃣ From CSV

df = spark.read.csv("sales.csv", header=True, inferSchema=True)

3️⃣ From JSON

df = spark.read.json("users.json")

4️⃣ From Parquet (fastest!)

df = spark.read.parquet("events.parquet")

🔍 Understanding Schema

Every DataFrame has a schema (column name + data type).

df.printSchema()

Example:

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)

Schema is critical because Spark is strongly typed at runtime.

✨ DataFrame Operations You’ll Use Daily

Select columns:

df.select("name", "id").show()

Filter rows:

df.filter(col("id") > 5).show()

Add new columns:

df = df.withColumn("new_value", col("id") * 100)

Drop columns:

df = df.drop("unwanted_column")

Rename columns:

df = df.withColumnRenamed("id", "user_id")

🔁 DataFrame Actions — These Trigger Execution

Example actions:

df.count()
df.show()
df.collect()
df.take(5)

Remember: Transformations = lazy Actions = actual execution

💡 Tips for Best Practice
✔ Use col() instead of df["col"]
✔ Avoid collect() on huge data
✔ Prefer Parquet/ORC over CSV
✔ Use DataFrames, not RDDs
✔ Always inspect schema before transformations

Follow for more such content. Let me know if I missed anything in comments. Thank you!!

DEV Community

🔥 Day 5: Introduction to DataFrames - The Most Importantce of Spark API

Top comments (0)