DEV Community

Cover image for πŸ”₯ Day 5: Introduction to DataFrames - The Most Importantce of Spark API
Sandeep
Sandeep

Posted on

πŸ”₯ Day 5: Introduction to DataFrames - The Most Importantce of Spark API

Welcome to Day 5 of the Spark Mastery Series. Today we begin the journey with DataFrames - the heart of modern Spark programming.

If you're a Data Engineer, Data Scientist, or ETL developer, DataFrames will become your default tool.

Let’s understand them deeply.

πŸ“Œ What is a DataFrame?

A DataFrame in Spark is:

A distributed, column-based, optimized table-like structure used for efficient data processing.

  • It feels like SQL
  • It works like Pandas
  • But scales to terabytes effortlessly.

πŸ”₯ Why DataFrames are better than RDDs
DataFrames outperform RDDs because:

  • They use Catalyst optimizer β†’ rewrites your query for speed
  • They use Tungsten execution engine β†’ memory-efficient
  • They support automatic code generation
  • They allow SQL-like expressions
  • They support file formats like Parquet, ORC, JSON, Avro

This is why almost every industry Spark job uses DataFrames.

🧠 Creating Your First DataFrame
1️⃣ From Python list

df = spark.createDataFrame([(1,"A"), (2,"B")], ["id","name"])
df.show()
Enter fullscreen mode Exit fullscreen mode

2️⃣ From CSV

df = spark.read.csv("sales.csv", header=True, inferSchema=True)
Enter fullscreen mode Exit fullscreen mode

3️⃣ From JSON

df = spark.read.json("users.json")
Enter fullscreen mode Exit fullscreen mode

4️⃣ From Parquet (fastest!)

df = spark.read.parquet("events.parquet")
Enter fullscreen mode Exit fullscreen mode

πŸ” Understanding Schema

Every DataFrame has a schema (column name + data type).

df.printSchema()
Enter fullscreen mode Exit fullscreen mode

Example:

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
Enter fullscreen mode Exit fullscreen mode

Schema is critical because Spark is strongly typed at runtime.

✨ DataFrame Operations You’ll Use Daily

Select columns:

df.select("name", "id").show()
Enter fullscreen mode Exit fullscreen mode

Filter rows:

df.filter(col("id") > 5).show()
Enter fullscreen mode Exit fullscreen mode

Add new columns:

df = df.withColumn("new_value", col("id") * 100)
Enter fullscreen mode Exit fullscreen mode

Drop columns:

df = df.drop("unwanted_column")
Enter fullscreen mode Exit fullscreen mode

Rename columns:

df = df.withColumnRenamed("id", "user_id")
Enter fullscreen mode Exit fullscreen mode

πŸ” DataFrame Actions β€” These Trigger Execution

Example actions:

df.count()
df.show()
df.collect()
df.take(5)
Enter fullscreen mode Exit fullscreen mode

Remember:
Transformations = lazy
Actions = actual execution

πŸ’‘ Tips for Best Practice
βœ” Use col() instead of df["col"]
βœ” Avoid collect() on huge data
βœ” Prefer Parquet/ORC over CSV
βœ” Use DataFrames, not RDDs
βœ” Always inspect schema before transformations

Follow for more such content. Let me know if I missed anything in comments. Thank you!!

Top comments (0)