Welcome to Day 5 of the Spark Mastery Series. Today we begin the journey with DataFrames - the heart of modern Spark programming.
If you're a Data Engineer, Data Scientist, or ETL developer, DataFrames will become your default tool.
Letβs understand them deeply.
π What is a DataFrame?
A DataFrame in Spark is:
A distributed, column-based, optimized table-like structure used for efficient data processing.
- It feels like SQL
- It works like Pandas
- But scales to terabytes effortlessly.
π₯ Why DataFrames are better than RDDs
DataFrames outperform RDDs because:
- They use Catalyst optimizer β rewrites your query for speed
- They use Tungsten execution engine β memory-efficient
- They support automatic code generation
- They allow SQL-like expressions
- They support file formats like Parquet, ORC, JSON, Avro
This is why almost every industry Spark job uses DataFrames.
π§ Creating Your First DataFrame
1οΈβ£ From Python list
df = spark.createDataFrame([(1,"A"), (2,"B")], ["id","name"])
df.show()
2οΈβ£ From CSV
df = spark.read.csv("sales.csv", header=True, inferSchema=True)
3οΈβ£ From JSON
df = spark.read.json("users.json")
4οΈβ£ From Parquet (fastest!)
df = spark.read.parquet("events.parquet")
π Understanding Schema
Every DataFrame has a schema (column name + data type).
df.printSchema()
Example:
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
Schema is critical because Spark is strongly typed at runtime.
β¨ DataFrame Operations Youβll Use Daily
Select columns:
df.select("name", "id").show()
Filter rows:
df.filter(col("id") > 5).show()
Add new columns:
df = df.withColumn("new_value", col("id") * 100)
Drop columns:
df = df.drop("unwanted_column")
Rename columns:
df = df.withColumnRenamed("id", "user_id")
π DataFrame Actions β These Trigger Execution
Example actions:
df.count()
df.show()
df.collect()
df.take(5)
Remember:
Transformations = lazy
Actions = actual execution
π‘ Tips for Best Practice
β Use col() instead of df["col"]
β Avoid collect() on huge data
β Prefer Parquet/ORC over CSV
β Use DataFrames, not RDDs
β Always inspect schema before transformations
Follow for more such content. Let me know if I missed anything in comments. Thank you!!
Top comments (0)