Predicate Pushdown - Understanding Practically With An Example

#spark #optimisation #sql #interview

What is predicate pushdown?

The immediate theoretical answer that we get on searching is

Predicate pushdown is a query optimisation technique used in database technologies

Okay, I got to know that it is an optimisation technique. But I still did not understand...

How is the optimisation happening? 🤨
What is predicate? 🤔
What exactly is the meaning of pushed down here? 🤷‍♂️

I'm sure since you are also reading this article, even you have these questions in mind!

Now, lets explore this interesting topic practically in PySpark using explain()
(Similar phenomenon could be observed in relational databases as well)

1] Reading a csv file containing employee information.

emp = spark.read.format('csv')\
                .option('header', True)\
                .load('data/employee.csv')
emp.show()

emp.explain(mode="formatted")

Here, step (1) is related to read csv file

2] Lets do a group by and get number of employees on each department

df = emp.groupBy('deptID').count()
df.show()

df.explain(mode="formatted")

Here, step (2) is related to group by

3] Lets filter data for only dept number 10

dept_10 = df.filter(df.deptID == 10)
dept_10.show()

dept_10.explain(mode="formatted")

Here, we can see that step (2) is filtering and step (3) is grouping.

Now here is the catch,
Ideally if we go by the sequence of operations, grouping should be done first and then filtering.

However, the optimiser does filtering first and then grouping, because grouping is an expensive operation and it is optimal to filter first and then group data.