DEV Community

Timothy Cummins
Timothy Cummins

Posted on

Using pandasql to Sort Your DataFrame

For those of you who were introduced to SQL syntax for sorting through data before learning python (like I was) or people who just prefer using queries to find their data but are doing a project in python, here is a quick tutorial to help you understand how to use pandasql. I will be using some examples from a recent project of mine, where I was analyzing data from 2019 residential sales and seeing how different features of a home affect its sales price.

1) Setup:

The first thing you are going to need to to do is import pandas and pandasql.

import pandas as pd
import pandasql as ps

Then we are going to need to create some sort of data frame in Pandas. In this example I will be using a csv on Real Property Sales from the King County Department of Assessment. If you would like to follow along you can download them here. So to do this we are going to to use the read_csv function to turn our data into a pandas data frame and then we can start with our first query.

df = pd.read_csv("../src/data/EXTR_RPSale.csv")

So lets check out the structure of this process. First I am saving my query to an expression, this helps keep things clear for anyone trying to read your code. Then I am using triple quotes so that I can easily space my query out to multiple lines to give it a cleaner format like we are used to seeing with SQL. Finally I am running the query by plugging my query expression into ps.sqldf(q1). (I know this query is only doing the same thing as df.head() but this is only an example.)

q1 = ("""
SELECT *
FROM df
LIMIT 5
""")
ps.sqldf(q1)

2) Combining Columns

To create and combine columns in your DataFrame is very simple with pandasql. The only thing you need to do is after your select statement, place a comma and add the concat function, then give it a name. Just make sure you save the new DataFrame to a variable.

q2 = ("""
SELECT *, CONCAT(Major,'-',Minor) AS MajorMinor
FROM df
""")
df2 = ps.sqldf(q2)

3) Sorting Part 1

SQL syntax is prime for sorting through your data, first off SQL makes it very easy to combine “and”, “or” and “not” operators together to filter through your data but it also allows you to use wildcards so that you can grab similar rows as well.

Right now this Dataframe is huge so let's cut it down a little bit and just take the information from 2019. To do this we could convert the DocumentDate column to a Timestamp format, but I have an easier way. I prefer to take advantage of the wildcards SQL provides, the most common being % which represents 0 or more characters and _ representing a single character.

q3 = ("""
SELECT *
FROM df
WHERE DocumentDate like "%2019"
""")
df2019 =ps.sqldf(q3)

3) Combining DataFrames

To give an example on combing DataFrames we are going to have to download a second csv, so I am going to be using the csv in the Residential Building (.ZIP) from the King County Website. Now if we look at this file by using the read_csv in step one we can see that it shares the Major and Minor with the Real Property Sales. So what we are going to do is the same thing as we did in step 2 with this DataFrame and give it a column called MajorMinor then let's save it to the variable dfResBldg. Now since we have df2019 which only has rows from 2019 and dfResBldg which only has residential building we can INNER JOIN our Dataframes to get all of the rows that are residential and updated in 2019.

q4 = ("""
SELECT *
FROM df2019
INNER JOIN dfResBldg
ON df2019.MajorMinor = dfResBldg.MajorMinor
""")
dfFinal = ps.sqldf(q4)

4) Sorting Part 2

Our Dataframe is looking pretty good right now, but from doing some research about our PropertyType column I know that this is also including condos and multifamily homes, which for my purpose I want to filter out. Also by graphing out my sales prices I see that there are some homes that were sold for 0 dollars which is not very helpful so lets clear those things out real quick with a couple “and” and “or” operators.

q5 = ("""
SELECT *
FROM dfFinal
WHERE (PropertyType = 11 or PropertyType = 12 or PropertyType = 13 or PropertyType = 14) and SalePrice > 0
""")
dfresidence =ps.sqldf(q5)

Note that I placed all of my "or" statements within parentheses, this is so that they will all be attached to the "and" statement. If we did not include them we would end up still having sales prices of 0 where property types are between 11 and 13.

We now have Dataframe we can use for some analysis. Yay. From here we could do some extra cleaning with features such as CASE statements to do some basic encoding, remove some columns by defining a new Dataframe with only the columns we want to keep after the SELECT statement or combining columns into new columns with addition, subtraction and multiplication. Though python has great syntax to easily do those things as well so I will leave that up to you to decide.

Thank you for reading this Blog, it is my first tutorial and I hope that it helps someone out there.

Top comments (0)