Data Science - EDA with Pandas

Working with Pandas:

Earlier this month I started studying Data Science at the FlatIron School in New York and during our first week we were given our first cumulative Lab assignments. Since I was excited to start the program and meet the challenge of completing the lab work I've decided to share how I tackled the lab problems in a comprehensive post! I hope someone out there finds this information helpful when working with Pandas in Python.

Introduction to Lab:

The data being explored in this lab was housing data in Iowa. Some of the variables to consider with housing data are the Sale Price, the Overall Condition of the House, the Overall Quality of the House, The Rooms Above Grade (quality), the Year Sold (ie. how new the house is). You'll see in the lab how some of these variables were more correlated with others and what that suggests about the housing data. Below is an example of a few of the variables that are integers. These were some of the variables we primarily focused on in the lab.

As you can see we have 1,460 observations for these integers with the columns: OverallQual, OverallCond, YearBuilt, YearRemodAdd. To summarize from left to right we have: Column Number, Column Name, Number of Observations (1,460), and Dtype is the type of observation. This can be an object, float, or integer, as examples.

Histograms and Stats

In order to interpret the data we are working with it is essential to create graphs and calculate the mean, median, and standard deviation to understand the distribution of our data. The most useful type of graph for interpreting the distribution of data is a histogram.

Sale Price:

Here we can see the distribution of our Sale Price in a histogram that shows the frequency of prices in our data. From the 1,460 Sale Prices in our data we can see that the majority of the homes have a sale price between $100,000 and $200,000. This is represented by the tallest bar in the graph with a frequency near 700 as a comparable scale to other sale prices that were less common. The average sale price was $180,921 and the median sale price was $163,000. This aligns with the histogram shown above because you can see how the peak of the histogram is in that range of sale price. The standard deviation is 79,442. This number is indicative of the distribution of data in proximity to the mean. Looking at the histogram above, all of these numbers are aligned with the results we have.

Other columns that were relevant for creating histograms were Total Rooms, Above Grade, and Overall Condition.

Difference Between Subsets

Our next task was to find the difference between the subsets for the conditions of the houses. Conditions were Below Average, Above Average, or Average. The code I created to create these variables is below:

Subsets of Overall Condition

below_average_condition = df.loc[df['OverallCond'] < 5]
below_average_condition = df.loc[df['OverallCond'] == 5]
below_average_condition = df.loc[df['OverallCond'] > 5]

This code filtered the Overall Condition column into the three categories mentioned above. We used the i.loc function in pandas to specify the values we were considering for each of the three conditions. 5 ended up being the defining value.
Now we can create a comparative histogram that shows the difference between these three categories and their corresponding sale prices.

Histogram of Subsets

We can see here that the histogram has two axis. The x-axis is the Sale Price and the y-axis is the Number of Houses. Our first subset, Above Average Condition, is highlighted in blue. We can see that a high number of houses were sold at this Above Average condition, mostly near that Average Sale Price mentioned earlier with our initial histogram. The second subset, Average Condition, is highlighted in gray. There seems to be a slightly larger amount of houses sold with an Average Condition and at a higher price. This could simply mean that the difference between Above Average and Average houses in Iowa is not that drastic.

This seems like a reasonable hypothesis although we can't be sure of that without interpreting other variables that may contribute to that explanation. Other variables that are also integers that could help us gain a greater understanding are the Overall Quality and the Year Built. As a Data Scientist it's important to think beyond just the task at hand and use your own imagination to enhance your critical thinking skills. This analysis is an example of how to create a hypothesis to propose how data presentation could be different or better even if it is not.

For our last Subset, Below Average, we have a lower amount of houses sold and at a lower price compared to the other two subsets above. This result seems accurate because you would expect houses that fit the Below Average condition criteria to be less frequently sold or on the market. Additionally their sale price will be lower due to the condition.

All in all, the conclusion we can draw from these subsets is that the Above Average and Average conditions are the majority of houses sold on the market and have the highest sale prices. The one surprise is that houses with an Average condition have a higher price than Above Average. Since Iowa is not a major city that could be one explanation.

Correlations

In this lab we also found if there were strong correlations between variables. We found that there were several correlations and some were stronger than others. To simplify this process we looked at the strongest positive and negative correlations. The strongest positive correlation you can have is 1.0. The strongest negative correlation you can have is -1.0. Anything between 0.1-1.0 is a positive correlation. Anything between -0.1-(-1.0) is a negative correlation. With this knowledge we can code for correlations and notice what the strongest and weakest correlations are. The strongest positive correlation was between Overall Quality of the houses in our data and the Sale Price with a positive correlation of 0.79. Below you can see that positive correlation as a box plot.

Positive Correlation (Overall Quality & Sale Price)

Here we can see this positive correlation. Each box plot provides a high level overview of the statistical measures of each column in the data relative to sale price.

Negative Correlation (Year Sold & Sale Price)

Even though the negative correlation is less strong than the positive correlation at (-0.5) it is still slightly noticeable on the graph and once again the statistical measures of the year sold column relative to sale price is shown on the graph.

Blog Summary

I hope you enjoyed reading my blog post! To summarize, we were studying housing data in Iowa. We constructed histograms to look at the distribution of our data for 1,460 observations. Then we made three separate conditions for the Overall Condition of the houses compared to the Sale Price and analyzed why we might be finding for the results.

We also looked at the correlations between all integer variables and Sale Price to determine what variables had the strongest positive and negative correlation with Sale Price. Then we graphed the results. We found that Overall Quality had the strongest correlation to Sale Price at 0.79. We also found that the YearSold was the strongest negative correlation with Sale Price at -0.5. The strongest negative correlation was not as strong as the positive correlation.

References: FlatIron School Canvas

DEV Community

Data Science - EDA with Pandas

Top comments (0)