Just sit right back
And you’ll hear a tale
A tale of analysis
That started from a data set
Saved on a laptop’s disk
The author is a student who
The process he has sought
To find the things that can be seen
In the data he will plot, the data he will plot
The going started getting tough
The student knew the cost
If not for the EDA he would do
His insight would be lost, his insight would be lost
Photo by Pablo García Saldaña on Unsplash
Now that I’ve had a little fun, it’s time to get a little more serious about the topic of this post, exploratory data analysis (or EDA). Much like the hapless castaways of that famous television show were lost at sea, you'll be lost in a sea of data without performing EDA.
Before I get too far, however, I think it’s important to provide this disclaimer: at the time of this writing, I am new to the data science world and processes.
Part of my motivation in writing this post (despite the volume of articles on this subject you will find with a quick Google search) is to help me learn this concept and cement it for myself.
As such, I’m certain there will be things I’ve missed or overlooked. Feel free to enlighten me in the comments below!
What is EDA?
I feel that defining EDA will help provide some insight into the actual process itself.
Wikipedia defines EDA as “an approach to analyzing data sets to summarize their main characteristics, often with visual methods.” Wikipedia Link
The National Institute of Standards and Technology says “Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to
- maximize insight into a data set;
- uncover underlying structure;
- extract important variables;
- detect outliers and anomalies;
- test underlying assumptions;
- develop parsimonious models; and
- determine optimal factor settings.” (Information from this NIST webpage)
My personal working definition is “the art and science of reviewing a set of data, often using visualizations, with an eye toward answering a question or solving a problem.”
My definition differs slightly from those above it because, generally speaking, you won’t be doing EDA on a data set without some motivating factor. That motivating factor would generally be a question you want to answer, or a problem you’re trying to solve using that data.
I also feel that there is some artistry involved in doing EDA well. It’s true that there are many techniques that can be used and a general process that can be followed (the “science” in my definition). However, each question or problem and each set of data is different. Knowing how to apply the techniques and the process to your unique situation is where the “art” comes in.
There are some other things I’d like to mention that bear keeping in mind when performing EDA:
- Foremost, focus on understanding the data. It’s easy to get lost in detailed stats and pretty visualizations, but your main goal is to get a feel for the data and how you might use it for your purpose.
- EDA can (and really should) provide insights or guidance regarding data cleaning, feature engineering, and model selection, but is technically separate from those things. Make notes or even changes, but at this stage you shouldn’t get sidetracked too much from gaining an understanding of the data.
Okay, Now How Do I Do It?
Photo by Glenn Carstens-Peters on Unsplash
A good way to start is by answering some general questions. The answers may lead to insights, or they may lead to more focused questions. More questions usually mean more answers and (one hopes) better understanding.
Common questions include:
- Do I have any applicable domain knowledge? What do I know that might help me interpret this data?
- How many columns/features are there? What kind of data is in those columns/features?
- Are all of the columns/features useful for answering my question or solving my problem?
- Is there data missing? Duplicate data? How will I deal with those issues?
- How are the values distributed? Are there outliers? How will I deal with them?
- What relationships do I see in the data? Do any of these relationships seem to point toward an answer or solution to my question or problem?
There are a lot of tools you can use to answer these questions. Popular summary statistics of numerical values include mean or median, minimum and maximum values, and the first and third quartiles. Commonly used visualizations include histograms, box and whisker plots, line charts, and scatter plots. There are many other tools and methods that can be applied as well, and listing or describing them all is beyond the scope of this article.
An Example Of EDA In Action
In this example, we’ll be looking at the Students Performance in Exams dataset from Kaggle, located here.
For this example, I’ll mainly be using the Pandas and Seaborn libraries for Python.
Getting A Grip On The Data
First, the General Stuff
After reading the CSV file into a Pandas DataFrame, the first thing I did was to look at a handful of records using the “sample” method in Pandas.
This gives me an outline of the sort of data contained in this dataset.
Next, I used the "info" method the have a look at datatypes and the count of non-null entries in each column.
From this, I see that there are 5 categorical data columns and 3 numerical data columns. I also see that there are no null values, although that doesn't preclude placeholder values.
A Brief Look At The Numbers
I wanted a summary of the numerical columns, so I used the "describe" method:
The main things that stand out to me here:
- the minimum math test score was a 0, which I want to look at. How many 0s are there? Could it be a placeholder?
- The standard deviation is a little high, so test scores are a bit spread out.
- The minimums are significantly lower than the 1st quartile, near 3 standard deviations in difference, which indicates outliers there.
Visualizations Are Useful
This is a good point to briefly talk about the use of visualizations in EDA. Humans in general are visual creatures, and we're pretty good at pattern recognition. Visualizations take advantage of those broad tendencies by providing us with pictures we can spot patterns in.
Here's an example of this idea. Taking a quick boxplot of the numerical columns visually shows that there are indeed outliers on the low end of the test scores. Those outliers will need to be dealt with, so I make a note to myself.
Next I looked at the distributions of the test scores, using histograms.
These distributions look fairly normal, but they are skewed to the left. For most models, the more normal the distribution, the better. This is probably fine, but we may want to do some feature engineering on these later to make the distributions more normal. I make another note to myself.
On To The Categorical Data
Now that I have a feel for the numerical data, it's time to have a look at the categorical data.
First off, again taking advantage of visual representations, I did some quick plots of the counts of each value in the categorical columns.
The first three plots we'll look at are pretty basic, showing gender, lunch type, and test preparation course completion counts.
From these first three plots, we can easily see that these are basically binary values:
- male vs female
- standard lunch vs free/reduced cost lunch
- completed vs didn't complete a test preparation course.
We can also see:
- both genders are fairly evenly represented
- less than half of the students receive free/reduced cost lunches
- less than half of the students completed a test preparation course.
There are two more categorical plots to look at, with a bit more information to look at then in the first three. The next plot looks at the education level of the students' parents.
Some quick insights we can get from this plot:
- About 18% of the parents in this sample didn't complete high school
- The majority of the parents had at least some college education
- The smallest education group were the parents with master's degrees
- There may be value in combining the college-level categories, so I make a note of that.
The last plot gives us information about the race/ethnicity groups the students belong to.
At a glance, this plot shows us that this category is a bit skewed:
- Over 30% of the students belong to Group C
- Less than 10% of the students belong to Group A
- More than 25% belong to Group D
This means that two race/ethnicity groups account for more than half of the students in this data set.
Look For Relationships Or Correlations
Correlations can only be calculated for numerical data, so this table is pretty small.
Reading and writing test scores are highly correlated, which I would expect. Math test scores have more correlation with reading test scores than with writing test scores, but it's a small difference.
We can't calculate correlation for the categorical data, but we can still look for relationships using tools like Seaborn countplots or Pandas cross-tabulations. In the interest of brevity, I'm only going to show a couple of examples but in a real scenario, I would do this for all of the categorical values.
The first example shows the distribution of math scores grouped by the parents' education level.
Looking at this plot, we can see that there is generally a larger proportion of higher math test scores as the education level of the students' parents increased. This may be a factor to consider when we begin creating a model.
The second example shows the median of the reading test scores of the standard lunch group vs the free/reduced lunch group.
We can see that the median reading test score of the free/reduced lunch group is lower than that of the standard lunch group. The conclusion I came to here is that possibly the economic situation of the free/reduced lunch group negatively impacts their ability to study and their performance on tests.
Closing Thoughts
EDA is an important step in the data modeling process.
Once you have a general understanding of the data you're working with, you will usually have one or more ideas for features you want to add or remove. After you get started working on those ideas, you will probably find that you have more questions that require more analysis of your data.
This is when you realize that EDA isn't something you'll only do at the beginning of a project.
Your first pass was just the beginning!
In a sense, you're never really done with the EDA for a project until you're done with the project itself.
Top comments (1)
nicely written :)