EDA
looking for patterns that addresses our question(s), this entire article and future ones are heavily informed by Allen downey's thinkstats2, it actually an excellent book both in terms of content and structure, a cherry on top it is free, and written in python.
data and practical methods can answer questions and guide decisions under uncertainty
> Allen Downey
That is exactly what we will attempt to do in this article and the subsequent one's, answer questions given data,
For all the beginners or if this is your first time attempting data analysis, it can be a little hard to wrap your head around it at first, I'll admit it took a long time for me to even have a basic understanding of what I am doing and even today feel like I have no idea what I am doing, and reason really is data science thou being a science involves a lot of different fields: statistics,some form of art, critical thinking, on top of that you need to have some form of domain knowledge for whatever data you are working with and the list goes, but worry not a simple structure or system is usually useful to handle such fields, which is what thinkstats2 attempts to do, successfully at that.
Statistical analysis and EDA are the most fundamental and important parts of actually doing data science, building and training models is actually the last step, if you have been around in the data science world there's a rather famous statement garbage in garbage out, simply meaning feed your model garbage data expect a garbage model/output. Sadly this is where most beginners get stuck because they dive straight into building models, without understanding the most fundamental and crucial step and EDA is the first step towards success meaning you are in the right path.
Question
before we can even do any analysis we need a question to actually inform and guide our analysis, we will be using the automobile dataset.
brand/make - BMW, Audi, Volvo etc
possible questions:
what is the most expensive car?
what is the least expensive car?
what is the most affordable car by make/brand?
what is the least affordable car by make/brand?
which is the most profitable make/brand to invest in?
given a car with certain features can you classify it's brand/make
given a car with certain features including make/brand can you estimate it's price
The first two questions are probably the simplest to answer we don't care about the make of the car, this is basically getting the min and max of the dataset in terms of price
the second two are little involved we want to isolate a group from the rest of the data by affordability, secondly we need to know what affordable means in this context(data), does it mean more people are buying from that brand, or that brand has many models and variation of cars(we will come back to this later), what determines affordability?, after answering that question we need to identify that variable in the data and isolate it to answer the questions.
the third may seem like the second but in some form it's not, and I will admit we cannot "really"answer it with the data we have, because profitability is more than how many people are buying those cars, or how many models does the brand have, it encompasses stocks, how much does it cost to produce the car(the entire production chain) and what not, this is where data collection comes in.
the last two inform models, this is where we look at linear separability of data, correlation, variance etc.
the gist of the matter is question(s) inform approach and methods you will use to negotiate the data, and how you handle it, usually you will have a single question about the data which of course can branch to multiple questions with an aim to answer The question, we will attempt to answer the above questions to some extent because we are learning.
Simple Analysis and Data Cleaning
before we can answer the questions we need to get familiar with the data, clean it and check for errors, usual data comes with a codebook which describes the data, since we do not have it here we will use our intuition(never do this a codebook is very important).
we will be using bokke.js as per the previous article I published here: if you have not read it yet please do or you may not understand what is going on here, I cannot cover bokke here, there are also new features in bokke that are not covered in the previous article but will be covered here.
npm i bokke.js
or if you were following the last article update bokke to
npm i bokke.js@1.0.1
setup:
index.html:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document</title>
<style>
table {
border-collapse: collapse;
margin-top: 10px;
width: 100%;
}
table th {
text-align: center;
background-color: #3e6070;
color: #fff;
padding: 4px 30px 4px 2px;
}
table td {
border: 1px solid #e3e3e3;
padding: 4px 8px;
}
table tr:nth-child(odd) td {
background-color: #e7edf0;
}
</style>
</head>
<body>
<input type="file" id="csv">
<div id="table">
</div>
<div id="har">
</div>
<script src="main.js"></script>
</body>
</html>
main.js
import dataFrame, {Hist, Plot} from "bokke.js";
let file = document.getElementById("csv");
file.onchange = e => {
dataFrame(file, (frame)=> {
document.getElementById("table").innerHTML = frame.table
window.frame_ = frame // to use in the dev console
window.Hist = Hist
window.Plot = Plot
},)
}
devConsole:
const df = window.frame_ // dataframe
const tbl = document.getElementById("table") // div to append table
let df2 // to assign manipulated/changed dataframe
function callback(frame){
tbl.innerHTML = frame.table;
df2 = frame
}
bokke has a dataframe module which produces an html table, and has many methods for data cleaning and exploration, you can find an overview and tutorial on bokke.js here, prerequisite for this one
drop row 205 and the duplicate index column:
df.drop(true, callback,205, " ") // true means inplace
because we do not have a codebook we will clean the data in parallel with answering a question, meaning a question will determine which features(columns) we are to clean, to avoid cleaning "unnecessary" features(one of the few reason the codebook is important), in essence we are doing multiple things developing a codebook whilst analyzing data.
what is the most expensive car? && what is the least expensive car?
immediately this tells us to look at the price. Looking at unique values especially in our case(no code book) is super useful, for we expect the price column to be numbers and we need to validate that, if there are no missing values or strings, if there are we handle them.
df.unique("price") // returns an array of unique values
// we have a '?', not a number
first discovery the automobile dataset uses '?' to denote missing values,
Handling missing values
there are multiple ways to handle missing values ranging from easy to advanced which are beyond the scope of this article, for our case we can take the easy way and use descriptive statistics(mean), I guess it's not that easy, dropping the row is way easier but we won't.
the mean is sometimes criticized as a replacement(imputation) technique, but in our case it is highly unlikely that it will produce noise in the price column, if you think about the mean for each make will be between the min and max of the make, and it is way easier to get, anyway the focus is not the mean but knowing that in your EDA step you need to handle missing values, you can easily google more advanced methods for doing such.
the cars are classified by their makes/brands, let's assume they are representative of the population of cars in each brand(which I doubt is the case) but for our purpose will work, the normal intuition will be to take the entire price column mean and fill the missing value(s) with it, but that is inaccurate, the mean of the entire column represents the distribution of the entire column(all makes(bmw, audi etc) combined), we need to fill the missing value with a mean from it's make
- we need to find out which make has the missing data
// will group all rows that return true
function predicate(val){
if(val === "?")
return true
else
return false
}
df.group_by("price", predicate, callback) // return all rows with '?' in the price column
which return's three makes
isuzu
audi
porsche
-
group by each make and compute basic stats
but first we need to replace "?" with 0 in the price column, which is safe for we know that no car cost's 0 dollars, we are doing this to avoid getting "NaN" when calculating a mean, this is a minor limitation on bokke's side and will be fixed soon.
df.replace("?","NaN", callback, true) // replace all "?" with NaN df.map({'NaN': "0"}, "price", callback, true) // replacing all NaN's in the price column with "0"
we can check if it worked by:
function predicate(val){ if(val === '0') return true else return false } df.group_by("price", predicate, callback)
you can check if there are no string's in the price column by computing basic stats
df.basicStat()
// if the last column has no NaN's we are good to go
I know we replaced NaN with a string "0" but this actually is a number, it is coerced during calculation, always use strings with map to avoid problems
2.1 Isuzu
function predicate(val){
if(val === 'isuzu')
return true
else
return false
}
df.group_by("make", predicate, callback)
df2 now will have car's who make are Isuzu
df2.basicStat()
// mean 4458.25
// actual mean 8916.5 (will become clear what this means later ignore for now)
// row 44 and 45
const q = df.query() // returns a proxy object to emulate indexing q[1] will return row 1 , q[[1, 2, 3]] will return row 1 2 3
let isuzu = q[[44, 1,45]] // i ran into a problem with q[[44, 45]] so i added row 1 in between, i think query has a problem fetching some rows next to each other(under investigation)
isuzu.splice(1, 1) // remove row 1 not a suzu
isuzu.forEach((row)=> {
row[row.length -1] = Math.round(4458.25) // fill all missing values with mean
df.swap(row[0], row, callback) // swap the old row with a new row
})
/*
swap(rowIndex, row(array), callback)
rowindex - the index in the dataframe
row - Array with same columns as df
*/
2.2 audi
function predicate(val){
if(val === 'audi')
return true
else
return false
}
df.group_by("make", predicate, callback)
// row 9
df2.basicStat()
// mean 15307.857142857143
let audi = q[9] // get row 9
audi[audi.length - 1] = Math.round(17859.166666666668) // change price(last col) to mean
df.swap(9, audi, callback)
2.3 porsche
function predicate(val){
if(val === 'porsche')
return true
else
return false
}
df.group_by("make", predicate, callback)
// row 129
df2.basicStat()
// mean 25120.4
let porsche = q[129]
porsche[porsche.length - 1] = Math.round(25120.4)
df.swap(129, porsche, callback)
df.group_by("make", predicate, callback)
2.4 analysis
const Hist = window.Hist // histogram object
let car = new Hist(1, df.getCol("price").data)
/*
note on histograms - they calculate frequency of values in an array given a bin number
bin - is an interval (bin values that are closer to each other and calculate their frequency)
for example
[1, 2, 3, 4, 5, 6]
bin of 1, will calculate each value
bin of 2 will bin values in intervals of two etc
new Hist(bin, array) and returns a hist object with some functionality
hist.smallest(10) will get the lowest 10 values
hist.largest() vice versa of smallest
*/
smallest and largest
car.smallest(1)
['4458']
car.largest(1)
['45400']
// getting the least and most expensive car
function predicate(val){
if(val === 4458 || val === 45400)
return true
else
return false
}
df.group_by("price", predicate, callback)
Oooh shocker I honestly thought porsche was gonna take it, actually did not see that coming(I am exploring this data the first time too, pardon my excitement)., data can shape perceptions :) and prove us wrong, if you allow it to talk, remember we are pretending this is representative of all cars
look at fuel type we at least can tell yet not conclusive, that it is not a good classifier, most and least expensive car both use gas, while the horseporwer though say's something different, you see a simple question is already building up to answer more complex ones and giving us a glimpse into the data(this is impressive considering we do not have a code book)
but wait something is wrong, the mean for isuzu is below the minimum price for it, that is impossible.
what is a mean: sum of all elements/ length
all our means are wrong(we computed even the missing cars as part of the length while they are not contributing with a price they are missing remember) - always question your result's don't try and force what you want them to be, let the data talk
recomputing means
isuzu:
function predicate(val){
if(val === 'isuzu')
return true
else
return false
}
df.group_by("make", predicate, callback) // getting all suzu's assigning them to df2
function predicate2(val){
if(val !== 4458) // removing the previously filled row, 4458 is the wrong isuzu mean
return true
else
return false
}
df2.group_by("price", predicate2, callback) // removing the missing data
df2.basicStat() // actual correct mean : 8916.5
let isuzu = q[[44, 1,45]]
isuzu.splice(1, 1)
isuzu.forEach((row)=> {
row[row.length -1] = Math.round(8916.5)
df.swap(row[0], row, callback)
})
audi
function predicate(val){
if(val === 'audi')
return true
else
return false
}
df.group_by("make", predicate, callback)
function predicate2(val){
if(val !== 15308)
return true
else
return false
}
df2.group_by("price", predicate2, callback)
df2.basicStat() // correct mean: 17859.166666666668
let audi = q[9]
audi[audi.length - 1] = Math.round(17859.166666666668)
df.swap(9, audi, callback)
porsche:
function predicate(val){
if(val === 'porsche')
return true
else
return false
}
df.group_by("make", predicate, callback)
function predicate2(val){
if(val !== 25120)
return true
else
return false
}
df2.basicStat() // 31400.5
let porsche = q[129]
porsche[porsche.length - 1] = Math.round(31400.5)
df.swap(129, porsche, callback)
correct analysis
const Hist = window.Hist
let car = new Hist(1, df.getCol("price").data)
car.smallest(1) // ['5118'] // least expensive car is a subaru
car.largest(1) // ['45400']
looking beyond doesn't hurt
car.smallest(3) // ['5118', '5151', '5195']
car.largest(3) // ['40960', '41315', '45400']
function predicate(val){
if(val === 5118 || val===5151 || val === 5195 || val===40960 || val === 41315 || val === 45400)
return true
else
return false
}
df.group_by("price", predicate, callback)
df2.writeCSV("leastandMostexpensive") // save the answer could becoming handy
findings
most
mercedes-benz
bmw
least
subaru
chev
mazda
always save progress towards clean data
df.drop(true, callback,"")
df2.writeCSV("clean-automobile")
we can probably use this data to analyze for linear separability
looking at this data we can already classify between a least and most expensive car using horsepower alone(but letting the data speak is probably the best way).
affordability and profitability
probably the most difficult questions to answer given the data we have
for affordability we need to know what that means in the first place, what people regard as affordable, and we do not have such data, secondly automobile dataset does not seem like a frequency of people buying cars, rather comparison of different makes given different models, if we were to calculate the mean of the entire price column we would be calculating the average price of a car, which tells us nothing unless we have data depicting affordability
for profitability too, we need to know more than the price of cars, also the number of people buying those cars, the production cost for each car in each make etc, we can probably try and maybe say given this data the frequency of each make depicts profitability meaning if a make has more cars that means most people tend to buy from that make hence a need for variance in the cars, it could be totally the opposite, let's do a simple histogram and see where it takes us
Hist
let h = new Hist(500, df.getCol('price').data) // bin of 500
Plot.Hist(["blue"], h)
// 6500 - 18500 most frequent makes/cars
// mode between 6500 - 8000
base on our definition(assuming it is correct) the most profitable makes are makes with cars priced between 6500 and 18500, and to specifically zoom in, the mode is between 6500 and 8000, you can already see how unsatisfactory this answer is, what if the high end cars being pricey makes them more profitable even if less people buy them. let's take it further and compare the two groups
function predicate(val){
if(val >= 20500 && val <= 45000)
return true
else
return false
}
df.group_by("price", predicate, callback)
df2.basicStat()
// mean: 30025.16
function predicate2(val){
if(val >= 6500 && val <= 18500)
return true
else
return false
}
df.group_by("price", predicate2, callback)
df2.basicStat()
// mean: 11060
although this might look like something it does not tell us anything, the mean 30025 does not tell us how many people actually buy these cars, it might be that the number of people who buys from the other brands cancels the other, because the lower the price more people buy vice versa, the profit margin could be nearly proportional if you think about it,
a simple thing we could have just done is take the cars below and above the mean for the entire price column and compare the two group's mean or calculate the difference, the steps above omitted data, which you should never do, it was a quick example, always use the entire data set if possible, but I don't think it made that much of a difference
so how do we solve this: by data collection and research, we need a definite answer on what makes a car affordable according to buyers given a country for this set and what makes a car profitable according to sellers(this is known as data collection) given a country for this set and secondly we need a codebook, without a codebook we are just guessing.
The last two
classification and regression - these deserve a separate article, where we will look at distributions, Probability mass functions, correlations, variance etc. we answered the first one because it was easy and demonstrated some form of data cleaning and EDA.
the next article will tackle one of the two, where stuff get's interesting.
Summary
EDA and Statistical analysis are fundamental and a prerequisite to building robust models, and answer questions
A codebook is very important.
ask simple questions beside the main question to help guide you in exploring the data and cleaning it in the process(especially if you have no codebook which is bad), exploring the main question having answered few basic one's, is somewhat helpful as we have some clue of what the data is(my opinion)
Conclusion
we did not really do much in this article, upcoming articles will build upon this one.
Top comments (0)