DEV Community

Posted on • Updated on

The AI Alpha Geek: It starts with EDA! - Part B

Before we start exploring each individual feature, let's take a look at some statistics for the dataset produced by `train_df.drop('PassengerId', axis=1).describe()` below:

In the summary statistics above, looking at the Age feature for example:

• the count is 714, which tells us there are 177 missing entries since the total entries are 891 - we would need to deal with this later on when handling missing values,
• the mean age is 29.699, which is the average age of passengers who were aboard i.e the value 29.699 was the typical or normal age of the passengers aboard,
• the std (standard deviation) of 14.526 tells us that most of the passengers are in the age range (29.699-14.526) to (29.699+14.526),
• the min age is 0.42, which tells us the least age is for a baby on board,
• the 25th percentile is 20.125 years shows that 25% of passengers is less than 20.125 years,
• the 50th percentile, which is the median is 28 years, tells us that half of the passengers onboard are below 28 years old - seems most of the passengers were young,
• the 75th percentile, which is 38, tells us that 75% of the passengers are less than 38 years, and
• the max age is 80 years, which is the age of the eldest passenger onboard - luckily, it seems there are no aliens onboard.

Now, it's time for some univariate analysis - this is just descriptive analysis of one variable at a time which it helps us understand the data distribution for that variable and even detect outliers. Let's start with the categorical variables -

In the code example above, taking a look at the output for the target variable, Survived, below -

• value_counts() is used to get the counts of unique values for this column - and it seems a lot more people did not survive. Note that it is not a perfectly balanced dataset but this is not a case where the number of those who didn't survive is far more significant than those who survived.
• to get the percentages of each class (i.e survived - 1 and deceased - 0), set the normalize parameter of value_counts() to True.
• to have a better view of the count for each class, we use count plot via Seaborn. The label_chart() is just a helper function to label the chart.

Let's see some insights gathered from the code output from eda_part_b.py above -

• For the Pclass feature, it seems a lot more people that were on board are in class 3 and from Part A of this series, we saw that these are people in the lower socio-economic class, which seem to mean most onboard got the cheap ticket,
• Seems more males boarded when you look at the Sex feature, as 64.76% of passengers are males,
• Most passengers boarded from the Southampton port, and it seems most passengers came alone since most have 0 siblings and/or travelled with just a nanny.

So, all these give us more insights to explore further - Stay tuned for the next parts on this topic, on this same series, where we go-ahead to explore individual numerical variables for patterns. Wish you an awesome October!