DEV Community

Cover image for The AI Alpha Geek: It starts with EDA! - Part B
Joy Ada Uche
Joy Ada Uche

Posted on • Edited on

4 3

The AI Alpha Geek: It starts with EDA! - Part B

Before we start exploring each individual feature, let's take a look at some statistics for the dataset produced by train_df.drop('PassengerId', axis=1).describe() below:

Summary Stats

In the summary statistics above, looking at the Age feature for example:

  • the count is 714, which tells us there are 177 missing entries since the total entries are 891 - we would need to deal with this later on when handling missing values,
  • the mean age is 29.699, which is the average age of passengers who were aboard i.e the value 29.699 was the typical or normal age of the passengers aboard,
  • the std (standard deviation) of 14.526 tells us that most of the passengers are in the age range (29.699-14.526) to (29.699+14.526),
  • the min age is 0.42, which tells us the least age is for a baby on board,
  • the 25th percentile is 20.125 years shows that 25% of passengers is less than 20.125 years,
  • the 50th percentile, which is the median is 28 years, tells us that half of the passengers onboard are below 28 years old - seems most of the passengers were young,
  • the 75th percentile, which is 38, tells us that 75% of the passengers are less than 38 years, and
  • the max age is 80 years, which is the age of the eldest passenger onboard - luckily, it seems there are no aliens onboard.

Now, it's time for some univariate analysis - this is just descriptive analysis of one variable at a time which it helps us understand the data distribution for that variable and even detect outliers. Let's start with the categorical variables -

train_df.drop('PassengerId', axis=1).describe()
def label_chart(plot, points='.0f'):
for p in plot.patches:
plot.annotate(format(p.get_height(), points),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center',
xytext = (0, 9),
textcoords = 'offset points')
train_df['Survived'].value_counts()
round(train_df['Survived'].value_counts(normalize=True)*100, 2)
label_chart(sns.countplot(train_df['Survived']))
train_df['Pclass'].value_counts()
round(train_df['Pclass'].value_counts(normalize=True)*100, 2)
label_chart(sns.countplot(train_df['Pclass']))
train_df['Sex'].value_counts()
round(train_df['Sex'].value_counts(normalize=True)*100, 2)
label_chart(sns.countplot(train_df['Sex']))
train_df['Embarked'].value_counts()
round(train_df['Embarked'].value_counts(normalize=True)*100, 2)
label_chart(sns.countplot(train_df['Embarked']))
train_df['SibSp'].value_counts()
round(train_df['SibSp'].value_counts(normalize=True)*100, 2)
label_chart(sns.countplot(train_df['SibSp']))
train_df['Parch'].value_counts()
round(train_df['Parch'].value_counts(normalize=True)*100, 2)
label_chart(sns.countplot(train_df['Parch']))
train_df['Ticket'].value_counts()
train_df['Cabin'].value_counts()
view raw eda_part_b.py hosted with ❤ by GitHub

In the code example above, taking a look at the output for the target variable, Survived, below -
Output Example

  • value_counts() is used to get the counts of unique values for this column - and it seems a lot more people did not survive. Note that it is not a perfectly balanced dataset but this is not a case where the number of those who didn't survive is far more significant than those who survived.
  • to get the percentages of each class (i.e survived - 1 and deceased - 0), set the normalize parameter of value_counts() to True.
  • to have a better view of the count for each class, we use count plot via Seaborn. The label_chart() is just a helper function to label the chart.

Let's see some insights gathered from the code output from eda_part_b.py above -

  • For the Pclass feature, it seems a lot more people that were on board are in class 3 and from Part A of this series, we saw that these are people in the lower socio-economic class, which seem to mean most onboard got the cheap ticket,
  • Seems more males boarded when you look at the Sex feature, as 64.76% of passengers are males,
  • Most passengers boarded from the Southampton port, and it seems most passengers came alone since most have 0 siblings and/or travelled with just a nanny.

So, all these give us more insights to explore further - Stay tuned for the next parts on this topic, on this same series, where we go-ahead to explore individual numerical variables for patterns. Wish you an awesome October!

Image of Datadog

Create and maintain end-to-end frontend tests

Learn best practices on creating frontend tests, testing on-premise apps, integrating tests into your CI/CD pipeline, and using Datadog’s testing tunnel.

Download The Guide

Top comments (0)

Billboard image

Try REST API Generation for Snowflake

DevOps for Private APIs. Automate the building, securing, and documenting of internal/private REST APIs with built-in enterprise security on bare-metal, VMs, or containers.

  • Auto-generated live APIs mapped from Snowflake database schema
  • Interactive Swagger API documentation
  • Scripting engine to customize your API
  • Built-in role-based access control

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay