DEV Community

Cover image for EXPLORATORY DATA ANALYSIS ULTIMATE GUIDE
Kimani Kanyutu
Kimani Kanyutu

Posted on

EXPLORATORY DATA ANALYSIS ULTIMATE GUIDE

INTRODUCTION

When buying high values commodities let's say a car, do you do just a random pick or do you take your time to assess the car? Check the engine condition, assess the tyre condition, the interior, the mileage, listen to the engine sound, check the service and maintenance history if there is one and confirm there is proper documentation.
Same case applies to data!

What is EDA?

Exploratory Data Analysis is a data analysis concept that involves the initial investigation of the dataset with an aim to establish patterns, detect obvious errors, detect anomalies and outliers or test hypothesis. This is often made possible with the use of statistical or graphical tools. EDA facilitates a better understanding of the data, relationships that exist within the dataset and the different variables present.
It is more often referred to as descriptive analytics. Insights obtained from this step determine the next undertaking in the data analysis process.
EDA can be categorized into either;

1. Graphical and non-graphical analysis
Graphical analysis involves the use of graphical tools such as boxplot, bar graph and scatter plot to assess the data. Python libraries such as Matplotlib and Seaborn can be used to perform this analysis.
The non-graphical analysis adopts the use of summary and statistical tools to analyze data. In python, there is a variety of functions that can be used for this. On the other hand, tools such as PowerBi can be used to perform both the graphical and non-graphical analysis.

2. Univariate and Multivariate analysis
The univariate analysis involves analysis on a single variable to establish pattern in it. Univariate analysis can employ use of either graphical or non-graphical analysis.
The multivariate analysis on the other hand assess two or more variables and establishes patterns and relationship among them. This can also be in the form of graphical or statistical analysis.

Why do we need to perform EDA?

Some of the reasons why exploratory data analysis is crucial is so as to:

  1. Make sense of the data
  2. Remove any outliers and anomalies in the data
  3. To easily prepare data for the next step of Data analysis
  4. For better performance and accurate results.

Remember:

Garbage In Garbage Out(GIGO)

The CRISP DM process

The CRoss-Industry Standard Procedure for Data Mining (CRISP DM) provides a standard methodology that is acceptable and agreeable for purposes of data mining and data science projects. The SOP can be dated back to the late nineties and is still one of the majorly adopted methodology in the analytics world by providing an approach that is technology and problem neutral.

Fig 1 CRISP DM process

The EDA concept is in line with the second step in the CRISP DM framework that involve data understanding.

Now lets do some hands on work on exploratory data analysis using python. The data used in this case is from kaggle.EDA: IT Salary Survey 2020

General information

Importing libraries
The pyforest is a Python library that allows one to import all the basic frequently used Python libraries in one line of code, instead of having to import different libraries such as pandas and numpy.

import pyforest
Enter fullscreen mode Exit fullscreen mode

Loading Dataset
The data is available in csv format. We use the read_csv function in pandas into a dataframe and get the first five rows of the dataframe.

df = pd.read_csv('IT Salary Survey EU  2020.csv')
df.head(5)
Enter fullscreen mode Exit fullscreen mode

Output:
First 5 rows

Shape of the Dataset
This function gets the dimensions of the dataset in terms of rows and columns. The first value in the tuple is the number of rows while the second value is the number of columns.

df.shape
Enter fullscreen mode Exit fullscreen mode

Output:
Shape of the dataset

Summary information on the data set and datatypes
The info() function gives general information about the columns of the dataset such as; number of non-null values and data types of the column variables.

df.info()
Enter fullscreen mode Exit fullscreen mode

Output:
Summary info

Column names
This enables us to get the different names of the columns.

df.columns
Enter fullscreen mode Exit fullscreen mode

Output:
Column names

Some of the column name are very long. We change the column and then replace the white spaces with an underscore (_).

df.columns = ["Year", "Age", "Gender","City","Position","Years of experience","Germany experience","Seniority level","Tech program language",
                       "Other Language","Yearly salary","Yearly bonus and stocks", "Salary one year ago","Bonus and stocks last year","Vacation days",
                       "Employment_status","Сontract_duration","Language","Company size","Company type","Job loss COVID","Kurzarbeit","Monetary Support"]

df.columns = df.columns.str.replace(' ', '_')
Enter fullscreen mode Exit fullscreen mode

Unique Value counts for columns
Here, we get the number of unique values in each and every column.

df.nunique(axis=0)
Enter fullscreen mode Exit fullscreen mode

Output:
Number of unique values

Summary statistics
The describe() function gives us summarized statistics of the dataset such as mean, standard deviation, min, max, 25%, 50%, 75% and count.
Note: This works only for the columns with numerical values

df.describe()
Enter fullscreen mode Exit fullscreen mode

Output:
Summary Statistics

Missing values
Missing values can affect in a great way the performance of a model. There are different ways of handing missing data. One of such ways is dropping them. This is not advised especially in instances where we have a large number of null values. The fillna() approach can be used to replace the missing values with the mean, median or any other value of that column.
In this case, we first visualize all the missing values in the dataset.

df.isna().sum().sort_values().plot(kind='barh')
plt.title('Distribution of null values')
Enter fullscreen mode Exit fullscreen mode

Missing values visualization
We then drop missing values in the age, gender, position, years of experience, seniority level, salary one year ago and language columns.

df = df.dropna(subset=['Age','Gender','Position','Years_of_experience','Seniority_level','Salary_one_year_ago','Language'])

df = df.drop_duplicates()
Enter fullscreen mode Exit fullscreen mode

Output:
Dropping Missing Values

Changing Data Types
In order to allow for easy analysis and computation, it is important to have the variables in the correct data type. In this case, we change the column age to integer data type as well as extract the year using the datetime from pandas.

df['Age'] = df['Age'].astype(int)

df['Year'] = pd.to_datetime(Idf['Year']).dt.year
Enter fullscreen mode Exit fullscreen mode

Output:

Changing Datatypes

Graphical Visualization

Univariate analysis

Data distribution

Age

sns.set(font_scale=1.3)
plot=sns.histplot(IT_Filtered.Age,bins=9,kde=True)
plt.title('Age Distribution')
plt.show()

print('Age Descriptive Statistics')
IT_Filtered['Age'].describe().round()
Enter fullscreen mode Exit fullscreen mode

Age visualization

From this it is evident that;

  • Majority of the people survey are between 20 - 40 years

Workplace City Location

city10=IT_Filtered.City.value_counts().iloc[:10].to_frame().reset_index()
city10=city10.rename(columns={"index": "City", "City": "Count"})

sns.barplot(x=city10["Count"],y=city10["City"])
plt.title('Workplace City Location')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Location visualization
We can conclude that;

  • Berlin is the city with the highest tech opportunities in Germany.

Gender

plt.figure(figsize=(15,8))

IT_Filtered['Gender'].value_counts().plot.pie(autopct="%1.2f%%",fontsize=12,startangle=90, cmap='crest',explode=[0.05] * 3,pctdistance=1.1,labeldistance=1.3,textprops={'fontsize': 15})
plt.ylabel("")
plt.show()
Enter fullscreen mode Exit fullscreen mode

Gender

  • The tech jobs in Germany are male dominated.

Programming Language

program=IT_Filtered["Tech_program_language"].value_counts().iloc[:15].reset_index()
program=program.rename(columns={"index": "Program", "Tech_program_language": "Count"})
sns.barplot(data=program, y="Program", x="Count")
plt.title('Top 15 Program Languages')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Programming Language

  • It is evident that Python and Java are some of the most popular programming languages.

Multivariate analysis

Correlation

IT_Filtered.corr()
Enter fullscreen mode Exit fullscreen mode

Correlation Matrix

plt.figure(figsize=(15, 12))
sns.heatmap(IT_Filtered.corr(), annot = True, cmap= 'coolwarm')
Enter fullscreen mode Exit fullscreen mode

Correlation matrix visual

Correlation matrix shows the relationship between different numerical columns in the dataset.
Correlation is depicted in a scale of -1 to 1. Where -1 shows a high negative correlation while 1 shows a high positive correlation.
The heatmap helps with a better visualization of the correlation with an additional color scale.

Top 15 programming languages

program_top=[]
for x in program['Program']:
     if x not in program_top:
        program_top.append(x)

program_df = IT_Filtered[IT_Filtered['Tech_program_language'].isin(program_top)] 
sns.boxplot(data=program_df, x="Tech_program_language", y="Yearly_salary")
plt.xlabel('Program Language')
plt.ylabel('Yearly Salary')
plt.gcf().set_size_inches(15, 8)
plt.suptitle('Salary by Top 15 Programming Languages')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Programming Languages

Gender and Age

plt.figure(figsize=(12,8))
sns.boxplot(x='Gender', y='Age', data=IT_Filtered, palette='Pastel2')
plt.xticks(fontsize=13)
plt.xlabel("")
plt.yticks(fontsize=13)
plt.ylabel("Age", fontsize=14)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Gender and Age

Job Loss due to covid

sns.histplot(IT_Filtered, x="Job_loss_COVID", stat="percent", multiple="dodge", shrink=.8, hue='Seniority_level')
plt.suptitle('Job Loss due to COVID-19',y=1.01)
plt.xlabel('')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Job loss

Years of experience vs German experience

plt.figure(figsize=(20, 5))
sns.histplot(data=IT_Filtered[['Years_of_experience', 'Germany_experience']].melt(), x='value', hue='variable', kde=True)
plt.show()
Enter fullscreen mode Exit fullscreen mode

German experience

References:

  1. Martinez-Plumed, F. et al. (2021) “CRISP-DM twenty years later: From data mining processes to data science trajectories,” IEEE Transactions on Knowledge and Data Engineering, 33(8), pp. 3048–3061. Available at: https://doi.org/10.1109/tkde.2019.2962680.
  2. Parul Pandey. IT Salary Survey for EU region(2018-2020).Kaggle. Accessed February 11,2022. https://www.kaggle.com/datasets/parulpandey/2020-it-salary- survey-for-eu-region.

Top comments (0)