Exploratory Data Analysis (EDA) is an approach to analyze and summarize data sets in order to understand their main characteristics and detect patterns, relationships, and anomalies. EDA is typically performed at the beginning of a data analysis project and is used to gain insight into the data and identify potential problems with the data quality or the analytical approach.
EDA can involve a range of statistical and visualization techniques, such as summary statistics, histograms, box plots, scatter plots, and correlation matrices. The purpose of EDA is to uncover the main features of the data, such as its distribution, range, central tendency, variability, and any outliers or missing values. This information can be used to guide the development of more sophisticated statistical models or machine learning algorithms.
Some common tasks involved in EDA include examining the data for missing values, exploring relationships between variables, identifying outliers or anomalies, and detecting patterns or trends in the data. EDA can also involve data transformations, such as normalization or scaling, to help make the data more amenable to analysis. Ultimately, the goal of EDA is to gain a better understanding of the data and its underlying structure in order to make more informed decisions about how to analyze it.
In this blog post, we will cover some basic concepts and techniques to help you get started with EDA. We will perform EDA on the IT Salary Survey dataset, which is available at https://raw.githubusercontent.com/junn-hope/LuxAcademyBootcamp/main/IT_SalarySurvey_EU2020.csv. The IT Salary Survey dataset contains information about the salaries of IT professionals in Europe, as well as their job titles, years of experience, and other demographic information.
Importing the Data
The first step in any data analysis project is to import the data into your programming environment. The IT Salary Survey dataset is available in CSV format, which can be imported using the pandas library in Python:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
from ydata_profiling import ProfileReport
import ipywidgets as widgets
pd.options.display.float_format = "{:,.5f}".format
url = 'https://raw.githubusercontent.com/junn-hope/LuxAcademyBootcamp/main/IT_SalarySurvey_EU2020.csv'
df = pd.read_csv(url)
The read_csv function reads the CSV file from the specified URL and creates a pandas DataFrame.
After importing the data, we can start exploring it using various techniques.
Understanding the Dataset
Once you have imported the dataset, it's important to get a basic understanding of its structure and properties. Here are some questions to ask about the dataset:
- How many rows and columns does the dataset have?
- What are the names of the columns?
- What is the data type of each column?
- Are there any missing values?
We can use basic functions like shape, columns, info, and describe to answer the questions we have posed above:
# print the shape of the data (number of rows and columns)
print(df.shape)
# print the names of the columns
print(df.columns)
# print information about the data, including data types and number of non-null values
print(df.info())
# print summary statistics for the numeric columns
print(df.describe())
# print the sum of all null values
print(df.isnull().sum())
The output is as shown:
RangeIndex: 1253 entries, 0 to 1252
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Timestamp 1253 non-null object
1 Age 1226 non-null float64
2 Gender 1243 non-null object
3 City 1253 non-null object
4 Position 1247 non-null object
5 Total years of experience 1237 non-null object
6 Years of experience in Germany 1221 non-null object
7 Seniority level 1241 non-null object
8 Your main technology / programming language 1126 non-null object
9 Other technologies/programming languages you use often 1096 non-null object
10 Yearly brutto salary (without bonus and stocks) in EUR 1253 non-null float64
11 Yearly bonus + stocks in EUR 829 non-null object
12 Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country 885 non-null float64
13 Annual bonus+stocks one year ago. Only answer if staying in same country 614 non-null object
14 Number of vacation days 1185 non-null object
15 Employment status 1236 non-null object
16 Сontract duration 1224 non-null object
17 Main language at work 1237 non-null object
18 Company size 1235 non-null object
19 Company type 1228 non-null object
...
21 Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week 373 non-null float64
22 Have you received additional monetary support from your employer due to Work From Home? If yes, how much in 2020 in EUR 462 non-null object
dtypes: float64(4), object(19)
memory usage: 225.3+ KB
From the output of these functions, we can see that the dataset contains 1,253 rows and 23 columns. The column names can be found by running df.columns as seen below:
print(df.columns)
Whose output is:
Index(['Timestamp', 'Age', 'Gender', 'City', 'Position ',
'Total years of experience', 'Years of experience in Germany',
'Seniority level', 'Your main technology / programming language',
'Other technologies/programming languages you use often',
'Yearly brutto salary (without bonus and stocks) in EUR',
'Yearly bonus + stocks in EUR',
'Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country',
'Annual bonus+stocks one year ago. Only answer if staying in same country',
'Number of vacation days', 'Employment status', 'Сontract duration',
'Main language at work', 'Company size', 'Company type',
'Have you lost your job due to the coronavirus outbreak?',
'Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week',
'Have you received additional monetary support from your employer due to Work From Home? If yes, how much in 2020 in EUR'],
dtype='object')
The dataset contains both categorical and numerical data, and there are some missing values within some columns.
An easy way of understanding the data would be to run a ProfileReport by the ydata_profiling library package that was earlier imported.
profile = ProfileReport(df, title ="IT Survey Profile Report", html={'style':{'full_width':True}})
profile.to_notebook_iframe()
## the iframe() exports the Jupyter Notebook to HTML
Cleaning the Data
Before we can begin analyzing the data, we need to clean it. This involves dealing with missing values, removing duplicates, and correcting any errors in the data.
Dealing with Missing Values
Missing values are a common problem in real-world datasets. Before we can analyze the dataset, we need to deal with these missing values. There are several ways to do this, including:
- Dropping rows or columns with missing values
- Imputing missing values with a statistical measure (e.g., mean or median)
- Imputing missing values with a machine learning algorithm
Let's begin by dealing with the missing values.
We first seek to understand the percentage of missing values across the columns:
df_length = len(df)
missing_percentages = df.isna().sum().sort_values(ascending= False)/df_length
missing_percentages
With the output being:
Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week 0.70231
Have you received additional monetary support from your employer due to Work From Home? If yes, how much in 2020 in EUR 0.63128
Annual bonus+stocks one year ago. Only answer if staying in same country 0.50998
Yearly bonus + stocks in EUR 0.33839
Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country 0.29370
Other technologies/programming languages you use often 0.12530
Your main technology / programming language 0.10136
Number of vacation days 0.05427
Years of experience in Germany 0.02554
Сontract duration 0.02314
Age 0.02155
Company type 0.01995
Have you lost your job due to the coronavirus outbreak? 0.01596
Company size 0.01437
Employment status 0.01357
Total years of experience 0.01277
Main language at work 0.01277
Seniority level 0.00958
Gender 0.00798
Position 0.00479
City 0.00000
Yearly brutto salary (without bonus and stocks) in EUR 0.00000
Timestamp 0.00000
dtype: float64
We can then use the fillna() method in pandas to replace missing values with a specified value.
For example, if we want to replace missing values in the 'Gender' column with 'Unknown', we can do the following:
df['Gender'] = df['Gender'].fillna('Unknown')
df['Salary one year ago'] = df['Salary one year ago'].fillna(0)
We seek to also fill missing values within the Age column by using the median age of the columns
##filled missing values in Age with the median value while making sure to skip Null values
median_age = df['Age'].median(skipna= True)
df['Age'] = df['Age'].fillna(median_age)
Correcting Errors
Finally, we need to correct any errors in the data. In this dataset, we can see that some of the values in the 'Experience' column are negative, which is clearly an error. We can correct this by taking the absolute value of the column:
df['Total years of experience'] = df['Total years of experience'].abs()
Dropping columns
We can use the drop function in pandas to remove the Timestamp column:
data = df.drop(columns=['Timestamp'])
Exploring the Dataset
Now that we have a clean dataset, we can start exploring it. Let's take a look at some basic statistics for the dataset:
# compute summary statistics for the dataset
print(df.describe())
Age Yearly_brutto_salary Annual_brutto_salary Shorter_working week
count 1,226.00000 1,253.00000 885.00000 373.00000
mean 32.50979 80,279,042.57872 632,245.87232 12.96783
std 5.66380 2,825,061,107.59049 16,805,081.75171 15.27517
min 20.00000 10,001.00000 11,000.00000 0.00000
25% 29.00000 58,800.00000 55,000.00000 0.00000
50% 32.00000 70,000.00000 65,000.00000 0.00000
75% 35.00000 80,000.00000 75,000.00000 30.00000
max 69.00000 99,999,999,999.00000 500,000,000.00000 40.00000
The output of this command shows that the Annual brutto salary in the dataset is 632,245.87232 Euros, with a median value of 75,000.00000 Euros.
Visualizing the Data
Visualizing the data is a great way to get an initial sense of the patterns and relationships in the dataset. We can use libraries like matplotlib or seaborn in Python to create visualizations. Here are some examples of how to create visualizations for the IT Salary Survey dataset:
# create a histogram of the salaries
sns.histplot(df['Current Salary'], kde=False, bins=20)
plt.title('Histogram of Salaries')
plt.xlabel('Salary (EUR)')
plt.ylabel('Count')
plt.show()
# create a box plot of the salaries by gender
sns.boxplot(x='Gender', y='Current Salary', data=data)
plt.title('Box Plot of Salaries by Gender')
plt.xlabel('Gender')
plt.ylabel('Salary (EUR)')
plt.show()
# create a scatter plot of the salaries by years of experience
sns.scatterplot
I hope this tutorial was of much help to you as a beginner. Feel free to reach out for any contributions or collaborations.
Happy coding,
Junn Hope
Top comments (0)