DEV Community

Cover image for The Ultimate Exploratory Data Analysis
Maiyo
Maiyo

Posted on

The Ultimate Exploratory Data Analysis

Introduction

Exploratory Data Analysis (EDA) is the process of examining and analyzing data sets to summarize their main characteristics and gain insights into their underlying structure, patterns, and relationships. EDA is typically used as a preliminary step before performing more complex statistical analysis or building predictive models.
For this article I will explore this dataset from Kaggle and list down findings from it. This dataset contains rich information about the salary patterns among the IT professionals in the EU region and offers some great insights.
I will carry out my data analysis using python on a Jupyter notebook. Now let's get down to it.

Importing Libraries and Loading Datasets

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
plt.style.use('ggplot')

Enter fullscreen mode Exit fullscreen mode

The initial step in analyzing data using Python is to import python libraries that provide data analysis functionalities. These modules include pandas, numpy, matplotlib and seaborn. Here is a breakdown for the respective use of each module:

  • pandas - This is a common python library used for data manipulation, analysis, and cleaning. It provides data structures and functions for efficiently handling and processing large and complex data sets.
  • numpy - It is a popular python library that is used for mathematical and scientific computing. It provides a powerful array computing feature that enables developers to perform complex mathematical operations on large arrays and matrices of numeric data.
  • matplotlib - The library is used for data visualization. It provides tools for creating a wide range of static, animated, and interactive visualizations in Python.
  • Seaborn - Seaborn is a Python library that is built on top of Matplotlib and is primarily used for statistical data visualization. It provides a high-level interface for creating informative and attractive statistical graphics in Python. From the python libraries, it is possible to carry out an exploratory data analysis which includes loading datasets, cleaning datasets, analysing datasets and finally visualizing findings from datasets.
df = pd.read_csv('IT Salary Survey EU  2020.csv', sep=',')
Enter fullscreen mode Exit fullscreen mode

To load the dataset, download it first from the link provided. Save the file on the same directory as that of the notebook being used. If the file is saved elsewhere note that the path to its location will be provided while loading the dataset.
Once the data is downloaded it is good to open it and explore to find out it's separator value. Separator values can range from single spaces, tab spaces to commas.
To load the dataset onto a data frame, we use pandas read_csv method. The method takes in a string value representing the location of the dataset downloaded, and separator value specified as arguments.

Understanding the dataset

One of the most important step in carrying out EDA is understanding the dataset you are analyzing. Since the dataset can be very large, Pandas library offers methods that can simplify our work in exploring the dataset to understand it.
Number of columns and rows of dataframe
To know the size of the dataset, run df.shape. From the result, the dataset has 1253 rows and 23 columns.
To display first five rows
To display the first five rows of the dataset, run df.head(5). Note that you can specify any number of rows in this case by changing the integer value in the argument.
The image above shows the results of the first five rows. The image itself cannot capture all the 23 columns but u can extend the columns by running pd.set_option('max_columns', 200) at the import section.
To display the last line of rows, the method is just changed, and it becomes df.tail(5). This line of code performs in the same manner as the head method.
Display column names
To list all the columns names i.e. all 23 of them, run the code df.columns. This lists down all the column names as an array.
Display column types
To display the column types, we run the code df.dtypes. From the picture, the colums are of an object type as well as float type.
Display statistical analysis
Finally to get information and statistics of the numerical data in the dataset, use the code df.describe(). This method output statistical parameters such as count, mean , standard deviation, minimum, maximum, 25%, 50% and 75% percentiles for the numerical data within the dataset. Note that we only have four columns with numerical data inform of floats in the datasets.

Cleaning data

Most of the datasets are going to be very large and contain lots of redundant information that might not be needed in the analysis. On the other hand, new columns can be derived from the already existing ones and added to the data frame to be analyzed.
Cleaning of the data requires both a deep understanding of the dataset and the objective of the analysis. Cleaning of the data can involve:

  • dropping columns that are not required for analysis
  • Updating the columns names to more appropriate names and removing whitespaces between column names
  • Updating the columns data types to suit the type of data they hold for seamless analysis using python code.
  • Checking for null values within the dataset and understanding their distribution.
  • Checking for duplicate records in the dataset. When duplicates are found, they should be removed.

From the initial 23 columns, some of the columns are supposed to be dropped for this analysis and retain only ten columns. The columns to be retained are:

  • Age
  • Gender
  • City
  • Position
  • Total years of experience
  • Seniority level
  • Main technology/programming language
  • company size
  • company type
  • contract duration

Dropping columns
To clean the data, it is advisable to create a subset of the original data frame which is large and complex. his can be useful when the dataset is very large and complex, and analyzing the entire dataset would be time-consuming or computationally challenging.
For this dataset, a subset of only the 10 selected columns have been selected. The rest of the columns have been commented out just to clearly show which ones were dropped. When the df.shape is run we note that the number of rows remain same as before but the columns reduce to ten.

Rename columns
The next step is to scrutinize the columns and check their data types and their naming conventions. As for this dataset, the data types were not altered. On the naming of columns, it is good practice to use short and concise names without whitespaces between two words. Whitespaces will bring up errors when code implementation is done for the specific columns. The columns we renamed as shown above.

Null values
To check for null values, isna method is used together with the common sum method. From the results above it is clear that all columns apart from the city has null values.

Duplicate values
Next is to check for duplicates in the dataset. I looked for the duplicates but could not find a conclusive evidence of duplicated records. This is because the nature of the data lacks a unique identifier for every member that filled in the form, although this does not eliminate the fact that one could refill the form with the exact same details. To be on the safer side, I looked for duplicates in the subset data frame by using all the columns and did not find none.
Up to this point, the dataset can now proceed for data analysis.

Analyzing the data.

To analyze the data subset, data visualization tools come in handy in this case. Features of the data can be analyzed individually through univariate analysis or can be compared against each other through correlation.
This data frame subset merely has numerical values for simple numerical analysis using histograms, boxplots, scatter diagrams among other visualization diagrams.
For this tutorial I was able to do an example of numerical analysis of the age feature. Here are the findings:

Image description
The age feature was plotted on a histogram. From the histogram the following is gathered from the age of employees who undertook this survey:

  • Most of the employees employed in IT in Europe are around 30 years.
  • From the histogram we can also note that the distribution of age ranges from 20 to almost 70 years This is one of the examples of feature analysis. I would have wanted to explore every feature, but my skillset currently limits me to do an analysis on text based data which is the majority on this data subset, might as well take it up as a challenge and do a dedicated article on it after learning it.

References

Here's a link for my Jupyter notebook I used for this tutorial.

Top comments (0)