Data is becoming more valuable to different institutions with time. Market behavior and potential customers can be predicted from existing data. This has led to the phenomena of "Big Data", large amounts of data that are analyzed computationally.
Equally, EDA is a term that one can frequently come across in data science and it can be a little heavy for newbies especially when they are not familiar with what the abbreviation means. EDA in data science stands for Exploratory Data Analysis. So we shall take a look at this concept in depth in this article and perhaps, guide you into one or more issues in these concepts.
Definition
There are important characteristics of a given data that one seeks when analyzing data. After the analysis, one can present the discovery (whether new or formerly known) in summary and maybe in a visual form. Therefore, exploratory data analysis can be defined as an
approach to intimately analyze data seeking to become more familiar with. Simply put, it is a way to get a basic understanding of the data at hand. Regardless, the end goal is to establish patterns in the data after analysis that are relevant to the given institution or the analyst. There are several ways in which one can achieve this concept and there are different degrees to which one can reach in their analysis. In the following part, we shall take a look at the steps which are used in EDA and how to accomplish them.
Steps in EDA
- Data collection where data is collected from different sources, either excel, an end point, or csv file in sites like Kaggle and Github.
- Load Data: Data is what is sought to be explored. Data can be loaded in different ways:
Upload already available data from local machine: This is done by use of an available button in the notebook environment, (for my case I am using Jupyter notebook). In the Jupyter Notebook's Home page, there is a button on the upper right side of the page tagged "upload". When pressed it takes one to the area in which the file to be uploaded is located. After selecting the file, click the upload blue button to allow the file to be in the Jupyter Hub.
Data can also be loaded using the command line; in the button next to the "upload" button is the "new" button. When click it gives options and among them is the "Terminal" button. Click the "Terminal" button to open the command line. Then enter following command to download data in the current directory one is in:
wget <MY-FILE-URL>
In the case that the file being downloaded is a zip file, use pip to download a tool that is used to unzip the file using the following command:
pip install unzip
You can then now proceed to unzip the file using the following command:
unzip <"downloaded_file_name">
- Open the notebook and import a few libraries that will help you explore different aspects of your data; numpy, pandas, matplotlib, and seaborn. Please note, these are not the only libraries that can help in analysis, but they are the most used in the analysis, thus they are preferred.
import pandas as pd
import matplotlib as plt
import numpy as np
import seaborn as sns
Afterward click "Run" button just above the cell to run the cell and get feedback. If there is no error given then the cell has been successfully executed.
Next, you read the data from where it is located. Use the following commands :
variable_name=pd.read_csv("directory of the data location")
When you run this command and it does not return an error message, it means the file can now be referred to using the variable name assigned to it.
The Actual EDA
Now, this is the beginning of an interesting part for me in EDA, where we get to get our hands dirty with the real thing. Analysis can be done in various aspects and aims at different goals. According to the difference in needs, analysis can be done using different libraries in Python language. The following are some of the functions and tools that can be used to do data analysis.
a) Read head
Using the following command, one is able to get a preview of the nature of the data being used without no much sweat.
variable_name.head()
It returns the value of the first five rows if not specified. To specify the number of rows to preview, just enter the integer inside the brackets. For example
variable_name.head(10) # which returns the first 10 rows of the data
Likewise, we can read the last 5 rows using the tail function that can be written in the following manner:
variable_name.tail()
b) Number of rows and columns
If interested to know how many rows and columns(dimensions) respectively you are working with, then you can use the following to know.
variable_name.shape
c) Check for null values and types of data
We use info() method to achieve this
variable_name.info()
It returns the different columns present and their type of object. In addition, it tells you if any of the columns has null values.
d) A summary of statistical status of the data
It gives the mean, mode, maximum value, standard variation, and the quartiles. We can get this information using the described function as shown below:
variable_name.describe()
e) Getting a unique character in a given column
To analyze the presence and the nature of unique elements in the data set, use the unique() function as shown:
variable_name.column_name.unique()
f) To see how many times a value appears in a certain column
Frequency of a certain data can give a deeper insight to the analysis of data and can be gotten through the code below using the value_count() function. It returns the column name and the datatype alongside the results.
variable_name.column_name.value_count()
g) Number of array dimension or axes
To see the nature of the dataframe in terms of dimension, use the ndim function
variable_name.ndim
h) Number of elements in an object
There is a function that returns an integer that gives the number of elements in an object
variable_name.size
i) Check id dataframe is empty
We can analyze if the data we have is fully populated using the following code
variable_name.empty
j) To check memory usage of the dataframe
Use the following command:
variable_name.memory_usage
k) Access a single value
When want to access a single value in either a row and a column, can use the following command:
variable_name.at
l) Get columns in the data
To achieve this, we use the column function that returns a list of all columns in their order as an array
variable_name.columns
m) Correlation
To see the negative, moderate, and positive correlation, we can use the corr() function to see this in a table.
variable_name.corr()
Graphical Representation in Data Analysis
We can also have a visual representation of data using the functions offered by numpy, pandas, seaborn, and matplotlib. In this section, we shall have a deep insight into these libraries and the range of interesting things they are capable of achieving.
Import pyplot from matplotlib in the case that you have not imported it.
a) Bar Chart
When we want a bar chart, we have to pass three parameters to the plot() function in matplotlib. They are the x-axis, the y-axis and the type of plotting we need. In our case, it is a bar chart.
We can use the guideline to help us do this
variable_name.plot(x="column1", y= "column2",kind ="bar",figsize=(20,15)
In the case that the figsize is not explicitly given, the plot returns a default size.
b) Line Graph
When we want to see our data on a line graph, we use the same method as above, but pass in the kind field, line as our type of graph that we want drawn
variable_name.plot(x="column_name", y= "column_name",kind ="line",figsize=(20,15)
c) Plot a single column
We use the seaborn library to achieve this. As part of the code, we use distplot function to plot the data in the given column as below:
sns.distplot(variable_name["column_name"])
d) General Information
We can plot all the columns in a single graph and analyze it visually. However, when dealing with huge data, the analysis may be a little bit difficult cause of the congestion. We can try to reduce the congestion by passing the fig size that is a little bigger than the default one.
We use the plot and the show function to achieve this as shown
variable_name.plot()
mplt.show()
d) General Plot of a Column
Data from a single column can be plotted using this code:
variable_name["column_name"].plot()
mplt.show()
e)Comparison of two different Columns
A comparison of two different columns can be done and a relationship traced if it exists. We should be careful to reasonably choose the columns under analysis to avoid weird graphs that are trying to stretch and accommodate the outrageous data range difference. We can use the following guideline to achieve this:
variable_name. plot.scatter(x="column_name",y="column_name",alpha = 0.5)
mplt.show()
For this instance, we have chosen a scatter graph to be plotted in order to see the variations and relationships in between the columns.
f) Box Graph
Box graph represents a summary of a set of data in a five-number format, the first quartile, median, third median, minimum and maximum. We can have a graph to represent this information using the plot() function as shown
variable_name.plot.box(figsize=(n,m))
plt.show()
This returns the data displayed in a graph and each column is represented in the graph and an analysis can be done from that information.
g) Correlation Objects
We have seen above that python has a function that enables us to get the correlation of the data we have. Now correlation objects will be very useful as we will see in the next section. In this section, we shall demonstrate how to create a correlation object
variable_name_of_object = variable_name.corr()
data_set_name.corr()
h) Heatmaps
We can view correlations graphically using the heatmap function that changes color in regards to change of correlation of elements with others. In the following code, we demonstrate how to achieve this:
sns.heatmap(variable_name_of_object,cmap='Red',annot=True)
We can also manipulate the data and add columns, and strike out some columns until we get what we desire. Python offers us these functionalities to enable us to explore different possibilities that can with data analysis.
a) Create a column from derived information
Now let us take for instance, a circumstance in which we want to get a new column that is a result of a mathematical operation of an already existing column, doing the operation on the elements of column one at a time can be tedious and especially working with the very large data set. It is therefore important to seek for a solution that is first and efficient. Fortunately, A new column can also be created from existing columns, derived information columns.
The code below illustrates a guideline on how this can be achieved
variable_name_of_dataset ["new_column_name"] = variable_name_of_dataset ["existing_column_name"] *2
In this example, the mathematical operation is multiplying the elements of the column that is existing by 2; thus the "*2" mark at the end of the operation.
b) Renaming Columns
We download or get data that has its columns named according to the need and knowledge locus of the person collecting the data. However, the naming may be not conventional enough to suit the needs of the person doing the analysis. To make the data seem more familiar and usable, the data analyst can rename the columns by giving the alternate new name they desire to give the specific column. The guideline below shows how this is possible using rename function which takes two arguments, the original column name and the new column name :
# Renaming Columns
new_variable_name = variable_name.rename(columns+
{"old_column_name":"new_column_name"})
new_variable_name.head()
In the above guideline, we have given for rename of just a single column, but more than one column can be renamed. We just pass the old column names and the new ones as elements of a dictionary in python.
The rename can also take the direction of letter cases, that is to lower or to upper case. We take the same steps, but instead of passing the old and new column names, we can pass the function of conversion to lower or upper case for a given column label
# Renaming Columns
new_variable_name = variable_name.rename(columns=str.lower)
new_variable_name.head()
Data analysis is not limited to the above functions, but these are key and the most used to see different aspects of data. I have done a notebook on most if not all the things mentioned above for reference in case you are stuck. It is in my github account and can be accessed by the link below, if it is helpful, please give an upvote: https://github.com/Gamalie/Data-Science
This is done to illustrate the things discussed above and assist, those stuck on how to analyze their data in preparation for modeling into machine learning or deep learning model.
There are also several sites that can assist one do more data exploration. Kaggle has guidelines on how to analyze different data sets. The official Pandas, Seaborn, numpy, and matplotlib documentation will assist in getting more understanding of the libraries and what they offer in terms of tools for analysis.
Top comments (0)