DEV Community: Biruk Bizuayehu

Exploratory Data Analysis Ultimate Guide

Biruk Bizuayehu — Tue, 28 Feb 2023 17:40:00 +0000

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA), refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

Exploratory Data Analysis (EDA), also known as Data Exploration, is a step in the Data Analysis Process, where a number of techniques are used to better understand the dataset being used.

Understanding the dataset might mean a variety of things, including but not restricted to...
removing unnecessary variables and keeping only the relevant ones.
locating outliers, missing values, or mistakes made by humans.
Knowing the relationship(s) between variables.
In the end, increasing your insights from a dataset and reducing any chance for error along the process.

EDA components

I see three essential parts to data exploration:

being aware of your variables.
tidying up your dataset.
examining the connections between various variables.

Tools available for EDA

There are numerous EDA tools available. Among the most popular tools are:

Python is a popular programming language that is used in EDA. Pandas, NumPy, Matplotlib, and Seaborn are some of the most popular Python libraries for EDA.

R is a statistical programming language that is used in EDA. The most popular R packages for EDA are dplyr, ggplot2, and tidyr.

Excel is a spreadsheet program that can be used for EDA. It includes data visualization tools like charts and graphs.

Tableau is a data visualization tool that allows users to easily create interactive visualizations and dashboards.

Pandas

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

to work with pandas first download pandas library. If you use pip, you can install Pandas with:

pip install pandas

Let see some basic operations in pandas to work with Exploratory Data Analysis.

import pandas as pd

df = pd.read_csv('employees.csv')

df.head()

The head() method returns a specified number of rows, string from the top.

The head() method returns the first 5 rows if a number is not specified.

df.shape()

shape attribute in Pandas enables us to obtain the shape of a DataFrame.

output (1000, 8)

df.describe()

describe() method. The describe() function applies basic statistical computations on the dataset like extreme values, count of data points standard deviation, etc. Any missing value or NaN value is automatically skipped.

df.info()

The info() method prints information about the DataFrame. The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values). Note: the info() method actually prints the info.

This is all about today's article. If you want to learn more about pandas use this pandas documentation from their offical website.

Pandas offical doc

Thank you for taking the time to read my article and giving me honest feedback.

Python 101: Introduction to Python for Data Science

Biruk Bizuayehu — Thu, 16 Feb 2023 11:33:49 +0000

PYTHON

Python is a high-level, object-oriented programming language. High-level refers to the fact that this language is simple for people to grasp and it is object-oriented, which means it emphasizes the use of objects to represent real-world entities and to allow interactions between objects.

Python is simple and easy to learn

Python is one of the best programming languages to learn if you're just getting started. Whether a user is experienced or not, they will be able to easily grasp each line of code and its purpose due to its basic syntax, English-based commands, and relatively simple layout.

Application of python

1 Data Science.
2 Machine Learning.
3 Web Development.
4 Computer Vision and Image Processing.
5 Game Development. etc

Python Basic

Variables

Variables are containers for storing data values.

example

x = str(3)    # x will be '3'
y = int(3)    # y will be 3
z = float(3)  # z will be 3.0

Data Structure

List, set, tuples, and dictionary are some of the fundamental data structures used in Python. Every data structure is distinctive in its own way.

List  

fruits = ['apple','mango','banana']

set

fruits = {"apple", "banana", "mango"}

tuples

fruits = ('apple','mango','banana')

dictionary 

fruits =  {'fruit1':'apple','fruit2':'mango','fruit3':'banana'}

If you want to learn more about python use this python documentation from their offical website.

Python offical doc

Python for Data Science

If you want to perform data analysis, you need to import specific libraries. Some examples include:

NumPy - A powerful library that helps you create n-dimensional arrays.

Pandas - Used for structured data operations.

SciPy - Provides scientific capabilities, like linear algebra and Fourier transform.

Matplotlib - Primarily used for visualization purposes.

Scikit-learn - Used to perform all machine learning activities.

Sea born - provides a high-level interface for drawing attractive and informative statistical graphics.

Let see some basic concepts about numpy and pandas.

Numpy

Numpy is a Python library for high-performance data analysis. Numpy provides an array-oriented interface to matrix and vector operations, making it easy to perform complex mathematical operations on large data sets.

To work with numpy first download numpy library. If you use pip, you can install NumPy with:

pip install numpy
let see some basic operations in numpy

# importing numpy

import numpy as np

#create 1d array

arr1 = np.array([10,20,30])
print(arr1)  output array([10, 20, 30])

# create 2d array

arr2 = np.array([[1,2,3],[4,5,6]])
print(arr2)  output   array([1, 2, 3],
                           [4, 5, 6]])

# to get individual element from numpy array we must pass it's index

arr1 = np.array([10,20,30])
print(arr1[0])  output = 10

Nb: index start from zero also we can perform advanced indexing with python slice method.

If you want to learn more about numpy use this numpy documentation from their offical website.

Numpy offical doc

Pandas

to work with pandas first download pandas library. If you use pip, you can install Pandas with:

pip install pandas
Let see some basic operations in pandas

# importing pandas

import pandas as pd

# loading csv to dataframe

df = pd.read_csv('data.csv')

# It returns the first 5 rows of the Dataframe

 df.head()

    PassengerId Survived    Pclass  Name    Sex Age SibSp   Parch   Ticket  Fare    Cabin   Embarked
0   1   0   3   Braund, Mr. Owen Harris male    22.0    1   0   A/5 21171   7.2500  NaN S
1   2   1   1   Cumings, Mrs. John Bradley (Florence Briggs Th...   female  38.0    1   0   PC 17599    71.2833 C85 C
2   3   1   3   Heikkinen, Miss. Laina  female  26.0    0   0   STON/O2. 3101282    7.9250  NaN S
3   4   1   1   Futrelle, Mrs. Jacques Heath (Lily May Peel)    female  35.0    1   0   113803  53.1000 C123    S
4   5   0   3   Allen, Mr. William Henry    male    35.0    0   0   373450  8.0500  NaN S


# It returns the last 5 rows of the Dataframe

df.tail() 

    PassengerId Survived    Pclass  Name    Sex Age SibSp   Parch   Ticket  Fare    Cabin   Embarked
886 887 0   2   Montvila, Rev. Juozas   male    27.0    0   0   211536  13.00   NaN S
887 888 1   1   Graham, Miss. Margaret Edith    female  19.0    0   0   112053  30.00   B42 S
888 889 0   3   Johnston, Miss. Catherine Helen "Carrie"    female  NaN 1   2   W./C. 6607  23.45   NaN S
889 890 1   1   Behr, Mr. Karl Howell   male    26.0    0   0   111369  30.00   C148    C
890 891 0   3   Dooley, Mr. Patrick male    32.0    0   0   370376  7.75    NaN Q

# It helps in getting a quick overview of the dataset

df.info()  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

# Return a statistical summary for numerical columns present in the dataset

df.describe()

    PassengerId Survived    Pclass  Age SibSp   Parch   Fare
count   891.000000  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean    446.000000  0.383838    2.308642    29.699118   0.523008    0.381594    32.204208
std 257.353842  0.486592    0.836071    14.526497   1.102743    0.806057    49.693429
min 1.000000    0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
25% 223.500000  0.000000    2.000000    20.125000   0.000000    0.000000    7.910400
50% 446.000000  0.000000    3.000000    28.000000   0.000000    0.000000    14.454200
75% 668.500000  1.000000    3.000000    38.000000   1.000000    0.000000    31.000000
max 891.000000  1.000000    3.000000    80.000000   8.000000    6.000000    512.329200

If you want to learn more about pandas use this pandas documentation from their offical website.

Pandas offical doc

Thank you for taking the time to read my article and giving me honest feedback.