Exploratory Data Analysis Ultimate Guide

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA), refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

Exploratory Data Analysis (EDA), also known as Data Exploration, is a step in the Data Analysis Process, where a number of techniques are used to better understand the dataset being used.

Understanding the dataset might mean a variety of things, including but not restricted to...
removing unnecessary variables and keeping only the relevant ones.
locating outliers, missing values, or mistakes made by humans.
Knowing the relationship(s) between variables.
In the end, increasing your insights from a dataset and reducing any chance for error along the process.

EDA components

I see three essential parts to data exploration:

being aware of your variables.
tidying up your dataset.
examining the connections between various variables.

This are Tools we can use for EDA

There are numerous EDA tools available. Among the most popular tools are:

Python is a popular programming language that is used in EDA. Pandas, NumPy, Matplotlib, and Seaborn are some of the most popular Python libraries for EDA.

Excel is a spreadsheet program that can be used for EDA. It includes data visualization tools like charts and graphs.

R is a statistical programming language that is used in EDA. The most popular R packages for EDA are dplyr, ggplot2, and tidyr.

Tableau is a data visualization tool that allows users to easily create interactive visualizations and dashboards.

Pandas

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

to work with pandas first download pandas library. If you use pip, you can install Pandas with:

pip install pandas

Let see some basic operations in pandas to work with Exploratory Data Analysis.

import pandas as pd

df = pd.read_csv('employees.csv')

df.head()

The head() method returns a specified number of rows, string from the top.

df.describe()

describe() method. The describe() function applies basic statistical computations on the dataset like extreme values, count of data points standard deviation, etc. Any missing value or NaN value is automatically skipped.

df.shape()

shape attribute in Pandas enables us to obtain the shape of a DataFrame.

df.info()

The info() method prints information about the DataFrame. The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values). Note: the info() method actually prints the info.

This is all about today's article. If you want to learn more about pandas use this pandas documentation from their offical website.

Pandas offical doc

Thank you for taking the time to read my article and giving me honest feedback.

DEV Community

Exploratory Data Analysis Ultimate Guide

Top comments (0)