Exploratory Data Analysis (EDA) is a basic and vital step in any data science project. EDA consists contributes 70% of work done by data scientists. It investigates the dataset and helps to get answers from data by manipulating it. EDA makes it easier to discover patterns, test hypotheses, spot anomalies and check assumptions. It helps choose a better machine learning model and allows it to predict datasets better.
Types of Exploratory Data Analysis
There are various types of EDA strategies that are used depending on the desires of the evaluation and nature of the records. Depending on the number of columns being analyzed, we can divide EDA into three types: Univariate, bivariate and multivariate.
1. Univariate Analysis
Univariate EDA analyzes a single variable at a time. It implements techniques such as visualizations (charts, bar graphs, pie charts, bbox plots and histograms) and descriptive statistics (mode, mean, median, variance and standard deviation).
2. Bivariate Analysis
Analyzes relationships between two variables. Techniques used in Bivariate EDA are scatter plots, correlation coefficients, contingency tables and cross tabulation.
3. Multivariate Analysis
Multivariate EDA analyzes relationships between three or more variables. Techniques used include multivariate plots, dimensionality reduction techniques, cluster analysis, correlation matrices and heatmaps.
Key aspects of EDA
Understanding the distribution of data: Examining the distribution of data points to understand their mesures of central tendencies and dispersion.
Graphical representations: Utilizing charts, plots and other advanced visualizations.
Outlier detection: Identifying unusual values in the data that might lead to errors when computing the statistical summary.
Correlation analysis: Analyzing the relationships between variables to understand how they might affect each other. This involves computing correlation coefficients and creating correlation matrices.
Handling missing values: Identifying missing values and how to deal with them.
Summary Statistics: Calculating the summary statistics to provide insights into data trends.
Testing Assumptions: Verify assumptions made in the models.
EDA Tools
1. Python libraries
Pandas - Data cleaning, manipulation and summary statistics.
Matplotlib - Used for visualizations.
Seaborn - Built on matplotlib. It is used for high-level interactive visualizations.
SciPy - Provides higher-level scientific algorithms.
Plotly - Makes dynamic and interactive graphs for visualization.
2. R libraries
ggplot2 - Used to make complex multilayered visualizations.
dplyr - For data wrangling and manipulation.
tidyr - Data cleaning and tidying.
shiny - Used to create interactive data analysis web apps.
Plotly - Visualization.
3. IDEs
Environments like Jupyter Notebook to write python code.
4. Data visualization tools
E.g Tableau - For interactive and sharable dashboards.
PowerBi - Interactive reports and dashboards.
5. Statistical analysis tools
SPSS - Used for complex statistical data analysis.
SAS - Statistical analysis and data management.
6. Data cleaning tools
OpenRefine - For cleaning and transformation.
SQL databases - Eg mySQL, PostgreSQL and SQL to manage and query relational databases.
Top comments (0)