Silvester

Posted on May 5, 2024

Mastering Data Exploration with Tidyverse: A Beginner-Friendly Guide

#datascience #data #beginners #newbie

Imagine you have been saving data on your shopping in an Excel file on your computer for a year. In December, you may want to see how you have been using your finances for the entire year. Well, that is what Exploratory data analysis helps you with. It allows you to break down complex data and have a better understanding of what it means.

As a data analyst, curiosity will greatly help you understand your data and how various features in your dataset are related. Usually, there are no structured rules on what you can do or not do when performing exploratory data analysis. This article will introduce you to exploratory data analysis using Tidyverse, a library that contains many R packages and as such, it will make your life easier.

Why EDA

1. One reason for performing EDA is to identify errors. Obvious errors and patterns can be easily discerned using EDA.
2. The analyst can detect patterns within the data, anomalies, and interesting relationships between variables. For a data scientist, EDA ensures that they get a valid output from their model.
3. EDA helps answer questions like confidence intervals, standard deviations, and categorical variables within the data.

EDA steps

Prerequisites

Basic knowledge of R programming language
Install R and RStudio on your computer
Getting started
When you want to use tidyverse if you have not yet installed it, you install it and then load it.
You first need to install tidyverse in your R studio and then load the library to use it.

install.packages("tidyverse") #installing the library
library(tidyverse) #loading the library

Now that you have loaded your library, you are free to continue with the next part.

Importing Data

Before you perform exploratory data analysis, you need data. You can import data from a database or your computer. As a beginner, you can start with the inbuilt R datasets.
However, when you are importing data from your computer to the R studio or console, you will use the ‘read.csv’ function. For example

Expenses <- read.csv(yearly_expenses.csv)

In this code, expenses.csv is the file that you have on your computer. This code reads the yearly_expenses file and stores it in the variable expenses. With expenses holding the data frame, you can now proceed with the next steps of your data analysis.

Basic Explorations

Summary statistics
This is usually among the first steps you will undertake in your EDA process. Summary statistics is necessary when you want to gain insights quickly from your data as it shows you the patterns and distribution in your data. Usually, summary statistics serve as the foundation for further exploration that you will perform. Some of the insights that you will gain from summary statistics are median, mode, variance, mean, quartiles, range, percentiles, and standard deviation.

library(tidyverse) #the tidyverse package for Exploratory data analysis 
#inbuilt R dataset
air <- airquality #Assigning the inbuilt R dataset to the data frame named air
summary(air) #Getting summary of the air dataset

The image above shows the summary statistics of the variables in the air dataset. We can see that there are 6 variables in the dataset (Ozone, Solar.R, Wind, Temp, Month, Day). Some of the summary statistics that we can see from the table above are the minimum value, the first and third quadrants, the median, the mean, and the maximum value for each of the variables. Another observation is that there are null values in the Ozone and Solar.R variables.

Checking for missing values

When using data that you did not collect (secondary data), you may sometimes encounter cases of missing values in some rows. The missing values can be represented as NA or NULLS. If you analyze your data without dealing with these missing values, you risk generating incorrect insights from the data.
If you are a data scientist creating a model, missing values can greatly impact the performance of your predictive model. As a result, dealing with these missing values is an important task that you have to perform and this can only happen after you have noted the values that are missing.

air %>%
  summarise(across(everything(), ~sum(is.na(.))))

In the code above, we start with the data frame that we want to work on which is the air dataframe. The next step is the pipe operator %>% which takes the data frame and then passes it as an argument for the next steps. The summarize () function calculates the summary statistics for the dataset and then collapses the many rows to become a single row. The across(everything(), ~sum(is.na())) counts the missing values in each column in the air data frame.

Output:

From the image above, we can see that the Ozone and Solar.R variables are the only variables with missing values and they have 37 and 7 missing values respectively.

Univariate analysis

Univariate analysis is the analysis of one variable in the dataset. This analysis is important as it helps uncover aspects like maximum, range, mode, median, and outliers of the variable within the dataset. Univariate analysis is necessary for both the numerical and categorical variables

Histogram for Ozone

library(tidyverse)
air <- airquality
#Histogram of Ozone variable
ggplot(air, aes(x=Ozone))+
  geom_histogram(fill="skyblue", color="black", bins=20)+
  labs(title="Ozone level distribution")

Output:

The output above shows the distribution of the Ozone variable. We can see that most of the values are concentrated between 0 and 50. Also, the maximum value is 168 which is located further from the rest of the data. If you look a little up before the title of the histogram, you will see a warning that shows that ggplot ignored the null values in the dataset.

Box plot of Solar.R variable 
ggplot(air, aes(x=Solar.R))+
  geom_boxplot(fill="skyblue", color="red")+
  labs(title="Solar. R boxplot")+
theme(plot.title = element_text(hjust = 0.5))

The code above outputs a box plot of the variable Solar.R. A box plot is important for a data analyst as it shows the distribution of the variable.

From the box plot above, we can see that the Solar. R variable is left-skewed. There are also no outliers since all the data fall within the first and third quadrants.

Tidyverse has powerful tools like ggplot2 that we used when performing the univariate analysis. The histogram and box plot have been created using ggplot2 and it has helped us understand the distribution of the Ozone and Solar.R variables. As you can see from the univariate plots above, visualizations help in seeing patterns in addition to showing us outliers that we might miss if we rely only on summary statistics.

Bivariate analysis

While univariate analysis focuses on one variable only, bivariate analysis emphasizes on the relationship between two variables. Bivariate analysis can be in the form of analyzing variable pairs such as categorical-categorical, numerical-numerical, and numerical-categorical.

Scatter plot

# Scatter plot for Ozone vs. Solar.R
ggplot(air, aes(x = Solar.R, y = Ozone)) +
  geom_point(color = "darkblue") +
  labs(title = "Ozone vs. Solar Radiation")

In the code above, we use the ggplot2 function to create a scatter plot within the Tidyverse library.

In this scatter plot of Ozone vs Solar.R created using ggplot2, we can see that there is a positive relationship between these two variables. The plot shows that high solar radiation is associated with high Ozone levels.

Multivariate analysis

Multivariate analysis is used for datasets with many variables. One commonly used multivariate analysis is correlation analysis. Correlation analysis shows how different numerical variables are correlated.

Pairwise scatterplot matrix for numerical variables

air %>%
  select(Ozone, Solar.R, Wind, Temp) %>%
  pairs()

The code above is of a scatterplot matrix, a simple approach to understanding the rough linear relationship that exists between multiple variables. In this code, we are looking for the linear relationships of Ozone, Solar.R, Wind, and Temp variables.

In the pair plot above, the variables are written diagonally from the top left to bottom right. In the plot, each variable is plotted against each other and the plots on the upper side of the diagonal are mirror images of the plots on the lower side of the diagonal.

When interpreting this plot, the variable name in each column is correlated to the rest of the variables. For instance, a look at the first scatter plot of Ozone and Solar.R shows that there is what appears to be a linear relationship, an indication that the variables have a positive correlation. In the plot below, the plots on both sides of the diagonal represent the same thing.

Another interesting scatter plot is that of Ozone and wind. In the plot below, the scatter plot of Ozone and Wind shows a negative slope indicating the possible presence of a negative correlation.

Even though there are cases of correlation between some of the variables in the data, it is important to remember that correlation does not mean causation.

Conclusion

As a data analyst, Tidyverse will help you to explore variable distribution and variable relationships and this ultimately helps you transform messy data into valuable insights. Since you now have an idea of how to use the tidyverse library to extract insights from data, get out and explore your data!