DEV Community: Silvester

Understanding Your Data: The Essentials of Exploratory Data Analysis

Silvester — Sun, 11 Aug 2024 20:00:07 +0000

Exploratory Data Analysis (EDA) is a critical data analysis process as it involves understanding and identifying patterns in the data. EDA processes include studying the data to discover patterns, identify how variables are related, and locate outliers.

Why perform EDA?

To identify patterns in the data. By visualizing the data and checking the statistical summaries of the numerical variables, one can see the hidden patterns in the data and also how some variables are related.
To detect outliers and anomalies. Outliers are values in columns that are abnormally far from the rest of the values. Outliers can greatly affect the results of the analysis and as a result, detecting and handling them is key in reducing the chances of mistakes occurring in the data modelling or prediction process.
To facilitate data cleaning. Through EDA, one can spot issues in the data like missing values and errors and this can inform how the data can be cleaned.
To understand the data structures in the data. With EDA, you can get a better understanding of the features and their distribution and this can help inform how the data analysis and feature engineering will be done.

Techniques in EDA

Univariate analysis

Univariate analysis is the analysis of a single variable. The purpose of univariate analysis is to understand the summary statistics and distribution of the variable. Some of the activities in the univariate analysis include summary statistics and visualizing the data using histograms, box plots, bar charts, line plots and violin plots among others.

Bivariate analysis

Bivariate analysis refers to the analysis of how two variables are related. This analysis helps in uncovering patterns in the data and the commonly used bivariate analysis techniques are pair plots, heatmaps and scatter plots. Other techniques include line graphs, cross-tabulation and covariance

Multivariate analysis

Multivariate analysis is the simultaneous examination of the relationships between more than two variables. The aim of this analysis is to understand how the various features in the dataset are interacting. Commonly used multivariate analysis techniques include principal component analysis, pair plots and contour plots.

Statistical tests

Statistical tests help in validating hypotheses and discerning significant differences between groups. Some of the statistical tests when performing EDA include t-tests, ANOVA, and chi-square tests.

Conclusion

EDA is an important step in the data analysis or data science pipeline. In EDA, you can use techniques like multivariate, bivariate, univariate and even statistical tests to unlock hidden insights in the data.

If done well, EDA can help a data professional make their data cleaner, more accurate and finally to make better performing models. As a data professional, embracing best practices in EDA is important in understanding your dataset and ultimately generating reliable insights from the data.

Additional readings

Building a data science career as a beginner. How can you do it?

Silvester — Sat, 03 Aug 2024 15:41:24 +0000

The data landscape has changed over the years, increasing opportunities for people seeking data-related jobs in companies. Among the careers that one can work in is the data science field which uses algorithms to generate insights and help companies make better use of their data.

Even though the data science field is constantly changing, there are a few constant aspects that you must be aware of to succeed in the field. In this article, we will look into some insights on how you can build a successful data science career with a focus on education, skill development and job searching approaches.

Educational requirements

The debate on whether one needs a degree to be a data scientist or not has been ongoing for some time now. While education has always been touted as the best way to enter the data science field, many professionals became data scientists without undertaking a relevant degree like data science or computer science. Typically, data scientists are expected to have a bachelor’s degree in data science, mathematics, computer science or statistics among other related fields. Some employers might prefer people with master's or doctoral degrees in data science depending on the nature of the job. It is generally expected that with the relevant education, one can perform their duties optimally.

Data science as a field borrows heavily from computer science, statistics and mathematics. This means that a solid understanding of these three areas is key to deriving insights from the data, developing well-functioning models and analyzing data. Some courses that can greatly improve your performance as a data scientist are linear algebra, probability, statistics and calculus.

Apart from the formal education approach which entails getting a relevant bachelor’s and graduate degrees, one can also transition to data science through boot camps and online courses. Data science boot camps are very intensive programs that allow you to prepare for the data science field within a few months. These boot camps and courses like Coursera teach you the data science skills that you will need to succeed as a data scientist. With many boot camps and online courses coming up, you have to carefully choose your preferred course to align with your career goals.

The skills to master

A data scientist must possess both hard and soft skills to excel in their job. The hard skills that one must develop to become a better data scientist include mastering Python, R, SQL, statistics, data visualization, deep learning, machine learning, cloud computing, natural language processing and big data.

Soft skills are human skills that allow a person to work properly with their colleagues and clients, and they are not job-specific. Soft skills that one must possess include communication, critical thinking, problem-solving, storytelling and teamwork.

As an entry-level professional, mastering hard and soft skills may not be enough to get your first job. You will need a strong portfolio that shows your mastery of the hard and soft skills that you possess. As a data scientist, a good portfolio should focus on your abilities in handling real-world data problems, starting with acquiring the data, cleaning, analysis, model building and deployment of the model. A platform like GitHub is a good place to showcase your portfolio as you build your online presence.

Navigating the job market for data science roles

Some of the roles within the data science field are:

A data scientist focuses solely on building predictive models and deriving insights from data.
A data engineer is responsible for developing and maintaining the infrastructure for generating, storing and retrieving data.
A machine learning engineer is responsible for designing, implementing, and deploying machine learning models.

Securing a data science position involves using online job boards, networking and using direct applications. Sites like LinkedIn, Glassdoor, Fuzu or Brighter Monday can serve as a place for finding relevant job opportunities. Networking can also help a person get valuable job leads that can translate to jobs. After finding the relevant job opportunities, the next stage is to craft a compelling cover letter and resume that captures technical skills, the experience and also aligns it with the job requirements.

Career growth as a data scientist

As a practicing data scientist, there are various activities that you can engage in to further grow as a data scientist. Some of these activities include networking, seeking new mentors, joining a data science community and keeping updated on new developments in the data science field.

Networking is key for a data scientist’s career growth. Avenues for networking include attending industry events, engaging data science communities on platforms like X or LinkedIn and joining professional groups. Networking with peers and mentors offers professional opportunities for career growth.

As an increasingly changing field, keeping abreast of new changes by continuing education or reading research papers will keep you informed of new techniques, tools and best practices that will make you competitive in your field.

Conclusion

A career in the data science field requires a good educational foundation, mastery of specific soft and technical skills and also continuous professional development to remain competitive in the dynamic field. This article has looked at some of the important aspects that aspiring data scientists can do to position themselves for success in the field. As stated in this article, the data science field is dynamic and therefore you should do more research and reading to understand

References

Building your first machine learning model in Python

Silvester — Tue, 28 May 2024 12:34:22 +0000

Machine learning is the use of algorithms that can learn from data over time and therefore can detect and learn patterns from the data. Machine learning models are divided into Supervised, Unsupervised, and Reinforcement learning. The commonly used machine learning algorithms fall under Supervised learning and the linear regression model is usually the first model you will encounter in this category.

Under Linear regression models, we have simple linear and multiple linear models. A simple linear model involves the use of one independent and one dependent variable. On the other hand, multiple linear models have one dependent variable and more than two independent variables. In this article, I will take you through the process of creating your first multiple linear model for predicting the tips that customers give waiters in restaurants.

Getting started

Before we start, there are some technologies that you should be familiar with.

Basic understanding of Python
Some familiarity with statistics
Python libraries including pandas, numpy, matplotlib, seaborn,
scikit-learn

Linear regression

Linear regression is among the simple but commonly used algorithms, especially when the focus is to determine how variables are related. A linear regression model aims to get the best fit linear line that minimizes the sum of squared differences between actual and predicted values.

There are many uses of linear regression models. Some of the uses are market analysis, sports analysis, and financial analysis among other uses.

Loading and understanding the dataset

We will use the tips dataset embedded in the Seaborn library. The tips dataset contains simulated data on tips that waiters receive in restaurants in addition to other attributes.
For this demonstration, this is the complete Google Colab that I used. We start by loading the necessary libraries and loading the data.

import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

After loading the libraries, we first check for the datasets in the Seaborn library.

print(sns.get_dataset_names())

After looking at the various datasets and opting for the dataset of choice, we can now load the dataset.

tips = sns.load_dataset('tips')
tips.head(5)

The table above shows that there are 7 variables in the dataset. The numerical columns in the dataset are total_bill, tip and size while the categorical columns are sex, smoker, day and time.

For basic statistics, we can use the describe () method.

tips.describe().T

The describe () function gives the summary statistics of the numerical variables only. From the output, we can see the mean, standard deviation, minimum, maximum, and percentiles of the variables.

Data visualizations

Distribution of sex variable

sns.countplot(x ='sex', data = tips)
plt.title('Distribution of Sex variable')

We can see from the plot above that men comprised a big percentage of the customers represented in the restaurant.

####Total bill variable

sns.histplot(x ='total_bill', data = tips)
plt.title('Histogram of the Total bill variable')

The above plots show the distributions of two variables. We can see that the majority of the bills fall between $10 and $20. The sex distribution variable also shows that most of the customers were men.

Scatterplot

sns.scatterplot(x='total_bill', y='tip', data=tips)
plt.title('Scatter plot of total bill and tip variables')
plt.show()

Correlation plot

num_cols = tips.select_dtypes(include='number')
corr_matrix = num_cols.Corr()
sns.heatmap(corr_matrix, annot=True)
plt.title('Correlation Heatmap')
plt.show()

The scatterplot above shows that the tip and total_bill have a strong linear relationship. We can see from the correlation plot that the total_bill and tip correlate 0.68, indicating a strong positive correlation.

Model building

Before building the model, the data has to be processed in a format that is compatible with the machine learning algorithm. Machine learning algorithms work with numerical data and that necessitates changing the categorical values to numerical. To change the data from categorical to numerical, there are various approaches like Label Encoding and OneHotEncoding. For this project, we will use OneHotEncoding.

tips = pd.get_dummies(tips, columns=['sex', 'smoker', 'day', 'time'], dtype=int)

Using OneHotEncoding creates new variables for each of the categorical values. For example, we had a variable named sex which has Male and Female as the values. After using the get_dummies () method which encodes the data using OneHotEncoding, we have two new variables from the sex variable named sex_Male and sex_Female. Note that we started our data analysis with 7 variables and now after applying OneHotEncoding, we have 13 variables.

After encoding the data, we now have to scale the data to fall within the same range. For example, values in the total_bill column vary between 3 and 50 while for the majority of the remaining columns, the values are between 0 and 1. Scaling ensures that the model is robust by ensuring there are no extreme values. For this, we are using the MinMaxScaler class of the scikit-learn library.

from sklearn.preprocessing import MinMaxScaler
# Instantiate the scaler
MM = MinMaxScaler()
col_to_scale = ['total_bill']
# Fitting and transforming the scaler 
scaled_data = MM.fit_transform(tips[col_to_scale])
# Convert the scaled data into a DataFrame
scaled_df = pd.DataFrame(scaled_data, columns=col_to_scale)
# Dropping the original columns to avoid duplication
tips_df = tips.drop(columns=col_to_scale).join(scaled_df)

After scaling the total_bill column, we have the results below. You can see that the values in the total_bill column now range between 0 and 1 like the rest of the variables.

Next, we split the data into train and test sets. We will use the training data to train the model and test data to test the performance of our model.

from sklearn.model_selection import train_test_split
X= tips_df.drop(columns='tip', axis=1)
y=tips_df['tip']
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2,random_state=42)

The big X represents the independent variables (features) that will be fed to our model and the small y represents the target variable.

After splitting the data, we now proceed to instantiate the model and fit it to the training data as shown by the code below.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
LR = LinearRegression()
LR.fit(X_train, y_train)

Model evaluation

After fitting the training data to the model, we now proceed to test the model with our unseen data. Evaluating the model is important as it tells us whether our model performance is good or bad. For regression models, the evaluation metrics are the mean absolute error, mean squared error, mean squared error, R squared and Root mean squared error among others.

y_pred = LR.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
r2 = r2_score(y_test, y_pred)
print("R-Squared (R2) Score:", r2)

The output is:
Mean Squared Error: 0.7033566017436106
R-Squared (R2) Score: 0.43730181943482493

The mean squared error is high and this means that our model is not predicting well while the R squared value is low meaning that the model is not fitting the data well. Ideally, the mean squared error must be low and the R-squared value must be high.
On visualizing the results;

plt.figure(figsize=(8,8))
plt.scatter(y_test, y_pred)
#adding labels to the plot
plt.xlabel("The Actual Tip Amount")
plt.ylabel("The Predicted Tip Amount")
plt.title("Plot of Actual versus Predicted Tip Amount")
plt.plot([0, max(y_test)], [0, max(y_test)], color='green', linestyle='--')
plt.show()

From the plot, we can see that there are many values below the diagonal line. This means that in many cases, the predicted tip amount tends to be lower than the actual tip amount.

Conclusion

In this article, we successfully built our first machine-learning model to predict the tips that customers pay. This regression model has provided us with a starting point to understand the relationship between several independent features and the tip amount. We also saw in the model evaluation that our model did not perform well in predicting the tip amount.

The performance of our model highlights an important aspect of data science and machine learning which is improving models iteratively. To further improve our model, we may have to use feature engineering, perform hyperparameter tuning, or do data quality checks. As you embark on this machine-learning journey, remember that your model may need several improvements before it achieves the desired performance.

Additional readings

Guide to exploring data in Python

Silvester — Tue, 14 May 2024 07:00:43 +0000

Data professionals rely on Exploratory Data Analysis (EDA) to understand the data and how variables within the data are related. There are various tools used when performing EDA but the key of them all is visualization. Through visualizations, we can easily see how the data looks and we can make assumptions that will guide how we will analyze the data.

We will use Google Colab for this demonstration to show that you do not need to download Python software locally to uncover insights in your data. Google Colab is a powerful platform that allows you to write and execute your Python code in your browser and hence convenient for your data analysis needs.

Core EDA libraries in Python

Python has numerous libraries tailored for manipulating and analyzing data. Below are some of the libraries that you will need for your EDA:

Pandas - This library helps in loading the data and cleaning the data
Numpy - This library helps when performing numerical computations in Python. Numpy works with pandas and it is good for manipulating big datasets
Matplotlib and Seaborn are for visualizing the data

Loading the libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Loading the data

The data for this demonstration was sourced from Milwaukee City datasets for 2023.

property_df = pd.read_csv('/content/armslengthsales_2023_valid.csv')

Overview of the data

Head: This function shows the top rows of the data

property_df.head()

This is an incomplete snap of the data. (there are many rows in the data and hence could not catpure the whole tablel).

Shape - This shows the number of rows and columns in the data

property_df.shape

(5831, 20)
The output (5831, 20) shows that there are 5831 rows and 20 columns

Data types - This shows the data types of the variables in the dataset.

property_df.dtypes

From the output above, we can see that our datasets have the data types int, float, and object. We can see from the output that we have 6 categorical variables (object) and 14 numerical (int64 and float64) variables.

Missing values

Another important part of EDA analysis is checking for missing values. Missing values are unknown, unspecified, or unrecorded values in the dataset. In Pandas, the missing values are usually represented with NaN.

From the table above, we can see that the columns CondoProject, Rooms, and Bdrms have missing values represented by NaN values.

The best way to see the null values in your dataset is by using the .info command:

property_df.info()

We can see from the above that while the total rows are 5831, some columns do not have 5831 rows. Some of the variables with missing values are CondoProject, Style, Extwall, Stories, Year_Built, Rooms, FinishedSqft and Bdrms. Let’s check the missing values in each column

property_df.isna().sum()

When faced with a variable with a big proportion of missing values, we can drop the affected column. For those that have fewer missing values, we can drop the rows or use estimates to replace the missing values.

Dealing with the column with the most missing values

From the previous tables, we saw that the CondoProject variable has more than 80% missing values. The best way we can deal with this variable is dropping it in its entirety as done below.

property_df.drop(columns='CondoProject', axis=1, inplace=True)

For the remaining variables with missing values with small proportion of missing values, we can just drop the respective rows that have missing values.

property_df = property_df.dropna()

After dropping the CondoProject column and the rows with null values, we can see that the total rows have dropped to 4690 from the initial 5831 and that we have 19 columns instead of the initial 20.

The outcome is:

Now, we have clean data and we can now perform data visualization.

Summary statistics

property_df.describe().T

The summary statistics show the minimum, maximum, first quartile, third quartile, mean, and maximum values for each variable in the dataset.

Univariate variables

1. Histogram of Year Built variable

sns.histplot(property_df['Year_Built'])

We can see from the histogram that most of the houses were built in the 1950s and 1920s. Since the 1980s, the number of houses built in Milwaukee has been declining.

2. Distribution Rooms variable

sns.histplot(property_df['Rooms'])
plt.title(“Distribution of Rooms Variable”)

From the histogram of the rooms variable, we can conclude that many properties have between 5 and 12 rooms. Only a few properties have more than 20 rooms.
3. Distribution of stories variable

sns.histplot(property_df['Stories'])
plt.title(“Distribution of Stories variable”)

Most of the properties are between 1 and 2 stories tall. There are a few properties that have between 2.5 and 4 stories.
4. Distribution of Sales Price variable

sns.histplot(property_df['Sale_price'])
plt.title("Distribution of Sales Price")

From the plot above, we can see that most of the properties are concentrated between around $15000 and $500,000. Other properties cost more than $1,000,000 but they are few.
5. Property style distribution

style_count = property_df['Style'].value_counts()
order = style_count.index
plt.figure(figsize=(12, 6))
sns.barplot(x=style_count.index, y=style_count.values, order=order, palette='viridis')
plt.ylabel('Frequency')
plt.title('Frequency of property styles')
plt.xticks(rotation=45)
plt.show()

The frequency plot above shows that the most common property styles in Milwaukee are Ranch and Cape Cod while the least popular property styles are Office and Store buildings.
6. Property type distribution

sns.histplot(property_df['PropType'])

We can see that the most common property type is residential property.

Bivariate variables

Scatterplot for finished square feet and sales price

plt.figure(figsize=(8, 6))
sns.scatterplot(x='FinishedSqft', y='Sale_price', data=property_df)
plt.title('Relationship between Finished Sqft and Sale Price')
plt.xlabel('Finished Sqft')
plt.ylabel('Sale Price')
plt.show()

Sales Price and Finished Sqft have a positive linear relationship. An increase in Finished Sqft leads to an increase in the Sales Price.

Multivariate

Correlation plot

#computing correlation matrix
corr_matrix = property_df.select_dtypes(include='number').drop(columns=['PropertyID', 'taxkey', 'District']).corr()

# Plotting the heatmap of correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Numerical Variables')
plt.show()

The correlation matrix above shows us how the variables are related. For instance, we can see that Rooms and Bedrooms variables are highly correlated with a correlation of 0.86.

Conclusion

This comprehensive guide has shown you the fundamental steps that we use when exploring data. Throughout this tutorial, you have learned to load and have an overview of your dataset, handle missing values, perform both univariate and bivariate analysis, and finally examine multivariate relationships using correlation analysis.

From this analysis, we have learned some valuable insights on houses in Milwaukee city like pricing range, property sizes, and architectural styles. These insights that we have uncovered highlight the powerfulness of EDA and why every data practitioner should be good at it. This guide has given you a good foundation to keep exploring and visualizing your data. Continue exploring your data to unlock more insights!

Additional Resources

Mastering Data Exploration with Tidyverse: A Beginner-Friendly Guide

Silvester — Sun, 05 May 2024 13:17:28 +0000

Imagine you have been saving data on your shopping in an Excel file on your computer for a year. In December, you may want to see how you have been using your finances for the entire year. Well, that is what Exploratory data analysis helps you with. It allows you to break down complex data and have a better understanding of what it means.

As a data analyst, curiosity will greatly help you understand your data and how various features in your dataset are related. Usually, there are no structured rules on what you can do or not do when performing exploratory data analysis. This article will introduce you to exploratory data analysis using Tidyverse, a library that contains many R packages and as such, it will make your life easier.

Why EDA

1. One reason for performing EDA is to identify errors. Obvious errors and patterns can be easily discerned using EDA.
2. The analyst can detect patterns within the data, anomalies, and interesting relationships between variables. For a data scientist, EDA ensures that they get a valid output from their model.
3. EDA helps answer questions like confidence intervals, standard deviations, and categorical variables within the data.

EDA steps

Prerequisites

Basic knowledge of R programming language
Install R and RStudio on your computer
Getting started
When you want to use tidyverse if you have not yet installed it, you install it and then load it.
You first need to install tidyverse in your R studio and then load the library to use it.

install.packages("tidyverse") #installing the library
library(tidyverse) #loading the library

Now that you have loaded your library, you are free to continue with the next part.

Importing Data

Before you perform exploratory data analysis, you need data. You can import data from a database or your computer. As a beginner, you can start with the inbuilt R datasets.
However, when you are importing data from your computer to the R studio or console, you will use the ‘read.csv’ function. For example

Expenses <- read.csv(yearly_expenses.csv)

In this code, expenses.csv is the file that you have on your computer. This code reads the yearly_expenses file and stores it in the variable expenses. With expenses holding the data frame, you can now proceed with the next steps of your data analysis.

Basic Explorations

Summary statistics
This is usually among the first steps you will undertake in your EDA process. Summary statistics is necessary when you want to gain insights quickly from your data as it shows you the patterns and distribution in your data. Usually, summary statistics serve as the foundation for further exploration that you will perform. Some of the insights that you will gain from summary statistics are median, mode, variance, mean, quartiles, range, percentiles, and standard deviation.

library(tidyverse) #the tidyverse package for Exploratory data analysis 
#inbuilt R dataset
air <- airquality #Assigning the inbuilt R dataset to the data frame named air
summary(air) #Getting summary of the air dataset

The image above shows the summary statistics of the variables in the air dataset. We can see that there are 6 variables in the dataset (Ozone, Solar.R, Wind, Temp, Month, Day). Some of the summary statistics that we can see from the table above are the minimum value, the first and third quadrants, the median, the mean, and the maximum value for each of the variables. Another observation is that there are null values in the Ozone and Solar.R variables.

Checking for missing values

When using data that you did not collect (secondary data), you may sometimes encounter cases of missing values in some rows. The missing values can be represented as NA or NULLS. If you analyze your data without dealing with these missing values, you risk generating incorrect insights from the data.
If you are a data scientist creating a model, missing values can greatly impact the performance of your predictive model. As a result, dealing with these missing values is an important task that you have to perform and this can only happen after you have noted the values that are missing.

air %>%
  summarise(across(everything(), ~sum(is.na(.))))

In the code above, we start with the data frame that we want to work on which is the air dataframe. The next step is the pipe operator %>% which takes the data frame and then passes it as an argument for the next steps. The summarize () function calculates the summary statistics for the dataset and then collapses the many rows to become a single row. The across(everything(), ~sum(is.na())) counts the missing values in each column in the air data frame.

Output:

From the image above, we can see that the Ozone and Solar.R variables are the only variables with missing values and they have 37 and 7 missing values respectively.

Univariate analysis

Univariate analysis is the analysis of one variable in the dataset. This analysis is important as it helps uncover aspects like maximum, range, mode, median, and outliers of the variable within the dataset. Univariate analysis is necessary for both the numerical and categorical variables

Histogram for Ozone

library(tidyverse)
air <- airquality
#Histogram of Ozone variable
ggplot(air, aes(x=Ozone))+
  geom_histogram(fill="skyblue", color="black", bins=20)+
  labs(title="Ozone level distribution")

Output:

The output above shows the distribution of the Ozone variable. We can see that most of the values are concentrated between 0 and 50. Also, the maximum value is 168 which is located further from the rest of the data. If you look a little up before the title of the histogram, you will see a warning that shows that ggplot ignored the null values in the dataset.

Box plot of Solar.R variable 
ggplot(air, aes(x=Solar.R))+
  geom_boxplot(fill="skyblue", color="red")+
  labs(title="Solar. R boxplot")+
theme(plot.title = element_text(hjust = 0.5))

The code above outputs a box plot of the variable Solar.R. A box plot is important for a data analyst as it shows the distribution of the variable.

From the box plot above, we can see that the Solar. R variable is left-skewed. There are also no outliers since all the data fall within the first and third quadrants.

Tidyverse has powerful tools like ggplot2 that we used when performing the univariate analysis. The histogram and box plot have been created using ggplot2 and it has helped us understand the distribution of the Ozone and Solar.R variables. As you can see from the univariate plots above, visualizations help in seeing patterns in addition to showing us outliers that we might miss if we rely only on summary statistics.

Bivariate analysis

While univariate analysis focuses on one variable only, bivariate analysis emphasizes on the relationship between two variables. Bivariate analysis can be in the form of analyzing variable pairs such as categorical-categorical, numerical-numerical, and numerical-categorical.

Scatter plot

# Scatter plot for Ozone vs. Solar.R
ggplot(air, aes(x = Solar.R, y = Ozone)) +
  geom_point(color = "darkblue") +
  labs(title = "Ozone vs. Solar Radiation")

In the code above, we use the ggplot2 function to create a scatter plot within the Tidyverse library.

In this scatter plot of Ozone vs Solar.R created using ggplot2, we can see that there is a positive relationship between these two variables. The plot shows that high solar radiation is associated with high Ozone levels.

Multivariate analysis

Multivariate analysis is used for datasets with many variables. One commonly used multivariate analysis is correlation analysis. Correlation analysis shows how different numerical variables are correlated.

Pairwise scatterplot matrix for numerical variables

air %>%
  select(Ozone, Solar.R, Wind, Temp) %>%
  pairs()

The code above is of a scatterplot matrix, a simple approach to understanding the rough linear relationship that exists between multiple variables. In this code, we are looking for the linear relationships of Ozone, Solar.R, Wind, and Temp variables.

In the pair plot above, the variables are written diagonally from the top left to bottom right. In the plot, each variable is plotted against each other and the plots on the upper side of the diagonal are mirror images of the plots on the lower side of the diagonal.

When interpreting this plot, the variable name in each column is correlated to the rest of the variables. For instance, a look at the first scatter plot of Ozone and Solar.R shows that there is what appears to be a linear relationship, an indication that the variables have a positive correlation. In the plot below, the plots on both sides of the diagonal represent the same thing.

Another interesting scatter plot is that of Ozone and wind. In the plot below, the scatter plot of Ozone and Wind shows a negative slope indicating the possible presence of a negative correlation.

Even though there are cases of correlation between some of the variables in the data, it is important to remember that correlation does not mean causation.

Conclusion

As a data analyst, Tidyverse will help you to explore variable distribution and variable relationships and this ultimately helps you transform messy data into valuable insights. Since you now have an idea of how to use the tidyverse library to extract insights from data, get out and explore your data!

Additional resources

https://rstudio-connect.hu.nl/redamoi_test/lab5eda.html
https://blog.datascienceheroes.com/exploratory-data-analysis-in-r-intro/
https://bookdown.dongzhuoer.com/hadley/r4ds/exploratory-data-analysis
https://www.statology.org/pairs-plots-r/
https://rpubs.com/jovial/r

Choosing a visualization tool as a beginner: R or Python in 2024?

Silvester — Fri, 19 Apr 2024 18:15:23 +0000

As a data analyst, scientist, or business analyst, data visualization is your friend when you want to see trends in your data. Without visualization, seeing patterns and trends in your data will be a big challenge. With data visualization, you can easily visualize your data and spot trends. Some of the visuals that you are likely to use in your data analysis work include charts, histograms, line plots, scatter plots, tree maps, heatmaps, and box plots among others.

Why visualization?

Data visualization is beneficial in many ways:

It offers stakeholders a clear visual of the data allowing them to understand data insights.
It helps stakeholders understand the relations between the variables. Some of the relationships of interest include trends, correlations, and connections
Visualizations make it easier to spot inaccuracies in the data by offering visual representations. This helps data scientists prepare the data by ensuring there are no missing values or outliers before passing the data through machine learning models.

There are many data visualization tools that you will encounter in your data analysis journey. Some of the tools include Tableau, Looker, Microsoft Excel, Power BI, Google Data Studio, and programming languages like Python and R. For this article, we will talk about the data visualization tools in R and Python.

Python visualization tools

Some of the popular data visualization libraries in Python are Matplotlib, seaborn, and Plotly.

Matplotlib

This is the widely used visualization library in Python. Matplotlib is the first visualization tool in Python, other visualization libraries in Python are built on it. Some of the features of this library include supporting graphical representations like bar plots, scatter plots, histograms, scatter plots, area plots, pie charts, and line lots.
Code:

import statsmodels.api as sm
import matplotlib.pyplot as plt
# Listing the available datasets in statsmodels
print(sm.datasets.__all__)
# Accessing the state crime dataset belonging to stats models
statecrime = sm.datasets.statecrime.load()
#assigning statecrime data to crime_df
crime_df = statecrime.data
#matplotlib plot for murder vs poverty 
plt.figure(figsize =(10,6))
plt.scatter(crime_df['murder'], crime_df['poverty'])
plt.title('Scatter plot of murder against poverty')
plt.xlabel('Murders')
plt.ylabel('poverty')
plt.show()

Code Explanation:
The first thing to do is to load the necessary libraries. In this case, we load stats models containing the dataset that we will use and matplotlib for creating visualizations.
The second thing to do is to list the available datasets and choose the one to use. For this visualization, we are using the statecrime dataset. We load the dataset and then assign it to the crime_df data frame.
Finally, we create the visualization which in this case is a scatterplot showing the relationship between ‘murder’ and ‘poverty’ variables. In this section, we first set the plot size, then plot the scatter plot, then put labels like title, xlabel and ylabel and we end with displaying the plot using plt. show.

Plot:

A look at the scatterplot above shows that there is a positive relationship between poverty and murder rates.

Seaborn

Seaborn is a visualization library for generating statistical graphs. The library is built on Matplotlib and therefore does not have the limitations associated with Matplotlib. When you want to understand how variables in a dataset are related, you use statistical analysis as this will show you the trends and patterns in the dataset. You can get visuals using Seaborn: line, scatter, point, count, violin, KDE, bar, swam and box plots. It is key to acknowledge that Seaborn was created to work with Matplotlib. Check this site for more visuals on the Seaborn library.
Code:

import statsmodels.api as sm
import seaborn as sns 
import matplotlib.pyplot as plt
# Listing the available datasets in statsmodels
print(sm.datasets.__all__)
# Accessing the state crime dataset belonging to stats models
statecrime = sm.datasets.statecrime.load()
#assigning statecrime data to crime_df
crime_df = statecrime.data#creating the plot
plt.figure(figsize=(10,6))
sns.scatterplot(x='single', y='violent', data=crime_df)
plt.title('Scatterplot of single people and violence')
plt.xlabel('Single')
plt.ylabel('Violence')
plt.show()

Code explanation:
The first step as usual is loading the libraries which are statsmodels, Matplotlib and Seaborn. We also load the dataset from the stats model and assign it to the crime_df data frame.
The next step is using seaborn to create the scatter plot that shows the relationship between single people and violence. The scatter plot is created using the ‘sns. scatterplot()’ function. This function takes parameters like x-axis and y-axis parameters and the dataset. Next, the labels are added which are the title, xlabel and the ylabel
Finally, the ‘plt. show’ is called which displays the scatterplot.

The plot:

The plot shows that violence and singleness are positively correlated.

Plotly

Plotly is another visualization library for producing interactive plots. For these interactive plots, you can zoom in and out to get a clearer picture of the relationship between variables or the distribution of a variable. Some of the benefits you will get for using Plotly include having the capability to detect outliers using the hover tool and also endless customization of the graphs to make the plots more understandable. Check out these interactive plotly visuals.

Code:

import statsmodels.api as sm
import plotly.express as px
# Listing the available datasets in statsmodels
print(sm.datasets.__all__)
# Accessing the state crime dataset belonging to stats models
statecrime = sm.datasets.statecrime.load()
#assigning statecrime data to crime_df
crime_df = statecrime.data
#creating the plot
fig=px.scatter(crime_df, x='hs_grad', y='poverty', size='violent',
               hover_name=crime_df.index,
               title='Scatter Plot of high school graduation vs poverty')
fig.show()

Code explanation:
The first step is importing the relevant libraries and loading the dataset.
The next step is creating the scatterplot. In this section, ‘Plotly Express (px) creates the interactive scatter plot of high school graduation and poverty. The px.scatter takes parameters like the dataset, x-axis, y-axis, and title. Size is for the marker and can be big or small depending on the violence rate.
The final part shows the interactive plot and this is represented by ‘fig. show()’.

The plot:

The plot above shows that poverty is negatively related to high school graduation. An increase in poverty rates is negatively correlated to high school graduation rates. That is, areas with high poverty rates tend to have low graduation rates.

R Visualization tools

The commonly used visualization R libraries are ggplot2 and plotly.

Ggplot2

Ggplot2 is built on the grammar of graphics. This library is used in creating visualizations like error charts, scatter plots, histograms, pie charts, and bar charts. With ggplot2, you can add various layers of aesthetics to your visualization as per your needs.
Code:

library(tidyverse)
Arrests <- USArrests
ggplot(arrests, aes(x=Assault, y=Murder, label = rownames(arrests)))+
  geom_point(color = "darkred")+
  labs(title="Murder vs. Assaults per region", x="Assaults", y="Murders")+
  geom_text(nudge_x = 0.5, check_overlap = TRUE)

Code Explanation:
The first step is loading the libraries and then the data. The library used for this project is ‘tidyverse’ and the dataset is an inbuilt R dataset named ‘USArrests’ which contains data on arrests made in the US.
The second step is creating a scatterplot. After initializing ‘ggplot’ which is a function in the ‘tidyverse’ library, we specify the dataset ‘arrests’ and the x-axis ‘Assault’ and y-axis ‘Murder’ variables. The ‘label=’rownames(arrests) adds the labels for each of the points in the plot. At this point, the scatter plot can be considered complete since the rest of the stages are for giving it aesthetics and making it visually appealing.

The third step is adding points and labels. The ‘geom_point(color=”darkred”) color the points in the plot darkred while the ‘geom_text(nudge_x=0.5, check_overlap=TRUE’ adds text labels, nudges the texts to the right to avoid the texts from overlapping.
The final code is for labeling the scatterplot. The code gives the scatterplot a title, x-label and y-label.
Below is the scatterplot:

In this plot, Murders and Assaults are directly correlated. States near 0 in both the x and y axes have low murder and assault rates while those in the top right end have higher murder and assault rates.

As explained above, you can see that each point is colored dark red and has text indicating the state where the assault happened. The beauty of using ggplot is that it allows for the customization of the plot.

Plotly

Plotly in Python is the same as Plotly in R. Their functionalities are the same in that apart from producing a wide array of plots, the library also produces interactive plots.
Code:

library(plotly)
arrests <- USArrests
plot_ly(
  data=arrests,
  x = ~Murder,
  y = ~Assault,
  text = ~paste("State:", rownames(arrests), "<br>Murders:", Murder,"<br>Assault:", Assault),
  mode="markers",
  marker = list(size=10, color="darkblue", opacity=0.7)
)%>%
  layout(title="Crimes per state",
         xaxis = list(title="Murder"),
         yaxis = list(title="Assault"),
         hovermode = "closest")

Code Explanation
The first step is loading the Plotly library and the data. The plotly library allows for the plotting of interactive visualizations.
The next step is creating the interactive scatterplot. The ‘plot_ly() function is for creating the plot and in it, we specify the dataset (arrests), and the x and y-axis variables which are ‘murder’ and ‘assault’ respectively. The ‘text’ parameter within plot_tly is for displaying information when one hovers over the points in the plot.
The third step is customizing the plot's appearance. The ‘mode’ specifies that the scatterplot be displayed with markers while the ‘marker’ controls how the markers appear.
The last step is setting the layout options which customize the appearance of the plot. The title, x-axis, and y-axis labels are the texts that will be displayed while the hovermode indicates that hover information nearest to the cursor will be displayed.
Plotly scatter plot

The plot above shows that assaults and murder rates are positively correlated. If you hover the cursor in the plot after running the code above on your computer, you will see that states with small murder and assault rates are closer to 0 in both the x and y axes. If you progressively move towards the top-right of the plot, you will see that the murder and assault rates are increasing.

Hovering over the points will not display any information since this is a ‘.png’ picture. However, if you run the code above, you will be able to have an interactive visualization in your R studio.

Choosing the right visualization tool

Picking the ideal tool for your visualization needs can be challenging since both R and Python have robust libraries that will support your visualization needs. Let us look at the strengths and weaknesses of each language for data visualization:

R was designed for statistical data analysis and as such, it has libraries that will create beautiful visualizations to meet your needs. In particular, the ggplot2 package is known for its aesthetics, and extensive chart types and this makes R good for data visualization.
Python’s versatility as a language extends to the powerful visualization libraries of Matplotlib and Seaborn. Even though using these libraries can be confusing compared to when using ggplot2, the learning curve for Python is relatively easy for beginners. Also, the existence of Python’s extensive data science libraries makes it ideal for projects that go beyond visualization.

Ultimately, the choice of the tool to use for your visualization project will depend on your needs. If you want to perform statistical data analysis and visualization, then R will be a good choice for you. However, if you are interested in data science applications and you want to create your visuals after a short learning period, then Python fits your needs.

Additional resources for your visualization needs

Top machine learning algorithms for a beginner

Silvester — Wed, 10 Apr 2024 08:58:23 +0000

With a lot of data available today, machines are learning at a fast pace. These machines learn through the use of machine learning algorithms which are the building blocks for artificial intelligence. Today, you can analyze large volumes of data and make predictions on what will happen tomorrow, next week, next month, or even in a year.

Platforms like e-commerce sites use machine learning to suggest what you can pair with what you have already put in your cart. Businesses are also using machine learning to optimize their marketing campaigns and to understand their customers. These examples show that a future filled with machine learning is inevitable and hence why you should have some knowledge of some of the commonly used machine learning algorithms. Whether you are an aspiring data scientist or a curious person, this beginner-friendly article will give you a good foundation to learn more about machine learning algorithms.

Models

Linear Regression

This is probably the first machine learning algorithm that you will create in your data science/machine learning journey because of its simplicity. This algorithm is usually for establishing relationships between input and output variables.
Linear regression is represented by the linear equation; y=mx +c where y is the dependent variable, m is the gradient (slope), x is the independent variable and c is the intercept (where the line cuts the y-axis). With linear regression, the target is finding the best-fit line that shows how variables x and y are related.

Let us look at this example of a logistic regression task by Javatpoint which is a simple linear regression task to determine the relationship between salary and years of experience.

The plot above represents a plot of salary against years of experience. The red line represents the regression line while the blue dots represent the observations. In this prediction, you can see that the observations (blue dots) are close to the regression line, an indication that indeed, salary increases with an increase in years of experience hence a strong linear relationship.
Linear regression is used for prediction tasks and it works best with continuous data that have a linear relationship. As a beginner, linear regression will give you the necessary foundation to learn other machine learning algorithms.

Decision Tree

The Decision Tree Algorithm is based on the binary tree. With this model, a decision is reached after following a tree-like structure that has nodes, branches, and leaves. Each node in the tree represents an input variable (x) while the leaf node is the output variable (y). When using this model, you traverse the nodes starting with the root node and then passing through the splits until you arrive at the leaf node which is the output of the model.

This image by IBM gives a good illustration of how the decision tree algorithm works.

The image above represents an example of how a person can use a decision tree to decide whether to surf or not. From the decision tree, you can see that the surfer only goes to surf when there is offshore wind direction or when there is low to no wind.
Decision Tree makes classification using a series of questions which give it a flowchart-like structure. A Decision Tree Algorithm learns quickly and it is mostly used in banking to classify loan applicants based on their probabilities of defaulting.

Logistic Regression

Logistic Regression, just like linear regression, operates to find values for coefficients weighing the input variables. When using logistic regression, a binary variable is the dependent. For instance, you can use this variable when predicting whether an event will occur or not. If I wanted to predict whether a customer will default on their loan or not, this algorithm would make perfect sense to use. Another example is that an e-commerce company can use this model to predict whether a customer will make a purchase or abandon their cart based on their browsing behaviors.

This illustration by Analytics Vidhya gives a simplified approach to how logistic regression works.

In the illustration, the logistic regression model takes in various features and predicts whether the bird is happy or sad. As discussed earlier, the logistic regression works only where the target feature is a binary variable and in the image above, the binary values are sad and happy.
The key thing to know about this model is that it is ideal for binary classification and that it is often used for tasks like filtering spam or predicting customer churn.

Naïve Bayes algorithm

This machine learning algorithm is based on the Bayesian probability model. Naïve Bayes is a supervised machine learning algorithm used for classification problems. The algorithm takes its name from the assumption that no variable is related to the other, which is naïve given that variables are related in the real world.
The Bayes equation; P(X|Y) = (P(Y|X) P(X)) / P(Y)
This model is efficient when used to classify data using independent features. Examples of areas where you can use this algorithm include email spam filtering, sentiment analysis, recommendation systems, and weather prediction among others.

K-means

K-Means is an unsupervised machine learning model used for clustering problems. The model uses the K number of clusters to break down data into closely related groups and outputs them as clusters. When starting with this algorithm, you randomly choose the value for k. The next step is categorizing the data points into their closest points. The process is repeated until the clusters k become stable in that the centroids no longer change. The clusters are usually differentiated with colors to reduce confusion
K-means is probably the first unsupervised machine-learning algorithm that you will use. K-means is commonly used for problems that require clustering such as determining the shopping habits of customers, anomaly detection, or market segmentation.
This demonstration by Stanford offers a good demonstration of how k-means clustering works.

The first step (a) represents the initial dataset. In (b), the initial cluster centroids of 2 were chosen and they are represented with the blue and red crosses. From (c-f), the cluster centroids were recalculated and the process was repeated several times until the perfect cluster in (f) was arrived at. If you look closely, you will see that the red and blue clusters are distinct, an indication that the clustering process is complete.

Random forest

Random Forest algorithm is a supervised machine learning algorithm that is an upgrade from the challenges associated with decision trees. This algorithm combines several decision trees to create a better-performing model. This approach reduces overfitting, a problem that decision trees face. Some of the uses of this algorithm include image recognition and customer churn predictions.
Note that this algorithm is used for both regression and classification problems. Also, Random Forests have better performance compared to decision tree models.
This demonstration from Analytics Vidhya gives a better view of the working of Random Forest algorithms.

In the image above, the majority decision was reached after comparing the performance of the various trees. You can see that most of the trees classified the fruit as an apple while a few classified the fruit as a banana. The class that got the most voting was the one for apples and that is how the decision was reached at.

K-nearest neighbor algorithm (KNN)

KNN is a supervised machine learning model that handles both regression and classification problems. The algorithm is based on the assumption that observations close to a data point have similar observations in the dataset. As a result, the algorithm assumes that we can classify unseen points based on their closeness to existing points. That is, with a data point k, we can predict nearby observations.

The image below by Towards Data Science gives a good visual of how KNN works.

A closer look at the image shows that similar data points are close to each other. The image captures the model’s assumption that similar observations are near each other.
Some of the uses of the algorithm include loan approvals, credit scoring, and optical character recognition.

Support Vector Machines (SVM)

SVM is an ML algorithm used to handle both regression and classification problems. When using SVM, the objective is to find an optimal hyperplane to separate data points in different classes. The optimal hyperplane has the largest margin.
This Datacamp visual shows how the SVM hyperplane works.

In the visual, 3 hyperplanes were initially selected to separate the two classes. However, two hyperplanes (the blue and orange) did not separate the classes effectively. The only hyperplane that separated the classes correctly is the black one and thus became the chosen hyperplane as shown in the second image.
SVM is used for various problems like image classification, handwriting identification, anomaly detection, face detection, spam detection, gene expression analysis, and text classification.

Apriori (frequent pattern mining)

This unsupervised algorithm uses prior knowledge to generate associations between events. Apriori creates association rules by observing events that followed each other in the past when a person did something. The association rules like ‘if A bought item B, then they will buy C’ is represented as B-> C.
This visual by Intellipaat summarizes what the apriori algorithm in market basket analysis is all about.

This is the logic in the image above: If a transaction containing items {wine, chips, bread} is frequent, then {wine, bread} must also be a frequent transaction since for every transaction containing {wine, chips, bread}, then {wine, bread} are automatically on the list.
Apriori algorithm is used for problems like Google autocomplete and market basket analysis.

Conclusion

As you dive into the data science world, you will encounter most of these machine learning algorithms listed above. All these algorithms will not work for all your data problems but rather specific problems. In this article, we have looked at the top 10 most popular algorithms with the majority being supervised and unsupervised algorithms.
Remember that]-

While these algorithms are powerful, understanding their capabilities, strengths, and limitations is important to critically leverage their power.

Through active experimentation with the various models, you will gain valuable technical skills and also get a better understanding of how to use and solve data science problems. With this knowledge, you can easily leverage the power of machine learning to solve any real-world problem.

Popular data science libraries

Silvester — Sat, 06 Apr 2024 10:34:31 +0000

Introduction

With vast amounts of data generated daily, understanding what the data means can be challenging. Through wider data analytics field, you can gain insights into the data's secrets. As a data scientist or data analyst, libraries will be your greatest companions as they simplify the data science processes. There are several data science libraries that you will encounter in your data analytics journey and below are some of the most common ones. Let’s have a look at some of the libraries that you will need as a beginner!

Popular Python data science libraries

Data wrangling libraries

a) Pandas
Pandas is your friend when it comes to manipulating data in table format. In data science, we call data in a table format a data frame. A data frame comprises rows and columns and this can be used to manipulate data by using operations like merge, join, concatenate, or groupby. You can use pandas to clean, explore, analyze, and manipulate data. Pandas work with data stored in databases or spreadsheets.
Here’s an example of how Pandas works:

import pandas as pd
#let us create a dataframe
data = {'Name': ['John', 'Alice', 'Bob', 'Jane'],
        'Age': [25, 30, 35, 40],
        'Education':['Bsc', 'Masters', 'College', 'highschool'],
        'Salary': [50000, 60000, 70000, 80000]}
df = pd.DataFrame(data)
#now we have a dataset/dataframe  
# let us now filter the high paying employees
high_salary_employees = df[df['Salary'] > 60000]
high_salary_employees

In the code above, the pandas was used to create a dataframe and then filter the dataframe to meet a specific condition which in this case was to output employees earning more than 60000 a month.

b) NumPy
When it comes to manipulating data in array format, NumPy is your best friend. This library is useful when dealing with multidimensional matrices and large arrays. Some basic operations that you can perform using NumPy include multiplying, slicing, flattening, indexing, reshaping, and adding arrays.

Here’s an example of a code using NumPy:

import numpy as np
#create the array
a = np.array([[1,4,6],[3,5,7]], dtype=int)
print('The array created is:\n', a)

The array created is:
[[1 4 6]
[3 5 7]]
In the code above, we have created a 2-dimensional NumPy array. The result printed out is the output of the code.

Data visualization libraries

a) Matplotlib
Visualization is the easiest way of seeing patterns, trends, or relationships between variables in your data. With Matplotlib, you can create plots like histograms, line plots, scatterplots, pie charts, and bar charts among others. Matplotlib is customizable and as such, you can easily customize it with a little code to fit the plots you want.

Here’s a small code showing Matplotlib visualization:

import matplotlib.pyplot as plt
x = [5,10,15,20,25]
y = [30,35,40,45,50]
plt.plot(x, y)
plt.title('A plot of X against Y')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

The output is:

The visual above shows the X against Y plot represents a typical line plot created using matplotlib. Apart from making the plot, matplotlib also allows you to create labels for the X and Y axes in addition to the title for the plot.

b) Seaborn
This is another visualization library and it is integrated with both Pandas and Numpy. Seaborn has plotting functions that allow it to operate on arrays and data frames. With Seaborn, you can do statistical aggregations to create informative plots according to user needs. The data graphics that come with Seaborn include pie charts and scatter plots.

Here’s an example of how Seaborn works:

import seaborn as sns
import matplotlib.pyplot as plt
x = [5,10,15,20,25]
y = [30,35,40,45,50]
sns.line plot(x=x, y=y)
plt.title('A plot of X against Y')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

The output of this code is;

Seaborn serves the same function as matplotlib and as such, they use almost the same syntax. For instance, in the plot above, the labeling of the axes and the title was done using matplotlib but the plot itself was done using seaborn.

c) Plotly
If you want to take your visualization game to another level, then Plotly is your tool of choice. Plotly offers interactive visualizations in addition to having many unique chart types like histograms, sparklines, scatter plots, line charts, box plots, and bar charts. Additional benefits of using Plotly include counter plots which other visualization libraries lack.
Here’s an example of how Plotly works:

import plotly.express as px
#using the inbuilt iris flower dataset
flower = px.data.iris()
#plotting a bar chart
plot = px.bar(flower, x="petal_width", y="petal_length")
# showing the plot
plot.show()

Below is the output:

Just like matplotlib and seaborn, plotly also produces good plots. In the plot above, plotly did the work from plotting to labelling.

Machine Learning Libraries

a) Scikit-Learn (Sklearn)
Sklearn is an important tool for your machine-learning needs, whether you are a beginner or an expert. Built on top of Scipy, Numpy, and Matplotlib, this library is efficient in performing machine learning tasks. Sklearn contains both supervised and unsupervised machine learning algorithms. Some of the machine learning models that this library contains include Regression, Support Vector Machines, Clustering, Naïve Bayes, Random Forests, Classification, Nearest Neighbors, and Decision Trees.
Here’s an example of how Scikit-learn works:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Loading the inbuilt Iris dataset
iris = load_iris()
# Creating a dataframe from the dataset
iris = pd.DataFrame(data=iris.data, columns=iris.feature_names)
# splitting the features from the target variable
X = iris.drop(columns = ['petal width (cm)'])  # specifying the feature variable
y = iris['petal width (cm)']    # specifying the target variable
# Split the iris dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Instantiating and fitting the linear regression model to the training data
LR = LinearRegression()
LR.fit(X_train, y_train)

# Making predictions on the test set
y_pred = LR.predict(X_test)

# Calculating the coefficient of determination (R^2) to evaluate the model
r_squared = LR.score(X_test, y_test)
print("Coefficient of determination (R^2):", r_squared)

# Print the coefficients
print("Coefficients:", LR.coef_)

# Plotting the actual vs. predicted values
plt.scatter(y_test, y_pred, color='blue')
plt.title('Actual vs. Predicted Petal Width')
plt.xlabel('Actual Petal Width (cm)')
plt.ylabel('Predicted Petal Width (cm)')
plt.show()

The output:
Coefficient of determination (R^2): 0.9407619505985545
Coefficients: [-0.25113971 0.25971605 0.54009078

In the example above, we used linear regression which is found in the Scikit learn library. The output of the code shows how the features are influencing the predictor variable. The coefficient of determination (R^2) is 0.941 means that 94.1% of the variance in petal width is a result of the independent variables (features) that include petal length, sepal length, and sepal width. In the Linear Regression model, a higher R-squared value is good for the model as it shows that the model has a better fit of the data.
Coefficients show the weight for each of the independent variables. In this model, there were 3 independent variables and their coefficients are -0.25113971, 0.25971605, and 0.54009078 which are the coefficients of sepal length, sepal width, and petal length respectively.

Popular R data science libraries

Data wrangling libraries

a) Dplyr
This library is for the transformation and manipulation of data. With it, you can perform basic data operations like selecting, mutating, summarizing, joining, filtering, and grouping data frames. You can use Dplyr to clean and wrangle data.

Here’s an example of how Dplyr works:

#loading libraries
library(dplyr)
#load the inbuilt cars dataset
data(mtcars) #this is an inbuilt dataset on cars 
select_data <- select (mtcars, hp, mpg, disp)
head(select_data)

Below is the output:

In the example above, we have used dplyr to select some features from the dataset like hp, mpg and disp. Generally, the code extracts a subset of the relevant data from the dataset so we can focus on specific aspects of the data. Extracting a few features from the dataset allows us to create visualizations with ease and also analyze smaller data.

b)Lubridate
If you have date-related variables to transform then Lubridate is your friend. Without this library, working with date-time models would have been frustrating. Lubridate has functions like hour (), month (), minute (), year () and second (). You can use this library to calculate durations, intervals, and age among other time-related measures accurately. This library will greatly help you wrangle and clean time-related data.

Here’s an example of how Lubridate works:

#loading the necessary library 
library(lubridate)
#creating a date-time data
dates <- as.Date(c("2024-01-01", "2024-04-01", 
                   "2024-02-19", "2024-03-09", 
                   "2024-03-25"))
year <- year(dates)
month <- month(dates)
day <- day(dates)
print(data.frame(dates, year, month, day))

The output:

The code takes date strings and converts them into usable date objects. The date object has been separated into components like year, month and day which is easier for further analysis. As you can see from the output, the data is more structured with the extraction of the date components.

Data visualization libraries

a)Ggplot2
The most popular library for visualization and it is based on the grammar of graphics. By mapping data attributes to visual elements like colors and lines, ggplot2 creates informative plots. The library also supports themes, facets, layers, and scales and this gives you control over the layout and appearance of your plots.

Here’s an example of how Ggplot2 works:

#loading the library
library(ggplot2)
#loading the dataset
data(mtcars)
#creating a scatterplot of mpg and hp
ggplot(data=mtcars, aes(x=hp, y=mpg))+
  geom_point()+
  labs(title="Scatterplot of mpg vs hp",
       x="mpg",
       y="hp")

The output:

The code creates a scatter plot using the ggplot2 library. The above plot visualizes the relationship between the fuel efficiency (mpg) and engine power (hp) for the vehicles in mtcars dataset.

b)Plotly
Just like in Python, Plotly is for creating interactive visualizations. This library offers many options for visualizing data like traditional plots and advanced ones like 3D charts and heat maps. This library will come in handy when you want to create interactive plots.

Here’s an example of how Plotly works:

#loading the library
library(plotly)
#loading the data
data(mtcars)
#plotting a histogram of miles per gallon (mpg)
plot_ly(data = mtcars, x = ~mpg, type = "histogram", marker = list(color = "skyblue"), color = I("black")) %>%
  layout(title = "Histogram of MPG",
         xaxis = list(title = "MPG"),
         yaxis = list(title = "Frequency"))

The output:

The plotly code above creates an interactive histogram for displaying the distribution of the mpg (miles per gallon) feature of the mtcars dataset. This histogram offers insights into the number of cars with the same fuel efficiency ratings.

ML model libraries

a) Caret
This package is a supervised machine-learning library. It is mainly used for classification and regression problems and its name is a short form of Classification and Regression Training. Caret has many functions like creating DataPartition and train control which are used for splitting data and performing cross-validation respectively.
Here’s an example of how Caret works:

#loading the library
library(caret)
#loading the data
data(iris)
#splitting the data into training and test sets
set.seed(42)
train_index <- createDataPartition(iris$Species, p = 0.8)
train_data <- iris[train_index[[1]],]
test_data <- iris[-train_index[[1]],]
#training the decision tree model
model <- train(Species ~., data=train_data, method='rpart')
#making predictions on the test data
predictions <- predict(model, newdata=test_data)
#model evaluation
confusionMatrix(predictions, test_data$Species)

The output:

The code above demonstrates the use of R’s Caret library for machine learning. In the example above, we trained the decision tree model to classify the flower species in the iris dataset. The code then evaluates the performance of the model using the test data and creates a confusion matrix. This matrix shows us how the model performed in correctly predicting the flower types. From the output above, the model has an accuracy of 93.33%, an indication that the model performed fairly well.

Conclusion

In this article, you have learned about some of the most popular data science libraries in R and Python. These libraries like Dplyr, Plotly, Ggplot2, Caret, Scikit-learn, Pandas, and NumPy among others play an important role in helping data scientists explore, manipulate and make statistical models. These libraries are not exhaustive and you can find other libraries with additional research. Happy learning ahead!

Scraping movie data

Silvester — Sat, 11 Nov 2023 10:46:47 +0000

We start by importing the libraries that we will need. Requests and BeautifulSoup are the standard libraries for scraping data from websites while the csv library is for writing the scraped data into a csv file.

```{python)
import requests
from bs4 import BeautifulSoup
import csv




The headers function reduces the chances of the website rejecting your scraping requests since it shows you are a genuine person and not a bot. You can get the headers by right clicking your current website page and clicking inspect. Thereafter, you go to networks and click where you see a status of 200. You will see Headers on your right hand side of the screen and when you scroll to the bottom of the page, you will see the headers starting with a user agent.



```{python}
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0"}

The code below checks if the HTTP request to the IMDb page was successful (status code 200). When it returns a success, it uses BeautifulSoup to parse the HTML content of the web page. Thereafter, the script identifies and iterates through movie list items within an unordered list, extracting details such as rank, title, year, duration, parental advisory, and rating. It writes this information to a CSV file named "movies_data.csv" in a structured format. If the request fails, it prints an error message with the HTTP status code

# Create a session and send the request
with requests.Session() as session:
    link = session.get('https://m.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm', headers=headers)

# Check if the request was successful (status code 200)
if link.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(link.text, 'html.parser')

    # Find all list items within the unordered list
    movies_items = soup.find("ul", class_="ipc-metadata-list").find_all("li")

    # Open a CSV file for writing
    with open("movies_data.csv", mode="w", encoding="utf-8", newline="") as file:
        # Create a CSV writer
        writer = csv.writer(file)

        # Write the header row
        writer.writerow(["Rank", "Title", "Year", "Duration", "Parental Advisory", "Rating"])

        # Iterate through each list item and write to the CSV file
        for movie_item in movies_items:
            rank = movie_item.find("div", class_="sc-94da5b1b-0").get_text(strip=True).split('(')[0]
            title = movie_item.find("a", class_="ipc-title-link-wrapper").text
            year = movie_item.find("div", class_="sc-c7e5f54-7").get_text(strip=True)[:4]
            duration = movie_item.find("div", class_="sc-c7e5f54-7").get_text(strip=True)[4:10]
            parental_advisory = movie_item.find("div", class_="sc-c7e5f54-7").get_text(strip=True)[10:]
            rating = movie_item.find("span", class_="ipc-rating-star").get_text(strip=True).split('(')[0]

            # Write a row to the CSV file
            writer.writerow([rank, title, year, duration, parental_advisory, rating])

    print("Data written to movies_data.csv successfully.")
else:
    print(f"Failed to retrieve the page. Status code: {link.status_code}")

Beginner Data Scientist Roadmap

Silvester — Fri, 10 Nov 2023 12:43:10 +0000

First we, have to understand what data scientists do. A data scientist use programming languages either R or python to uncover insights hidden in data, create data-drive products and also make predictions.

Why a data scientist?

You will work in rewarding ad intellectually challenging environment.
The demand for data scientists is projected to grow greatly within the next 10 years with the US bureau of statistics predicting that the demand will grow by 35% between 2022-2032.

Qualifications?

There are usually no set qualifications but it is recommended that you possess strong mathematical skills in order to understand and work with complex data. You should also be capable of using statistical software and in addition to being familiar with R or Python.

Roadmap

Understand the role of a data scientist: Take your time to understand what a data scientist does. There are many blogs on the same and you can easily access them by googling.
Explore job requirements: Look at what employers are looking for in a data scientist
Get comfortable with math and statistics: Math and statistics are essential skills for data scientists
Learn programming languages: Learn programming languages such as Python, R, and SQL
Build basic skills: Build basic skills such as problem-solving, database management, and data wrangling
Get familiar with data analytics tools: Learn how to use tools such as Tableau, Power BI, and Matplotlib
Learn machine learning: Learn machine learning libraries such as TensorFlow, Keras, and Scikit-learn
Develop your skillset: Pursue volunteer, open-source, or freelance projects to build your portfolio
Build your network: Network with other data scientists and attend industry events
Consider a data science internship: Gain practical experience through internships
Polish your resume and start applying: Once you have the necessary skills and experience, start applying for data scientist positions.

Web scraping with python-first try

Silvester — Fri, 10 Nov 2023 12:41:53 +0000

Web scraping with python

For my first web scraping, I followed a tutorial in YouTube by a person named Tinkernut

importing libraries

from bs4 import BeautifulSoup 
import requests
import csv

Here, we import the basic libraries for scraping the data and writing them into a csv file.

url_to_scrape = requests.get('https://quotes.toscrape.com/')
soup = BeautifulSoup(url_to_scrape.text, 'html.parser')
quotes = soup.findAll("span", attrs={"class":"text"})
authors = soup.findAll("small", attrs={"class":"author"})

Here, we specify the url where we will be scrapping the data from and also the classses where the data we want is located. We also specificy the data that we need. I.e, we want to scrap quotes and authors which are in their respective span and class attributes.

file = open("quotes.csv", "w")
writer= csv.writer(file)

The file is opened in a write mode and csv.writer returns a writer object for writing files to the csv file.

writer.writerow(["Quotes", "Author"])
for quote, author in zip(quotes, authors):
  print(quote.text + "." + author.text)
  writer.writerow([quote.text, author.text])
file.close()

This writes the headers "Quotes" and "Author" to the CSV file. It then iterates through pairs of quote and author elements and prints each quote and author to the console. It finally writes each author and code to a new row in the CSV file before closing the csv file.

Roadmap for a beginner data scientist

Silvester — Mon, 02 Oct 2023 09:32:00 +0000

A data scientist uses programming languages either R or python to uncover insights hidden in data, create data-drive products and also make predictions.

Data science is a rewarding and intellectually challenging field if you are willing to get put give your best.
The demand for data scientists is projected to grow greatly within the next 10 years with the US bureau of statistics predicting that the demand will grow by 35% between 2022-2032.

Understand the role of a data scientist: Take your time to understand what a data scientist does. There are many blogs on the same and you can easily access them by googling.
Explore job requirements: Look at what employers are looking for in a data scientist
Get comfortable with math and statistics: Math and statistics are essential skills for data scientists
Learn programming languages: Learn programming languages such as Python, R, and SQL
Build basic skills: Build basic skills such as problem-solving, database management, and data wrangling
Get familiar with data analytics tools: Learn how to use tools such as Tableau, Power BI, and Matplotlib
Learn machine learning: Learn machine learning libraries such as TensorFlow, Keras, and Scikit-learn
Develop your skillset: Pursue volunteer, open-source, or freelance projects to build your portfolio
Build your network: Network with other data scientists and attend industry events
Consider a data science internship: Gain practical experience through internships. There are many unpaid remote internships that you can do to build your experience. You can check sites like Forage where there are tons of projects you can do to horn your skills.
Polish your resume and start applying: Once you have the necessary skills and experience, start applying for data scientist positions.