DEV Community: Erik Marsja

Mastering Data Manipulation: Merge Datasets in R

Erik Marsja — Wed, 02 Aug 2023 11:36:40 +0000

Combining datasets is an essential task in data science that empowers analysts to open valuable understandings and drive data-driven decision-making. The process of merging datasets allows analysts to consolidate information from numerous sources into a unified and comprehensive view, providing a deeper understanding of complex relationships and patterns within the data. In this R tutorial, we will get into merging datasets in R, providing you with essential skills to harness the full potential of your data.

Requirements

Before starting, ensure you have the necessary knowledge and packages. Basic familiarity with R programming is assumed, including understanding data frames and common data manipulation functions in R.

This tutorial will use R, a powerful open-source programming language and environment for statistical computing and data analysis. Additionally, we will primarily rely on two essential R packages: dplyr and readxl. Dplyr offers an intuitive grammar for data manipulation, enabling smooth and efficient data transformation. Meanwhile, readxl focuses on reading Excel files.

Reading Excel Files

Here we read two .xlsx files in R:

# Load the required library
library(readxl)

# Read the student data from the .xlsx file
student_data <- read_excel("student_data.xlsx")

# Read the course data from the .xlsx file
course_data <- read_excel("course_data.xlsx")

After executing the above code, the "student_data" and "course_data" data frames will contain the data from the respective .xlsx files. You can merge these datasets in R using various methods, explore relationships between students and courses, and conduct insightful data analyses.

Merge Datasets in R Example 1: Inner Join

In this example, we will perform an inner join between the "student_data" and "course_data" datasets based on the common CourseID. The inner join will only merge data in R with students taking courses and exclude those not enrolled in any course.

# Load dplyr
library(dplyr)

# Perform an inner join
inner_merged_data <- inner_join(student_data, 
                                course_data, by = "CourseID")

# Display the merged dataset
print(inner_merged_data)

After executing the above code and merging the two datasets in R, the "student_data" and "course_data" data frames will contain the data from the respective .xlsx files. You can now explore various methods to merge these datasets, identify relationships between students and courses, and conduct insightful data analyses.

Merge Datasets in R Example 2: Left Join

In this example, we will perform a left join between the "student_data" and "course_data" datasets based on the common CourseID. The left join will merge data in R from the "student_data" DataFrame with all available courses from the "course_data" DataFrame, including students not enrolled in any course.

# Perform a left join
left_merged_data <- left_join(student_data,
                              course_data, by = "CourseID")

# Display the merged dataset
print(left_merged_data)

After running the code, we utilize the left_join() function from the dplyr package to merge the "student_data" and "course_data" datasets based on the common "CourseID" column. The resulting "left_merged_data" data frame includes all rows from the "student_data" DataFrame, irrespective of course enrollment. The corresponding columns from the "course_data" DataFrame will have NA values for students not enrolled in any course.

There is also a function for merging datasets called right_join() in the dplyr package. The right_join() function works similarly to left_join(), but it retains all rows from the right dataset and includes matching rows from the left dataset based on the common column. This provides a perspective where the focus is on the right dataset, and it incorporates relevant information from the left dataset, even for cases where no matches exist in the left dataset.

Example 3: Filtered Merge using %in%

To merge the datasets using %in% in R, we will perform a filtered merge that includes only the students taking the course with CourseID 201 and the corresponding Instructor teaching that course.

# Filter students taking course 201
students_taking_course_201 <- student_data %>%
  filter(CourseID %in% 201)

# Filter the course with CourseID 201 and its Instructor
course_201_info <- course_data %>%
  filter(CourseID %in% 201)

# Merge the filtered datasets
merged_data <- inner_join(students_taking_course_201, 
                          course_201_info, by = "CourseID")

# Display the merged dataset
print(merged_data)

After executing the code, we filter the students taking the course with CourseID 201 from the "student_data" DataFrame using the %in% operator. Simultaneously, we filter the course with CourseID 201 and its Instructor from the "course_data" DataFrame. Next, we perform an inner join on the filtered datasets based on the common "CourseID" column. This results in the merged dataset containing only the students taking Course 201 and the corresponding Instructor teaching that course.

In this post, we explored how to merge datasets in R using various functions from the dplyr package, such as left_join() and right_join(). We learned how to perform inner joins and utilize %in% to filter and merge specific data based on common columns.

Data Wrangling in Python and Pandas: to Process and Prepare Data for Analysis

Erik Marsja — Thu, 20 Aug 2020 19:40:10 +0000

In this post, I will cover some basic data wrangling techniques that we can do with Python and Pandas. Actually, here I will also introduce the excellent Python package dfply.

What is data wrangling and why is it an important technique to know in Python?

Processing of data is known as data wrangling, data munging, data, grunging, or data preparation processes. The purpose of the processing of the data is to format the information so that it can be analyzed later. This step is extremely important because the majority of the working hours usually are spent in order to process the data. Often, the majority of the analysis code (e.g., in Python) will be concerned with data munging, which processes the data. It is, therefore, extremely important to learn this in an efficient and robust way. To select the rows and columns of the data without a doubt two of the most basic of tasks. Furthermore, to add new variables (e.g., columns to Pandas dataframe), or modifying variables are also examples of two essential tasks.
Thus, it is important to learn from, for example, R and Python tutorials, and, Pandas tutorials, like this one.

Today, there are all the necessary functions to process the data. Unfortunately, however, these features are often difficult to use or results in code that is difficult to read. Developers, and data scientists, in the R community, took note of this early on and came up with several libraries, to carry out data wrangling. One of my favorite libraries is undoubtedly dplyr, part of the tidyverse library, and dplyr has maybe been the most revolutionary of the libraries, as it has introduced in a completely new way to manipulate data. In the dplyr, simplicity is in focus; with a small number of functions, some of which are simple and easy to use, enables powerful processing of data. The Python community has observed that dplyr became very popular, and because of this, there are equivalent Pyton packages. In fact, there are three packages for Python: dfply, pandas-ply, and plython. In this post, I will exemplify how powerful, and easy, one of these packages are; dfply. This, I would say, makes Pandas (and dfply) one of the essential Python packages for data science.

Installing Python Packages

In case you need to install Pandas and dfply, here's how to get them installed using pip:

pip install pandas dfply

Importing Data in Python with Pandas

Now, before we go on and have a look at how to munge data with Python we need some example data. Here, you will import a dataset (from a .csv file) from a URL:

import pandas as pd

df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Earnings.csv',
                index_col=0)

Checking the first five rows of the dataset can be done using the head() method:

Data wrangling in Python with Pandas and dfply

In this section, you are going to learn how to rename columns in Pandas dataframe with dfply. After this, you are going to learn how to calculate simple descriptive statistics.

Changing Column Names in Pandas Dataframe with dlfply

Here's how easy it is to change the column names in the dataframe. First, we import dlfply:

from dlfply import *

This is was done through the use of the function from, followed by a *, which will activate all of the functions available in the Python library, and we don't have to write "dfply." in front of each and every function that we want to use.

Second, we use piping, and the rename() method:

df = df >> rename(Wage='y',
            Age='age')

Note, how we used >> in order to create a chain of events. The real power with this, may not be evident in the simple example code above. However, using this method enables us to carry out a lot of data wrangling tasks in a single line of code.

Summary statistics in Python with dlfply (and Pandas)

Here's how easy it is to calculate descriptive statistics (by groups) with dlfply:

df >> group_by('Age') >> summarize(mean=X.Wage.mean(),
                                   meadian=X.Wage.median(),
                                  std=X.Wage.std())

Again, the code begins with a "df >> group_by('Age'), which is interpreted as ”the start with df, then group df by Age group”. This means that the following functions will be carried out on Age groups (from the column). Then, we proceed to summarize(mean=X.Wage.mean(), which implies that we describe that; including mean, the median, and the standard deviation of Wage. Please note that we do use X to tell Python that Wage is derived from the same dataframe we started off with (df).

Hope you learned something, please share if you did!

Essential Python Libraries for Data Science, Machine Learning, and Statistics

Erik Marsja — Sun, 25 Nov 2018 11:48:57 +0000

In this post I will list some very valuable libraries for people who intend to use Python for data science, Machine Learning, and Statistics. These libraries are very extensive and are developed by a big number of experts around the world and together, the libraries, make Python a very powerful tool for data analysis.

I really recommend that you install, and use, Anaconda the scientific Python distribution. This will give you loads of Python libraries installed. In the example below we will use one very handy library; Pandas.

The convention is to load pandas as pd and then we can use the methods and classes very easily. For instance, we can write pd.read_csv(‘datafile.csv’) to load a CSV file to a dataframe object.

Essential Libraries in Python

In this section I will list some of the most essential Python libraries when it comes to data science.

NumPy (Numerical Python)

NumPy is an extensive library for data storage and calculations. This library contains data structures, algorithms, and other things that are used to handle numerical data in Python. Furthermore, NumPy includes methods for arrays (lists) that are more efficient than Python's built-in methods. This makes NumPy faster than Python's standard methods. NumPy also contains features that can be used to load data to Python, as well as export data from Python.

If you are migrating from MATLAB, for instance, you will like NumPy (see here)

http://numpy.org

Pandas

Pandas is the most powerful library for data manipulation. Pandas contains a wide range of data import and export functions, as well as for indexing and manipulating data. This library is inevitable for those who use the Python for data science. Pandas also includes sophisticated methods for data structures. The most used data structure in pandas is dataframe (series of columns and rows) and series (a 1-dimensional array).

Pandas are extremely effective for reshaping, merging, splitting, aggregating, and selecting (subsetting) data. In fact, the absolute majority of the code in a data science project usually consists of data wrangling, which are the steps required to prepare data so that analyses can be performed. Having a coherent library for all data wrangling is, of course, advantageous.
Unlike the statistical programming environment konwn as R, there is no built-in variant of dataframes in Python. Dataframes are central to basically all data analysis. A dataframe is a table of columns and rows. Here's a very nice Pandas Dataframe tutorial I wrote, aimed at the beginner.

http://pandas.pydata.org

matplotlib

Matplotlib is used to visualize data. Although matplotlib is quite easy to use and you have a lot of control over your plots I would recommend using Seaborn.

http://matplotlib.org

Seaborn

Seaborn is Python package for data visualization that is based on matplotlib. This package gives us a high-level interface for drawing beautiful and informative visualizations. It's possible to draw bar plots, histograms, scatter plots, and many other nice plots.

Here's an example using NumPy to generate some data and plotting it using Seaborn:

import numpy as np
import seaborn as sns

# Generate some normally distributed data
dat = np.random.normal(0.0, 0.2, 1000)

# Create a histogram using seaborn
sns.distplot(dat)

Learn how to create scatter plot in Python using Seaborn.

SciPy (Scientific Python)

SciPy includes features for advanced calculations.

http://scipy.org

scikit-learn

scikit-learn is a huge library of data analysis features. In scikit-learning there are classification models (e.g., Support Vector Machines, random forest), regression analysis (linear regression, ridge regression, lasso), cluster analysis (e.g, k-means clustering), data reduction methods (e.g., Principal Component Analysis,, feature selection), model tuning and selection (with features like grid search, cross validation, etc), pre-processing of data among many other things.
http://scikit-learn.org/