Chris Greening

Posted on Jan 28, 2023

Joining multiple datasets on the same column in R using dplyr and purrr

#datascience #r #tidyverse #beginners

Joining multiple datasets on the same column is a common pattern in data preparation

So let's jump in and explore how we can leverage R and the tidyverse to join an arbitrary number of datasets on a shared column with elegant, readable code!

Installing prerequisite packages
Examining our sample datasets
Using dplyr to manually join two datasets at a time
Understanding the reduce operation
Leveraging purrr to join multiple datasets
Conclusion
Additional resources

Chris Greening - Software Developer

Hey! My name's Chris Greening and I'm a software developer from the New York metro area with a diverse range of engineering experience - beam me a message and let's build something great!

christophergreening.com

Installing prerequisite packages

In this tutorial we'll be using dplyr and purrr from the popular tidyverse collection of packages

The following line of code will install them on your machine if they aren't already:

install.packages(c("dplyr", "purrr"))

Examining our sample datasets

For the following examples, we'll be using real-world agricultural data sourced via Eurostat containing the number of specific livestock animals (swine, bovine, sheep, and goats) in a country during a given year

For example, here is the goats dataset

> goats
# A tibble: 1,322 × 3
   country  year goats
   <chr>   <dbl> <dbl>
 1 Albania  2014 904
 2 Albania  2015 932
 3 Albania  2016 941.
 4 Albania  2017 933.
 5 Albania  2018 917.
 6 Albania  2019 863.
 7 Albania  2020 774.
 8 Austria  1993  47
 9 Austria  1994  49.8
10 Austria  1995  54.2
# … with 1,312 more rows

Our goal is to join these datasets by country and year into a single livestock.data variable containing all the animals like so:

> livestock.data
# A tibble: 1,322 × 6
   country  year bovine goats swine sheep
   <chr>   <dbl>  <dbl> <dbl> <dbl> <dbl>
 1 Albania  2014   500. 904    172. 1896
 2 Albania  2015   506. 932    177. 1918
 3 Albania  2016   497. 941.   184. 1972.
 4 Albania  2017   475. 933.   180. 1926.
 5 Albania  2018   467. 917.   184. 1864.
 6 Albania  2019   416. 863.   184. 1758.
 7 Albania  2020   363. 774.   158. 1558.
 8 Austria  1993  2334.  47     NA   334
 9 Austria  1994  2329.  49.8 3729.  342.
10 Austria  1995  2326.  54.2 3706.  365.
# … with 1,312 more rows

Using dplyr::full_join to manually join two datasets at a time

Let's start with a naive approach and manually join our datasets one-by-one on the country and year columns

by = c("country", "year")
livestock.data <- dplyr::full_join(bovine, goats, by=by)
livestock.data <- dplyr::full_join(livestock.data, swine, by=by)
livestock.data <- dplyr::full_join(livestock.data, sheep, by=by)

The above code accomplishes the exercise by:

Manually stepping through each animal variable
applying a function that takes two arguments (in this case dplyr::full_join)
and chaining the output of one step (livestock.data) as the input for the next step

While this might work for four datasets, what if we had 100 datasets? 1000 datasets? n datasets?! Suddenly not a great solution!

Let's investigate how we can improve, automate, and scale this

Understanding the reduce operation

The reduce operation is a technique that combines all the elements of an array (i.e. an array containing our individual livestock datasets) into a single value (i.e. the final joined table).

The reduce operation accomplishes this by:

Looping over an array
applying a function that takes two arguments (such as dplyr::full_join)
and chaining the output of one step as the input for the next step

Sound familiar? This is exactly what we just performed manually in the previous section except this time we'll be leveraging R to do it for us!

So let's see in practice how we can apply the reduce operation to elegantly join our livestock.data

Leveraging purrr::reduce to join multiple datasets

purrr is a package that enhances R's functional programming toolkit for working with functions and vectors (i.e. reducing, mapping, filtering, etc.)

In this case, we're going to use purrr::reduce in conjunction with dplyr::full_join to join all of our datasets in one line of concise, readable code

livestock.data <- purrr::reduce(
    list(bovine, goats, swine, sheep),
    function(left, right) {
      dplyr::full_join(left, right, by=c("country", "year"))
    }
)

And that's it! We accomplished this by:

1. Looping over a list of our livestock

list(bovine, goats, swine, sheep)

2. applying dplyr::full_join which takes two arguments

function(left, right) {
  dplyr::full_join(left, right, by=c("country", "year"))
}

3. and chaining the output of one step as the input for the next step

Conclusion

Thus the elegance of the reduce operation comes from its flexibility to handle n datasets (assuming your machine has enough memory 😜)

If you want to take this a step further and practice with sample code and data, I've pulled together a full working example for you to explore on GitHub!

Thanks so much for reading and if you liked my content, be sure to check out some of my other work or connect with me on social media or my personal website 😄

Chris Greening - Software Developer

Hey! My name's Chris Greening and I'm a software developer from the New York metro area with a diverse range of engineering experience - beam me a message and let's build something great!

christophergreening.com

Cheers!

Deploying a free tier relational database with Amazon RDS

Chris Greening ・ Mar 28 '22

#aws #beginners #tutorial #database

DEV Community

Joining multiple datasets on the same column in R using dplyr and purrr

Table of Contents

Chris Greening - Software Developer

Installing prerequisite packages

Examining our sample datasets

Using dplyr::full_join to manually join two datasets at a time

Understanding the reduce operation

Leveraging purrr::reduce to join multiple datasets

Conclusion

Chris Greening - Software Developer

Deploying a free tier relational database with Amazon RDS

Chris Greening ・ Mar 28 '22

Connecting to a relational database using SQLAlchemy and Python

Chris Greening ・ Apr 30 '22

Additional resources

Top comments (0)

Read next

Enhancing LLM Performance at Scale with CDN-Based Knowledge Injection

Inside Commercial-Level Video AI: Unveiling Allegro, a Groundbreaking Video Generation Model

Why Every Developer Should Understand OOP: A Complete Breakdown

Tim Berners-Lee : The Man Behind the Web