Product developers’ guide to customize data for AI - Part 3: Merge multiple dataframes

#pandas #deved #dataframes #machinelearning

TLDR

As a step towards data processing, learn how to merge multiple datasets together and analyze the story behind the data.

Outline

Introduction
Before we begin
Merging all data
Email subscriptions
Conclusion

Introduction

Nowadays, it’s common for products to integrate AI into many applications or features. In this guide, we’ll be looking at how AI is used in email marketing campaigns and get started with combining all our datasets together. Finally, we’ll wrap up this series with a look into data preparation so we’re ready to do machine learning model training with our dataset.

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ffcekywmf0t1qs0ezwry.png)_Collecting data on the latest campaign (Source: [B2C](https://www.business2community.com/email-marketing/the-most-important-types-of-emails-you-need-for-email-marketing-success-02219324))_

Before we begin

By now, you should already be familiar with combining dataframes together, filtering and sorting. If not, read the 2nd part of our introduction series and data preparation series. We’ll start by introducing these 3 datasets, email_content, user_emails, and user_profiles.

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mgz95vdxht6ld7pc69b0.png)

Merging all data

After taking a look at these dataset, we’ll be looking for similarities that can be used to merge the datasets together. We’ll take the dataset with the most matching ids, user_emails and connect it with both the email_content and user profiles.

Glancing at the dataframe, we see that in user_emails there are 2 ids, a user_id, and the email_id. These correspond to the id in user_profile and email_content respectively.

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6xzrxff08r4gz1p5egwq.png)

I’ll start by renaming the columns from id to their matching column name.

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fvyelni9hyev8feobve5.png)_user_id and email_id_

Then we can call merge. I chose the left join here because our main dataset is on the left.

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/m4t2z83idvwwxdclinft.png) ![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/h2k0on36u2l9md0h7clw.png)_All the columns from all 3 datasets merged into 1_

And that’s it. We’ve completed preparing our data for training a machine learning model.

Email subscriptions

Next, we’re going to make scrappy inferences based on the data. First, let’s get the raw data by answering questions about our email campaign user subscriptions. Then, we’ll move to creating a model to determine the likelihood of churn.

You’ve got to submit your next marketing report on your leads, the boss wants to know how well you’ve done and how the company has grown, if at all, since its inception.

To do this, you’ll need to grab the data on 3 customer metrics.

How many customers have unsubscribed at least once?

We start off by picking out the data that matters, the user_id and the unsubscribed status. Note, we filter out the remaining columns because we care about just the users and not the number of emails.

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pavmvvag88ojl39cfwva.png)_Filter where the user_id is unique and unsubscribed value is yes._

Which type of subscription service was the least popular?

Once again, we only want the user_id, sub status, and category.

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/msfrwz940phi108bxw8e.png)_Filter for the user_id, unsubscribed, and category._

Then we group these by each type of subscription, or “category”, and count.

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nts5pzdu7wxjgzmn2rv0.png)_A lot of people don’t like promotional emails!_

Which topics are our customers not as interested in?
Once again, we only want the user_id, unsubscribed and theme.

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/52uapz2gdez8kplvaojx.png)_Filter for the user_id, unsubscribed, and theme._ ![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2skdqmqbwg7t94q4svpp.png)_Customers tend to unsubscribe to lifestyle related emails._

Conclusion

Based on the metrics, we infer from the data that we’ll want to avoid promotional emails, as well as those about improving lifestyle. Users of the platform seem to be interested in emails about food and those which are purely transactional. In a future series, we’ll revisit this data to create a model that’s better than simply making inferences from picking the best of each category and theme.