Justin Wheeler

Posted on Dec 19, 2020 • Edited on Sep 22, 2022

AWS ML Recommendation Engine

#aws #cloudguruchallenge #machinelearning #serverless

TLDR

acloud.guru released a Machine Learning #CloudGuruChallenge to build a recommendation engine. I completed this task and went one step further to build a Serverless PHP website that uses the data from the ML model to provide movie recommendations.

Check it out for yourself! http://wheelerrecommends.com/

Introduction

On October 19th, 2020 acloud.guru announced another #CloudGuruChallenge. This time the challenge was to build a Netflix style recommendation engine using AWS Sagemaker.

Originally I didn't want to do a movie-based engine, but I really struggled to find good datasets for other types of data. My first thought was a children's book recommendation engine that I could use to find books for my son. Although, I couldn't find any thing I could use.

Kesha Williams suggested using IMDB Datasets that would make it easier to get started right away.

The Process

I jumped into the AWS Console to spin up a SageMaker notebook instance. I quickly realized how little I actually knew about Machine Learning. It would definitely be a great opportunity to learn.

Step 1. Upload the IMDB Data to S3

Before I could use the IMDB datasets I had to get them into some accessible storage. The service I chose was AWS S3.

I accomplished this with a combination of Python and shell commands using Jupyter Notebooks. I've used boto3 with Python before, and I've used the AWS CLI before. However, I'm new to Jupyter and I wanted to see if I could mix the two together. Perhaps this is bad practice... I couldn't find anything on the web to suggest that.

def s3_upload(filename):

    bucket_name = "wheeler-cloud-guru-challenges"
    object_name = f"1020/IMDB/{filename}"

    !aws s3 cp "{filename}" "s3://{bucket_name}/{object_name}"

IMDB-To-S3.ipynb

Note: The ! in this function tells Jupyter to use bash

Step 2. Feature Engineering

I studied feature engineering before so I was familiar with concepts like one-hot encoding as well as min/max scaling. Albeit, I've never used them in practice. The biggest hurdle I faced was that my data did not fit into memory. This was frustrating because I couldn't comprehend how to mold the data if I couldn't load the data.

In that instance I recalled that SQL is a popular tool for feature engineering and decided to play more to my strengths. I utilized AWS Athena to query my data stored on S3 with standard SQL.

I thought, "Now we're cooking with fire!" 🔥. Since I am very familiar with SQL it is no surprise that I was able to get the data wrangled in no time.

select 
    res.*,
    contains(res.genres, 'Action') isaction,
    contains(res.genres, 'Adventure') isadventure,
    contains(res.genres, 'Comedy') iscomedy,
    contains(res.genres, 'Crime') iscrime,
    contains(res.genres, 'Documentary') isdocumentary,
    contains(res.genres, 'Drama') isdrama,
    contains(res.genres, 'Horror') ishorror,
    contains(res.genres, 'Mystery') ismystery,
    contains(res.genres, 'Romance') isromance,
    contains(res.genres, 'Thriller') isthriller
from (
select distinct 
    tbasics.tconst id,
    tbasics.primarytitle title,
    tbasics.startyear year,
    split(tbasics.genres, ',') genres,
    tratings.averagerating rating,
    tratings.numvotes votes
from title_basics tbasics
join title_akas takas on tbasics.tconst = takas.titleid
join title_ratings tratings on tbasics.tconst = tratings.tconst
where tbasics.startyear <= year(now()) 
and tbasics.titleType = 'movie' 
and tbasics.isadult = 0
and takas.language = 'en'
and takas.region = 'US'
) res;

data-exploration.sql

Honestly, this was my first use-case for Athena, which it delivered perfectly. My Athena bill was literally $0.02. More valuable though it saved me hours of time.

Step 3. Data Exploration

Now I had to visualize the data I prepared in Athena. The tool of choice was Seaborn back in Jupyter Notebooks.

year_df = imdb_df.groupby('year')['id'].nunique()
seaborn.set(style='darkgrid')
seaborn.lineplot(data=year_df)

I graphed the data by year so I could see if I had an even spread of data across the years. I realized that I had more data for recent years, which make sense as more movies are typically made each year when compared to the prior year.

imdb_df['rating'] = imdb_df['rating'].apply(lambda x: int(round(x)))
rating_df = imdb_df.groupby('rating')['id'].nunique()
seaborn.set(style='whitegrid')
seaborn.lineplot(data=rating_df)

Next I graphed the data by rating. I wanted to see if my assumptions were correct that the greatest number of records would fall in the middle. I was right! This assumption was based on the fact that most movies are average.

Step 4. K-Means Modeling

I thought I was ready to dive into modeling now. I took some help from Amazon's Sagemaker Examples.

💥 Boom! 💥

I was not ready... I hit some interesting errors. It took me some time before I realized that the KMeans functions expected all the data to be float32. All of it? Yes.

Step 4. Feature Engineering (Round 2)

Initially, I removed all of the non-numeric columns, but then the model could not use them! My first round of predictions were truthfully pretty awful.

If you're curious: Attempt 1.

I still wanted the model to use these fields so I had to convert them to be numeric. I wrote custom functions to map the data depending on type.

I scaled the genres first based on a static map.
e.g. Action=1, Adventure=2, etc.

def map_genre_to_numeric(genres):

    genres_array = genres.replace('[', '').replace(']', '').split(', ')
    return ''.join([str(genre_map[g]) for g in genres_array])

train_df['genres'] = train_df['genres'].map(lambda g: map_genre_to_numeric(g))

Next I scaled ratings, the trickiest part for me. It's obvious to humans that a four star product with one review is not nearly the same as a four star product with one million reviews. How do I convey that to the model?

I decided to introduce a check that took a percentage of the real rating based on the number of votes it had. Only if the movie had over one thousand votes would it keep it's actual rating.

def map_rating_to_numeric(rating, votes):

    result = 0
    if votes is None:
        result = 0
    elif votes < 10:
        result = rating * 0.1
    elif votes < 100:
        result = rating * 0.3
    elif votes < 500:
        result = rating * 0.5
    elif votes < 1000:
        result = rating * 0.7
    elif votes >= 1000: 
        result = rating
    else:
        print(f"Unexpected # of Votes: {votes}")

    return result

Following that, I mapped the title values to their unicode counterparts using the built-in ord() function. At first I was mapping the entire title, until this caused long movie titles to be cast to infinity.

The purpose behind this was so that movie series would be grouped together. e.g. Deadpool 1 and Deadpool 2. I decided I could take a subset of the title to achieve the same result without going out of bounds. I took a substring of the title limited to nine characters. The reason? Nine is my favorite number.

def map_tilte_to_numeric(title):
    return ''.join([str(ord(c)) for c in title[:9]])

Finally, I conducted min/max scaling to ensure these columns would be uniform.

scaler = MinMaxScaler()
scaler_columns = ['genres', 'rating', 'title', 'year']
train_df[scaler_columns] = pd.DataFrame(scaler.fit_transform(train_df[scaler_columns]))

Step 5. K-Means Modeling (Round 2)

Now I'm confident! Time to train (for real). I initialized the K Means model with 1 instance and 9 clusters (you know why).

sage_session = sagemaker.Session()
sage_outputs = f"s3://{sage_session.default_bucket()}/imdb/"

kmeans = KMeans(role=sagemaker.get_execution_role(),
                instance_count=1,
                instance_type='ml.c4.xlarge',
                output_path=sage_outputs,              
                k=9)

Then I trained and deployed the model.

train_data = train_df.values
kmeans.fit(kmeans.record_set(train_data))
model = kmeans.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')

Once I had the model, I sent the training data through the model to make its predictions. Boy was this fast! ⚡

predictions = model.predict(train_data)

result_df['cluster'] = np.nan
result_df['distance'] = np.nan

for i, p in enumerate(predictions):
    result_df.at[i, 'cluster'] = p.label['closest_cluster'].float32_tensor.values[0]
    result_df.at[i, 'distance'] = p.label['distance_to_cluster'].float32_tensor.values[0]

Now, this next part is really important! Delete the Sagemaker endpoint to stop hourly billing.

sagemaker.Session().delete_endpoint(model.endpoint_name)

I was reminded by my AWS Budgets that I forgot this step. Don't be like me, clean up your resources.

From there, I dumped the results to a CSV. We're done! Right? Technically yes, yet I couldn't stop.

Finalized Results: Attempt 2.

The Serverless PHP Website

I wanted to build a simple website to utilize the ML model findings. Ambitious enough, I had recently learned PHP with the help of Pluralsight and chose to put those skills to the test.

Oh wait, I'm a servleress guy, that won't work... or will it? I did some digging to find Bref -- a literal godsend. This software lets you run PHP applications in an AWS Lambda function behind an AWS API Gateway endpoint for serverless websites. 🤯

Thinking to myself, "Am I dreaming?".

Website Architecture

Traffic is routed to CloudFront by Route53
Static content is fetched from S3
Dynamic content is fetched from API Gateway
API Gateway interacts with Lambda to generate the content

This was working really well! Shocked honestly. It was a tad confusing how the API Gateway would interact with the client.

The design is actually simpler now than it was before. See I learned from Bref documentation that this has been possible since June with the use of Lambda Layers. This very month AWS announced Container Image Support for Lambda during their annual re:Invent conference.

Now I'm thinking, "It was meant to be.".

Source:
https://aws.amazon.com/blogs/compute/introducing-the-new-serverless-lamp-stack/

In the beginning, I implemented Geo Restrictions on the CloudFront distribution to limit traffic to the United States.

I did this for three main reasons:

Reduce the cost
Protect against attacks from outside the United States
Get hands on with Geo Restrictions

GeoPeeker Results from Blacklisting

That functionality was working fine and easy to set up. Except, I scraped this functionality when I wanted my foreign friends to share their opinions with me. It's okay though because I learned how.

Overall, I had a fun time with this challenge. It makes me eager to see what challenge they'll give us next. Thankful that acloud.guru decided to extend their deadline on this challenge or I would not have had time to finish during this busy holiday season.

Here is the website link along with the source code and my LinkedIn page. Please, please, please, let me know what you think. I'd love to take your feedback or even connect on LinkedIn.

Website
https://wheelerrecommends.com/

GitHub
https://github.com/wheelerswebservices/cgc-aws-ml-recommendation-engine

LinkedIn
https://www.linkedin.com/in/wheelerswebservices/