DEV Community: thelogicdudes

How to Build a Machine Learning Recommendation Engine w/ TensorFlow & HarperDB

thelogicdudes — Wed, 24 Aug 2022 18:49:00 +0000

This article explains how to create a recommendation system using HarperDB and Custom Functions.

With HarperDB we are able to deploy a machine learning model to servers located on the edge of the Internet to provide users with content recommendations. We have two examples to demonstrate how this can be done: a book recommender and a song recommender.

An Intro to Recommendation Systems

Recommendation engines are the most universally used type of machine learning. We interact with them everywhere, and it has become the driving force for how we explore new content. And while some are less than perfect, even the least of them are miles ahead of the previous methods we had available to find new things.

Take shows and movies for example. Do you remember the preview channel… staring at it for hours, trying to guess which titles would be worth further exploration? The same is true for audio content, where we were limited to radio stations to find new music. And books were even more difficult. I'd walk along the bookstore aisles picking random books to read their summary before finally finding one to take a chance on.

All of these methods did work, but they limited us. We could only truly explore the content which was made available by the system at hand - such as the preview channel, radio station, and bookstore shelves. But there is so much more content out there, especially today. So instead of having to choose from the limited options presented by the Best Sellers section at the bookstore, how does one find new content that’s based on their individual interests?

Recommendation Systems!

What is a Recommendation System? How do they work?

Recommendation systems are exactly what they sound to be; a system that provides recommendations. You give it an idea about what you like, and it points you in the direction of other things you might like.

There are many ways to set up such a system. For example, if we're talking about products, a system could be built that finds all of the users who bought the same product as you, and then displays the most common products among that group. This could be expanded by adding more items to the input. Such as moving from a single product to three products that you bought, grouping together all of the users who also bought those three items, and then displaying the most common other products among that group.

What if we're including ratings with the items of interest? For example, if many people are watching one particular movie, does that automatically mean it’s a good movie to recommend? And if not, then how would we prevent it from appearing in the recommendations? When including ratings and reviews, we could set up a filter to remove any content with a bad review to ensure the group of items we're aggregating always contains positive content. This is a great starting point. It works for many cases where you have a dataset of items and user interactions with those items.

The challenging part arises from the evolving nature of content. How would a song, book, or product that's new get into the output of the recommendation system?

Using Machine Learning

While we could keep adding manual filters to the content by sorting the ratings and weighting interactions with new items, there would be areas where we fall short. Especially when creating binary gates such as a positive/negative rating threshold to determine which items are included in the output.

This is where machine learning takes over. Using libraries such as TensorFlow Recommenders with Keras models, it's easy to shape the data in ways that will allow the items and users to be viewed and compared in a multidimensional perspective. Qualitative features such as item categories and user profile attributes can be mapped into mathematical concepts that can be quantitatively compared with one another, ultimately providing new insights and better recommendations.

Using a machine learning pipeline like this also allows the models to be continuously trained by taking the results of the previous model, the successfulness of those recommendations to drive user interactions, and using that data to create better models. I recently wrote an article about using TensorFlowJS and HarperDB for Machine Learning if you’d like to learn more.

HarperDB Recommenders

One of the awesome features of living on the edge is the ability to connect users to new insights with low latency and minimal traffic. With HarperDB Custom Functions, we can deploy a model to multiple nodes in a cluster, allowing the closest and most available to return the result to the user.

Let’s look at two example projects to demonstrate how this would look when providing recommendations to a user based on provided content to create a song recommender and a book recommender.

To ensure the examples are reusable and straightforward, these projects are the frontside of the process, where we're serving an already trained model. In production we would connect a second piece to this puzzle where we track user interactions in a central location, train new models, and deploy them back out to the edge with HaperDB.

HarperDB Song Recommender

Github repo: HarperDB/song-recommender

In our Song Recommender example, a user can find three of their favorite songs in the UI, and then the recommendation system returns other songs they're likely to enjoy.

There's a dataset called The Million Song Dataset that contains very detailed information on over one million songs, ranging from audio analysis to the location of the artist.

There's a nice subset of that data called The Echo Nest Taste Profile Subset which is a list of users, songs, and play counts. Each row in the data contains a userid, songid, and playcount. This is what we used to build the model for this project.

Training the Model

We created a Two-Tower model, which is a design where two latent vectors are created, compared during training to shape the embedding layers, and compared during inference to find the best match.

Tower One is the Query Tower. This is essentially the input to the equation, which in this case is three songs that the user selects.

Tower Two is the Candidate Tower. In this case it's the users in the dataset.

We find the user in the dataset most similar to the user using the UI, and then provide their most played songs as recommendations.

To look at the training specifics, here's the training notebook.

The basic steps that are taken are as follows:

Find the most played songs for each user
Create a query/candidate pair of three of those songs and the user ([songs], user)
Build two models and setup TFRS metrics for training
Feed the query/candidate pair into the model
When training is complete, create a new model that wraps the above two
Apply the BruteForce method to extract the best candidate match provided an input

Generating the Data

# create a dictionary of inputs and outputs
dataset = {'songs': [], 'user': []}
for user, songs in users_songs.items():
    for _ in range(len(songs) * 5):
        # randomly select n_songsns_in from the user's isbns
        selected_songs = np.random.choice(songs, n_songs_in)
        # add them to the inputs
        dataset['songs'].append(selected_songs)
        # add the user to the output
        dataset['user'].append(user)

Building the Models

# create the query and candidate models
n_embedding_dimensions = 24

## QUERY
songs_in_in = tf.keras.Input(shape=(n_songs_in))
songs_in_emb = tf.keras.layers.Embedding(n_songs+1, n_embedding_dimensions)(songs_in_in)
songs_in_emb_avg = tf.keras.layers.AveragePooling1D(pool_size=3)(songs_in_emb)
query = tf.keras.layers.Flatten()(songs_in_emb_avg)
query_model = tf.keras.Model(inputs=songs_in_in, outputs=query)

## CANDIDATE
user_in = tf.keras.Input(shape=(1))
user_emb = tf.keras.layers.Embedding(n_users+1, n_embedding_dimensions)(user_in)
candidate = tf.keras.layers.Flatten()(user_emb)
candidate_model = tf.keras.Model(inputs=user_in, outputs=candidate)

Creating the TensorFlow Recommenders Task

# TFRS TASK SETUP
candidates = dataset.batch(128).map(lambda x: candidate_model(x['user']))
metrics = tfrs.metrics.FactorizedTopK(candidates=candidates)
task = tfrs.tasks.Retrieval(metrics=metrics)


## TFRS MODEL CLASS
class Model(tfrs.Model):
    def __init__(self, query_model, candidate_model):
        super().__init__()
        self._query_model = query_model
        self._candidate_model = candidate_model
        self._task = task

    def compute_loss(self, features, training=False):
        query_embedding = self._query_model(features['songs'])
        candidate_embedding = self._candidate_model(features['user'])
        return self._task(query_embedding, candidate_embedding)

## COMPILE AND TRAIN MODEL
model = Model(query_model, candidate_model)
# load model weights - this is to resume training
# model._query_model.load_weights(weights_dir.format('query'))
# model._candidate_model.load_weights(weights_dir.format('candidate'))

model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))
model.fit(dataset.repeat().shuffle(300_000).batch(4096), steps_per_epoch=50, epochs=30, verbose=1)

Deploying the Model

To deploy this model, the final step in the above notebook is converting it to TensorFlowJS.

From there, we add it to the Custom Function's directory along with the logic in recommend.js which converts the user's three favorite songs into tensors which are provided to the model.

The output of the model is a reference to the most similar user to the one using the UI.

The top songs for that user are then returned and displayed in the UI as our recommendations.

Creating an input tensor and getting results

 if (!this.model) {
   const modelPath = path.join(__dirname, '../', 'tfjs-model', 'model.json');
   this.model = await tf.loadGraphModel(`file://${modelPath}`);
 }

 const inputTensor = tf.tensor([songIdxs], [1, 3], 'int32')
 const results = this.model.execute(inputTensor)
 const r0 = await results[0].data()

HarperDB Book Recommender

Github repo: HarperDB/book-recommender

In our Book Recommender example, a user can find three of their favorite books in the UI, and then the recommendation system returns other books they're likely to enjoy.

There's a dataset on Kaggle that includes a list of users, books, and ratings for about 250,000 different titles

Training the Model

We again created a Two-Tower model, which is a design where two latent vectors are created, compared during training to shape the embedding layers, and compared during inference to find the best match.

Tower One is the Query Tower. This is essentially the input to the equation, which in this case is three books that the user rated highly (a 5 and above out of 10).

Tower Two is the Candidate Tower. In this case it's the users in the dataset.

We find the user in the dataset most similar to the user using the UI, and then provide their highest rated books as recommendations.

To look at the training specifics, here's the training notebook.

The basic steps that are taken are as follows:

Find the top rated books for each user
Create a query/candidate pair of three of those books and the user ([books], user)
Build two models and setup TFRS metrics for training
Feed the query/candidate pair into the model
When training is complete, create a new model that wraps the above two
Apply the BruteForce method to extract the best candidate match provided an input

Generating the Data

# create a dictionary of inputs and outputs
dataset = {'isbns': [], 'user': []}
for user_id, isbns in user_isbns.items():
    # use 5x the number of isbns gathered for the user
    # this ensures a larger amount of training data
    for _ in range(len(isbns) * 5):
        # randomly select n_isbns_in from the user's isbns
        selected_isbns = np.random.choice(isbns, n_isbns_in)
        # add them to the inputs
        dataset['isbns'].append(selected_isbns)
        # add the user to the output
        dataset['user'].append(user_idxs[user_id])

Building the Models

# create the query and candidate models
n_embedding_dimensions = 24

## QUERY
isbns_in_in = tf.keras.Input(shape=(n_isbns_in))
isbns_in_emb = tf.keras.layers.Embedding(n_isbns+1, n_embedding_dimensions)(isbns_in_in)
isbns_in_emb_avg = tf.keras.layers.AveragePooling1D(pool_size=3)(isbns_in_emb)
query = tf.keras.layers.Flatten()(isbns_in_emb_avg)
query_model = tf.keras.Model(inputs=isbns_in_in, outputs=query)

## CANDIDATE
isbns_out_in = tf.keras.Input(shape=(1))
isbns_out_emb = tf.keras.layers.Embedding(n_users+1, n_embedding_dimensions)(isbns_out_in)
candidate = tf.keras.layers.Flatten()(isbns_out_emb)
candidate_model = tf.keras.Model(inputs=isbns_out_in, outputs=candidate)

Creating the TensorFlow Recommenders Task

# TFRS TASK SETUP
candidates = dataset.batch(128).map(lambda x: candidate_model(x['user']))
metrics = tfrs.metrics.FactorizedTopK(candidates=candidates)
task = tfrs.tasks.Retrieval(metrics=metrics)


## TFRS MODEL CLASS
class Model(tfrs.Model):
    def __init__(self, query_model, candidate_model):
        super().__init__()
        self._query_model = query_model
        self._candidate_model = candidate_model
        self._task = task

    def compute_loss(self, features, training=False):
        query_embedding = self._query_model(features['isbns'])
        candidate_embedding = self._candidate_model(features['user'])
        return self._task(query_embedding, candidate_embedding)

## COMPILE AND TRAIN MODEL
model = Model(query_model, candidate_model)
# load model weights - this is to resume training
# model._query_model.load_weights(weights_dir.format('query'))
# model._candidate_model.load_weights(weights_dir.format('candidate'))

model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))
model.fit(dataset.repeat().shuffle(300_000).batch(4096), steps_per_epoch=50, epochs=30, verbose=1)

Create the BruteForce Comparison

# create the index model to lookup the best candidate match for a query
index = tfrs.layers.factorized_top_k.BruteForce(model._query_model)
index.index_from_dataset(
    tf.data.Dataset.zip((
      dataset.map(lambda x: x['user']).batch(100),
      dataset.batch(100).map(lambda x: model._candidate_model(x['user']))
    ))
)
for features in dataset.shuffle(2000).batch(1).take(1):
    print('isbns', features['isbns'])
    scores, users = index(features['isbns'])
    print('recommended users', users)

Deploying the Model

To deploy this model, the final step in the above notebook is converting it to TensorFlowJS.

From there, we add it to the Custom Function's directory along with the logic in recommend.js which converts the user's three favorite books into tensors which are provided to the model.

The output of the model is a reference to the most similar user to the one using the UI.

The top rated books for that user are then returned and displayed in the UI as our recommendations.

Recap

And there you have it, that’s how you can create a recommendation system with HarperDB Custom Functions.

These machine learning models were pre-trained for these examples, as they do take 12+ hours to reach their most accurate state. In a production environment, there would most likely be a single instance responsible for continuously training the model and distributing it out to the other nodes on the edge.

Go ahead and launch a HarperDB instance with Custom Functions where you can pull in one of these repos and get recommendations for new songs and books to check out. If you get an interesting result that you enjoy, please let us know!

Thanks for reading,

–Kevin

Using TensorFlowJS & HarperDB for Machine Learning

thelogicdudes — Tue, 03 May 2022 15:33:00 +0000

Implementing a Dog Breed Classifier Using Stanford Dogs and MobileNet with HarperDB Custom Functions

Intro

HarperDB is an easy-to-use database solution that has a simple method of creating endpoints to interact with data, called Custom Functions. These Custom Functions can even be used to implement a machine learning algorithm to classify incoming data. TensorFlowJS is a library released by Google that makes it possible to use JavaScript for machine learning so it can be done in the browser or on a NodeJS server like we'll be doing in this article.

Summary

What We're Going To Do

This article will explain how to train and use a TensorFlowJS model to classify dog breeds with HarperDB Custom Functions, using the Stanford Dogs dataset and MobileNetV2 as a base for transfer learning.

Stanford Dogs

There's an awesome dataset that was released by Stanford with 20,000 images of dogs. The images are grouped into different folders, each folder containing the name of the breed. There are additional annotations available for bounding boxes as well, but today we'll be focused solely on classifying the breed.

MobileNet

There's a SOTA (state of the art) model published by Google called MobileNet which is a relatively small model with the ability to classify over 1,000 images. It's built small so it'll run on mobile devices without taking up too many resources. We'll be using version 2 of this model which is available in the @tensorflow-models/mobilenet package.

Transfer Learning

Transfer learning is the technique of taking a pretrained model and training it to output new data. Like teaching an old dog new tricks! For that we'll be using @tensorflow-models/knn-classifier.

We'll be sending an image into MobileNet and getting out the logits, which is the bit right before the classification. Then we'll send those logits into a KNN-Classifier which uses the K-Nearest Neighbors algorithm to associate those logits with specific dog breeds.

Getting Started

If that all sounds complicated, don't worry. This implementation will be quick and easy thanks to HarperDB Custom Functions.

Setup

Prereqs

A HarperDB Account
A HarperDB Local Database

Clone the Repo

Clone this repo into your Custom Functions folder
git clone https://github.com/HarperDB/hdb-cf-dogml.git ~/hdb/src/custom_functions/dogml

Restart Custom Functions

Use the link in the HarperDB Studio Functions page (bottom left of the screen) to refresh the projects.

Run /setup

The training data and TensorFlowJS modules need to be installed. This can be done via the /setup endpoint.

If you go to http://localhost:9926/dogml/setup it'll start the setup. You can check on the progress in the logs - either in stdout from the locally running database or in the logs section of the Status page inside of the Studio.

The expected output of starting setup is {success: true, message: ML Setup Started}

This will use the $HOME/dogml directory in relation to the database for all of the training materials.

Be sure to wait for the ML Setup Complete note in the database logs.

Activate

Run /train

To train the model, visit the /train endpoint by going to http://localhost:9926/dogml/train. This will begin the model training. You can see the status inside of the console logs (similar to viewing the info during /setup), or inside of the logs table inside of the schema.

Verify Model

Once the logs indicate that the training is complete, you should be able to see the model appear in the models table in the schema.

Classify a Dog Breed!

Travel to the UI at http://localhost:9926/dogml/ui and try uploading an image of a dog (one of the images in the $HOME/dogml/training_data/Images directory will do).
The results should appear in the UI as well as in the classifications table.

Go Deeper

Add New Training Data

You can add more training data by adding new images to the $HOME/dogml/training_data/Images directory - either by putting the image in the correct folder or making a new folder (if it's a breed without a folder already present). All images should be JPEGs.

Removing Training Data

You can also remove training data in the $HOME/dogml/training_data/Images directory to better target specific breeds.

Update the Model

If you modify the training data and use the /train endpoint to create a new model, be sure to then call the /update endpoint at http://localhost:9926/dogml/update to ensure the new model is loaded into the classifier.

Train w/ GPU

To train the model 200% faster, use the /train_gpu endpoint at http://localhost:9926/dogml/train_gpu. This will take advantage of a CUDA-Enabled Nvidia GPU to process the training mathematics quicker.

Be sure the necessary drivers and CUDA libraries are installed

Here's a guide to install CUDA on Ubuntu

Review

There you have it, you've just trained a machine learning model on dog breed data and can now use it to classify images of dogs and determine the breed. To do this, we used a HarperDB Custom Function and TensorFlowJS to train a MobileNet model on the Stanford Dogs dataset.