DEV Community: Isha Dagar

MNIST Digit Classification

Isha Dagar — Sat, 29 Jan 2022 07:34:54 +0000

MNIST dataset is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Each image is labeled with the digit it represents.

This set has been studied so much that it is often called the "hello world" of Machine Learning.

The idea is to feed through the pixel values to the neural network and then have the neural network output, which number it actually thinks that image is.

1. Load the data

Now we're going to import dataset. MNIST dataset consists of 28 * 28 sized unique images of handwritten digits 0-9. Now, we are going to unpack the data to training and testing set.

2. Normalization

The data seems to vary from 0 to 255 for pixel data. So, we want to scale the data between 0 and 1 and that just make it easier for the network to learn.

3. Building model

It is going to be Sequential type of model. It's a feed forward like the image we drew. The first layer will be the input layer. We want the image to be flat and not in multidimensional array. Now, we are going to have two hidden layers with 128 neurons and relu activation and the output layer will consist of 10 classifications and softmax activation.

4. Training model

Now we need to "compile" the model. This is where we pass the settings for actually optimizing/training the model we've defined.
Next, we have our loss metric. Loss is a calculation of error. A neural network doesn't actually attempt to maximize accuracy. It attempts to minimize loss. Again, there are many choices, but some form of categorical crossentropy is a good start for a classification task like this.

5. Making predictions

Finally, with your model, you can save it super easily and make predictions.

Case Study: How does Netflix use Machine Learning?

Isha Dagar — Wed, 05 Jan 2022 19:13:00 +0000

Netflix uses Netflix Recommendation Engine to show user content based on what they watch and their likes. Netflix uses a deep learning algorithm to understand the users likes and dislikes and then use this data and evaluate what content the user may like and recommend it to them.

Recommender Pipeline:

Pre-processing
Hyperparameter tuning
Model training and prediction
Post-processing
Evaluation

Data=>Machine Learning model=>Predictions
User Preferences=>Recommender System=>Recommendations

Collaborative filtering: Similar users like similar things.
Content based filtering: User and item features.

Pre-processing:

Let’s assume that we have data set that is dense enough to proceed.

Normalization:

Optimists = rate everything 4 or 5
Pessimists=rate everything 1 or 2
Need to normalize ratings by accounting for user and item bias.
Mean normalization:

subtract bi from each user’s rating for given item i.
Pick a Model:

Matrix Factorization:
Factorize the user-item matrix to get 2 latent factor matrices:

User-factor matrix
Item factor matrix Missing ratings are predicted from the inner product of these two factor matrices. Algorithms that perform matrix factorization:
Alternating Least Squares (ALS)
Stochastic Gradient Descent (SGD)
Singular Value Decomposition (SVD)

Pick an Evaluation Metric:

Precision at K:
Looks at the top K recommendations and calculates what proportion of those recommendations were relevant to a user. We will be focusing on the top 10 recommendations.

Hyperparameter tuning

Alternating Least Square’s Hyperparameters

K (# of factors)
Lambda (regularization parameter)

Goal: Find the hyperparameter that give the best precision at 10.

Grid Search: Iterates over a number set of combinations of K and lambda. So, we basically run the model and evaluate it for each combination of lambda and K and see which of them gives us the best precision at 10.

Random Search: It just randomly selects values of lambda and K and evaluates that several times. This approach is less exhaustive and more effective than grid search.

Sequential Model-Based Optimization: It’s basically a smarter way of tuning your hyper parameters because it takes into consideration the results of your previous iterations when you sample hyper parameters and your current iteration.
You can consider using tools like sidekick optimize, hyper opt or metric optimization engine which was developed by Yelp.

Model Training

We can train this model with these optimal hyperparameters to get our predicted ratings and we can use these results to generate our recommendations.

Post-processing

Sort predicted ratings and get top N.
Filter out items that a user has already purchased, watched, interacted with.
Item-item recommendations
-Use a similarity metric (e.g., cosine similarity)
-“Because you watched Movie X”

Evaluation

If you can do A|B testing or usability testing where you get actual feedback from real users, that is the best signal that you have a good recommender but in many cases that’s not possible. So, we’re going to have to do offline evaluation.

In traditional ML we split our dataset in half to create a training set and a validation set but this isn’t work for recommender models because the model won’t work if you train all your data on a separate user population then the validation set. So, for recommenders we actually mask random interactions in our matrix and use this as our training set. So, we pretend that we don’t know a user’s rating of a movie, but we actually do and we can compare the predicted rating with the actual rating and that’s our way of calculating precision at 10 or any metric we want in this case.

Precision and recall are very popular metrics for recommender systems and they’re both information retrieval metrics.

Jupyter Notebook v/s Google Colab

Isha Dagar — Sun, 02 Jan 2022 12:01:17 +0000

I have worked on both Google Colab and Jupyter notebook so I think I would be able to explain more clearly the difference between the two and which one you should use.

Jupyter is the open source project on which Colab is based. Colab allows you to use and share Jupyter notebooks with others without having to download, install, or run anything.

Google Colab:

Google Colab comes with collaboration backed in the product. As Colab is a Web App hosted by Google, you need Internet connection to access it and run code.
It runs in cloud and runs your program on the server side so you don’t need to worry about downloading packages locally .
They give you a free virtual machine that has about 12 Gb of RAM which is pretty descent, but they also give access to a GPU card which is much more performing and much faster. You can access free GPUs for max 12 hours.
The notebooks are saved to your Google Drive account.

Jupyter Notebook:

Jupyter notebook run on a local host so it works offline.
You have direct access to your file system and with your scripts you use all the storage of your disk.
The hardware is your PC's RAM, Disk, CPU and GPU.
You have to install everything by yourself via pip or other package managers.

Jupyter notebook = Running code, taking notes , show
multimedia interactivity + Read-eval-print loop Terminal, Kernal with the frontend interfaces.

Google colab = Jupyter notebook + Collaboration, additional hardware cloud based

Which one is better?

Google colab allows you to share your work with other developers. So, not only you can push your work directly to GitHub, but you can share the notebook itself to developers and they can run the cell and get the same result as you are getting because the environment it is running on is consistent as it is a remote machine which would not be the case if you were doing a local development on jupyter notebook.
If you’re just playing around or working on personal projects, Jupyter will work fine. If you want to build commercial-grade models and deploy them to production, Codelab provides the full-lifecycle approach that you’d need.
If you are in a non-programming job, and you don't want to install everything on your work computer to get it set up for Jupyter, you can just start working with Google Colab without having to do any installation and share your scripts with non-technical co-workers who wouldn't be able to install anything themselves.

Top 5 languages for Machine Learning

Isha Dagar — Thu, 30 Dec 2021 21:19:41 +0000

In machine learning, there is no best language as such. Each language is good where it fits best but there are more suitable programming languages that are more appropriate for machine learning tasks than others.

Such as, most of the software engineers use Java for machine learning applications like security and threat detection whereas other prefer to use Python for NLP and LSTM problems. Some also prefer to use R or Python for sentiment analysis tasks. Software engineers with a background in Java development transitioning into machine learning sometimes continue to use Java as the programming language in machine learning job roles.

1. Python : Python leads all the other languages. More than 60% of machine learning developers are using Python and prioritizing it for development because python is easy and versatile to learn.

"While Python has been around for decades, the demand for Python skills in 2022 will continue growing exponentially thanks to its use in the booming industries of data science, machine learning and AI," said Ryan Desmond, co-founder and lead instructor at CodingNomads."In addition, Python is considered one of the easiest, most powerful, and most versatile languages to learn, making it popular amongst companies, developers, and aspiring developers."

Python has many awesome visualization packages and useful core libraries like Numpy, scipy, pandas, matplotlib, seaborn, sklearn which really makes your work very easy and empower the machines to learn.

2. R : R is an open-source programming platform that includes a wide range of libraries and frameworks. Several big tech companies use the R language to run their businesses and with the increasing demand for machine learning and data science in 2021, it is quite evident that R will be in-demand in 2022 and the upcoming years. It is popular in implementing machine learning tasks like regression, classification, and decision tree formation.

3. Java : Java is considered harder to learn than Python but easier than C or C++ because Java is improved on C, and Python is improved on Java. If you learn java then learning something like Python will be much easier. Java provides many good environments like Weka, Knime, RapidMiner, Elka which used to perform machine learning tasks using graphical user interfaces.

4. Javascript : Used on more than 97% of the world's websites, JavaScript allows you to set up dynamic and interactive content, animated graphics and other complex features on the web. It's also the most popular language among contributors on GitHub. Javascript is also so popular in ML that high-profile projects like Google’s Tensorflow.js are based on JavaScript. If you are a master of Javascript then literally you can do everything from full-stack to machine learning and NLP.

5. C++ : C++ has become a go-to programming language for analysts and researchers. Besides it’s popularity in game development domain, many powerful libraries such as TensorFlow and Torch are implemented in the C++ programming language. Therefore C++ and machine learning is indeed a great combination.

Happy Learning.