DEV Community


Posted on • Originally published at on

How to Gather Data for Machine Learning

Unless you’ve been living in a cave these last few months (a cave that somehow carries sufficient WiFi coverage to reach our blog), you’ll doubtless have heard about machine learning. If you’re a developer, chances are you’re intrigued.

The machine learning algorithm, which solves problems without requiring detailed instructions, is one of the most exciting technologies on the planet. Since I began writing this post, Facebook and Microsoft have announced major additions to their own machine learning programmes.

But machine learning is deeply misunderstood – and not just by those who think it’s the dawn of a terrifying rise of the robots. In this article we’re going to demystify the concept, by cutting right to its core: the way we gather data for our machine learning algorithm, following a clearly defined process.

In this comprehensive introduction to machine learning, we’ll give you a clear explanation of the process used to train, and test, machine learning models, and show how you need different data at the various stages. But don’t worry, it won’t be too technical. We’ll leave the math lessons for another day.

A General Approach to Data-Gathering

Machine learning algorithms require huge amounts of data to function. When dealing with millions or even billions of images or records, it’s really hard to pinpoint what exactly makes an algorithm perform badly.

So, when compiling your data, it’s not enough to gather vast reams of information, feed it to your model and expect good results. The process needs to be much more finely tuned.

In general, it’s best to follow a series of iterative stages until you’re satisfied with the outcome. The process should run like this:

  1. Select your data distributions
  2. Split the data into data sets
  3. Train the model

Selecting Data Distributions in Machine Learning

The first step requires us to think about who will be interacting with our model, and the various data it will be handling as a result. This can be best explained with a couple of examples that illustrate what happens when we don’t take this into account.

Imagine you’re building an image recognition model to automatically label furniture items for an online store. To train the model, you collect a bunch of images from various manufacturers’ catalogues, professional shots that share common attributes such as distances and angles.

However, in production, you let users upload their own images from their phones. There’s a good chance these will be low-quality, blurry, badly lit or framed in the sort of unusual angles that a professional photographer wouldn’t use.

The system might perform poorly because the images used for training and production came from two clearly separate distributions.

CAPTION: Here you see the difference between a professional photo, used in training, and a poor-quality image taken by an actual user.

Another example: You need to train a model for an online book recommendation engine, but you believe your user base is evenly distributed across age and gender. Only later does it become apparent that 70% are actually young women. So you end up training the model with data from people who won’t actually be using it.

So when collecting data, it’s important to first define exactly how the system will be applied and make sure that the data we use to train the model is a good representation of the data it will handle when released to the market.

Creating Machine Learning Datasets

Ok, now we’re ready to think about breaking up our idea. So let’s dive into this with another example (sorry for the case study overload, but it’s the best way to get this point across).

Let’s imagine we were training someone to recognize the difference between a cat and a dog. We’d show them thousands of pictures of cats and dogs, all different types and breeds.

But how would we test them to ensure all those images had sunk in? If we showed them the images they’d already seen, they might be able to recognize them from memory. So we’d need to show them a new set of images, to prove that they could apply their knowledge to new conditions and give the right answer without assistance.

It’s the same principle for machine learning. We don’t want our model to recognize only the images, or records, it’s been training with; this problem is known as ‘over-fitting’ and is unfortunately quite common. Machine learning models often perform brilliantly when they’re asked to recall an item from their training, but less well when taken out of their comfort zone.

So we need to create three different datasets when training our machine learning model, for training , validation and testing.

The Training Stage

Naturally, we want the model to be as versatile as possible by the end of training, so it’s important the training set covers a wide range of images and records. But remember we don’t need the model to be 100% accurate by the end of training. We simply need to keep the margin of error to a minimum.

At this point, it’s worth introducing the ‘cost function’, a concept widely used among machine learning developers. The cost function is a measure of the variability between the model’s predictions and the ‘right answer’. The higher the cost function, the worse the model is performing, although there are other factors to consider, such as reaction speed or memory function.

(We could write a lot more about the cost function, but we don’t want to take up too much time. There’s a good article on it here).

Validation Stage

Once we’re happy with our cost function, and we’re ready to move on from the training, it’s time to start the validation stage. This is a bit like a mock exam, subjecting the model to new and unusual data without any pass-fail pressure.

Using the validation results, we can make any necessary tweaks to the model, or choose between different versions. A model which is 100% accurate at training stage but only 50% at validation is less likely to be chosen than one which is 80% accurate at both stages, as this second option is better able to face unusual circumstances.

Although we don’t need to give the model as much data at the validation stage as it received during training, all the data has to be fresh. If we recycle images the model has been trained with, it defeats the whole object.

Testing Stage

Why do we need a third stage, we hear you ask? Isn’t the validation stage enough of a test? Well, if the validation stage is long and rigorous enough, the model may eventually overfit it. It may learn the answer to every question.

So we need a third data set, whose goal is to define the model’s performance once and for all. If we get a bad result on this set, we might as well start from scratch.

Again, the test set must be completely fresh, with no repetition from the validation set or the original training set.

There are no specific rules on how to divide up your three machine learning datasets. Unsurprisingly, though, the majority of data is usually used for training – between 80 and 95%. The rest is split equally between the validation and test sets, as you’ll see in the chart below. Ultimately, however, it’s up to each individual team to find their own ratio by trial and error.

Training a Model

Right, now we’ve got our data ready, so let’s get into the training process.

Data collection, and training the model in general, is an iterative process, which means we might need to revisit the decisions we made when gathering the data. The process includes data preprocessing , model training and parameter tuning.

Data Preprocessing

The data being fed into a machine learning model needs to be transformed before it can be used for training.

On one hand, machine learning models expect their inputs in a given format, which is very often different to the format in which you find the data.

On the other hand, what models do is to learn and evaluate the cost function. They do so by minimizing the function’s error during training. In mathematics this is called an ‘optimization problem’, and certain characteristics of the data can affect how fast a computer will find the solution; the maximum or the minimum.

Some examples of data cleaning techniques used during pre-processing include normalization, clipping or binning. The following link explains these concepts.

Model Training

How training works is beyond the scope of this article, but you can find out more about it here and also get a practical view of how it is done using TensorFlow framework in this article.

Understanding it is essential to developing new models and tuning the parameters, but once that’s done, there are frameworks that will abstract all that away from us.

A given framework will provide a measure of error, describing the model’s performance in minimizing the cost function. The developer will need to test the model, adjust its hyper-parameters and continue iterating.

Parameter Tuning

This step, the final stage of the training process, is crucial to any machine learning project. It consists of realizing what went wrong and taking educated guesses at what parameters need to be modified.

This can get extremely complicated, but here are two very quick examples of what we might be modifying at this stage.

If we train our model to predict whether it is going to rain tomorrow based on various meteorological parameters, and we find that the model is wrong 80% of the time, we need to identify the causes. One could be that we don’t have enough records. In this case, we would collect more information and train again.

Alternatively, if we saw that the model was accurate enough, but took a surprisingly long time to train, we might think of tuning the learning rate, which is a parameter that controls how quickly the optimizing algorithm advances towards the solution.

These are only simple examples, and there are plenty more, but we hope you get the point.

Just One More Thing…

Ok, that’s the end of our introduction to machine learning. Hopefully you’ve gained an insight into the basic principles of the data gathering process, and gleaned some insights which will be useful going forward.

Before we go, however, we feel it’s only right to mention our own product, Bugfender, and how it can assist during the data-gathering process.

Bugfender can really help you gather machine learning data when your model takes large vectors of information that your users generate as they interact with the system. Bugfender’s ‘Tags’ enable you to relate logs under the same category, which you could use to create different data sets or simply help you organise your training samples in a convenient way. We look forward to telling you more later in the series.

Top comments (0)