Aayush Sinha

Posted on Feb 13

Simple > Machine Learning Prediction

#machinelearning

Selecting the prediction target

You can pull out a target feature/variable with a df.
This single column is stored as a series which boradly is a dataframe with one single column.
for instance :

y = df.price

Choosing features

The features/columns that are inputted into a model (and later used to make predictions. The columns that will be used to predict the price. Sometimes, you will use all columns except the target as features. Other times you'll be better off with fewer features. By convention this is called 'X'.
for instance:

melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]

Building your model

You will use the scikit-learn library to create your models. When coding, this library is written as sklearn, as you will see in the sample code. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.

The steps to building and using a model are:

Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
Fit: Capture patterns from provided data. This is the heart of modeling.
Predict: Just what it sounds like
Evaluate: Determine how accurate the model's predictions are.

Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.

We now have a fitted model that we can use to make predictions.

In practice, you'll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But we'll make predictions for the first few rows of the training data to see how the predict function works.

Model Validation

You'll want to evaluate almost every model you ever build. In most (though not all) applications, the relevant measure of model quality is __predictive accuracy. In other words, will the model's predictions be close to what happens.

Many people make a huge mistake when measuring predictive accuracy. They make predictions with their training data and compare those predictions to the target values in the training data. You'll see the problem with this approach and how to solve it in a moment, but let's think about how we'd do this first

You'd first need to summarize the model quality into an understandable way. If you compare predicted and actual home values for 10,000 houses, you'll likely find mix of good and bad predictions. Looking through a list of 10,000 predicted and actual values would be pointless. We need to summarize this into a single metric.

There are many metrics for summarizing model quality, but we'll start with one called Mean Absolute Error (also called MAE). Let's break down this metric starting with the last word, error.

The prediction error for each house is:

error=actual−predicted

So, if a house costs $150,000 and you predicted it would cost $100,000 the error is $50,000.

With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as

On average, our predictions are off by about X.

Once we have a model, here is how we calculate the mean absolute error:

The Problem with "In-Sample" Scores

The measure we just computed can be called an "in-sample" score. We used a single "sample" of houses for both building the model and evaluating it. Here's why this is bad.

Imagine that, in the large real estate market, door color is unrelated to home price.

However, in the sample of data you used to build the model, all homes with green doors were very expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.

Since this pattern was derived from the training data, the model will appear accurate in the training data.

But if this pattern doesn't hold when the model sees new data, the model would be very inaccurate when used in practice.

Since models' practical value comes from making predictions on new data, we measure performance on data that wasn't used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called validation data.

Wow!
Your mean absolute error for the in-sample data was about 500 dollars. Out-of-sample it is more than 250,000 dollars.

This is the difference between a model that is almost exactly right, and one that is unusable for most practical purposes. As a point of reference, the average home value in the validation data is 1.1 million dollars. So the error in new data is about a quarter of the average home value.

There are many ways to improve this model, such as experimenting to find better features or different model types.

Underfitting and Overfitting

Fine-tune your model for better performance.

Experimenting With Different Models

Now that you have a reliable way to measure model accuracy, you can experiment with alternative models and see which gives the best predictions. But what alternatives do you have for models?

In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses) and a leaf.

As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree only had 1 split, it divides the data into 2 groups.

If each group is split again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. If we keep doubling the number of groups by adding more splits at each level, we'll have 210 groups of houses by the time we get to the 10th level. That's 1024 leaves.

When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

This is a phenomenon called overfitting, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting.

Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting. Visually, we want the low point of the (red) validation curve in the figure below.

Example

There are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes. But the max_leaf_nodes argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.

We can use a utility function to help compare MAE scores from different values for max_leaf_nodes:

We can use a for-loop to compare the accuracy of models built with different values for max_leaf_nodes.

Of the options listed, 500 is the optimal number of leaves.

Overfitting:

Overfitting happens when a model learns not only the underlying patterns in the data but also the noise or random fluctuations that exist in the dataset.

Imagine you're trying to memorize a list of numbers, including some mistakes. Overfitting is like memorizing not just the real numbers but also the mistakes, which won't be useful for predicting new numbers accurately.

When a model overfits, it performs very well on the training data (the data it was trained on) but doesn't generalize well to new, unseen data. In other words, it's too tailored to the training data and doesn't work well with new data.
Underfitting:

Underfitting occurs when a model is too simple to capture the underlying patterns in the data.

Going back to the memorization example, underfitting would be like trying to remember a complex list of numbers with just a few general ideas. You'll likely miss many important details.
An underfit model doesn't perform well on either the training data or new data because it doesn't capture enough of the relevant patterns in the data.

Validation Data:

Validation data is a separate set of data that the model hasn't seen during training. It's used to evaluate how well the model generalizes to new, unseen data.
Just like taking a practice test before the real one, validation data helps us assess how well the model will perform in the real world.

By trying out different models and evaluating their performance on the validation data, we can choose the one that performs the best and is most likely to make accurate predictions on new data.
In simple terms, overfitting is like memorizing mistakes along with the right answers, while underfitting is like not studying enough to understand the material.

Validation data helps us pick the model that performs the best on new problems we haven't seen before.

DEV Community

Simple > Machine Learning Prediction

Selecting the prediction target

Choosing features

Building your model

Model Validation

The Problem with "In-Sample" Scores

Underfitting and Overfitting

Experimenting With Different Models

Example

Top comments (0)

Read next

Feature Selection with the IAMB Algorithm: A Casual Dive into Machine Learning

Reinforcement Learning with Human Feedback (RLHF) for Large Language Models (LLMs)

Golden-Retriever: High-Fidelity Agentic Retrieval Augmented Generation for Industrial Knowledge Base

A Beginner's Guide to Text Embedding Using BERT with MediaPipe