DEV Community: Aayush Sinha

Simple > Machine Learning Prediction

Aayush Sinha — Tue, 13 Feb 2024 09:13:10 +0000

Selecting the prediction target

You can pull out a target feature/variable with a df.
This single column is stored as a series which boradly is a dataframe with one single column.
for instance :

y = df.price

Choosing features

The features/columns that are inputted into a model (and later used to make predictions. The columns that will be used to predict the price. Sometimes, you will use all columns except the target as features. Other times you'll be better off with fewer features. By convention this is called 'X'.
for instance:

melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]

Building your model

You will use the scikit-learn library to create your models. When coding, this library is written as sklearn, as you will see in the sample code. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.

The steps to building and using a model are:

Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
Fit: Capture patterns from provided data. This is the heart of modeling.
Predict: Just what it sounds like
Evaluate: Determine how accurate the model's predictions are.

Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.

We now have a fitted model that we can use to make predictions.

In practice, you'll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But we'll make predictions for the first few rows of the training data to see how the predict function works.

Model Validation

You'll want to evaluate almost every model you ever build. In most (though not all) applications, the relevant measure of model quality is __predictive accuracy. In other words, will the model's predictions be close to what happens.

Many people make a huge mistake when measuring predictive accuracy. They make predictions with their training data and compare those predictions to the target values in the training data. You'll see the problem with this approach and how to solve it in a moment, but let's think about how we'd do this first

You'd first need to summarize the model quality into an understandable way. If you compare predicted and actual home values for 10,000 houses, you'll likely find mix of good and bad predictions. Looking through a list of 10,000 predicted and actual values would be pointless. We need to summarize this into a single metric.

There are many metrics for summarizing model quality, but we'll start with one called Mean Absolute Error (also called MAE). Let's break down this metric starting with the last word, error.

The prediction error for each house is:

error=actual−predicted

So, if a house costs $150,000 and you predicted it would cost $100,000 the error is $50,000.

With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as

On average, our predictions are off by about X.

Once we have a model, here is how we calculate the mean absolute error:

The Problem with "In-Sample" Scores

The measure we just computed can be called an "in-sample" score. We used a single "sample" of houses for both building the model and evaluating it. Here's why this is bad.

Imagine that, in the large real estate market, door color is unrelated to home price.

However, in the sample of data you used to build the model, all homes with green doors were very expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.

Since this pattern was derived from the training data, the model will appear accurate in the training data.

But if this pattern doesn't hold when the model sees new data, the model would be very inaccurate when used in practice.

Since models' practical value comes from making predictions on new data, we measure performance on data that wasn't used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called validation data.

Wow!
Your mean absolute error for the in-sample data was about 500 dollars. Out-of-sample it is more than 250,000 dollars.

This is the difference between a model that is almost exactly right, and one that is unusable for most practical purposes. As a point of reference, the average home value in the validation data is 1.1 million dollars. So the error in new data is about a quarter of the average home value.

There are many ways to improve this model, such as experimenting to find better features or different model types.

Underfitting and Overfitting

Fine-tune your model for better performance.

Experimenting With Different Models

Now that you have a reliable way to measure model accuracy, you can experiment with alternative models and see which gives the best predictions. But what alternatives do you have for models?

In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses) and a leaf.

As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree only had 1 split, it divides the data into 2 groups.

If each group is split again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. If we keep doubling the number of groups by adding more splits at each level, we'll have 210 groups of houses by the time we get to the 10th level. That's 1024 leaves.

When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

This is a phenomenon called overfitting, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting.

Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting. Visually, we want the low point of the (red) validation curve in the figure below.

Example

There are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes. But the max_leaf_nodes argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.

We can use a utility function to help compare MAE scores from different values for max_leaf_nodes:

We can use a for-loop to compare the accuracy of models built with different values for max_leaf_nodes.

Of the options listed, 500 is the optimal number of leaves.

Overfitting:

Overfitting happens when a model learns not only the underlying patterns in the data but also the noise or random fluctuations that exist in the dataset.

Imagine you're trying to memorize a list of numbers, including some mistakes. Overfitting is like memorizing not just the real numbers but also the mistakes, which won't be useful for predicting new numbers accurately.

When a model overfits, it performs very well on the training data (the data it was trained on) but doesn't generalize well to new, unseen data. In other words, it's too tailored to the training data and doesn't work well with new data.
Underfitting:

Underfitting occurs when a model is too simple to capture the underlying patterns in the data.

Going back to the memorization example, underfitting would be like trying to remember a complex list of numbers with just a few general ideas. You'll likely miss many important details.
An underfit model doesn't perform well on either the training data or new data because it doesn't capture enough of the relevant patterns in the data.

Validation Data:

Validation data is a separate set of data that the model hasn't seen during training. It's used to evaluate how well the model generalizes to new, unseen data.
Just like taking a practice test before the real one, validation data helps us assess how well the model will perform in the real world.

By trying out different models and evaluating their performance on the validation data, we can choose the one that performs the best and is most likely to make accurate predictions on new data.
In simple terms, overfitting is like memorizing mistakes along with the right answers, while underfitting is like not studying enough to understand the material.

Validation data helps us pick the model that performs the best on new problems we haven't seen before.

Correlation is not Causation!

Aayush Sinha — Tue, 23 May 2023 10:09:20 +0000

import numpy as np
import matplotlib.pyplot as plt

# Simulating ice cream sales and sunglasses sales data
np.random.seed(0)
days = 100
temperature = np.random.normal(80, 10, days)  # Simulated temperature data
ice_cream_sales = temperature + np.random.normal(0, 5, days)  # Simulated ice cream sales data
sunglasses_sales = temperature + np.random.normal(0, 8, days)  # Simulated sunglasses sales data

# Calculating correlation coefficient
correlation_coefficient = np.corrcoef(ice_cream_sales, sunglasses_sales)[0, 1]

# Plotting the data
plt.scatter(ice_cream_sales, sunglasses_sales)
plt.xlabel('Ice Cream Sales')
plt.ylabel('Sunglasses Sales')
plt.title(f'Correlation: {correlation_coefficient:.2f}')
plt.show()



![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/er997pzqrny6n9zhvqyz.JPG)

The phrase "correlation is not causation" is a fundamental principle in the field of statistics and scientific research. It reminds us that just because two variables are observed to be related or to occur together does not necessarily mean that one variable causes the other to happen.

Correlation refers to a statistical relationship between two or more variables, indicating how they tend to change together. It measures the strength and direction of the relationship, ranging from -1 to 1. A correlation coefficient of 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no correlation.

Causation, on the other hand, refers to a cause-and-effect relationship between variables, where changes in one variable directly lead to changes in another variable. Establishing causation requires more than just observing a correlation. It involves rigorous experimentation, controlling for other factors, and demonstrating that changes in one variable lead to predictable and consistent changes in another variable.

It's important to be cautious when interpreting correlations because there can be various reasons behind the observed relationship. Correlation does not provide evidence of causation because there might be underlying factors, often called confounding variables, that influence both variables simultaneously. Additionally, the correlation could be coincidental or the result of other factors that were not considered.

To determine causation, researchers often use experimental designs, such as randomized controlled trials, where they manipulate one variable and observe the effect on another variable while controlling for confounding factors. Such experiments allow researchers to make stronger claims about causation.

In summary, while correlations can be useful for identifying relationships between variables, it is crucial to remember that correlation alone does not establish causation. Additional evidence and rigorous research methods are necessary to determine causal relationships between variables.

Let's consider a simple example to illustrate the difference between correlation and causation.

Example: Ice cream sales and sunglasses sales

Suppose we observe a strong positive correlation between ice cream sales and sunglasses sales. That is, on hot sunny days, when ice cream sales increase, so do sunglasses sales. Based on this correlation, we might be tempted to conclude that increased ice cream sales cause increased sunglasses sales. However, this would be an example of mistakenly inferring causation from correlation.

Explanation:

Correlation: The observed correlation suggests that there is a statistical relationship between ice cream sales and sunglasses sales. It indicates that the two variables tend to change together. On hot sunny days, people are more likely to buy both ice cream and sunglasses.
Causation: However, correlation alone does not provide evidence of causation. In this example, ice cream sales and sunglasses sales might be correlated due to a common factor, such as weather. Hot sunny weather could be the driving factor behind both increased ice cream sales and increased sunglasses sales. People are more likely to buy ice cream to cool down and enjoy a refreshing treat, and they also need sunglasses to protect their eyes from the bright sunlight. Thus, weather is a confounding variable that influences both variables simultaneously, creating a correlation between them.

If we were to mistakenly assume causation based on this correlation, we might conclude that selling more ice cream causes an increase in sunglasses sales. However, this conclusion ignores the underlying factor of hot sunny weather, which is the actual cause behind the observed correlation.

To establish causation, we would need to conduct controlled experiments, such as manipulating ice cream sales while controlling for other factors like weather, and observing the effect on sunglasses sales. Only through such rigorous experimentation can we determine whether there is a causal relationship between the variables.

Remembering that correlation does not imply causation is crucial for sound reasoning and accurate interpretation of statistical relationships.