DEV Community

Cover image for Regression from scratch  -  Wine quality prediction
Apoorva Dave
Apoorva Dave

Posted on • Updated on

Regression from scratch  -  Wine quality prediction

In our previous posts, we covered the basics of machine learning and types of regression. In this article, we will do our first Machine Learning project. This would give an idea of how we can implement regression on different datasets. It will take just an hour to set up, understand and code. So let’s get started! 😃

ml1

The task here is to predict the quality of red wine on a scale of 0–10 given a set of features as inputs. I have solved it as a regression problem using Linear Regression.

The dataset used is Wine Quality Data set from UCI Machine Learning Repository. You can check the dataset here

Input variables are fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol. And the output variable (based on sensory data) is quality (score between 0 and 10). Below is a screenshot of the top 5 rows of the dataset.

ml3


Top 5 rows of Wine Quality dataset

Dependencies

The code is in python. Other than this, please install the following libraries using pip.

  1. Pandas: pip install pandas
  2. matplotlib: pip install matplotlib
  3. numpy: pip install numpy
  4. scikit-learn: pip install scikit-learn

And that’s it! You are halfway through 😄. Next, follow the below steps in order to build a linear regression model in no time!

Approach

Create a new IPython Notebook and insert the below code to import the necessary modules. In case you get any error, do install the necessary packages using pip.

import pandas as pd 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression 
from sklearn import metrics 
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns
Enter fullscreen mode Exit fullscreen mode

Read the data using pandas into a dataframe. To check the top 5 rows of the dataset, use df.head()

df = pd.read_csv('winequality-red.csv')
df.head()
Enter fullscreen mode Exit fullscreen mode

Finding correlations between each attribute of dataset using corr()

# there are no categorical variables. each feature is a number. Regression problem. 
# Given the set of values for features, we have to predict the quality of wine. 
# finding correlation of each feature with our target variable - quality
correlations = df.corr()['quality'].drop('quality')
print(correlations)
Enter fullscreen mode Exit fullscreen mode

ml3


Correlations between each attribute and target variable — quality

To draw a heatmap and get a detailed diagram of correlation, insert the below code.

sns.heatmap(df.corr())
plt.show()
Enter fullscreen mode Exit fullscreen mode

ml4


Heatmap

Define a function get_features() which outputs only those features whose correlation is above a threshold value (passed as an input parameter to function).

def get_features(correlation_threshold):
    abs_corrs = correlations.abs()
    high_correlations = abs_corrs
    [abs_corrs > correlation_threshold].index.values.tolist()
    return high_correlations
Enter fullscreen mode Exit fullscreen mode

Create two vectors, x containing input features and y containing the quality variable. In x, we get all the features except residual sugar. The threshold value can be increased if you want.

# taking features with correlation more than 0.05 as input x and quality as target variable y 
features = get_features(0.05) 
print(features) 
x = df[features] 
y = df['quality']
Enter fullscreen mode Exit fullscreen mode

Create training and testing set using train_test_split. 25% of the data is used for testing and 75% for training. You can check the size of the dataset using x_train.shape

x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=3)
Enter fullscreen mode Exit fullscreen mode

Once the training and testing sets are created, it is time to build your Linear Regression model. You can simply use the built-in function to create a model and then fit to training data. Once trained, coef_ gives the values of the coefficients for each feature.

# fitting linear regression to training data
regressor = LinearRegression()
regressor.fit(x_train,y_train)
# this gives the coefficients of the 10 features selected above. 

print(regressor.coef_)
Enter fullscreen mode Exit fullscreen mode

To predict the quality of wine with this model, use predict().

train_pred = regressor.predict(x_train)
print(train_pred)
test_pred = regressor.predict(x_test) 
print(test_pred)
Enter fullscreen mode Exit fullscreen mode

Calculating Root mean squared error for training as well as testing set. The root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample and population values) predicted by a model and the values actually observed. The RMSE for training and test sets should be very similar if we have built a good model. If the RMSE for the test set is much higher than that of the training set, it is likely that we’ve badly overfit the data.

# calculating rmse
train_rmse = mean_squared_error(train_pred, y_train) ** 0.5
print(train_rmse)
test_rmse = mean_squared_error(test_pred, y_test) ** 0.5
print(test_rmse)
# rounding off the predicted values for test set
predicted_data = np.round_(test_pred)
print(predicted_data)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, test_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, test_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, test_pred)))
# displaying coefficients of each feature
coeffecients = pd.DataFrame(regressor.coef_,features) coeffecients.columns = ['Coeffecient'] 
print(coeffecients)
Enter fullscreen mode Exit fullscreen mode

ml5


Coefficients of each feature

These numbers mean that holding all other features fixed, a 1 unit increase in sulphates will lead to an increase of 0.8 in quality of wine, and similarly for the other features.
Also holding all other features fixed, a 1 unit increase in volatile acidity will lead to a decrease of 0.99 in quality of wine, and similarly for the other features.

Thus, with few lines of code, we were able to build a Linear regression model to predict the quality of wine with RMSE scores of 0.65 and 0.63 for training and testing set respectively. This is just an idea to help you start with regression. You can play with the threshold value, other regression models and try feature engineering as well 😍.

To get the entire code, please use this link to my repository. The dataset is also uploaded :) Clone the repository and run the notebook to see the results.

The next articles would be on Classification and a similar small project on it. Stay tuned for more! Till then happy learning 😸

Top comments (7)

Collapse
 
alcaraz_pico profile image
Eric Alcaraz del Pico

I have a problem:
correlations = df.corr()['quality'].drop('quality')
Keyerror : 'quality'
Some idea?

Collapse
 
apoorvadave profile image
Apoorva Dave

The dataframe into which you have read csv file should contain the column 'quality'.
correlations = df.corr()['quality'].drop('quality')
Here we are trying to find correlations between column quality and all the other columns other than quality. 'quality' is our target variable.

Collapse
 
alcaraz_pico profile image
Eric Alcaraz del Pico • Edited

This my csv

I have this column named 'quality'.
This is my code:

And i am having the problem:

Help pls :,(

Thread Thread
 
apoorvadave profile image
Apoorva Dave

I see you have value for 'quality' column in the dataset but it is not being read properly. As you can see in the output row 2 and 3 are showing .... but values are present in the actual dataset. Can you try printing df['quality'] and see are there are blank values for it?

Thread Thread
 
bishtgovind1988 profile image
Govind Bisht

It is reading all the columns as one column. You need to pass the separator while reading the CSV file.

ex:
df = pd.read_csv('winequality-red.csv', sep=";")

Collapse
 
bishtgovind1988 profile image
Govind Bisht

Hey Eric,

Please pass the separator value while reading the CSV file.

ex:
df = pd.read_csv('winequality-red.csv', sep=";")

Collapse
 
rogerio1982 profile image
Rogério Soares

congratulation...