Dan for Leading EDJE

Posted on Sep 15, 2020

Machine Learning and Wine Quality: Finding a good wine using multiple classifications

#ai #machinelearning

Machine Learning and Wine Quality: Finding a good wine using multiple classifications

Wine Tasting

Wine tasting is an esoteric process with many ceremonies and customs. Everything from the shape of the glass to the temperature of the wine can affect how a wine is rated. Wine experts examine color, viscosity, smell, taste and secondary aromas. While machines could examine wines in a similar fashion it would be extremely expensive and difficult. A more feasible option is to use gas spectrum analysis along with pH and other chemical indicators to break a wine down into 11 variables. Using these variables along with reviews we can create a model that will predict which of these variables are most important in determining a “good” wine.

This project will use Kaggle’s Red Wine Quality dataset to create multiple classification models in an effort to predict if a red wine is “good” or not. The wines in the dataset already have been reviewed and rated from 0 to 10. The following 11 variables were also made available:

Fixed acidity
Volatile acidity
Citric acid
Residual sugar
Chlorides
Free sulfur dioxide
Total sulfur dioxide
Density
pH
Sulfates
Alcohol

We are going to experiment with several classification models to see which one can return the highest accuracy with this dataset. In doing so we will also get a good idea of which variables are most important in determining a “good” wine.

Setup

Import the dataset and the libraries that we will use:

import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns
import plotly.express as px

Read the data:

df = pd.read_csv("../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv")

Examine the data:

print("Rows, columns: " + str(df.shape))
df.head()

You will see that there are a total of 1599 rows in 12 columns. There appear to be no issues with the data in the first five rows. Let’s check for missing values:

print(df.isna().sum())

Kaggle has provided a nice clean dataset with no missing values.

Visualizing the Variables

Histogram of the quality variable

To ensure that the quality variable has enough variance and quantity we create a histogram:

fig = px.histogram(df,x='quality')
fig.show()

Variable Correlations

In order to visualize the correlations between the variables we will create a correlation matrix. This will enable us to understand the different relationships between the variables and even determine which variables are correlated to good quality wines.

corr = df.corr()
matplotlib.pyplot.subplots(figsize=(15,10))
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, annot=True, cmap=sns.diverging_palette(220, 20, as_cmap=True))

Convert to a Classification Problem

Going back to the objective of predicting wine quality, we needed the output variable to be a binary output.

For this problem, I defined a bottle of wine as ‘good quality’ if it had a quality score of 8 or higher, and if it had a score of less than 8, it was deemed ‘bad quality’.

Once I converted the output variable to a binary output, I separated my feature variables (X) and the target variable (y) into separate dataframes.

# Create Classification version of target variable
df['goodquality'] = [1 if x >= 8 else 0 for x in df['quality']]

# Separate feature variables and target variable
X = df.drop(['quality','goodquality'], axis = 1)
y = df['goodquality']

Proportion of Good vs Bad Wines

I wanted to make sure that there was a reasonable number of good quality wines. Based on the results below, it seemed like a fair enough number. In some applications, resampling may be required if the data was extremely imbalanced, but I assumed that it was okay for this purpose.

# See proportion of good vs bad wines
df['goodquality'].value_counts()

Preparing Data for Modeling

Normalizing Feature Variables

Now, I felt that I was prepared to set up the information for demonstrating. The primary thing that I did was normalize the information. Normalizing the information implies that it will change the information so its circulation will have a mean of 0 and a standard deviation of 1. It's critical to normalize your information so as to balance the scope of the information.

For instance, envision a dataset with two information highlights: stature in millimeters and weight in pounds. Since the estimations of 'tallness' are a lot higher because of its estimation, a more noteworthy accentuation will consequently be put on stature than weight, making a bias.

# Normalize feature variables
from sklearn.preprocessing import StandardScaler
X_features = X
X = StandardScaler().fit_transform(X)

Split information

Next I split the information into a training and test set with the goal that I could cross-approve my models and decide their viability.

# Splitting the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=0)

Modeling

For this project, we will compare five different machine learning models: decision trees, random forests, AdaBoost, Gradient Boost, and XGBoost. For the purpose of this project, I wanted to compare these models by their accuracy.

Model 1: Decision Tree
Decision trees are a popular model, used in operations research, strategic planning, and machine learning. Each square above is called a node, and the more nodes you have, the more accurate your decision tree will be (generally). The last nodes of the decision tree, where a decision is made, are called the leaves of the tree. Decision trees are intuitive and easy to build but fall short when it comes to accuracy.

from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier

model1 = DecisionTreeClassifier(random_state=1)
model1.fit(X_train, y_train)
y_pred1 = model1.predict(X_test)

print(classification_report(y_test, y_pred1))

Model 2: Random Forest
Random forests are an ensemble learning technique that builds off of decision trees. Random forests involve creating multiple decision trees using bootstrapped datasets of the original data and randomly selecting a subset of variables at each step of the decision tree. The model then selects the mode of all of the predictions of each decision tree. What’s the point of this? By relying on a “majority wins” model, it reduces the risk of error from an individual tree.

For example, if we created one decision tree, the third one, it would predict 0. But if we relied on the mode of all 4 decision trees, the predicted value would be 1. This is the power of random forests.

from sklearn.ensemble import RandomForestClassifier
model2 = RandomForestClassifier(random_state=1)
model2.fit(X_train, y_train)
y_pred2 = model2.predict(X_test)

print(classification_report(y_test, y_pred2))

Model 3: AdaBoost
The next three models are boosting algorithms that take weak learners and turn them into strong ones. I don’t want to get sidetracked and explain the differences between the three because it’s quite complicated and intricate. That being said, I’ll leave some resources where you can learn about AdaBoost, Gradient Boosting, and XGBoosting.

StatQuest: AdaBoost
StatQuest: Gradient Boost
StatQuest: XGBoost

from sklearn.ensemble import AdaBoostClassifier
model3 = AdaBoostClassifier(random_state=1)
model3.fit(X_train, y_train)
y_pred3 = model3.predict(X_test)

print(classification_report(y_test, y_pred3))

Model 4: Gradient Boosting

from sklearn.ensemble import GradientBoostingClassifier
model4 = GradientBoostingClassifier(random_state=1)
model4.fit(X_train, y_train)
y_pred4 = model4.predict(X_test)

print(classification_report(y_test, y_pred4))

Model 5: XGBoost

import xgboost as xgb
model5 = xgb.XGBClassifier(random_state=1)
model5.fit(X_train, y_train)
y_pred5 = model5.predict(X_test)

print(classification_report(y_test, y_pred5))

By comparing the five models, the random forest and XGBoost seems to yield the highest level of accuracy. However, since XGBoost has a better f1-score for predicting good quality wines (1), XGBoost appears to be the better model.

Feature Importance

Below, are graphed the feature importance based on the Random Forest model and the XGBoost model. While they slightly vary, the top 3 features are the same: alcohol, volatile acidity, and sulphates. If you look below the graphs, the dataset is split into good quality and bad quality to compare these variables in more detail.

Random Forest

feat_importances = pd.Series(model2.feature_importances_, index=X_features.columns)
feat_importances.nlargest(25).plot(kind='barh',figsize=(10,10))

XGBoost

feat_importances = pd.Series(model5.feature_importances_, index=X_features.columns)
feat_importances.nlargest(25).plot(kind='barh',figsize=(10,10))

Comparing the Top 4 Features

 # Filtering df for only good quality
df_temp = df[df['goodquality']==1]
df_temp.describe()

# Filtering df for only bad quality
df_temp2 = df[df['goodquality']==0]
df_temp2.describe()

By looking into the details, we can see that good quality wines have higher levels of alcohol on average, have a lower volatile acidity on average, higher levels of sulphates on average, and higher levels of residual sugar on average.