DEV Community

Cover image for Enhancing Red Wine Quality Prediction: Leveraging Machine Learning for Multiclass Classification and Data Imbalance Management
MARTINS ADEGBAJU
MARTINS ADEGBAJU

Posted on

Enhancing Red Wine Quality Prediction: Leveraging Machine Learning for Multiclass Classification and Data Imbalance Management

Abstract :This study focuses on enhancing forecasting the quality of red wine through machine learning techniques, specifically addressing the challenges of multiclass classification and data imbalance. By leveraging a dataset of physicochemical properties and quality ratings of red wines, various supervised learning algorithms were employed to predict wine quality, categorized into three classes: good, middle, and bad. The study highlights the importance of feature selection, model training, and balancing techniques in improving prediction accuracy and offers insights into the practical applications of predictive analytics in the wine industry.

https://www.kaggle.com/code/adegbaju/enhancing-red-wine-quality-prediction-leveraging/notebook

Introduction : In the intricate world of viticulture and ecology, forecasting the quality of red wine is a pivotal task that significantly influences consumer satisfaction and shapes the reputation of brands in the competitive wine market. Red wine, characterized by its rich hues ranging from intense violet to deep brown — indicative of its age — is produced through a meticulous process involving the selection of quality grapes, fermentation, aging, and bottling. Evaluating wine quality plays a vital role in the wine production industry, influencing consumer satisfaction and market trends. Traditionally, Human experts assess wine quality, but this process can be subjective and inconsistent. Given that red wine’s quality is influenced by numerous chemical and sensory attributes, machine learning (ML) presents a potent tool for predicting wine quality with high precision. However, a significant challenge in this endeavour arises from the inherent class imbalance present in wine quality datasets, where some quality classes are substantially underrepresented. This imbalance can skew predictive models, leading to less reliable and biased predictions, particularly for the underrepresented classes.

Addressing this imbalance is crucial for developing robust and accurate predictive models. Various techniques, such as oversampling the minority class, have been proposed and employed in different domains to mitigate the effects of class imbalance.
Earlier research has employed diverse machine learning methods to forecast wine quality, typically handling it as a regression or binary classification problem. Nonetheless, the categorization of wine into multiple quality levels and managing imbalanced datasets remain challenges that haven’t been thoroughly explored. In the context of red wine quality prediction, using these techniques could significantly improve the precision and dependability of predictive models across all quality categories. This study aims to explore and compare the effectiveness of different oversampling techniques alongside various machine learning algorithms in correcting class imbalances in red wine quality prediction. By conducting a thorough comparative analysis, this paper seeks to identify the most effective strategies for improving the predictive accuracy of red wine quality, thus contributing valuable insights to the field of predictive analytics in viticulture.

Previous work : Jain, K., 2023. developed machine learning models to predict wine quality using physiochemical properties, with Random Forest and XGBoost showing high accuracy and feature importance. Di, S.,2022 utilizes a 1D Convolutional Neural Network, enhanced with dropout and batch normalization, to effectively predict red wine quality by analysing physiochemical correlations.

Methodology

1.Data Collection:

The study utilizes a standard red wine quality dataset comprising 12 chemical and sensory attributes. The dataset was sourced from a well-known public repository( https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009.), ensuring its relevance and standardization for predictive modelling.
Correlation Matrix: is an essential tool in data analysis that reveals the extent to which variables in a dataset are interrelated. It helps identify how changes in one feature correspond with changes in another, aiding in the discovery of potential predictors for outcomes such as red wine quality.

Image description

Image description

  1. Data Pre processing:

2.1.Handling Missing Values: the dataset has no missing data, no imputation or exclusion.

Image description

2.2. Using SelectKBest: Select KBest will be applied to choose the five best features that contribute the most to predicting red wine quality based on their statistical relationship with the outcome variable likely f_regression since we are dealing with a regression problem.

Image description

2.3. Feature Scaling: All numerical features will be normalized to ensure uniformity in scale, minimizing the dominance of features with larger scales over those with smaller scales.

Image description

2.4. Encoding Categorical Variables: Categorical variables will be encoded using techniques like Label Encoder to transform them into a machine-readable numerical format. Additionally, the Standard Scaler will be applied to normalize these encoded features, ensuring that the machine learning algorithms can interpret them more effectively.

Image description

2.5. Data Splitting: The dataset has been divided into training and testing subsets, with 75% of the data allocated for training and the remaining 25% designated for testing.

Image description

Selection of Algorithms and Hyperparameter Justification

Logistic Regression: statistical analysis method used in predictive modeling that calculates The probability of achieving a binary outcome depending on one or more predictor variables.. This is the hyperparameter used.
max_iter=1000: Increased from the default to ensure convergence, particularly important for more complex or larger datasets where the default setting may be insufficient.
random state=42: Guarantees that the model’s outputs are reproducible across different runs, essential for scientific validation.
C=1.0: Maintains the default regularization strength, providing a balance that prevents overfitting while allowing sufficient model flexibility.
solver=’lbfgs’: Chosen for its efficiency on smaller datasets and its capability to handle multinomial loss, making it suitable for multiclass classification in wine quality prediction.

Decision Tree Classifier: is a machine learning algorithm that uses a tree-structured series of decisions and possible outcomes to perform classification tasks. It operates by dividing the data into subsets according to feature values, which simplifies understanding and visualizing the decision-making process.
random_state=42: Ensures consistent results across different executions, vital for comparative analysis.
max_depth=None: Allows the tree to expand fully based on the training data, which can capture complex patterns but requires careful monitoring to avoid overfitting.
min_samples_split=2: The minimal amount that is needed to consider a split at a node, set low to enable detailed data segmentation, capturing nuances in the dataset.

Random Forest Classifier: is a machine learning model that constructs multiple decision trees during training and predicts the class representing the most frequent outcome among the individual tree predictions. This ensemble method is effective for both classification and regression tasks, offering robustness and accuracy by mitigating the risk of overfitting common in individual decision trees.
random_state=42: Provides reproducibility in model results, which is critical for the validation of experimental outcomes.
n_estimators=100: A balanced default that provides a good compromise between computational demand and model performance, allowing for a robust ensemble of decision trees.

XGBoost Classifier: is a powerful machine learning algorithm that uses gradient boosting framework to optimize decision trees, enhancing performance and speed for classification tasks. It is renowned for its efficiency, scalability, and capability to manage large and complex datasets with great precision.
use_label_encoder=False: Adapts to the latest XGBoost updates, which recommend manual label encoding over automatic to avoid deprecation warnings.
eval_metric=’logloss’: Focuses on minimizing the logarithmic loss, which is particularly effective for binary classification tasks, enhancing model performance in distinguishing between wine quality classes.

These models were meticulously configured to address the specific challenges and characteristics of the red wine quality prediction. The bar chart presents the performance of six different algorithms based on their accuracy scores. The Random Forest Classifier leads with an accuracy of 0.870. The Decision Tree Classifier shows the lowest accuracy at 0.812, indicating that the Random Forest Classifier outperforms the other models.

Image description

  1. Imbalance Correction:

Oversampling with SMOTE:
The Synthetic Minority Over-Sampling Technique (SMOTE) is a popular and effective approach for tackling class imbalance in machine learning datasets. Class imbalance arises when the instances of one class vastly exceed those of one or more other classes, potentially resulting in biased models. These models typically perform well for the dominant class but struggle with the minority class because they are disproportionately influenced by the larger class. SMOTE addresses this by generating synthetic samples for the minority class rather than duplicating existing samples. It selects a random point from the minority class, calculates the difference between that point and its nearest neighbors, and creates synthetic samples by multiplying this difference by a random value between 0 and 1 and adding it to the original point from the minority class. This process not only augments the data size but also helps in generalizing the decision boundaries, making the model less prone to overfitting to the majority class.

Image description

GridSearchCV: The integration of SMOTE with GridSearchCV in the training process of various algorithms enhances model performance by allowing the models to learn from a more balanced dataset. GridSearchCV is a method employed to identify the best hyperparameters for a model, enhancing its accuracy and performance. It accomplishes this by methodically exploring various combinations of parameter settings, conducting cross-validation along the way to establish which combination delivers optimal result. In the context of this study, GridSearchCV was employed alongside SMOTE to fine-tune the parameters of various machine learning algorithms like

  1. Logistic Regression: Parameters like ‘C’ (regularization strength), ‘solver’ (algorithm to use for optimization), and ‘class_weight’ (weights associated with classes) are crucial. GridSearchCV helped in tuning these parameters under the balanced dataset provided by SMOTE, enhancing the model’s capability to generalize across minority classes.

  2. Decision Tree and Random Forest: These models benefit from tuning parameters such as ‘max_depth’ (the deepest level the tree can reach), ‘min_samples_split’ (the minimum sample count needed to divide an internal node), and ‘criterion’ (the method used to evaluate the quality of a split).SMOTE combined with GridSearchCV, allowed these tree-based models to avoid overfitting by adequately learning the minority class characteristics.

  3. KNeighbors Classifier: It includes parameters such as ‘n_neighbors’ (the number of neighbors considered), ‘weights’ (the function that assigns weights for prediction), and ‘p’ (the power parameter for the Minkowski metric). Through GridSearchCV, the best parameters were identified that worked well with the balanced dataset created by SMOTE, ensuring that the minority class influences were not overshadowed by the majority class.

  4. XGBoost: This algorithm includes parameters like ‘learning_rate’, ‘max_depth’, ‘n_estimators’, and ‘subsample’. Tuning these parameters with GridSearchCV on a SMOTE-enhanced dataset helped in preventing the model from being too biased towards the majority class while improving overall prediction accuracy across all classes.

Using SMOTE with GridSearchCV across these algorithms ensured that the models were not only tuned to their optimal parameters but were also trained on a dataset that mimicked a real-world scenario where class distribution is not always balanced. This approach significantly improved the robustness, accuracy, and fairness of the models, making them more reliable for predicting wine quality across different quality classes.

Image description

Results

The best model based on the average scores across the two modelling approaches is the RandomForestClassifier. The Random Forest Classifier consistently showed superior performance across both the original and balanced datasets, achieving the highest accuracy. The use of SMOTE generally improved the performance metrics for all models, highlighting its effectiveness in managing imbalanced data. KNeighbors Classifier and XGBoost also performed well, particularly after data balancing, indicating their robustness to class distribution changes.

Image description

Image description

From the selected features we could rank the best performing qualities according to the best model {random forest classifier):

Total Sulphur Dioxide: Most critical in predicting wine quality, total sulphur dioxide guards freshness and longevity, significantly influencing the model with its highest importance score of 0.24.
Volatile Acidity: Second in importance, volatile acidity’s level at 0.20 underscores its impact on wine taste; excessive amounts can produce an undesirable vinegar flavour, affecting the overall quality perception.
Alcohol: With an importance score of 0.19, alcohol content significantly shapes the body, texture, and palatability of wine, making it a pivotal factor in the model’s assessment of wine quality.
Sulphates: Ranking fourth, sulphates (0.18 importance) serve as preservatives and antioxidants in wine, playing a vital role in maintaining stability and preventing spoilage, thereby influencing quality evaluations.
Citric Acid: Though it has the lowest importance score at 0.16, citric acid is essential for modulating wine’s acidity, enhancing freshness, and contributing to the flavour profile, thus impacting quality judgments.

Image description

Discussion, Conclusions, and Future Work : The study confirms that advanced machine learning techniques can effectively predict red wine quality and that addressing data imbalance is crucial for improving model performance in multiclass classification scenarios. Future work could explore deeper feature engineering, the integration of unsupervised learning for anomaly detection in wine batches, and the deployment of models into real-time quality assessment systems in wineries.

Professional, Ethical, and Legal issues

Accuracy and Reliability: The model must undergo rigorous testing and validation to ensure that it delivers reliable and consistent results across various scenarios, helping to guide winemakers’ decisions effectively.

Transparency and Explainability: It’s important for the model to be understandable to stakeholders, blending well with traditional practices in wine quality assessment and enhancing trust in machine learning solutions.
Continuous Improvement: Ongoing updates and tuning are essential to adapt to new data and changing conditions, ensuring the model remains relevant and effective.
Ethical Considerations:

Bias and Fairness: It is essential to address and reduce any biases in the training data to guarantee that the model’s evaluations are fair and unbiased.
Data Privacy: Adhering to strict data protection standards is essential, even if the initial dataset does not contain sensitive information, to protect future data enhancements.
Impact on Stakeholders: The model should be developed and deployed with an awareness of its potential impacts on all stakeholders, promoting fairness and avoiding harm.

Questions for Further Exploration:

Validation Methods: What specific validation strategies could be employed to assess the accuracy and reliability of the wine quality prediction model?
Explainability Techniques: Which techniques could be used to enhance the transparency and explainability of the model, particularly to stakeholders unfamiliar with machine learning?
Bias Identification: What methodologies can be implemented to detect and correct biases in the dataset effectively?

References

Cortez, P., Cerdeira, A., Almeida, F., Matos, T. and Reis, J., 2009. Modeling wine preferences by data mining from physicochemical properties. Decision support systems, 47(4), pp.547–553.

Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P., 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, pp.321–357.

Han, H., Wang, W.Y. and Mao, B.H., 2005, August. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing (pp. 878–887). Berlin, Heidelberg: Springer Berlin Heidelberg.

Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P., 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, pp.321–357.

James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. An introduction to statistical learning (Vol. 112, p. 18). New York: springer.

Kuhn, M. and Johnson, K., 2013. Applied predictive modeling (Vol. 26, p. 13). New York: Springer.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V. and Vanderplas, J., 2011. Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, pp.2825–2830.

Hastie, T., Tibshirani, R. and Wainwright, M., 2015. Statistical learning with sparsity. Monographs on statistics and applied probability, 143(143), p.8.

Han, H., Wang, W.Y. and Mao, B.H., 2005, August. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing (pp. 878–887). Berlin, Heidelberg: Springer Berlin Heidelberg.

Jain, K., Kaushik, K., Gupta, S.K., Mahajan, S. and Kadry, S., 2023. Machine learning-based predictive modelling for the enhancement of wine quality. Scientific Reports, 13(1), p.17042.

Di, S. and Yang, Y., 2022. Prediction of red wine quality using one-dimensional convolutional neural networks. arXiv preprint arXiv:2208.14008.

Top comments (0)