Classification and Logistic Regression-Wine Qualit

Fadhil Ahmad Hidayat

Sosial Media


0 orang menyukai ini
Suka

Summary

This study aims to search for the elements which effects WINE QUALITY by using multiclass decision classification methods such as Support Vector Machines, K-NN, Logistic Regression, Softmax, Confusion Matrix, Accuracy, Precision, Specificity, F1 Score, ROC/AUC, Logarithmic Loss, Cross Validation, K-Fold Cross Validation, Grid Search, SMOTE.

Description

Classification and Logistic Regression-Wine Quality

 

About Dataset

Data Set Information:

The dataset was downloaded from Kaggle.com

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. The reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

Two datasets were combined and few values were randomly removed.

Attribute Information:

For more information, read [Cortez et al., 2009].

Input variables (based on physicochemical tests):

1 - fixed acidity

2 - volatile acidity

3 - citric acid

4 - residual sugar

5 - chlorides

6 - free sulfur dioxide

7 - total sulfur dioxide

8 - density

9 - pH

10 - sulphates

11 - alcohol

Output variable (based on sensory data):

12 - quality (score between 0 and 10)

 

The Aim of Analysis

This study aims to search for the elements which effects WINE QUALITY by using multiclass decision classification methods such as Support Vector Machines, K-NN, Logistic Regression, Softmax, Confusion Matrix, Accuracy, Precision, Specificity, F1 Score, ROC/AUC, Logarithmic Loss, Cross Validation, K-Fold Cross Validation, Grid Search, SMOTE.

 

General Information of the Data

Type: Two types of wines such as red wine and white wine.

Fixed acidity: Fixed acids include tartaric, malic, citric, and succinic acids which are found in grapes (except succinic)

Acids are one of the fundamental properties of wine and contribute greatly to the taste of the wine, Acidity in food and drink tastes tart and zesty. Tasting acidity is also sometimes confused with alcohol. Wines with higher acidity feel lighter-bodied because they come across as “spritzy”. Reducing acids significantly might lead to wines tasting flat. If you prefer a wine that is richer and rounder, you enjoy slightly less acidity.

Volatile acidity: These acids are to be distilled out from the wine before completing the production process. It is primarily constituted of acetic acid though other acids like lactic, formic and butyric acids might also be present. Excess of volatile acids are undesirable and lead to unpleasant flavour.

Citric acid: This is one of the fixed acids which gives a wine its freshness. Usually most of it is consumed during the fermentation process and sometimes it is added separately to give the wine more freshness.

Residual sugar: This typically refers to the natural sugar from grapes which remains after the fermentation process stops, or is stopped.

Chlorides: Chloride concentration in the wine is influenced by terroir and its highest levels are found in wines coming from countries where irrigation is carried out using salty water or in areas with brackish terrains.

Free sulfur dioxide: This is the part of the sulphur dioxide that when added to a wine is said to be free after the remaining part binds. Winemakers will always try to get the highest proportion of free sulphur to bind. They are also known as sulfites and too much of it is undesirable and gives a pungent odour.

Total sulfur dioxide: This is the sum total of the bound and the free sulfur dioxide. This is mainly added to kill harmful bacteria and preserve quality and freshness. There are usually legal limits for sulfur levels in wines and excess of it can even kill good yeast and give out undesirable odour.

Density: This can be represented as a comparison of the weight of a specific volume of wine to an equivalent volume of water. It is generally used as a measure of the conversion of sugar to alcohol.

pH: Also known as the potential of hydrogen, this is a numeric scale to specify the acidity or basicity the wine. Fixed acidity contributes the most towards the pH of wines. You might know, solutions with a pH less than 7 are acidic, while solutions with a pH greater than 7 are basic. With a pH of 7, pure water is neutral. Most wines have a pH between 2.9 and 3.9 and are therefore acidic.

Sulphates: These are mineral salts containing sulfur. Sulphates are to wine as gluten is to food. They are a regular part of the winemaking around the world and are considered essential. They are connected to the fermentation process and affects the wine aroma and flavour.

Alcohol: It's usually measured in % vol or alcohol by volume (ABV).

Quality: Wine experts graded the wine quality between 0 (very bad) and 10 (very excellent). The eventual quality score is the median of at least three evaluations made by the same wine experts.

 

Data Exploration

 

Checking for NULL Values

 

Filling of the Row Data

Wine quality has the highest correlation with alcohol. Other relation degrees are very low with each other,such as citric acid,free_sulfur_dioxide, sulphates and pH. Quality also has a low negative correlation with density,volatile acidity, chlorides, total_sulfur_dioxide and residual_sugar.

Distribution of Variables

Creating 2 Bins Model of Two Types of Wine Quality Classes

Quality in Different Wine Types

As we see on the chart, Low quality red wine has the highest numerical value in data set as well as low quality white wine. High quality white and red wines have little place in data.

Quality & Alcohol Relation

Red and White wines has similar results on the chart. High quality wines are mostly red wines and have more alcohol level.

Quality & Volatile Acidity by Types

Fixed acidity level is low in both wine classes, especially in white wine while red wine has more in low quality class up to 1.70. Fixed alcohol level is again high in red wine class comparing white wine in low quality. High quality class has the highest fixed alcohol level in booth wine classes.

 

Chlorides Level in Quality Classes

Chloride Level is a bit higher in red wine in contrats with white wine.

 

Fixed Acidity & Volatile Acidity & Citric Acid Density in Quality Classes

Residual Sugar Levels by Wine Quality Classes

Sulfur Dioxide Distribution in Wine Quality Classes

There are some extreme values in low quality wine class. Total sulfur dioxide level is getting higher in some low quality wine class while general disturubution is standing up to 100 level of free sulfur dioxide.

pH Level in Wine Quality

Density by Wine Quality Classes

Sulphate Values in Wine Quality Classes

There is more low quality wine in between 0.4 and 0.6 levels of sulphate levels. Both quality classes have similar values.

Overview about Outliers

 

LOGISTIC REGRESSION CLASSIFIER

Creating Train / Test Groups with 2 Bins Model

In order to have all variables in numeric data, I mapped wine types as following by using the previous data frame 'df_bins':

LogisticRegression

Text

Description automatically generated

Performance Measurements

It is better to check FP and FN values for another deep study to focus on false predictions for a better target of accurancy and results.

A new data set can be created with predictions, X_test and y_test data, than we can check for prediction value of this seperate data set.

Accuracy

Error Rate

Precision: Out of all the predicted positive instances, how many were predicted correctly = TP / (TP + FP))

Graphical user interface, text, application, Word

Description automatically generated

Recall ( Out of all the positive classes, how many instances were identified correctly = TP / (TP + FN))

Specificity :(TN)/(TN + FP))

F1-Score: From Precision and Recall, F-Measure is computed and used as metrics sometimes. F – Measure is nothing but the harmonic mean of Precision and Recall =(2 * Recall * Precision) / (Recall + Precision) )

ROC/AUC(Area Under Curve)

PRECISION RECALL CURVE

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

Log Loss (calculating the difference between ground truth and predicted score for every observation and average those errors over all observations. )

General Looking at Results

Imbalanced Data

In order to see the differency between logistic regression model, I also would like to check resampling imblance data. In previous steps, I added bins in low and high ranges on quality variable, this section will show the results by using resampling method.

When splitting data in two parts starting from four, it gives an imbalanced data.

Resampling Imbalance Data

Table

Description automatically generated

Cross Validation with 2 Bins Model

Chart

Description automatically generated

Graphical user interface, text, application

Description automatically generated

We splitted y values equally and trained our model.However, in order to see X values distribution we need following Cross Validation Measurement.

 

K-Fold Cross Validation

A picture containing text

Description automatically generated

We splited our function in 5 pieces and trained them with Kfold method. In the following section, Cross Validate and Cross Validation Score tools will do everything itself.

Cross Validation Score & Cross Validate

The average accuracy score is calculated from 10 different accuracy scores from the model.

We still have similiar accuracy scores (.96-.97) by different methods applied previously.

Text

Description automatically generated

Text

Description automatically generated with low confidence

cross_val_score and cross_validate functions used only test set. In order to have model predictions we can also check cross_val_predict function.

 

Hyperparameter Tuning

Apart from using appropriate function for our model, using the suitable parameter is also an important detail to have accurate predictions. I will use Grid Search and Random Search for this aim.

In order to have suitable parametres I used get_params() function.

Grid Search

 

The most successful 10 parametres on a chart.

Chart, scatter chart

Description automatically generated

 

RandomizedSearchCV

While we checked all combinations of our parameters with Grid Search method, we can also use this function with desired number of conbinations of parameters.

A picture containing chart

Description automatically generated

I will make 10 combination with 'n_iter' parameter.

Text

Description automatically generated

Table

Description automatically generated

Chart, scatter chart

Description automatically generated

 

Conclusion

  • In the beginning of this study, I checked general characteristic of the data set. Data has some NULL values. Even though, dropping missing values is still an option due to low percentage of missing values in data, I preferred to filled them by the mean of data. 
  • Data set shows that red wine is very reach in wine quality with a high correlation with alcohol. 
  • I also looked at quality levels in each variable by using suitable charts for a general understanding. 
  • Following sections, I searched for 2 different types of models with different bins. Behind this study I created many models for a better accuracy and recall scores. This study only shows the best model with good scores and predictions.
  • The first model was included 2 bins with all variables in a quality range of 0-5,5-10. This model gives %0.74 accuracy score on train and test samples. 
  • df_bins3 data frame was split in 3 different bins to check accuracy levels. First bin was between 0-5,5-6,6-10 range. This model gives score of %0.58. 
  • On the other hand, when bins are arranged by following 0-4,4-7,7-10; score reached 0.93%. I continued with this model for the further steps on other performance measurements. 
  • A general note: These results for imbalance data, thus I would like to see scores after balancing data set. Due to this reason, and to check the difference between logistic regression model, I resampled imbalance data. In previous steps, I added bins in low and high ranges on quality variable, this section will show the results by using resampling method.
  • However, having very high scores and a negative R square show that data set needs another approach at the end. For further studies, more suitable data set can be chosen. 
  • A Quick Note for Resampling Data: Splitting data in 2 parts from 0 to 5 gives a balance distribution. However, when we split data  from 0 to 4 in the first bin, I got an imbalance data. 
  • After resampling our data, I needed to switch y values in array format to have Cross Validation scores. 
  • Generally, our measurements and model scores worked well to show the aim of the study. This study can be completed in a shorter way as well without repeating similar functions; however, this study also aims to use different methods to have accurate scores from variable sources. 
  • I focused on classification methods on this study. However, I agree that other algorithms can be more successful such as Random forest and Boosting algorithms give better results. I will use these methods in my next kernel

Fadhil Ahmad Hidayat - Studi Independen Batch 3

AI Infra - Tugas Sertifikasi AIBIZ

Informasi Course Terkait
  Kategori: Data Science / Big Data
  Course: Infrastuktur Kecerdasan Artifisial (SIB AI-INFRA)