Fadhil Ahmad Hidayat
This study aims to search for the elements which effects WINE QUALITY by using multiclass decision classification methods such as Support Vector Machines, K-NN, Logistic Regression, Softmax, Confusion Matrix, Accuracy, Precision, Specificity, F1 Score, ROC/AUC, Logarithmic Loss, Cross Validation, K-Fold Cross Validation, Grid Search, SMOTE.
Classification and Logistic Regression-Wine Quality
About Dataset
Data Set Information:
The dataset was downloaded from Kaggle.com
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. The reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
Two datasets were combined and few values were randomly removed.
Attribute Information:
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
The Aim of Analysis
This study aims to search for the elements which effects WINE QUALITY by using multiclass decision classification methods such as Support Vector Machines, K-NN, Logistic Regression, Softmax, Confusion Matrix, Accuracy, Precision, Specificity, F1 Score, ROC/AUC, Logarithmic Loss, Cross Validation, K-Fold Cross Validation, Grid Search, SMOTE.
General Information of the Data
Type: Two types of wines such as red wine and white wine.
Fixed acidity: Fixed acids include tartaric, malic, citric, and succinic acids which are found in grapes (except succinic)
Acids are one of the fundamental properties of wine and contribute greatly to the taste of the wine, Acidity in food and drink tastes tart and zesty. Tasting acidity is also sometimes confused with alcohol. Wines with higher acidity feel lighter-bodied because they come across as “spritzy”. Reducing acids significantly might lead to wines tasting flat. If you prefer a wine that is richer and rounder, you enjoy slightly less acidity.
Volatile acidity: These acids are to be distilled out from the wine before completing the production process. It is primarily constituted of acetic acid though other acids like lactic, formic and butyric acids might also be present. Excess of volatile acids are undesirable and lead to unpleasant flavour.
Citric acid: This is one of the fixed acids which gives a wine its freshness. Usually most of it is consumed during the fermentation process and sometimes it is added separately to give the wine more freshness.
Residual sugar: This typically refers to the natural sugar from grapes which remains after the fermentation process stops, or is stopped.
Chlorides: Chloride concentration in the wine is influenced by terroir and its highest levels are found in wines coming from countries where irrigation is carried out using salty water or in areas with brackish terrains.
Free sulfur dioxide: This is the part of the sulphur dioxide that when added to a wine is said to be free after the remaining part binds. Winemakers will always try to get the highest proportion of free sulphur to bind. They are also known as sulfites and too much of it is undesirable and gives a pungent odour.
Total sulfur dioxide: This is the sum total of the bound and the free sulfur dioxide. This is mainly added to kill harmful bacteria and preserve quality and freshness. There are usually legal limits for sulfur levels in wines and excess of it can even kill good yeast and give out undesirable odour.
Density: This can be represented as a comparison of the weight of a specific volume of wine to an equivalent volume of water. It is generally used as a measure of the conversion of sugar to alcohol.
pH: Also known as the potential of hydrogen, this is a numeric scale to specify the acidity or basicity the wine. Fixed acidity contributes the most towards the pH of wines. You might know, solutions with a pH less than 7 are acidic, while solutions with a pH greater than 7 are basic. With a pH of 7, pure water is neutral. Most wines have a pH between 2.9 and 3.9 and are therefore acidic.
Sulphates: These are mineral salts containing sulfur. Sulphates are to wine as gluten is to food. They are a regular part of the winemaking around the world and are considered essential. They are connected to the fermentation process and affects the wine aroma and flavour.
Alcohol: It's usually measured in % vol or alcohol by volume (ABV).
Quality: Wine experts graded the wine quality between 0 (very bad) and 10 (very excellent). The eventual quality score is the median of at least three evaluations made by the same wine experts.
Data Exploration
Checking for NULL Values
Filling of the Row Data
Wine quality has the highest correlation with alcohol. Other relation degrees are very low with each other,such as citric acid,free_sulfur_dioxide, sulphates and pH. Quality also has a low negative correlation with density,volatile acidity, chlorides, total_sulfur_dioxide and residual_sugar.
Distribution of Variables
Creating 2 Bins Model of Two Types of Wine Quality Classes
Quality in Different Wine Types
As we see on the chart, Low quality red wine has the highest numerical value in data set as well as low quality white wine. High quality white and red wines have little place in data.
Quality & Alcohol Relation
Red and White wines has similar results on the chart. High quality wines are mostly red wines and have more alcohol level.
Quality & Volatile Acidity by Types
Fixed acidity level is low in both wine classes, especially in white wine while red wine has more in low quality class up to 1.70. Fixed alcohol level is again high in red wine class comparing white wine in low quality. High quality class has the highest fixed alcohol level in booth wine classes.
Chlorides Level in Quality Classes
Chloride Level is a bit higher in red wine in contrats with white wine.
Fixed Acidity & Volatile Acidity & Citric Acid Density in Quality Classes
Residual Sugar Levels by Wine Quality Classes
Sulfur Dioxide Distribution in Wine Quality Classes
There are some extreme values in low quality wine class. Total sulfur dioxide level is getting higher in some low quality wine class while general disturubution is standing up to 100 level of free sulfur dioxide.
pH Level in Wine Quality
Density by Wine Quality Classes
Sulphate Values in Wine Quality Classes
There is more low quality wine in between 0.4 and 0.6 levels of sulphate levels. Both quality classes have similar values.
Overview about Outliers
LOGISTIC REGRESSION CLASSIFIER
Creating Train / Test Groups with 2 Bins Model
In order to have all variables in numeric data, I mapped wine types as following by using the previous data frame 'df_bins':
LogisticRegression
Performance Measurements
It is better to check FP and FN values for another deep study to focus on false predictions for a better target of accurancy and results.
A new data set can be created with predictions, X_test and y_test data, than we can check for prediction value of this seperate data set.
Accuracy
Error Rate
Precision: Out of all the predicted positive instances, how many were predicted correctly = TP / (TP + FP))
Recall ( Out of all the positive classes, how many instances were identified correctly = TP / (TP + FN))
Specificity :(TN)/(TN + FP))
F1-Score: From Precision and Recall, F-Measure is computed and used as metrics sometimes. F – Measure is nothing but the harmonic mean of Precision and Recall =(2 * Recall * Precision) / (Recall + Precision) )
ROC/AUC(Area Under Curve)
PRECISION RECALL CURVE
The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.
Log Loss (calculating the difference between ground truth and predicted score for every observation and average those errors over all observations. )
General Looking at Results
Imbalanced Data
In order to see the differency between logistic regression model, I also would like to check resampling imblance data. In previous steps, I added bins in low and high ranges on quality variable, this section will show the results by using resampling method.
When splitting data in two parts starting from four, it gives an imbalanced data.
Resampling Imbalance Data
Cross Validation with 2 Bins Model
We splitted y values equally and trained our model.However, in order to see X values distribution we need following Cross Validation Measurement.
K-Fold Cross Validation
We splited our function in 5 pieces and trained them with Kfold method. In the following section, Cross Validate and Cross Validation Score tools will do everything itself.
Cross Validation Score & Cross Validate
The average accuracy score is calculated from 10 different accuracy scores from the model.
We still have similiar accuracy scores (.96-.97) by different methods applied previously.
cross_val_score and cross_validate functions used only test set. In order to have model predictions we can also check cross_val_predict function.
Hyperparameter Tuning
Apart from using appropriate function for our model, using the suitable parameter is also an important detail to have accurate predictions. I will use Grid Search and Random Search for this aim.
In order to have suitable parametres I used get_params() function.
Grid Search
The most successful 10 parametres on a chart.
RandomizedSearchCV
While we checked all combinations of our parameters with Grid Search method, we can also use this function with desired number of conbinations of parameters.
I will make 10 combination with 'n_iter' parameter.
Conclusion
Fadhil Ahmad Hidayat - Studi Independen Batch 3
AI Infra - Tugas Sertifikasi AIBIZ