Nida Maulida
This portfolio contains examples of case studies using machine learning, comparisons of the models used include Support Vector Regression, Decision Tree Regression, and Random Forest Regression.
A. SVR (Support Vector Regression)
SVR provides the flexibility to determine how much error is acceptable in the model and will find the appropriate line (or hyperplane in higher dimensions) to fit the data. In contrast to the usual Least Square, the objective function of the SVR is to minimize the coefficients — more specifically, the l2-norm coefficient vector — not the squared error.
B. DTR (Decision Tree Regression)
Decision Tree Regression (DTR) builds a regression model in the form of a tree structure. DTR breaks down the dataset into smaller and smaller subsets while at the same time the associated decision tree is developed incrementally. The end result is a tree with decision nodes and leaf nodes. With a given data point, the DTR is run completely through the entire tree by answering True/False questions until it reaches a leaf node. The final prediction is the average of the values of the dependent variable in a given leaf node. Through several iterations, the Tree is able to predict the exact value for the data point.
C. RFR (Random Forest Regression)
A Decision Tree is an algorithm that is easy to understand and interpret and therefore one tree may not be enough for the model to learn its features. On the other hand, Random Forest is also a “Tree” based algorithm that uses the quality features of multiple Decision Trees to make decisions. Therefore, it can be referred to as 'Forest' or 'Forest' of trees and hence called “Random Forest”. The term 'Random' or 'Random' is due to the fact that this algorithm is a forest of 'Decision Trees'. made at random'. The Decision Tree Algorithm has the main disadvantage of causing over-fitting. This problem can be overcome by applying Random Forest Regression (Random Forest Regression) as a substitute for DTR. In addition, the Random Forest algorithm is also very fast and powerful compared to other models. other regressions.
2. Dataset
The downloaded FuelConsumptionCo2.csv dataset contains specific models for fuel consumption ratings and estimates of carbon dioxide emissions for light vehicles sold in Canada.
Dataset Link (https://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64)
3. Pre-Processing
a. Data Exploration
b. Data Visualization
c. Creation of training and testing datasets
d. Normalizing or scaling the dataset
4. Modeling Data (Machine Learning)
2. Decision Tree Regression
3. Random Forest Regression
5. Model Evaluation
The actual value and the predicted value can be compared to calculate the accuracy of the regression model. Evaluation metrics are very important for model development because they provide knowledge for model improvement.
There are various metrics for model evaluation, for example MSE as an error to determine the accuracy of the built model calculated from the model MSE against the test data:
2. Decision Tree Regression
3. Random Forest Regression
6. Conclusion
From the evaluation results it is stated that the best model that can be used is the decision tree regression model.
Model | MAE | MSE | R2-SCORE |
Support Vector Regression | 23,31 | 947,03 | 0,69 |
Decision Tree Regression | 20,52 | 722,69 | 0,78 |
Random Forest Regression | 20,60 | 724,58 | 0,78 |