Bagus Tri Yulianto Darmawan
This portofolio is the final project of preparation for international certification of artificial intelligence (AIBIZ)
This project discusses the classification of diabetes data with machine learning models.
Diabetes is a chronic disease that occurs when the pancreas can no longer make insulin or the body cannot make good use of the insulin it produces. Learning how to use Machine Learning can help us predict Diabetes.
we will work based on the flow below, please remember that the flow of each cases of machine learning and the tasks is different, so this flow is not all the machine learning task have a work flow like this.
Preparation
The first thing you can do is download the dataset, you can download this dataset on Kaggle by clicking this link Pima Indians Diabetes Database | Kaggle, after you downloaded it, you can make a folder in your Google Drive and upload it to that folder. After that, you can just make jupyter notebook (ipynb) file in the same folder, if you don't have Google Colaboratory installed yet, you can follow these steps.
Click “+ New”
Click More > Connect more apps
Search for “google colab”
Click install, then you can create the jupyter notebook file
Preprocessing
Please remember, since we are using Google Colaboratory, connecting or mounting your Google Drive to your ipynb file is a must thing todo
Please remember, location of the dataset file is different, but “/content/drive/My Drive…” section is all the same, the rest you can change to your file location
By doing this you can see the shape or you can say the size of the dataset, for example, by the picture below that means this dataset has 768 rows and 9 columns
You can check whether your dataset has a Null Value or not by typing "df.isnull().sum()" where the code will show the data with a Null Value with isnull() function, and count the total instead of showing the data by sum().
Based on the picture below, we can see that data with diabetes is 268 rows and not diabetes is 500 rows
for the picture below we visualized each column with the value
as you can see the lowest correlation with Outcome is the Blood Pressure column, Skin Thickness column, and Insulin column, so we remove those columns by typing this code.
Splitting Dataset
Implement the Models
in this model, we will not only be implementing but also tuning hyper-parameters, by doing this allow the model to perform the most optimal performance, tuning hyper-parameters by typing this code below.
after that, we just fit or implement the model into the data by typing this code below
this will take a bit of time, after the process is completed, it should show a green check mark like that. When this process completed that means our model is trained, all we have to do is do the test and see the score, to test the model, you can type this code
and then we can see the score based on the test that we do just now
as you can see in the f1-score column, we have a 78% accuracy
and the picture above is the confusion matrix, you can read it by seeing the row and column, for example, row 0 means data with 0 values, and column 0 means the predicted data, so the trained model predicted that 83 data with 0 value is labeled with 0 or not diabetes, which means the model predicted right 83 rows, and you can see row 1 and column 0, that means the trained model predicted 18 rows is labeled 0 or not diabetes where the actual label is 1 or positive diabetes, and so on.
For the Naive Bayes Model, we will also do tuning hyper-parameters and the process also not too different with the previous model
then we fit and test the model
after the process completed we see the score based on the test
as you can see this time with Naive Bayes Model we get 72% Accuracy with confusion matrix like the picture below
For this model we also do tuning hyper-parameters
after the hyper-parameters have been tuned we can fit the model and test it
and then we can see the score and the confusion matrix
as you can see, for this model we get 72% Accuracy with the confusion matrix like the picture below