MAHADIKA DAFFA WAHYUDI
**This Data Classification Portfolio is required for DSBIZ Certification**
Background
Smoking activity is one of the causes of the source of the disease, not only impacting active smokers but people around the smoker or passive smokers are also affected. Passive smokers are more likely to experience the effects of diseases such as active smokers. However, if 1% of the existing human population becomes passive smokers then the number of existing lung specialists will not be able to handle it. This is a problem that must be addressed. Users can make an initial diagnosis of the symptoms suffered as well as their treatment through the Expert System. In this study, the expert system uses certainty factor method, which can provide certainty of a fact. Calculations are performed based on the value of an expert's belief in the symptoms of a disease. The resulting expert system is named Diagperosif where the system diagnoses the disease based on the symptoms entered by the user. The diseases that can be diagnosed by Diagperosif are asthma, bronchitis, cops, and lung cancer.
THE DEATH RATE DUE TO CIGARETTE INCREASED in 2018
The World Health Organization (WHO) reports that the death rate due to smoking has reached 30%, or the equivalent of 17.3 million people. The death rate is estimated to continue to increase until 2030, as many as 23.3 million people. Smoking activity increases the risk of cardiovascular disease which is suffered by many people in a number of low-income countries. In Indonesia, cardiovascular disease reaches 80% and ranks as the highest deadly disease.
In 2015, WHO issued research that more than 3.9 million children with an age range of 10 years to 14 years were active smokers. Meanwhile, smoking activity for the first time was carried out by 239,000 children under the age of 10. The rest, 40 million children under the age of 5 years become passive smokers.
In addition, WHO also notes that the increased risk of lung cancer in passive smokers reaches 20-30%, and the risk of people with heart disease is 25-35%.
The premature mortality rate due to smoking in the world is almost 5.4 million. If awareness about the dangers of smoking does not grow, it is predicted that by 2025, 10 million smokers will die.
BENEFIT
LOAD KAGGLE API
load the kaggle.json API token file. This file is obtained from the Kaggle account then we upload it to the Google Colab directory in the "content" folder.
Take the Kaggle API from the dataset to download, once downloaded it will be stored in the "content" section and the file is in the form of a zip extension.
we will extract zip file extension using this code below. Extraction results will be saved to the "tmp" folder.
LOAD LIBRARY
Before running Data Preprocessing or other stages in Data Science, we must load the libraries first so that the code is not written repeatedly. We use Numpy for scientific mathematical calculations, Pandas in this case study is used for manipulating and analyzing data, Matplotib for making graphs.
We start calling the Panda library function as shown in the screenshot below, to read and view the predic_table.csv file.
ENCODING CATEGORIAL DATA
We will encode features that have non-level data types such as Usia and Jenis_Kelamin. These two features are categorical data types, so they must be encoded into numbers 1 and 0 which have no mathematical value. When doing Encoding these two features will be removed so that columns or features are added based on the contents of the features that have been removed.
We can see in the data frame below that there are 4 additional features : Muda dan Tua dari fitur Usia, Pria dan Wanita dari fitur Jenis_Kelamin. These 4 new features have rows containing values 1 and 0.
df.info we use to see the data type of each feature.
After looking at the data type, we get information that the feature data type is still an object. To make data processing easier, we need to convert it into a boolean or float using the code below.
This code below is used to rename the Result feature to Target. It easy to recognize the class section.
After all the data has been encoded into categorical data, then we will make the Target do the Label Encoder.
Handling Missing Value
Then, because the process of entering data is carried out by humans, it is possible that there is data that is NaN Null so we need to fix it, data like this is called a Missing Value. We use the df.fillna() method to fill in the Missing Values based on the mean and nearest row.
We map the target feature into a bar graph using sns.countplot to find out the comparison of the feature's values after handling the missing value.
In more detail, we can see the distribution of all values from each feature using the histogram. df.his displays a histogram of the data frame.
Data Correlation
Data correlation is very important to know because we can measure the relationship between class or variable X with other variables. to visualize the data frame based on the X and Y variables, we can apply the code below.
From the correlation chart above, we know if there is a strong positive correlation and a weak negative correlation, to make it easier we will remove the strong positive correlation and use the features with weak negative correlation as classification material.
“ Data balancing is not done because the data is already in balance ”
Data Normalization
Before classifying, it is better to normalize the data. The data normalization changes the values of the numeric fields in the data set to use a common scale, without distorting differences in the range of values or losing information. Normalization is also required for some algorithms to model the data properly. Sciki Learn is a library for processing our data with the MinMaxScaler method.
MinMaxScaler makes it easy for us to scale features so we can change feature values to a predetermined scale (Minimum means scale 0, and Maximum is scale 1).
The next data normalization technique we use is oversampling, we need to load SMOTE first. Here is the oversampling process:
Model Classification
The last stage is model classification and model evaluation. In Classification, the model we use is Naive Bayes and Decision Tree. Scikit Learn is a library used to assist programmers in writing algorithms and modeling.
in the code below, we will load the library first, and enter the parameters used for the modern Naive Bayes classification.
Then, we will execute the test data and training data from the X and Y variables using the best modeling.
Next, we make predictions for the test data for variable X.
We can see the results of applying the Naive Bayes model, we need to make a report on the results of the classification and evaluation of the model. The following is the result and the code that is run.
the results obtained based on the report above are 82%, meaning that in the case of classification using the Naive Bayes model, it is the best. To make things easier, we will visualize the data first:
The next classification process, we will use the Decision Tree model. In the code below, we will load the library first, and enter the parameters used to classify the Decision Tree model via Scikit Learn.
Next we will define the parameters. The parameters of this model are different from the previous model, because the algorithm is also different. In this model, there are parameters such as the maximum and minimum sample depths that must be determined, as well as the gini and entropy.
Then, we will execute the test data and training data from the X and Y variables using the best modeling.
Next, we make predictions for the test data for variable X.
We can see the results of applying the Decision Tree model, we need to make a report on the results of the classification and evaluation of the model. The following is the result and the code that is run.
“The results obtained based on the report above are 83% meaning that in the case of classification using the Decision Tree model it is the best.”
To make things easier, we will visualize the data first:
Source Code : Collab Data Set
Data Set Kaggle : Diagnose lung disease caused by smoking