Diagnose lung disease caused by smoking

MAHADIKA DAFFA WAHYUDI

Sosial Media


0 orang menyukai ini
Suka

Summary

In this data set, 30,000 are given with various categories, which we will later calculate the number of people affected by lung disease based on daily activities and congenital diseases.

Description

 

**This Data Classification Portfolio is required for DSBIZ Certification**

Background

Smoking activity is one of the causes of the source of the disease, not only impacting active smokers but people around the smoker or passive smokers are also affected. Passive smokers are more likely to experience the effects of diseases such as active smokers. However, if 1% of the existing human population becomes passive smokers then the number of existing lung specialists will not be able to handle it. This is a problem that must be addressed. Users can make an initial diagnosis of the symptoms suffered as well as their treatment through the Expert System. In this study, the expert system uses certainty factor method, which can provide certainty of a fact. Calculations are performed based on the value of an expert's belief in the symptoms of a disease. The resulting expert system is named Diagperosif where the system diagnoses the disease based on the symptoms entered by the user. The diseases that can be diagnosed by Diagperosif are asthma, bronchitis, cops, and lung cancer.

THE DEATH RATE DUE TO CIGARETTE INCREASED in 2018

The World Health Organization (WHO) reports that the death rate due to smoking has reached 30%, or the equivalent of 17.3 million people. The death rate is estimated to continue to increase until 2030, as many as 23.3 million people. Smoking activity increases the risk of cardiovascular disease which is suffered by many people in a number of low-income countries. In Indonesia, cardiovascular disease reaches 80% and ranks as the highest deadly disease.

In 2015, WHO issued research that more than 3.9 million children with an age range of 10 years to 14 years were active smokers. Meanwhile, smoking activity for the first time was carried out by 239,000 children under the age of 10. The rest, 40 million children under the age of 5 years become passive smokers.

In addition, WHO also notes that the increased risk of lung cancer in passive smokers reaches 20-30%, and the risk of people with heart disease is 25-35%.

The premature mortality rate due to smoking in the world is almost 5.4 million. If awareness about the dangers of smoking does not grow, it is predicted that by 2025, 10 million smokers will die.

BENEFIT

  1. Predict someone who has the potential to get lung disease accurately which is useful for improving the quality of one's lifestyle. Besides that,
  2. can assist health workers in predicting whether a person has lung disease by using the decision tree classification method and the naïve Bayes method.
  3. Helping people to be more aware of health and a good lifestyle

LOAD KAGGLE API

load the kaggle.json API token file. This file is obtained from the Kaggle account then we upload it to the Google Colab directory in the "content" folder.

Take the Kaggle API from the dataset to download, once downloaded it will be stored in the "content" section and the file is in the form of a zip extension.

we will extract zip file extension using this code below. Extraction results will be saved to the "tmp" folder.

LOAD LIBRARY

Before running Data Preprocessing or other stages in Data Science, we must load the libraries first so that the code is not written repeatedly. We use Numpy for scientific mathematical calculations, Pandas in this case study is used for manipulating and analyzing data, Matplotib for making graphs.

We start calling the Panda library function as shown in the screenshot below, to read and view the predic_table.csv file.

ENCODING CATEGORIAL DATA

We will encode features that have non-level data types such as Usia and Jenis_Kelamin. These two features are categorical data types, so they must be encoded into numbers 1 and 0 which have no mathematical value. When doing Encoding these two features will be removed so that columns or features are added based on the contents of the features that have been removed.


We can see in the data frame below that there are 4 additional features : Muda dan Tua dari fitur Usia, Pria dan Wanita dari fitur Jenis_Kelamin. These 4 new features have rows containing values 1 and 0.

df.info we use to see the data type of each feature.

After looking at the data type, we get information that the feature data type is still an object. To make data processing easier, we need to convert it into a boolean or float using the code below.

This code below is used to rename the Result feature to Target. It easy to recognize the class section.

After all the data has been encoded into categorical data, then we will make the Target do the Label Encoder.

Handling Missing Value

Then, because the process of entering data is carried out by humans, it is possible that there is data that is NaN Null so we need to fix it, data like this is called a Missing Value. We use the df.fillna() method to fill in the Missing Values based on the mean and nearest row.

We map the target feature into a bar graph using sns.countplot to find out the comparison of the feature's values after handling the missing value.

In more detail, we can see the distribution of all values from each feature using the histogram. df.his displays a histogram of the data frame.

Data Correlation

Data correlation is very important to know because we can measure the relationship between class or variable X with other variables. to visualize the data frame based on the X and Y variables, we can apply the code below.

From the correlation chart above, we know if there is a strong positive correlation and a weak negative correlation, to make it easier we will remove the strong positive correlation and use the features with weak negative correlation as classification material.

“ Data balancing is not done because the data is already in balance ”

Data Normalization

Before classifying, it is better to normalize the data. The data normalization changes the values of the numeric fields in the data set to use a common scale, without distorting differences in the range of values or losing information. Normalization is also required for some algorithms to model the data properly. Sciki Learn is a library for processing our data with the MinMaxScaler method.

 

MinMaxScaler makes it easy for us to scale features so we can change feature values to a predetermined scale (Minimum means scale 0, and Maximum is scale 1).

The next data normalization technique we use is oversampling, we need to load SMOTE first. Here is the oversampling process:

Model Classification

The last stage is model classification and model evaluation. In Classification, the model we use is Naive Bayes and Decision Tree. Scikit Learn is a library used to assist programmers in writing algorithms and modeling.

in the code below, we will load the library first, and enter the parameters used for the modern Naive Bayes classification.

Then, we will execute the test data and training data from the X and Y variables using the best modeling.

Next, we make predictions for the test data for variable X.

We can see the results of applying the Naive Bayes model, we need to make a report on the results of the classification and evaluation of the model. The following is the result and the code that is run.

the results obtained based on the report above are 82%, meaning that in the case of classification using the Naive Bayes model, it is the best. To make things easier, we will visualize the data first:

The next classification process, we will use the Decision Tree model. In the code below, we will load the library first, and enter the parameters used to classify the Decision Tree model via Scikit Learn.

Next we will define the parameters. The parameters of this model are different from the previous model, because the algorithm is also different. In this model, there are parameters such as the maximum and minimum sample depths that must be determined, as well as the gini and entropy.

Then, we will execute the test data and training data from the X and Y variables using the best modeling.

Next, we make predictions for the test data for variable X.

We can see the results of applying the Decision Tree model, we need to make a report on the results of the classification and evaluation of the model. The following is the result and the code that is run.

“The results obtained based on the report above are 83% meaning that in the case of classification using the Decision Tree model it is the best.”

To make things easier, we will visualize the data first:

Source Code : Collab Data Set

Data Set Kaggle : Diagnose lung disease caused by smoking

Informasi Course Terkait
  Kategori: Data Science / Big Data
  Course: Produk dan Desain Kecerdasan Artifisial (SIB AI-Hipster)