Reggina Kuswandi The
Exploratory Data Analysis (EDA) is an approach to analyze the data using visual techniques. It is used to discover trends, patterns, or to check assumptions with the help of statistical summary and graphical representations.
Data Visualization is the process of analyzing data in the form of graphs or maps, making it a lot easier to understand the trends or patterns in the data. There are various types of visualizations :
About Dataset
Heart disease (heart disease) is a group of diseases related to cardiovascular diseases, manifested by a violation of the normal functioning of the heart. May be caused by damage to the epicardium, pericardium, myocardium, endocardium, valvular apparatus of the heart, heart vessels.
Heart disease can last a long time in a latent form, clinically not manifesting itself. Along with various tumors, these diseases are today the main cause of premature death in developed countries.
The uninterrupted operation of the circulatory system, which consists of the heart as a muscle pump and a network of blood vessels, is a necessary condition for the normal functioning of the body.
According to the National Heart, Lung and Blood Institute in Framingham (USA), the most important factors in the development of cardiovascular disease in humans are obesity, sedentary lifestyle and smoking.
About Project :
In this project I will do an exploratory data analysis using the heart disease dataset in 2020 which I took on the kaggle.com platform. Knowing that EDA is an important process in data analysis because through EDA we can save more time in the data analysis process, can find out data errors such as missing values and duplications, and can understand data visualization through techniques in EDA as I used in this project is using descriptive statistics by displaying visualization data in the form of tables, diagrams, graphs, etc.
Preparation
First, we can search and download unique public datasets that are available on the Kaggle.com platform. You can click on the dataset that I have via this link https://www.kaggle.com/code/georgyzubkov/heart-disease-exploratory-data-analysis/notebook
Second, open Google colaboratory via browser. After that, create a new notebook in the Google colaboratory and rename the notebook.
Third, input and upload the dataset that has been downloaded via kaggle.com.
So, we have saved the dataset and created a notebook file in google colaboratory.
Preprocessing
Let see the data sample is very informative and is represented by 319 thousand patients on 18 criteria.
What features characterize our data sample?
Check the dataset for gaps in the data
There are no missing values!
In our dataset there are 18078 duplicate data, so we need to drop duplicate data. So, the existing data becomes 301717 from 319776 with 18 columns.
Numeric variables are BMI, Physical Health, Mental Health, Sleep Time. The rest are categorical.
We have the hypothesis : there are outliers in the data for both maximum and minimum values.
Through the correlation matrix we can find out that every data between BMI, Physical Health, Mental Health, and Sleep Time has a correlation with one another.
Data Visualisation with Descriptive Statistics Techniques
Through the heatmap visualization data we can find out through the colors displayed. The darker the color, the lower the correlation, and the lighter the color, the higher the correlation.
Through the data frame that we have, we will find out how many heart diseases are in our dataset. And we get that as many 27261 have heart disease and 274456 do not.