Eka Wulan Yunita
Exploratory Data Analysis or EDA is first investigative test process that aims to identify patterns, find anomalies, test hypotheses and check assumptions. EDA is an important process in data analysis because of EDA, users will be able to save more time in the data analysis process, be able to find out some errors in the data such as missing values, outliers, duplications, encodings, noisy data, incomplete data, and etc. The data that will be predicted is taken from kaggle.com which contains BMI, Price, Gender/Sex, Smkoker, and Age that will be analyzed. The steps is taken in this case study are Load the data, Preprocessing Data, Data Visualization, and Correlation.
What is EDA ?
Exploratory Data Analysis or EDA is first investigative test process that aims to identify patterns, find anomalies, test hypotheses and check assumptions. In EDA, users will be greatly assisted in detecting errors from the start, being able to identify outliers, knowing the relationships between data and being able to explore important factors from the data. The EDA process is very useful in the process of statistical analysis. Thus, it is not surprising that professional data practitioners have used EDA very often so that the analysis results become more valid and relevant to business goals. In general, there are several exploratory data analysis techniques, namely by using descriptive statistics, univariate analysis, and multivariate analysis. EDA is an important process in data analysis because of EDA, users will be able to save more time in the data analysis process, be able to find out some errors in the data such as missing values, outliers, duplications, encodings, noisy data, incomplete data, and etc.
This is EDA for predicting Medical insurance prices, which the dataset is obtained from kaggle.com. This dataset contains several data, namely:
This process goes through several steps, namely: Data Load, Data Preprocessing, Data Visualization, and Correlation.
Steps :
1) Load Data
Import all required libraries. There are several libraries that are used, namely:
1. pandas, used to import data, etc.
2. numpy, used to process data
3. Seaborn, used for graphics and statistics.
4. matplotlib, used for data visualization such as plotting graphs.
The data is taken from kaggle.com. Then, the data is uploaded on google colaboratory. then, called by pd.read_csv("")
This is used to determine the size of the array
The describe() function is used to display descriptive statistics from a data frame or series. The output of this function contains a summary of the central tendency and distribution of the dataset. The function of describe() helps us to get an overview of the dataset. The function of Transpose returns a vertical range of cells as a horizontal range, or vice versa.
2) Data Preprocessing
Elements in the region are divided into several columns consisting of each country. This aims to make predictive data more valid.
Because of machines can only understand numbers, so all data will be converted into numbers, such as: Sex/gender, smoker, and children
.
In this insurance cost prediction, children's data is not needed, so we must delete it.
The IsNull() function is used to check whether a column has data or not. and if there is no data or NULL, then given the replacement data.
The GroupBy() function allows us to group data in the same set of items for example in location, product, education level, gender, and other categories. here, it is used to get the total score or average grade by age.
3) Data Visualization
RESULT :
RESULT :
4) Correlation Matrix
The correlation matrix is a matrix that shows the simple correlation between all possible pairs of variables involved in the analysis. Value or all possible pairs of variables involved in the analysis.