BISA AI - AI For Everyone

Vehicle Insurance Claim Fraud Detection

Trianto Haryo Nugroho

Sosial Media

0 orang menyukai ini
Suka

Summary

This project looks for the best model performance from 12 classification models to detect fraud in automobile insurance claims. The best model is the Ada Boost Classifier with 95.90% accuracy.

Description

Vehicle Insurance Claim Fraud Detection (Comparison 12 Models)

By: Trianto Haryo Nugroho

Data Understanding

Vehicle Insurance Fraud Detection

Vehicle insurance fraud involves conspiring to make false or exaggerated claims involving property damage or personal injuries following an accident. Some common examples include staged accidents where fraudsters deliberately “arrange” for accidents to occur; the use of phantom passengers where people who were not even at the scene of the accident claim to have suffered grievous injury, and make false personal injury claims where personal injuries are grossly exaggerated.

About this dataset

This dataset contains vehicle dataset - attribute, model, accident details, etc along with policy details - policy type, tenure, etc. The target is to detect if a claim application is fraudulent or not - FraudFound_P

Source

https://www.kaggle.com/datasets/shivamb/vehicle-claim-fraud-detection

Dimension

The data contains 15.421 bars and 33 columns.

Google Colab Link:

https://colab.research.google.com/drive/1o85fAtC-82uQbY1ENMWBoYZGoszUjwiE?usp=sharing

Exploratory Data Analysis (EDA)

Import Library

Upload Dataset

Read Dataset

Display 5 Samples

Dataset Information

The dataset consists of 15.421 rows and 33 columns columns

Checking Missing Values

There is no missing value in the dataset

Checking Data Duplicates

There is no duplicate data in the dataset

Checking Shape or Dimension

Checking Label Proportion

Significant differences in the proportion of labels in the classification case can cause computers/machines to not learn well so that the model formed can only recognize dominant labels.

There is a significant difference in the proportion of labels in this dataset so later we will balance the dataset.

Descriptive Statistics

Histogram

The highest age is between 30-40 years old. The highest Fraud_Found_P is 0. The highest deductible is between 400. There were no significant differences between driver ratings.

Pair Plot

Correlation

Checking for Multicollinearity

From the above plot, we can see there is only a low correlation between features, we don't need to remove it.

Checking Number of Unique Values

Remove Unnecessary Columns

Data Preprocessing

Separating the feature and target columns

Encoding Categorical Columns

Seeing the percentage distribution of categorical data, if there is one value that dominates the feature, it will be discarded.

Extracting Categorical Columns

Printing Unique Values of Each Column

Get Dummies from Categorical Columns

Extracting the Numerical Columns

Combining Numerical and Categorical Columns

The data looks good, let's check for outliers.

Outliers Detection

Outliers are present in some numerical columns we will scale numerical columns later.

Split Dataset

Normalization

Balancing the Dataset using oversampling (SMOTE)

Reduce Overfitting using Principal Component Analysis (PCA)

Modeling

1. Support Vector Classifier

Fit the model

Model Evaluation

2. KNN

Fit the model

Model Evaluation

3. Decision Tree Classifier

Fit the model

Model Evaluation

Hyperparameter tuning

Print best parameters and the best score

Fit the best estimator

Model Evaluation

4. Random Forest Classifier

Fit the model

Model Evaluation

5. Ada Boost Classifier

Fit the model using hyperparameter tuning

Print the best parameters and the best score

Fit the best estimator

Model Evaluation

6. Gradient Boosting Classifier

7. Stochastic Gradient Boosting (SGB)

Fit the model and model evaluation

8. XGBoost Classifier

Fit the model

Model Evaluation

Hyperparameter tuning

Model Evaluation after Hyperparameter tuning

9. Cat Boost Classifier

Install Cat Boost

Fit the model

Model Evaluation

10. Extra Trees Classifier

Fit the model and model evaluation

11. LGBM Classifier

Fit the model and model evaluation

12. Voting Classifier

Fit the model

Model Evaluation

Model Performance Comparison

Model Performance Comparison Visualization

Conclusion

Best Model : Ada Boost Classifier

Accuracy : 0.958958 (95,90 %)

Informasi Course Terkait

Kategori: Data Science / Big Data
Course: Data Science

Kelas GRATIS

Master Class

Master Class on Job Training

Learning Path

Kelas OFFLINE

Kelas Corporate

Prakerja

Webinar

Udemy

Kampus Merdeka

Learncation

Portofolio Peserta

Sertifikasi International

Sertifikasi Nasional

Kuliah RPL

Politeknik BISA AI

Pendidikan Profesional

Educloud

Siakad by Bisa AI

IT Solution

Konsultan Pendidikan

Kolaborasi Seminar

Kolaborasi pelatihan

Gallery

Tentang Kami

Testimonial Peserta

Corporate Social Responsibility

Hubungi Kami

Dokter Mekanik

E-learning

Bisa Design

Flungo

Tampil

Bakerspice Academy

TripTracker

Gramatikal

Vehicle Insurance Claim Fraud Detection

Sosial Media

Summary

Description

Data Understanding

Source

https://www.kaggle.com/datasets/shivamb/vehicle-claim-fraud-detection

Dimension

Google Colab Link:

https://colab.research.google.com/drive/1o85fAtC-82uQbY1ENMWBoYZGoszUjwiE?usp=sharing

Exploratory Data Analysis (EDA)

Import Library

Read Dataset

Dataset Information

Checking Label Proportion

Descriptive Statistics

Pair Plot

Correlation

Checking for Multicollinearity

Data Preprocessing

Separating the feature and target columns

Encoding Categorical Columns

Seeing the percentage distribution of categorical data, if there is one value that dominates the feature, it will be discarded.

Extracting Categorical Columns

Printing Unique Values of Each Column

Get Dummies from Categorical Columns

Extracting the Numerical Columns

Combining Numerical and Categorical Columns

Informasi Course Terkait

Bisa AI Academy