Foto User
HEALTH INSURANCE CROSS SELL PREDICTION

Andri Armaginda Siregar

Sosial Media


1 orang menyukai ini
Suka

Summary

     Cross selling is a strategy of offering consumers to buy additional products to support the performance of products they have already purchased. Therefore, cross selling products are often considered as recommendations that buyers cannot refuse. 

      In this case, cross selling is also done to attract health insurance users to also participate in the vehicle insurance program created by the health insurance company.

Description

PORTOFOLIO

HEALTH INSURANCE CROSS SELL PREDICTION

INTRODUCTION

     Cross selling is a strategy of offering consumers to buy additional products to support the performance of products they have already purchased. Therefore, cross selling products are often considered as recommendations that buyers cannot refuse.

     In this case, cross selling is also done to attract health insurance users to also participate in the vehicle insurance program created by the health insurance company.

OBJECTIVE

     Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

DATA DESCRIPTION

Nama Variable

Keterangan

IdUnique ID for the customer
GenderGender of the customer
AgeAge of the customer
Driving_License

0 : Customer does not have DL

1 : Customer already has DL

Region_CodeUnique code for the region of the customer
Previously_Insured

0 : Customer doesn't have Vehicle Insurance

1 : Customer already has Vehicle Insurance

Vehicle_AgeAge of the Vehicle
Vehicle_Damage

0 : Customer didn't get his/her vehicle damaged in the past

1 : Customer got his/her vehicle damaged in the past

Annual_PremiumThe amount the customer needs to pay as premium in the year.
PolicySalesChannelAnonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.
VintageNumber of Days, the Customer has been associated with the company
Response

0: Customer is not interested, 

1:Customer is interested

Dataset           : https://www.kaggle.com/datasets/anmolkumar/health-insurance-cross-sell-prediction

I.          PREPARE THE PROBLEM

IMPORT LIBRARY & DATASET

 

II.         EDA (Exploratory Data Analysis)

SUMMARIZE DATA

In the data we have there are 12 columns and 381109 rows. Next, we check the datatypes, shapes, and null values in our dataset.

DESCRIPTIVE STATISTICS

DATA VISUALIZATIONS 

RESPONSE & GENDER

 

AGE VS RESPONSE

            

  • Young people below 30 are not interested in vehicle insurance. Reasons could be lack of experience, less maturity level and they don't have expensive vehicles yet.
  • People aged between 30-60 are more likely to be interested.
  • From the boxplot we can see that there no outlier in the data.

 

DRIVING LICENSE                               PREVIOUSLY INSURED                       VEHICLE AGE

ANNUAL PREMIUM

  • From the distribution plot we can infer that the annual premimum variable is right skewed
  • From the boxplot we can observe lot of outliers in the variable

CORRELATION MATRIX

Target variable is not much affected by Vintage variable. we can drop least correlated variable.

 

III.        PREPROCESSING DATA

At the data preprocessing stage we do label encoding converting categorical variables into biner variables so that they can be used in data analysis. then we check for duplicate data in the dataset, based on the results of the check no duplicate data is found.

 

FEATURE SELECTION

We can remove less important features from the data set.

HANDLING IMBALANCED DATA

When observation in one class is higher than the observation in other classes then there exists a class imbalance. We can clearly see that there is a huge difference between the data set. Solving this issue we use resampling technique.

IV.        MODEL SELECTION

  • Problem can be identified as Binary Classification (wheather customer opts for vehicle insurance or not)
  • Dataset has more than 300k records
  • cannot go with SVM Classifier as it takes more time to train as dataset increase
  • The idea to start model selection can be made with several algorithms such as Logistic Regression, Random Forest, and XGBClassifier.

1.         LOGISTIC REGRESSION

 


2.         RANDOM FOREST CLASSIFIER

3.         XGBCLASSIFIER

COMPARING THE MODEL

The ML model for the problem statement was created using python with the help of the dataset, and the ML model created with Random Forest and XGBClassifier models performed better than Logistics Regression model. Thus, for the given problem, the models created by Random Forest and XGBClassifier.

 

CONCLUSION

  • Customers of age between 30 to 60 are more likely to buy insurance.
  • Customers with Driving License have higher chance of buying Insurance.
  • Customers with Vehicle_Damage are likely to buy insurance.
  • The variable such as Age, Previously_insured,Annual_premium are more afecting the target variable.
  • comparing ROC curve we can see that Random Forest model preform better. Because curves closer to the top-left corner, it indicate a better performance.

Informasi Course Terkait
  Kategori: Data Science / Big Data
  Course: Teknologi Kecerdasan Artifisial