BEST EMAIL SPAM DETECTION MACHINE LEARNING MODEL

Bethelsando Gemilang Wahyudi

Sosial Media


0 orang menyukai ini
Suka

Summary

SUMMARY

Email spam, also known as junk email, is unsolicited, unwanted, or irrelevant messages sent via email. These messages are typically sent in large quantities by spammers, who hope to either scam people out of their money or trick them into giving away personal information. Spam emails may contain links to malicious websites or attachments that can harm your computer, so it's important to be careful when dealing with them. Most email providers have spam filters in place to help protect users from this type of unwanted email. Actually we can detect which one is spam or not use machine learning.

Description

DESCRIPTION:

  1. Download the dataset from kaggle : https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv
  2. I create this machine learning use google collaboratory so I upload my dataset to google drive

3. Mounting drive with collab

4. Import library that we need to process dataset and make data to a dataframe

5.Preprocessing dataframe 

After we know that dataset is clean we continue to next step

6. Get the statistical from dataframe and get in the columns on pandas dataframe

7. Define the X data from dataframe

8. Define the y data or result from dataframe

  

9. We do some classification use eleven algorithm 

10. Train data with each algorithm

Until we get this output 

==============================

KNeighborsClassifier

****Results****

Accuracy: 87.0070%

Log Loss: 1.379305610485888

==============================

SVC

****Results****

Accuracy: 71.0750%

Log Loss: 0.4836649541561642

==============================

NuSVC

****Results****

Accuracy: 82.6759%

Log Loss: 0.3320301535134764

==============================

DecisionTreeClassifier

****Results****

Accuracy: 93.1168%

Log Loss: 2.3773790403302795

==============================

RandomForestClassifier

****Results****

Accuracy: 97.8345%

Log Loss: 0.16699441191493325

==============================

XGBClassifier

****Results****

Accuracy: 96.5197%

Log Loss: 0.12179309395210326

==============================

AdaBoostClassifier

****Results****

Accuracy: 96.2877%

Log Loss: 0.5251066199866752

==============================

GradientBoostingClassifier

****Results****

Accuracy: 96.7517%

Log Loss: 0.12360515066970783

==============================

GaussianNB

****Results****

Accuracy: 95.2823%

Log Loss: 1.6259912051490315

==============================

LinearDiscriminantAnalysis

****Results****

Accuracy: 72.3125%

Log Loss: 8.361442220781141

/usr/local/lib/python3.8/dist-packages/sklearn/discriminant_analysis.py:878: UserWarning: Variables are collinear

   warnings.warn("Variables are collinear")

==============================

QuadraticDiscriminantAnalysis

****Results****

Accuracy: 75.2514%

Log Loss: 8.547879695569543

==============================

 

11. Compare the accuracy from each algorithm and get the best model machine learning

12. And we get the best algorithm

13. From this chart we know that randomforest classifier is the best model for this dataset

Informasi Course Terkait
  Kategori: Artificial Intelligence
  Course: Infrastuktur Kecerdasan Artifisial (SIB AI-INFRA)