Anisah Fadhilah Putri
Fake News Detection is one of the most important issues in the digital information age, which is rich in inaccurate or misleading content. Classification algorithms can be used to classify articles or news as fake news or not.
Here I am using Passive Aggressive Classifier (PAC) classification algorithm which is used for classification tasks in Machine Learning. PAC is a useful algorithm for binary classification problems, where data is separated into two different classes. In the context of fake news detection, Passive Aggressive Classifier can be used to learn the patterns present in articles or news and classify them as fake news or not.
The first step when we want to read a dataset from google drive, we must connect the path that contains the dataset in it with the drive mount from the google.colab library.
The code above is an import of libraries and modules needed to build models and perform text analysis using the Passive Aggressive Classifier method. The following is an explanation of each part of the code:
The code above is used to load a dataset stored in a CSV file into a DataFrame object using the pandas library. The following is an explanation of each part of the code:
Using the code above, we can see the number of empty values in each column of the dataset. If there are empty values, the number will be displayed next to the column name. This is useful for detecting whether the dataset has empty values that need to be addressed before performing further analysis or modeling.
The above code is used to import the libraries required for model visualization and evaluation. The following is an explanation of each imported library:
With this code, we can split the dataset into two parts: X which contains the news text, and y which contains the classification labels. Then, we can use X and y as inputs to train the classification model. In addition, the code also prints the number of news stories with 'FAKE' and 'REAL' labels to provide information about the label distribution in the dataset.
The bar graph depicting the amount of data is the amount of FAKE data is 3164, while the REAL amount of data is 3171
The code uses the train_test_split method with the argument X as the dataset feature and y as the dataset label. Then here I also provide the argument test_size=0.2 to determine the test size of 20% of the entire dataset. In addition, I provide the argument random_state=123 to ensure consistent dataset splitting results every time the code is run. After this code, it appears to have four variables: X_train (training data features), X_test (testing data features), y_train (training data labels), and y_test (testing data labels).
With this step, the dataset has been divided into training and testing data subsets that can be used to train and test the fake news detection model.
Here I created a TfidfVectorizer object with parameters stop_words = 'english' to remove common words in English, and max_df = 0.6 to remove words that appear in more than 60% of the documents.
Then, proceed to use the fit_transform method on the X_train training data to learn the vocabulary of the training data and transform it into a TF-IDF matrix. Next, use the transform method on the test data X_test to transform it into a TF-IDF matrix with the same vocabulary that has been learned from the training data.
Here I created a PassiveAggressiveClassifier object with the parameter max_iter=50 which indicates the maximum number of iterations used in the learning process. Then, we train the model using fit method with TF-IDF matrix of training data tfidf_train and label y_train.
After the model is trained, we continue to use the predict method to make predictions on the tfidf_test test data and store the prediction results in the y_pred variable. Finally, then score the prediction accuracy using the accuracy_score method from the scikit-learn library by comparing the actual label y_test with the predicted label y_pred.
With this step, the model has been trained and tested to perform classification on text data using the Passive Aggressive approach, and the prediction accuracy is printed.
Here the initiation accuracy value gets a value of 0.94
Here I use the confusion_matrix method with the argument y_test as the actual label and y_pred as the prediction label. And also give the argument labels=['FAKE', 'REAL'] to determine the order of the classes that will be displayed in the confusion matrix. The result will print a confusion matrix consisting of four values: true negative (TN), false positive (FP), false negative (FN), and true positive (TP), with the class order 'FAKE' and 'REAL'.
With this step, I can view and print the confusion matrix to evaluate the model's performance in fake news detection.
The confusion matrix graph will display the numbers in the box that represent the number of correct and incorrect predictions