Skip to main content

Effective Fake News Detection Using Logistic Regression: A High-Accuracy Approach

 Fake New Prediction

Fig: Fake News Prediction

Workflow

Fig: Workflow

Effective Fake News Detection Using Logistic Regression: A High-Accuracy Approach

Abstract

The proliferation of fake news on digital platforms has become a significant challenge, necessitating the development of automated systems to detect and mitigate its impact. This study presents a Logistic Regression-based approach for fake news detection, leveraging a dataset of news articles. The model achieved a high accuracy of 98.95% on the training data and 97.76% on the test data, demonstrating its effectiveness in distinguishing between genuine and fake news. The simplicity and interpretability of Logistic Regression make it a strong candidate for real-world applications where transparency and quick decision-making are crucial.

1 Introduction

The digital age has led to an unprecedented increase in the dissemination of information. However, this has also given rise to the spread of misinformation or fake news, which can have severe societal impacts. The ability to automatically identify and filter out fake news is critical to maintaining the integrity of information. This paper explores the use of Logistic Regression, a statistical method for binary classification, to address the problem of fake news detection. By analyzing various features of news articles, the model predicts whether an article is likely to be fake or genuine.

2 Related Works

The issue of fake news detection has garnered significant attention in recent years. Researchers have employed various machine learning techniques, including Support Vector Machines (SVM), Naive Bayes, and deep learning models such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), to tackle this problem. While deep learning models often yield high accuracy, they require large amounts of data and computational resources. On the other hand, traditional machine learning models like Logistic Regression, although simpler, have proven to be effective in scenarios where interpretability and lower computational costs are desired.

3 Algorithm

Logistic Regression is a statistical method that models the probability of a binary outcome based on one or more predictor variables. The model assumes a linear relationship between the input features and the log-odds of the outcome. The logistic function, or sigmoid function, is used to map the predicted values to probabilities between 0 and 1.

Mathematically, the logistic function is defined as:



Where:

P(y=1∣X) is the probability that the target variable 𝑦 y equals 1 given the input features X.

𝛽0, 𝛽1, …𝛽n are the coefficients of the model, learned during training.

The model is trained by maximizing the likelihood function, which measures how well the model fits the data. The cost function, known as the log-loss or binary cross-entropy, is minimized during the training process.

4 Methodology

The methodology involves several key steps:

Data Collection: The dataset used in this study consists of news articles labeled as fake or genuine. The data is split into training and test sets, with 80% used for training and 20% for testing.

Data Preprocessing: The articles undergo preprocessing, including the removal of stop words, stemming, and vectorization. The vectorization process converts the text into numerical features using methods such as TF-IDF (Term Frequency-Inverse Document Frequency).

Model Training: Logistic Regression is employed to train the model on the processed dataset. The training process involves optimizing the model coefficients to minimize the log-loss function.

Model Evaluation: The trained model is evaluated on the test set to assess its performance. Accuracy, precision, recall, and F1-score are calculated to provide a comprehensive evaluation of the model.

Experimental Work

The experiments were conducted using a standard dataset of news articles. The dataset was preprocessed to remove noise and irrelevant features. Logistic Regression was implemented using the scikit-learn library in Python. The model was trained on 80% of the data and tested on the remaining 20%. The results showed that the model performed exceptionally well, with an accuracy of 98.95% on the training data and 97.76% on the test data.

5 Results

The Logistic Regression model demonstrated high accuracy in predicting fake news. The accuracy on the training set was 98.95%, indicating that the model learned the underlying patterns in the data effectively. The test accuracy was slightly lower at 97.76%, suggesting that the model generalizes well to unseen data. These results validate the effectiveness of Logistic Regression as a tool for fake news detection, especially in scenarios where interpretability and quick decision-making are required.

6 Conclusion

This study presents a Logistic Regression-based approach for fake news detection, achieving high accuracy on both training and test datasets. The results indicate that Logistic Regression is a viable option for real-world fake news detection systems, offering a balance between simplicity, interpretability, and performance. Future work could explore the integration of more complex features and hybrid models to further enhance detection accuracy.

7 References

·       Graves, S. Fernández, and J. Schmidhuber, "Bidirectional LSTM networks for improved phoneme classification and recognition," in Artificial Neural Networks: Formal Models and Their Applications – ICANN 2005, 2005, pp. 799–804.

·       R. Johnson and T. Zhang, "Supervised and Semi-supervised Text Categorization using LSTM for Region Embeddings," in Proceedings of the 34th International Conference on Machine Learning - Volume 70, Sydney, Australia, 2017, pp. 526–534.

·       J. B. Polson and N. S. Scott, "Data Science and Fake News Detection," Applied Stochastic Models in Business and Industry, vol. 35, no. 1, pp. 77–86, 2019.

·       P. Ferrara, H. Harman, J. J. Jiang, and H. Zheng, "Linguistic Features for Fake News Detection," International Journal of Advanced Computer Science and Applications, vol. 10, no. 8, pp. 6–15, 2019.

 

To view code: Click Here

Comments

Popular posts from this blog

Comparative Analysis of Advanced Clustering Algorithms for Market Segmentation

  Comparative Analysis of Advanced Clustering Algorithms for Market Segmentation - A Case Study on Mall Customer Data Abstract This study conducts a comparative analysis of advanced clustering algorithms for market segmentation using Mall Customer Data. The algorithms evaluated include K-Means, Hierarchical Clustering, DBSCAN, Gaussian Mixture Models (GMM), Agglomerative Clustering, BIRCH, Spectral Clustering, OPTICS, and Affinity Propagation. Evaluation metrics such as Silhouette Score, Davies-Bouldin Score, and Calinski-Harabasz Score are employed to assess the clustering performance and determine the most suitable algorithm for segmenting mall customers based on their spending habits. Methodology The methodology involves several key steps: 1.      Data Collection: Mall Customer Data is obtained, comprising various demographic and spending attributes. 2.      Data Preprocessing: Data is cleaned, normalized, and prepared for cl...

Enhanced House Price Prediction Using XGBoost: A Comprehensive Analysis with the Boston Dataset

 House Price Prediction Fig: Supervised Learning Types of Supervised Learning Fig: Types of Supervised Learning Boston House Price Prediction The Dataset used in this project comes from the UCI machine learning repository the data was collected in 1978 and each of the 506 entries represents aggregate information about 14 features of homes from various suburbs located in Boston. Fig: Boston Dataset Workflow Fig: Workflow Enhanced House Price Prediction Using XGBoost: A Comprehensive Analysis with the Boston Dataset Abstract The accurate prediction of house prices is a critical task in the real estate industry, aiding buyers, sellers, and investors in making informed decisions. This study explores the application of the XGBoost algorithm for predicting house prices using the Boston housing dataset. The model was evaluated using R-squared error and Mean Absolute Error (MAE) as performance metrics. The results demonstrate the model's effectiveness, with an R-squared error of 0.9116...

Face Detection Based Attendance System

 Face Detection Based Attendance System Create a Main Folder named "Face Detection Based Attendance System" in VS Code.  Create a file named "add_faces.py" add_faces.py import cv2 video = cv2 . VideoCapture ( 0 ) while True :     ret , frame = video . read ()     cv2 . imshow ( "Frame" , frame )     k = cv2 . waitKey ( 1 )     if k == ord ( 'q' ):         break video . release () cv2 . destroyAllWindows () Open a new terminal and type "python add_faces.py" This will open your web camera. So, the process is getting started. Click "Q" to exit camera.  Create a Folder named "Data". In that folder, create a file named "haarcascade_frontalface_default.xml" haarcascade_frontalface_default.xml For, haarcascade_frontalface_default.xml   code   link   Click Here Now, write code in add_faces.py as, add_faces.py import cv2 video = cv2 . VideoCapture ( 0 ) facedetect = cv2 . CascadeClassifier ( 'data\...