Heart Disease Prediction Using Logistic Regression: A Comprehensive Approach for Early Diagnosis

Heart Disease Prediction Using Logistic Regression: A Comprehensive Approach for Early Diagnosis

Abstract

Heart disease remains one of the leading causes of mortality worldwide, emphasizing the importance of early diagnosis and prediction. This study utilizes Logistic Regression, a widely used classification algorithm, to predict the likelihood of heart disease. The model was trained and evaluated using a publicly available dataset, achieving an accuracy of 85.12% on the training data and 81.96% on the test data. This paper presents the methodology, experimental work, and results, highlighting Logistic Regression's capability to provide reliable predictions in healthcare applications.

Introduction

Cardiovascular diseases (CVD) are a significant global health burden, accounting for a large proportion of deaths annually. Predicting heart disease risk at an early stage can lead to timely intervention and treatment, reducing the likelihood of severe outcomes. Traditionally, medical practitioners rely on risk factors such as age, cholesterol levels, and blood pressure to diagnose heart disease. However, machine learning algorithms can provide more accurate and data-driven predictions by analyzing complex interactions between variables. Logistic Regression is a powerful yet interpretable machine learning algorithm suitable for binary classification problems, such as heart disease prediction. This study explores the use of Logistic Regression in predicting heart disease and evaluates its effectiveness using real-world clinical data.

Related Works

Several machine learning algorithms have been applied to predict heart disease, including Decision Trees, Random Forest, Support Vector Machines, and Neural Networks. Logistic Regression, due to its simplicity and effectiveness, has been a popular choice in medical studies. Previous research has demonstrated that Logistic Regression performs well in predicting binary outcomes such as heart disease presence or absence, especially when datasets contain both continuous and categorical variables. Studies like "Heart Disease Prediction Using Logistic Regression and Neural Networks" and "A Review of Machine Learning Techniques for Heart Disease Diagnosis" highlight its efficacy in producing accurate predictions with minimal computational cost. Compared to more complex models like Neural Networks, Logistic Regression offers better interpretability, making it ideal for use in healthcare.

Algorithm

Logistic Regression

Logistic Regression is a statistical method used for binary classification problems. It models the probability that a given input belongs to a particular class (e.g., presence or absence of heart disease). The model calculates a weighted sum of the input features and passes the result through a logistic function to produce a probability between 0 and 1. If the predicted probability exceeds a threshold (typically 0.5), the output is classified as 1 (positive), otherwise 0 (negative). The logistic function is defined as:

where:

θ represents the weights (coefficients) learned by the model,

x is the input feature vector.

The model is trained using maximum likelihood estimation to minimize the difference between predicted and actual outcomes.

Methodology

Data Collection: The dataset used for this study is the "Heart Disease" dataset, containing clinical data on patients such as age, sex, cholesterol levels, blood pressure, and more. The target variable indicates whether the patient has heart disease (1) or not (0).

Data Preprocessing: The dataset was cleaned by handling missing values, encoding categorical variables (such as sex and chest pain type), and scaling numerical features like cholesterol and age. The dataset was then split into training (80%) and testing (20%) sets to evaluate model performance.

Model Training: Logistic Regression was used as the classification algorithm. The model was trained on the training data using the sklearn library. The weights were learned by minimizing the binary cross-entropy loss function. Hyperparameters such as regularization (L2 penalty) were tuned using cross-validation to avoid overfitting.

Model Evaluation: The trained Logistic Regression model was evaluated on both the training and testing datasets using accuracy, precision, recall, and F1-score as performance metrics. The accuracy on the training data was 85.12%, while on the test data, it achieved 81.96%.

Experimental Work

The experimental setup involved the following key steps:

Data Analysis: Exploratory Data Analysis (EDA) was conducted to understand the distribution of the features and their correlation with heart disease. Visualizations such as histograms and correlation matrices were used to explore patterns and relationships.

Model Training and Validation: Logistic Regression was implemented using the LogisticRegression class from the sklearn library. The data was split using the train_test_split function, with 80% used for training and 20% for testing. Cross-validation was performed to tune the regularization parameter (C).

Feature Importance: The coefficients learned by the Logistic Regression model were analyzed to determine the importance of each feature in predicting heart disease. Factors such as cholesterol level, age, and maximum heart rate were found to be significant predictors.

Results

The Logistic Regression model achieved the following results:

Accuracy on Training Data: 85.12%

Accuracy on Test Data: 81.96%

Additional performance metrics:

Precision: The model demonstrated good precision, indicating a low false-positive rate.

Recall: Recall was also high, meaning the model was able to detect most cases of heart disease.

F1-Score: The balance between precision and recall was captured by the F1-Score, further confirming the model's reliability.

The model's performance indicates that Logistic Regression is a suitable method for predicting heart disease, with an acceptable trade-off between simplicity and accuracy.

Conclusion

This study demonstrated the effectiveness of Logistic Regression in predicting heart disease using clinical data. The model achieved an accuracy of 85.12% on the training data and 81.96% on the test data, highlighting its capability to provide accurate predictions. The simplicity and interpretability of Logistic Regression make it a valuable tool for healthcare professionals, as it allows them to identify important risk factors and make data-driven decisions. Future work could involve comparing Logistic Regression with more complex models like Support Vector Machines or Neural Networks to further enhance prediction accuracy.

References

· Alizadehsani, R., et al. (2018). "A Review of Machine Learning Techniques for Heart Disease Diagnosis and Prediction." Journal of Medical Systems, 42(7), 1-13.

· Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.

· James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. Springer.

· Khosravi, A., et al. (2020). "Heart Disease Prediction Using Data Mining Techniques." International Journal of Advanced Computer Science and Applications, 11(6), 23-28.

· Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research, 12, 2825-2830.

To view code: Click Here

Enhanced House Price Prediction Using XGBoost: A Comprehensive Analysis with the Boston Dataset

House Price Prediction Fig: Supervised Learning Types of Supervised Learning Fig: Types of Supervised Learning Boston House Price Prediction The Dataset used in this project comes from the UCI machine learning repository the data was collected in 1978 and each of the 506 entries represents aggregate information about 14 features of homes from various suburbs located in Boston. Fig: Boston Dataset Workflow Fig: Workflow Enhanced House Price Prediction Using XGBoost: A Comprehensive Analysis with the Boston Dataset Abstract The accurate prediction of house prices is a critical task in the real estate industry, aiding buyers, sellers, and investors in making informed decisions. This study explores the application of the XGBoost algorithm for predicting house prices using the Boston housing dataset. The model was evaluated using R-squared error and Mean Absolute Error (MAE) as performance metrics. The results demonstrate the model's effectiveness, with an R-squared error of 0.9116...

Machine Learning Projects

Search This Blog

Heart Disease Prediction Using Logistic Regression: A Comprehensive Approach for Early Diagnosis

Comments

Post a Comment

Popular posts from this blog

Comparative Analysis of Advanced Clustering Algorithms for Market Segmentation

Enhanced House Price Prediction Using XGBoost: A Comprehensive Analysis with the Boston Dataset

Face Detection Based Attendance System