Skip to main content

Breast Cancer Classification Using Logistic Regression: A Comprehensive Analysis and Performance Evaluation

 

Breast Cancer Classification Using Logistic Regression: A Comprehensive Analysis and Performance Evaluation

Abstract

Breast cancer classification is a critical task in medical diagnostics, aiding in early detection and treatment planning. This study presents a breast cancer classification model using logistic regression to predict the presence of malignancy based on various diagnostic features. The model was evaluated on a dataset with accuracy scores of 94.95% on training data and 92.98% on test data. The results highlight the effectiveness of logistic regression in distinguishing between benign and malignant cases, demonstrating its potential as a reliable tool in medical decision-making.

Introduction

Breast cancer remains one of the leading causes of cancer-related deaths worldwide. Early detection and accurate classification of breast cancer can significantly improve patient outcomes and treatment effectiveness. Logistic regression, a statistical method used for binary classification problems, has shown promise in medical diagnostics due to its simplicity and interpretability. This study explores the application of logistic regression in classifying breast cancer cases, assessing its performance, and comparing it to other classification methods.

Related Works

  • "Breast Cancer Diagnosis and Prognosis Using Machine Learning: A Survey" (2019) reviewed various machine learning techniques, including logistic regression, for breast cancer diagnosis and prognosis, highlighting their strengths and limitations.
  • "Application of Logistic Regression in Medical Diagnosis: A Case Study of Breast Cancer" (2020) explored the effectiveness of logistic regression models in medical diagnostics, focusing on breast cancer classification.
  • "Comparative Study of Classification Techniques for Breast Cancer Detection" (2021) compared several classification algorithms, including logistic regression, to evaluate their performance in breast cancer detection.

Algorithm: Logistic Regression

Logistic regression is a statistical model used for binary classification. It estimates the probability that a given input belongs to a certain class using the logistic function.

Key Components:

  • Logistic Function: The logistic function, or sigmoid function, maps any real-valued number into a value between 0 and 1, representing probabilities.

Where zzz is a linear combination of the input features.

  • Model Equation: The logistic regression model predicts the probability P(Y=1∣X) using:

Where β0\beta_0β0​ is the intercept and βi​ are the coefficients for each feature Xi​.

  • Cost Function: The cost function used to train the model is the binary cross-entropy loss, which measures the difference between predicted probabilities and actual outcomes.

Methodology

  1. Dataset Collection:
    • The dataset used for this study includes diagnostic features of breast cancer cases. It is divided into training and test sets for model evaluation.
  2. Data Preprocessing:
    • Data Cleaning: Handled missing values and removed irrelevant features.
    • Feature Scaling: Standardized features to ensure equal importance during model training.
  3. Model Training:
    • Logistic Regression Implementation: The logistic regression model was trained on the training dataset using standard optimization techniques to find the best coefficients.
  4. Model Evaluation:
    • Accuracy: Evaluated the model’s performance using accuracy metrics on both training and test datasets.
    • Confusion Matrix: Analyzed true positives, true negatives, false positives, and false negatives to assess model performance.
  5. Performance Metrics:
    • Accuracy: The proportion of correctly classified instances out of the total instances.
    • Precision and Recall: Measures of model performance related to false positives and false negatives.

Experimental Work

  1. Exploratory Data Analysis (EDA):
    • Conducted EDA to understand the dataset's structure, feature distributions, and relationships between variables.
  2. Model Training and Validation:
    • Trained the logistic regression model on the training dataset and validated it using cross-validation techniques to ensure generalizability.
  3. Performance Evaluation:
    • The model's performance was evaluated based on accuracy scores and other relevant metrics to gauge its effectiveness in classifying breast cancer cases.

Results

  • Training Accuracy: 94.95%
  • Test Accuracy: 92.98%
  • Confusion Matrix Analysis: Provided insights into the model’s strengths and weaknesses in detecting malignant and benign cases.

Conclusion

The logistic regression model demonstrated high accuracy in classifying breast cancer cases, both on training and test datasets. The results confirm the model's effectiveness and reliability in predicting breast cancer malignancy. Logistic regression, with its interpretability and efficiency, proves to be a valuable tool in medical diagnostics. Future work could explore ensemble methods and other advanced algorithms to further enhance classification performance.

References

  1. Breast Cancer Wisconsin (Diagnostic) Dataset. (2018). UCI Machine Learning Repository.
  2. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: With Applications in R. Springer.
  3. Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. Springer.
  4. Iglewicz, B., & Hoaglin, D. C. (2003). How to Detect and Handle Outliers. Springer.

To view code: Click Here

Comments

Popular posts from this blog

Comparative Analysis of Advanced Clustering Algorithms for Market Segmentation

  Comparative Analysis of Advanced Clustering Algorithms for Market Segmentation - A Case Study on Mall Customer Data Abstract This study conducts a comparative analysis of advanced clustering algorithms for market segmentation using Mall Customer Data. The algorithms evaluated include K-Means, Hierarchical Clustering, DBSCAN, Gaussian Mixture Models (GMM), Agglomerative Clustering, BIRCH, Spectral Clustering, OPTICS, and Affinity Propagation. Evaluation metrics such as Silhouette Score, Davies-Bouldin Score, and Calinski-Harabasz Score are employed to assess the clustering performance and determine the most suitable algorithm for segmenting mall customers based on their spending habits. Methodology The methodology involves several key steps: 1.      Data Collection: Mall Customer Data is obtained, comprising various demographic and spending attributes. 2.      Data Preprocessing: Data is cleaned, normalized, and prepared for cl...

Face Detection Based Attendance System

 Face Detection Based Attendance System Create a Main Folder named "Face Detection Based Attendance System" in VS Code.  Create a file named "add_faces.py" add_faces.py import cv2 video = cv2 . VideoCapture ( 0 ) while True :     ret , frame = video . read ()     cv2 . imshow ( "Frame" , frame )     k = cv2 . waitKey ( 1 )     if k == ord ( 'q' ):         break video . release () cv2 . destroyAllWindows () Open a new terminal and type "python add_faces.py" This will open your web camera. So, the process is getting started. Click "Q" to exit camera.  Create a Folder named "Data". In that folder, create a file named "haarcascade_frontalface_default.xml" haarcascade_frontalface_default.xml For, haarcascade_frontalface_default.xml   code   link   Click Here Now, write code in add_faces.py as, add_faces.py import cv2 video = cv2 . VideoCapture ( 0 ) facedetect = cv2 . CascadeClassifier ( 'data\...

Titanic Survival Prediction Using Logistic Regression: A Data-Driven Approach to Understand Survival Factors

Titanic Survival Prediction Using Logistic Regression: A Data-Driven Approach to Understand Survival Factors Abstract The Titanic disaster remains one of the most infamous maritime tragedies, and its dataset provides a valuable opportunity to study the factors influencing survival rates using data analysis. In this study, we employ Logistic Regression, a widely-used statistical classification algorithm, to predict the survival of passengers aboard the Titanic. Using features such as passenger class, age, gender, and other socio-economic factors, the Logistic Regression model achieved an accuracy of 78.21% on the test data. The findings suggest that gender, class, and age were significant factors affecting survival, offering insights into the predictive power of statistical modeling for classification problems. Introduction The RMS Titanic sank in the early hours of April 15, 1912, during its maiden voyage, resulting in over 1,500 deaths. Many efforts have been made to analyze the facto...