Titanic Survival Prediction Using Logistic Regression: A Data-Driven Approach to Understand Survival Factors
Titanic Survival Prediction Using Logistic Regression: A Data-Driven Approach to Understand Survival Factors
Abstract
The Titanic disaster remains one of the most infamous maritime tragedies, and its dataset provides a valuable opportunity to study the factors influencing survival rates using data analysis. In this study, we employ Logistic Regression, a widely-used statistical classification algorithm, to predict the survival of passengers aboard the Titanic. Using features such as passenger class, age, gender, and other socio-economic factors, the Logistic Regression model achieved an accuracy of 78.21% on the test data. The findings suggest that gender, class, and age were significant factors affecting survival, offering insights into the predictive power of statistical modeling for classification problems.
Introduction
The RMS Titanic sank in the early hours of April 15, 1912, during its maiden voyage, resulting in over 1,500 deaths. Many efforts have been made to analyze the factors that contributed to the survival of some passengers and the unfortunate fate of others. Machine learning techniques have been used to investigate the influence of different socio-economic factors such as gender, age, class, and fare on survival.
Logistic Regression, a popular method for binary classification, provides a means to model the probability of survival as a function of various input features. In this study, we use Logistic Regression to predict whether a passenger survived the Titanic disaster. The goal is to analyze how different features impact survival, while also evaluating the model’s performance in making accurate predictions.
Related Works
• In the work "Predicting Titanic Survival Using Machine Learning Techniques" (2017), several machine learning models, including Logistic Regression, Decision Trees, and Random Forests, were used to classify passengers as survivors or non-survivors, with Logistic Regression achieving a respectable performance.
• "A Comparative Study of Machine Learning Algorithms for Predicting Titanic Survivors" (2019) applied Logistic Regression, Support Vector Machines, and K-Nearest Neighbors to the Titanic dataset, demonstrating that Logistic Regression is a reliable model for this type of binary classification problem due to its interpretability.
• In "Exploring Predictive Models for Titanic Survival" (2020), Logistic Regression was compared with other classification models such as Naive Bayes and Gradient Boosting, highlighting that Logistic Regression is competitive for small-to-medium-sized datasets.
Algorithm: Logistic Regression
Logistic Regression is a statistical method used for binary classification, where the outcome is categorical, typically 0 or 1. It estimates the probability that a given input belongs to a particular class by applying a logistic function to a linear combination of the input features.
Logistic Function:
The logistic function is expressed as:
The objective of Logistic Regression is to find the optimal values for the coefficients that best predict the survival probability. This is done using Maximum Likelihood Estimation (MLE), which seeks to maximize the likelihood that the model’s predictions are consistent with the observed outcomes.
Methodology
1. Dataset Collection:
o The Titanic dataset was obtained from the Kaggle Titanic competition, which contains records of 891 passengers, including details such as age, sex, class, fare, and whether they survived.
2. Data Preprocessing:
o Handling Missing Data: Missing values, particularly in the "Age" and "Embarked" columns, were imputed. Age was filled with the median value, while missing embarkation ports were filled with the mode.
o Encoding Categorical Variables: Categorical variables such as "Sex" and "Embarked" were converted to numerical values using one-hot encoding.
o Feature Scaling: Features such as "Fare" were normalized to ensure they had similar ranges, improving model convergence during training.
3. Feature Selection:
o The features used in the model were: Passenger Class (Pclass), Gender (Sex), Age (Age), Number of Siblings/Spouses (SibSp), Number of Parents/Children (Parch), Fare (Fare), and Embarked (Embarked).
o These features were selected based on their relevance to the prediction task, as supported by previous literature.
4. Model Training:
o A Logistic Regression model was trained on 80% of the data, with 20% held out as a test set to evaluate model performance.
o The regularization parameter CCC was tuned to prevent overfitting, using cross-validation.
o The Logistic Regression model was implemented using Python’s scikit-learn library.
5. Model Evaluation:
o The model was evaluated based on accuracy, precision, recall, and F1-score. Accuracy was the primary metric, but other metrics were also calculated to provide a more comprehensive evaluation of the model.
Experimental Work
1. Exploratory Data Analysis (EDA):
o Initial analysis showed that women had a higher survival rate than men, with more passengers from the first class surviving than those from the lower classes.
o Passengers between the ages of 18 and 35 had a better chance of survival compared to younger children or elderly individuals.
2. Model Training:
o The Logistic Regression model was trained using a regularization parameter C=1.0C = 1.0C=1.0. The model converged after a few iterations, demonstrating stability during training.
o The training process yielded a model that effectively captured the relationship between the input features and the likelihood of survival.
3. Performance on Test Data:
o The model achieved an accuracy of 78.21% on the test data, indicating that it was able to correctly classify survivors and non-survivors in approximately 78% of cases.
o Precision, recall, and F1-score were also computed to evaluate the model's performance across different metrics.
Results
The Logistic Regression model achieved the following results:
• Accuracy on Test Data: 78.21%
• Precision: 0.79
• Recall: 0.74
• F1-Score: 0.76
The model performed well in classifying Titanic passengers based on their likelihood of survival, with accuracy comparable to other machine learning models previously applied to the dataset. Gender, class, and age emerged as the most significant features in determining survival.
Conclusion
This study demonstrates that Logistic Regression can effectively predict the likelihood of survival on the Titanic based on key socio-economic and demographic features. The model achieved a test accuracy of 78.21%, which aligns with expectations for binary classification problems on similar datasets.
The findings suggest that gender, class, and age were the most influential factors in determining a passenger's chance of survival, with women, younger passengers, and those in first class having a higher probability of survival. These results are consistent with historical accounts of the Titanic disaster.
In future work, more advanced models like Random Forest or Gradient Boosting could be explored to improve the accuracy of survival predictions. Additionally, incorporating more granular data on the conditions during the disaster (e.g., location on the ship, proximity to lifeboats) may further enhance predictive performance.
References
1. Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research, 12, 2825-2830.
2. Klein, M. (2017). "Predicting Titanic Survival Using Machine Learning Techniques." Kaggle Notebooks.
3. Brownlee, J. (2020). Logistic Regression for Machine Learning. Machine Learning Mastery.
4. Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer.
5. Surviving Titanic Dataset. (2017). Kaggle Titanic Competition.
To view code: Click Here
Comments
Post a Comment