CHAPTER 1
**Introduction**
Divorce is a significant life event that affects numerous individuals and families worldwide. With the rise in divorce rates over the years, studying divorce data can provide valuable insights into the underlying factors and trends impacting marital relationships. In this blog post, we will delve into a divorce dataset to uncover intriguing patterns, explore key variables, and gain a deeper understanding of the dynamics surrounding divorce.
The dataset used will be downloaded from the Kaggle data site and used of logistic regression will be implemented on the dataset to test for the accuracy, precision, and F1 of the result.
AIM
The aim of using linear logistic regression to predict divorce is to develop a statistical model that can accurately classify individuals or couples as either likely to divorce or not based on a set of predictor variables.Objectives:
Dataset Collection: Gather a dataset that includes relevant variables associated with marriages, such as demographic information, relationship characteristics, and individual traits.
- Data Preprocessing: Clean and preprocess the collected dataset by handling missing values, dealing with outliers, and performing necessary transformations or feature engineering.
- Variable Selection: Identify and select the most significant predictor variables that are likely to influence the outcome variable (divorce) based on domain knowledge or statistical techniques such as feature importance or correlation analysis.
- Model Building: Utilize linear logistic regression to build a predictive model that estimates the probability of divorce based on the selected predictor variables.
- Model Training and Evaluation: Split the dataset into training and testing sets, and train the logistic regression model using the training set. Evaluate the model's performance by applying appropriate evaluation metrics such as accuracy, precision, recall, and F1-score on the testing set.
- Model Interpretation: Analyze the coefficients of the logistic regression model to understand the direction and strength of the relationships between the predictor variables and the likelihood of divorce.
- Prediction and Validation: Use the trained model to predict divorce probabilities for new, unseen data. Validate the model's predictive capabilities by comparing the predicted probabilities with the actual divorce outcomes.
- Model Optimization: Fine-tune the logistic regression model by adjusting hyperparameters, exploring different feature combinations, or employing regularization techniques to enhance its predictive accuracy.
-
Communication and Reporting: Summarize the findings and conclusions of the logistic regression analysis in a clear and interpretable manner, highlighting the significant predictors and the model's overall performance in predicting divorce.
CHAPTER 2 Methodology
Methodology for standard scalar Divorce Dataset Using Logistic Regression:
i)Dataset Preparation:
Obtain the divorce dataset, ensuring it contains relevant variables for analysis, such as demographic factors, reasons for divorce, and other pertinent information.
Clean the dataset by removing any duplicate or irrelevant entries and handling missing data appropriately, such as through imputation or exclusion.
Variable Selection:
ii)Identify the dependent variable: In this case, it would be a binary variable indicating whether a divorce occurred (e.g., 0 for no divorce, 1 for divorce).
Select relevant independent variables: Consider variables such as age, education level, employment status, reasons for divorce, or any other factors hypothesized to influence divorce likelihood.
Data Exploration:
III)Conduct exploratory data analysis to gain insights into the distribution, relationships, and summary statistics of the variables.
iv)Identify any potential outliers or influential observations that might affect the analysis.
V)Logistic Regression Modelling:
Split the dataset into a training set and a testing/validation set to evaluate the model's performance.
a)Apply logistic regression, a statistical modelling technique suitable for binary dependent variables.
b)Choose the appropriate logistic regression algorithm (e.g., maximum likelihood estimation) to estimate the model parameters.
c)Fit the logistic regression model using the training set, taking the dependent variable (divorce) and selected independent variables as inputs.
d)Evaluate the significance and interpret the coefficients of the independent variables to understand their impact on divorce likelihood.
e)Assess the goodness-of-fit measures, such as the Hosmer-Lemeshow test or AIC/BIC, to evaluate the model's overall fit.
Model Evaluation and Validation:
vi)Apply the trained model to the testing/validation set and assess its predictive accuracy and performance metrics (e.g., accuracy, precision, recall, F1 score, ROC-AUC).
Validate the model results through cross-validation techniques (e.g., k-fold cross-validation) to ensure the stability and generalizability of the findings.
Interpretation of Results:
vii)Analyse the coefficients and odds ratios of the independent variables to understand their impact on divorce likelihood.
Identify the significant predictors and their direction of influence (positive or negative) on the probability of divorce.
Consider the practical implications of the results and how they align with existing knowledge or theories in the field.
Limitations and Further Analysis:
viii)Acknowledge any limitations of the analysis, such as data biases, unobserved factors, or potential confounding variables.
Consider further analysis, such as interaction effects, subgroup analysis, or additional statistical techniques (e.g., regularization, model comparison) to enhance the understanding of divorce dynamics.
CHAPTER 3
RESULT
importing the logistic regression library :
we import the logistic regression model from the scikit-learn library's linear model module. To evaluate the model's performance using a confusion matrix, you would also need to import the confusion_ matrix function from the metrics module. we can see the diagram below
When utilizing logistic regression for predicting divorce, it is important to partition the available data into training and testing sets. In this scenario, the data will be divided with 80% allocated to the training data and 20% to the testing data.
To ensure consistency and reproducibility in the data splitting process, the random state will be set to 7.
Dividing the data in this manner allows us to effectively train the logistic regression model on a substantial portion of the dataset. The training data, comprising 80% of the dataset, will be used to estimate the model's parameters and capture the underlying patterns and relationships between the input variables and the target variable (divorce prediction).
By setting the random state to 7, we ensure that the data splitting process is reproducible. The random state serves as a seed, ensuring that the same split is obtained each time the code is executed. This consistency aids in replicating and validating the model's performance.
Once the model is trained on the training data, the remaining 20% of the data (the testing data) will be employed to evaluate the model's performance. By assessing the model's predictions against the actual divorce outcomes in the testing data, we can gauge the model's ability to generalize to unseen data.
Splitting the data into training and testing sets is vital in logistic regression as it helps assess the model's accuracy, precision, recall, and other performance metrics on independent data. This division enables us to identify potential issues like overfitting or underfitting and make informed decisions about the model's predictive capabilities.
The obtained results using logistic regression were as follows: the accuracy of the model was 0.99264, the precision was 1, and the sensitivity/recall was 0.98. However, these results indicate a high level of overfitting in the model. This overfitting might have occurred due to the data being optimized using a standard scalar, which resulted in an excessively accurate model. Overfitting happens when the model performs exceptionally well on the training data but struggles to generalize to unseen data.
from the result of the confuxion matrix The model accurately predicted 70 instances as "not divorce" also The model accurately predicted 65 instances as "divorce". The model had one false negative (predicted "not divorce" but it was actually "divorce").
The model had no false positives (predicted "divorce" when it was actually "not divorce").
The confusion matrix provides a detailed breakdown of the model's performance, allowing you to evaluate its accuracy and understand the types of errors it makes.
Top comments (0)