Introduction
Todays blog is going to be different than the blogs I normally post. Instead of talking about a subject that I know well, I am going to take you along on my journey of learning about Survival Analysis. This topic is huge and there is a ton I still don't know but I am going to try to tackle covering examples of the Kaplan Meier Estimator and the Cox Proportional Hazard Model.
What is Survival Analysis
Survival Analysis also known as Time-to-Effect Analysis is used to estimate when a someone or something will experience an event. For example in Engineering it is used for reliability analysis or in Economics for Duration Modeling, though it was originally developed for medical research hence the "survival".
-Prep
If you would like to follow along I downloaded my data from Haberman's Survival Data Set on Kaggle. Then the libraries and tools I am using are:
import pandas as pd
import numpy as np
import random
from lifelines.fitters.kaplan_meier_fitter import KaplanMeierFitter
And finally the steps to import it into a Pandas data frame:
data = pd.read_csv('../Downloads/haberman.csv',names=['age', 'year_of_treatment', 'positive_lymph_nodes', 'survival_5_years'])
Kaplan Meier
The Kaplan Meier estimator is used to estimate the survival function. It works by finding the fraction of subjects who survived for a certain time frame and then returns a visual of those estimations in a visual called the Kaplan Meier Curve. So in our case the visual will show us the probability of survival for 5 years after the surgery with our age as the time.
km = KaplanMeierFitter()
km.fit(data['age'],data['survival_5_years'])
km.plot()
Another nifty metric you can receive from this estimator is the median survival time, showing where half of the population have experienced the event of interest.
km.median_survival_time_
52.0
From what I understand this estimator can be very useful for comparing how other features effect your population as well by taking different groups from the features you are using based on other features. Though I haven't gone deep enough into the subject to understand the validation of these divisions and comparisons are yet.
Cox Proportions Hazard Model
The Cox Proportions Hazard Model is another huge tool in the toolbox of Survival Analysis. This model takes several variables into account at the same time and examines the relationship they have to the survival distribution. So how I understand that it works is that it is very similar to a multiple regression model separates the data on small amounts of time with at least one event of interest. This lets the model create weights on the different variables to create an accurate estimator.
chm=CoxPHFitter()
chm.fit(data,'age','survival_5_years')
chm.plot()
So now with the Cox Model plotted we can see can see the coefficients of how the model weighted the effect that the positive_lymph_nodes and year_of_treatment had on the survival rate of these people, as well as the confidence interval it has on these predictions. With such large intervals though it looks as we may need more data, but we can also get more information about the coefficients it placed on these features by printing the summary.
chm.print_summary()
Final Note
As I said above this is a new subject for me and my knowledge of it is just beginning, so if anyone is reading this and would like to help me expand on this subject I would love to get in contact with you!
Top comments (0)