DEV Community

Timothy Cummins
Timothy Cummins

Posted on

My Intro into Survival Analysis

Introduction

Todays blog is going to be different than the blogs I normally post. Instead of talking about a subject that I know well, I am going to take you along on my journey of learning about Survival Analysis. This topic is huge and there is a ton I still don't know but I am going to try to tackle covering examples of the Kaplan Meier Estimator and the Cox Proportional Hazard Model.

What is Survival Analysis

Survival Analysis also known as Time-to-Effect Analysis is used to estimate when a someone or something will experience an event. For example in Engineering it is used for reliability analysis or in Economics for Duration Modeling, though it was originally developed for medical research hence the "survival".

-Prep

If you would like to follow along I downloaded my data from Haberman's Survival Data Set on Kaggle. Then the libraries and tools I am using are:

import pandas as pd 
import numpy as np
import random
from lifelines.fitters.kaplan_meier_fitter import KaplanMeierFitter
Enter fullscreen mode Exit fullscreen mode

And finally the steps to import it into a Pandas data frame:

data = pd.read_csv('../Downloads/haberman.csv',names=['age', 'year_of_treatment', 'positive_lymph_nodes', 'survival_5_years'])
Enter fullscreen mode Exit fullscreen mode

Kaplan Meier

The Kaplan Meier estimator is used to estimate the survival function. It works by finding the fraction of subjects who survived for a certain time frame and then returns a visual of those estimations in a visual called the Kaplan Meier Curve. So in our case the visual will show us the probability of survival for 5 years after the surgery with our age as the time.

km = KaplanMeierFitter()
km.fit(data['age'],data['survival_5_years'])
km.plot()
Enter fullscreen mode Exit fullscreen mode

Alt Text

Another nifty metric you can receive from this estimator is the median survival time, showing where half of the population have experienced the event of interest.

km.median_survival_time_
Enter fullscreen mode Exit fullscreen mode

52.0

From what I understand this estimator can be very useful for comparing how other features effect your population as well by taking different groups from the features you are using based on other features. Though I haven't gone deep enough into the subject to understand the validation of these divisions and comparisons are yet.

Cox Proportions Hazard Model

The Cox Proportions Hazard Model is another huge tool in the toolbox of Survival Analysis. This model takes several variables into account at the same time and examines the relationship they have to the survival distribution. So how I understand that it works is that it is very similar to a multiple regression model separates the data on small amounts of time with at least one event of interest. This lets the model create weights on the different variables to create an accurate estimator.

chm=CoxPHFitter()
chm.fit(data,'age','survival_5_years')
chm.plot()
Enter fullscreen mode Exit fullscreen mode

Alt Text

So now with the Cox Model plotted we can see can see the coefficients of how the model weighted the effect that the positive_lymph_nodes and year_of_treatment had on the survival rate of these people, as well as the confidence interval it has on these predictions. With such large intervals though it looks as we may need more data, but we can also get more information about the coefficients it placed on these features by printing the summary.

chm.print_summary()
Enter fullscreen mode Exit fullscreen mode

Alt Text

Final Note

As I said above this is a new subject for me and my knowledge of it is just beginning, so if anyone is reading this and would like to help me expand on this subject I would love to get in contact with you!

Top comments (0)