The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems that some groups of people were more likely to survive than others.
So, to get the answer to the question: “What sorts of people were more likely to survive?” using the passenger data, I have worked on building a predictive model that will calculate the probability of people surviving the Titanic shipwreck given the input values.
I have used the training data set to build a regression model. For the training set, the outcome (also known as the “ground truth”) is provided for each passenger. Besides, I have used the test set to see how well the model performs on unseen data and for the test set, the ground truth for each passenger is not provided.
pclass: A proxy for socio-economic status (SES) -
1st = Upper; 2nd = Middle & 3rd = Lower
sex: Created a dummy variable for this attribute, 1 for Male and 0 for Female
sibsp: The dataset defines family relations in the following way:
Sibling = brother, sister, stepbrother, stepsister &
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in the following way:
Parent = mother, father; Child = daughter, son, stepdaughter, stepson & Some children traveled only with a nanny, therefore parch = 0 for them.
Some techniques that I have used for feature selection are: -
1.) Percent missing values: Using this technique, I dropped variables that have a very high % of missing values [# of records with missing values / # of total records].
2.) Pairwise correlation: Many features are often correlated with each other hence are redundant. So I kept the one that helped reduce dimensionality without much loss of information, i.e. the one that has a higher correlation with the target.
3.) Backward elimination: I started with all the variables included in the model and then dropped the least useful variable, and so on until some predefined criteria are satisfied.
Later on, I fitted the regression model by applying Ridge regression and saved the model in a pickle file named “Ridge_Titanic” as you can see below.
The fitted model is loaded into the app file where the routes were used to map the URL with the defined function to display the probability value on the web page. In other words, it can be said that if we visit the URL “http://127.0.0.1:5000/” i.e. mapped to the defined function, the output of that function is rendered on the browser's screen as shown below.
Built a User Interface to take the new inputs and displayed the predicted probability of survival by clicking the “Predict button” as you can see below.