NishthaCh

Posted on

# Titanic Project

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems that some groups of people were more likely to survive than others.

So, to get the answer to the question: “What sorts of people were more likely to survive?” using the passenger data, I have worked on building a predictive model that will calculate the probability of people surviving the Titanic shipwreck given the input values.

The data has been split into two groups:
1.) training set (train.csv)
2.) test set (test.csv)

I have used the training data set to build a regression model. For the training set, the outcome (also known as the “ground truth”) is provided for each passenger. Besides, I have used the test set to see how well the model performs on unseen data and for the test set, the ground truth for each passenger is not provided.

Data Dictionary

From the given variables, I selected the following variables using the feature selection approach: -

Variable notations:

• pclass: A proxy for socio-economic status (SES) -
1st = Upper; 2nd = Middle & 3rd = Lower

• sex: Created a dummy variable for this attribute, 1 for Male and 0 for Female

• sibsp: The dataset defines family relations in the following way:
Sibling = brother, sister, stepbrother, stepsister &
Spouse = husband, wife (mistresses and fiancés were ignored)

• parch: The dataset defines family relations in the following way:
Parent = mother, father; Child = daughter, son, stepdaughter, stepson & Some children traveled only with a nanny, therefore parch = 0 for them.

• fare

Some techniques that I have used for feature selection are: -
1.) Percent missing values: Using this technique, I dropped variables that have a very high % of missing values [# of records with missing values / # of total records].

2.) Pairwise correlation: Many features are often correlated with each other hence are redundant. So I kept the one that helped reduce dimensionality without much loss of information, i.e. the one that has a higher correlation with the target.

3.) Backward elimination: I started with all the variables included in the model and then dropped the least useful variable, and so on until some predefined criteria are satisfied.

Later on, I fitted the regression model by applying Ridge regression and saved the model in a pickle file named “Ridge_Titanic” as you can see below.

The fitted model is loaded into the app file where the routes were used to map the URL with the defined function to display the probability value on the web page. In other words, it can be said that if we visit the URL “http://127.0.0.1:5000/” i.e. mapped to the defined function, the output of that function is rendered on the browser's screen as shown below.

Built a User Interface to take the new inputs and displayed the predicted probability of survival by clicking the “Predict button” as you can see below.

I have accessed the new test data points using Javascript and then via jQuery Ajax call, received responses from the Flask App which were then sent to the User Interface. You can see the javascript code attached below.