Buono

Posted on Jan 16, 2024

Data Analysis Flow Based on the Kaggle Competition to Predict Survival Rates from the Titanic

#datascience

Data Analysis Workflow

Since the process of data analysis follows established guidelines, I've compiled a summary that can be useful in the future.

This article is part of the Aidemy Premium curriculum and is made public to meet the completion requirements.

Example Used

The example used in this data analysis is the Titanic survival prediction competition from Kaggle, a well-known competition often cited in various contexts.

Link to Titanic Competition

This example is in tabular format, the most common form in data analysis. It includes irrelevant features and missing data, providing opportunities to learn data preprocessing. It is an ideal case for data analysis, akin to "Hello, world!" in web development or the LED blink example in electronics.

Data Analysis Workflow

The general workflow for data analysis is as follows. According to a Kaggle Grandmaster, creating features in step 4 is the most crucial, taking up 50-80% of the entire process. Conversely, models and parameters have established principles, leading to less variability across individuals.

Data Preparation
Data Inspection
Data Preprocessing
Feature Determination
Validation Data Creation
Model Selection
Model Training
Evaluation of Prediction Results

Let's go through each step.

1. Data Preparation

The first step is to have data. Data doesn't magically appear; it needs to be obtained manually from public organizations, companies, or APIs, or through web scraping. In the case of Kaggle, the data is readily available. The following command reads the data:

train_df = pd.read_csv('/kaggle/input/titanic/train.csv')

When scraping, it's crucial to respect terms of service, considering commercial use restrictions and being mindful of access frequency to avoid overloading servers (commonly limited to 1 page per second).

2. Data Inspection

After obtaining data, the next essential step is to thoroughly examine it. By visualizing the data using libraries such as pandas, matplotlib, and seaborn, patterns, and potential areas of focus for building an accurate model can be identified.

Some inspection methods include:

# Display the first 5 rows of data
display(train_df.head())

# Display various statistics for each column
display(train_df.describe())

# Check data imbalance using seaborn
sns.countplot(x='Survived', data=train_df)

Other methods include scatter plots to identify outliers, correlating age with survival rate across age groups, checking for bias in training and test data, and exploring correlations between columns using heatmaps. Examining the correlation between columns is crucial for feature creation.

3. Data Preprocessing

After data inspection and analysis, preprocessing is performed if necessary. Common preprocessing steps include handling missing values and dealing with outliers.

Handling Missing Values

Missing values are common in real-world data. Various methods can be employed to handle them:

Fill with the mean or median for numerical features.
Fill with the most frequent value for categorical features.
Use zero or a special value.
Categorize data based on other features.

# Fill missing values in 'Age' with the mean
all_df['Age'] = all_df['Age'].fillna(all_df['Age'].mean())
# Fill missing values in 'Ticket' with the mode
train_df['Ticket'] = train_df['Ticket'].fillna(train_df['Ticket'].mode())

Handling Outliers

Outliers, data points that deviate significantly from the general pattern, can be handled in various ways:

Remove outliers.
Convert to categorical variables.
Use them as-is.

The approach depends on the situation, and there's no one-size-fits-all solution.

4. Feature Determination

This step is crucial for creating an accurate model. Feature creation significantly impacts accuracy.

Conversion to Categorical Variables

In some cases, data is segmented into categories. This can involve grouping age into ranges or categorizing locations into regions. There are different methods for converting categorical variables:

One-Hot Encoding
Label Encoding

# One-Hot Encoding example
# Convert 'Age' into 8 categorical variables
train_df['AgeBand'] = pd.qcut(train_df['Age'], 8)
# Apply One-Hot Encoding to each categorical variable
train_df = pd.get_dummies(train_df, columns=['AgeBand'])

Creating New Features

Sometimes, creating new features by combining or manipulating existing ones can reveal additional insights. For instance, combining the number of siblings and spouses into "family size" may expose new correlations.

Removing Unnecessary Columns

Columns with low correlation, potential negative effects on the model, or no contribution can be removed. Care must be taken not to blindly delete data, as even seemingly irrelevant features might have hidden impacts.

This iterative process often involves building and evaluating models to discover the most effective techniques.

Continue to Part 2 for the remaining steps.

5. Creation of Validation Data

Once feature creation is complete, the next step in the preparation process is to create validation data before proceeding to model development.

The dataset is typically divided into two subsets: training data and test data. To predict the test data accurately, a model is trained using the training data. However, if the test data is entirely unknown during training, it becomes challenging to assess the model's performance. To address this, a common approach is to further split the training data into training and validation data subsets, using a technique known as the holdout method, as depicted in the diagram below:

Source: https://di-acc2.com/analytics/ai/6498/

However, it's essential to note that this method may reduce the amount of data available for training, potentially degrading the model's accuracy. Additionally, there is a risk of overfitting due to data imbalances.

Therefore, a common practice is to use k-fold cross-validation, where the dataset is divided into k subsets, and each subset is used for both training and evaluation. The model's performance is then assessed by taking the average of the results from each fold:

Source: https://di-acc2.com/analytics/ai/6498/

Here's an example of how to implement k-fold cross-validation in Python:

from sklearn.model_selection import KFold

cv = KFold(n_splits=10, random_state=0, shuffle=True)

for i, (trn_index, val_index) in enumerate(cv.split(train, target)):
    # Data splitting process
    XXXX
    # Model training process
    XXXX
    # Model evaluation process
    XXXX

# Average the results of 10 evaluations and display
XXXX

6. Model Selection

Choosing the right model is crucial, and there is no one-size-fits-all model for every scenario. The selection process involves considering factors such as the data's nature, final evaluation metrics, personal experience, and intuition.

Some representative models include:

Logistic Regression
Support Vector Machine (SVM)
k-Nearest Neighbors
Random Forest
Multi-Layer Perceptron
Gradient Boosting Decision Trees (GBDT)

GBDT has gained popularity recently due to its balanced performance, and it's often recommended to start with GBDT before exploring other models. The book below, highly praised among Kagglers, might be a factor in the increased preference for GBDT:

Furthermore, the advancement of AutoML (Automated Machine Learning) allows automatic testing of various models and hyperparameters, selecting the best-performing model automatically. Some notable AutoML services and libraries include:

Cloud AutoML (Google)
Automated ML (Microsoft)
AutoAI (IBM)
PyCaret (Open-source library)

7. Model Training

Once the model selection is complete, the next step is to train the model using the available training data. Despite the perceived complexity of training, modern libraries offer convenient tools for straightforward implementation. Below is an example of coding for logistic regression training:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
print(accuracy_score(y_train, y_pred))

8. Evaluation of Prediction Results

While the provided code displays accuracy, it's important to note that accuracy is just one metric, and there are various evaluation metrics depending on the problem type. For example:

Regression Metrics:

RMSE (Root Mean Squared Error)
RMSLE (Root Mean Squared Logarithmic Error)
MAE (Mean Absolute Error)

Binary Classification Metrics:

Confusion Matrix
Accuracy and Misclassification Rate
Precision and Recall
F1 Score and Fβ Score
Log Loss
AUC (Area Under the ROC Curve)

Multi-Class Classification Metrics:

Multi-Class Accuracy
Multi-Class Log Loss

These are just a few examples, and many more exist depending on the context.

Conclusion

The article has covered the entire process of data analysis. The remaining steps involve iterating through steps 3 to 8 to gradually refine and enhance the accuracy and evaluation metrics. While the Titanic dataset example may not be as prevalent on Kaggle, mastering this tabular format problem is considered an initial goal for individuals involved in data science and machine learning.

Forem

Data Analysis Flow Based on the Kaggle Competition to Predict Survival Rates from the Titanic

Data Analysis Workflow

Example Used

Data Analysis Workflow

1. Data Preparation

2. Data Inspection

3. Data Preprocessing

Handling Missing Values

Handling Outliers

4. Feature Determination

Conversion to Categorical Variables

Creating New Features

Removing Unnecessary Columns

5. Creation of Validation Data

6. Model Selection

7. Model Training

8. Evaluation of Prediction Results

Regression Metrics:

Binary Classification Metrics:

Multi-Class Classification Metrics:

Conclusion

Top comments (0)

Read next

Mozilla's Exit from Rust Led to 9% Drop in Development, New Contributors Most Affected

AI Breakthrough: New Self-Driving System Masters Complex Traffic Using 3D Scene Learning

Breakthrough Method Achieves Ultimate Precision in Quantum Measurements Using Simple Circuits

New AI System Lets Users Control Speed vs. Accuracy Trade-offs in Document Search and Generation