Data Analysis Workflow
Since the process of data analysis follows established guidelines, I've compiled a summary that can be useful in the future.
This article is part of the Aidemy Premium curriculum and is made public to meet the completion requirements.
Example Used
The example used in this data analysis is the Titanic survival prediction competition from Kaggle, a well-known competition often cited in various contexts.
This example is in tabular format, the most common form in data analysis. It includes irrelevant features and missing data, providing opportunities to learn data preprocessing. It is an ideal case for data analysis, akin to "Hello, world!" in web development or the LED blink example in electronics.
Data Analysis Workflow
The general workflow for data analysis is as follows. According to a Kaggle Grandmaster, creating features in step 4 is the most crucial, taking up 50-80% of the entire process. Conversely, models and parameters have established principles, leading to less variability across individuals.
- Data Preparation
- Data Inspection
- Data Preprocessing
- Feature Determination
- Validation Data Creation
- Model Selection
- Model Training
- Evaluation of Prediction Results
Let's go through each step.
1. Data Preparation
The first step is to have data. Data doesn't magically appear; it needs to be obtained manually from public organizations, companies, or APIs, or through web scraping. In the case of Kaggle, the data is readily available. The following command reads the data:
train_df = pd.read_csv('/kaggle/input/titanic/train.csv')
When scraping, it's crucial to respect terms of service, considering commercial use restrictions and being mindful of access frequency to avoid overloading servers (commonly limited to 1 page per second).
2. Data Inspection
After obtaining data, the next essential step is to thoroughly examine it. By visualizing the data using libraries such as pandas, matplotlib, and seaborn, patterns, and potential areas of focus for building an accurate model can be identified.
Some inspection methods include:
# Display the first 5 rows of data
display(train_df.head())
# Display various statistics for each column
display(train_df.describe())
# Check data imbalance using seaborn
sns.countplot(x='Survived', data=train_df)
Other methods include scatter plots to identify outliers, correlating age with survival rate across age groups, checking for bias in training and test data, and exploring correlations between columns using heatmaps. Examining the correlation between columns is crucial for feature creation.
3. Data Preprocessing
After data inspection and analysis, preprocessing is performed if necessary. Common preprocessing steps include handling missing values and dealing with outliers.
Handling Missing Values
Missing values are common in real-world data. Various methods can be employed to handle them:
- Fill with the mean or median for numerical features.
- Fill with the most frequent value for categorical features.
- Use zero or a special value.
- Categorize data based on other features.
# Fill missing values in 'Age' with the mean
all_df['Age'] = all_df['Age'].fillna(all_df['Age'].mean())
# Fill missing values in 'Ticket' with the mode
train_df['Ticket'] = train_df['Ticket'].fillna(train_df['Ticket'].mode())
Handling Outliers
Outliers, data points that deviate significantly from the general pattern, can be handled in various ways:
- Remove outliers.
- Convert to categorical variables.
- Use them as-is.
The approach depends on the situation, and there's no one-size-fits-all solution.
4. Feature Determination
This step is crucial for creating an accurate model. Feature creation significantly impacts accuracy.
Conversion to Categorical Variables
In some cases, data is segmented into categories. This can involve grouping age into ranges or categorizing locations into regions. There are different methods for converting categorical variables:
- One-Hot Encoding
- Label Encoding
# One-Hot Encoding example
# Convert 'Age' into 8 categorical variables
train_df['AgeBand'] = pd.qcut(train_df['Age'], 8)
# Apply One-Hot Encoding to each categorical variable
train_df = pd.get_dummies(train_df, columns=['AgeBand'])
Creating New Features
Sometimes, creating new features by combining or manipulating existing ones can reveal additional insights. For instance, combining the number of siblings and spouses into "family size" may expose new correlations.
Removing Unnecessary Columns
Columns with low correlation, potential negative effects on the model, or no contribution can be removed. Care must be taken not to blindly delete data, as even seemingly irrelevant features might have hidden impacts.
This iterative process often involves building and evaluating models to discover the most effective techniques.
Continue to Part 2 for the remaining steps.
5. Creation of Validation Data
Once feature creation is complete, the next step in the preparation process is to create validation data before proceeding to model development.
The dataset is typically divided into two subsets: training data and test data. To predict the test data accurately, a model is trained using the training data. However, if the test data is entirely unknown during training, it becomes challenging to assess the model's performance. To address this, a common approach is to further split the training data into training and validation data subsets, using a technique known as the holdout method, as depicted in the diagram below:
Source: https://di-acc2.com/analytics/ai/6498/
However, it's essential to note that this method may reduce the amount of data available for training, potentially degrading the model's accuracy. Additionally, there is a risk of overfitting due to data imbalances.
Therefore, a common practice is to use k-fold cross-validation, where the dataset is divided into k subsets, and each subset is used for both training and evaluation. The model's performance is then assessed by taking the average of the results from each fold:
Source: https://di-acc2.com/analytics/ai/6498/
Here's an example of how to implement k-fold cross-validation in Python:
from sklearn.model_selection import KFold
cv = KFold(n_splits=10, random_state=0, shuffle=True)
for i, (trn_index, val_index) in enumerate(cv.split(train, target)):
# Data splitting process
XXXX
# Model training process
XXXX
# Model evaluation process
XXXX
# Average the results of 10 evaluations and display
XXXX
6. Model Selection
Choosing the right model is crucial, and there is no one-size-fits-all model for every scenario. The selection process involves considering factors such as the data's nature, final evaluation metrics, personal experience, and intuition.
Some representative models include:
- Logistic Regression
- Support Vector Machine (SVM)
- k-Nearest Neighbors
- Random Forest
- Multi-Layer Perceptron
- Gradient Boosting Decision Trees (GBDT)
GBDT has gained popularity recently due to its balanced performance, and it's often recommended to start with GBDT before exploring other models. The book below, highly praised among Kagglers, might be a factor in the increased preference for GBDT:
Furthermore, the advancement of AutoML (Automated Machine Learning) allows automatic testing of various models and hyperparameters, selecting the best-performing model automatically. Some notable AutoML services and libraries include:
- Cloud AutoML (Google)
- Automated ML (Microsoft)
- AutoAI (IBM)
- PyCaret (Open-source library)
7. Model Training
Once the model selection is complete, the next step is to train the model using the available training data. Despite the perceived complexity of training, modern libraries offer convenient tools for straightforward implementation. Below is an example of coding for logistic regression training:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
print(accuracy_score(y_train, y_pred))
8. Evaluation of Prediction Results
While the provided code displays accuracy, it's important to note that accuracy is just one metric, and there are various evaluation metrics depending on the problem type. For example:
Regression Metrics:
- RMSE (Root Mean Squared Error)
- RMSLE (Root Mean Squared Logarithmic Error)
- MAE (Mean Absolute Error)
Binary Classification Metrics:
- Confusion Matrix
- Accuracy and Misclassification Rate
- Precision and Recall
- F1 Score and Fβ Score
- Log Loss
- AUC (Area Under the ROC Curve)
Multi-Class Classification Metrics:
- Multi-Class Accuracy
- Multi-Class Log Loss
These are just a few examples, and many more exist depending on the context.
Conclusion
The article has covered the entire process of data analysis. The remaining steps involve iterating through steps 3 to 8 to gradually refine and enhance the accuracy and evaluation metrics. While the Titanic dataset example may not be as prevalent on Kaggle, mastering this tabular format problem is considered an initial goal for individuals involved in data science and machine learning.
Top comments (0)