DEV Community: Loi2008

Predicting Survival on the Titanic: A Machine Learning Approach

Loi2008 — Sun, 05 Oct 2025 09:44:09 +0000

1. Introduction

One of the most tragic maritime catastrophes in history, the RMS Titanic sinking in 1912 claimed a great number of lives. In addition to the disaster's immense scope, the intricate interactions between various elements that affected survival rates have long captivated scholars.

In order to create a predictive model, this research will examine the Titanic dataset, which is a comprehensive collection of passenger data. We aim to identify the primary factors influencing survival and develop a strong model that can forecast a person's chances of surviving a disaster by examining characteristics like age, gender, passenger class, among others.

In addition to illuminating past trends, the knowledge gathered from this analysis will show how effective machine learning is at deriving insightful forecasts from intricate real-world data.

2. Understanding the Dataset

The dataset used (https://github.com/Loi2008/Data_Science_Assignments/blob/main/tested.csv) contains information about Titanic passengers. The goal is to develop a prediction model that predicts the likelihood of survival (using Survived column - 0 = No, 1 = Yes), based on other features. The dataset contains the following columns:

PassengerId: Unique identifier for each passenger.
Survived: Survival (0 = No, 1 = Yes) - Target variable.
Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd).
Name: Passenger's name.
Sex: Sex (male/female).
Age: Age in years.
SibSp: Number of siblings/spouses aboard.
Parch: Number of parents/children aboard.
Ticket: Ticket number.
Fare: Passenger fare.
Cabin: Cabin number.
Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

3. Building the Model

Goal: Predicting the passengers' survival likelihood using 'Survived' column based on other passenger features. Below are the steps involved, together with the Python code for each step:

a. Download all the Required Libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder 
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

b. Suppress Warnings for Cleaner Output

warnings.filterwarnings('ignore')

c. Load the Data

Read the CSV file into a pandas DataFrame.

df = pd.read_csv (r"tested.csv")
df.head()

Output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	0	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	1	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	0	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	0	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	1	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

d. Exploratory Data Analysis (EDA)

Understand the data (data types, missing values and distributions).

df.info()

Output

RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):

#	Column	Non-Null Count	Dtype
0	PassengerId	418 non-null	int64
1	Survived	418 non-null	int64
2	Pclass	418 non-null	int64
3	Name	418 non-null	object
4	Sex	418 non-null	object
5	Age	332 non-null	float64
6	SibSp	418 non-null	int64
7	Parch	418 non-null	int64
8	Ticket	418 non-null	object
9	Fare	417 non-null	float64
10	Cabin	91 non-null	object
11	Embarked	418 non-null	object

dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB

Handling missing values
Number of missing values for each column

missing = df.isnull().sum()
non_null = df.notnull().sum()
total = len(df)

# Build summary DataFrame
summary = pd.DataFrame({
    "Non-Null Count": non_null,
    "Missing Values": missing,
    "Missing %": (missing / total * 100).round(1),
    "Dtype": df.dtypes
})

# Rename index to 'Column'
summary.index.name = "Column"

# Display
summary

Output

Column	Non-Null Count	Missing Values	Missing %	Dtype
PassengerId	418	0	0.0%	int64
Survived	418	0	0.0%	int64
Pclass	418	0	0.0%	int64
Name	418	0	0.0%	object
Sex	418	0	0.0%	object
Age	332	86	20.6%	float64
SibSp	418	0	0.0%	int64
Parch	418	0	0.0%	int64
Ticket	418	0	0.0%	object
Fare	417	1	0.2%	float64
Cabin	91	327	78.2%	object
Embarked	418	0	0.0%	object

Creating a copy of the dataset to avoid Error when modifying the original dataset (df) later .

df_processed = df.copy()

The dataset statistical distribution

Output

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked_C	Embarked_Q	Embarked_S
count	418.0000	418.0000	418.0000	418.0000	418.0000	418.0000	418.0000	418.0000	418.0000	418.0000
mean	0.3636	2.2656	0.6364	29.5993	0.4474	0.3923	35.5765	0.2440	0.1100	0.6459
std	0.4816	0.8418	0.4816	12.7038	0.8968	0.9814	55.8501	0.4300	0.3133	0.4788
min	0.0000	1.0000	0.0000	0.1700	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
25%	0.0000	1.0000	0.0000	23.0000	0.0000	0.0000	7.8958	0.0000	0.0000	0.0000
50%	0.0000	3.0000	1.0000	27.0000	0.0000	0.0000	14.4542	0.0000	0.0000	1.0000
75%	1.0000	3.0000	1.0000	35.7500	1.0000	0.0000	31.4719	0.0000	0.0000	1.0000
max	1.0000	3.0000	1.0000	76.0000	8.0000	9.0000	512.3292	1.0000	1.0000	1.0000

Impute the missing values with median - Impute() usedinstead of fillna() due to its robustness. The median is preferred over the mean because the data is skewed. From the statistical distribution results above the means for "Age" and "Fare" are not equal to the median hence the data is skewed.

imputer_age = SimpleImputer(strategy='median')
df_processed['Age'] = imputer_age.fit_transform(df_processed[['Age']])
imputer_fare = SimpleImputer(strategy='median')
df_processed['Fare'] = imputer_fare.fit_transform(df_processed[['Fare']])

Dropping the column Cabin. This is because it has a high percentage of missing values (78.2%), and directly using it often requires complex feature engineering.

df_processed = df_processed.drop('Cabin', axis=1)

Dropping irrelevant columns.
PassengerId, Name, Ticket are usually unique identifiers or free-text fields that don't directly contribute to survival prediction for basic models.

df_processed = df_processed.drop(['PassengerId', 'Name', 'Ticket'], axis=1)

Encode categorical columns - 'Sex' and 'Embarked'.

Sex: It's a binary categorical feature (male, female). LabelEncoder converts male to 1 and female to 0 for numerical models to understand.
Embarked: Has three categories (S, C, Q). OneHotEncoder is used to convert this into separate binary columns - Embarked_C, Embarked_Q, Embarked_S. This will prevents the model from assuming an ordinal relationship between the categories.

# 'Sex': Use LabelEncoder (binary feature)
le = LabelEncoder()
df_processed['Sex'] = le.fit_transform(df_processed['Sex']) # male=1, female=0. Or male=0, female=1 depending on internal sorting.

# 'Embarked': Use OneHotEncoder
print("Unique Embarked values before OneHotEncoding:", df_processed['Embarked'].unique())
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
embarked_encoded = ohe.fit_transform(df_processed[['Embarked']])
embarked_df = pd.DataFrame(embarked_encoded, columns=ohe.get_feature_names_out(['Embarked']), index=df_processed.index)
df_processed = pd.concat([df_processed.drop('Embarked', axis=1), embarked_df], axis=1)

print("DataFrame after preprocessing. First 5 rows:")
print(df_processed.head())
print("\n")
print("DataFrame Info after preprocessing:")
df_processed.info()
print("\n")

Output

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked_Q	Embarked_S
0	0	3	1	34.5	0	0	7.8292	1.0	0.0
1	1	3	0	47.0	1	0	7.0000	0.0	1.0
2	0	2	1	62.0	0	0	9.6875	1.0	0.0
3	0	3	1	27.0	0	0	8.6625	0.0	1.0
4	1	3	0	22.0	1	1	12.2875	0.0	1.0

The final columns

#	Column	Non-Null Count	Dtype
0	Survived	418 non-null	int64
1	Pclass	418 non-null	int64
2	Sex	418 non-null	int64
3	Age	418 non-null	float64
4	SibSp	418 non-null	int64
5	Parch	418 non-null	int64
6	Fare	418 non-null	float64
7	Embarked_C	418 non-null	float64
8	Embarked_Q	418 non-null	float64
9	Embarked_S	418 non-null	float64

e. Splitting Data

The dataset is divided into training and testing sets.

print("Splitting Data into Training and Test Sets")
X = df_processed.drop('Survived', axis=1) 
y = df_processed['Survived']             

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")
print("\n")

Output
The data is split Data into Training and Test sets as illustrated:

X_train shape: (334, 9)
X_test shape: (84, 9)
y_train shape: (334,)
y_test shape: (84,)

Elaboration on the splitting code

X = df.drop('Survived', axis=1): Creates the feature matrix X by dropping the target variable.
y = df['Survived']: Creates the target vector y.
train_test_split(X, y, test_size=0.2, random_state=42, stratify=y): Splits the data into 80% for training and 20% for testing.
random_state: Ensures reproducibility of the split.
stratify=y: Important for classification problems to ensure that the proportion of 'Survived' (0s and 1s) is roughly the same in both training and test sets.

f. Scaling Numerical Features (Columns)

StandardScaler standardizes features by removing the mean and scaling to unit variance. This is important for algorithms that are sensitive to the scale of input features like Logistic Regression to prevent features with larger values from dominating the learning process. As we are using Logistic Regression, scaling is important as it will ensure each feature contributes equally.

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("X_train_scaled (first 5 rows):")
print(X_train_scaled[:5])
print("\n")
print("X_test_scaled (first 5 rows):")
print(X_test_scaled[:5])
print("\n")

Output
X_train_scaled (first 5 rows)

Row	Col1	Col2	Col3	Col4	Col5	Col6	Col7	Col8	Col9
1	0.85435834	0.75370758	-0.21477642	-0.48043064	-0.41184087	-0.50912957	-0.54736724	-0.33665016	0.70551956
2	0.85435834	0.75370758	-0.68236136	-0.48043064	-0.41184087	-0.49535877	-0.54736724	-0.33665016	0.70551956
3	0.85435834	0.75370758	-0.21477642	-0.48043064	-0.41184087	-0.49615131	-0.54736724	2.97044263	-1.41739515
4	0.85435834	0.75370758	-1.61753124	-0.48043064	0.67126820	-0.57539142	-0.54736724	-0.33665016	0.70551956
5	0.85435834	-1.32677450	-0.21477642	-0.48043064	-0.41184087	-0.49564602	-0.54736724	2.97044263	-1.41739515

X_test_scaled (first 5 rows)

Row	Col1	Col2	Col3	Col4	Col5	Col6	Col7	Col8	Col9
1	0.85435834	-1.32677450	-0.21477642	-0.48043064	-0.41184087	-0.49615131	-0.54736724	2.97044263	-1.41739515
2	-1.49424813	-1.32677450	-0.37063806	0.61116007	-0.41184087	0.32912293	1.82692702	-0.33665016	-1.41739515
3	0.85435834	0.75370758	-0.21477642	-0.48043064	-0.41184087	-0.49362833	-0.54736724	-0.33665016	0.70551956
4	-0.31994489	-1.32677450	-0.76029218	-0.48043064	-0.41184087	0.00567507	-0.54736724	-0.33665016	0.70551956
5	-1.49424813	0.75370758	-0.37063806	-0.48043064	-0.41184087	-0.18034678	1.82692702	-0.33665016	-1.41739515

4. Choosing and Training the Model

Two models chosen:
Logistic Regression: Chosen for its clear interpretability, providing insights into how each factor linearly influences survival probability, and serving as a strong, efficient baseline.

print("Training Logistic Regression Model...")
log_reg_model = LogisticRegression(random_state=42, solver='liblinear')
log_reg_model.fit(X_train_scaled, y_train)
print("Logistic Regression Model Trained.\n")

log_reg_model.fit(X_train_scaled, y_train): Trains the model using the scaled training data.

Random Forest Classifier An ensemble method, offers higher accuracy by capturing complex, non-linear relationships and interactions without extensive data preprocessing.

print("Training Random Forest Classifier Model...")
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
# Random Forest is less sensitive to feature scaling, so we can use unscaled X
rf_model.fit(X_train, y_train)
print("Random Forest Classifier Model Trained.\n")

rf_model.fit(X_train, y_train): Trains the model. Random Forests are less sensitive to feature scaling, so we can use the unscaled X_train here.

Together, they allow for a comprehensive analysis, balancing transparency with predictive power to understand Titanic survival.

5. Evaluating the Models

The predictions on the scaled test set for Logistic Regression is generated (y_pred_log_reg = log_reg_model.predict(X_test_scaled)).

y_pred_log_reg = log_reg_model.predict(X_test_scaled)
accuracy_log_reg = accuracy_score(y_test, y_pred_log_reg)
print(f"Logistic Regression Accuracy: {accuracy_log_reg:.4f}")
print("Logistic Regression Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_log_reg))
print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_log_reg))
print("\n")

Output

The predictions on the unscaled test set for Random Forest are generated (y_pred_rf = rf_model.predict(X_test))

y_pred_rf = rf_model.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf:.4f}")
print("Random Forest Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))
print("\n")

Output

From the output of the two models:

accuracy_score: Calculates the proportion of correctly classified instances.
confusion_matrix: Shows the number of true positives, true negatives, false positives, and false negatives.
classification_report: Provides a detailed summary of precision, recall, f1-score, and support for each class.

6. Making predictions using sample data

Using hypothetical new single passenger and then multiple passengers
The data will be processed through the same steps
- Removing irrelevant columns
- Impute missing values if any.
- Encode categorical features using the fitted encoders
- Ensuring column order matches training data for X_train -This is crucial (Create a DataFrame with the same columns as X_train, then fill it)
- Prediction using logistic regression and random forest
- Visualizing key relationships

a. Capturing the data

**_Using Single Passenger_**
new_passenger_features_raw = pd.DataFrame({
    'Pclass': [1],
    'Name': ['New, Mrs. Example (Test User)'], # Name added for consistency, but will be dropped
    'Sex': ['female'],
    'Age': [30],
    'SibSp': [0],
    'Parch': [0],
    'Ticket': ['TEST12345'], # Ticket added for consistency, but will be dropped
    'Fare': [50],
    'Cabin': [None], # Cabin added for consistency, but will be dropped
    'Embarked': ['S']
})

b. Dropping irrelevant columns

new_passenger_processed = new_passenger_features_raw.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, errors='ignore')

c. Imputing missing values

new_passenger_processed['Age'] = imputer_age.transform(new_passenger_processed[['Age']])
new_passenger_processed['Fare'] = imputer_fare.transform(new_passenger_processed[['Fare']])

d. Encoding categorical features

new_passenger_processed['Sex'] = le.transform(new_passenger_processed['Sex'])
embarked_new = ohe.transform(new_passenger_processed[['Embarked']])
embarked_new_df = pd.DataFrame(embarked_new, columns=ohe.get_feature_names_out(['Embarked']))
new_passenger_processed = pd.concat([new_passenger_processed.drop('Embarked', axis=1), embarked_new_df], axis=1)

e. Ensure column order matches training data for X_train

new_passenger_final = pd.DataFrame(columns=X.columns)
new_passenger_final = pd.concat([new_passenger_final, new_passenger_processed], ignore_index=True)
new_passenger_final = new_passenger_final.fillna(0)

f. Prediction with the models

Logistic Regression using scaled data

new_passenger_scaled = scaler.transform(new_passenger_final)
log_reg_prediction = log_reg_model.predict(new_passenger_scaled)
print(f"Logistic Regression predicts survival for new passenger: {'Survived' if log_reg_prediction[0] == 1 else 'Not Survived'}")

Output: Logistic Regression predicts survival for new passenger: Survived

Random Forest using unscaled data

rf_prediction = rf_model.predict(new_passenger_final)
print(f"Random Forest predicts survival for new passenger: {'Survived' if rf_prediction[0] == 1 else 'Not Survived'}")

Output Random Forest predicts survival for new passenger: Survived

Using Multiple Passengers

a. Capture multiple passengers' data

new_passengers_features_raw = pd.DataFrame({
    'Pclass': [1, 2, 3],
    'Name': [
        'New, Mrs. Example (Test User 1)',
        'New, Mr. Sample (Test User 2)',
        'New, Miss Demo (Test User 3)'
    ],
    'Sex': ['female', 'male', 'female'],
    'Age': [30, None, 22],        # Example: missing Age for passenger 2
    'SibSp': [0, 1, 0],
    'Parch': [0, 0, 1],
    'Ticket': ['TEST12345', 'TEST12346', 'TEST12347'],
    'Fare': [50, None, 15],       # Example: missing Fare for passenger 2
    'Cabin': [None, None, None],
    'Embarked': ['S', 'C', 'Q']
})

b. Dropping irrelevant columns

new_passengers_processed = new_passengers_features_raw.drop(
    ['PassengerId', 'Name', 'Ticket', 'Cabin'],
    axis=1,
    errors='ignore'
)

c. Imputing missing values using fitted imputers

new_passengers_processed['Age'] = pd.Series(
    imputer_age.transform(new_passengers_processed[['Age']]).flatten(),
    index=new_passengers_processed.index
)
new_passengers_processed['Fare'] = pd.Series(
    imputer_fare.transform(new_passengers_processed[['Fare']]).flatten(),
    index=new_passengers_processed.index
)

d. Encoding categorical features

#Label encode 'Sex'
new_passengers_processed['Sex'] = le.transform(new_passengers_processed['Sex'])

#One-hot encode 'Embarked'
embarked_new = ohe.transform(new_passengers_processed[['Embarked']])
embarked_new_df = pd.DataFrame(
    embarked_new, 
    columns=ohe.get_feature_names_out(['Embarked']),
    index=new_passengers_processed.index
)
new_passengers_processed = pd.concat(
    [new_passengers_processed.drop('Embarked', axis=1), embarked_new_df],
    axis=1
)

e. Align columns with training data

Handles cases where new_passengers_processed might miss a dummy column e.g., if a new passenger only has 'Embarked_S' and not 'Embarked_C').

new_passengers_final = pd.DataFrame(columns=X.columns)
new_passengers_final = pd.concat([new_passengers_final, new_passenger_processed], ignore_index=True)
new_passengers_final = new_passenger_final.fillna(0)

print(new_passengers_final)

Output

Pclass	Sex	Age	SibSp	Parch	Fare	Embarked_C	Embarked_Q	Embarked_S
1	0	30.0	0	0	50.0000	0.0	0.0	1.0
2	1	27.0	1	0	14.4542	1.0	0.0	0.0
3	0	22.0	0	1	15.0000	0.0	1.0	0.0

e. Scale features and predict survival for multiple passengers

Logistic Regression

print("--- Logistic Regression Predictions ---")
new_passengers_scaled = scaler.transform(new_passengers_final)
log_reg_predictions = log_reg_model.predict(new_passengers_scaled)
log_reg_probabilities = log_reg_model.predict_proba(new_passengers_scaled)[:, 1] # Probability of survival

# Create results_lr using the *index* of new_passengers_final
results_lr = pd.DataFrame({
    'Predicted_Survival_LR': ['Survived' if p == 1 else 'Not Survived' for p in log_reg_predictions],
    'Survival_Probability_LR': log_reg_probabilities.round(2)
}, index=new_passengers_final.index) # Use the index of the processed features for results_lr

print(results_lr)
print("\n")

Output

#	Predicted_Survival_LR	Survival_Probability_LR
0	Survived	0.99
1	Not Survived	0.01
2	Survived	0.99

Random Forest

print("--- Random Forest Predictions ---")
# Random Forest uses unscaled data, so use new_passengers_final directly
rf_predictions = rf_model.predict(new_passengers_final)
rf_probabilities = rf_model.predict_proba(new_passengers_final)[:, 1] # Probability of survival

results_rf = pd.DataFrame({
    'Predicted_Survival_RF': ['Survived' if p == 1 else 'Not Survived' for p in rf_predictions],
    'Survival_Probability_RF': rf_probabilities.round(4)
})
print(results_rf)
print("\n")

Output

#	Predicted_Survival_RF	Survival_Probability_RF
0	Survived	0.96
1	Not Survived	0.01
2	Survived	0.96

Combined Prediction

combined_results = pd.merge(results_lr, results_rf, left_index=True, right_index=True)
print("Combined Predictions (merged on index)")
print(combined_results)
print("\n")

Output

#	Predicted_Survival_LR	Survival_Probability_LR	Predicted_Survival_RF	Survival_Probability_RF
0	Survived	0.99	Survived	0.96
1	Not Survived	0.01	Not Survived	0.01
2	Survived	0.99	Survived	0.96

8. Visualizing Key Relationships

Survival rate by Sex

print("\n Visualizing Key Relationships - By Sex")
plt.figure(figsize=(6, 4))
sns.barplot(x='Sex', y='Survived', data=df_processed.replace({'Sex': {0: 'Female', 1: 'Male'}}))
plt.title('Survival Rate by Sex')
plt.ylabel('Survival Rate')
plt.show()

Visually shows the higher survival rate for females.

Survival rate by Pclass -
illustrates how survival rates vary across different passenger classes.

print("\n Visualizing Key Relationships - By PClass")
plt.figure(figsize=(6, 4))
sns.barplot(x='Pclass', y='Survived', data=df_processed)
plt.title('Survival Rate by Pclass')
plt.ylabel('Survival Rate')
plt.show()

Distribution of Age and its relation to Survival

The histogram compares the age distribution of survivors vs. non-survivors.

plt.figure(figsize=(10, 6))
sns.histplot(df_processed[df_processed['Survived'] == 0]['Age'], color='red', label='Not Survived', kde=True)
sns.histplot(df_processed[df_processed['Survived'] == 1]['Age'], color='green', label='Survived', kde=True)
plt.title('Age Distribution by Survival')
plt.xlabel('Age')
plt.ylabel('Number of Passengers')
plt.legend()
plt.show()

Feature Importance for Random Forest
Since we are using Random Forest for its ability to show feature importance, visualizing or printing the rf_model.feature_importances_ adds viability to our model.

importances = rf_model.feature_importances_
feature_names = X.columns
forest_importances = pd.Series(importances, index=feature_names).sort_values(ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x=forest_importances.values, y=forest_importances.index)
plt.title('Random Forest Feature Importances')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

7. Conclusion and Summary

Conclusion
This project effectively leveraged machine learning to forecast the survival of Titanic passengers. The tragic historical event emphasizes the crucial role that socioeconomic and demographic factors play in crisis outcomes. The predictions verified that sex and passenger class were by far the most important factors influencing survival. This was evident in both Random Forest Classifiers and Logistic Regression, with women and those in higher classes having significantly higher probability of being saved. Additionally, age was a factor, favoring very young children in particular.

Complementary insights were obtained by the application of the two separate models: As a great starting point, logistic regression provided a clear, intelligible explanation of the linear influence of each characteristic on survival probability. The effectiveness of ensemble approaches for subtle pattern identification was demonstrated by the Random Forest Classifier. this typically obtained greater predicted accuracy due to its capacity to grasp complicated non-linear linkages and interactions. The results were further supported by the visual analyses, which offered unambiguous graphical proof of the differences in survival rates.
In conclusion, this project demonstrated machine learning's ability to glean significant, actionable insights from historical data, shedding light on historical trends and providing a potent tool for comprehending intricate real-world phenomena, in addition to producing strong predictive models for the Titanic disaster.

Summary
The process of creating and assessing Titanic passenger survival prediction models was covered in length in this article. It began with an overview of the issue and proceeded through the fundamental stages of a machine learning process, with emphasis on training, evaluating and testing the model using sample data.

References

Atieno, L., Robina, F., & Otieno, M (2025). Survival Likelihood Model. (https://github.com/Loi2008/Data_Science_Assignments/blob/main/Prediction_Model_Titanic_Dataset.ipynb)
Dawson, E. (1997). The Titanic Disaster: Historical and Social Perspectives. Journal of Maritime History.
Kaggle. (n.d.). Titanic: Machine Learning from Disaster.
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
Hosmer, D. W., & Lemeshow, S. (2000). Applied Logistic Regression. Wiley Series in Probability and Statistics.

📊Unlocking the power of SQL: Subqueries, CTEs, and Stored Procedures Demystified

Loi2008 — Fri, 12 Sep 2025 14:20:31 +0000

📝Introduction

In SQL, developers are often faced with situations where they are required to break down complex queries, reuse logic, or encapsulate business rules for repeated use. There are three powerful features that help manage complexity and improve efficiency, and each serves a different purpose:

Subqueries - allow quick, inline calculations inside a query.
Common Table Expressions (CTEs) - improve readability and support recursion within queries.
Stored procedures - encapsulate reusable, parameterized business logic stored at the database level.

Understanding their similarities, differences, and best use cases is essential for writing efficient, maintainable SQL code.

1. Subquery

A subquery is a query nested inside another query. It can be used in the SELECT, FROM, or WHERE clause to provide intermediate results.
Example:

SELECT employee_name, salary
FROM employees
WHERE salary > (
    SELECT AVG(salary)
    FROM employees
);

🟢Best for: Quick, one-off filtering or calculations.
🔴Limitation: Cannot be reused across queries and may affect performance if overused.

2. Common Table Expression (CTE)

A CTE is a temporary named result set defined with the **WITH **keyword. It improves readability and supports recursion.
Example:

WITH AvgSalary AS (
    SELECT AVG(salary) AS avg_sal
    FROM employees
)
SELECT employee_name, salary
FROM employees, AvgSalary
WHERE salary > avg_sal;

🟢Best for: Structuring complex queries, improving readability, and handling recursive scenarios like hierarchies.
🔴Limitation: Exists only within the query scope and cannot be parameterized.

3. Stored Procedure

A stored procedure is a precompiled set of SQL statements stored in the database. It can accept parameters, perform multiple operations, and encapsulate business logic.
Example:

CREATE PROCEDURE GetHighEarners(IN minSalary DECIMAL(10,2))
BEGIN
    SELECT employee_name, department, salary
    FROM employees
    WHERE salary > minSalary;
END;
CALL GetHighEarners(60000);

🟢Best for: Reusable routines, parameterized operations, and business logic encapsulation.
🔴 Limitation: Requires database-level creation and maintenance.

Features

Feature	Subquery	CTE	Stored Procedure
Scope	Within a query	Within a query	Stored in the database
Reusability	No	Only within same query	Yes (global)
Parameters	No	No	Yes
Supports Recursion	No	Yes	Yes
Can Modify Data	Rarely (SELECT only)	Rarely (SELECT only)	Yes (INSERT/UPDATE/DELETE)
Best Use	Simple inline logic	Complex query readability	Reusable business logic

Where to use each

Subquery

In a simple, one-off query.
when filtering or aggregating values inside a query.
⚠️ Use correlated subqueries sparingly for performance reasons.

CTE

When query logic is complex or needs recursion.
Makes queries readable and maintainable.
Ideal when the same subquery is referenced multiple times.

Stored Procedure

When logic needs reusability across multiple queries.
For data modification, business rules, or repetitive operations.
When performance benefits from precompiled execution.
When parameters or multiple operations are required in one call.

Conclusion

In SQL, stored procedures, CTEs, and subqueries are complementary tools. Subqueries are ideal for inline, fast processes; CTEs facilitate recursion and make difficult queries readable; and stored procedures contain reusable, parameterized business logic for routine tasks. Additionally, stored procedures bridge the gap between database architecture and programming principles by embodying abstraction, reusability, and modularity, much like Python functions do. By choosing the appropriate method for the appropriate situation, developers may create scalable, manageable, and effective SQL.

📖 Further Reading & References

Gravell, M. Difference between CTE and subquery on Stack Overflow – highlights recursive capabilities of CTEs. https://stackoverflow.com/questions/706972/difference-between-cte-and-subquery
LearnSQL.com – comprehensive overview of subqueries and CTEs with examples. https://learnsql.com/blog/cte-vs-subquery
KDnuggets (April 2025): SQL CTE vs Subquery: This Debate Ain’t Over Yet – detailed comparison. https://www.kdnuggets.com/sql-cte-vs-subquery-this-debate-aint-over-yet
Wikipedia – Stored Procedure – in-depth explanation of stored procedures, use cases, and comparison with functions. https://en.wikipedia.org/wiki/Stored_procedure
PostgreSQL stored procedures guide – syntax and use for transaction-aware routines.https://pysql.tecladocode.com/section08/lectures/04_stored_procedures
Wikipedia – Correlated Subquery – explains execution patterns and performance considerations in correlated subqueries. https://en.wikipedia.org/wiki/Correlated_subquery

From SQL to Python: Uniting Stored Power with Functional Flexibility

Loi2008 — Fri, 12 Sep 2025 14:10:08 +0000

Overview

Databases and programming languages are frequently used in modern software systems to provide effective, scalable, and maintainable solutions. Python functions and SQL stored procedures are essential components of these ecosystems. Python functions encapsulate reusable application logic for computation, integration, and sophisticated processing, whereas stored procedures encapsulate database logic to carry out actions directly within the database engine.
This article explores their similarities, differences, and suitable applications, emphasizing their potential in both individual and combined use.

Stored Procedure (SQL)

A stored procedure is a precompiled set of SQL statements (and optional control-of-flow logic) stored in a relational database. It can accept input parameters, perform operations (such as queries, inserts, updates, deletes, or complex business logic), and return results.

Key Features

Encapsulation of database logic.
Parameterized execution for dynamic queries.
Control-of-flow logic (IF, WHILE, CASE).
Enhanced security via procedure-level permissions.

Application

Generating financial or operational reports
Running batch updates and ETL jobs
Enforcing business rules within the database

SQL Script: Applying Stored Procedures

The example illustrates application of a stored procedure in a business scenario, using sales database. The script retrieve customer orders above a certain amount and log when the procedure is executed.

-- Customers table
CREATE TABLE Customers (
    CustomerID INT PRIMARY KEY,
    CustomerName VARCHAR(100),
    City VARCHAR(50)
);

-- Orders table
CREATE TABLE Orders (
    OrderID INT PRIMARY KEY,
    CustomerID INT,
    OrderDate DATE,
    OrderAmount DECIMAL(10,2),
    FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)
);

-- ProcedureLog table
CREATE TABLE ProcedureLog (
    Log_ID SERIAL PRIMARY KEY,  
    ProcedureName VARCHAR(100),
    ExecutionTime TIMESTAMP
);

-- Insert sample data
INSERT INTO Customers (CustomerID, CustomerName, City)
VALUES (1, 'Alice Johnson', 'New York'),
       (2, 'Michael Smith', 'Chicago'),
       (3, 'Sarah Lee', 'San Francisco');

INSERT INTO Orders (OrderID, CustomerID, OrderDate, OrderAmount)
VALUES (101, 1, '2025-01-10', 250.00),
       (102, 2, '2025-01-15', 120.00),
       (103, 1, '2025-02-01', 500.00),
       (104, 3, '2025-02-05', 90.00);

-- Stored procedure 

CREATE OR REPLACE FUNCTION GetHighValueOrders(min_amount DECIMAL)
RETURNS TABLE (
    OrderID INT,
    CustomerName VARCHAR(100),
    OrderDate DATE,
    OrderAmount DECIMAL(10,2)
)
AS $$
BEGIN
    -- Log the execution
    INSERT INTO ProcedureLog (ProcedureName, ExecutionTime)
    VALUES ('GetHighValueOrders', NOW());

    -- Return query
    RETURN QUERY
    SELECT O.OrderID, C.CustomerName, O.OrderDate, O.OrderAmount
    FROM Orders O
    INNER JOIN Customers C ON O.CustomerID = C.CustomerID
    WHERE O.OrderAmount >= min_amount
    ORDER BY O.OrderAmount DESC;
END;
$$ LANGUAGE plpgsql;
-- Executing the function
SELECT * FROM GetHighValueOrders(200);

-- Check logs
SELECT * FROM ProcedureLog;

The sql code:

creates the tables.
Insert customers and orders.
Create a stored procedure that:
- Logs every execution into ProcedureLog.
- Returns orders where OrderAmount >= @MinAmount.

Function (Python)

A Python function is a block of reusable code that performs a specific task, takes input arguments (optional), and can return values. Functions in Python support modularity, abstraction, and reusability within applications.

Key Features

Can return any Python object (e.g., int, list, dict)
Support recursion, loops, and error handling with try...except
Integrate seamlessly with external APIs and libraries
Enable abstraction and modularity in software design

Application

Data preprocessing and cleaning
Implementing application business rules
Applying machine learning and analytics
Integrating with external APIs and services

Python Script: Applying Python Function

The script applies functions for reusability and clarity by:

Connecting to a database.
Calling a stored procedure.
Applying a function to filter, transform, and display the data.

from datetime import datetime

# "Tables" in memory
customers = [
    {"CustomerID": 1, "CustomerName": "Alice Johnson", "City": "New York"},
    {"CustomerID": 2, "CustomerName": "Michael Smith", "City": "Chicago"},
    {"CustomerID": 3, "CustomerName": "Sarah Lee", "City": "San Francisco"}
]

orders = [
    {"OrderID": 101, "CustomerID": 1, "OrderDate": "2025-01-10", "OrderAmount": 250.00},
    {"OrderID": 102, "CustomerID": 2, "OrderDate": "2025-01-15", "OrderAmount": 120.00},
    {"OrderID": 103, "CustomerID": 1, "OrderDate": "2025-02-01", "OrderAmount": 500.00},
    {"OrderID": 104, "CustomerID": 3, "OrderDate": "2025-02-05", "OrderAmount": 90.00}
]

procedure_log = []  # "ProcedureLog table"

# Function to log execution
def log_procedure(name):
    procedure_log.append({
        "ProcedureName": name,
        "ExecutionTime": datetime.now()
    })

# Function to get high-value orders
def get_high_value_orders(min_amount):
    # Log execution
    log_procedure("get_high_value_orders")

    # Filter and join with customers
    result = []
    for order in orders:
        if order["OrderAmount"] >= min_amount:
            customer = next(c for c in customers if c["CustomerID"] == order["CustomerID"])
            result.append({
                "OrderID": order["OrderID"],
                "CustomerName": customer["CustomerName"],
                "OrderDate": order["OrderDate"],
                "OrderAmount": order["OrderAmount"]
            })
# Sort ORDER BY DESC
    result.sort(key=lambda x: x["OrderAmount"], reverse=True)
    return result
# ---------------------------
# Application
print("High value orders >= 200:")
for row in get_high_value_orders(200):
    print(row)
print("\nProcedure logs:")
for log in procedure_log:
    print(log)

Similarities

1. Encapsulation of Logic

Both stored procedures and Python functions encapsulate logic into reusable units. For example, instead of writing the same SQL query or Python code multiple times, you place it in a procedure/function and call it when needed.

2. Parameterization

Both accept input parameters, process them, and return results. For example:

SQL: EXEC GetCustomerOrders @CustomerID = 5

Python: get_customer_orders(customer_id=5)

3. Modularity & Reusability

Both allow modular program design, making systems easier to maintain. Code changes in one procedure/function apply everywhere it is called.

4. Control Flow Support

Both can include conditional logic (IF, CASE in SQL vs. if/else in Python) and looping constructs.

Differences

Aspect	Stored Procedures (SQL)	Functions (Python)
Execution Context	Runs inside database engine	Runs in Python interpreter/application layer
Primary Purpose	Optimizing database operations (queries, transactions)	Implementing general-purpose logic and algorithms
Return Types	Result sets, output parameters, status codes	Any Python object (int, list, dict, etc.)
Language Used	SQL with procedural extensions (T-SQL, PL/SQL, etc.)	Python syntax
Performance	Precompiled, reduces network traffic by processing in DB	Requires fetching data from DB before processing
Error Handling	`TRY...CATCH` blocks	`try...except` blocks
Statefulness	Tied to database state (tables, views, transactions)	Independent, works with in-memory or external data

Suitable Applications

Stored Procedures (SQL)

Best used when:

Heavy database operations are needed (aggregation, filtering, batch updates).
In need for reduced network overhead (logic executes close to the data).
Security is critical - permissions can be granted at procedure-level rather than table-level.
You need performance optimization: pre-compiled execution plans and indexing.

Application

Generating financial reports directly from the database.
Performing scheduled batch updates or ETL processes.
Enforcing business rules within the database.

Python Functions

Best used when:

Application-level processing is required (business rules, algorithms, data transformations).
Data needs to be manipulated in memory beyond SQL capabilities (e.g., machine learning, natural language processing).
You need integration with external APIs, services, or user interfaces.
Logic requires flexibility beyond relational operations (graph algorithms, recursive calculations, etc.).

Application

Cleaning and preparing datasets for machine learning.
Implementing application logic in a web service.
Calling a database stored procedure and further processing results in Python.

Application of both

In real-world systems, stored procedures and Python functions often complement each other- Stored procedure handles data retrieval/aggregation while Python function calls the stored procedure and applies additional business logic.

Summary

Stored Procedures: Optimize and secure database operations, reduce network load, enforce business rules within the DB.
Python Functions: Provide flexibility, abstraction, and broader application logic capabilities outside the database.
Both: Form a powerful combination — databases handle what they do best (data storage and retrieval), while Python manages application logic and advanced processing.

References and Further Reading

Coronel, C., & Morris, S. (2015). Database Systems: Design, Implementation, & Management. Cengage Learning.
Ramakrishnan, R., & Gehrke, J. (2003). Database Management Systems. McGraw-Hill.
Fowler, M. (2018). Refactoring: Improving the Design of Existing Code. Addison-Wesley.
Van Rossum, G., & Drake, F. L. (2009). The Python Language Reference Manual. Network Theory Ltd.
Microsoft Docs. (2023). Stored Procedures (Database Engine).
PostgreSQL Documentation. (2023). Functions and Stored Procedures.
Python Software Foundation. (2023). Python Functions.
IBM Developer. (2021). Choosing Between Stored Procedures and Application Logic.
Real Python. (2023). Defining Your Own Python Function.
Stack Overflow Discussions. (Ongoing). Best practices for stored procedures vs. application-level logic.

Beyond the Fields: A Data-Driven Look at Kenya’s Agricultural Productivity

Loi2008 — Fri, 12 Sep 2025 13:26:53 +0000

1. Overview

Kenya's economy is based primarily on agriculture, which makes a substantial contribution to employment, food security, and GDP. Farming is the main source of income for more than 70% of rural households. However, due to variations in crop selection, agricultural methods, input usage, and susceptibility to weather variations, productivity and profitability fluctuate greatly.

A wealth of data on farmers' activities throughout counties can be found in the Kenya Crops Dataset. The data provide an insight on crop types, planted areas, yields, expenses, income, and management techniques (pest control, irrigation, and fertilizer).

This article analyses the data set to identify trends, obstacles, and possibilities that can guide practice and policy in Kenya's agriculture industry.

2. Data Overview

The dataset consists of 500 records drawn from various counties in Kenya, each providing detailed information on:

Crops: Include crops such as: - Potatoes, maize, tomatoes, sorghum, coffee, beans, and others.
Land use: Planted area in acres.
Productivity: Yield in kilograms.
Economics: Market prices, revenues, production costs, and profits.
Farming practices: Fertilizer type, irrigation use, pest control methods.
Environmental impact: Weather conditions and soil type.

The data's comprehensiveness makes it feasible to evaluate farming's technical and financial facets, as well as environmental influences.

3. Key Findings

3.1 Crop Distribution

The dataset reveals a wide variety of crops grown across counties, with the most popular ones being coffee, potatoes and tea (Fig1).

3.2 Profitability and Yields

Profitability is highly variable among the various crop types. Some farmers record substantial profits, especially those producing high-value crops like rice and sorghum (Fig2).

The yields were highest for rice but sorghum registered moderate yield though revenue and profit were high(Fig2 and Fig3).

3.3 Farming Practices

According to the data set, majority of the farmers did not use any fertilizer. For those who used fertilizer, the preferences favored were DAP, manure and CAN (Fig4).

Irrigation adoption is uneven, with many farmers relying solely on rainfall, exposing yields to droughts and irregular rainfall. Pest control is inconsistently applied, with some farmers report no control methods, exposing crops to risks of pest infection (Fig 5).

Missing Data in the Dataset and Effect on Analysis

Missing Data

One of the issues with the Kenya Crops Dataset is the existence of partial or missing data, which is frequently indicated by "Not Provided" items. Key variables that have missing values include:

Yield (Kg): In certain instances, farmers failed to document or supply real yields.
Profit (KES): Economic analysis is less accurate when profit numbers are missing.
Fertilizer Used: "Not Provided" restricts information about input usage trends.
Pest Control Method: Without this information, crop protection measures cannot accurately be evaluated.

Effect on Analysis

Decreased Average and Total Accuracy: The accuracy of computed averages, totals, and comparisons between crops and counties is reduced when yield and profit figures are missing. For instance, if high-profit records are absent, the average profit per crop can be understated.
Prejudiced Views: Results may disproportionately represent farmers with better-organized records if missing data is not random (smallholder farmers are less likely to record inputs, for example), which would tilt analysis in favor of bigger or better-resourced farms.
Challenges in Trend Analysis: Trends in modern versus ancient farming methods are difficult to determine due to incomplete information on pest management or fertilizer practices. As a result, it becomes harder to connect input utilization to productivity results.
Problems with Policy Suggestions: To make focused judgments, policymakers depend on comprehensive datasets. Missing values could conceal crops that require assistance or underperforming areas.

Dashboards in Visualizing Findings

Dashboards convert complicated statistics into understandable, useful insights that enhance in-depth reporting. Intuitive analysis on this dataset is contained in my Dashboard titled: Kenya Crops Analysis Dashboard (Fig 6) Kenya Crops Analysis Dashboard

While reports such as this provide detailed analysis, dashboards make it easier to visualize findings interactively and monitor trends in real time. A dashboard built on this dataset included:

Crop Distribution: A detailed bar chart containing the distribution of planting area, yield, revenue and profit across counties.
Impact of weather on revenue and profitability: column chart comparing revenue and profits per weather impact.
Visual map: giving the position of the counties and their performance.
Key Metrics (KPIs): Cards displaying total planted area, total yield, total production cost, total revenue and total profit.

The dashboard would enable farmers, researchers, and politicians to:

Rapidly determine which crops and areas are performing well and poorly.
Evaluate how well pest management, fertilization, and irrigation are working.
Monitor how the weather affects crops and earnings.
Make decisions based on evidence to promote the expansion of agriculture.

Recommendations

Based on the analysis, the following measures could help enhance agricultural outcomes in Kenya:

Increase irrigation coverage to lessen susceptibility to fluctuations in rainfall and enhance smallholder irrigation projects.
Encourage sustainable farming methods that preserve soil health, support organic substitutes, soil conservation, and balanced fertilizer use.
Enhance pest control to reduce agricultural losses, and teach farmers integrated pest management techniques.
Digitize agricultural records to minimize missing data and enhance decision-making. This will also enhance data quality and monitoring using digital platforms.
Targeted assistance for low-yield counties by offering training, extension services, and resources to counties that continuously perform poorly.

References and Additional Readings

Food and Agriculture Organization of the United Nations (FAO). (2021). The State of Food and Agriculture. Rome: FAO.
Kenya Ministry of Agriculture. (2020). Agricultural Sector Transformation and Growth Strategy (ASTGS) 2019–2029. Nairobi: Government of Kenya.
World Bank. (2019). Agricultural Productivity in Kenya: Trends and Determinants. Washington, DC: World Bank.
International Fund for Agricultural Development (IFAD). (2022). Climate-Resilient Agriculture in East Africa. Rome: IFAD.

Ms Excel and Predictive Data analysis

Loi2008 — Sun, 31 Aug 2025 17:07:31 +0000

Overview

Microsoft Excel is a keystone of business analytics and is valued for its familiarity, flexibility, and widespread use across industries. As businesses demand ever-more sophisticated insights, Excel’s limitations become increasingly pronounced. Below, is an exploration of its strengths and limitations in predictive analysis, and its critical role in enabling data-driven decisions. Despite the limitations, businesses can still leverage its capabilities to gain insights and make informed decisions. This is possible by following best practices, understanding its limitations, and maximizing the benefits of using Excel for predictive analytics.

Strengths and Limitations of Excel in Predictive Analysis

Ms Excel Strengths in Predictive Analysis

- Accessibility and Ubiquity

Excel is an accessible tool for predictive analysis since it is readily available and well-known to many people. Because of its established reputation in the business sector, many professionals choose it by default. Its widespread use and user-friendly interface allow for rapid analysis with little training.

- Built-In Analytical Tools

For trend-based predictions, Excel offers a number of forecasting tools, including single and double exponential smoothening and linear forecasting. Furthermore, its ability to create bespoke models makes it possible to perform competent financial modeling, including scenario simulations, projections, and discount cash flow assessments. - Built-in Functions: Excel's built-in functions, like forecasting and regression analysis, offer a strong basis for predictive analytics.

- Visualization and Layout Capabilities

Excel gives customers the ability to visualize data patterns and derive actionable insights with its built-in charts, pivot tables, conditional formatting, and formatting flexibility.

- Data Integration & Flexibility

Excel's versatility makes it useful for combining data and enabling rapid, exploratory research. It can import data from a various sources: - CSV files, databases, and APIs.

- Data Manipulation

Users can prepare data for predictive models thanks to Excel's flexibility in data manipulation and presentation.

- Add-ins

Excel's predictive skills are improved with add-ins like Analysis ToolPak, PI DataLink and Solver.

Limitations of Ms Excel in Predictive Analysis

- Scalability & Performance Constraints

Excel may become unresponsive for large-scale predictive modeling as dataset sizes increase. It may become sluggish, crash, or unable to handle the volume.

- Limited Forecasting Sophistication

Excel's Forecast function assumes linear or exponential trends. Inventory Planner frequently fails in situations involving non-linear dynamics, complex seasonal patterns, or environments with outliers and volatility.

- Error-Prone Manual Handling

Errors that can jeopardize analytical integrity are frequently caused by manual entry, formula formulation, and inadequate data management.

- Version Control and Siloed Collaboration Issues

Spreadsheets frequently spread among departments, resulting in several, erratic versions. This makes it more difficult to maintain a single source of truth and hinders collaboration.

Role of Excel in Data-Driven Business Decisions: Analyzing Sample Jumia Dataset

Overview

The Jumia dataset contains sample Jumia data on products, pricing of each product, discounts offered on each product, products' rating and reviews. The data was cleaned and analyzed using Ms Excel.A dashboard revealing various perspectives on the data provided enables effective decision making on the products performance (Fig1)

Fig1: Jumia Products Analysis Dashboard

Findings

A total of 112 products sold on Jumia were analysed.All the products were discounted. Fig2 illustrates the top 10 discounted products

Fig2: Top 10 Discounted Products
The products had good reviews, with the top 10 products having reviews ranging between 20 and 70 as illustrated in Fig3.

Fig3: Top 10 Reviewed Products
The finding also revealed a positive relationship between the product ratings and the number of reviews (Fig4).

Fig.4: Relationship between Product Review and the Rating
More than 50% of the products had high discount and 43% of the products had excellent rating (Fig5 & Fig6).

Fig5: Product Discount Category

Fig6: Product Rating Category

Conclusion

Overall, the analysis provides a comprehensive view of Jumia’s sales ecosystem and guides strategic decisions to optimize revenue, customer engagement, and operational efficiency.

Reference

Jumia Sample DataSet
What are the pros and cons of using Excel for data analysis?

Global Debt Uncovered - Insights from PostgreSQL Analysis

Loi2008 — Sun, 31 Aug 2025 08:01:58 +0000

1. Introduction

This analysis explores a sample international debt dataset using PostgreSQL. The goal is to understand the structure of the data, assess data quality, and generate insights about global debt distribution. The dataset contains information on countries, debt indicators, and debt value. It also includes missing values that must be handled carefully during analysis, if accurate meaning is to be drawn from the dataset.

2. Loading the Dataset

**Assuming there is an active connected postgresql.

Steps:

Open PostgreSQL in Dbeaver
Create a schema

create schema international_debt_analysis);

Set the search path

set search_path to international_debt_analysis);

Right click on tables under your schema
Import data
Select the source file
Map the table to the schema
Confirm
Proceed
Open a new script
Confirm your table is in the right schema

select * from international_debt_with_missing_values;

3. SQL Queries and Findings

The data is analysed using SQL queries. Charts and tables are used for visualization.

3.1 The Total Amount of Debt Owed

select sum(debt) as total_debt 
from international_debt_analysis;

The Total Amount of Debt Owed is 2,823,893,300,273

3.2 Number of Distinct Countries

select count(distinct country_name) as distinct_country 
from international_debt_analysis;

Distinct Countries = 125

3.3 Distinct Types of Debt Indicators

select distinct indicator_code, indicator_name
from international_debt_analysis
where indicator_name is not null
and indicator_name <> ''
and indicator_code is not null
and indicator_code <> '';

Table1: Distinct Debt Indicators

Serial	Code	Description
1	DT.INT.PRVT.CD	PPG, private creditors (INT, current US$)
2	DT.AMT.OFFT.CD	PPG, official creditors (AMT, current US$)
3	DT.INT.DLXF.CD	Interest payments on external debt, long-term (INT, current US$)
4	DT.INT.DPNG.CD	Interest payments on external debt, private nonguaranteed (PNG) (INT, current US$)
5	DT.DIS.PCBK.CD	PPG, commercial banks (DIS, current US$)
6	DT.AMT.PBND.CD	PPG, bonds (AMT, current US$)
7	DT.DIS.MLAT.CD	PPG, multilateral (DIS, current US$)
8	DT.DIS.PRVT.CD	PPG, private creditors (DIS, current US$)
9	DT.INT.MLAT.CD	PPG, multilateral (INT, current US$)
10	DT.INT.PBND.CD	PPG, bonds (INT, current US$)
11	DT.INT.PROP.CD	PPG, other private creditors (INT, current US$)
12	DT.DIS.OFFT.CD	PPG, official creditors (DIS, current US$)
13	DT.AMT.MLAT.CD	PPG, multilateral (AMT, current US$)
14	DT.INT.OFFT.CD	PPG, official creditors (INT, current US$)
15	DT.DIS.PROP.CD	PPG, other private creditors (DIS, current US$)
16	DT.AMT.PCBK.CD	PPG, commercial banks (AMT, current US$)
17	DT.DIS.BLAT.CD	PPG, bilateral (DIS, current US$)
18	DT.AMT.DLXF.CD	Principal repayments on external debt, long-term (AMT, current US$)
19	DT.AMT.PROP.CD	PPG, other private creditors (AMT, current US$)
20	DT.AMT.PRVT.CD	PPG, private creditors (AMT, current US$)
21	DT.AMT.BLAT.CD	PPG, bilateral (AMT, current US$)
22	DT.INT.PCBK.CD	PPG, commercial banks (INT, current US$)
23	DT.INT.BLAT.CD	PPG, bilateral (INT, current US$)
24	DT.DIS.DLXF.CD	Disbursements on external debt, long-term (DIS, current US$)
25	DT.AMT.DPNG.CD	Principal repayments on external debt, private nonguaranteed (PNG) (AMT, current US$)

3.4 Country with Highest Total Debt, and the Amount

select country_name, sum(debt) as total_debt
from international_debt_analysis
where country_name is not null
and country_name <> ''
group by country_name
order by total_debt desc
limit 1;

The county is China with the debt of 283,748,948,518

3.5 The Average Debt Across Different Debt Indicators

select indicator_name, AVG(debt) AS avg_debt
from international_debt_with_missing_values
group by indicator_name
order by avg_debt desc;

Fig1: Average Debt per Indicator Category

3.6 Country with Highest Principal Repayment

select country_name, SUM(debt) AS total_principal_repayment
from international_debt_analysis
where indicator_name like '%Principal repayment%' and debt> 0
group by country_name
order by total_principal_repayment desc
limit 1;

The county is China with the principal repayment amount of 168,611,607,050

3.7 Most Common Debt Indicator

select indicator_name, COUNT(*) AS frequency
from international_debt_analysis
where indicator_name is not null
and indicator_name <> ''
group by indicator_name
order by frequency desc
limit 1;

the most common debt indicator is PPG, official creditors (AMT, current US$) with a total of 124

3.8 Other Key Debt Trends

3.8.1 Top 5 countries with the most debt

select country_name, sum(debt) as total_debt
from international_debt_with_missing_values
group by country_name
order by total_debt desc
limit 5;

Fig2: Average Debt per Indicator Category

3.8.2 Five Countries with the Least Debt

This excludes countries registering 0 debt

select country_name, sum(debt) as total_debt
from international_debt_analysis
where debt> 0 
group by country_name
order by total_debt asc
limit 5;

Fig:5 Countries with the Lowest Debt

3.8.3 Number of countries with missing debt values or 0 debt

select count(distinct country_name) as countries_with_zero_or_missing_debt
from international_debt_analysis
where debt is null or debt = 0;

Total number of countries where debt is 0 or missing debt are 110

Conclusion

Overall, the data suggests a significant dependence on external financing, with repayment pressures concentrated in a few major economies and vulnerable groups. This underlines the importance of careful debt management policies, diversification of financing sources, and sustainable borrowing strategies to reduce long-term risks.

References and Further Reading

World Bank. (2023). International Debt Statistics. Washington, DC: World Bank. Available at: https://databank.worldbank.org/source/international-debt-statistics
International Monetary Fund (IMF). (2022). Global Debt Database. Washington, DC: IMF. Available at: https://www.imf.org/en/Data
Reinhart, C. M., & Rogoff, K. S. (2010). Growth in a Time of Debt. American Economic Review, 100(2), 573–578.
FAO & UNCTAD. (2021). Financing Sustainable Development in Developing Countries. Geneva: United Nations.
PostgreSQL Global Development Group. (2023). PostgreSQL Documentation. Available at: https://www.postgresql.org/docs/
IBM Developer. (2021). Best Practices in Data Analysis Using SQL. IBM Developer Portal.
O’Neil, P., & O’Neil, E. (2014). Database Principles, Programming, and Performance. Morgan Kaufmann.
DataCamp. (2022). SQL for Data Analysis: Concepts and Practice. Available at: https://www.datacamp.com

A Guide for Creating a Linux Server VM on Azure & Installing PostgreSQL

Loi2008 — Sat, 02 Aug 2025 07:42:59 +0000

This guide covers:

Creating a Linux VM on Azure
Installing PostgreSQL
Configuring PostgreSQL for local and remote access

✅ Prerequisites

Active Azure subscription
Access to Azure Portal
SSH client (Linux on Windows)

🧰 Part 1: Create a Linux VM on Azure

1. Go to Azure Portal

Navigate to: https://portal.azure.com
Go to "Virtual Machines"
Click "+ Create" > "Azure virtual machine"

2. Configure VM Basics

Subscription: Choose your active subscription ( subscribe if you do not have an active subscription)
Resource Group: Create/select one
VM Name: e.g., linux-postgres-vm
Region: Closest to your location
Image: Ubuntu 22.04 LTS
Size: e.g., Standard B1s
Authentication: - SSH public key (recommended)
- Password:e.g, 1234
Username: e.g., myLinuxServer
SSH public key: Paste your public key if using SSH

You can generate one using:
```
ssh-keygen -t rsa -b 2048
```

3. Networking

Public IP: Enabled
NSG (firewall): Allow SSH (port 22)

4. Create the VM

Click "Review + create" then "Create"

🔌 Part 2: SSH into the VM

ssh azureuser@<your-vm-public-ip- e.g, 192.168.20.139>

🐘 Part 3: Install PostgreSQL

1. Update Packages

sudo apt update && sudo apt upgrade -y

2. Install PostgreSQL

sudo apt install postgresql postgresql-contrib -y

3. Enable and Start the Service

sudo systemctl enable postgresql
sudo systemctl start postgresql

🔐 Part 4: Configure PostgreSQL

1. Switch to postgres User

sudo -i -u postgres

2. Access PostgreSQL Terminal

psql

3. Create User and Database

CREATE ROLE e.g, myPostgres WITH LOGIN PASSWORD e.g, '1234';
ALTER ROLE myPostgres CREATEDB;
CREATE DATABASE student OWNER myPostgres;
\q
exit

🌐 Part 5: Allow Remote Access (Optional)

1. Edit postgresql.conf

sudo nano /etc/postgresql/15/main/postgresql.conf

Find:

listen_addresses = 'localhost'

Change to:

listen_addresses = '*'

2. Edit pg_hba.conf

sudo nano /etc/postgresql/15/main/pg_hba.conf

Add this line:

host    all             all             0.0.0.0/0               md5

3. Allow Port 5432 in Azure NSG

Go to Azure portal > VM > Networking
Click "Add inbound port rule"
- Port: 5432
- Protocol: TCP
- Action: Allow

4. Restart PostgreSQL

sudo systemctl restart postgresql

🧪 Part 6: Connect Remotely

From your local pc:

psql -U myPostgres -d student -h <192.168.20.139> -p 5432

Or use GUI tools such as DBeaver.

Your Linux VM is now running PostgreSQL and ready to accept connections.