So, here’s the story—I recently worked on a school assignment by Professor Zhuang involving a pretty cool algorithm called the Incremental Association Markov Blanket (IAMB). Now, I do not have a background in data science or statistics, so this is new territory for me, but I love to learn something new. The goal? Use IAMB to select features in a dataset and see how it impacts the performance of a machine-learning model.
We’ll go over the basics of the IAMB algorithm and apply it to the Pima Indians Diabetes Dataset from Jason Brownlee's datasets. This dataset tracks health data on women and includes whether they have diabetes or not. We’ll use IAMB to figure out which features (like BMI or glucose levels) matter most for predicting diabetes.
What’s the IAMB Algorithm, and Why Use It?
The IAMB algorithm is like a friend who helps you clean up a list of suspects in a mystery—it’s a feature selection method designed to pick out only the variables that truly matter for predicting your target. In this case, the target is whether someone has diabetes.
- Forward Phase: Add variables that are strongly related to the target.
- Backward Phase: Trim out the variables that don’t really help, ensuring only the most crucial ones are left.
In simpler terms, IAMB helps us avoid clutter in our dataset by selecting only the most relevant features. This is especially handy when you want to keep things simple boost model performance and speed up the training time.
Source: Algorithms for Large-Scale Markov Blanket Discovery
What’s This Alpha Thing, and Why Does it Matter?
Here’s where alpha comes in. In statistics, alpha (α) is the threshold we set to decide what counts as "statistically significant." As part of the instructions given by the professor, I used an alpha of 0.05, meaning I only want to keep features that have less than a 5% chance of being randomly associated with the target variable. So, if a feature’s p-value is less than 0.05, it means there’s a strong, statistically significant association with our target.
By using this alpha threshold, we’re focusing only on the most meaningful variables, ignoring any that don’t pass our “significance” test. It’s like a filter that keeps the most relevant features and tosses out the noise.
Getting Hands-On: Using IAMB on the Pima Indians Diabetes Dataset
Here's the setup: the Pima Indians Diabetes Dataset has health features (blood pressure, age, insulin levels, etc.) and our target, Outcome (whether someone has diabetes).
First, we load the data and check it out:
import pandas as pd
# Load and preview the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
column_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
data = pd.read_csv(url, names=column_names)
print(data.head())
Implementing IAMB with Alpha = 0.05
Here’s our updated version of the IAMB algorithm. We’re using p-values to decide which features to keep, so only those with p-values less than our alpha (0.05) are selected.
import pingouin as pg
def iamb(target, data, alpha=0.05):
markov_blanket = set()
# Forward Phase: Add features with a p-value < alpha
for feature in data.columns:
if feature != target:
result = pg.partial_corr(data=data, x=feature, y=target, covar=markov_blanket)
p_value = result.at[0, 'p-val']
if p_value < alpha:
markov_blanket.add(feature)
# Backward Phase: Remove features with p-value > alpha
for feature in list(markov_blanket):
reduced_mb = markov_blanket - {feature}
result = pg.partial_corr(data=data, x=feature, y=target, covar=reduced_mb)
p_value = result.at[0, 'p-val']
if p_value > alpha:
markov_blanket.remove(feature)
return list(markov_blanket)
# Apply the updated IAMB function on the Pima dataset
selected_features = iamb('Outcome', data, alpha=0.05)
print("Selected Features:", selected_features)
When I ran this, it gave me a refined list of features that IAMB thought were most closely related to diabetes outcomes. This list helps narrow down the variables we need for building our model.
Selected Features: ['BMI', 'DiabetesPedigreeFunction', 'Pregnancies', 'Glucose']
Testing the Impact of IAMB-Selected Features on Model Performance
Once we have our selected features, the real test compares model performance with all features versus IAMB-selected features. For this, I went with a simple Gaussian Naive Bayes model because it’s straightforward and does well with probabilities (which ties in with the whole Bayesian vibe).
Here’s the code to train and test the model:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
# Split data
X = data.drop('Outcome', axis=1)
y = data['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Model with All Features
model_all = GaussianNB()
model_all.fit(X_train, y_train)
y_pred_all = model_all.predict(X_test)
# Model with IAMB-Selected Features
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]
model_iamb = GaussianNB()
model_iamb.fit(X_train_selected, y_train)
y_pred_iamb = model_iamb.predict(X_test_selected)
# Evaluate models
results = {
'Model': ['All Features', 'IAMB-Selected Features'],
'Accuracy': [accuracy_score(y_test, y_pred_all), accuracy_score(y_test, y_pred_iamb)],
'F1 Score': [f1_score(y_test, y_pred_all, average='weighted'), f1_score(y_test, y_pred_iamb, average='weighted')],
'AUC-ROC': [roc_auc_score(y_test, y_pred_all), roc_auc_score(y_test, y_pred_iamb)]
}
results_df = pd.DataFrame(results)
display(results_df)
Results
Here’s what the comparison looks like:
Using only the IAMB-selected features gave a slight boost in accuracy and other metrics. It’s not a huge jump, but the fact that we’re getting better performance with fewer features is promising. Plus, it means our model isn’t relying on “noise” or irrelevant data.
Key Takeaways
- IAMB is great for feature selection: It helps clean up our dataset by focusing only on what really matters for predicting our target.
- Less is often more: Sometimes, fewer features give us better results, as we saw here with a small boost in model accuracy.
- Learning and experimenting is the fun part: Even without a deep background in data science, diving into projects like this opens up new ways to understand data and machine learning.
I hope this gives a friendly intro to IAMB! If you’re curious, give it a shot—it’s a handy tool in the machine learning toolbox, and you might just see some cool improvements in your own projects.
Top comments (0)