DEV Community

Cover image for Exploring K-NN Data: A Beginner’s Guide to EDA and Feature Selection
Julie Fisher
Julie Fisher

Posted on

Exploring K-NN Data: A Beginner’s Guide to EDA and Feature Selection

TLDR;

The Gender feature is noise and the User ID column is a unique identifier that's unnecessary for model training purposes. These columns get dropped, leaving Age and EstimatedSalary as features to predict Purchased.

Abra-data-dabra: Data and EDA

To be perfectly honest, I considered skipping EDA entirely for this series of posts. I want to get to the good stuff and we all know those kinds of topics include overfitting and underfitting, variance, bias, performance metrics, and distance metrics. When it came to the data, I wanted to say "Abracadabra" here's our dataset. That's why curated datasets are so popular, right?

However, at a bare minimum, you really do need at least a basic understanding of any dataset you work with, especially in industry or production settings. Any job you land as a data scientist will have real world, messy data that you'll probably need the help of a domain expert to understand and that you'll be responsible for transforming into neatly ordered rows from the chaos you start with.

So I'll do a very brief exploratory data analysis (EDA) of the data and select the features we'll use. Data wrangling, cleaning, and transformation deserve their own series of posts, but this will get us started.

Smoke, Mirrors, and Mysterious Data Provenance

I was introduced to the Social_Network_Ads dataset in the first week of the first course of the University of Washington's Machine Learning Certificate. I like this dataset for reasons explained below, but having worked with network data, I'm still confused how this data is related to a social network.

The Social_Network_Ads dataset is available from several different users on Kaggle, but the main one seems to be this dataset loaded in 2017 by user "rakeshrau". The Data Card has no explanation of where the data originated, what it's purpose was, who collected/created it, or how it was intended to be used. A Google search only produced more places to find it with no additional provenance information to be had.

While the dataset’s origin is unclear, for the purposes of exploring training a K-NN model, it serves as a useful example.

From the Kaggle page we can determine there are five columns with the following properties:

  • User ID: a unique identifier for each individual observation/row
  • Gender: a feature column
  • Age: a feature column
  • EstimatedSalary: a feature column
  • Purchased: the target we're trying to predict

Don't know how I determined those properties? At this point, don't worry about it. Those are the kinds of things I'll cover when discussing data wrangling, cleaning, and transformation. For now, let's keep moving so we can get to the good stuff I talked about earlier.

The Magic of Being Prepared

Before we look at the data, I have some housekeeping recommendations.

1. Keep It In The Code

I programmatically access the Kaggle data for this series using kagglehub.

Why access the data programmatically you ask? Why not just download it manually from Kaggle and drop it into our project's directory? Clicking the "Download" button is easier, you say?

Personally, I like to keep all aspects of a project in one place. If I have to do manual steps, I forget what they are when I come back to rerun the code or to reference some aspect of the project. Then the fantastic, portfolio worthy project that I spend hours and hours on becomes totally unusable because it's broken and I can't reproduce the results.

Don't get me wrong, programmatically loading the data isn't risk free. There's a danger that wherever I've loaded the data from will delete the dataset. However, in the 8 years I've been doing this, there's always somewhere that has the dataset stored online. I just update the data location and rerun my project.

2. Playing Well With Others

I recommend setting up a virtual environment to use for all of the posts in this series. I've created a requirements.txt with all of the packages you'll need and set the versions to the ones I used so they should all play well together.

You can find the requirements file, along with all of the notebooks and code for this series in my repo.

If you don't know how to set up a virtual environment, you can follow the instructions in my post Python Projects With Less Pain: Beginner's Guide to Virtual Environments.

If you don't want to do that, you can simply use this command in a Jupyter Notebook:

!pip install kagglehub

But I don't recommend the adhoc approach.

* Advanced Concept Introduction *

In a production environment I'd never recommend using Jupyter Notebooks. You'll want to take the time to convert your code and logic to a python script or a codebase containing a collection of scripts. The finished product will look much more like a software development project than the notebooks you see in tutorials.

However, in EDA or exploratory and experimental situations like this, I much prefer Jupyter Notebooks so that I can see my results, look at charts, and have the results easy to reference at a later date.

Eventually, I'd like to write a series on creating a production grade machine learning pipeline project. While I have a bunch of production code, it isn't exactly publicly sharable or extensible to public datasets. If this sounds interesting to you, keep your fingers crossed that I can find time to work up to that post series. In the meantime, just be aware that production grade workflows look radically different.

Behold: The Non-Network Social_Network_Ads Dataset

Now that all the housekeeping stuff is out of the way, let's load the data and take a look.

# import necessary libraries
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import kagglehub
Enter fullscreen mode Exit fullscreen mode
# Download data from Kaggle and load into a DataFrame
path = kagglehub.dataset_download("rakeshrau/social-network-ads")
df = pd.read_csv(os.path.join(path, "Social_Network_Ads.csv"))
df.head()
Enter fullscreen mode Exit fullscreen mode
User ID Gender Age EstimatedSalary Purchased
0 15624510 Male 19 19000 0
1 15810944 Male 35 20000 0
2 15668575 Female 26 43000 0
3 15603246 Female 27 57000 0
4 15804002 Male 19 76000 0

Feature Overview: Numbers Into Understanding

Earlier I mentioned that I really like this dataset. The reason is because none of the features are strong predictors on their own, but there is a strong relationship when you combine Age and EstimatedSalary.

How do we figure this out? First we visualize each feature on its own. We'll do simple counts (also called a frequency distribution) and color by our target value Purchased.

matplotlib tends to be the default visualization library in Python, but it doesn't have a great way to color by a variable out of the box. The seaborn package does, so that's the package we'll use for our exploratory data analysis.

* Pro Tip *: Packages like seaborn and the plotting capabilities natively available within pandas are handy for quick visualizations. If you get into more complex visualizations though, chances are high you'll end up using matplotlib.

Age

In the Age frequency distribution plot we can see that those under ~35 have Purchased==0. It also looks like there might be a positive relationship between Age and Purchased.

In the machine learning context positive and negative relationships refer to how the variables change in relation to each other:

  • Positive relationship: as one variable increases in value, so does the other
    • Example: As the temperature increases, the number of ice cream sales increases
      • Temperature: 50; Ice Cream Sales: $20
      • Temperature: 85; Ice Cream Sales: $35,000
      • Temperature: 105; Ice Cream Sales: $100,000
  • Negative relationship: as one variable increases, the other one decreases
    • Example: As the temperature decreases, the number of winter clothes sales increases
      • Temperature: 85; Winter Clothes Sales: $50
      • Temperature: 50; Winter Clothes Sales: $150,000
      • Temperature: 0; Winter Clothes Sales: $500,000
sns.histplot(data=df, x='Age', hue='Purchased', bins=20, multiple='stack')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Frequency Distribution')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Age Frequency Distribution Plot

We can visualize the positive relationship more easily by turning our bars into percentages. When we do this, the bar fills the whole height of the plot, then fills the color as a percentage (based on the y label, it's technically a ratio) of our target variable.

sns.histplot(data=df, x='Age', hue='Purchased', bins=20, multiple='fill', stat="percent")
plt.xlabel('Age')
plt.ylabel('Percentage')
plt.title('Age Percentage Distribution')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Age Distribution Plot by Percentage

While this relationship is interesting and weakly predictive of our target, we clearly can't separate purchased vs not purchased by Age alone. For the majority of our observations we'd simply be guessing.

EstimatedSalary

From the plot we can see that EstimatedSalary is in about the same boat as Age. There is a decline in Purchased counts from around 40,000 - 60,000, but otherwise at best there's a weak relationship between EstimatedSalary and Purchased.

sns.histplot(data=df, x='EstimatedSalary', bins=20, hue='Purchased', multiple='stack')
plt.xlabel('EstimatedSalary')
plt.ylabel('Frequency')
plt.title('EstimatedSalary Frequency Distribution')
plt.show()
Enter fullscreen mode Exit fullscreen mode

EstimatedSalary Frequency Distribution Plot

Just to confirm our findings, we can look at the distribution formatted as a percentage. This plot gives the same impression as the frequency distribution: there is some kind of weak relationship here. I wouldn't really call it a positive relationship though because of that weird decrease in the middle range.

sns.histplot(data=df, x='EstimatedSalary', bins=20, hue='Purchased', multiple='fill', stat="percent")
plt.xlabel('EstimatedSalary')
plt.ylabel('Percentage')
plt.title('EstimatedSalary Percentage Distribution')
plt.show()
Enter fullscreen mode Exit fullscreen mode

EstimatedSalary Distribution Plot by percentage

It's tempting to try to come up with an explanation for this dip in the 40,000 - 80,000 salary range. It’s natural to want to explain patterns such as this one, but be cautious. Our brains are wired to find meaning, even when none exists. The explanation you come up with might make sense to you and your colleagues, but have nothing to do with the real world. If you absolutely have to explain something, always validate assumptions with data, not intuition.

Gender

This feature is a great example of noise.

In a machine learning context, "noise" is a feature that isn't correlated to the target value. After COVID and all the video calls we did, a good analogy for "noise" would be that person on the video call that didn't mute their mic and then proceeded to do dishes. All that extra sound made it hard to hear/understand what the presenter was saying.

Features that have no value add in predicting the target value are called "noise". They make it harder to find the patterns that explain our target.

In the Gender histogram plot we can see that our overall counts for both male and female are pretty equally distributed. And the counts of Purchased for each gender is very similar. There is no way to make an informed prediction on likelihood to purchase using this feature, i.e. this feature is noise and can be dropped.

sns.histplot(data=df, x='Gender', hue='Purchased', multiple='stack')
plt.xlabel('Gender')
plt.ylabel('Frequency')
plt.title('Gender Frequency Distribution')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Gender Frequency Distribution Plot

The Power of Working Together

I've seen combining two features to better explain a target's behavior in several tutorials. I think Andrew Ng covered it in his old Coursera Machine Learning course (which I can't find that class on Coursera anymore, but it was probably replaced by this specialization) and obviously my professor in this class has an example, but I generally don't look for combinations of features that become more predictive when combined.

The datasets I work with a messy real world data that end up turning into 500+ features, so I have mathematical tools that determine if they're individually useful. Now that I've been reminded that combined features can return better results, I might have to do more data analysis and feature engineering of my data.

Causation and Correlation

No data science blog is complete without at least a quick note on causation and correlation. In machine learning, we look for correlation: are two things related to each other. For example, do they have a positive or negative relationship, like we discussed earlier.

However, just because two things are correlated doesn't mean that one of the values causes the other.

In the examples I used to discuss positive and negative relationships, I'd be pretty comfortable saying that temperature does in fact cause changes in ice cream sales and winter clothes sales.

However, there are many, many examples of correlations that have no causation aspect. For example, as ice cream sales increase, so do shark attacks. Did the ice increase in cream sales cause the increase in shark attacks? Or maybe increased shark attacks cause increased ice cream sales? No, it's probably because both increase during summer when the temperatures are higher and people spend more time outside.

For more examples of correlations that have no causal relationship, including plots, I recommend the Spurious Correlations website.

Age and EstimatedSalary

This is the interesting combination. In the plot we can clearly see that the Purchased values have separated out into clusters or groups. Try drawing a line between the two groups. If you're anything like me, there is a boundary that stands out as the best to get the cleanest groups.

sns.scatterplot(data=df, x="Age", y="EstimatedSalary", hue="Purchased")
plt.title("Age vs Estimated Salary by Purchase Status")
plt.show()
Enter fullscreen mode Exit fullscreen mode

Age and EstimatedSalary Plot

Age and Gender

This plot just shows that a purchase is more common at older ages, but we already knew that from the Age distribution plots. Gender doesn't add any new information.

sns.scatterplot(data=df, x="Age", y="Gender", hue="Purchased")
plt.title("Age vs Gender by Purchase Status")
plt.show()
Enter fullscreen mode Exit fullscreen mode

Age and Gender plot

EstimatedSalary and Gender

This plot also only shows us the decline in purchase in that 40,000 - 80,000. Adding the Gender dimension shows us that the decrease is more prominent in males at the lower end of the range and more prominent in females at the higher end of the range. But this still doesn't explain much of the Purchased pattern.

sns.scatterplot(data=df, x="EstimatedSalary", y="Gender", hue="Purchased")
plt.title("Estimated Salary vs Gender by Purchase Status")
plt.show()
Enter fullscreen mode Exit fullscreen mode

EstimatedSalary and Gender plot

Age, EstimatedSalary and Gender

Because of the slight difference between genders in the last EstimatedSalary and Gender plot, let's check that Gender doesn't add any value when we visualize all three features.

# Plot code written by CoPilot
from mpl_toolkits.mplot3d import Axes3D

df["Gender_num"] = df["Gender"].map({"Male": 0, "Female": 1})

# Create a color map for Purchased
colors = df["Purchased"].map({0: "red", 1: "green"})  # or 'No'/'Yes' depending on your data

# Create the 3D scatter plot
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection="3d")

ax.scatter(
    df["Age"],
    df["EstimatedSalary"],
    df["Gender_num"],
    c=colors,
    s=50,
    alpha=0.8
)

# Label axes
ax.set_xlabel("Age")
ax.set_ylabel("Estimated Salary")
ax.set_zlabel("Gender (0=Male, 1=Female)")
ax.set_title("3D Plot of Age, Estimated Salary, and Gender Colored by Purchase Status")

# Add legend
for label, color in {"Purchased": "green", "Not Purchased": "red"}.items():
    ax.scatter([], [], [], c=color, label=label)
ax.legend()

plt.show()
Enter fullscreen mode Exit fullscreen mode

Age, EstimatedSalary, and Gender 3D plot

Recap

Of the five columns we started with our EDA revealed:

  • We can get rid of User ID since it's a unique identifier and unnecessary for training purposes
  • We can get rid of Gender because it is neither predictive on its own, nor predictive in conjunction with the other features
  • Our target values are stored in the Purchased column
  • The useful features are Age and EstimatedSalary

To prepare our data for use with K-NN our code is:

path = kagglehub.dataset_download("rakeshrau/social-network-ads")
df = pd.read_csv(os.path.join(path, "Social_Network_Ads.csv"))
df = df.drop(columns=["User ID", "Gender"], axis=1)
Enter fullscreen mode Exit fullscreen mode

Up Next

Next we'll use the dataset to train a KNN model.

Top comments (0)