DEV Community: Lindsey

Balancing the Imbalanced

Lindsey — Fri, 14 Jun 2019 20:48:36 +0000

As I continue to build up my data science toolkit, I've begun learning about the types of classification techniques that are used to solve everyday problems. These tools are really cool! Want to know whether an email you received is spam or not spam? Use a classification technique! Want to know if a new transaction is fraudulent or not? Use a classification technique! Et cetera, et cetera.

One thing I've seen again and again is the importance of class balance when feeding data into these models. Think about it - you're asking a computer, which has NO idea what you're talking about or how to identify anything in any way other than how you tell it to identify things, to look at something completely new and categorize it. If you feed it 1000 emails, 950 of which are 'not spam' and 50 of which are 'spam,' and ask it to identify which are 'not spam,' it can just label everything as 'not spam' and be 95% correct! Not bad.

And yet... that doesn't do what you want at all. You want your model to learn the characteristics of 'spam' emails and actually identify the parts of it which are reliable predictors for 'spam' in general, something the computer is increasingly incentivized not to do as the majority in your datasets gets larger and your models become more complex.

So! Time to practice how to balance the classes within your dataset. I'll be giving examples of how to code a few methods I've encountered in Python 3, using Pandas, SciKit Learn, and a bit of imblearn to make all of our lives easier.

The Simple Ways to Balance

Perhaps the simplest way to balance your under-represented category against the rest of your data is to under-sample the rest of your data. To stick with our 950 'not spam' versus 50 'spam' example, we'd simply take a sample of 50 'not spam' and use that sample with our full 'spam' data to have a balanced dataset to use to train our model! Easy-peasy.

# Using a Pandas dataframe, 'data,' where a column "category" either 
# has the "majority" option or the "minority" option within the column

minority = data[data["category"] == "minority"]
majority = data[data["category"] == "majority"].sample(n=len(minority))

Alas, you can probably see some problems with this simple model. We lose a lot of data (900 observations) by going down this route, for one. The key to differentiating between 'spam' and 'not spam' could be hidden within that lost data!

So, a different (but still simple) way to balance your under-represented category is to over-sample from that minority, with replacement, over and over, until it's the same size as your majority.

# Same example Pandas dataframe as before

majority = data[data["category"] == "majority"]
minority = data[data["category"] == "minority"].sample(n=len(majority), replace=True)

This... also has problems. With a case like our 950-50 split, that means you're likely using those 50 observations in the 'spam' category 19 times, over and over again, to get even with the 950 observations of 'not spam.' This will very likely result in overfitting your data - where your model becomes so used to the content of your minority, the 'spam' category, that it only works on those emails, and cannot be generalized to recognize 'spam' out in the real world.

Sure you can balance these two, both over-sampling your minority and under-sampling your majority, and maybe that will work fine for some of what you do! But, for the cases when you need a more nuanced way to balance your data, there are more complicated methods.

A Little More Complicated

Alright, so if we can't simply sample the data we already have, what can we do? One idea is to add weight to our minority category, so our model knows that the frequency with which it encounters each class does not translate into the importance of each class - the less frequent category should be considered more important, even though it's rare!

SciKit Learn's Logistic Regression model for classifying data has a built-in option for class_weight which allows you to explain to your model that some classes should be considered more strongly than others. The easiest way to balance from there is to just apply class_weight='balanced' - the Logistic Regression model will automatically know to assign a weight inverse to the frequency of that class. In our spam example, the logistic regression model then knows 'spam' and 'not spam' should be balanced, and will automatically say those 50 examples of 'spam' should be weighted so they're considered more important than the 950 examples of 'not spam.'

# Import the logistic regression package from sci-kit learn
from sklearn.linear_model import LogisticRegression

# Start the instance of the Logistic Regression, but balanced
# Default for class_weight is None, which gives all classes a weight of 1
logreg = LogisticRegression(class_weight='balanced')

So what is this actually doing? You're telling your model that all classes should contribute equally when it calculates its loss function. In other words, when the model is deciding which way to best fit the data, you're being really explicit in telling it that it needs to consider the percentage of errors in the minority as just as important as the percentage of errors in the majority.

With our example, we discussed a model that always predicts our emails are 'not spam,' since 950 out of 1000 are 'not spam' and only predicting 'not spam' results in a model that's 95% accurate. But that's 0% accuracy for emails that are actually 'spam,' since it never predicts that an email is 'spam.' By telling our model that the classes should be balanced, our model knows that the accuracy for predicting 'spam' is just as important as the accuracy for predicting 'not spam,' and thus it can't consider an overall 95% accuracy as acceptable.

This works! I can only speak to the Logistic Regression model at the moment, but I know other Sci-Kit Learn modeling algorithms have a way of balancing their classes. This may be enough to get you better results with your data, and if so that's great. But what if it's not? Can we get more complicated?

Of Course We Can Get More Complicated

Another idea - what if we could train our model to make synthetic data, that's similar to the data in our 'spam' minority category but is a little bit different, thus avoiding some of the over-fitting that we were worried about before?

Yes, this is a thing, and no, you don't have to code it from scratch. The Imbalanced Learn library, imblearn, is full of fun ways to apply more complicated balancing techniques - including under- and over-sampling through clusters! These techniques work by identifying clusters in your dataset. To under-sample, you use those clusters to remove observations within the cluster, thus preserving more diversity in the majority cluster than randomly under-sampling. To over-sample, you generate new, synthetic observations within the minority cluster, thus avoiding overfitting to your data because the data within the minority is more diverse.

Okay, but how in the world does any of that work? Let's dig in.

The Synthetic Minority Oversampling Technique (SMOTE) is the most common method I've run into to conduct cluster-based over-sampling. SMOTE works by finding all the instances of the minority category within the observations, drawing lines between those instances, and then creating new observations along those lines.

I found a great explainer of how SMOTE works on Rich Data, although his examples are created in R (aka less helpful for us Python-only people). But the below image shows exactly how those lines are drawn and where the resulting new, synthetic observations are created.

So how do we do this in Python?

# Import the SMOTE package from the imblearn library
from imblearn.over_sampling import SMOTE

# First, look at your initial value counts
print(y.value_counts())

# Start your SMOTE instance
smote = SMOTE()

# Apply SMOTE to your data, some previously defined X and y
X_resampled, y_resampled = smote.fit_resample(X, y) 

# Look at your new, resampled value counts - should be equal!
print(pd.Series(y_resampled).value_counts())

Now, can you guess why this isn't perfect? This is better than simply using a random over-sample, yet not only are these synthetic samples not real data but also these samples are based on your existing minority. So, those new, synthetic samples can still result in over-fitting, since they're made from our original minority category. An additional pitfall you might run into is if one of your minority category is an outlier - you'll have new data that creates synthetic data based on the line between that outlier and another point in your minority, and maybe that new synthetic data point is also an outlier.

I'll note that SMOTE has a bunch of variants that people have invented that account for some of the overfitting and outlier problems, but are increasingly more complex. Do your best.

Another way to create synthetic data to over-sample our minority category is the Adaptive Synthetic approach, ADASYN. ADASYN works similarly to SMOTE, but it focuses on the points in the minority cluster which are the closest to the majority cluster, aka the ones that are most likely to be confused, and focuses on those. It tries to help out your model by focusing on where it might get confused, where 'spam' and 'not spam' are the closest, and making more data in your 'spam' minority category there.

# Import the ADASYN package from the imblearn library
from imblearn.over_sampling import ADASYN

# Start your ADASYN instance
adasyn = ADASYN()

# Apply ADASYN to your data, some previously defined X and y
X_resampled, y_resampled = adasyn.fit_resample(X, y)

Let's try the opposite, synthetic under-sampling. Cluster Centroids finds clusters, too, but instead of using those clusters to create new data points, you're instead inferring which data points in your majority category are 'central' in that cluster. Your model then uses those centroids (central points) for your majority instead of actual instances.

# Import the ClusterCentroids package from the imblearn library
from imblearn.under_sampling import ClusterCentroids

# Start your ClusterCentroids instance
cc = ClusterCentroids()

# Apply ClusterCentroids to your data, some previously defined X and y
X_cc, y_cc = cc.fit_sample(X, y)

Of course, any under-sampling technique will eliminate some of the data you have, thus reducing the nuance that could be found if you looked at all of your data in your majority category. But this way, at least, those centroids will typically be more representative than a random sample of your majority.

If your data is having trouble differentiating between your classes, another alternative technique to ADASYN is to have your model ignore instances of your majority that are close to your minority. Uh, what? So, say you have some instances of 'not spam' that look really similar to 'spam.' You can tell your model to link those similar points, and then ignore the majority in that link, the 'not spam,' thus increasing the space in your data between 'spam' and 'not spam.'

These are called Tomek links, and I found a great example in a Kaggle page on Resampling Strategies for Imbalanced Datasets:

# Import the TomekLinks package from the imblearn library
from imblearn.under_sampling import TomekLinks

# Start your TomekLinks instance
tomek = TomekLinks()

# Apply TomekLinks to your data, some previously defined X and y
X_tl, y_tl = tomek.fit_sample(X, y)

Does this also have problems? Of course! You're ignoring the data that's right on the cusp between your majority and minority categories, perhaps where you need to dig into that data the most! But it is an option.

There are dozens of increasingly more complicated ways to balance your class, as you mix and match and try to get the best set of observations before you try to build a classification model. See the resources below, and dig into the imblearn documentation, if you'd like to find plenty of other ways to try to balance your imbalanced categories!

Caveats

There are a lot of considerations to keep in mind when doing any part of data science, and of course balancing your imbalanced classes is no exception. One thing I absolutely want to reiterate is how important it is to do a train-test split before creating your model, so you reserve a percentage of your data to test your model.

Create your train-test split BEFORE you balance your classes! Otherwise, especially if you use an over-sampling technique, your 'balanced' classes will have overlap between your training data and your testing data - after all, your over-sampling is basically using data you already have to make more data in your minority class, so your testing data will just be your training data either exactly or slightly modified by SMOTE. This tutorial walks through how that can trip you up in practice in quite a lot of detail.

In general, the best advice is to look at metrics beyond accuracy. Accuracy is important, but if we only looked at accuracy in our 'spam' or 'not spam' example we'd have a 95% accurate but otherwise completely useless model. Look at recall and precision as well, and try, as always, to find that magical Goldilocks zone that achieves what you want your model to achieve. Run a confusion matrix - confusion matrix is friend!

Soon, I'll edit this post to add an example GitHub repository using actual data, not just spam. In the meantime, any suggestions for more robust ways to balance your datasets? Run into any pitfalls when applying these techniques, or have a technique you find yourself turning to again and again? Let me know!

Some Resources

I used many of the below to learn more about each of these techniques:

Cover image sourced from this Medium post. SMOTE visualization sourced from Rich Data. Tomek link visualization sourced from this Kaggle page. GIFs, as always, from GIPHY

A GitHub Guide For People Who Don’t Understand GitHub

Lindsey — Thu, 23 May 2019 18:16:57 +0000

A few years ago, I had an honestly embarrassing first encounter with GitHub. I didn’t work in any kind of technical field, had never heard of Git before, had never used my command line, and just wanted to host a website. While GitHub was absolutely not the best choice for this, it’s what I bumbled my way into and it eventually worked.

That “eventually,” however, meant I spent a ridiculous amount of time (think easily a dozen hours - like I said, it was embarrassing) trying to figure out what to make of this GitHub thing, far longer than it took me to actually write out and test the code for my website. It turned GitHub into this mysterious, mythological beast that I’m still vaguely intimidated by, even as I get more comfortable in its presence.

For those of you out there who don’t understand GitHub, who are intimidated by GitHub, who don’t entirely know why or when or how people actually use GitHub, this guide is for you. I’m no expert, but I’ve been in your shoes, and think we can tame this beast (at least a little) together.

Git And GitHub Are Not The Same

One confusion I had early on was that I’d see references to Git and references to GitHub and it didn’t even cross my mind that these were different beasts.

The short of it: Git is your local version control, a tool you can use to create and keep track of versions of your code so that you can reference or revert back to previous iterations. GitHub is a way of hosting Git repositories (their fancy word for collections of versions and connected files), so that you can share and collaborate your code. You can do all of your local version control on your own computer using Git, without ever connecting to the internet. But, GitHub allows you to move those versions of your code to the cloud if you so choose, and has developed some tricks for working with others on the same projects.

When you first start out and you create a repository (the folder for all of your files you want to keep together and to maybe someday share) you can do that locally with Git or online on your GitHub account - whatever works for you!

I'll give you two guides to follow when you do this: If you start a repository locally, you need to tell GitHub that this exists and give it permission to make copies that for it to keep online. If you start a repository in GitHub, you need to ‘clone’ it to copy it onto your local machine to work on it.

How do you know which of the above guides to follow? If you’ve already written your code, I’d go with the first option. If you’re just starting out, I think it’s a little easier to make the repository on GitHub and then ‘clone’ it to work locally.

There Are More Steps In Git Than You Think You Need

First, let’s look at the part of Git that’s useful when you’re just starting out with GitHub. Why do I need to ‘add’, then ‘commit’, then ‘push’ my code? When working on smaller projects, all of these fundamental steps can seem redundant, and the terms aren’t necessarily intuitive, so let’s break down each one.

I think of working with Git as working with a play, so bear with me as I try to make this metaphor work.

When you ‘add’ something in Git, you’re basically prepping an actor backstage to go out onto the main stage, but the actor is still behind the curtain. You can prepare everything for the stage in one fell swoop, but maybe you don’t want everything to go on stage at once - some of the actors aren’t in costume or ready yet! The intention of ‘add’ is to tell Git which files are actually going to go on stage during this scene, while the other actors aren’t ready just yet and will come into the action when it makes more sense in later scenes.

When you ‘commit’ something in Git, you’re actually moving those prepared actors (files) onto the stage. This should be done with some kind of introduction, a line spoken as the actor comes onto the stage so the audience can understand why the actor is there - this is conveyed with good ‘commit’ messages!

Committing something does NOT actually put the stage in front of an audience, however. When you ‘push’ something in Git, you’re moving the scene you’ve written, with all the proper actors on the stage, into the actual play. If you ‘push’ to your GitHub, this means not only adding the scene to the play but also putting that play, with the new scene, in front of an audience! Now, others can watch your scene unfold by visiting your GitHub repository.

If you don’t know whether your actors are behind the curtain before the scene, or on the stage during the scene, or whether the scene has been added to the play (aka if you don’t know whether you need to ‘add’, or ‘commit’ or ‘push’) you can check your Git ‘status’. In the command line of the most basic terminals, it will tell you exactly what has been added to your repository, what has been committed to your repository, and what the status of your local work is compared to the original you began working with (probably in color, even!). Check out this tutorial on Git basics to go into more detail and to get into the nitty gritty of how to run these Git commands.

The Purpose Of GitHub Is Collaboration

If you aren’t working on a team, and want to hoard your code and not share it with anyone ever, you might not need GitHub. But it’s much more likely that you’re either going to be working with other people or will want to share your work with others, so getting a little comfortable with GitHub is valuable.

This focus on collaboration is why you can ‘branch' your code, which is a fancy way of saying that your code is going in a new direction from here on out. You can ‘merge’ those branches later on, if your main trunk of code and your side-branch end up going in the same direction again. New metaphor, I know, but the tree metaphor was clearly what Git and GitHub had in mind and it’s the easiest way for most to conceptualize this.

What about pulling, and why is a ‘pull’ different from a ‘pull request’? A ‘pull’ is a Git command to bring in changes that have been made on a different branch. A ‘pull’ command is the combination of a ‘fetch’ command and a ‘merge’ command - you’re going to a remote piece of code (‘fetch’) and then smushing it together with your current piece of code (‘merge’).

On the other hand, a ‘pull request’ is a GitHub action which allows you to request that a branch (or other connected piece of code) get added into the main trunk of the tree - you’re grafting the pulled code onto the main trunk, to stick with our new tree metaphor.

All of this collaboration stuff - branches, merges, pulls and pull requests - are designed to make sure that every member of a team can work independently, but can combine their work without overriding each other. This is why merge conflicts are a necessary evil. I won't go into them here, because other people have way more experience in handling merge conflicts than I do, but just know that they happen, it's okay, you'll get through this. And remember, GitHub is trying to help, even if it's annoying sometimes.

A Tamer Beast

This only covers the very basics, since Git and GitHub are vital tools that take time to master. But I feel more comfortable in the presence of GitHub, even if it still surprises me and I still don’t completely trust that it won’t rear up and attack me. I hope you do too! Check out the many links above, as they all dive way deeper into the details and are great resources.

Was this helpful to conceptualize how and why you use these tools? Any fun ways you think about GitHub (or Git) - any other metaphors? Let me know!

Cover image sourced from Diana Neculai with FreeCodeCamp, the three GIFs are from GIPHY.

Folium : Powerful Mapping Tool for Absolute Beginners

Lindsey — Thu, 16 May 2019 18:18:26 +0000

As someone new to data science, just now beginning to grok the possibilities of what I can do with a lot of data and a bit of programming, there are parts that are intensely gratifying and parts that make me feel a bit in over my head.

There are plenty of times that I set out to solve something, try something, or write something and get stuck. Horribly, inextricably, I-don’t-even-know-where-to-start levels of stuck. This is inevitable, and I’m building up a solid toolbox of resources to use when I find myself in these situations.

But the flip-side of this is pretty awesome. Any time I set out to do something and it actually does what I intended, it’s extremely rewarding. Even if it’s something basic, I love pushing through blockers and getting to a solution that I created - it’s a huge reason I am positioning myself to move into data science and a technical career.

So! When I wanted to map out a dataset for a project, got frustrated with the clunkiness of the tools I knew, taught myself how to work with a new mapping library, and made a gorgeous map that conveyed exactly what I wanted, I was pretty pleased with myself. Allow me to share Folium with the other python beginners out there, because it’s a forgiving and accessible way to play around with mapping techniques. It also looks slick, is interactive by default, and made me feel like I’d leveled up my data visualization effortlessly.

Before You Begin

Folium does have some caveats. The dataset I was working with was one which provided data on homes sold in King County around Seattle. That dataset already had columns for latitude and longitude, which made mapping aspects of this data pretty natural. From everything I’ve gathered from the Folium documentation, you need lat/long pairs in order to use their map, so other location data will present a data cleaning challenge at the start.

Additionally, Folium only works well up to a certain point - on my system, it would not process and map all 21,000 rows of the pandas data frame I tried to pass through it (fair). After playing around with it, about 1000 rows appeared to be the sweet spot for me, and any more than that would not work nearly as well, if at all. In my dataset, I focused on the 1000 most expensive homes as a way to narrow in on a subset of my data (and to answer other questions as part of the project I was working on), so I’ll be showcasing that subset.

So, set yourself up for Folium success by targeting smaller datasets with accessible location data, then get in there and play around!

Follow along with my code here: https://github.com/lindseyberlin/Blog_FoliumMaps

Where Is Everything?

Perhaps there are magical people out there who can look at a lat/long pair and know exactly where it is, or look at two pairs and know how they’re related, but I am no such person. When I was working with this data, I had some initial questions. Were all of these houses within the same few blocks? Were they really spread out? Were there any obvious clusters? Only one way to find out - make those lat/long pairs work for me!

The introductory Folium map I created provided a scatterplot over a map background. The most complex part of the code was the for-loop, which mapped each row as its own dot on the map. The second hardest part was using a mean function on my data set to get an average of the latitude and longitude columns, where I focused my map.

This basic configuration answered those above questions - I could clearly see where each house was sold, and how each house was spaced. Hooray! But my initial success led me to wonder what else I could do with Folium, so I went a little deeper.

Adding Layers of Meaning

From the basic map, I took three additional steps to make my map more complex and to make sure it conveyed more meaning than just the locations of the homes in my dataset.

Note! These are just screenshots - the actual maps I'm creating in python are interactive. Check out that github repo I linked above to see what I mean.

First, I added pop-up text, which displays the exact latitude and longitude of the house as well as the price at which it was sold. Adding the pop-up text was a bit more complex, but still straightforward - I added the code within the for-loop, so it would create pop-up text specific to each row of data. I formatted the text using .format, but could have also used an f-string. Now my map provided a bit more detail!

Next, I changed the size of each dot to correspond with the price at which each home was sold. This involved changing the radius of each dot based on the price for that row. Easy-peasy!

Last, I changed the color of each dot to correspond with different buckets of price, so that the most expensive homes showed up as a bright, obnoxious pink and the least expensive homes (of the 1000 most expensive homes within the original data set) were a gentle green. This involved integrating if/elif/else statements based on those cost buckets, still within the for-loop, to change the color of each dot. Slightly more complicated than anything else so far, but still straightforward.

Other Options

Another easy way to examine this kind of data is to add a heat map instead of adding points to the map. This conveys the concentration of data in a different way, useful if you want to explore concentration more than the details of each individual row of your dataset.

There are dozens of other ways you can map things in Folium, and probably better ways I could’ve answered the initial questions I posed. Play around with it and see what you find! And if you have any useful Folium tips or tricks, or questions that are best answered with a map, please share!

Useful Tutorials:

Folium for Maps, Heatmaps and Time Analysis

Creating Interactive Crime Maps With Folium

Cover image from Spatial Visualizations and Analysis in Python with Folium. Spinning turtle GIF from GIPHY. All other images were screenshots I created using Folium - see my GitHub repository here

Three Personal Traits That Convinced Me To Try Data Science

Lindsey — Fri, 03 May 2019 03:37:16 +0000

Jumping feet-first into an immersive Data Science bootcamp is not a decision I made lightly. Sure, there are a lot of reasons for people to move into data science as a field right now - data-related competencies and skills are some of the most sought-after by employers in the United States - but that didn’t automatically mean that data science would be the right move for me. And would the results be worth the substantial investment of money and time?

There were a lot of contributing factors and considerations that led me to answer 'yes' and take this plunge, but three of the most relevant are personal traits which, I hope, will ultimately make me into a successful and effective data scientist.

Question Everything

Anyone who has ever worked with me will tell you that I’m quick to question pretty much everything. What is the purpose of what I’m doing? How can what I’m doing be improved? Is what I’m doing having an impact? Et cetera, et cetera, until I’ve driven everyone around me mad if I’m not careful.

This is not a useful trait in plenty of employment areas, I’ve found. Quite often, employers don’t enjoy being questioned on why they do what they do in the way that they do it. Especially not by someone who is brand new to a job who has trouble keeping her head down and learning the work without asking so many pesky questions.

But that’s the whole point of data science! Employers hire data scientists to improve processes and procedures, explore new or better opportunities, and, most importantly, ask questions. Finding a job that encourages and nurtures my natural tendencies and quirks, rather than trying to squash them, is a huge driver of why I’m currently looking to move down another career path.

Communicate Complicated Ideas

Have you ever forgotten the name of a person or concept (or movie, album, app, class, et cetera), and then spent several minutes explaining who or what you mean to the person you’re talking to? I am the most guilty of this. For whatever reason, my brain easily associates information but often forgets the proper label - I can hardly ever remember anyone or anything’s name!

The silver lining is that I’ve become pretty good at explaining concepts, connections and context in order to become understood. Luckily this also manifests in ways that are more flattering to me than how I’m terrible with names. Communication of complicated ideas, while conveying the appropriate connections and context, is one of the most important skills a data scientist can have. While communication skills can always be improved, thinking of myself as a relatively good communicator of ideas was one of the things that led me to think of data science as a promising option.

Love Learning New Things

I love learning new things. I know, that’s a pretty cliche thing to say- doesn’t everyone love learning new things? But I love the process of starting to learn - the discovery, the excitement, the CLICK as new concepts slide into place and change the way you view the world.

On the other side of that coin, learning new things is exhausting - especially once the newness wears off and the real work of learning begins. I’m absolutely guilty of starting to learn yet another new thing before I really master what I was previously learning (aren’t we all?). The way to shortcut this, I’ve found, is to be constantly working on new projects or tasks that call on you to either learn new things or apply your knowledge in new ways.

The fact is that I will never know all of the possible languages or tools I could use as a data scientist. Chances are, the daily tools I will use five or ten years from now haven’t even been invented or written yet. While this is a daunting prospect, it’s also exciting since it means that I’ve placed myself in a position where I’ll always be challenged to learn new things.

While these considerations were only part of what led me to dive into the Flatiron School’s immersive data science program, they are fundamental parts of me that I hope will translate well into a data science role. Certainly I don’t think these are the only traits a data scientist needs, or that all data scientists need these three traits! And I won't know for a while whether these traits actually set me up for success in a data science role - stay tuned.

What traits do you have that translate well into a data science-focused position? Would love to hear the thoughts of beginner and established data scientists alike!

Cover image sourced from SimpliLearn. Monty python gif sourced from GifGlobe. Learn all the things sourced from Quick Meme, with image credit to Hyperbole and a Half.