DEV Community: dwalkup

A Primer On Weight Of Evidence

dwalkup — Fri, 31 Jan 2020 16:34:23 +0000

In this post I will provide you with some basic information about weight of evidence and a related concept, information value. I will outline how to calculate weight of evidence and information value. I will also talk about a few guidelines and a caution for using these concepts.

First of all, what is weight of evidence?

Weight of evidence (WOE) is a method of encoding a predictor variable to show the relationship it has with a binary target variable. It originated in the credit and finance industries to help separate “good” risks from “bad” risks, with the risk in that case being loan default. It has been in use for more than forty years, although WOE is most widely known in the credit, finance, and insurance industries.

It is calculated as the natural log of the distribution of good customers (those who did not default on their loan) divided by the distribution of bad customers (those who did default). This can also be looked at from the perspective of events (something happening) vs non-events (the same thing not happening). In that case, the calculation is the log of the distribution of non-events divided by the distribution of events.

This is the formula for weight of evidence:

The Steps To Calculate WOE:

If the predictor variable is continuous (for example, a list of peoples' age or household income): split the variable into a number of groups or “bins.” This process is known as “binning.”

A good rule of thumb is to start with 10 bins. This is especially true if your binning strategy is to keep an equal number of data points in each bin. If that is the case, using 10 bins means each bin will contain 10% of the variable's values.

The principle behind that rule of thumb is that a given bin should capture at least 5%-10% of the observations in the category. The reason is that, to a point, fewer bins do a better job of capturing patterns. As long as the number of values in a bin isn’t so large that the pattern is masked, a smaller number of bins is better. The problem with too many bins is that the smaller number of values in each bin may not be enough to capture the pattern or signal in the data.

Unfortunately, the only way to determine the ideal number of bins to use is to experiment a bit. There is no quicker or better way that I know of, so we generally start with 10 bins and adjust from there.

This step is skipped for categorical variables, because they are already effectively binned.
Calculate the number of events & non-events in each bin.

Each bin should ideally have at least one event and at least one non-event in it. If there are no events or non-events in a bin, then an approximation will need to be made. A quick-and-dirty approach would be to use the probability of the event or non-event in place of WOE (this is the same as saying WOE = 0).

Another approach would be to use additive or Laplace smoothing. Let's walk through how to do that.

This is the formula for the adjusted distribution using smoothing:
1. Start by identifying whether it's the event or non-event that's missing from a bin. If it's both, then you'll benefit the most from using the quick-and-dirty approach from above.
2. Since you're performing this step because you're missing events or non-events from a particular bin, the number of events/non-events will be 1 (0 events or non-events + the smoothing factor of 1).
3. Take the total number of data points (or observations) in the bin and add 2. This comes from the smoothing factor of 1 from above, multiplied by the number of categories, which is 2 (event or non-event).
4. Divide the number from step 2 by the number from step 3. This is the adjusted distribution of events or non-events that you'll use in step 4 of the WOE calculation. For this event or non-event in this bin, you'll skip step 3 of the WOE calculation.
By using smoothing to calculate an adjusted distribution, the problem of dividing by 0 is avoided.
Calculate the percentage of non-events & events in each bin.

This is done by taking the number of events in the bin and dividing it by the total number of values in the bin. Repeat this for the number of non-events in the bin.
Calculate WOE by taking the natural log of the percentage of non-events in the bin divided by the percentage of events in the bin.

Weight Of Evidence Usage

Now that you know what it is and how to calculate it, why do you want to use it?

One reason is that WOE allows you to see the relationship between the predictor variables and the target variable. If WOE is a positive number, then the distribution of non-events is higher than the distribution of events for that bin or category. That means the variable likely has a weaker relationship to the target. If it's a negative number, then the distribution of non-events is lower than the distribution of events, which may indicate a stronger relationship to the target.

Another reason is that you're planning to use a logistic regression model. WOE is particularly suited to logistic regression, because WOE is log-odds for a given group of values in a category and logistic regression is predicting log-odds of the target variable. This also means that the target variable and the encoded predictor variable are already on the same scale so scaling is unnecessary.

Keep in mind, a consideration in using weight of evidence is the potential for target leakage inherent in using the distribution of the target values in a category to encode that category. This can lead to overfitting in your model. One way of dealing with this is to inject some random Gaussian noise into the variable during encoding.

Related Concept - Information Value

You may have been wondering if I forgot about information value, since I haven't mentioned it again since the start of this post. That's because information value is closely related to weight of evidence. In fact, WOE is used in calculating information value!

Information value is intended to express how much benefit there is in knowing the value of an independent variable for predicting the target variable. Where weight of evidence shows you the relationship between the independent variable and the target variable, information value shows you the strength of that relationship.

Calculating Information Value

Information value is calculated as the difference of the distributions of non-events and events multiplied by the weight of evidence value, summed over all groups or bins of a predictor variable.

For each bin, subtract the percentage of events from the percentage of non-events and multiply the result by the WOE value for the bin.
Add those results together for each bin in the predictor variable.
The total is the information value for the predictor variable.

Information Value Usage

According to Siddiqi (2006)[1], the information value statistic can be interpreted according to the table below.

IV	Predictive Power
< 0.02	No predictive value
0.02 - 0.1	Weak predictor
0.1 - 0.3	Moderate predictor
0.3 - 0.5	Strong predictor
> 0.5	Suspiciously strong predictor

In the case of information value > 0.5, you should double-check your information value calculation.

Having the information value for each of your predictor variables allows you to rank them accordingly and may assist in feature selection for your model. By using the variables with higher information value ranks, you are able to eliminate lower-ranked variables (assuming there are no variable interactions). That helps you avoid the so-called curse of dimensionality!

That's all I intend to cover in this post. Thank you for your time, and I hope to catch you in the next one!

References
[1] Siddiqi, Naeem (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. SAS Institute, pp 79-83.

All GIFs sourced from giphy.com

Hey, I'm Doing (Data) Science!

dwalkup — Fri, 17 Jan 2020 19:34:25 +0000

When I set out to write this post, I thought I would be talking about the scientific method and how it applies to data science workflow. I had it in mind to align the steps and methods of each for discussion.

Then I changed my mind.

I think the more interesting discussion is about a realization I had — one of the steps of the scientific method isn’t necessarily explicit in data science. It’s still there, but that wasn’t apparent to me at first glance. This was likely due to my previous experience being more technical than scientific.

Maybe there are more people like me out there and this will resonate with them too.

The Scientific Method In Brief

Here is a brief refresher on the scientific method, for those who may not have thought about it since they learned it in school. Merriam-Webster’s definition is “principles and procedures for the systematic pursuit of knowledge involving the recognition and formulation of a problem, the collection of data through observation and experiment, and the formulation and testing of hypotheses.”

These are the general steps involved in the scientific method:

Define a question
Gather information & resources
Form an explanatory hypothesis
Test the hypothesis by performing an experiment and collecting data in a reproducible manner
Analyze and interpret the data. Draw conclusions that may serve as the starting point for a new hypothesis
Communicate results

My Thoughts And Realizations

The first couple of steps seem self-explanatory, and are the same between the two processes. I think it’s useful to mention that I would place exploratory data analysis here in step two. It may be obvious to some, but I would also say that asking clarifying questions about the goal of your project belongs there was well.

Step three is the point that wasn’t immediately obvious to me. I didn't think I was actually forming a hypothesis in the course of what I was doing. In fact, it was a more experienced data scientist who pointed out to me that I was. I just wasn't formally or explicitly stating it. There are a couple of things about this idea which I want to talk about.

The first thing is simply a definition I heard for what a hypothesis is that made sense to me. It was that a hypothesis can be defined as an educated guess about the relationship between two or more variables. It seems to me that this educated guessing, followed by testing to validate (or invalidate) it, almost defines feature selection, feature engineering and model selection.

The other thing is that we make a number of assumptions when we start trying to solve a problem. For instance, we make the initial assumption that the data we have gathered is all the data we need. We assume that the features we select have a relationship with the target we’re trying to predict. We assume that the underlying assumptions of the model we select have been met. When it comes down to it, we start with the very basic assumption that we can solve the problem we’re studying. This is not an exhaustive list. In a way, all of these assumptions are hypotheses, or at the least pieces of one.

It also seems to me that the line between steps four and five tends to be kind of blurry. I hadn’t really articulated this for myself before sitting down to write this, but each iteration of a model is an experiment. Each experiment (should) shed light on one or more of the assumptions (hypotheses) made earlier. Assessment of the results of the model (experiment) often leads to immediate adjustments to the model or features, which is then a new experiment. We may cycle back and forth between these two steps fairly rapidly, especially in the early stages of a project.

The last paragraph points to another important realization: the scientific method and a general data science workflow are both iterative processes. It seems pretty unlikely for a person to just follow the steps one time through and arrive at an accurate conclusion. Most likely, it will take several experiments, additional data collection, and reforming hypotheses to get there.

At last, we talk about the final step, communication of your results or findings. There are things about this step that may not be immediately obvious. For example, it’s important to consider the audience you’re communicating your results to in order to present them effectively. A presentation for a room full of board members should look very different than a paper being presented for peer review.

Another consideration is your call to action. If you are making one, make certain it’s clear and compelling. If you’re not recommending a specific action to be taken, you should try to present your findings in a way that helps to spark ideas about the next steps to be taken. To do otherwise runs the risk of devaluing your work and your conclusions.

Articulating these thoughts in this post has helped me to realize that I am actually engaging in the scientific process. For me, this lends a bit more gravity to the things I’m learning and practicing. It also spurs me to put a little more thought into my assumptions than I have been up to now.

Hopefully, this has helped you in some way as well.

GIFs were sourced from giphy.com

A Beginner's Look at Exploratory Data Analysis

dwalkup — Fri, 20 Dec 2019 19:46:21 +0000

Just sit right back
And you’ll hear a tale
A tale of analysis
That started from a data set
Saved on a laptop’s disk
The author is a student who
The process he has sought
To find the things that can be seen
In the data he will plot, the data he will plot
The going started getting tough
The student knew the cost
If not for the EDA he would do
His insight would be lost, his insight would be lost

Photo by Pablo García Saldaña on Unsplash

Now that I’ve had a little fun, it’s time to get a little more serious about the topic of this post, exploratory data analysis (or EDA). Much like the hapless castaways of that famous television show were lost at sea, you'll be lost in a sea of data without performing EDA.

Before I get too far, however, I think it’s important to provide this disclaimer: at the time of this writing, I am new to the data science world and processes.

Part of my motivation in writing this post (despite the volume of articles on this subject you will find with a quick Google search) is to help me learn this concept and cement it for myself.

As such, I’m certain there will be things I’ve missed or overlooked. Feel free to enlighten me in the comments below!

What is EDA?

I feel that defining EDA will help provide some insight into the actual process itself.
Wikipedia defines EDA as “an approach to analyzing data sets to summarize their main characteristics, often with visual methods.” Wikipedia Link

The National Institute of Standards and Technology says “Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to

maximize insight into a data set;
uncover underlying structure;
extract important variables;
detect outliers and anomalies;
test underlying assumptions;
develop parsimonious models; and
determine optimal factor settings.” (Information from this NIST webpage)

My personal working definition is “the art and science of reviewing a set of data, often using visualizations, with an eye toward answering a question or solving a problem.”

My definition differs slightly from those above it because, generally speaking, you won’t be doing EDA on a data set without some motivating factor. That motivating factor would generally be a question you want to answer, or a problem you’re trying to solve using that data.

I also feel that there is some artistry involved in doing EDA well. It’s true that there are many techniques that can be used and a general process that can be followed (the “science” in my definition). However, each question or problem and each set of data is different. Knowing how to apply the techniques and the process to your unique situation is where the “art” comes in.

There are some other things I’d like to mention that bear keeping in mind when performing EDA:

Foremost, focus on understanding the data. It’s easy to get lost in detailed stats and pretty visualizations, but your main goal is to get a feel for the data and how you might use it for your purpose.
EDA can (and really should) provide insights or guidance regarding data cleaning, feature engineering, and model selection, but is technically separate from those things. Make notes or even changes, but at this stage you shouldn’t get sidetracked too much from gaining an understanding of the data.

Okay, Now How Do I Do It?

Photo by Glenn Carstens-Peters on Unsplash

A good way to start is by answering some general questions. The answers may lead to insights, or they may lead to more focused questions. More questions usually mean more answers and (one hopes) better understanding.

Common questions include:

Do I have any applicable domain knowledge? What do I know that might help me interpret this data?
How many columns/features are there? What kind of data is in those columns/features?
Are all of the columns/features useful for answering my question or solving my problem?
Is there data missing? Duplicate data? How will I deal with those issues?
How are the values distributed? Are there outliers? How will I deal with them?
What relationships do I see in the data? Do any of these relationships seem to point toward an answer or solution to my question or problem?

There are a lot of tools you can use to answer these questions. Popular summary statistics of numerical values include mean or median, minimum and maximum values, and the first and third quartiles. Commonly used visualizations include histograms, box and whisker plots, line charts, and scatter plots. There are many other tools and methods that can be applied as well, and listing or describing them all is beyond the scope of this article.

An Example Of EDA In Action

In this example, we’ll be looking at the Students Performance in Exams dataset from Kaggle, located here.

For this example, I’ll mainly be using the Pandas and Seaborn libraries for Python.

Getting A Grip On The Data

First, the General Stuff

After reading the CSV file into a Pandas DataFrame, the first thing I did was to look at a handful of records using the “sample” method in Pandas.

This gives me an outline of the sort of data contained in this dataset.

Next, I used the "info" method the have a look at datatypes and the count of non-null entries in each column.

From this, I see that there are 5 categorical data columns and 3 numerical data columns. I also see that there are no null values, although that doesn't preclude placeholder values.

A Brief Look At The Numbers

I wanted a summary of the numerical columns, so I used the "describe" method:

The main things that stand out to me here:

the minimum math test score was a 0, which I want to look at. How many 0s are there? Could it be a placeholder?
The standard deviation is a little high, so test scores are a bit spread out.
The minimums are significantly lower than the 1st quartile, near 3 standard deviations in difference, which indicates outliers there.

Visualizations Are Useful

This is a good point to briefly talk about the use of visualizations in EDA. Humans in general are visual creatures, and we're pretty good at pattern recognition. Visualizations take advantage of those broad tendencies by providing us with pictures we can spot patterns in.

Here's an example of this idea. Taking a quick boxplot of the numerical columns visually shows that there are indeed outliers on the low end of the test scores. Those outliers will need to be dealt with, so I make a note to myself.

Next I looked at the distributions of the test scores, using histograms.

These distributions look fairly normal, but they are skewed to the left. For most models, the more normal the distribution, the better. This is probably fine, but we may want to do some feature engineering on these later to make the distributions more normal. I make another note to myself.

On To The Categorical Data

Now that I have a feel for the numerical data, it's time to have a look at the categorical data.

First off, again taking advantage of visual representations, I did some quick plots of the counts of each value in the categorical columns.

The first three plots we'll look at are pretty basic, showing gender, lunch type, and test preparation course completion counts.

From these first three plots, we can easily see that these are basically binary values:

male vs female
standard lunch vs free/reduced cost lunch
completed vs didn't complete a test preparation course.

We can also see:

both genders are fairly evenly represented
less than half of the students receive free/reduced cost lunches
less than half of the students completed a test preparation course.

There are two more categorical plots to look at, with a bit more information to look at then in the first three. The next plot looks at the education level of the students' parents.

Some quick insights we can get from this plot:

About 18% of the parents in this sample didn't complete high school
The majority of the parents had at least some college education
The smallest education group were the parents with master's degrees
There may be value in combining the college-level categories, so I make a note of that.

The last plot gives us information about the race/ethnicity groups the students belong to.

At a glance, this plot shows us that this category is a bit skewed:

Over 30% of the students belong to Group C
Less than 10% of the students belong to Group A
More than 25% belong to Group D

This means that two race/ethnicity groups account for more than half of the students in this data set.

Look For Relationships Or Correlations

Correlations can only be calculated for numerical data, so this table is pretty small.

Reading and writing test scores are highly correlated, which I would expect. Math test scores have more correlation with reading test scores than with writing test scores, but it's a small difference.

We can't calculate correlation for the categorical data, but we can still look for relationships using tools like Seaborn countplots or Pandas cross-tabulations. In the interest of brevity, I'm only going to show a couple of examples but in a real scenario, I would do this for all of the categorical values.

The first example shows the distribution of math scores grouped by the parents' education level.

Looking at this plot, we can see that there is generally a larger proportion of higher math test scores as the education level of the students' parents increased. This may be a factor to consider when we begin creating a model.

The second example shows the median of the reading test scores of the standard lunch group vs the free/reduced lunch group.

We can see that the median reading test score of the free/reduced lunch group is lower than that of the standard lunch group. The conclusion I came to here is that possibly the economic situation of the free/reduced lunch group negatively impacts their ability to study and their performance on tests.

Closing Thoughts

EDA is an important step in the data modeling process.

Once you have a general understanding of the data you're working with, you will usually have one or more ideas for features you want to add or remove. After you get started working on those ideas, you will probably find that you have more questions that require more analysis of your data.

This is when you realize that EDA isn't something you'll only do at the beginning of a project.

Your first pass was just the beginning!

In a sense, you're never really done with the EDA for a project until you're done with the project itself.

I Am Not A Unicorn (Nor A Wizard): or, Why I Chose A Data Science Bootcamp

dwalkup — Mon, 02 Dec 2019 22:42:18 +0000

Even a casual look around the web at what makes a good data scientist will reveal that many feel that a great data scientist is a unicorn: an entity that blends a unique and disparate set of skills and traits to perform what can appear to be magic with really big sets of data. Somewhat to my surprise, it turns out that I have aspirations toward wizardry.

How I Got Here

I recently moved to Houston, TX from Wasilla, AK. In Alaska, I worked in the telecommunications industry for nearly 20 years, from February 2000 until July 2018. We moved because my wife, who works in the oil and gas industry, took another job that was based in Houston.

Moose fighting from Giphy.com
In the course of looking for employment in my new city, I came to realize that I was dissatisfied with my job. It wasn’t a bad job. In fact, I had previously enjoyed it and it did feed my need to be constantly learning. I just felt that I needed a change from poring over packet captures and digging through signaling protocols.

I have always had an interest in programming. Toward the end of my time in Alaska, I had taught myself enough Python to script some job-aid tools to streamline some tasks. I found that I really enjoyed this. It was apparent to others, as well. One day, after I solved a problem in a text-parsing script I had written, one of my co-workers told me “You know, Dave, maybe you should go back to school for that. You seem to really enjoy it.”

Fast forward to the summer of 2019. I’m contemplating yet more job listings on Indeed and I get to thinking about my co-worker’s comment. I had previously seen advertisements for coding bootcamps but now I started looking at them in earnest. In the course of that investigation, I ran across a data science program.

Reading about this program excited me far more than the coding program had. Tempering my excitement, I looked into the data science field to ensure it was a good fit for me.

What I Determined

There are many articles on what makes a great data scientist. None of them align perfectly, though there is some general overlap. Some of the points of overlap I found were:

Curiosity
A love of learning
Analytical thinking, especially with regard to solving problems
Tenacity or determination
Diverse technical skills
Education, especially in statistics and math

I have the curiosity, tenacity, love of learning, and problem solving mindset. I lack the education and some of the technical skills, but I can learn the material and develop those skills. Importantly, I was still interested in the field after my preliminary research, so I decided to move forward.

Unicorn from NicePNG.com
I am not a unicorn, nor a wizard, but someday soon, I’m going to make that transformation.