DEV Community: David Kaspar

Let's Look at Images While Talking About Other Images

David Kaspar — Wed, 12 Jun 2019 21:43:18 +0000

A Quick Summary of Microsoft's Common Objects in Context Image Database

It's been said that data scientists spend 60% of their time cleaning data, and 19% of their time building data sets, this leaves only 21% of a data scientist's time to allocate toward other tasks, such as:

building models
iteratively improving those models
going to meetings that should have been emails

Many of the most exciting parts about machine learning (and Data Science in general) are stuck behind a lot of fairly boring and mundane tasks.

In 2015, a paper was released that describes "over 70,000 worker hours" spent building an enormous data set of labeled photographs. The goal of this data set was to provide training data for machine learning models to help with "scene understanding". At the time, there already existed multiple image data sets (which I'll summarize shortly), that focused on both high variety of objects, as well as many instances per object-type. One major downfall that Microsoft wanted to address, however, is that many of those objects and scenes were isolated in the image in such a way that they would almost never be seen in the context of a normal day. Here's an example from the paper.

The 4-cluster of images on the left are similar to what was the "industry standard" at the time for object detection, while the middle cluster shows iconic scenes: a yard, a living room, a street taken from SUN database: Large-scale scene recognition from abbey to zoo. The problem is that many of these images are too idealized, as if they were taken by a professional photographer for a catalog or a real estate listing. What the researchers from Microsoft were more interested in, though, is the non-iconic imagry displayed in the cluster toward the right side. For computer vision to be effective, we need training data that closely matches the types of tasks we want to predict on in the future. Since that data-set didn't exist, a few bright minds at Microsoft decided to build it.

Now that you know their motivations, I'll talk about their methods, but first, let's go into a bit more depth on the state of large computer vision datasets at the time this paper was released.

Data Set Name & Link	# of Images	# of Labeled Categories	Year Released	Pros	Cons
ImageNet	14,197,122	1000	2009	Huge number of categories & images	Subjects are mostly isolated, out of normal context
Sun Database	131,067	397	2010	High 'image complexity' with multiple subjects per photo	Fewer overall instances for any one category
PASCAL VOC Dataset	500,000	20	2012	Fairly high instances per category	Low 'image complexity', and relatively few categories
MS COCO	2,500,000	91	2015	Good 'image complexity', high # of instances per category	The # of categories could be higher, but overall not a lot of downsides at the time it was released.

Side note: Most of the images in this blog post are copied straight from the academic paper linked above. If you want to know the details of their methodology, I encourage to read the paper itself. It is well written and fairly short, only about 9 pages of text (including the appendix), and lots of insightful pictures and graphs. Here is their informationally dense comparison of the 4 large image datasets. My favorite graph here is the scatterplot in the lower left. Note the log scale on both X & Y axes.

I hope you enjoyed nerding out over these graphs as much as I did. Now that we're all in a good head-space, let's dive into the immense planning and effort that went into creating MS COCO. For brevity, I will do this in an outline.

How It Was Done

1) Realize in what way the current data sets were lacking, and inspire someone to commit the resources to address those shortages

Low image complexity
Labels are bounding boxes only, not pixelated shading

2) Find pictures of multiple objects / scenes in an "everyday" context, instead of staged scenes

Flickr contains photos uploaded by amateur photographers with searchable metadata and keywords
Searches for multiple objects at the same time instead of just a single word / topic
- cat wine glass is better than just cat or wine glass

3) Annotate Images by hand: Amazon's Mechanical Turk (AMT) is a platform to crowdsource work, but care must be taken to maintain high standards

A) Label Categories -- Is there a ____________ anywhere in this picture?

repeated 8 times per question with different AMT workers, to increase Recall. The chance that all 8 workers "get it wrong" is very low.

B) Instance Spotting -- There is already a ____________ in this picture, place a marker over every discrete instance of it that you see.

also repeated 8 times per question with different AMT workers.

C) Instance Segmentation -- Shade the photo where each individual ____________ takes place (on average, any single photo has 7.7 instances that need to be shaded)

Design an interface that will allow workers to shade the specific pixels where an object is in the image.
Segmenting 2,500,000 object instances is an extremely time consuming task requiring over 22 worker hours per 1,000 segmentations. (around 80 sec per instance)
- For this reason, only a single worker will segment each instance, but high training and quality control is required.

Before we look at the outcomes, I want to remind you of Werlindo's Saturation Phenomenon:

1) Exhibit A is super novel and cool and ppl love it

2) Everyone copies the effect of Exhibit A to the point that it's overused and perceived as tacky

3) Someone who wasn't around / aware of Exhibit A sees it for the first time, and thinks "that looks tacky" not realizing they are looking at something kindof amazing that revolutionized an experience.

I first heard of this with regard to 'Bullet Time' from The Matrix

vs the 'low budget' version.

I bring this up, because even though we've seen a lot of cool image recognition and shading demos over the last few years, to the point of thinking they are normal, none of this would be possible without the rather boring task of building a massive, intricately labeled, natural scene image data set.

Thanks, Microsoft COCO.

I was going to add even more images to this blog showing the output of this project, but instead, you should probably just play around with the dataset yourself. There are even pre-trained models that work "out of the box" to leverage all of this hard work, with almost no added input from you.

Something is growing, and it's growing very fast, but how fast?

David Kaspar — Tue, 30 Apr 2019 16:31:53 +0000

Last week, I talked about engineering new features from columns you already have (I used Haversine's Formula as an example). This week, I'd like to discuss using log-transformations to test whether the underlying patterns in the data is caused by exponential behavior: y=A*B^x or perhaps a power model: y=A*x^n. NOTE: This is not about using log-transformations to assist with normalizing the data, for that, you'll need to look here or here. This topic is near and dear to my heart, because it was how I met my very first tutoring client, which cascaded into my own business over ten years. I was working at a school, and one of my fellow teachers (she specialized in Chemistry & Bio, but often helped with math at the high school level) was doing some tutoring on the side, and she ran into a group project that was focusing on exactly this topic. She wasn't feeling very confident, and knew I wasn't doing anything that night, because we had just talked about it earlier that day. So she called me, asking for help, and she paid me 100% of what she made from the family that hour. Later on, she moved away, and "gifted" me with a referral to work with that same family.

Fast-forward 10 years, I'm no longer a tutor, but I have decided to take my mathematical expertise and pivot into computers, specifically, data science. Let's tie it all together. Imagine we've done some basic analysis on a few columns in pandas, and we're trying to decide on a column-by-column basis, which growth pattern is in play. Let's start with the necessary imports:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns    # Sidenote: if you haven't tried using the Seaborn library, it's wonderful! I highly recommend that you give it a try.
from scipy import stats
from exp_or_poly import mystery_scatter  # Secret function I wrote, don't cheat and look!
%matplotlib inline

Here is some imaginary data. This function returns a pandas dataframe which we will save as df

df = mystery_scatter(n=500, random=False)
df.describe()

	x	mystery_1	mystery_2	mystery_3	linear
count	500.000000	500.000000	500.000000	5.000000e+02	500.000000
mean	51.595384	31372.948633	30654.424502	1.098368e+07	34111.441777
std	26.416814	25954.487664	22727.902144	8.786530e+06	22607.290150
min	5.019101	2622.068494	5451.400236	9.498159e+04	-7483.112617
25%	29.494955	8308.838700	11995.912686	3.070432e+06	14698.430889
50%	51.622231	24277.069057	23841.307039	9.079029e+06	34205.683607
75%	73.700855	49624.805802	43976.822005	1.776408e+07	52997.927645
max	98.925990	95455.331996	94890.894987	3.056593e+07	75247.568169

Mysteries #1, #2, and #3 all appear quite similar, and of course #4 certainly appears linear. We could just run a bunch of different regression models on columns 1, 2, and 3, and keep track of the R^2 values, then just pick the best. In theory, this should work (and with computing power being so cheap, it's certainly not a bad idea), but I'd like to give us a more precise way to decide, instead of the "shotgun" approach where you just try all possible models. The mathematics for why the following strategy works is rather elegant, as it involves taking complicated equations and turning them into the simple linear form: y=m*x+b. We won't be delving into the algebra here, but you can go here instead. For now, we'll just practice how to make it all happen in Python.

The basic idea is as follows: when looking at data that curves upward as the x variable increases, we can plot two different scatter plots, and test each of them for linearity. Whichever scatterplot is closest to a straight line will tell us which underlying growth pattern is on display. If log(x) vs log(y) becomes linear, then the underlying pattern came from power model growth, but if x vs log(y) gives the more linear scatterplot, then the original data was probably from an exponential model.

First, let's restrict our dataframe to only the curved lines:

curved_data = df.drop('linear', axis=1)

Now, we'll build a function that can display the graphs side-by-side, for comparison.

def linearity_test(df, x=0):
    '''Take in a dataframe, and the index for data along the x-axis. Then, for each column, 
       display a scatterplot of the (x, log(y)) and also (log(x), log(y))'''
    df_x = df.iloc[:, x].copy()
    df_y = df.drop(df.columns[x], axis=1)

    for col in df_y.columns:
        plt.figure(figsize=(18, 9), )
        # Is it exponential?

        plt.subplot(1,2,1)
        plt.title('Exp Test: (X , logY)')
        sns.regplot(df_x, np.log(df[col]))
        # Is it power?
        plt.subplot(1,2,2)
        plt.title('Pwr Test: (logX , logY)')
        sns.regplot(np.log(df_x), np.log(df[col]))

        plt.suptitle(f'Which Model is Best for {col}?')
        plt.show()
        plt.savefig(col+'.png')
        plt.close()
        print('')

linearity_test(curved_data)

Well, mystery_1 seems a bit inconclusive, we'll come back to that later. mystery_2 certainly seems to have a linear relationship on the Exp Test, but NOT for the Power Test, which means the growth pattern for that column was caused by exponential growth, and mystery_3 is the opposite, it is very obviously linear for the Pwr Test, but not for Exp Test. Let's peek under the hood and take a look at the function I used to build this data:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

def mystery_scatter(n=100, var=200, random=False):

    '''Randomly generate data, one of which is quadratic, another is exponential,
       and a third is linear.
       Plot all sets of data side-by-side, then return a pandas dataframe'''

    if random == False:
        np.random.seed(7182818)

    '''Generate some data, not much rhyme or reason to these numbers, just wanted everything 
       to be on the same scale and the two curved datasets should be hard to tell apart'''

    x = np.random.uniform(5, 100, n)
    y1 = 11.45 * x ** 2 - 234.15 * x + 5000 + np.random.normal(0, var, n) * np.sqrt(x)
    y2 = 5000 * 1.03 ** x + np.random.normal(0, var, n) * np.sqrt(x)
    y3 = 5000 * x ** 1.9 + np.random.normal(0, 20 * var, n) * x
    y4 = 856.16 * x - 10107.3 + np.random.normal(0, 6 * var, n)


    '''Graph the plots'''

    plt.figure(figsize=(14, 14), )

    plt.subplot(2,2,1)
    sns.scatterplot(x, y1)
    plt.title('Mystery #1')

    plt.subplot(2,2,2)
    sns.scatterplot(x, y2)
    plt.title('Mystery #2')

    plt.subplot(2,2,3)
    sns.scatterplot(x, y3)
    plt.title('Mystery #3')

    plt.subplot(2,2,4)
    sns.scatterplot(x, y4)
    plt.title('Linear')

    plt.suptitle('Mystery Challenge: \nOne of these is Exponential, one is Power, \nand the other is Quadratic, but which is which?')
    plt.show()
    plt.close()

    df = pd.DataFrame([x, y1, y2, y3, y4]).T
    df.columns = ['x', 'mystery_1', 'mystery_2', 'mystery_3', 'linear']
    df.sort_values('x', inplace=True)
    df.reset_index(drop=True, inplace=True)
    return df

in particular, look at how the data was generated:

x = np.random.uniform(5, 100, n)
y1 = 11.45 * x ** 2 - 234.15 * x + 5000 + np.random.normal(0, var, n) * np.sqrt(x)
y2 = 5000 * 1.03 ** x + np.random.normal(0, var, n) * np.sqrt(x)
y3 = 5000 * x ** 1.9 + np.random.normal(0, 20 * var, n) * x
y4 = 856.16 * x - 10107.3 + np.random.normal(0, 6 * var, n)

So we were right about mystery_2 (which was the same as y2) being caused by exponential growth, it was y2 = 5000*1.03^x.

Meanwhile, mystery_1 was y1 = 11.45*x^2 - 234.15*x + 5000, definitely built from a quadratic equation, which is somewhat related to the Power Model, but not exactly the same. Technically, the log(x) vs log(y) test only forces linearity for Power Models. In order to find the exact solution for this one, we would rule out exponential growth and power growth first, then start applying the "reiterative common differences" strategy to determine the degree of the polynomial, but we won't be getting in to that here.

Refocusing again on power growth: it looks like mystery_3 was generated with y3 = 5000*x^1.9 which definitely qualifies as a power model.

P.S. If you're wondering what the extra np.random.normal(...) stuff at the back part of each of those lines is doing, it's just adding in some random noise so the scatter plots wouldn't be too "perfect".

Finding my way into Data Science

David Kaspar — Tue, 23 Apr 2019 22:17:36 +0000

My father likes to brag about his three sons, of which I am “the middlest”. When he gets to talking, and the conversation turns to “what the boys are up to?” he likes to describe my life this way:

David is doing great, he got married around ten years ago and runs his own tutoring business out in Seattle, WA. In fact, his “office” is at the dining room tables of families that are less than half a mile from Bill Gates’ house. He has helped students get accepted at Harvard and Yale! He only works about 25 hours a week, but it’s enough to own a house, and that’s even when he has to compete with all that Amazon and Microsoft money.

Well, Dad, I’m leaving that all behind, but I hope to give you a new story to catch the attention of those who vaguely knew me when I was a kid running around the Indiana countryside. About 6 weeks ago, I made the difficult choice to abandon my students part-way through the school year in order to join a 15 week bootcamp studying Data Science with Flatiron School. Upon doing so, I personally delivered a letter to each of the families I was working with at the time, regretfully sharing the news that I wouldn’t be able to continue as their tutor. Upon arriving to one house a week later, the father casually mentioned, “Oh, David, before you go, I wanted to talk with you about something.” As it turns out, after reading that letter, he had arranged for me to get a meeting with someone who has been at an international company for years, basically doing my dream job. I was thrilled! It was more of a meeting than an interview, and helped me understand what exactly I was signing myself up for.

During the first 20 mins of this meeting, we discussed the various ways to break into Data Science. In her experience there are three basic paths, none of which are better than the other, as they each bring their own important background tools necessary for the job. Below, I drew up a summary of those three paths:

On her personal team she mentioned that everyone has at least a master’s degree in something relevant, many have a PhD (I have neither). After briefly summarizing the syllabus for my 15 week program, she mentioned that someone who intensely studies these topics shouldn’t have any trouble finding a job in the Seattle market, but I should temper my expectations of jumping straight into a high level machine learning or AI position. This doesn’t mean I won’t try, but if it takes me a few years to work my way up, that’s completely fine by me. In my old job, there wasn’t much upward mobility anyway, so starting on the bottom rung of the ladder and being challenged by my future colleagues sounds great. Also . . . a computer job would very likely come with good health insurance, which, from the perspective of a sole-proprietor business owner, is just like the Swiss flag (a huge plus).

Before this meeting took place, I hopped on to reddit’s “MachineLearning” forum and asked for help:

I managed to get a face-to-face introduction with the someone in the industry. I can't promise to deliver an answer (but I'll try), and I'm not a journalist (just a nerdy guy trying to switch careers), but I feel like I'm in a state of "I know that I don't know" and I'd like your help. What should I ask? If you had the ear of someone who's been working in big data and machine learning for years, what would you ask? (obviously nothing about getting "trade secrets" or anything else she clearly wouldn't answer. These have to be feasible questions.) Thanks!

Thanks to everyone who contributed questions, here is my report:

I'd be interested in hearing what their plans are for ML driven spatial / temporal upscaling in games and how far they expect this to go.

E. G. Is a 720p at 30hz original render going to give us 8k at 120hz in a decade.

She said it's absolutely something that is being worked on by others (not sure if she meant @ her company, or just generally, sorry).

I loved this question, because there are so many great games out there where the graphics have aged poorly, but the premise of the game is incredibly fun.

What is the most exciting area for ml in gaming? GANs? RL? CART?

This didn't come up specifically, however, she did share the premise of a talk she heard at a ML conference within the last year. It's about developing an AI for NPCs, but could be applied toward an AI that tries to beat the game from the player perspective. Essentially, the speaker developed a "risk / reward" function for each object on the screen. Strong monster close by = bad ; strong monster far away = not so bad ; treasure chest = good ; etc. all with number values, then the AI would follow a "path of best outcomes". I imagine a bunch of lines all shooting out from the controlled char to every other object on the screen, with varying degrees of red -> orange -> yellow -> green, and they adjust the "risk / reward" outcome based on movement and actions of the player.

Overall, it was a great meeting for me. I only minorly embarrassed myself three times, and could understand the basic idea of everything we discussed. About half the time was spent discussing how people from various backgrounds end up in the "data science" space, and whether it's feasible for me to make that leap. Kudos to her for quickly figuring out "my level" and pushing it slightly beyond in an interesting way. Every part of this blog is solely my interpretation of her opinion and in no way reflects the official stance of her company.

Engineering Location Features with Haversine's Formula for Prediction Modeling

David Kaspar — Fri, 19 Apr 2019 23:29:11 +0000

Here is the full git repo: Predicting King Co home prices This blog post is just a snippet of one strategy we used.
Co-written with Natasha Kacoroski

So you'd like to build a prediction model from some data, but some of the columns seem a bit . . . useless? Alternatively, the column of data might accidentally infer meaning where there shouldn't be any. Let's try this out by engineering some columns with location data to help predict the price of a home using a small slice of the King County Housing Data set from Washington State. Let's explore how we can turn difficult columns into something useful instead of just throwing them away or using them incorrectly.

import pandas as pd

df = pd.read_csv('kc_house_data.csv')
df.head().T

	0	1	2	3	4
id	7129300520	6414100192	5631500400	2487200875	1954400510
date	10/13/2014	12/9/2014	2/25/2015	12/9/2014	2/18/2015
price	221900	538000	180000	604000	510000
bedrooms	3	3	2	4	3
bathrooms	1	2.25	1	3	2
sqft_living	1180	2570	770	1960	1680
sqft_lot	5650	7242	10000	5000	8080
floors	1	2	1	1	1
waterfront	NaN	0	0	0	0
view	0	0	0	0	0
condition	3	3	3	5	3
grade	7	7	6	7	8
sqft_above	1180	2170	770	1050	1680
sqft_basement	0.0	400.0	0.0	910.0	0.0
yr_built	1955	1951	1933	1965	1987
yr_renovated	0	1991	NaN	0	0
zipcode	98178	98125	98028	98136	98074
lat	47.5112	47.721	47.7379	47.5208	47.6168
long	-122.257	-122.319	-122.233	-122.393	-122.045
sqft_living15	1340	1690	2720	1360	1800
sqft_lot15	5650	7639	8062	5000	7503

What do each of these columns mean?

from IPython.display import display, Markdown

with open('column_names.md', 'r') as fh:
    content = fh.read()

display(Markdown(content))

COLUMN NAMES AND DESCRIPTIONS FOR KING COUNTRY DATA SET

id - unique identified for a house
dateDate - house was sold
pricePrice - is prediction target
bedroomsNumber - of Bedrooms/House
bathroomsNumber - of bathrooms/bedrooms
sqft_livingsquare - footage of the home
sqft_lotsquare - footage of the lot
floorsTotal - floors (levels) in house
waterfront - House which has a view to a waterfront
view - Has been viewed
condition - How good the condition is ( Overall )
grade - overall grade given to the housing unit, based on King County grading system
sqft_above - square footage of house apart from basement
sqft_basement - square footage of the basement
yr_built - Built Year
yr_renovated - Year when house was renovated
zipcode - zip
lat - Latitude coordinate
long - Longitude coordinate
sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors

Many of these columns seem like they should have some kind of an influence on the price. In particular, the common saying "the three most important things to think about when buying are home are location, location, location" tells us we should probably care a lot about things like zipcode, latitude, and longitude. Also, just to note, 'sqft_living15' and 'sqft_lot15' are metadata about each house in the database that someone else has already engineered for us. Thanks, King Co. data scientist!

There are plenty of columns we could use, but for this tutorial with Haversine, let's just focus on latitude & longitude. Let's get a visual of what this data looks like.

target = df['price'].copy()
needs_work = df[['long', 'lat']].copy()

import seaborn as sns
import matplotlib.pyplot as plt

for col in needs_work:
    x = needs_work[col]
    y = target

    plt.figure(figsize=(16,8))
    plt.subplot(1,2,1)
    sns.distplot(x)
    plt.subplot(1,2,2)

    sns.scatterplot(x, y)
    plt.suptitle(col)
    plt.tight_layout
    plt.savefig(col + '.png')

Even though these pairs of graphs look similar, notice the y-axes on these plots are measuring different things. The histogram is simply counting how many observations of the x-value we have in the data, while the scatterplot is showing the price in USD of the house sold at the x-value. It's almost like more houses sold (histogram height) correlates rather strongly with a higher sale price (scatterplot height). The Law of Supply & Demand seems alive and well in King County, WA. We'll discuss how to deal with this data in regard to predicting sale price using Haversine's 3D distance formula.

Before we get into the details of the formula, let's talk about why we should consider augmenting the location data at all. The problem here is that ONLY looking at one-dimensional movement on a map doesn't particularly generate accurate housing price patterns. Another way of saying this is there is no linear relationship between either latitude vs price nor longitude vs price. For example, if there are multiple wealthy neighborhoods that have lower-priced neighborhoods in between them, using just that one dimension of data won't accurately predict the value. In the following image (shamelessly lifted from http://www.peteryu.ca/tutorials/matlab/image_in_3d_surface_plot_with_multiple_colormaps) imagine the Z axis isn't elevation, but rather home price. As you can see in the heatmap below the surface, there isn't a clear linear pattern when you move along JUST the X-direction or just the Y-direction, but proximity to the dark red spot near the back right part of the graph would have a clear pattern for predicting price.
(Note the red spot near the bottom left should actually be blue on the heatmap, as in this example, it represents very low house prices.)

When modeling with multivariable linear regression, simply having both latitude and longitude doesn't really help either. Instead, let's think about why home values in some neighborhoods are high vs low. It could simply be the houses in those neighborhoods have a view of the mountains & water, but the only way any one home sells for a lot of money is if someone pays a lot of money for it. I know this seems simplistic, but high home values require high paying jobs, and people with high paying jobs will pay more to be conveniently close their work (especially if the home has a view of the mountains or water). This leads me to believe a "radius from high-paying job cluster" might be a much better indicator of home value instead of using the two columns of 'longitude' and 'latitude' on their own. Let's build one! In fact, if there are multiple employment clusters, maybe building a single column for each "job hub" would make sense.

If you are willing to accept that we live on a round planet, we can utilize the Haversine formula, which measures 3D arc-length on the surface of a sphere. This is really just an adaption of the Pythagorean theorem.

We adapt the python code from: https://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points/4913653

We choose to pass in two arguments, each in the form [longitude, latitude]. To do this, we first build a new column to serve as the input to the function.

needs_work.head(2)

	long	lat
0	-122.257	47.5112
1	-122.319	47.7210

needs_work['long_lat'] = list(zip(needs_work['long'], needs_work['lat']))
needs_work.head(2)

	long	lat	long_lat
0	-122.257	47.5112	(-122.257, 47.5112)
1	-122.319	47.7210	(-122.319, 47.721000000000004)

Now we define the Haversine function. Alternatively, you can simply import it from https://pypi.org/project/haversine/ but we want to make sure we can control exactly what is going on and it's fun to see trigonometry in action!

from math import radians, cos, sin, asin, sqrt

def haversine(list_long_lat, other=[-122.336283, 47.609395]):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees), in this case the 2nd point is in the 
    Pike Pine Retail Core of Seattle, WA.
    """
    lon1, lat1 = list_long_lat[0], list_long_lat[1]
    lon2, lat2 = other[0], other[1]
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    # Radius of earth in kilometers is 6371
    km = 6371 * c
    return km

We're finally ready to build our useful output column using series.apply() which yields the distance to Seattle.

needs_work['dist_to_seattle'] = needs_work['long_lat'].apply(haversine)

In fact, it would very easy to build another column for a different "job cluster location" at this same point by passing in the optional arguments for haversine() doing a different series.apply(function, opt_arg=______). Seattle's "suburb" of Bellevue has actually grown into its own city with skyscrapers and high paying jobs of its own. Let's create the 'dist_to_bellevue' column as well.

needs_work['dist_to_bellevue'] = needs_work['long_lat'].apply(haversine, other=[-122.198985, 47.615577])

Finally, let's drop the columns we used as the building blocks for these two new columns. (Although instead of using df.drop([columns], axis=1) we actively select the columns to keep.)

engineered_columns = needs_work.loc[:,['dist_to_seattle', 'dist_to_bellevue']]

engineered_columns.head(10)

	dist_to_seattle	dist_to_bellevue
0	12.434278	12.395639
1	12.477217	14.770934
2	16.247460	13.838051
3	10.731122	17.970486
4	21.850148	11.542868
5	25.361129	15.217257
6	33.331870	35.347235
7	22.284717	24.515385
8	10.796605	15.463271
9	35.274167	30.244215

Now you can add these back into to your original dataframe and begin improving your model with these new feature columns.