<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: David Kaspar</title>
    <description>The latest articles on DEV Community by David Kaspar (@upwardtrajectory).</description>
    <link>https://dev.to/upwardtrajectory</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F158478%2Fc03fb799-907f-4d63-9c7a-d09dac1b2ba6.jpeg</url>
      <title>DEV Community: David Kaspar</title>
      <link>https://dev.to/upwardtrajectory</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/upwardtrajectory"/>
    <language>en</language>
    <item>
      <title>Let's Look at Images While Talking About Other Images</title>
      <dc:creator>David Kaspar</dc:creator>
      <pubDate>Wed, 12 Jun 2019 21:43:18 +0000</pubDate>
      <link>https://dev.to/upwardtrajectory/let-s-look-at-images-while-talking-about-other-images-bbg</link>
      <guid>https://dev.to/upwardtrajectory/let-s-look-at-images-while-talking-about-other-images-bbg</guid>
      <description>&lt;h3&gt;
  
  
  A Quick Summary of Microsoft's &lt;em&gt;Common Objects in Context&lt;/em&gt; Image Database
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#615638c36f63"&gt;It's been said&lt;/a&gt; that data scientists spend 60% of their time cleaning data, and 19% of their time building data sets, this leaves only 21% of a data scientist's time to allocate toward other tasks, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;building models&lt;/li&gt;
&lt;li&gt;iteratively improving those models &lt;/li&gt;
&lt;li&gt;going to meetings that should have been emails
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many of the most exciting parts about machine learning (and Data Science in general) are stuck behind a lot of fairly boring and mundane tasks.  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pzEOpS-V--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://media1.tenor.com/images/2f93a5bd8c8ab67d92cd039ced056d14/tenor.gif%3Fitemid%3D12726853" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pzEOpS-V--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://media1.tenor.com/images/2f93a5bd8c8ab67d92cd039ced056d14/tenor.gif%3Fitemid%3D12726853" alt="Let's talk about that"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In 2015, &lt;a href="https://arxiv.org/pdf/1405.0312.pdf"&gt;a paper was released&lt;/a&gt; that describes "over 70,000 worker hours" spent building an enormous data set of labeled photographs. The goal of this data set was to provide training data for machine learning models to help with "scene understanding". At the time, there already existed multiple image data sets (which I'll summarize shortly), that focused on both high variety of objects, as well as many instances per object-type. One major downfall that Microsoft wanted to address, however, is that many of those objects and scenes were isolated in the image in such a way that they would almost never be seen in the context of a normal day. Here's an example from the paper. &lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3fg7Iwlb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/UpwardTrajectory/ms_coco_blog/blob/master/iconic_vs_non.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3fg7Iwlb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/UpwardTrajectory/ms_coco_blog/blob/master/iconic_vs_non.png%3Fraw%3Dtrue" alt="iconic images vs non-iconic"&gt;&lt;/a&gt;  &lt;/p&gt;

&lt;p&gt;The 4-cluster of images on the left are similar to what was the "industry standard" at the time for object detection, while the middle cluster shows iconic scenes: a yard, a living room, a street taken from &lt;em&gt;SUN database: Large-scale scene recognition from abbey to zoo&lt;/em&gt;. The problem is that many of these images are too idealized, as if they were taken by a professional photographer for a catalog or a real estate listing. What the researchers from Microsoft were more interested in, though, is the non-iconic imagry displayed in the cluster toward the right side. For computer vision to be effective, we need training data that closely matches the types of tasks we want to predict on in the future. Since that data-set didn't exist, a few bright minds at Microsoft decided to build it. &lt;/p&gt;

&lt;p&gt;Now that you know their motivations, I'll talk about their methods, but first, let's go into a bit more depth on the state of large computer vision datasets at the time this paper was released. &lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Data Set Name &amp;amp; Link&lt;/th&gt;
&lt;th&gt;# of Images&lt;/th&gt;
&lt;th&gt;# of Labeled Categories&lt;/th&gt;
&lt;th&gt;Year Released&lt;/th&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="http://www.image-net.org/about-overview"&gt;ImageNet&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;14,197,122&lt;/td&gt;
&lt;td&gt;1000&lt;/td&gt;
&lt;td&gt;2009&lt;/td&gt;
&lt;td&gt;Huge number of categories &amp;amp; images&lt;/td&gt;
&lt;td&gt;Subjects are mostly isolated, out of normal context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://vision.princeton.edu/projects/2010/SUN/"&gt;Sun Database&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;131,067&lt;/td&gt;
&lt;td&gt;397&lt;/td&gt;
&lt;td&gt;2010&lt;/td&gt;
&lt;td&gt;High 'image complexity' with multiple subjects per photo&lt;/td&gt;
&lt;td&gt;Fewer overall instances for any one category&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="http://host.robots.ox.ac.uk/pascal/VOC/"&gt;PASCAL VOC Dataset&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;500,000&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;2012&lt;/td&gt;
&lt;td&gt;Fairly high instances per category&lt;/td&gt;
&lt;td&gt;Low 'image complexity', and relatively few categories&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="http://cocodataset.org/#home"&gt;MS COCO&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2,500,000&lt;/td&gt;
&lt;td&gt;91&lt;/td&gt;
&lt;td&gt;2015&lt;/td&gt;
&lt;td&gt;Good 'image complexity', high # of instances per category&lt;/td&gt;
&lt;td&gt;The # of categories could be higher, but overall not a lot of downsides at the time it was released.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Side note: Most of the images in this blog post are copied straight from the academic paper linked above. If you want to know the details of their methodology, I encourage to read the paper itself. It is well written and fairly short, only about 9 pages of text (including the appendix), and lots of insightful pictures and graphs. Here is their informationally dense comparison of the 4 large image datasets. My favorite graph here is the scatterplot in the lower left. Note the log scale on both X &amp;amp; Y axes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--R_HYRi86--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/UpwardTrajectory/ms_coco_blog/blob/master/image_db_comparison.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--R_HYRi86--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/UpwardTrajectory/ms_coco_blog/blob/master/image_db_comparison.png%3Fraw%3Dtrue" alt="comparison"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I hope you enjoyed nerding out over these graphs as much as I did. Now that we're all in a good head-space, let's dive into the immense planning and effort that went into creating MS COCO. For brevity, I will do this in an outline.   &lt;/p&gt;

&lt;h3&gt;
  
  
  How It Was Done
&lt;/h3&gt;

&lt;p&gt;1) Realize in what way the current data sets were lacking, and inspire someone to commit the resources to address those shortages  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Low image complexity
&lt;/li&gt;
&lt;li&gt;Labels are bounding boxes only, not pixelated shading
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;2) Find pictures of multiple objects / scenes in an "everyday" context, instead of staged scenes  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.flickr.com/about"&gt;Flickr&lt;/a&gt; contains photos uploaded by amateur photographers with searchable metadata and keywords
&lt;/li&gt;
&lt;li&gt;Searches for multiple objects at the same time instead of just a single word / topic

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;cat wine glass&lt;/em&gt; is better than just &lt;em&gt;cat&lt;/em&gt; or &lt;em&gt;wine glass&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;3) Annotate Images by hand: &lt;a href="https://www.mturk.com/"&gt;Amazon's Mechanical Turk (AMT)&lt;/a&gt; is a platform to crowdsource work, but care must be taken to maintain high standards  &lt;/p&gt;

&lt;p&gt;A) Label Categories -- Is there a &lt;em&gt;____________&lt;/em&gt; anywhere in this picture?  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;repeated 8 times per question with different AMT workers, to increase Recall. The chance that all 8 workers "get it wrong" is very low.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;B) Instance Spotting -- There is already a &lt;em&gt;____________&lt;/em&gt; in this picture, place a marker over every discrete instance of it that you see.  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;also repeated 8 times per question with different AMT workers.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;C) Instance Segmentation -- Shade the photo where each individual &lt;em&gt;____________&lt;/em&gt; takes place (on average, any single photo has 7.7 instances that need to be shaded)  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Design an interface that will allow workers to shade the specific pixels where an object is in the image.
&lt;/li&gt;
&lt;li&gt;Segmenting 2,500,000 object instances is an extremely time consuming task requiring over 22 worker hours per 1,000 segmentations. (around 80 sec per instance)

&lt;ul&gt;
&lt;li&gt;For this reason, only a single worker will segment each instance, but high training and quality control is required.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before we look at the outcomes, I want to remind you of &lt;em&gt;Werlindo's Saturation Phenomenon&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;1) &lt;em&gt;Exhibit A&lt;/em&gt; is super novel and cool and ppl love it&lt;br&gt;&lt;br&gt;
2) Everyone copies the effect of &lt;em&gt;Exhibit A&lt;/em&gt; to the point that it's overused and perceived as tacky&lt;br&gt;&lt;br&gt;
3) Someone who wasn't around / aware of &lt;em&gt;Exhibit A&lt;/em&gt; sees it for the first time, and thinks "that looks tacky" not realizing they are looking at something kindof amazing that revolutionized an experience.  &lt;/p&gt;

&lt;p&gt;I first heard of this with regard to 'Bullet Time' from &lt;em&gt;The Matrix&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--uLOIKC_F--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/http://i.imgur.com/4nvICeP.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--uLOIKC_F--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/http://i.imgur.com/4nvICeP.gif" alt="Bullet time"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;vs the 'low budget' version.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--NyTv4M7q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://media0.giphy.com/media/13ihNcbag4xLt6/giphy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--NyTv4M7q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://media0.giphy.com/media/13ihNcbag4xLt6/giphy.gif" alt="LEGO bullet time"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I bring this up, because even though we've seen a lot of cool image recognition and shading demos over the last few years, to the point of thinking they are normal, none of this would be possible without the rather boring task of building a massive, intricately labeled, natural scene image data set. &lt;/p&gt;

&lt;h3&gt;
  
  
  Thanks, Microsoft COCO.
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YgbSJogm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://liuziwei7.github.io/projects/vsreid/demo.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YgbSJogm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://liuziwei7.github.io/projects/vsreid/demo.gif" alt="image segmentation gifs"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I was going to add even more images to this blog showing the output of this project, but instead, you should probably just &lt;a href="http://cocodataset.org/#explore"&gt;play around&lt;/a&gt; with the dataset yourself. There are even &lt;a href="https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md"&gt;pre-trained models&lt;/a&gt; that work "out of the box" to leverage all of this hard work, with almost no added input from you.&lt;/p&gt;

</description>
      <category>flatironschool</category>
      <category>microsoft</category>
      <category>computervision</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Something is growing, and it's growing very fast, but *how* fast?</title>
      <dc:creator>David Kaspar</dc:creator>
      <pubDate>Tue, 30 Apr 2019 16:31:53 +0000</pubDate>
      <link>https://dev.to/upwardtrajectory/something-is-growing-and-it-s-growing-very-fast-but-how-fast-1li6</link>
      <guid>https://dev.to/upwardtrajectory/something-is-growing-and-it-s-growing-very-fast-but-how-fast-1li6</guid>
      <description>&lt;p&gt;Last week, I talked about engineering new features from columns you already have (I used Haversine's Formula as an example). This week, I'd like to discuss using log-transformations to test whether the underlying patterns in the data is caused by exponential behavior: &lt;code&gt;y=A*B^x&lt;/code&gt; or perhaps a power model: &lt;code&gt;y=A*x^n&lt;/code&gt;. NOTE: This is &lt;em&gt;not&lt;/em&gt; about using log-transformations to assist with normalizing the data, for that, you'll need to look &lt;a href="https://dev.to/rokaandy/logarithmic-transformation-in-linear-regression-models-why-when-3a7c"&gt;here&lt;/a&gt; or &lt;a href="https://dev.to/annalara/king-county-house-sales-dataset-log-transformations-and-interpreting-results-19b2"&gt;here&lt;/a&gt;. This topic is near and dear to my heart, because it was how I met my very first tutoring client, which cascaded into my own business over ten years. I was working at a school, and one of my fellow teachers (she specialized in Chemistry &amp;amp; Bio, but often helped with math at the high school level) was doing some tutoring on the side, and she ran into a group project that was focusing on exactly this topic. She wasn't feeling very confident, and knew I wasn't doing anything that night, because we had just talked about it earlier that day. So she called me, asking for help, and she paid me 100% of what she made from the family that hour. Later on, she moved away, and "gifted" me with a referral to work with that same family.  &lt;/p&gt;

&lt;p&gt;Fast-forward 10 years, I'm no longer a tutor, but I have decided to take my mathematical expertise and pivot into computers, specifically, data science. Let's tie it all together. Imagine we've done some basic analysis on a few columns in pandas, and we're trying to decide on a column-by-column basis, which growth pattern is in play. Let's start with the necessary imports:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;seaborn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;    &lt;span class="c1"&gt;# Sidenote: if you haven't tried using the Seaborn library, it's wonderful! I highly recommend that you give it a try.
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scipy&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;stats&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;exp_or_poly&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mystery_scatter&lt;/span&gt;  &lt;span class="c1"&gt;# Secret function I wrote, don't cheat and look!
&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;matplotlib&lt;/span&gt; &lt;span class="n"&gt;inline&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is some imaginary data. This function returns a pandas dataframe which we will save as &lt;code&gt;df&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mystery_scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FUpwardTrajectory%2Fto_log_or_not_to_log%2Fblob%2Fmaster%2Fmystery_growth.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FUpwardTrajectory%2Fto_log_or_not_to_log%2Fblob%2Fmaster%2Fmystery_growth.png%3Fraw%3Dtrue" alt="png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;x&lt;/th&gt;
      &lt;th&gt;mystery_1&lt;/th&gt;
      &lt;th&gt;mystery_2&lt;/th&gt;
      &lt;th&gt;mystery_3&lt;/th&gt;
      &lt;th&gt;linear&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;count&lt;/th&gt;
      &lt;td&gt;500.000000&lt;/td&gt;
      &lt;td&gt;500.000000&lt;/td&gt;
      &lt;td&gt;500.000000&lt;/td&gt;
      &lt;td&gt;5.000000e+02&lt;/td&gt;
      &lt;td&gt;500.000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;mean&lt;/th&gt;
      &lt;td&gt;51.595384&lt;/td&gt;
      &lt;td&gt;31372.948633&lt;/td&gt;
      &lt;td&gt;30654.424502&lt;/td&gt;
      &lt;td&gt;1.098368e+07&lt;/td&gt;
      &lt;td&gt;34111.441777&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;std&lt;/th&gt;
      &lt;td&gt;26.416814&lt;/td&gt;
      &lt;td&gt;25954.487664&lt;/td&gt;
      &lt;td&gt;22727.902144&lt;/td&gt;
      &lt;td&gt;8.786530e+06&lt;/td&gt;
      &lt;td&gt;22607.290150&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;min&lt;/th&gt;
      &lt;td&gt;5.019101&lt;/td&gt;
      &lt;td&gt;2622.068494&lt;/td&gt;
      &lt;td&gt;5451.400236&lt;/td&gt;
      &lt;td&gt;9.498159e+04&lt;/td&gt;
      &lt;td&gt;-7483.112617&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;25%&lt;/th&gt;
      &lt;td&gt;29.494955&lt;/td&gt;
      &lt;td&gt;8308.838700&lt;/td&gt;
      &lt;td&gt;11995.912686&lt;/td&gt;
      &lt;td&gt;3.070432e+06&lt;/td&gt;
      &lt;td&gt;14698.430889&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;50%&lt;/th&gt;
      &lt;td&gt;51.622231&lt;/td&gt;
      &lt;td&gt;24277.069057&lt;/td&gt;
      &lt;td&gt;23841.307039&lt;/td&gt;
      &lt;td&gt;9.079029e+06&lt;/td&gt;
      &lt;td&gt;34205.683607&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;75%&lt;/th&gt;
      &lt;td&gt;73.700855&lt;/td&gt;
      &lt;td&gt;49624.805802&lt;/td&gt;
      &lt;td&gt;43976.822005&lt;/td&gt;
      &lt;td&gt;1.776408e+07&lt;/td&gt;
      &lt;td&gt;52997.927645&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;max&lt;/th&gt;
      &lt;td&gt;98.925990&lt;/td&gt;
      &lt;td&gt;95455.331996&lt;/td&gt;
      &lt;td&gt;94890.894987&lt;/td&gt;
      &lt;td&gt;3.056593e+07&lt;/td&gt;
      &lt;td&gt;75247.568169&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Mysteries #1, #2, and #3 all appear quite similar, and of course #4 certainly appears linear. We could just run a bunch of different regression models on columns 1, 2, and 3, and keep track of the R^2 values, then just pick the best. In theory, this should work (and with computing power being so cheap, it's certainly not a bad idea), but I'd like to give us a more precise way to decide, instead of the "shotgun" approach where you just try all possible models. The mathematics for why the following strategy works is rather elegant, as it involves taking complicated equations and turning them into the simple linear form: &lt;code&gt;y=m*x+b&lt;/code&gt;. We won't be delving into the algebra here, &lt;a href="https://mathbench.umd.edu/modules/misc_scaling/page11.htm" rel="noopener noreferrer"&gt;but you can go here instead&lt;/a&gt;. For now, we'll just practice how to make it all happen in Python.   &lt;/p&gt;

&lt;p&gt;The basic idea is as follows: when looking at data that curves upward as the x variable increases, we can plot two different scatter plots, and test each of them for linearity. Whichever scatterplot is closest to a straight line will tell us which underlying growth pattern is on display. If &lt;code&gt;log(x) vs log(y)&lt;/code&gt; becomes linear, then the underlying pattern came from power model growth, but if &lt;code&gt;x vs log(y)&lt;/code&gt; gives the more linear scatterplot, then the original data was probably from an exponential model. &lt;/p&gt;

&lt;p&gt;First, let's restrict our dataframe to only the curved lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;curved_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;linear&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, we'll build a function that can display the graphs side-by-side, for comparison.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;linearity_test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;'''&lt;/span&gt;&lt;span class="s"&gt;Take in a dataframe, and the index for data along the x-axis. Then, for each column, 
       display a scatterplot of the (x, log(y)) and also (log(x), log(y))&lt;/span&gt;&lt;span class="sh"&gt;'''&lt;/span&gt;
    &lt;span class="n"&gt;df_x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;df_y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df_y&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Is it exponential?
&lt;/span&gt;
        &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Exp Test: (X , logY)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;regplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
        &lt;span class="c1"&gt;# Is it power?
&lt;/span&gt;        &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Pwr Test: (logX , logY)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;regplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_x&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;

        &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;suptitle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Which Model is Best for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;savefig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;linearity_test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;curved_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FUpwardTrajectory%2Fto_log_or_not_to_log%2Fblob%2Fmaster%2Fmystery_1.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FUpwardTrajectory%2Fto_log_or_not_to_log%2Fblob%2Fmaster%2Fmystery_1.png%3Fraw%3Dtrue" alt="png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FUpwardTrajectory%2Fto_log_or_not_to_log%2Fblob%2Fmaster%2Fmystery_2.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FUpwardTrajectory%2Fto_log_or_not_to_log%2Fblob%2Fmaster%2Fmystery_2.png%3Fraw%3Dtrue" alt="png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FUpwardTrajectory%2Fto_log_or_not_to_log%2Fblob%2Fmaster%2Fmystery_3.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FUpwardTrajectory%2Fto_log_or_not_to_log%2Fblob%2Fmaster%2Fmystery_3.png%3Fraw%3Dtrue" alt="png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Well, &lt;code&gt;mystery_1&lt;/code&gt; seems a bit inconclusive, we'll come back to that later. &lt;code&gt;mystery_2&lt;/code&gt; certainly seems to have a linear relationship on the Exp Test, but NOT for the Power Test, which means the growth pattern for that column was caused by exponential growth, and &lt;code&gt;mystery_3&lt;/code&gt; is the opposite, it is very obviously linear for the Pwr Test, but not for Exp Test. Let's peek under the hood and take a look at the function I used to build this data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;seaborn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;mystery_scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="sh"&gt;'''&lt;/span&gt;&lt;span class="s"&gt;Randomly generate data, one of which is quadratic, another is exponential,
       and a third is linear.
       Plot all sets of data side-by-side, then return a pandas dataframe&lt;/span&gt;&lt;span class="sh"&gt;'''&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;7182818&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="sh"&gt;'''&lt;/span&gt;&lt;span class="s"&gt;Generate some data, not much rhyme or reason to these numbers, just wanted everything 
       to be on the same scale and the two curved datasets should be hard to tell apart&lt;/span&gt;&lt;span class="sh"&gt;'''&lt;/span&gt;

    &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;y1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;11.45&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;234.15&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;y2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.03&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;y3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mf"&gt;1.9&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
    &lt;span class="n"&gt;y4&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;856.16&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;10107.3&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


    &lt;span class="sh"&gt;'''&lt;/span&gt;&lt;span class="s"&gt;Graph the plots&lt;/span&gt;&lt;span class="sh"&gt;'''&lt;/span&gt;

    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scatterplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Mystery #1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scatterplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Mystery #2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scatterplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Mystery #3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scatterplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Linear&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;suptitle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Mystery Challenge: &lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;One of these is Exponential, one is Power, &lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;and the other is Quadratic, but which is which?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y4&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;mystery_1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;mystery_2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;mystery_3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;linear&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;in particular, look at how the data was generated:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x = np.random.uniform(5, 100, n)
y1 = 11.45 * x ** 2 - 234.15 * x + 5000 + np.random.normal(0, var, n) * np.sqrt(x)
y2 = 5000 * 1.03 ** x + np.random.normal(0, var, n) * np.sqrt(x)
y3 = 5000 * x ** 1.9 + np.random.normal(0, 20 * var, n) * x
y4 = 856.16 * x - 10107.3 + np.random.normal(0, 6 * var, n)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So we were right about &lt;code&gt;mystery_2&lt;/code&gt; (which was the same as y2) being caused by exponential growth, it was &lt;code&gt;y2 = 5000*1.03^x&lt;/code&gt;.  &lt;/p&gt;

&lt;p&gt;Meanwhile, &lt;code&gt;mystery_1&lt;/code&gt; was &lt;code&gt;y1 = 11.45*x^2 - 234.15*x + 5000&lt;/code&gt;, definitely built from a quadratic equation, which is somewhat related to the Power Model, but not exactly the same. Technically, the &lt;code&gt;log(x) vs log(y)&lt;/code&gt; test only forces linearity for Power Models. In order to find the exact solution for this one, we would rule out exponential growth and power growth first, then start applying the "reiterative common differences" strategy to determine the degree of the polynomial, but we won't be getting in to that here.   &lt;/p&gt;

&lt;p&gt;Refocusing again on power growth: it looks like &lt;code&gt;mystery_3&lt;/code&gt; was generated with &lt;code&gt;y3 = 5000*x^1.9&lt;/code&gt; which definitely qualifies as a power model.&lt;/p&gt;

&lt;p&gt;P.S.  If you're wondering what the extra &lt;code&gt;np.random.normal(...)&lt;/code&gt; stuff at the back part of each of those lines is doing, it's just adding in some random noise so the scatter plots wouldn't be too "perfect".&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>flatironschool</category>
      <category>mathematics</category>
      <category>growthmodels</category>
      <category>python</category>
    </item>
    <item>
      <title>Finding my way into Data Science</title>
      <dc:creator>David Kaspar</dc:creator>
      <pubDate>Tue, 23 Apr 2019 22:17:36 +0000</pubDate>
      <link>https://dev.to/upwardtrajectory/finding-my-way-into-data-science-4ljm</link>
      <guid>https://dev.to/upwardtrajectory/finding-my-way-into-data-science-4ljm</guid>
      <description>&lt;p&gt;My father likes to brag about his three sons, of which I am “the middlest”. When he gets to talking, and the conversation turns to “what the boys are up to?” he likes to describe my life this way:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;David is doing great, he got married around ten years ago and runs his own tutoring business out in Seattle, WA. In fact, his “office” is at the dining room tables of families that are less than half a mile from Bill Gates’ house. He has helped students get accepted at Harvard and Yale! He only works about 25 hours a week, but it’s enough to own a house, and that’s even when he has to compete with all that Amazon and Microsoft money.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Well, Dad, I’m leaving that all behind, but I hope to give you a new story to catch the attention of those who vaguely knew me when I was a kid running around the Indiana countryside. About 6 weeks ago, I made the difficult choice to abandon my students part-way through the school year in order to join a 15 week bootcamp studying Data Science with Flatiron School. Upon doing so, I personally delivered a letter to each of the families I was working with at the time, regretfully sharing the news that I wouldn’t be able to continue as their tutor. Upon arriving to one house a week later, the father casually mentioned, “Oh, David, before you go, I wanted to talk with you about something.” As it turns out, after reading that letter, he had arranged for me to get a meeting with someone who has been at an international company for years, basically doing my dream job. I was thrilled! It was more of a meeting than an interview, and helped me understand what exactly I was signing myself up for.&lt;/p&gt;

&lt;p&gt;During the first 20 mins of this meeting, we discussed the various ways to break into Data Science. In her experience there are three basic paths, none of which are better than the other, as they each bring their own important background tools necessary for the job. Below, I drew up a summary of those three paths:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--RFKWzi7I--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i.imgur.com/viv154W.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--RFKWzi7I--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i.imgur.com/viv154W.jpg" alt="Paths to Data Science"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On her personal team she mentioned that everyone has at least a master’s degree in something relevant, many have a PhD (I have neither). After briefly summarizing the syllabus for my 15 week program, she mentioned that someone who intensely studies these topics shouldn’t have any trouble finding a job in the Seattle market, but I should temper my expectations of jumping straight into a high level machine learning or AI position. This doesn’t mean I won’t try, but if it takes me a few years to work my way up, that’s completely fine by me. In my old job, there wasn’t much upward mobility anyway, so starting on the bottom rung of the ladder and being challenged by my future colleagues sounds great. Also . . . a computer job would very likely come with good health insurance, which, from the perspective of a sole-proprietor business owner, is just like the Swiss flag (a huge plus). &lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2xf0XDAo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.telegraph.co.uk/content/dam/business/2016/03/07/flagswiss_trans_NvBQzQNjv4BqM37qcIWR9CtrqmiMdQVx7Ef6Or6a507Effv7rfJzHcs.jpg%3Fimwidth%3D450" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2xf0XDAo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.telegraph.co.uk/content/dam/business/2016/03/07/flagswiss_trans_NvBQzQNjv4BqM37qcIWR9CtrqmiMdQVx7Ef6Or6a507Effv7rfJzHcs.jpg%3Fimwidth%3D450" alt="Swiss flag"&gt;&lt;/a&gt;&lt;br&gt;
Before this meeting took place, I hopped on to reddit’s “MachineLearning” forum and asked for help: &lt;/p&gt;

&lt;p&gt;&lt;em&gt;I managed to get a face-to-face introduction with the someone in the industry. I can't promise to deliver an answer (but I'll try), and I'm not a journalist (just a nerdy guy trying to switch careers), but I feel like I'm in a state of "I know that I don't know" and I'd like your help. What should I ask? If you had the ear of someone who's been working in big data and machine learning for years, what would you ask? (obviously nothing about getting "trade secrets" or anything else she clearly wouldn't answer. These have to be feasible questions.) Thanks!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Thanks to everyone who contributed questions, here is my report:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I'd be interested in hearing what their plans are for ML driven spatial / temporal upscaling in games and how far they expect this to go.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;E. G. Is a 720p at 30hz original render going to give us 8k at 120hz in a decade.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;She said it's absolutely something that is being worked on by others (not sure if she meant @ her company, or just generally, sorry).&lt;/em&gt; &lt;/p&gt;

&lt;p&gt; I loved this question, because there are so many great games out there where the graphics have aged poorly, but the premise of the game is incredibly fun.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the most exciting area for ml in gaming? GANs? RL? CART?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This didn't come up specifically, however, she did share the premise of a talk she heard at a ML conference within the last year. It's about developing an AI for NPCs, but could be applied toward an AI that tries to beat the game from the player perspective. Essentially, the speaker developed a "risk / reward" function for each object on the screen. Strong monster close by = bad ; strong monster far away = not so bad ; treasure chest = good ; etc. all with number values, then the AI would follow a "path of best outcomes". I imagine a bunch of lines all shooting out from the controlled char to every other object on the screen, with varying degrees of red -&amp;gt; orange -&amp;gt; yellow -&amp;gt; green, and they adjust the "risk / reward" outcome based on movement and actions of the player.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Overall, it was a great meeting for me. I only minorly embarrassed myself three times, and could understand the basic idea of everything we discussed. About half the time was spent discussing how people from various backgrounds end up in the "data science" space, and whether it's feasible for me to make that leap. Kudos to her for quickly figuring out "my level" and pushing it slightly beyond in an interesting way. Every part of this blog is solely my interpretation of her opinion and in no way reflects the official stance of her company.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Engineering Location Features with Haversine's Formula for Prediction Modeling</title>
      <dc:creator>David Kaspar</dc:creator>
      <pubDate>Fri, 19 Apr 2019 23:29:11 +0000</pubDate>
      <link>https://dev.to/upwardtrajectory/engineering-location-features-with-haversine-s-formula-for-prediction-modeling-23n2</link>
      <guid>https://dev.to/upwardtrajectory/engineering-location-features-with-haversine-s-formula-for-prediction-modeling-23n2</guid>
      <description>&lt;p&gt;Here is the full git repo: &lt;a href="https://github.com/nkacoroski/dsc-1-final-project#module-1-final-project" rel="noopener noreferrer"&gt;Predicting King Co home prices&lt;/a&gt; This blog post is just a snippet of one strategy we used.&lt;br&gt;
&lt;em&gt;Co-written with &lt;a href="https://github.com/nkacoroski" rel="noopener noreferrer"&gt;Natasha Kacoroski&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;So you'd like to build a prediction model from some data, but some of the columns seem a bit . . . useless? Alternatively, the column of data might accidentally infer meaning where there shouldn't be any. Let's try this out by engineering some columns with location data to help predict the price of a home using a small slice of the King County Housing Data set from Washington State. Let's explore how we can turn difficult columns into something useful instead of just throwing them away or using them incorrectly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;kc_house_data.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;th&gt;4&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;id&lt;/th&gt;
      &lt;td&gt;7129300520&lt;/td&gt;
      &lt;td&gt;6414100192&lt;/td&gt;
      &lt;td&gt;5631500400&lt;/td&gt;
      &lt;td&gt;2487200875&lt;/td&gt;
      &lt;td&gt;1954400510&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;date&lt;/th&gt;
      &lt;td&gt;10/13/2014&lt;/td&gt;
      &lt;td&gt;12/9/2014&lt;/td&gt;
      &lt;td&gt;2/25/2015&lt;/td&gt;
      &lt;td&gt;12/9/2014&lt;/td&gt;
      &lt;td&gt;2/18/2015&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;price&lt;/th&gt;
      &lt;td&gt;221900&lt;/td&gt;
      &lt;td&gt;538000&lt;/td&gt;
      &lt;td&gt;180000&lt;/td&gt;
      &lt;td&gt;604000&lt;/td&gt;
      &lt;td&gt;510000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;bedrooms&lt;/th&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;4&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;bathrooms&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;2.25&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;sqft_living&lt;/th&gt;
      &lt;td&gt;1180&lt;/td&gt;
      &lt;td&gt;2570&lt;/td&gt;
      &lt;td&gt;770&lt;/td&gt;
      &lt;td&gt;1960&lt;/td&gt;
      &lt;td&gt;1680&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;sqft_lot&lt;/th&gt;
      &lt;td&gt;5650&lt;/td&gt;
      &lt;td&gt;7242&lt;/td&gt;
      &lt;td&gt;10000&lt;/td&gt;
      &lt;td&gt;5000&lt;/td&gt;
      &lt;td&gt;8080&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;floors&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;waterfront&lt;/th&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;view&lt;/th&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;condition&lt;/th&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;5&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;grade&lt;/th&gt;
      &lt;td&gt;7&lt;/td&gt;
      &lt;td&gt;7&lt;/td&gt;
      &lt;td&gt;6&lt;/td&gt;
      &lt;td&gt;7&lt;/td&gt;
      &lt;td&gt;8&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;sqft_above&lt;/th&gt;
      &lt;td&gt;1180&lt;/td&gt;
      &lt;td&gt;2170&lt;/td&gt;
      &lt;td&gt;770&lt;/td&gt;
      &lt;td&gt;1050&lt;/td&gt;
      &lt;td&gt;1680&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;sqft_basement&lt;/th&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;400.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;910.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;yr_built&lt;/th&gt;
      &lt;td&gt;1955&lt;/td&gt;
      &lt;td&gt;1951&lt;/td&gt;
      &lt;td&gt;1933&lt;/td&gt;
      &lt;td&gt;1965&lt;/td&gt;
      &lt;td&gt;1987&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;yr_renovated&lt;/th&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;1991&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;zipcode&lt;/th&gt;
      &lt;td&gt;98178&lt;/td&gt;
      &lt;td&gt;98125&lt;/td&gt;
      &lt;td&gt;98028&lt;/td&gt;
      &lt;td&gt;98136&lt;/td&gt;
      &lt;td&gt;98074&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;lat&lt;/th&gt;
      &lt;td&gt;47.5112&lt;/td&gt;
      &lt;td&gt;47.721&lt;/td&gt;
      &lt;td&gt;47.7379&lt;/td&gt;
      &lt;td&gt;47.5208&lt;/td&gt;
      &lt;td&gt;47.6168&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;long&lt;/th&gt;
      &lt;td&gt;-122.257&lt;/td&gt;
      &lt;td&gt;-122.319&lt;/td&gt;
      &lt;td&gt;-122.233&lt;/td&gt;
      &lt;td&gt;-122.393&lt;/td&gt;
      &lt;td&gt;-122.045&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;sqft_living15&lt;/th&gt;
      &lt;td&gt;1340&lt;/td&gt;
      &lt;td&gt;1690&lt;/td&gt;
      &lt;td&gt;2720&lt;/td&gt;
      &lt;td&gt;1360&lt;/td&gt;
      &lt;td&gt;1800&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;sqft_lot15&lt;/th&gt;
      &lt;td&gt;5650&lt;/td&gt;
      &lt;td&gt;7639&lt;/td&gt;
      &lt;td&gt;8062&lt;/td&gt;
      &lt;td&gt;5000&lt;/td&gt;
      &lt;td&gt;7503&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What do each of these columns mean?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;IPython.display&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;display&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Markdown&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;column_names.md&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;fh&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;COLUMN NAMES AND DESCRIPTIONS FOR KING COUNTRY DATA SET&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;id&lt;/strong&gt; - unique identified for a house&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;dateDate&lt;/strong&gt; - house was sold&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pricePrice&lt;/strong&gt; -  is prediction target&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;bedroomsNumber&lt;/strong&gt; -  of Bedrooms/House&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;bathroomsNumber&lt;/strong&gt; -  of bathrooms/bedrooms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sqft_livingsquare&lt;/strong&gt; -  footage of the home&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sqft_lotsquare&lt;/strong&gt; -  footage of the lot&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;floorsTotal&lt;/strong&gt; -  floors (levels) in house&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;waterfront&lt;/strong&gt; - House which has a view to a waterfront&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;view&lt;/strong&gt; - Has been viewed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;condition&lt;/strong&gt; - How good the condition is ( Overall )&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;grade&lt;/strong&gt; - overall grade given to the housing unit, based on King County grading system&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sqft_above&lt;/strong&gt; - square footage of house apart from basement&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sqft_basement&lt;/strong&gt; - square footage of the basement&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;yr_built&lt;/strong&gt; - Built Year&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;yr_renovated&lt;/strong&gt; - Year when house was renovated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;zipcode&lt;/strong&gt; - zip&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;lat&lt;/strong&gt; - Latitude coordinate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;long&lt;/strong&gt; - Longitude coordinate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sqft_living15&lt;/strong&gt; - The square footage of interior housing living space for the nearest 15 neighbors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sqft_lot15&lt;/strong&gt; - The square footage of the land lots of the nearest 15 neighbors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many of these columns seem like they should have some kind of an influence on the price. In particular, the common saying "the three most important things to think about when buying are home are location, location, location" tells us we should probably care a lot about things like zipcode, latitude, and longitude. Also, just to note, 'sqft_living15' and 'sqft_lot15' are metadata about each house in the database that someone else has already engineered for us. Thanks, King Co. data scientist!&lt;/p&gt;

&lt;p&gt;There are plenty of columns we could use, but for this tutorial with Haversine, let's just focus on latitude &amp;amp; longitude. Let's get a visual of what this data looks like.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;needs_work&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;long&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;lat&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;seaborn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;needs_work&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;needs_work&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;

    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;distplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scatterplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;suptitle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tight_layout&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;savefig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fnkacoroski%2Fdsc-1-final-project%2Fraw%2Fblog%2Flong.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fnkacoroski%2Fdsc-1-final-project%2Fraw%2Fblog%2Flong.png" alt="png_of_long"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fnkacoroski%2Fdsc-1-final-project%2Fraw%2Fblog%2Flat.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fnkacoroski%2Fdsc-1-final-project%2Fraw%2Fblog%2Flat.png" alt="png_of_lat"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Even though these pairs of graphs look similar, notice the y-axes on these plots are measuring different things. The histogram is simply counting how many observations of the x-value we have in the data, while the scatterplot is showing the price in USD of the house sold at the x-value. It's almost like more houses sold (histogram height) correlates rather strongly with a higher sale price (scatterplot height). The Law of Supply &amp;amp; Demand seems alive and well in King County, WA. We'll discuss how to deal with this data in regard to predicting sale price using Haversine's 3D distance formula.&lt;/p&gt;

&lt;p&gt;Before we get into the details of the formula, let's talk about why we should consider augmenting the location data at all. The problem here is that ONLY looking at one-dimensional movement on a map doesn't particularly generate accurate housing price patterns. Another way of saying this is there is no linear relationship between either latitude vs price nor longitude vs price. For example, if there are multiple wealthy neighborhoods that have lower-priced neighborhoods in between them, using just that one dimension of data won't accurately predict the value. In the following image (shamelessly lifted from &lt;a href="http://www.peteryu.ca/tutorials/matlab/image_in_3d_surface_plot_with_multiple_colormaps" rel="noopener noreferrer"&gt;http://www.peteryu.ca/tutorials/matlab/image_in_3d_surface_plot_with_multiple_colormaps&lt;/a&gt;) imagine the Z axis isn't elevation, but rather home price. As you can see in the heatmap below the surface, there isn't a clear linear pattern when you move along JUST the X-direction or just the Y-direction, but proximity to the dark red spot near the back right part of the graph &lt;em&gt;would&lt;/em&gt; have a clear pattern for predicting price.&lt;br&gt;
(Note the red spot near the bottom left should actually be blue on the heatmap, as in this example, it represents very low house prices.)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/http%3A%2F%2Fwww.peteryu.ca%2F_media%2Ftutorials%2Fmatlab%2Fimage_surf_3d_basic.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/http%3A%2F%2Fwww.peteryu.ca%2F_media%2Ftutorials%2Fmatlab%2Fimage_surf_3d_basic.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When modeling with multivariable linear regression, simply having both latitude and longitude doesn't really help either. Instead, let's think about why home values in some neighborhoods are high vs low. It could simply be the houses in those neighborhoods have a view of the mountains &amp;amp; water, but the only way any one home sells for a lot of money is if someone pays a lot of money for it. I know this seems simplistic, but high home values require high paying jobs, and people with high paying jobs will pay more to be conveniently close their work (especially if the home has a view of the mountains or water). This leads me to believe a "radius from high-paying job cluster" might be a much better indicator of home value instead of using the two columns of 'longitude' and 'latitude' on their own. Let's build one! In fact, if there are multiple employment clusters, maybe building a single column for each "job hub" would make sense.&lt;/p&gt;

&lt;p&gt;If you are willing to accept that we live on a round planet, we can utilize the Haversine formula, which measures 3D arc-length on the surface of a sphere. This is really just an adaption of the Pythagorean theorem.&lt;/p&gt;

&lt;p&gt;We adapt the python code from:  &lt;a href="https://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points/4913653" rel="noopener noreferrer"&gt;https://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points/4913653&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We choose to pass in two arguments, each in the form &lt;code&gt;[longitude, latitude]&lt;/code&gt;. To do this, we first build a new column to serve as the input to the function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;needs_work&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;long&lt;/th&gt;
      &lt;th&gt;lat&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;-122.257&lt;/td&gt;
      &lt;td&gt;47.5112&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;-122.319&lt;/td&gt;
      &lt;td&gt;47.7210&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;needs_work&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;long_lat&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;needs_work&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;long&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;needs_work&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;lat&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="n"&gt;needs_work&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;long&lt;/th&gt;
      &lt;th&gt;lat&lt;/th&gt;
      &lt;th&gt;long_lat&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;-122.257&lt;/td&gt;
      &lt;td&gt;47.5112&lt;/td&gt;
      &lt;td&gt;(-122.257, 47.5112)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;-122.319&lt;/td&gt;
      &lt;td&gt;47.7210&lt;/td&gt;
      &lt;td&gt;(-122.319, 47.721000000000004)&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now we define the Haversine function. Alternatively, you can simply import it from &lt;a href="https://pypi.org/project/haversine/" rel="noopener noreferrer"&gt;https://pypi.org/project/haversine/&lt;/a&gt; but we want to make sure we can control exactly what is going on and it's fun to see trigonometry in action!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;radians&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cos&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;asin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sqrt&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;haversine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;list_long_lat&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;other&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;122.336283&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;47.609395&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees), in this case the 2nd point is in the 
    Pike Pine Retail Core of Seattle, WA.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;lon1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lat1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;list_long_lat&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;list_long_lat&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;lon2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lat2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;other&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;other&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="c1"&gt;# convert decimal degrees to radians 
&lt;/span&gt;    &lt;span class="n"&gt;lon1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lat1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lon2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lat2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;radians&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;lon1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lat1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lon2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lat2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="c1"&gt;# haversine formula 
&lt;/span&gt;    &lt;span class="n"&gt;dlon&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lon2&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;lon1&lt;/span&gt; 
    &lt;span class="n"&gt;dlat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lat2&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;lat1&lt;/span&gt; 
    &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dlat&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;cos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lat1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;cos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lat2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;sin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dlon&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;asin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; 
    &lt;span class="c1"&gt;# Radius of earth in kilometers is 6371
&lt;/span&gt;    &lt;span class="n"&gt;km&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;6371&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;km&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We're finally ready to build our useful output column using &lt;code&gt;series.apply()&lt;/code&gt; which yields the distance to Seattle.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;needs_work&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dist_to_seattle&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;needs_work&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;long_lat&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;haversine&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In fact, it would very easy to build another column for a different "job cluster location" at this same point by passing in the optional arguments for haversine() doing a different &lt;code&gt;series.apply(function, opt_arg=______)&lt;/code&gt;. Seattle's "suburb" of Bellevue has actually grown into its own city with skyscrapers and high paying jobs of its own. Let's create the &lt;code&gt;'dist_to_bellevue'&lt;/code&gt; column as well.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;needs_work&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dist_to_bellevue&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;needs_work&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;long_lat&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;haversine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;other&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;122.198985&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;47.615577&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, let's drop the columns we used as the building blocks for these two new columns. (Although instead of using &lt;code&gt;df.drop([columns], axis=1)&lt;/code&gt; we actively select the columns to keep.)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;engineered_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;needs_work&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[:,[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dist_to_seattle&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dist_to_bellevue&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;engineered_columns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;dist_to_seattle&lt;/th&gt;
      &lt;th&gt;dist_to_bellevue&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;12.434278&lt;/td&gt;
      &lt;td&gt;12.395639&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;12.477217&lt;/td&gt;
      &lt;td&gt;14.770934&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;16.247460&lt;/td&gt;
      &lt;td&gt;13.838051&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;10.731122&lt;/td&gt;
      &lt;td&gt;17.970486&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;21.850148&lt;/td&gt;
      &lt;td&gt;11.542868&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;5&lt;/th&gt;
      &lt;td&gt;25.361129&lt;/td&gt;
      &lt;td&gt;15.217257&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;6&lt;/th&gt;
      &lt;td&gt;33.331870&lt;/td&gt;
      &lt;td&gt;35.347235&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;7&lt;/th&gt;
      &lt;td&gt;22.284717&lt;/td&gt;
      &lt;td&gt;24.515385&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;8&lt;/th&gt;
      &lt;td&gt;10.796605&lt;/td&gt;
      &lt;td&gt;15.463271&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;9&lt;/th&gt;
      &lt;td&gt;35.274167&lt;/td&gt;
      &lt;td&gt;30.244215&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now you can add these back into to your original dataframe and begin improving your model with these new feature columns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>python</category>
      <category>linearregression</category>
      <category>machinelearning</category>
      <category>distancecalculation</category>
    </item>
  </channel>
</rss>
