<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nick Larsen</title>
    <description>The latest articles on DEV Community by Nick Larsen (@nicklarsen).</description>
    <link>https://dev.to/nicklarsen</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F19720%2F32deea8c-2ff3-4af2-b3b2-94c030e0dc49.jpg</url>
      <title>DEV Community: Nick Larsen</title>
      <link>https://dev.to/nicklarsen</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nicklarsen"/>
    <language>en</language>
    <item>
      <title>How to prepare for your next interview</title>
      <dc:creator>Nick Larsen</dc:creator>
      <pubDate>Fri, 02 Aug 2019 18:10:14 +0000</pubDate>
      <link>https://dev.to/nicklarsen/how-to-prepare-for-your-next-interview-dkc</link>
      <guid>https://dev.to/nicklarsen/how-to-prepare-for-your-next-interview-dkc</guid>
      <description>&lt;p&gt;Your next interview is coming up and you feel excited, nervous and you are definitely experiencing some notable anxiety that you are not used to.  Interviewing &lt;em&gt;is&lt;/em&gt; a huge deal!  You're making a huge life change, you're going to meet a lot of new people, you're going to be focusing on totally new problems and you feel like you're going to have to prove yourself to a bunch of people all over again.  And on top of all that, you're going to have to face some of your demons like coding in front of other people are solving problems on a timer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OAmNcNDH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cultureofdevelopment.com/img/hackerrank-codepair-screenshot1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OAmNcNDH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cultureofdevelopment.com/img/hackerrank-codepair-screenshot1.png" alt="Online coding interview"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h6&gt;
  
  
  &lt;a href="https://www.hackerrank.com/products/codepair/"&gt;image source&lt;/a&gt;
&lt;/h6&gt;

&lt;p&gt;Largely this anxiety comes from not really knowing what to expect in the interviews and not wanting to be in this situation for very long.  Not wanting to be in this situation for very long is a real concern that almost everyone faces, and unfortunately I don't have a solution for that.  What I do have a solution for is telling you what to expect so that you get an offer you'll be happy with a lot sooner.  &lt;/p&gt;

&lt;p&gt;We're going to cover the main topics you are being evaluated on, and give some pointers on how to improve your standing in each area by the end the interview.  First we'll break down the key topics, then we'll explain them in more detail, and at the very end, we'll cover the most common question patterns you will almost certainly be asked.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem solving process
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;You are being hired to solve problems.&lt;/strong&gt;  I cannot reiterate this enough.  You are &lt;em&gt;not&lt;/em&gt; here to write code, you are here to &lt;em&gt;solve problems&lt;/em&gt;.  Problems can be product related, code related, people related, process related, whatever, your job is to solve problems.  Writing code is just one method for implementing the solution to a problem.  As you are interviewing, the expectation is that you will be writing code to implement solutions, so that's what they are going to test you on, however, it's your problem solving skills that really make the difference between passing the interview or not.  So let's break down the problem solving process and see how all the parts fit together.&lt;/p&gt;

&lt;h3&gt;
  
  
  The problem description
&lt;/h3&gt;

&lt;p&gt;Problems are a description of something that needs to be accomplished.  Problems are things like "we need to order all the items in this array by ascending value".  This is the sorting problem.  There are lots of ways to solve the sorting problem, and you can sort things without a computer as well, for instance alphebetizing medical records in a filing cabinet.  A problem statement is just a description, it does not in any way, shape or form insinuate a specific solution.&lt;/p&gt;

&lt;p&gt;There are lots of problems, but just to highlight a few common problems that you might run into in your work, it's things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;we need more customers buying things on our website&lt;/li&gt;
&lt;li&gt;we need to make it easier to find the answers to your questions&lt;/li&gt;
&lt;li&gt;we need identify the toxic content on our website&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you have ever run into any of these problems in your work, your brain is probably filled with ideas just because you read those problem statements, and that's the point, a problem can have tons of solutions.  It's your job to find a good solution to the problem at hand.&lt;/p&gt;

&lt;h3&gt;
  
  
  The approach
&lt;/h3&gt;

&lt;p&gt;An approach is a high-level idea about how to solve a problem.  It does not specify implementation details, it only provides an outline for the key components of a solution.  Coming up with a valid approach is the actual act of solving a problem.  This is why you are not here to code, you are here to solve problems.  You need to come up with valid, working approaches to each problem.  And once you have that, it can be anybody's problem to implement it.&lt;/p&gt;

&lt;p&gt;To continue with the sorting example, an approach is something like "find the lowest value and put it in the first location, then repeat this process on the remaining unsorted elements until all elements are sorted".  This is a totally valid and working approach.  Again, notice it does not specify a particular algorithm, it's just a general idea about how to solve the problem and it can be implemented in many ways.  This general approach is the basis of numerous sorting algorithms which we'll go through in a moment.&lt;/p&gt;

&lt;p&gt;Once you have an approach, &lt;strong&gt;you have solved the problem&lt;/strong&gt;; you have accomplished the primary thing you are being hired to do.  One very common mistake people make here is to "think in code", which essentially means that you drop down to only consider what tools you are familiar with.  You may have heard the saying "if all you have is a hammer, all the problems start looking like nails".  If you can't think of a way to solve it using some tool you know then all of a sudden you are stuck.  This is a huge mistake.  It's like trying to fix the shower wall in a bathroom, but the only tool you know how to use is duct tape, so you start describing how to use duct tape to hold the tiles to the wall.  Instead you should describe the approach as attach the tiles to the wall, then you have still solved the problem, and you can worry about the implementation details later.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--9-jDC9PU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cultureofdevelopment.com/img/duct-tape-fix.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--9-jDC9PU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cultureofdevelopment.com/img/duct-tape-fix.jpg" alt="duct tape wall"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h6&gt;
  
  
  &lt;a href="http://www.startribune.com/top-20-home-inspection-photos-from-2015/364228611/"&gt;image source&lt;/a&gt;
&lt;/h6&gt;

&lt;p&gt;Instead I strongly urge you to &lt;strong&gt;solve all problems like a human, not like a computer&lt;/strong&gt;.  Think about if you had that filing cabinet full of medical records and you had to sort it alphabetically, what would you do?  Are you going to come up with the most efficient answer on the first try?  Probably not, but you &lt;em&gt;are&lt;/em&gt; going to solve the problem, and once a problem is solved and we have a working solution, we can iterate on it.  If you don't have a working solution, you're busted and you're not getting the job.  Just solve it naively and we'll work through better solutions later on.&lt;/p&gt;

&lt;h3&gt;
  
  
  The technique
&lt;/h3&gt;

&lt;p&gt;Techniques are actual implementation details of solution.  This is the point where you start talking about specific algorithms and how you are going to apply them to the data given to you.  As noted above, there are tons of different ways to implement a sort based on finding the smallest value first.  Some of the most common are selection sort, bubble sort and insertion sort but there are plenty of others as well.  There are lots of techniques to implement an approach.  You are well aware, I'm sure, that these are not fast sorting algorithms.  That's not the point, the point is that you solved the problem.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--TSXsHOIX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cultureofdevelopment.com/img/home-alone-plan.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--TSXsHOIX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cultureofdevelopment.com/img/home-alone-plan.jpg" alt="Home Alone battle plan"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h6&gt;
  
  
  &lt;a href="https://slate.com/culture/2015/11/home-alone-hit-theaters-25-years-ago-heres-how-they-filmed-its-bonkers-finale.html"&gt;image source&lt;/a&gt;
&lt;/h6&gt;

&lt;p&gt;As you work through the details of your approach and apply the various techniques you need to solve your problem, you might find that they are slow.  At that point, you have a good indication that you need to modify your approach.  In the case of sorting, the approach of repeatedly finding the lowest value essentially boils down to performing local comparisons, i.e. items that are side by side in the array.  If you instead compare items that are far away from each other, then when you swap those you can eliminate multiple inversions at a time.  This is the basis of another large group of sorting algorithms, namely quicksort and merge sort, but again there are others as well.&lt;/p&gt;

&lt;p&gt;This sort of back and forth from problem solving to identifying limitations that lead you to other approaches is a big part of what you learn through first hand experience.  It's okay that you don't come up with the best approach on the first try, again it's always most important that you just have a working solution.  There is a long philosophical conversation to be had here, but the jist of it is that you should just understand that we might come up with better algorithms for any problem tomorrow, even the ones we think are mostly solved today.  So don't worry about trying to come up with the best solution, a working solution is worth 99% of the credit.  Usually you will only have to worry about performance if the max time or amount of memory your algorithm is allowed to use is specified in the problem description.&lt;/p&gt;

&lt;h3&gt;
  
  
  The code
&lt;/h3&gt;

&lt;p&gt;The code is pretty straightforward, it's the actual text you write that implements your solution.  I'm not going to say much here because there's only one thing we care about really and that's that you write code that works when you run it.  In an interview, don't worry about writing perfectly styled code so much, just focus on getting to a working solution as quickly as possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the key skills
&lt;/h2&gt;

&lt;p&gt;Now that you have a good idea of what the problem solving process looks like, let's see how you can highlight the specific skills that companies are looking for.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem solving skills
&lt;/h3&gt;

&lt;p&gt;If it isn't clear yet, you are being hired to solve problems.  Focus on that first; verbally announce an approach before you get start coding on every single problem.  This gives the interviewer an opportunity to ensure your approach is valid and make sure that you are working toward your given approach during your implementation.  When they are convinced this approach will work, then start your coding.&lt;/p&gt;

&lt;p&gt;Another really big point here is that when you announce a valid approach, you are given much more leeway on the coding part of the interview because they know what you are trying to.  When you fail to announce an approach, you are forcing the interviewer to judge your problem solving skills entirely based on the output of your code and whether or not it passes all the test cases.  You can essentially get full credit for problem solving skills by just announcing a valid approach before you start.  Good interviewers will force you to tell your approach before you start coding, but you should make this a habit as you practice for your next interview.&lt;/p&gt;

&lt;p&gt;I also want to stress again to solve the problem like a human during an interview.  It is critical that your interviewer believes that your solution will work so being expressive and easy to understand are critical when explaining your approach.  Cleverness is usually difficult to understand and more tricky to implement so it's generally a good idea to avoid it during an interview.&lt;/p&gt;

&lt;h3&gt;
  
  
  Coding skills
&lt;/h3&gt;

&lt;p&gt;Coding skills are the ability to take a given approach and convert it into working code.  This is &lt;em&gt;not&lt;/em&gt; problem solving, this is highlighting the techniques you are familiar with and your knowledge of your language of choice.  If you are given the option to choose your own language, &lt;em&gt;always&lt;/em&gt; choose the one you are most familiar with and make no exceptions on this point because you will only handicap yourself.  If they tell you what language you have to write in, then it is what it is.&lt;/p&gt;

&lt;p&gt;There's not a lot to say about the coding skills other than can you produce working code in the amount of time you have for the interview or not.  For most companies, this part of the grading is almost entirely binary, you pass or fail.  Style matters very little, but you should still aim to be expressive and parsable.  Expressive code is easy to understand and parsable code makes it easier to figure out which lines of code go with other lines, e.g. ensuring that all the statements within a block have the same indentation.  If your code works, the style doesn't count for much, but when it doesn't work having more expressive and parsable code makes it easier for the interviewer to help you find the trivial mistakes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Debugging skills
&lt;/h3&gt;

&lt;p&gt;Debugging is the process of locating the source of an error.  It is not problem solving, it is not coding, it is just detective work and once you identify the source of the problem, then you head back to coding or problem solving land.  This skill is incredibly valuable because getting stuck in an interview is an almost certainty, and debugging is how you get unstuck.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--WT-CqgPJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cultureofdevelopment.com/img/manyadventures-01.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--WT-CqgPJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cultureofdevelopment.com/img/manyadventures-01.jpg" alt="Winnie the Pooh stuck"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h6&gt;
  
  
  &lt;a href="https://www.dvdizzy.com/manyadventures-bluray.html"&gt;image source&lt;/a&gt;
&lt;/h6&gt;

&lt;p&gt;There are three primary kinds of getting stuck that require debugging in an interview.  The first is syntax errors and you can identify the source of most of those by looking at the line numbers in the compiler output.  Just go to that line and fix the problem.  &lt;/p&gt;

&lt;p&gt;The second is logic errors, this is the most common type of blocking error you run into during an interview.  Debugging these errors is also very simple but no one ever really teaches it, not in school, not in bootcamps and not really on the job either, you just sort of develop a sense for it, but luckily for you, here's primer on how to do it effectively.  First set your expectation.  Each line of code you write has some values going into and it produces some new values.  Find the line of code, one by one, where your expected output for the previous line is violated.  Use print lines after each statement to validate that the current state matches your expectations.  One of them &lt;em&gt;will&lt;/em&gt; eventually be different from your expectations and then you are done debugging.  It really is that easy but so many people don't do it that it stands out in an interview.  Use this &lt;strong&gt;methodical&lt;/strong&gt; approach to locating the source of logic errors.&lt;/p&gt;

&lt;p&gt;If there is no compiler available, then use a minimal example and walk through your code operation by operation, keeping track of all the variables (and I mean &lt;em&gt;all&lt;/em&gt; the variables).  &lt;strong&gt;Do not skip operations!&lt;/strong&gt;  This is the most common mistake I see when people are writing code without a compiler; they read only what they &lt;em&gt;expect&lt;/em&gt; to happen, and don't track what is &lt;em&gt;actually&lt;/em&gt; happening.  This is a very natural thing, and one of the reasons you usually cannot effectively edit your own writing because you have a tendency to read what you intended to write, not what you actually wrote.  Track all the variables, write them down if you have to, and make sure you explicitly say the values out loud going into each statement, especially conditionals like &lt;code&gt;if&lt;/code&gt; and &lt;code&gt;for&lt;/code&gt; loop conditions.&lt;/p&gt;

&lt;p&gt;The third kind of error you run into is a timeout error.  This happens when you write inefficient code and it runs too slow against larger datasets.  This is a much more difficult error to fix usually because it is a clear indication that your approach is not going to work.  If the output is correct, but it's taking too long, then printing intermediate results is not going to help.  Instead you need to time each line to find the lines that are taking the longest.  &lt;em&gt;Do not guess&lt;/em&gt; which lines are slow because unless you are an expert, you will almost certainly be wrong!  Time &lt;em&gt;each&lt;/em&gt; line, then when you identify the slow part, consider other approaches that might be faster.&lt;/p&gt;

&lt;p&gt;Debugging skills are critical to your long term success as a developer but they take time to develop.  Don't worry if this is the part you struggle with the most right now, just focus on making small improvements to your process each time you run into a error.  &lt;/p&gt;

&lt;p&gt;As the interviewee, debugging can be really frustrating, especially in your first few interviews because you feel like it looks bad that you wrote a bug.  Here's a secret, bugs happen all the time.  In the entire time that you are a developer you will only write bug free code a handful of times per year.  And then in a few weeks or months, someone will come along and the find the bugs for you.  The most important thing to realize when you write a bug in an interview is to not panic.  Just shift out of code writing mode, and shift into debugging mode as seamlessly as possible.  Practicing interview questions is more about mastering this transition than learning to write bug free code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Yhqh8JED--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cultureofdevelopment.com/img/drake-debugging.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Yhqh8JED--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cultureofdevelopment.com/img/drake-debugging.png" alt="Drake says &amp;quot;don't panic when you see an error&amp;quot;"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Things you will almost certainly be asked in your first few technical interviews
&lt;/h2&gt;

&lt;p&gt;The first thing is you need to practice some technical interview questions before you go into your first few interviews.  There are tons of sites you can do this on, such as &lt;a href="http://leetcode.com/"&gt;leetcode&lt;/a&gt;, &lt;a href="https://www.hackerrank.com/"&gt;hacker rank&lt;/a&gt;, &lt;a href="https://www.codility.com/"&gt;codility&lt;/a&gt;, &lt;a href="https://projecteuler.net/"&gt;project Euler&lt;/a&gt; and tons of others.  The best thing about practicing on these sites is that often times companies will use these sites as the first filter to getting a job there, so being familiar with them can help ease that early anxiety.  In addition to that, doing some simple coding right before you get into a technical interview will help you warm up your problem solving skills.&lt;/p&gt;

&lt;p&gt;Next is to learn &lt;a href="https://dev.to/blog/how-to-talk-about-yourself-in-an-interview/"&gt;how to talk about yourself in an interview&lt;/a&gt;, which I have written about previously, so just read up on that.&lt;/p&gt;

&lt;p&gt;As for the technical questions you will run into, I can't tell you for sure, but you can do a little homework to help you get an idea.  Here are some really common question types which are the basis of tons of interview questions, just worded differently.&lt;/p&gt;

&lt;p&gt;The first is string manipulation.  Just learn all of the string manipulation functions in your language of choice.  This includes regex.  You don't need to be an expert in these things, but you need a working knowledge of them.  I also highly recommend learning common parsing functions, such as how to convert a string to an integer or a double.  While you probably won't have to parse a CSV file as part of your interview, your function will often be given input in string format.  Being prepared to convert it into a workable format for solving the problem will eliminate a lot of time spent looking up documentation.&lt;/p&gt;

&lt;p&gt;The second really common thing is improving the runtime complexity of problems that can be naively solved using nested lists.  This has been written about time and time again, so just go read &lt;a href="https://dev.to/healeycodes/solving-puzzles-with-high-performance-javascript-3o4k"&gt;this really well written example&lt;/a&gt;.  The jist is that you preprocess one of the lists into a set or map in order to enable fast lookups.&lt;/p&gt;

&lt;p&gt;And that's really it.  Just to summarize, understand what skills the people interviewing you are looking for, then learn the common problem types, practice practice practice and don't panic when you write bugs during the interview, just focus on locating the source of the error quickly.&lt;/p&gt;

&lt;p&gt;If you have any questions, feel free to reach out, you can find my information on &lt;a href="https://mentors.codingcoach.io/"&gt;mentors.codingcoach.io&lt;/a&gt; or post in the comments below or on reddit.&lt;/p&gt;

</description>
      <category>career</category>
      <category>interview</category>
    </item>
    <item>
      <title>How we sped up random forest processing, getting the lay of the land</title>
      <dc:creator>Nick Larsen</dc:creator>
      <pubDate>Thu, 20 Jun 2019 14:44:16 +0000</pubDate>
      <link>https://dev.to/nicklarsen/how-we-sped-up-random-forest-processing-getting-the-lay-of-the-land-517a</link>
      <guid>https://dev.to/nicklarsen/how-we-sped-up-random-forest-processing-getting-the-lay-of-the-land-517a</guid>
      <description>&lt;p&gt;Cross posted from &lt;a href="https://cultureofdevelopment.com/blog/how-we-sped-up-random-forest-processing-getting-the-lay-of-the-land/" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I work on the team at &lt;a href="https://stackoverflow.com" rel="noopener noreferrer"&gt;stackoverflow.com&lt;/a&gt; that is responsible for job ad selection model.  These things:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcultureofdevelopment.com%2Fimg%2Fso-job-ads.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcultureofdevelopment.com%2Fimg%2Fso-job-ads.PNG" alt="Stack Overflow Job Ads"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This model predicts the likelihood that a user will click on a given job relative to all the other jobs.  At any given time there are a bunch of jobs on the job board that the current visitor can see, so we calculate the expectation for each job, then we perform a weighted random selection over those values.&lt;/p&gt;

&lt;p&gt;The problem we have is that the model needs to be updated from time to time to account for changing market conditions.  Basically, new technologies come out all the time and which kinds of technologies are being used on Stack Overflow changes over time and which kinds of jobs are being listed changes all the time.&lt;/p&gt;

&lt;p&gt;In this most recent round of updating the model, we decided to go outside of our typical process of using a &lt;a href="https://en.wikipedia.org/wiki/Lasso_(statistics)" rel="noopener noreferrer"&gt;lasso regression&lt;/a&gt; and instead decided to try out numerous kinds of other models.  At the end of the day, the best model based on our selection criteria turned out to be a &lt;a href="https://en.wikipedia.org/wiki/Random_forest" rel="noopener noreferrer"&gt;random forest&lt;/a&gt;.  A random forest is basically a collection of decision trees that all combine to make a prediction.&lt;/p&gt;

&lt;p&gt;You may have heard of decision trees because they are the brunt of one of the biggest jokes in all of AI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcultureofdevelopment.com%2Fimg%2Fscooby_doo_ai.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcultureofdevelopment.com%2Fimg%2Fscooby_doo_ai.jpg" alt="Decision Trees unmasked!"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h6&gt;
  
  
  image source: &lt;a href="https://9gag.com/gag/aeMmnPW/if-advanced-ai-was-a-scooby-doo-villain" rel="noopener noreferrer"&gt;9gag.com&lt;/a&gt;
&lt;/h6&gt;

&lt;p&gt;But it turns out they are quite powerful after all, especially when you use a lot of them together, like a random forest.  I tend to get a little upset when I see that meme because the interesting part isn't really how they make the predictions, it's in how they are constructed, which is legitimate ML.&lt;/p&gt;

&lt;p&gt;Training a random forest can be quite expensive however, and we were not able to feed it all of our sample data.  Instead it was decided to try a similar model, &lt;a href="https://xgboost.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;XGBoost&lt;/a&gt; which we could feed with much more data in the same amount of time.  This is trained in a totally different manner, i.e. using &lt;a href="https://en.wikipedia.org/wiki/Boosting_(machine_learning)" rel="noopener noreferrer"&gt;boosting&lt;/a&gt;, however on the prediction side it turns out to be almost exactly the same as a random forest.&lt;/p&gt;

&lt;p&gt;So we have a new model, we hope it's going to be good, now we need to run a test which means we need to get this thing into a API that can be called from our ad server.&lt;/p&gt;

&lt;h3&gt;
  
  
  Try the easiest thing first
&lt;/h3&gt;

&lt;p&gt;One problem we have had here is that most of our models are trained in the &lt;a href="https://www.r-project.org/about.html" rel="noopener noreferrer"&gt;R language&lt;/a&gt; and then we have to convert to those models to run in C# which is what we use for pretty much all production stuff here at Stack Overflow.&lt;/p&gt;

&lt;p&gt;It has been on our radar for a long time to try to run R in production so we can avoid the step of transfering the model to a new system.  There has been more than one major mistake on account of that transfer and it would also reduce our time to push out new models once we get them from our data scientist.&lt;/p&gt;

&lt;p&gt;So we tried it.  And then we were like, okay, let's see how fast it is.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcultureofdevelopment.com%2Fimg%2Fxgboost-api-time-r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcultureofdevelopment.com%2Fimg%2Fxgboost-api-time-r.png" alt="R API times"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h6&gt;
  
  
  image source: &lt;a href="https://twitter.com/juliasilge" rel="noopener noreferrer"&gt;Julia Silge&lt;/a&gt;
&lt;/h6&gt;

&lt;p&gt;That's... ummm, not fast.  That's about a 25 millisecond average time to score a &lt;em&gt;single&lt;/em&gt; job.  But wait, we have a lot of jobs on the job board at any time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcultureofdevelopment.com%2Fimg%2Fnum-jobs-on-the-board.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcultureofdevelopment.com%2Fimg%2Fnum-jobs-on-the-board.PNG" alt="Number of jobs on Stack Overflow right now"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That's how many you can see right now if you live in the United States and you can see how many jobs are available in your area right now by visiting &lt;a href="https://stackoverflow.com/jobs" rel="noopener noreferrer"&gt;stackoverflow.com/jobs&lt;/a&gt;.  And if we multiply that with the time it takes to score one job on average, we're looking at &lt;code&gt;3810 * 25ms = 95,250ms&lt;/code&gt;.  That's a lot of ms to render a job ad.  Too many in fact.  I guess we need to take a look at our overall requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  The requirements
&lt;/h3&gt;

&lt;p&gt;First off, the number of jobs fluctuates from time to time due to various marketing and syndication efforts, as well as the general seasonal fluctuations.  We need to build in some buffer here, so let's say this needs to work for 5,000 jobs, and of course the more the merrier.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcultureofdevelopment.com%2Fimg%2Fad-request-time.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcultureofdevelopment.com%2Fimg%2Fad-request-time.PNG" alt="Job Ad Request Engine times"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Looking at our current ad request times, it looks like the entire job selection process takes about 25 ms on average and roughly 65-70 ms for the 99th percentile.  That's to do all the scoring and choosing of the jobs based on those scores.  I'm not entirely sure how much of that time is the job selection process, but we should give ourselves some buffer here and let's just call it 25 ms to match what we have and we need to keep the 99th percentile under 65 ms.&lt;/p&gt;

&lt;p&gt;This application runs on our web tier, the exact same boxes that Stack Overflow and all the other Stack Exchange sites run on.  These boxes are getting slammed like crazy with requests.  On our web tier, pretty much anything in the path of a request needs to be handled in a single-threaded manner.  In order to test out something that operates in parallel would be a test all of it's own and is beyond the scope of this project.&lt;/p&gt;

&lt;p&gt;We also need to be aware of the memory usage for this application.  The request handling needs to allocate as little memory as possible but there is some leeway here because the boxes themselves run at about 80% memory capacity already.  There are 64 GB of RAM on each of these boxes, and we need to keep some buffer in there, so going to the architecture team to ask for more than a couple gigabytes on each of these boxes is a pretty big ask.  Whatever memory it takes to run a single request, we really need to multiply it times 48 because there are 48 cores on the box.  If we want to stay under say 3 GB, then we have at most 62.5 MB per request.&lt;/p&gt;

&lt;p&gt;And lastly, we really only care about making this specific model fast enough to run the test.  It is not necessary to build a general purpose random forest implementation here, just something that can test the model we were given so we can verify if it is an improvement over our old model or not.  It is totally fine to make concessions as necessary to support only this one instance, and if this model turns out to be better than what we currently have, &lt;em&gt;then&lt;/em&gt; we might consider generalizing it to support a wider range of model instances.&lt;/p&gt;

&lt;p&gt;So in short, the requirements for handling a single request are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;must score 5,000 jobs for a user&lt;/li&gt;
&lt;li&gt;must complete in roughly 25 ms, and 99% of requests need to finish in under 65 ms&lt;/li&gt;
&lt;li&gt;must be single-threaded&lt;/li&gt;
&lt;li&gt;must use less than roughly 65 MB total&lt;/li&gt;
&lt;li&gt;all tricks are allowed as long as &lt;em&gt;this one model instance&lt;/em&gt; works correctly&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Let's stick to our bread and butter
&lt;/h3&gt;

&lt;p&gt;We're much better at C# here than we are at R for performance critical applications.  Additionally the most important thing to do right now is to verify that this new model is an improvement over the old model so building out a pipeline that included R in production was probably a lot of additional work we didn't need to consider doing for this anyway.&lt;/p&gt;

&lt;p&gt;So we decided to write this in C# and we went with the most naive possible solution.  Pretty much all the code to run the XGBoost model looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;sealed&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DecisionTree&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DecisionTreeNode&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;FeatureIndex&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;set&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;set&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;TrueBranch&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;set&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;FalseBranch&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;set&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;LeafIndex&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;-&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;readonly&lt;/span&gt; &lt;span class="n"&gt;DecisionTreeNode&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nf"&gt;DecisionTree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DecisionTreeNode&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;tree&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;nodes&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tree&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="nf"&gt;Evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FeatureIndex&lt;/span&gt; &lt;span class="p"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;LeafIndex&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;nodeIndex&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FeatureIndex&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TrueBranch&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FalseBranch&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="n"&gt;node&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;nodeIndex&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;sealed&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RandomForest&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;readonly&lt;/span&gt; &lt;span class="n"&gt;DecisionTree&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;trees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nf"&gt;RandomForest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DecisionTree&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;trees&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trees&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="nf"&gt;EvaluateProbability&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;instance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;instance&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
        &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;Logit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// this converts the output to a probability&lt;/span&gt;
    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="nf"&gt;Logit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="k"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="m"&gt;1f&lt;/span&gt; &lt;span class="p"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1f&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="n"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Exp&lt;/span&gt;&lt;span class="p"&gt;(-&lt;/span&gt;&lt;span class="k"&gt;value&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's basically all the work part of it.  I left off the parsing code because it's not particularly important but so you are aware of what the actual output from the XGBoost model in R looks like, here's a snippet:&lt;/p&gt;

&lt;pre&gt;
booster[0]
0:[f0&amp;lt;0.99992311] yes=1,no=2,missing=1,gain=97812.25,cover=218986
1:leaf=-0.199992761,cover=27584.75
2:[f17&amp;lt;0.000367681001] yes=3,no=4,missing=3,gain=10373.0732,cover=191401.25
3:[f2&amp;lt;0.5] yes=5,no=6,missing=5,gain=4121.85938,cover=103511.5
5:[f6&amp;lt;0.00216802233] yes=9,no=10,missing=9,gain=873.340759,cover=50533.25
9:[f732&amp;lt;0.5] yes=17,no=18,missing=17,gain=515.368896,cover=33356.75
17:leaf=-0.00213295687,cover=25503.5
18:leaf=-0.0314288437,cover=7853.25
10:[f732&amp;lt;0.5] yes=19,no=20,missing=19,gain=276.522034,cover=17176.5
19:leaf=0.0253729131,cover=13474
20:leaf=-0.00548130181,cover=3702.5
6:[f734&amp;lt;0.237819463] yes=11,no=12,missing=11,gain=2141.23145,cover=52978.25
11:[f8&amp;lt;0.00104575348] yes=21,no=22,missing=21,gain=566.334961,cover=35689
21:leaf=-0.0620479584,cover=24457.5
22:leaf=-0.0349165387,cover=11231.5
12:[f762&amp;lt;0.308019817] yes=23,no=24,missing=23,gain=483.886719,cover=17289.25
23:leaf=-0.0144120604,cover=16450.5
24:leaf=0.063411735,cover=838.75
4:[f2&amp;lt;0.5] yes=7,no=8,missing=7,gain=2694.23291,cover=87889.75
7:[f27&amp;lt;0.000739371637] yes=13,no=14,missing=13,gain=928.447266,cover=44100.5
13:[f732&amp;lt;0.5] yes=25,no=26,missing=25,gain=285.069702,cover=17082.25
25:leaf=0.032621529,cover=13427.25
26:leaf=0.00112144416,cover=3655
14:[f285&amp;lt;0.000919258455] yes=27,no=28,missing=27,gain=421.745117,cover=27018.25
27:leaf=0.0483669229,cover=20145
28:leaf=0.077062957,cover=6873.25
8:[f734&amp;lt;0.103942066] yes=15,no=16,missing=15,gain=1591.2124,cover=43789.25
15:[f101&amp;lt;0.000240761583] yes=29,no=30,missing=29,gain=608.92157,cover=24192.75
29:leaf=-0.0209285971,cover=14574.75
30:leaf=0.0114876805,cover=9618
16:[f722&amp;lt;0.5] yes=31,no=32,missing=31,gain=601.422363,cover=19596.5
31:leaf=0.0258833747,cover=18429.75
32:leaf=0.099892959,cover=1166.75
booster[1]
0:[f0&amp;lt;0.99992311] yes=1,no=2,missing=1,gain=80168.1719,cover=218645.734
1:leaf=-0.181867003,cover=27310.75
2:[f17&amp;lt;0.000390548841] yes=3,no=4,missing=3,gain=8405.98535,cover=191334.984
3:[f2&amp;lt;0.5] yes=5,no=6,missing=5,gain=3338.54272,cover=103498.18
5:[f5&amp;lt;0.000531250262] yes=9,no=10,missing=9,gain=750.746765,cover=50538.3281
9:[f732&amp;lt;0.5] yes=17,no=18,missing=17,gain=355.950684,cover=32388.959
17:leaf=-0.00289925188,cover=24663.1035
18:leaf=-0.0274959896,cover=7725.85449
10:[f732&amp;lt;0.5] yes=19,no=20,missing=19,gain=279.289886,cover=18149.3711
19:leaf=0.0230532587,cover=14319.1934
20:leaf=-0.00734602008,cover=3830.17749
6:[f734&amp;lt;0.237819463] yes=11,no=12,missing=11,gain=1736.53467,cover=52959.8516
11:[f5&amp;lt;0.000720986514] yes=21,no=22,missing=21,gain=514.774414,cover=35668.0508
21:leaf=-0.0571310222,cover=22939.1699
22:leaf=-0.0320498347,cover=12728.8799
12:[f722&amp;lt;0.5] yes=23,no=24,missing=23,gain=468.159485,cover=17291.8027
23:leaf=-0.0137320729,cover=16250.6074
24:leaf=0.0554070361,cover=1041.19482
4:[f2&amp;lt;0.5] yes=7,no=8,missing=7,gain=2186.99561,cover=87836.8047
7:[f27&amp;lt;0.000739371637] yes=13,no=14,missing=13,gain=755.193359,cover=44065.7109
13:[f668&amp;lt;0.5] yes=25,no=26,missing=25,gain=245.545715,cover=17078.1777
25:leaf=0.0203356724,cover=16096.4053
26:leaf=0.0718296021,cover=981.772095
14:[f58&amp;lt;0.000670915877] yes=27,no=28,missing=27,gain=312.375,cover=26987.5352
27:leaf=0.0368296206,cover=10639.0859
28:leaf=0.0588531196,cover=16348.4482
8:[f734&amp;lt;0.237819463] yes=15,no=16,missing=15,gain=1328.46338,cover=43771.0977
15:[f27&amp;lt;0.000467836275] yes=29,no=30,missing=29,gain=631.899902,cover=28666.8477
29:leaf=-0.0222451035,cover=11761.168
30:leaf=0.00793752726,cover=16905.6797
16:[f722&amp;lt;0.5] yes=31,no=32,missing=31,gain=362.48999,cover=15104.248
31:leaf=0.0279367212,cover=14043.1475
32:leaf=0.0885346606,cover=1061.10071
&lt;/pre&gt;

&lt;p&gt;This snippet shows two trees, delimited by the lines that start with &lt;code&gt;booster[&lt;/code&gt; and there are 1,000 of these trees.  Due to the way this model was trained, the missing branch, which you would normally take when a feature is missing from the sample, is always the "no" branch.  This is because we replace all values with 0 if they are missing, and all features have a positive value in our problem space.  The gain and cover variables are just output from the model for your information but do not affect the tree at all.  The first number is the index of the line that is referred to by the "yes" and "no" parameters.  The features are numbered 0 through about 840 in our data, so when you see &lt;code&gt;fXXX&lt;/code&gt; it means feature number XXX in the input.  The input to these, as you can see from the &lt;code&gt;Evaluate&lt;/code&gt; function is just an array of doubles.&lt;/p&gt;

&lt;h3&gt;
  
  
  How close are we?
&lt;/h3&gt;

&lt;p&gt;Let's try it out real fast, and see.  This is just using a simple timer wrapped around running this code with 5000 examples.  One caveat here, make sure you run this in release mode or your times will be wildly different.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;[6/20/19 4:02:34 AM] Time taken for 5000 evaluations: 319.5548 ms&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Hey that's not so bad, only a factor of about 12-13 to go.  I should point out here that when you're trying to reach a specific performance goal, the actual machine you're going to be running this on makes a difference.  My dev box gets an update about every other year, and is much more powerful than a lot of our production servers, especially the ones this will be running on.  In fact if you run this on processors released in the last year or two, you might see times as low as 220 ms, like I did on my laptop which only 4 months old. [obligatory works-on-my-machine badge here]  In an ideal world, we would run this test on the machine we care about, but since we're so far from meeting our requirements already, it's not time for that yet.&lt;/p&gt;

&lt;p&gt;But wait up, let's not get ahead of ourselves, let's verify this is actually generating the correct values too.  Keep in mind that there are a bunch of samples here that we used for the timing that we can now use to verify it is spitting out the correct values.  These samples, and the expected outputs, were provided to us by our data scientist.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="n"&gt;RandomForest&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// loaded in test setup&lt;/span&gt;
&lt;span class="kt"&gt;double&lt;/span&gt;&lt;span class="p"&gt;[][]&lt;/span&gt; &lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// loaded in test setup&lt;/span&gt;
&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;probablity_feature_index&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;840&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;foreach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;sample&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;EvaluateProbability&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;probablity_feature_index&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="n"&gt;Assert&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;InRange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="m"&gt;1e-06&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="m"&gt;1e-06&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WriteLine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Correctness verified."&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;[6/20/19 4:02:34 AM] Correctness verified.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So \o/ we're on the right track!&lt;/p&gt;

&lt;p&gt;Let's also take a second to realize that the biggest win we're going to see here at all is that moving away from the tooling we are not familiar with, i.e. the R API described above, to something we are much better with.  We were able to reduce the time for a single sample evaluation from roughly 25 ms to roughly 0.047 ms which is roughly three orders of magnitutude.  I want to be clear, this is not because &lt;em&gt;R&lt;/em&gt; is slow, it's because &lt;em&gt;we&lt;/em&gt; are not familiar enough with R to write efficient code.  The rest of this entire series is about grinding out as much of that last order of magnitude as possible.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's next?
&lt;/h3&gt;

&lt;p&gt;Next time we're going to take a look at some really basic improvements which will net us a fair amount of win.  The first step for designing anything is to get it working correctly and set a baseline for performance.  These next few things we address will be very common tips that can show some major improvements in your applications with only minor changes.&lt;/p&gt;

&lt;p&gt;If you are interested in following along and trying this stuff out for yourself, please clone &lt;a href="https://github.com/culture-of-development/random-forest-perf-blog" rel="noopener noreferrer"&gt;the repo&lt;/a&gt; and run the naive tests as shown in the readme.&lt;/p&gt;

</description>
      <category>randomforest</category>
      <category>performance</category>
      <category>dotnet</category>
    </item>
  </channel>
</rss>
