DEV Community: Tangram

Train a Machine Learning Model to Predict the Programming Language in a Code Snippet

Isabella Tromba — Tue, 15 Feb 2022 19:35:06 +0000

We are going to build a web application that has a code editor that automatically predicts the programming language of the code contained in it. This is similar to VSCode's language detection feature that predicts the programming language and performs automatic syntax highlighting.

As a programmer I know that the following code is python:

def foo():
  print(“hello world”)

This is ruby:

def say_hello(name)
   return “Hello, ” + name
end

And this is javascript:

function myFunction() {
  console.log(“hello world”)
}

We have a training dataset that we curated called languages.csv. The csv file contains two columns, the first is the code snippet and the second is the programming language of the code snippet.

code	language
def foo(): print(“hello world”)	python
function myFunction() { console.log(“hello world”) }	javascript
def say_hello(name) return “Hello, ” + name end	ruby

We can train a machine learning model to predict the programming language contained in the code snippet by running the following command:

tangram train --file languages.csv --target language

The csv file languages.csv is a small dataset of programming language snippets and their corresponding language label. You can download the full dataset here.

Under the hood, Tangram will take care of feature engineering, split our data into a train/test split, train a number of linear and gradient boosted decision tree models with a range of hyperparameter settings and finally evaluate all of the models and output the best one in the current directory: languages.tangram.

Now, we can use this file langauges.tangram to make predictions in our apps.

To make a prediction in javascript, all we have to do is import the tangram library and load the model file we just trained and call the predict function on the model.

Here is the code to load the model:

import * as tangram from "@tangramdotdev/tangram";
import modelUrl from "./languages.tangram";

// Download the model.
let modelResponse = await fetch(modelUrl);
let modelData = await modelResponse.arrayBuffer();
// Load the model.
let model = new tangram.Model(modelData);

Then, we can just call the predict function, passing in the code snippet:

let code = 'def foo(): print("hello world")'
// Make a prediction
model.predict({ code })

We said we wanted to make this a react component that renders a code editor. Here is the full example code that contains usage of the Ace code editor. Every time the code changes in the editor, we call model.predict, passing in the new code string contained in the editor.

import * as tangram from "@tangramdotdev/tangram";
import modelUrl from "./languages.tangram";

function App() {
  // Download the model.
  let modelResponse = await fetch(modelUrl);
  let modelData = await modelResponse.arrayBuffer();
  // Load the model.
  let model = new tangram.Model(modelData);

  let [code, setCode] = useState("");
  let [language, setLanguage] = useState(null);
  let onChange = (newCode) => {
    setLanguage(model.predict({ code: newCode }).className);
    setCode(newCode);
  };
  return (
    <>
    <p>{`Detected language: ${language}`}</p>
    <AceEditor value={code} mode={language} onChange={onChange} />
    </>
  );
}

let root = document.createElement("div");
document.body.appendChild(root);
ReactDOM.render(<App />, root);

Under the Hood

With Tangram, we were able to train a model with just a single command on the command line. In the following section, we will learn more about what Tangram is actually doing under the hood.

Tokenization

The first step in making the code into features is called Tokenization, where we will split the code into individual tokens. One strategy of splitting a stream of characters into chunks of characters called tokens is using whitespaces.

Here is our python code tokenized by using whitespace as a token delimitter:

token 1	token 2	token 3
`def`	`foo():`	`print("hello world")`

This isn't so great because the string that is being printed "hello world" is part of the same token as the print function.

Another strategy of splitting characters into tokens is by using all non-alphanumeric characters as token boundaries. Here is our python code tokenized using this strategy:

token 1	token 2	token 3	token 4	token 5	token 6	token 7	token 8	token 9	token 10	token 11	token 12
`def`	`foo`	`(`	`)`	`:`	`print`	`(`	`"`	`hello`	`world`	`"`	`)`

For code, splitting on punctuation is better because now the print function name is no longer in the same token as the string we want to print. So our machine learning model can learn that the word print is associated with the python language. (Of course, the string print can and will appear in other programming languages as well.)

Feature Engineering

This is a great first step, but we still don’t have something that we can pass to a machine learning model. Remember, things we can pass to machine learning models are numbers (integers and floats) and what we still have is strings.

What we can do is turn every token into its own feature. For each token, we ask, does our input code contain this token? If the answer is yes, we assign a feature value of 1. If the answer is no, we assign a feature value of 0. This is called "Bag of Words" encoding. It's called "Bag of Words" encoding because after tokenization, we just treat everything as a bag of words, completely ingoring the structure and order that those words may have appeared in the original code snippet.

To illustrate this better, the following two code snippets produce the exact same features:

Jumbled python code snippet:

("hello)def:world"
()print foo

Regular python code snippet:

def foo():
  print("hello world")

One way to make the machine learning model aware of the structure of the code is through ngrams. Commonly used ngrams are bigrams and trigrams. To make bigrams from our token stream, we just combine all adjacent unigrams.

Unigram token features:

token 1	token 2	token 3	token 4	token 5	token 6	token 7	token 8	token 9	token 10	token 11	token 12
`def`	`foo`	`(`	`)`	`:`	`print`	`(`	`"`	`hello`	`world`	`"`	`)`

Bigram token Features:

token 1	token 2	token 3	token 4	token 5	token 6	token 7	token 8	token 9	token 10	token 11
`def foo`	`foo (`	`( )`	`):`	`: print`	`print(`	`("`	`"hello`	`hello world`	`world"`	`")`

You can see how now we have features that capture some of the structure of our code. If you really want machine learning to capture structure you can use some deep learning techniques, but that is out of scope for this tutorial.

So far, in our bag of words encoding, we are using a binary count method. If the token is present in the string, we assign a feature value of 1 and 0 otherwise. There are other feature weighting strategies that we can use. For instance, we can use a counting strategy where we count the number of times each token appear in the text. We can also use a strategy called tf-idf that downweights frequently occurring tokens.

By default, Tangram chooses a feature engineering strategy based on the input data. But you can completely configure which strategy you want to use by passing a config file to the command line:

tangram train --file languages.csv --target language --config config.json

To learn about all of the options to customize training, check out the tangram docs on custom configuration :https://www.tangram.dev/docs/guides/train_with_custom_configuration.

Training a Hyperparameter Grid

Finally, Tangram trains a number of machine learning models including linear models and gradient boosted decision trees and chooses the best model based on a hold-out comparison dataset. Since we are training a multiclass classifier, the metric we use to choose the best model is accuracy.

And thats it!

In this tutorial, we showed how we can train a machine learning model to predict the programming language contained in a code snippet and then use that model in a react app to predict the code contained in a code editor.

Tangram makes it easy for programmers to train, deploy, and monitor machine learning models.

Run tangram train to train a model from a CSV file on the command line.
Make predictions with libraries for Elixir, Go, JavaScript, PHP, Python, Ruby, and Rust.
Run tangram app to learn more about your models and monitor them in production.

Head over to https://www.tangram.dev and give it a try!

Writing the fastest GBDT libary in Rust

Isabella Tromba — Tue, 11 Jan 2022 19:30:05 +0000

In this post, we will go over how we optimized our Gradient Boosted Decision Tree library. This is based on a talk that we gave at RustConf 2021: Writing the Fastest Gradient Boosted Decision Tree Library in Rust. The code is available on GitHub.

The content of this post is organized into following sections:

What are GBDTs.
Use Rayon to Parallelize.
Use cargo-flamegraph to find bottlenecks.
Use cargo-asm to identify suboptimal code generation.
Use intrinsics to optimize for specific CPUs.
Use unsafe code to eliminate unnecessary bounds checks.
Use unsafe code to parallelize non-overlapping memory access.

What are GBDTs?

GBDT stands for Gradient Boosted Decision Tree. GBDTs are a type of machine learning model that perform incredibly well on tabular data. Tabular data is data that you would normally find in a spreadsheet or csv.

In order to get a feeling for how GBDT's work, let's go through an example of making a prediction with a single decision tree. Let’s say you want to predict the price of a house based on features like the number of bedrooms, bathrooms, and square footage. Here is a table with 3 features num_bedrooms, num_bathrooms and sqft. The final column called price is what we are trying to predict.

num_bedrooms	num_bathrooms	sqft	price
2	2	1200	$300k
4	3.5	2300	$550k
3	3	2200	$450k
8	9	7000	$990k

To make a prediction with a decision tree, you start at the top of the tree, and at each branch you ask how one of the features compares with a threshold. If the value is less than or equal to the threshold, you go to the left child. If the value is greater, you go to the right child. When you reach a leaf, you have the prediction.

Let’s make an example prediction. We have a house with 3 bedrooms, 3 bathrooms, and 2500 square feet. Let’s see what price our decision tree predicts.

num_bedrooms	num_bathrooms	sqft	price
3	3	2500	?

Starting at the top, the number of bedrooms is 3 which is less than or equal to 3, so we go left. The square footage is 2500 which is greater than 2400, so we go right, and we arrive at the prediction which is $512K.

A single decision tree isn’t very good at making predictions on its own, so we train a bunch of trees, one at a time, where each tree predicts the error in the sum of the outputs of the trees before it. This is called gradient boosting over decision trees!

Making a prediction with gradient boosted decision trees is easy. We start with a baseline prediction which in the case of regression (predicting a continuous value like the price of a home) is just the average price of houses in our dataset. Then, we run the process we described for getting a prediction out of a single decision tree for each tree and sum up the outputs. In this example, the prediction is $340K.

To learn more about GBDT's check out the wikipedia article on gradient boosting.

Use Rayon to Parallelize

So now that we know a little about GBDT's, let's talk about how we made our code fast. The first thing we did was parallelize our code. Rayon makes this really easy. Rayon is a data parallelism library for Rust that makes converting sequential operations into parallel ones extremely easy.

The process of training trees takes in a matrix of training data which is n_rows by n_features.

To decide which feature to use at each node, we need to compute a score for each feature. We do this by iterating over each column in the matrix. The following is a sequential iteration over the columns.

dataset.columns().iter().map(|column| {
  // compute the score of branching on this feature
})

We can parallelize this code with Rayon. All we have to do is change the call to iter to par_iter.

dataset.columns().par_iter().map(|column| {
  // compute the score of branching on this feature
})

Rayon will keep a thread pool around and schedule items from your iterator to be processed in parallel. Parallelizing over the features works well when the number of features is larger than the number of cores on your computer. When the number of features is smaller than the number of logical cores in your computer, parallelizing over the features is not as efficient. This is because some of our cores will be sitting idle so we will not be using all of the compute power available to us. You can see this clearly in the image below. Cores 1 through 4 have work to do because they have features 1 through 4 assigned to them. Cores 5 through 8 though are sitting idle.

In this situation, we can parallelize over chunks of rows instead and make sure we have enough chunks so that each core has some work to do.

Each core now has some rows assigned to it, and no core is sitting idle.

Distributing the work across rows is super easy with Rayon as well. We just use the combinator par_chunks!

dataset.rows().par_chunks(N).map(|chunk| {
  // process the chunk
});

These are just a couple of the combinators available in Rayon. There are a lot of other high-level combinators that make it easy to express complex parallel computations. Check out Rayon on GitHub to learn more.

Use cargo-flamegraph to find bottlenecks

Next, we used cargo-flamegraph to find where most of the time was being spent. Cargo-flamegraph makes it easy generate flamegraphs and integrates elegantly with cargo. You can install it with cargo install, then run cargo flamegraph to run your program and generate a flamegraph.

sudo apt install -y linux-perf
cargo install flamegraph
cargo flamegraph

Here is a simple example with a program that calls two subroutines, foo and bar.

fn main() {
  foo();
  bar();
}

When we run cargo flamegraph we get an output that looks like this.

It contains a lot of extra functions that you have to sort through, but it boils down to something like this.

The y-axis of the graph is the call stack, and the x axis is duration. The bottom of the graph shows that the entire duration of the program was spent in the main function. Above that, you see that the main function’s time is broken up between calls to foo and bar, and that about two thirds of the time was spent in foo and its subroutines and about one third of the time spent in bar and its subroutines.

In our code for training decision trees, the flamegraph showed one function where the majority of the time was spent. In this function, we maintain an array of the numbers 0 to n that we call indexes, and at each iteration of training we rearrange it. Then, we access an array of the same length, called values, but in the order of the indexes in the indexes array.

let indexes = (0..n).collect();

// rearrange indexes

for index in indexes {
  let mut value = values.get_mut(index);
  // mutate the value
}

This results in accessing each item in the values array out of order. We will refer back to this function throughout the rest of this post.

Use cargo-asm to identify suboptimal code generation

From the flamegraph, we knew which function was taking the majority of the time, which we briefly described above. We started by looking at the assembly code it generated to see if there were any opportunities to make it faster. We did this with cargo-asm.

cargo install cargo-asm
cargo asm --rust path::to::a::function

Like cargo-flamegraph, cargo-asm is really easy to install and integrates nicely with cargo. You can install it with cargo install, and run it as a cargo subcommand.

Here is a simple example with a function that adds two numbers and multiplies the result by two.

pub fn sum_times_two(x: i32, y: i32) -> i32 {
  let sum = x + y;
  let sum_times_two = sum * 2;
  sum_times_two
}

When we run cargo asm, we get an output that looks like this. It shows the assembly instructions alongside the rust code that generated them.

// pub fn sum_times_two(x: i32, y: i32) -> i32 {
// let sum = x + y;
add        edi, esi
// let sum_times_two = sum * 2;
lea        eax, [rdi, +, rdi]
//}
ret
//}

Note that due to all the optimizations the compiler does, there is often not a perfect correlation from the rust code to the assembly.

for index in indexes {
  let mut value = values.get_mut(index);
  // mutate the value
}

When we looked at the assembly for this loop, we were surprised to find an imul instruction, which is an integer multiplication.

imul     rax, qword, ptr, [r8, +, 16]

What is that doing in our code? We are just indexing into an array of f32’s. f32’s are 4 bytes each, so the compiler should be able to get the address of the ith item by multiplying i * 4. Multiplying by four is the same as shifting i left by two. Shifting left by two is much faster than integer multiplication.

for index in indexes {
  let mut value = values.get_mut(index);
  // mutate the value
}

Why would the compiler not be able to produce a shift left instruction? Well, the values array is a column in a matrix and a matrix can be stored either in row major or column major order. This means that indexing into the column might require multiplying by the number of columns in the matrix, which is unknown at compile time. But since we were storing our matrix in column major order, we could eliminate the multiplication, but we have to convince the compiler of this.

let values = values.as_slice();
// for index in indexes {
  // let mut value = valuues.get_mut(index);
  // mutate the value
// }

We did this by casting the values array to a slice. This convinced the compiler that the values array was contiguous, so it could access items using the shift left instruction, instead of integer multiplication.

Use intrinsics to optimize for specific CPUs

Next, we used compiler intrinsics to optimize for specific CPUs. Intrinsics are special functions that hint to the compiler to generate specific assembly code.

for index in indexes {
  let mut value = values.get_mut(index);
  // mutate the value
 }

Remember how we noticed that this code results in accessing the values array out of order? This is really bad for cache performance, because CPUs assume you are going to access memory in order. If a value isn’t in cache, the CPU has to wait until it is loaded from main memory, making your program slower.

However, we know which values we are going to be accessing a few iterations of the loop in the future. We know this because the indexes are given by the indexes array. So 10 iterations in the future, we will be accessing values[indexes[current_index + 10]]. We can hint to x86_64 CPU’s to prefetch those values into cache using the mm_prefetch intrinsic. We experimented with different values of the OFFSET until we got the best performance. If the OFFSET is too small, the CPU will still have to wait for the data, if the OFFSET is too large, data that the CPU needs might be evicted and by the time the CPU gets to the iteration that needs the data, it will no longer be there. The best offset will vary depending on your computer's hardware so it can be more of an art than a science.

for index in indexes {
  mm_prefetch(values.as_ptr().offset(index + OFFSET));
  let value = values[index];
  // do something with value
}

If you are interested in cache performance and writing performant code, check out Mike Acton's talk on Data-Oriented Design and Andrew Kelly's recent talk at Handmade Seattle.

Use unsafe code to eliminate unnecessary bounds checks

Next, we used a touch of unsafe to remove some unnecessary bounds checks.

// for index in indexes {
  let mut value = values.get_mut(index);
  // mutate the value
//}

Most of the time, the compiler can eliminate bounds checks when looping over values in an array. However, in this code, it has to check that index is within the bounds of the values array.

// for index in indexes {
  let mut value = values.get_unchecked_mut(index);
  // mutate the value
// }

But we said at the beginning that the indexes array is a permutation of the values 0 to n. This means the bounds check is unnecessary. We can fix this by replacing get_mut with get_unchecked_mut. We have to use unsafe code here, because Rust provides no way to communicate to the compiler that the values in the indexes array are always in bounds of the values array.

Use unsafe code to parallelize non-overlapping memory access

Finally, we parallelized the following code:

for index in indexes {
  let mut value = values.get_mut(index);
  // mutate the value
}

But is it even possible to parallelize? At first glance, it seems the answer is no, because we are accessing the values array mutably in the body of the loop.

indexes.par_iter().for_each(|index| {
  let mut value = values.get_mut(index);
})

If we try it, the compiler will give us an error indicating overlapping borrows. However, the indexes array is a permutation of the values 0 to n, so we know that the access into the values array is never overlapping.

We can parallelize our code using unsafe Rust, wrapping a pointer to the values in a struct and unsafely marking it as Send and Sync.

struct ValuesPtr(*mut [f32]);
unsafe impl Send for ValuesPtr {}
unsafe impl Sync for ValuesPtr {}
let values = ValuesPtr(values);
indexes.par_iter().for_each(|index| unsafe {
  let mut value = values.0.get_unchecked_mut(index);
})

So, going back to the code we started out with...

for index in indexes {
  let mut value = values.get_mut(index);
  // mutate the value
}

Finally, when combining the four optimizations together:

Making sure that the values array is a contiguous slice.
Prefetching values so they are in cache.
Removing bounds checks because we know the indexes are always in bounds.
Parallelizing over the indexes because we know they never overlap.

this is the code we get:

struct ValuesPtr(*mut [f32]);
unsafe impl Send for ValuesPtr {}
unsafe impl Sync for ValuesPtr {}
let values = ValuesPtr(values.as_slice());
indexes.par_iter().for_each(|index| unsafe {
  mm_prefetch(values.as_ptr().offset(index + OFFSET));
  let mut value = values.0.get_unchecked_mut(index);
  // mutate the value
})

Here are our benchmarks on training time comparing Tangram's Gradient Boosted Decision Tree Library to LightGBM, XGBoost, CatBoost, and sklearn.

To see all the benchmarks, head over to https://tangram.dev/benchmarks.

If you are interested in reading the code or giving us a star, the project is available on GitHub.

What machine learning can learn from Ruby on Rails

Isabella Tromba — Mon, 10 Jan 2022 21:47:52 +0000

I wrote my first end-to-end functioning web application using Ruby on Rails in a class at MIT (6.170) in 2013. There were things that Rails automatically handled for me that I didn’t even realize were hard to do. Running rails new just set up a completely functioning application. I never had to consider all of the components I would need to string together. Database migrations, routing, run and deploy scripts, tests, handling static assets, and more worked out of the box and the documentation clearly described how to build every part of my application. In fact, I assumed that writing web applications should always be this easy because I had never tried to write one from scratch. I was the beginner benefiting from my own ignorance that DHH talks about in The Rails Doctrine!

But beyond the productivity gains for experts, conventions also lower the barriers of entry for beginners. There are so many conventions in Rails that a beginner doesn’t even need to know about, but can just benefit from in ignorance. It’s possible to create great applications without knowing why everything is the way it is.

That’s not possible if your framework is merely a thick textbook and your new application a blank piece of paper. It takes immense effort to even figure out where and how to start. Half the battle of getting going is finding a thread to pull.

- DHH, The Rails Doctrine

A couple of years later, as a machine learning engineer at Slack, getting machine learning into production felt a lot more like "the framework as a thick textbook" and my application as "a blank piece of paper" that DHH talks about in the Rails Doctrine.

To make things even worse, try googling “how to learn machine learning”. The steps involved start looking like the curriculum required to obtain a PhD in Statistics, Math, and Computer Science.

The problems don’t end once you have successfully trained a model. You still have to figure out how to get your model into production. The code you wrote in your jupyter notebook needs to be translated into code that can be deployed. An entirely new job called “Machine Learning Engineer” was created just to solve this problem.

In the Rails Doctrine, there is a section on “Value Integrated Systems”. DHH says that Rails is “A whole system that addresses an entire problem.”

Rails can be used in many contexts, but its first love is the making of integrated systems: Majestic monoliths! A whole system that addresses an entire problem. This means Rails is concerned with everything from the front-end JavaScript needed to make live updates to how the database is migrated from one version to another in production.

That’s a very broad scope, as we’ve discussed, but no broader than to be realistic to understand for a single person. Rails specifically seeks to equip generalist individuals to make these full systems. Its purpose is not to segregate specialists into small niches and then require whole teams of such in order to build anything of enduring value.

- DHH, The Rails Doctrine

One sentence in that section really stuck out to me: "Its [Rails'] purpose is not to segregate specialists into small niches and then require whole teams of such in order to build anything of enduring value". Today, this is exactly what companies are doing to get machine learning into production. They are required to assemble a team of specialists including Data Scientists, Machine Learning Engineers, Backend Engineers and Ops teams.

It would be great if we had something like Ruby on Rails for machine learning: a single system that provides the tools you need to go from data to a deployed machine learning model. Just as DHH says "rails specifically seeks to equip generalist individuals to make these full system", we need tools to equip generalist programmers, like front-end javascript engineers or back-end ruby programmers, to build full machine learning systems.

Introducing Tangram

Tangram is an all-in-one automated machine learning framework that makes it easy to add machine learning to your applications. Predictions happens directly in your existing applications so there are no network requests and there is no need to set up a separate service to serve your models.

Run tangram train to train a model from a CSV file on the command line.
Make predictions with bindings for Ruby, Python, Golang, Elixir, Javascript, PHP, or Rust.
Run tangram app to start a web application where you can learn more about your models and monitor them in production.

You can check out the Tangram Ruby Gem. We built it using Ruby FFI and the source is available on our GitHub repo.

Tangram is a new project and there is a lot of work ahead. We’d love to get your feedback. Check out the project on GitHub, and let us know what you think! If you like what we are working on, give us a star!