Isabella Tromba for Tangram

Posted on Feb 15, 2022

Train a Machine Learning Model to Predict the Programming Language in a Code Snippet

#machinelearning #react #javascript #node

We are going to build a web application that has a code editor that automatically predicts the programming language of the code contained in it. This is similar to VSCode's language detection feature that predicts the programming language and performs automatic syntax highlighting.

As a programmer I know that the following code is python:

def foo():
  print(“hello world”)

This is ruby:

def say_hello(name)
   return “Hello, ” + name
end

And this is javascript:

function myFunction() {
  console.log(“hello world”)
}

We have a training dataset that we curated called languages.csv. The csv file contains two columns, the first is the code snippet and the second is the programming language of the code snippet.

code	language
def foo(): print(“hello world”)	python
function myFunction() { console.log(“hello world”) }	javascript
def say_hello(name) return “Hello, ” + name end	ruby

We can train a machine learning model to predict the programming language contained in the code snippet by running the following command:

tangram train --file languages.csv --target language

The csv file languages.csv is a small dataset of programming language snippets and their corresponding language label. You can download the full dataset here.

Under the hood, Tangram will take care of feature engineering, split our data into a train/test split, train a number of linear and gradient boosted decision tree models with a range of hyperparameter settings and finally evaluate all of the models and output the best one in the current directory: languages.tangram.

Now, we can use this file langauges.tangram to make predictions in our apps.

To make a prediction in javascript, all we have to do is import the tangram library and load the model file we just trained and call the predict function on the model.

Here is the code to load the model:

import * as tangram from "@tangramdotdev/tangram";
import modelUrl from "./languages.tangram";

// Download the model.
let modelResponse = await fetch(modelUrl);
let modelData = await modelResponse.arrayBuffer();
// Load the model.
let model = new tangram.Model(modelData);

Then, we can just call the predict function, passing in the code snippet:

let code = 'def foo(): print("hello world")'
// Make a prediction
model.predict({ code })

We said we wanted to make this a react component that renders a code editor. Here is the full example code that contains usage of the Ace code editor. Every time the code changes in the editor, we call model.predict, passing in the new code string contained in the editor.

import * as tangram from "@tangramdotdev/tangram";
import modelUrl from "./languages.tangram";

function App() {
  // Download the model.
  let modelResponse = await fetch(modelUrl);
  let modelData = await modelResponse.arrayBuffer();
  // Load the model.
  let model = new tangram.Model(modelData);

  let [code, setCode] = useState("");
  let [language, setLanguage] = useState(null);
  let onChange = (newCode) => {
    setLanguage(model.predict({ code: newCode }).className);
    setCode(newCode);
  };
  return (
    <>
    <p>{`Detected language: ${language}`}</p>
    <AceEditor value={code} mode={language} onChange={onChange} />
    </>
  );
}

let root = document.createElement("div");
document.body.appendChild(root);
ReactDOM.render(<App />, root);

Under the Hood

With Tangram, we were able to train a model with just a single command on the command line. In the following section, we will learn more about what Tangram is actually doing under the hood.

Tokenization

The first step in making the code into features is called Tokenization, where we will split the code into individual tokens. One strategy of splitting a stream of characters into chunks of characters called tokens is using whitespaces.

Here is our python code tokenized by using whitespace as a token delimitter:

token 1	token 2	token 3
`def`	`foo():`	`print("hello world")`

This isn't so great because the string that is being printed "hello world" is part of the same token as the print function.

Another strategy of splitting characters into tokens is by using all non-alphanumeric characters as token boundaries. Here is our python code tokenized using this strategy:

token 1	token 2	token 3	token 4	token 5	token 6	token 7	token 8	token 9	token 10	token 11	token 12
`def`	`foo`	`(`	`)`	`:`	`print`	`(`	`"`	`hello`	`world`	`"`	`)`

For code, splitting on punctuation is better because now the print function name is no longer in the same token as the string we want to print. So our machine learning model can learn that the word print is associated with the python language. (Of course, the string print can and will appear in other programming languages as well.)

Feature Engineering

This is a great first step, but we still don’t have something that we can pass to a machine learning model. Remember, things we can pass to machine learning models are numbers (integers and floats) and what we still have is strings.

What we can do is turn every token into its own feature. For each token, we ask, does our input code contain this token? If the answer is yes, we assign a feature value of 1. If the answer is no, we assign a feature value of 0. This is called "Bag of Words" encoding. It's called "Bag of Words" encoding because after tokenization, we just treat everything as a bag of words, completely ingoring the structure and order that those words may have appeared in the original code snippet.

To illustrate this better, the following two code snippets produce the exact same features:

Jumbled python code snippet:

("hello)def:world"
()print foo

Regular python code snippet:

def foo():
  print("hello world")

One way to make the machine learning model aware of the structure of the code is through ngrams. Commonly used ngrams are bigrams and trigrams. To make bigrams from our token stream, we just combine all adjacent unigrams.

Unigram token features:

token 1	token 2	token 3	token 4	token 5	token 6	token 7	token 8	token 9	token 10	token 11	token 12
`def`	`foo`	`(`	`)`	`:`	`print`	`(`	`"`	`hello`	`world`	`"`	`)`

Bigram token Features:

token 1	token 2	token 3	token 4	token 5	token 6	token 7	token 8	token 9	token 10	token 11
`def foo`	`foo (`	`( )`	`):`	`: print`	`print(`	`("`	`"hello`	`hello world`	`world"`	`")`

You can see how now we have features that capture some of the structure of our code. If you really want machine learning to capture structure you can use some deep learning techniques, but that is out of scope for this tutorial.

So far, in our bag of words encoding, we are using a binary count method. If the token is present in the string, we assign a feature value of 1 and 0 otherwise. There are other feature weighting strategies that we can use. For instance, we can use a counting strategy where we count the number of times each token appear in the text. We can also use a strategy called tf-idf that downweights frequently occurring tokens.

By default, Tangram chooses a feature engineering strategy based on the input data. But you can completely configure which strategy you want to use by passing a config file to the command line:

tangram train --file languages.csv --target language --config config.json

To learn about all of the options to customize training, check out the tangram docs on custom configuration :https://www.tangram.dev/docs/guides/train_with_custom_configuration.

Training a Hyperparameter Grid

Finally, Tangram trains a number of machine learning models including linear models and gradient boosted decision trees and chooses the best model based on a hold-out comparison dataset. Since we are training a multiclass classifier, the metric we use to choose the best model is accuracy.

And thats it!

In this tutorial, we showed how we can train a machine learning model to predict the programming language contained in a code snippet and then use that model in a react app to predict the code contained in a code editor.

Tangram makes it easy for programmers to train, deploy, and monitor machine learning models.

Run tangram train to train a model from a CSV file on the command line.
Make predictions with libraries for Elixir, Go, JavaScript, PHP, Python, Ruby, and Rust.
Run tangram app to learn more about your models and monitor them in production.