Maksim Balabash

Posted on May 26

How to Create a VSCode Extension That Uses a Machine Learning Model Under the Hood

#opensource #ai #learning #javascript

After the first release of ChatGPT 3.5, many people started with prompt engineering. But typical prompts are just strings of text. Often, they are pretty lengthy strings, as you occasionally end up adding more instructions and details to the context:

When working with multiple prompts, the dull and same-looking appearance of the strings might make it difficult to focus or lead to forgetting about some outdated text parts, especially after iterating on them for over an hour.

Wouldn't it be nice to add some highlighting like we do for regular code? This is exactly what we are going to do with our extension! (here's a link to it, and here is the code).

We want an extension that:

brings theme-agnostic highlighting to prompts
does the highlighting as close to real time as possible (combining heuristics and ml model)
highlights only those strings that we explicitly requested and does not alter any other strings

What can extensions do?

customize IDE appearance (themes)
extend UI with specialized panels (Docker, Git, etc.)
render custom webpages (Markdown, diagrams, etc.)
add language support (syntax, formatting, intellisense, etc.)
enable runtime-specific debugging tools (Python, Node.js, etc.)

There are many examples and APIs available for each direction.

There are two main restrictions to consider: DOM access and custom style sheets of the VS Code UI are not allowed.

And sure, in case you're wondering, you can run DOOM inside it.

Tokens

Tokens are the smallest units that VSCode recognizes and handles individually:

keywords (if, else, for)
variables (myVariable, userName)
functions (calculateTotal(), getData())
strings ("Hello, world!", 'text')
comments (// my comment, /* comment */)
etc.

They help VSCode understand the code better and offer helpful features like syntax highlighting, autocomplete, and refactoring.

Semantic tokens carry additional information about code's meaning and structure, allowing VSCode to highlight code in a meaningful way:

normally, VSCode only knows basic grammar rules
but with semantic tokens, we can make VSCode behave like something is a variable or a function depending on the context

Here's how it works:

VSCode asks language-specific extensions to analyze the code and provide semantic tokens.
Extensions parse the code, figure out roles for tokens (functions, variables, classes, interfaces, types, parameters), and return these tokens with their semantic meaning.
Once VSCode gets these tokens, it highlights the code. As an additional bonus, it does the job according to the user's current theme, making it theme-agnostic.

But if the prompts are just words in a string, how do we highlight them?

We will identify all strings that begin with our unique marker, which will function as a switch
Then, we will classify each word into 4 classes: actions, objects, subjects, and descriptors
And create semantic tokens for each word in these classes with settings that will make VSCode highlight them differently.

Classifying words using some ML magic

We have a text classification task, and currently, transformer models perform this job the best.

Transformer models are a type of machine learning model that understand the meaning and context of words, sentences, or even images all at once, instead of reading them one by one. They use a mechanism called attention to figure out which parts are most important – like how humans focus on key words when reading. This makes them great for tasks like writing, translating, summarizing, and more.

Since VSCode runs on Electron, the simplest way to use such models in our extension is through the @xenova/transformers library:

gives access to these powerful models without ever leaving JavaScript or TypeScript
models can be run locally in the user's browser, no extra server needed
no need for complex installations, it’s ready-to-use out-of-the-box

Using an LLM would be excessive and negatively impact UX, as even smaller versions require significant storage, and their capabilities are overly complex for our simple task. Therefore, we will create a small model fine-tuned for our task based on DistilBERT.

DistilBERT is a smaller and faster version of a larger model called BERT. BERT is a language model developed by Google and is considered an ancestor of modern LLMs. It performs pretty well in tasks such as text classification, sentiment analysis, filling in missing words, etc.

To create our model, we need to:

Collect the dataset of words for each class
Fine-tune the original model
Ensure it does its job
Convert it to the format recognized by @xenova/transformers
Quantize (compress) the model
Upload it to Hugging Face

This process may take another "How to" article, which we won't cover here. To summarize, here's what I did:

I used GPT-4o to generate words for each class. I managed to produce a couple of thousand for each, even though it required a few iterations
The same GPT-4o helped me with the code for fine-tuning the model
I played with it briefly to ensure it accurately predicted classes. This required several iterations of training the model with different parameters and testing it
Converting it to the required format only takes a single command to execute
Quantizing the model is also a straightforward task that takes a single function call
And finally we simply upload files to make the model available from Hugging Face.

The complete code, training set, and list of used packages related to model creation are here. After completing all the steps, this model essentially enables us to do something like this:

import { pipeline } from "@xenova/transformers";

const classifier = await pipeline("text-classification", "mbalabash/distilbert_subjects_actions_objects_descriptors");

const prediction = await classifier("beautiful");
console.log(prediction)

// OUTPUT:
// [{label: 'DESCRIPTOR', score: 0.9923667311668396}]

The model size is only 66 MB, which is quite suitable for our extension.

Speeding up work

Even though the model should do the job, we can also add a couple of things to improve our extension's UX.

It is possible to add a less compute-intensive method to classify words using regex. This will be our default approach. If this doesn't work, we will use model-based classification.

I once again asked GPT-4 to provide me with suffixes for each word class that would likely match the respective words (and would not conflict with each other).

Here is what I got:

// dictionary.js

{
  subjects: /(or|ian|eer|ster|ist|ite|ee)$/i,
  actions: /(ing|ize|ify|ate|ish|ect)$/i,
  objects: /(tion|sion|ment|ity|ism|ness|hood|age|ance|ence|ery|dom|ship)$/i,
  descriptors: /(ful|ous|ible|able|less|ive|ward|wise|ways|fold|most)$/i
}

Then the usage would be something like this:

// classification.js

function classifyWordUsingHeuristics(word) {
  if (suffixes.subjects.test(word.toLowerCase())) {
    ...

We can also implement an in-memory cache to avoid repeating the classification of words we've already processed.

// classification.js

const PREDICTION_CACHE = new Map()

// in both classification functions, we will check whether we have processed this world before

async function classifyWordUsingModel(word, classifier) {
  if (PREDICTION_CACHE.has(word)) {
    return PREDICTION_CACHE.get(word)
  }
  ...

Putting it all together

To kick off the work on our extension, we only need to use about 2 commands and answer a few yes/no questions in the CLI. After that, it is all set. There is a straightforward official guide for this stage.

We need to fill in specific properties in package.json to inform the world about what (and how) our extension does (see the full list). Think of this as:

our extension’s resume (name, description, version, main, etc.)
and its access control list (what it contributes to VSCode)

There are two essential properties to understand, as they also influence how our extension will function:

activation events: events upon which our extension becomes active
contribution points: declarations that specify which functionalities within VSCode our extension will extend

VSCode extension lifecycle:

VSCode starts and scans all installed extensions’ package.json files
It checks each extension’s activationEvents to see what conditions should trigger activation
Extensions stay inactive until one of their activation conditions is met
Once triggered, VSCode:
- Loads the extension’s main file
- Runs the activate(context) function
Extension does its thing
When the extension is unloaded, deactivate() is called (if it exists).

To minimize the impact on performance, our extension should be activated when the onStartupFinished event is fired. We will not extend any VSCode functionalities so that the contributions property remains empty.

Here is how the main file will appear in our situation:

// extension.js

async function activate(context) {
  // We define that we want to provide semantic tokens for JavaScript, TypeScript, Python, and Go files
  const selector = [
    { language: 'javascript', scheme: 'file' },
    { language: 'typescript', scheme: 'file' },
    { language: 'python', scheme: 'file' },
    { language: 'golang', scheme: 'file' }
  ]
  // We define the types of semantic tokens that we will provide
  const legend = new vscode.SemanticTokensLegend(
    ['operator', 'keyword', 'namespace', 'variable', 'number'],
    ['declaration']
  )

  const classifier = await initializeModel()

  const provider = {
    _eventEmitter: new vscode.EventEmitter(),

    get onDidChangeSemanticTokens() {
      return this._eventEmitter.event
    },

    async provideDocumentSemanticTokens(document) {
      const tokensBuilder = new vscode.SemanticTokensBuilder(legend)

      const text = document.getText()
      const blocks = text.matchAll(blockRegex) // Finds all strings that have the marker (#!promptskeeper)

      for (let block of blocks) {
        const classifiedWords = await extractClassifiedWords(block[0], classifier)
        generateSemanticTokens(classifiedWords, block, document, tokensBuilder)
      }

      return tokensBuilder.build()
    }
  }

  context.subscriptions.push(
    vscode.languages.registerDocumentSemanticTokensProvider(selector, provider, legend)
  )

  context.subscriptions.push(
    vscode.workspace.onDidChangeTextDocument(event => {
      // Check if tokens should be recalculated after changing the file
      if (
        selector.some(
          s =>
            vscode.languages.match(s, event.document) && event.document.getText().includes(MARKER)
        )
      ) {
        // Signal that tokens need to be recalculated
        // This will cause VS Code to request new tokens
        provider._eventEmitter.fire()
      }
    })
  )
}

function deactivate() {}

Before classifying and highlighting words in prompts, we must initialize the model:

// classification.js

const MODEL_ID = 'mbalabash/distilbert_subjects_actions_objects_descriptors'
const MODEL_CACHE_DIR = join(
  contextEnv.HOME || contextEnv.USERPROFILE || '.',
  '.cache',
  'promptskeeper-vscode-highlighting-extension',
  'models'
)

if (!existsSync(MODEL_CACHE_DIR)) {
  mkdirSync(MODEL_CACHE_DIR, { recursive: true })
}

async function getModel(pipeline, progress) {
  return await pipeline('text-classification', MODEL_ID, {
    cache_dir: MODEL_CACHE_DIR,
    local_files_only: false, // download the model if not found in cache
    progress_callback: chunk => {
      if (chunk.status) {
        progress.report({ message: chunk.status })
      }
      if (chunk.progress) {
        progress.report({
          message: `Downloading: ${Math.round(chunk.progress)}%`
        })
      }
    }
  })
}

async function initializeTransformers() {
  process.env.HF_HOME = MODEL_CACHE_DIR
  process.env.TRANSFORMERS_CACHE = MODEL_CACHE_DIR

  try {
    const { pipeline, env } = await import('@xenova/transformers')
    env.cacheDir = MODEL_CACHE_DIR
    env.allowLocalModels = true

    return await vscode.window.withProgress( // show beautiful message in VSCode
      {
        location: vscode.ProgressLocation.Notification,
        title: 'Initializing model...',
        cancellable: false
      },
      async progress => {
        progress.report({ message: 'Downloading...' })
        return await getModel(pipeline, progress)
      }
    )
  } catch (error) {
    console.error('Model initialization error:', error)
    vscode.window.showErrorMessage('Failed to load model: ' + error.message)
    return null
  }
}

Now that our model is in place, we can jump to the function that extracts the words we want to highlight:

// classification.js

async function extractClassifiedWords(prompt, classifier) {
  if (!classifier) {
    vscode.window.showErrorMessage('Word categorization failed: model not initialized')
    return
  }

  const { text, categories } = preparePromptForClassification(prompt)

  try {
    const words = splitTextIntoWords(text)
    const ambiguousWords = []

    for (const word of words) {
      if (isExceptionWord(word)) {
        continue
      }

      // Try to classify the word using heuristics first
      const { confidence, category } = classifyWordUsingHeuristics(word)
      if (confidence === 0) {
        ambiguousWords.push(word)
      } else {
        categories[category].push(word)
      }
    }

    // Handle ambiguous words using our model
    const BATCH_SIZE = 6
    for (let i = 0; i < ambiguousWords.length; i += BATCH_SIZE) {
      const batch = ambiguousWords.slice(i, i + BATCH_SIZE)
      const predictions = await Promise.all(batch.map(word => classifyWordUsingModel(word, classifier)))

      predictions.forEach((prediction, index) => {
        const word = batch[index]
        if (typeof prediction.category === 'string' && prediction.category.length > 0) {
          categories[prediction.category].push(word)
        }
      })
    }
  } catch (error) {
    vscode.window.showErrorMessage(`Word categorization failed: ${error.message}`)
    console.error('Word categorization error:', error)
  }

  return categories
}

This gives us words grouped into classes:

{
  subject: ["word1", "word2"],
  action: [...],
  object: [...],
  descriptor: [...],
  marker: ["#!promptskeeper"]
}

With these words, we can find their exact positions in the document and generate semantic tokens:

// highlighter.js

const HIGHLIGHTING_GROUPS = {
  marker: {
    tokenType: 'operator',
    tokenModifiers: ['declaration']
  },
  subject: {
    tokenType: 'variable',
    tokenModifiers: ['declaration']
  },
  action: {
    tokenType: 'keyword',
    tokenModifiers: ['declaration']
  },
  object: {
    tokenType: 'namespace',
    tokenModifiers: ['declaration']
  },
  descriptor: {
    tokenType: 'number',
    tokenModifiers: ['declaration']
  }
}

function generateSemanticTokens(classifiedWords, block, document, tokensBuilder) {
  const prompt = block[0]
  const categories = Object.entries(classifiedWords)

  for (const [category, words] of categories) {
    const matches = prompt.matchAll(getRegexForCategoryWords(category, words))

    for (const match of matches) {
      // Get the exact word from the match
      const keyword =
        typeof match[0] === 'string' && match[0].length > (match[1] || '').length
          ? match[0]
          : match[1]

      // Get index of word in matched string
      const appendix = (match[0] || '').indexOf(keyword)

      // Get the index of word in the prompt (source string)
      const matchStart =
        appendix !== -1 ? block.index + match.index + appendix : block.index + match.index

      // Get the exact positions in the document
      const startPos = document.positionAt(matchStart)
      const endPos = document.positionAt(matchStart + keyword.length)

      // Generate semantic token
      tokensBuilder.push(
        new vscode.Range(
          new vscode.Position(startPos.c, startPos.e),
          new vscode.Position(endPos.c, endPos.e)
        ),
        // Makes words from the corresponding classes look like their tokenType and tokenModifiers
        HIGHLIGHTING_GROUPS[category].tokenType,
        HIGHLIGHTING_GROUPS[category].tokenModifiers
      )
    }
  }
}

This is it! We have covered the core logic of our extension. Here is the repository with the full code there.

Making sure everything works

There are two ways to test our extension:

Run it with the debugger (that works perfectly for development purposes)
- How: press the F5 key
Build the .vsix package and install it directly (to ensure that everything is ready before publishing it)
- How: run vsce package command in the CLI, then open the command palette in VSCode and find command Extensions: Install from VSIX, then choose the extension's .vsix file

In addition, we can add regular unit tests or even e2e tests (see the official guide).

It's important to note that because the extension runs in a separate instance of VSCode, some exceptions may be swallowed in that instance. This could lead to situations where the extension doesn't function properly, and you have no idea why. I've encountered this a couple of times due to syntax errors that went unnoticed since I did not use TypeScript for this project.

Sharing it with others

The good part of using plain JavaScript is that I do not need to do any bundling at all. It just works.

But it's not always worth it, especially because there is a well-detailed official guide for using TypeScript and bundling the extensions (has examples for esbuild and webpack).

Having said this, I need to prune my dependencies before packaging our extension (otherwise, all of my devDependencies will also be included in the package):

// package.json
...
"scripts": {
  "test": "jest",
  "package": "npm prune --production && vsce package",
  "publish": "npm prune --production && vsce publish"
},
...

Publishing a VSCode extension is a straightforward process and is very similar to publishing an npm package.

You will need the vsce tool and a personal access token. The whole process, from start to finish, is thoroughly documented in the official Publishing Extensions Guide.

Afterword

As you have seen, the result is far from ideal. It operates word by word and cannot recognize phrasal verbs, compound expressions, etc. But I had a great time learning about VSCode extensions and fine-tuning BERT.

Sometimes, it can be tedious or even painful (like when I was collecting the training set), but software engineering gives those moments of magic when you can turn an idea in your head into something that exists and works (even if it's not perfect).

By the way, you can do much more with @xenova/transformers.js in the browser. There are plenty of opportunities to experiment with in natural language processing, computer vision, audio, and more.

I wish you more joy in your journey as a software engineer!

Thanks for your attention! 👋

DEV Community