How to setup a free, self-hosted AI model for use with VS Code

#ai #vscode

You have probably heard about GitHub Co-pilot. It's an AI assistant that helps you code. There are a few AI coding assistants out there but most cost money to access from an IDE. But did you know you can run self-hosted AI models for free on your own hardware? All you need is a machine with a supported GPU.

Before we begin

We are going to use an ollama docker image to host AI models that have been pre-trained for assisting with coding tasks. We are going to use the VS Code extension Continue to integrate with VS Code. If you are running VS Code on the same machine as you are hosting ollama, you could try CodeGPT but I could not get it to work when ollama is self-hosted on a machine remote to where I was running VS Code (well not without modifying the extension files). There are currently open issues on GitHub with CodeGPT which may have fixed the problem now.

System requirements

This guide assumes you have a supported NVIDIA GPU and have installed Ubuntu 22.04 on the machine that will host the ollama docker image. AMD is now supported with ollama but this guide does not cover this type of setup.

Installing the NVIDIA CUDA drivers

The NVIDIA CUDA drivers need to be installed so we can get the best response times when chatting with the AI models. On the machine you want to host the AI models...



sudo apt install nvidia-cuda-toolkit

Now check the status of the CUDA drivers by running ...



nvidia-smi

You should see something like this ...

Next run a check to ensure the CUDA drivers are available to for use ...



nvcc --version

You should see something like this ...

Install Docker and enable Docker images to access the GPU

Follow the instructions to install Docker on Ubuntu.

Now we install and configure the NVIDIA Container Toolkit by following these instructions.

Next we double check that any docker images can access the NVIDI GPU...



docker run --runtime=nvidia --gpus all --rm nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

Note you should select the NVIDIA Docker image that matches your CUDA driver version. Look in the unsupported list if your driver version is older.

Run the ollama Docker image

Now we are ready to start hosting some AI models.

First, download and run the ollama Docker image...



docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Next check ollama is running by...



curl http://localhost:11434

You should see the output "Ollama is running".

We can now download and run any model easily by running ...



docker exec -it ollama ollama run deepseek-coder:6.7b

This version of deepseek-coder is a 6.7 billon parameter model. The model will be automatically downloaded the first time it is used then it will be run. After it has finished downloading you should end up with a chat prompt when you run this command.

While it responds to a prompt, use a command like btop to check if the GPU is being used successfully. Also note if you do not have enough VRAM for the size model you are using, you may find using the model actually ends up using CPU and swap.

Picking a pre-trained coding model

You may have to have a play around with this one. The best model will vary but you can check out the Hugging Face Big Code Models leaderboard for some guidance. You will also need to be careful to pick a model that will be responsive using your GPU and that will depend greatly on the specs of your GPU.

Setting up VS Code

On the machine you want to run VS Code on, check you can see the ollama API address (note x.x.x.x is the IP address of your machine that is hosting the Docker image)...



curl http://x.x.x.x:11434

You should get the output "Ollama is running".

Next Download and install VS Code on your developer machine.

Now we need the Continue VS Code extension.

Make sure you only install the official Continue extension.

Click cancel if it asks you to sign in to GitHub.

Now configure Continue by opening the command palette (you can select "View" from the menu then "Command Palette" if you don't know the keyboard shortcut). Type "Continue: Open config.json"

Open the config file and modify it to be like what I have entered below:

Note again that x.x.x.x is the IP of your machine hosting the ollama docker container.

Save the file and click on the Continue icon in the left side-bar and you should be ready to go. Note you can toggle tab code completion off/on by clicking on the continue text in the lower right status bar. Also note that if the model is too slow, you might want to try a smaller model like "deepseek-coder:latest". Refer to the Continue VS Code page for details on how to use the extension.