Simon Pfeiffer for Codesphere Inc.

Posted on Aug 18, 2023 • Originally published at codesphere.com on Aug 18, 2023

From first click to prompt output in 1m38s - Running Llama2 in Codesphere

#llama #largelanguagemodels #inference #codesphere

What started out of pure curiosity: how difficult can it be to run your own LLM inside a Codesphere workspace, will it even run and if so will it be performant enough, running inference on our CPUs? Turned into one of the most exciting projects I've worked on in recent weeks.

Unlike OpenAi and their GPT models, Meta has open sourced their entire suite of Large Language Models called Llama1 & Llama2 alongside the pre-trained chat versions. In independent performance tests these land somewhere between GPT3.5 and GPT4 and they are actually a bit faster than GPT4 in their responses.

These models come in different sizes, based on how many parameters they were trained on. Typically inference for LLM's is run on GPU instead CPU processors because these computation are very memory intense and GPU's have a clear edge there. GPU servers are still very expensive and not as widely accessible as CPU servers - Codesphere is planning to offer shared (cheaper) & dedicated GPU plans in the near future but as of today we only offer them after receiving a pre-order requesting early access.

Therefore today we are going to test if we can still run the smaller model (trained on 7 Billion parameters) on a CPU based server inside of Codesphere. Since we know it will challenging to get a smooth response, we are going with our pro plan, providing 8 state of the art vCPUs, 16GB RAM and a 100GB of storage.

Getting Llama2 running on Codesphere is actually very easy thanks to the amazing open source community, providing C++ based wrappers (llama.cpp) and huggingface offering pre-compiled and compressed model versions for download.

It is actually so easy that I decided to do a timed run. From the first click to the first chat response took 1 minute 38 seconds. It really blows my mind.

Let's take a look how it's done inside of Codesphere. If you still need to create a Codesphere account now is as good a time as any. If your machine is strong enough this tutorial will also work locally (at least for Linux & MacOS with small adjustments)

Step 1: Create a workspace from the Llama.cpp repository

Sign in to your Codesphere account, click the create new workspace button at the top right and paste this into the repo search bar at the top:

https://github.com/ggerganov/llama.cpp

Next you'll want to provide a workspace name, select the pro plan and hit the start coding button. This plan is 80$/m for a production always on plan and 8$/m for a standby when unused deployment mode. We know 80$/m seems like a lot but consider that renting a GPU is typically more than 1000$/m.

Step 2: Compile the code

Open up a terminal and type:

make

This command will compile the c++ code to be readable for Linux. The Llama.cpp repository contains a Makerfile that tells the compiler what to do.

Step 3: Download the model

First type cd models in the terminal to navigate to the folder where Llama.cpp expects to find the model binaries. There are a wide variety of versions available via hugging face, as mentioned we are picking the 7b params size and opt for the pre-trained chat version of that.

Even for this specification there are ~10 different flavours available, pick the one that suits your use case best - we found this repo to contain good explanations alongside models: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML

In the models directory run this command and replace the model name with the flavour that suits you best:

wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_K_M.bin

Step 4: Run your first query

Now we can ask our very own chatbot the first question. Navigate back to the main directory with cd .. and then run this command from the terminal:

make -j && ./main -m ./models/llama-2-7b-chat.ggmlv3.q4_K_M.bin -p "Why are GPUs faster for running inference of LLMs" -n 512

It's going to take a few seconds to load the 4GB model into memory but then you should start to see your Chatbot typing an answer to your query into the terminal.

Once completed it will also print the timings, the initial load can take up to 30s but subsequent runs take less than 1s to start providing a response - also the speed is not quite as fast as interacting with chatGPT in the browser but it still returns around 4 words per second which is pretty good.

The images show the timing of the initial run vs. subsequent runs.

[Optional] Step 5: Getting the example chatbot web interface running on Codesphere

The llama.cpp repository comes with simple web interface example. This provides an experience closer to what you might be used to from ChatGPT.

Navigate to CI pipeline and click the Define CI Pipeline button. For the prepare stage enter make as command.

And for the run stage enter this command, making sure the model name point to the binary of the version you downloaded:

./server -m ./models/llama-2-7b-chat.ggmlv3.q4_K_M.bin -c 2048 --port 3000 --host 0.0.0.0

We need to set the port to 3000 and the host to 0.0.0.0 in order to expose the frontend in Codesphere.

Now you can run your prepare stage (which won't do anything if you previously ran make already via the terminal) but might be needed after workspace restarts.

Next run your run stage and click the open deployment icon in the top right corner. Now you and anyone you share the URL with can have chats with your self-hosted ChatGPT clone 😎

Let us know what you think about this! Also feel free to reach out to us if you are interested in getting early access to our GPU plans.

Happy Coding!

DEV Community

From first click to prompt output in 1m38s - Running Llama2 in Codesphere

Step 1: Create a workspace from the Llama.cpp repository

Step 2: Compile the code

Step 3: Download the model

Step 4: Run your first query

[Optional] Step 5: Getting the example chatbot web interface running on Codesphere

Top comments (0)

Read next

Architecture visible

TinyMCE 7.4 & 7.5 Release Notes - Overview

Mastering JSON Handling in JavaScript: Parsing and Stringifying

Will Artificial Intelligence(AI) Replace Software Jobs?