DEV Community

Cover image for How to Tune and Deploy Your First Small Language Model (sLLM)
Jesse Williams for KitOps

Posted on • Originally published at jozu.com

How to Tune and Deploy Your First Small Language Model (sLLM)

Large Language Models (or LLMs) have taken the machine learning world by storm. In 2024 alone, over 20,000 papers related to LLMs were published. Popularized by the release of ChatGPT in November 2022, LLMs have a huge potential to augment and automate business processes. LLMs can help answer user queries, summarize information, and generate human-like text (write emails, essays, etc.).

Despite their abilities, companies are reluctant to adopt LLMs for two main reasons,

  • Size: It is common for LLMs to have over 80 billion parameters, and hence, deploying them is a resource-intensive task. The challenges are amplified when consumers want a response that is near real-time.
  • Generic response: LLMs are often trained on user-generated data from the internet, which may be incorrect or very generic. Furthermore, fine-tuning these LLMs with application-specific data is a resource and time-consuming task because of the large number of parameters.

In this article we take a look at sLLMs, an alternative to LLMs that solve the size problem mentioned above. We will also walk you through the steps required to quickly deploy your sLLM and using Dev Mode: a user-friendly interface to interact with your LLM, offered by KitOps.

Small Language Models (sLLM)

Small Language Models (sLLM) are miniature versions of LLMs, and have about 10 times fewer parameters compared to their LLM counterparts. LLMs store the parameters using precise data types like 16 and 32-bit floating point numbers. On the contrary, sLLMs use 4-bit or 8-bit integer data types, resulting in smaller models. The other key difference is that while LLMs contain both encoder and decoder, sLLMs like Mistral 7B are decoder-only models, resulting in models with extremely few parameters: 7 to 8 billion instead of 80 billion in LLMs.

sLLMs have three key advantages over LLMs,

  • Lightweight: sLLMs have fewer parameters and require fewer resources to deploy.
  • Domain specificity: Fine-tuning sLLM is faster; they can be easily trained with new domain-specific data.
  • Faster inference: Fewer parameters in the sLLM means fewer computations, resulting in faster inference. This makes sLLM ideal for real-time applications.

How to fine-tune an sLLM

Although most language models perform very well out of the box, it is essential to tune them when you want to use them for specialized tasks. Fine-tuning sLLMs on task-specific datasets helps to improve the performance of the model for that task.

If you search for guides on fine-tuning sLLMs, you will find numerous blog posts on the subject, generally, the trained models in those blogs are created with Jupyter notebooks. In this tutorial we will simplify this by exposing your tuned model through a web interface using a single command. All of this is made possible by Dev Mode, a feature available with open source KitOps.

Let’s start by installing the kit command-line tool.

Step 1: Install kit
The process of installing kit varies depending on the operating system (OS). However, the central idea is to download the kit executable and add it to the path where your OS can detect it. You can find detailed instructions on the Installing Kit page.

Once you install kit, you need to login to a container registry, you could use DockerHub or GitHub.

# Login to ghcr.io
kit login ghcr.io -u github_user -p personal_token

# Login to docker
kit login docker.io --password-stdin -u docker_user
Enter fullscreen mode Exit fullscreen mode

Step 2: Download a base model
To fine-tune a model, you will need a base model. In KitOps, everything is defined/specified using a Kitfile. Kitfile is similar to a Dockerfile but is tailored to machine learning. You can find more information about a Kitfile and its format in the KitOps documentation. For now, you can use the Kitfile below to define the Llama 3 8B model.

manifestVersion: "1.0"
model:
  name: llama3-8B-instruct-q4_0
  path: ghcr.io/jozu-ai/llama3:8B-instruct-q4_0
  description: Llama 3 8B model
Enter fullscreen mode Exit fullscreen mode

Before downloading the model, you will need to create the ModelKit using the command below:

kit pack . -t fine-tuning:untuned
Enter fullscreen mode Exit fullscreen mode

Now, you can download/unpack the model:

kit unpack fine-tuning:untuned -d . --overwrite
Enter fullscreen mode Exit fullscreen mode

Step 3: Create the dataset
A dataset is imperative for fine-tuning the model. You can create a text file, training-data.txt, with examples of your training data. An example dataset can look like:

<start> Example one.
<start> Example two.
Enter fullscreen mode Exit fullscreen mode

Depending on what you want to use your model for (answering questions related to your internal documents, summarizing long texts, etc.), the dataset will vary. For the demonstration, we will use only two dummy training examples. A smaller dataset will result in a faster and less resource-intensive fine-tuning process.

You can use any relevant text file as a dataset. A sample of ultrachat_200k dataset is in this Google Drive link for download and use. If you want to convert any other HuggingFace dataset, you can use this collab notebook. Please note that providing <start> token is not compulsory. If you choose not to provide the token, do not specify any value for the --sample-start parameter in the fine-tuning step.

Step 4: Fine-tuning
To ease the fine-tuning process, we will use the llma.cpp library. The GitHub repository provides OS-specific installation instructions. The commands in this guide are tested on a MacOS, but you should be able to find the canonical command using the repository or help command.

After installing the library, you will have access to the finetune command. The command allows you to set hyperparameters and configuration settings. On a Mac, you can find a detailed list of parameters using the llama-finetune help command. Some important parameters are:

  • --model-base : Location of the base model
  • --lora-out : Location to save the output
  • --sample-start : Sets the starting point for samples after the specified pattern. If empty use every token position as sample start.
  • --save-every : Save checkpoint every N iterations.
  • --epochs : Maximum number epochs to process.

Now, you can fine-tune the model using the example dataset created in the earlier step. To run the fine-tuning task, use the following command:

llama-finetune --model-base ./llama3-8B-instruct-q4_0.gguf --train-data ./training-data.txt --epochs 1 --sample-start "<start>" --lora-out lora_adapter.gguf
Enter fullscreen mode Exit fullscreen mode

After the training is complete, update the Kitfile with the new artifacts and dataset.

manifestVersion: "1.0"
package:
  name: llama3 fine-tuned
  version: 3.0.0
  authors: ["Jozu AI"]
model:
  name: llama3-8B-instruct-q4_0
  path: ghcr.io/jozu-ai/llama3:8B-instruct-q4_0
  description: Llama 3 8B model
  parts:
    - path: ./lora-adapter.gguf
      type: lora-adapter
datasets:
  - name: fine-tune-data
    path: ./training-data.txt
Enter fullscreen mode Exit fullscreen mode

Finally, package your model, tag it, and upload it to a container repository.

## Pack
kit pack /lora_finetuning -t fine-tuning:tuned

## Tag 
kit tag fine-tuning:tuned docker.io/bhattbhuwan13/finetuned:latest

## Push 
kit push docker.io/bhattbhuwan13/finetuned:latest
Enter fullscreen mode Exit fullscreen mode

The model is now fine-tuned, and you can interact with the model using Dev Mode.

Step 5: Deployment
Dev Mode is currently available only for MacOS, on KitOps allows you to quickly deploy your sLLM through an interactive web interface. For this, from the directory containing your Kitfile, you will need to run the command: kit dev start. After running the command, it will provide you with a local address; follow the address to interact with your deployed model.

KitOps is an open source MLOps tool

Once you are finished, run kit dev stop to terminate the Dev Mode.

If you want to deploy it to a server, you can unpack the ModelKit to a server, run Dev Mode and configure an Nginx server. By default, Dev Mode randomly chooses a port to serve the web interface; you can override that behavior by specifying the --port parameter: kit dev start --port 8000. A sample Nginx configuration file will then look like this:

server {
            listen 80;  
                server_name 3.111.144.91;  # replace the ip with public ip for your instance
                location / {  
                proxy_pass http://127.0.0.1:8050;  
                }  
        }
Enter fullscreen mode Exit fullscreen mode

As you can see, we fine-tuned and deployed an entire sLLM on a web interface by simply writing some configurations and running a few commands. This abstraction and convenience is the result of adopting KitOps. KitOps standardizes packaging models, code, and artifacts while allowing engineers to easily track, reproduce, and deploy machine learning models.

Furthermore, Dev Mode enables engineers to swiftly expose their LLMs via a web interface using a single command. To learn more about KitOps, follow the documentation. If you have any questions, you can also contact the KitOps team through our Discord server.

Top comments (5)

Collapse
 
atsag profile image
Andreas

Thank you for the very informative article! It would be nice if you could also add some information on the sLLMs which are available, and how to deploy these on limited compute power machines (e.g a Raspberry Zero).

Collapse
 
aravind profile image
Aravind Putrevu

Interesting!

Collapse
 
trizzle profile image
Wayne Trout

Adding this to my project list I have a use case for me at work!

Collapse
 
jwilliamsr profile image
Jesse Williams

Let me know if you need any support with that, happy to hop on a call to work through some of the common processes. jesse [at] jozu [dot] com

Collapse
 
piyushtechsavy profile image
piyush tiwari

Thanks for the information. For now I have been mostly using Langchain with Cloud APIs. Tries running Ollama locally but best results are often given by latest models.