Nevo David

for Gitroom

Posted on Apr 25, 2024

I fine-tuned my model on a new programming language. You can do it too! 🚀

#webdev #javascript #beginners #programming

I have been using OpenAI ChatGPT-4 for a while now.
I don't have a lot of bad stuff to say about it.
But sometimes, it's not enough.

In Winglang, we wanted to use OpenAI and ChatGPT-4 to answer people's questions based on our documentation.

Your options are:

Use OpenAI assistant or any other vector-based database with (RAG). It worked nicely since Wing looked like JS, but there were still many mistakes.
Passing the entire documentation into the context window is super expensive.

Soon enough, we realized that was not going to work.
It's time to host our own LLM.

Your LLM dataset

Before we train our model, we need to create data that will be used to train the model. In our case, the Winglang documentation. I will do something pretty simple.

Extract all the URLs from the sitemap, set a GET request, and collect the content.
Parse it; we want to convert all the HTML into readable content.
Run it with ChatGPT 4 to convert the content into a CSV as the dataset.

It should be something like this:

Once you finish, save the CSV with one column named text and add the question and the answer. We will use it later. It should look something like this:

text
<s>[INST]How to define a variable in Winglang[/INST] let a = 'Hello';</s>
<s>[INST]How to create a new lambda[/INST] bring cloud; let func = new cloud.Function(inflight () => { log('Hello from the cloud!'); });</s>

Save it on your computer in a new folder called data.

Autotrain, your model

My computer is pretty weak, so I have decided to go into a smaller model - 7b parameters: mistralai/Mistral-7B-v0.1

There are millions of ways to train a model. We will use Huggingface Autotrain. We will use their CLI without running any Python code 🚀

When you use Autotrain from Huggingface, you can train it on your computer (my approach here) or train it on their servers (pay money) and train larger models.

I have no GPU with my old Macbook Pro M1 2021. thank you, Apple 🍎.

Let's install autotrain.

pip install -U autotrain-advanced
autotrain setup > setup_logs.txt

Then, all we need to do is run the autotrain command:

autotrain llm \
--train \
--model "mistralai/Mistral-7B-Instruct-v0.2" \
--project-name "autotrain-wing" \
--data-path data/ \
--text-column text \
--lr "0.0002" \
--batch-size "1" \
--epochs "3" \
--block-size "1024" \
--warmup-ratio "0.1" \
--lora-r "16" \
--lora-alpha "32" \
--lora-dropout "0.05" \
--weight-decay "0.01" \
--gradient-accumulation "4" \
--quantization "int4" \
--mixed-precision "fp16" \
--peft

Once finished you will have a new directory called "autotrain-wing" with the new fine-tuned model 🚀

Playing with the model

To play with the model, start by running:

pip install transformers torch

Once completed, create a new Python file named invoke.py with the following code:

from transformers import pipeline

# Path to your local model directory
model_path = "./autotrain-wing"

# Load the model and tokenizer from the local directory
classifier = pipeline("text-classification", model=model_path, tokenizer=model_path)

# Example text to classify
text = "Example text to classify"
result = classifier(text)
print(result)

And then you can run it by running the CLI command:

python invoke.py

And you are done 🚀

Keep on working on your LLMs

I am still learning about LLMs.
One thing I realized is that it's not so easy to track changes with your models.

You can't really use it with Git because a model can reach a very large size > 100 GB; it doesn't make much sense - git doesn't handle it nicely.

A better way to do this is with a tool called KitOps.

I think it will soon be a standard in the world of LLM, so make sure you star this library so you can use it later.

Download the latest KitOps release and install it.
Go to the model folder and run the command to pack your LLM:
```
kit pack .
```
You can also push it to Docker hub by running
```
kit pack . -t [your registry address]/[your repository name]/mymodelkit:latest
```
💡 To learn how to use DockerHub check this

⭐️ Star KitOps so you can find it again later ⭐️

I started a new YouTube channel mostly about open-source marketing :)

(Like how to get Stars, Forks and Client)

If that's something that interests you, feel free to subscribe to it here:
https://www.youtube.com/@nevo-david?sub_confirmation=1

Top comments (29)

Shai Ber • Apr 25 '24

Nice! Is there a way to access the LLM you trained?

Brad Micklea • Apr 25 '24

Very soon KitOps will have a Dev mode command that will make it easy to run the LLM locally and interact with it via prompts or chats as well as experiment with various parameters. You can star our repo or, better yet, join our discord: discord.gg/Tapeh8agYy

Sammy Scolling • Apr 25 '24

Can't wait 🤞

Brad Micklea • May 9 '24

Good news - KitOps dev mode is here! Now you can run and interact with an LLM locally (no internet or GPUs required) with a single Kit CLI command.
dev.to/kitops/kitops-release-v02-i...

Nevo David • Apr 25 '24

Awesome!

Brad Micklea • May 9 '24

Good news - KitOps dev mode is here! Now you can run and interact with an LLM locally (no internet or GPUs required) with a single Kit CLI command.
dev.to/kitops/kitops-release-v02-i...

Brad Micklea • May 9 '24

Good news - KitOps dev mode is here! Now you can run and interact with an LLM locally (no internet or GPUs required) with a single Kit CLI command.
dev.to/kitops/kitops-release-v02-i...

Nevo David • Apr 25 '24

I will dm you :)

Brad Micklea • May 9 '24

You can, KitOps dev mode lets you run and interact with an LLM locally (no internet or GPUs required) with a single Kit CLI command.
dev.to/kitops/kitops-release-v02-i...

Nathan Tarbert • Apr 25 '24

Looks like I need to check out KitOps!

Nevo David • Apr 26 '24

You do!

Morgan • Apr 25 '24

I'll check Kitops out!

Andrew • Apr 25 '24

Intresting, almost no python code :)

Nevo David • Apr 25 '24

A little bit :)

Jesse Williams • Apr 25 '24

Check out our discord, we have a lot coming down the pipeline in the next few weeks. discord.gg/Tapeh8agYy

Nevo David • Apr 25 '24

🚀

Ayush Thakur • Apr 25 '24

Very informative blog, Nevo

Nevo David • Apr 25 '24

Thank you so much!

Mathew • Apr 25 '24

I love it!

Nevo David • Apr 25 '24

🙏🏻

Johny • Apr 25 '24

Awesome!

Nevo David • Apr 25 '24

🚀

Benjamin • Apr 25 '24

Autotrain really simplifies everything.

Nevo David • Apr 25 '24

It is!

mark-friedman • Apr 26 '24

So what were your results? How did your small, but fine-tuned, model do with questions and code generation tasks with your new language compared to your RAG-based approach with GPT-4?

Pablo E. Cabrol • Apr 30 '24 • Edited

Have you tried Gen app Builder from Google?, let you use a set of data and no code to have a natural language access to your documentation. I'm doing my homework to test it my self. 😊

Matija Sosic • Apr 30 '24

Thanks for sharing - we've actually been thinking about doing something similar for Wasp, our full-stack framework for React & Node.js! (github.com/wasp-lang/wasp)

View full discussion (29 comments)

DEV Community

I fine-tuned my model on a new programming language. You can do it too! 🚀

Your LLM dataset

Autotrain, your model

Playing with the model

Keep on working on your LLMs

Top comments (29)

Read next

Introduction to Amazon VPC and Its Fundamentals

Text compression & Code splitting & Modern image formats - Performance optimization

Static React App Deployment with Vite

JUnit Testing: A Comprehensive Guide to Unit Testing in Java