Fine-tuning a 0.5B LLM to run on a potato laptop

#agents #ai #automation #llm

I wanted a tool that could automate stuff on my laptop without sending everything to the cloud. Problem: my laptop is ancient. There's only so much a laptop with Intel Pentium CPU, 8GB RAM and HDD can do. Regardless of the constraints, I set out the sails to work on it.

I won't say it's SOTA quality, but it's decent, and it works.
The first problem was very clear: to find a SLM (Small Language Model) that is good, even for its size. And after searching, for a bit I settled with Qwen2.5-0.5B-Instruct.

Now, for the data generation and how I did it. This part was the most tedious thing. First, I prompted Qwen2.5-7B-Instruct to generate instructions (like- Open Reddit, Copy X file from Y to Z, etc). Then, I iterated through each instruction, and prompted model, to first generate some paraphrases, and then corresponding elements for input and output. But the data wasn't clean, I had to regenerate numerous times. Even after that, when I started fine-tuning the model, it kept overfitting. So I took a look at the data more closely and found that even after all the regeneration, the data wasn't consistent. Examples, where directory field had Windows file-paths, response still had Linux commands in it. So I decided to go back and regenerate the data from scratch again, and then, finally, I got data that was usable to fine-tune the model.
The structure I decided my model to train is like:
Input prompt

{ 
  "task": <task>,
  "directory": <directories>, # Needs to pass full paths right now
  "available_hotkeys": <list_of_hotkeys_relevant_for_this_task>,
  "iteration_context": <iteration_context> # Usually null only used when the task is repetitive
}

Response:

{
  "task_type": <task_type>, # Model predicts task types: atomic | repetitive | clarification 
  "output": {
    "execution_plan": {
      "hotkeys": [<hotkey_plan>],
        "cli": [<cli_plan>]
    }
  }
}

Well now, you might think why I chose this structured input-output instead of doing something like streaming the output and doing something what people usually call tool-calling. But let's be honest, there's a problem with that: What if model fails midway and starts emitting wrong output? Or worse, destructive outputs

You might be aware of a recent case that was trending, where the Meta researcher Summer Yue told her OpenClaw agent, quote: "Check this inbox too and suggest what you would archive or delete, don’t action until I tell you to", and it started deleting her mails without permission.

Instead, having the model output its full plan before execution, you get the power, to either let it execute or not. But this might be a bad UX choice, not for everyone, but some it might be. Though there are some ways, hypothetically to counter that too, at least in my POV. But it'll be off-topic to discuss them in this post.

So that's why I chose to go with having model output full execution plan before taking actions. Anyway, this is how ACE, yes that's what the tool is called, ACE - Adaptive Command Executor, works.

Though calling it Adaptive might be overkill, given that the model is a bit overfit and doesn't correctly emit hotkeys because of some of my beginner mistakes during fine-tuning.

Now for some minor details; other than Qwen as its main generative model, ACE uses all-MiniLM-L6-v2 to embed the tasks and hotkeys' descriptions, then uses cosine similarity to retrieve hotkeys for the current task. It might be inefficient to recompute the embedding of descriptions, so instead it uses .npz file to cache the description and uses that. During loading models, it also checks if there's been any changes to the hotkeys.json and recomputes the descriptions accordingly.

As for LoRA configuration, I used, r = 16, alpha = 32, and dropout = 0.2, and trained for 1750 steps.

In the future versions, I plan to do something similar to search for the files, that user wants to perform operations on, and remove the need of passing full paths.

All of this, going back and forth between data generation and fine-tuning took about 6 weeks and 2 extra weeks coding for model to actually work as an automation system. It might've taken less if I had incorporated more use of AI, but then there wouldn't have been any learning of things when facing errors. In all of this I learned quite a lot. Foremost is the fine-tuning, and got a gist of how RAG system works while implementing the hotkey retrieval. Other than that, I learned why data quality is the most important thing, not just in LLM fine-tuning, but in training and any kind of model.

Well, that's it. If you want to look at the code or models:
Code: GitHub
Models: HuggingFace

DEV Community

Fine-tuning a 0.5B LLM to run on a potato laptop

Top comments (0)