Day 0 (Planning): Building my personal AI assistant that runs locally.

#machinelearning #llm #ai

This is day 0 of me trying to build my own personal assistant that runs locally on my laptop and helps me with random stuffs (like opening apps, searching web, and I’m not even sure what else I want it to do)

Intro:

If you don’t know me, that’s probably because I’m not very well known (other than my friends and family, no one really knows me, but anyways). I’m Prakhar, an aspiring “AI” researcher. The thing is, research doesn’t pay much, so I end up doing random stuffs that probably won’t even help me get a job, but again anyways.

I’ve always wanted to build my own “Jarvis” (from Iron Man) to help me with all kinds of things, but I can’t really make anything physical. I don’t know electronics, and yeah, I’m pretty broke. So instead, here I am trying to build my own AI assistant that can do various things and help me out with random everyday tasks.
(Ill leave my socials at the end if you want to reach out to me or anything)

Choosing LLM:

This is one of the crucial things when creating an assistant. Now I can’t just magically create my own LLM out of nowhere, so I’ve gotta use an existing one.

Now here’s the tricky part. I could’ve used Gemini’s free tier, but the recent limitations made it almost useless for testing, and the moment you start making multiple API calls per minute, you run straight into rate limits (which is kinda annoying).

So I went through Hugging Face and found two parent models: Qwen and LLaMA (I’ve worked with both of them before, and they both have some advantages and disadvantages). Now in future, as I start facing new problems, I’ll switch between these two and maybe, even between the parameter sizes (3B, 7B or 8B). I would be using Quantization 8bit for 3B and 4bit for 7B+ (You can check my laptop specs and the end).

As for fine-tuning the LLM, we won’t be doing so, unless the prompting fails and thats unlikely, but still keep in mind that there is a chance I might just to give it a personality.

Approach and Disclaimer:

Rather than doing everything at once, Ill split the functionalities into smaller tasks and complete them individually before implementing it into the main assistant. Functionality like speech recognition, searching the internet, data retrieval, API calls, etc.

Here is the disclaimer:
This is me just trying to build something that is for me and works in my laptop. So I won’t promise that the assistant I will create work for your computer too. Also, I might not be able to add all the functionality or even stop mid way as I have my final year exams coming up and then ill be busy shifting and stuffs.

Functionalities:

1. Speech Recognition:
One of the main feature is speech recognition. I’ve built it before, so its not completely new to me, but the results were pretty bad. It worked, just not how I wanted. Ill be going through it when I implement it so stay tuned for that.

2. System Control:
What I mean by system control is just opening apps, folders, files, etc. As I am writing this, I realized that it could be more useful to also perform checks like: asking whether a file exists in a certain folder or not. Well lets go over it later.

3. Searching Internet:
This is kinda new to me and I don’t know any thing about creating a script for searching internet for information, other than doing using API. I am looking for something like web scraping or other implements for accurate information or maybe free API alternative (but using API is no thrill)

4. API Calls:
Now before you guys jump at me saying, “You said there is no thrill in using APIs,” what I meant is that there’s no thrill for big tasks. But for small tasks, like getting the weather, location, or other searches like stock prices and stuffs, Ill still be using API calls. If I do find that scraping is useful for a particular task, then Ill Scrape.

5. STM and LTM:
STM (Short Term Memory) and LTM (Long Term Memory) are useful when creating an agent. What I’m thinking of building is a simple assistant that doesn’t need planning or memory (other than recent messages), but lets just keep this here as an idea or placeholder in case I do end up creating it like an agent.

6. Computer Vision:
This is another idea, and honestly, idk if what I’m thinking would work or not. Basically, I’m thinking of creating a script that takes screenshots of laptop, then identifies web pages or software and uses that to navigate through the system. Not only that, but also giving it access to my camera to identify stuff in real world.
For this purpose, I’d probably have to train a YOLO or some other model, so lets go over this when I actually reach that stage.

7. You guys tell me:
I can’t think of anything more than the functionalities mentioned above, so if you guys have any ideas, let me know. If I’m capable enough to create it, I’d love to build it, and even share how I did it.