Jay

Posted on Sep 4 • Edited on Sep 17

Set up your own personal AI Stack: Longform Version

#ai #opensource #showdev #linux

Apologies in advance for the length, I tend to ramble a bit. I might do a 2nd summarized version that's less me going on about stuff and more bullet points. Expect some edits to this over time.

Summarized version can be read here:
https://dev.to/ghotet/set-up-your-own-personal-ai-frankenstack-summarized-version-536l

Prologue:

Hey folks. I finally have a moment to sit down and try to lay out the blueprint for setting up your own AI stack, which I dubbed the "Frankenstack" and it seems to have stuck. This stack consists of LLM software, Stable Diffusion (image gen), text-to-speech (but not speech to text), web search for the LLM, all tied together through a unified front end. Now just to clarify up front, this isn't a tutorial or a how to guide. I'm just laying out the toolkit and giving any info that comes to mind in regards to the software and any caveats or additional info regarding each one. For example I'll list my machines specs and what LLM's I run to give you a realistic expectation. It is GPU/CPU hungry to put it mildly but I'll tackle that a bit more when I get to each component.

Lastly just to clarify every single tool I mention here is open source and doesn't require any subscriptions or payment or anything like that. Some of them do have paid optional tools or requirements if you were to try and scale it for business use. I'm just a dude in a garage who values privacy and has a passion for AI and computers. Do not mistake me for some sort of industry expert. I'm going to layout the tools in order of how I set them up and your free to skip any you aren't interested in. At the end as bonus to those interested I'll do my best to detail how I host it through my own domain for online access from anywhere (handy if you want to use it's full power from your phone or something). Finally there are other options for most if not all of these components but I have a working stack that does what I need and I have tried some other one and didn't get what I needed out of them or couldn't get them to link up properly early in when I was experimenting.

This is sort of a catch-all for both Windows and Linux which is part of the reason it isn't a tutorial. All of the software mentioned should be available for both OS. I use Linux but prior to switching to the penguin I was already using a few of these on Windows 11. I will double check them as I go and I'll try to remember to put links at the end of each section.

Finally I just want to mention the front end...up front. It does provide some pretty nice options that will allow you to use tools, add memory, and other things that I would argue are the most exciting parts of having your own stack but any deep dives on that wont be for a while yet as I'm in the early stages of experimenting with that aspect. For actual setup each software component is running a local server and your basically just making sure the localhost addresses are set up in OpenWeb UI's admin panel. Alright, Lets get this stitched up.

My specs: Modified Alienware 15 R4 (circa 2018)

Nvidia GTX 1070 8GB (laptop gpu)
Nvidia RTX 3060 12GB (AGA external GPU dock)
Intel i7-8750H CPU @ 2.20GHz
32 GB RAM
All drives are NVME.
My stack uses a total of approximately 120GB which includes about 8 or so LLM/SD models total.

LLM:

LM Studio was my choice here. Why? Well it has a nice front end itself that offers some pretty in-depth options for performance tuning as well as some experimental features that may help those with lower end systems get things running decently. Such features would be offloading KV cache for faster performance, however this does potentially effect quality. it is very dependant on what models you are running. Tons of knobs and dials, and things I have no idea how to use yet. This software also allows you to easily run multiple models at once (if your system can handle it) allow you to run a personality, a coder, and a reasoning model all at once. Personally I can only run one at a time with the models I use. LM Studio also has an easy download function where you can just search any model or explore models right in the GUI and download them directly from hugging face.

I couldn't possibly detail this entire app here in a snippet. I really do recommend just giving it a go before yelling at me about Ollama in the comments. I'm aware it exists, I have used it, and I wasn't really a fan. Maybe it's just me. You are more then welcome to use it if you really want to. With that said I have only ever used Ollama in CLI mode so I have no idea what options the GUI version has.

Link: https://lmstudio.ai

Models: These are all downloadable directly through LM Studio

OpenAI's GPT-OSS 20B is my personal favourite. It offers reasoning with adjustable low, medium and high settings. Low takes about 2 seconds. High can take up to about 2 minutes. even the actual generation speed is a tad slow for me on my setup. I'm also running an external GPU through a dock which comes with it's own bottleneck so..yeah.

This one only runs about 3-4 billion parameters at a time so despite it being a 20B model it's much lighter on resources when it's actually running. Good job OpenAI. Lastly, it is trained for tool use.

I use Mythalion 13B (for creative writing/story content). It's super fast, decent for chatting, pretty good at coming up with stable diffusion prompts, but I wouldn't ask it for code, it's not that smart. I really only use this one if I want speed over anything cause the GPT-OSS is my go to generally speaking.

Deepseek-Coder(R1) - I use this one strictly for longer more complex scripts as I find anything OpenAI does has a cap of about 200 lines before it completely loses track of everything and just starts creating bugs for you to fix. The downside is that it's by far the slowest model I have.

Vision models: There are a few options here. I haven't tried any yet as it's just not something I really need or would use. I downloaded a 7B model with vision but have yet to test it so if you do need one, same with any model really, just do some research or download them and test them out and see what works best for you.

In terms of models these are just what I use. There's a million of them out there. If you can't run a 13B model due to a smaller GPU like 8GB of vram, just try out some 7B models or smaller. Parameter count isn't always indicative of performance.

Stable Diffusion (Image Gen):

Right now I'm using A1111. Weird name, I know. I use this because it's super straight forward on the surface, has all of the deep setting to do LoRA training, img to img, it's all there and I don't have to mess around with nodes like in ComfyUI. It allows for the use of VAE and has a bunch of other things that I don't know the meanings of. I'll be honest I don't use this aspect of my stack all that much. I occasionally use it to create my cover art for my articles or iterate through character concepts but that's about it. I don't really have any model suggestions here. I use one called RevAnimated for everything and I just haven't played with it all that much.

A1111 does of course have a web-ui as well so you can just run it independently if your AI isn't very good at prompting things properly. The front end I used doesn't really have too many options so it's good to have the web-ui bookmarked in case you want more control and better images.

Link (Github): https://github.com/AUTOMATIC1111/stable-diffusion-webui

Text-To-Speech:

Chatterbox. %100, use chatterbox. Everything else I tried like tortoise or Coqui were awful. If you want it to sound like a cursed Google Maps, by all means.

Chatterbox however, is really damn good. If anyone is familiar with ElevenLabs, it's like that but locally run. It take a couple of seconds to start speaking once the actual text generates but it does stream it in chunks to speed things up. One of the best parts of this is ResembleAI, the creator of it, have voice cloning software as well as text prompted voice creation tools on their website. The cloning is a one off for a new account so in my case I went and found a good clip of Cortana and cleaned it up and you can bet your ass my AI sound exactly like her. All you need is a 10 second voice clip and it does a good job. Otherwise they do have a pay as you go option if you want to make a few voices. In order to swap the voice model out from the default you just have to open the right script and change the default "american-female" or whatever. I believe they cover that on their GitHub page where you will find the download for Chatterbox.

Link: https://www.resemble.ai/chatterbox/

Web Search:

SearXNG. This one is actually where I'm the least certain. I set up a custom browser called SearXNG which sort of acts as a Google search but you can have it scan multiple browsers and services. So in short you can set it to search google, duck duck go, brave, opera, whatever you want so instead of pulling results only from google, you can take a shotgun approach and run 6 searches in one. I must have set this one up at 2:00 AM after a few beer or something cause I don't remember much about it. That also means nothing went wrong and as a result it wasn't very memorable. All I know is my AI can search 6 sites in one shot and it works pretty good.I'm fairly certain you can set up the AI front end to run any web browser you want if you don''t feel like hosting this one yourself.

Additionally I run cloudflared warp when using my stack so any web searches it's doing just look like i'm on cloudflare to my ISP when i'm using it. In the world of increasing surveillance I'm very privacy first. I might even end up just running it through Tor here soon.

Link (Github): https://github.com/searxng/searxng

Frontend:

OpenWeb UI. I love this. it has anything I could ever need in a day. Some of it is still in experimental or beta phase such as the memory function but it does let you set up knowledge bases and tooling for your AI to pull from anyways. One of the little caveats here is generally speaking, you have to select the OpenAI options when linking your API/domains for the local hosting. For example the TTS I think had an option for Web API and no custom fields so you would change it to OpenAI, point it to your localhost and for the API key it just needs a value so you can just type 'ttskey' or whatever in the field and your good to go.

It allows you to rate your LLM responses. To accomplish any task like generating an image, having your AI do a web search, run some code, you just select the toggle under your text input field and off you go. Regarding image gen sometimes it seems to partially use its previous response first so expect to have to hit the regenerate button here and there. I find this is less of an issue with GPT-OSS vs my other models like Mythalion. On the upside you can just sit there and keep hitting regenerate instead of re-prompting it over and over.

The front end in the admin panel has plenty of options for various things such as setting up multiple models so you can toggle between them (if you can run multiple at once, otherwise you need to swap them in LM Studio anyways). Evaluations, Tools, Documents, Code Execution, Pipelines, Databases, you name it, it's probably in there.

For TTS in the personal user settings it has an autoplay voice option you can toggle on, otherwise you can hit the speaker icon to just have it play it whenever you want and you can of course replay the messages as many times as you want.

My PC is always online. I believe in order to enable full offline functionality for OpenWeb UI as it does ask for an email and password by default, you need to change an environment variable. something like Offline_Mode = true in one of the scripts.

Lastly as far I know you can customize this software and get rid of all their branding and do as you please but if you decided to use it commercially (beyond 50 users I believe it was) they do have some restrictions or ask you to be on a paid plan.

Link: https://openwebui.com/

Other Notes:

Regarding custom persona prompts like "You are a systems architect who prioritizes accuracy over conversation" or what have you, I would set the base prompt in LM Studio. The admin panel in OpenWeb UI does also have a higher priority prompt field in case you don't have one through the LLM software your using, and lastly it has a per user prompt so if you have multiple users they can set their own persona prompts and the higher priority ones from LM Studio or the admin panel take precedence and the user prompt gets added on top. Basically if you want to add in some filters do it via LM Studio first, or if your using something else set it in the admin panel for each specific model.

For Linux users:

I set up a launcher script and an alias so all I have to do to fire up the entire thing is open a terminal and type aistart and it fires it off sequentially to ensure resource allocation is correct and it all goes to the correct GPU etc. I also set up an 'aistop' alias to ensure it all shuts down properly. I'm no expert so my script might not be the best but I'll upload it to github as a reference as well. The only caveat with the launcher, and it's probably fixable I just haven't got around to it, is when LM Studio launches it doesn't automatically load the last model I used. That's it, everything else is great. I am using what i call a "debug" launcher so it does fire off something like 5 terminals and the LM Studio GUI so that I can monitor everything. I intended to build a clean launcher but I always have multiple monitors and I feel like a wizard with the terminals up so I just never really had the motivation to do the clean one to be totally honest.

If you are doing that just bare in mind, something I learned the hard way was just because nvidia-smi says my 3060 is listed as 0, doesn't mean other programs will respect that so in the case of the TTS software, it turns out my 1070 was listed as 0 and oh boy was that a headache. Running the LLM, SD, and TTS all on one 3060? nope. It just caused my model to fail and I couldn't figure out why it would reload.

Launch/Kill script example:
https://github.com/Ghotet/Frankenstack-Launch-Script

Why not use Docker?

I don't use it for anything else and in my attempts to set things up on Linux I had issues with it appending "docker" to the localhost addresses. If the localhost address for the LLM and such aren't exact it wont work. Maybe knowing what I know now I could figure it out but it was just a hassle to me at the time and I didn't really see the benefit. I don't like added dependencies, I have enough components on the go and it was another potential moving part to break. Engineering 101: the more moving parts you have the more likely something is to break. And it did, immediately. User error? Probably.

You are welcome to try if you want it just isn't my thing. I haven't tried it on Windows but I don't think it has that issue. I was told that was Linux specific.

Connecting it to the web:

I'll be honest this is super long as it is so this is going to be super quick. You need to have/buy a domain, set it up through cloudflare, and then link the domain via OpenWeb UI. You will need to run something like a cloudflare tunnel or one of the many other options out there for tunnelling. It's not super complicated and ChatGPT can run you through it in a couple of minutes if you want. Personally I love it cause I use my stack on my phone while i'm sitting outside having a coffee. My rig may technically be a laptop to some (desktop replacement at best) but it is far from portable at this point. But as long as it's running I can use my stack from any PC or device that has a web browser. On Android you can save any web page as its own app basically so it feels a bit like having my own AI App. I'd go more in depth but as I said this is already super long and I don't recall the specifics of the setup process off hand unfortunately.

Finally:

A big thanks to the community. I wrote this because my initial article about the pain and triumph of setting this all up did really well so I saw the interest was there and down the road I do intend to write a full and proper tutorial for both OS as it can be a headache. DO NOT expect everything to just go off without a hitch if you have never set something like this up. Use the resources out there, check documentation if needed, ask a cloud AI if you get stuck. I was going to set up a discord to try to help but there didn't seem to be much interest and I wouldn't exactly be able to provide quick responses on a whim anyways. Same with the web access, most people seem to be pretty security focused and just want it to run offline anyways so I just don't really have the motivation. If I can figure it out, anyone else here can too. I have faith in you.

Best of luck, experiment, have fun with it and just remember, the pain of troubleshooting just makes it feel more rewarding when you finally get it. It may take several hours, but it's worth it once it's done. Personally, I'm looking forward to figuring out how to make good use of all the other features and figuring out some optimisation to get the GPT-OSS to reply a bit quicker. Where there's a will there's a way.

//Ghotet

DEV Community