Jay

Posted on Sep 4 • Edited on Sep 12

Set Up Your Own Personal AI Stack: Summarized

#ai #opensource #showdev #linux

Hey folks. I finally have a moment to sit down and lay out the blueprint for setting up your own AI stack. This will be a quick summary not a tutorial.

This stack consists of:

LLM software
Stable Diffusion (image generation)
Text-to-speech (but not speech-to-text)
Web search for the LLM
All tied together through a unified front end

Just to clarify upfront: this isn't a tutorial or step-by-step guide. I'm laying out the toolkit, giving notes and caveats for each piece of software. For example, I'll list my machine specs and the LLMs I run to give you a realistic expectation. This stack is GPU/CPU hungry.

My Specs

Modified Alienware 15 R4 (circa 2018)
Nvidia GTX 1070 8GB (laptop GPU)
Nvidia RTX 3060 12GB (AGA external GPU dock)
Intel i7-8750H CPU @ 2.20GHz
32GB RAM
All drives are NVMe
Stack uses ~120GB including ~8 LLM/SD models

LLM

LM Studio was my choice:

Offers an in-depth front end with performance tuning and experimental features
Allows offloading KV cache for faster performance (quality may vary)
Lets you run multiple models simultaneously (if your system can handle it)
Easy download of models directly from Hugging Face

I recommend trying it before asking about alternatives like Ollama. I’ve used Ollama in CLI mode, but I wasn’t a fan personally.

Models I use:

GPT-OSS 20B – My favorite for reasoning. Adjustable low/medium/high settings. Low ~2s, High ~2min. Only runs 3-4B parameters at a time, so lighter on resources. Trained for tool use.
Mythalion 13B – Creative writing, fast, decent chat, good for Stable Diffusion prompts. Not for code.
Deepseek-Coder (R1) – Strictly for complex scripts. Slowest model, but handles long code reliably.

Vision models:

I haven’t used these extensively; if you need vision, try a 7B model and test. Smaller models may be better for limited VRAM.
Parameter count isn’t always indicative of performance; adjust based on GPU capacity.

Stable Diffusion (Image Generation)

I use A1111:

Straightforward GUI with deep settings for LoRA training, img2img, VAE support
I mainly use it for cover art or character concepts
Default model: RevAnimated
ComfyUI is an alternative but more node-based; I didn’t use it

Text-to-Speech

Chatterbox – 100% recommend:

Local alternative to ElevenLabs
Streams in chunks for faster playback
Supports voice cloning via ResembleAI: just a 10-second clip for a new voice
Swap default voice by editing the relevant script (check GitHub for details)
Other options (Tortoise, Coqui) were worse in my experience.

Web Search

SearXNG – acts like a meta-search engine:

Searches multiple engines at once (Google, DuckDuckGo, Brave, etc.)
AI can query several sources in one shot
I run it through Cloudflare Warp for privacy; Tor is optional

Frontend

OpenWebUI – central control hub:

Configure multiple models, knowledge bases, tools
Evaluate LLM responses, run pipelines, execute code, manage databases
TTS autoplay option in user settings; speaker icon for manual playback
Offline mode available (set Offline_Mode = true)
Customize branding freely; commercial use over 50 users may require paid plan

Custom prompts/personas:

Set base prompt in LM Studio
OpenWebUI admin panel allows high-priority prompts
Per-user prompts can be layered on top

Linux Launcher Script

I created a aistart alias to sequentially launch all components for proper resource allocation
LM Studio doesn’t auto-load the last model yet
Debug launcher opens multiple terminals for monitoring
Important: GPU assignment isn’t always respected automatically; check NVIDIA settings

Why Not Docker?

Docker caused localhost address issues on Linux
Added dependencies can break the stack; simpler is better
Windows may not have this issue

Connecting to the Web

Requires domain and Cloudflare tunnel
Tunnel forwards traffic to OpenWebUI on your local machine
Lets you access the stack anywhere, including mobile
ChatGPT or documentation can guide setup quickly

Final Thoughts

DO NOT expect this to run perfectly on first try
Troubleshooting is part of the fun and rewarding
Experiment, iterate, optimize
Full tutorial may come later for both OS

Best of luck, have fun, and remember: the pain of troubleshooting makes the success sweeter.

// Ghotet

Top comments (9)

Guy • Sep 10

Really enjoyed your “Frankenstack” breakdown, there’s something genuinely inspiring about someone in a garage sweating those GPU cycles and privacy ideals to build something future-proof. When I wired up my own orchestration with Claude, I leaned on that same hacker ethic: tight context flow, predictable handoffs, and making sure the tools serve you. Not the other way around. Your stack isn’t just neat tech tinkering; it’s real, usable sovereignty in action. Nice work.

Quick question for you: when you’re juggling image gen, TTS, web search, and LLMs under a single front end, have you felt any orchestration pain? Like components tripping over each other or has everything stayed remarkably smooth?

Jay • Sep 10

Thank you for acknowledhing the hacker spirit and privacy first. That was the main premise when I started down this road.

The TTS was the last piece I added and it wasn't at all smooth lol. It takes up just enough vram to either cause my larger LLM models to fail or for the stable diffusion model not to load. I thought it was just a case of editing my launcher script to point the TTS to my lower end gpu but no matter what I did everything would try to load on my 3060 causing vram shortages.

Eventually I realized the only solution is to make a new virtual environment that can't see my 3060 at all so it will default to my 1070 instead and alleviate that memory shortage. It does all run on the 3060 if I use smaller models or GPT-OSS though which is pretty impressive. If i try to run Mythalion 13B or Deepseek coder, LM studio will eject the model and fail to load it back in.

Guy • Sep 11

That workaround makes total sense. I’ve had to play the same shell game with GPUs when juggling heavier models. It’s funny how orchestration ends up mattering as much for hardware as it does for the models themselves. You can wire Claude into a codebase with all the right context flow, but if your VRAM isn’t managed like a scarce resource, the whole thing collapses anyway.

What you did with the virtual environment to blind the 3060 is exactly the kind of hacker-first pragmatism I love. It’s not elegant, but it’s reliable, and that’s what makes it valuable. Honestly, that tension between wanting the “big brain” models like Mythalion 13B and keeping the whole stack stable is what makes these DIY builds so interesting. You’re constantly deciding whether to scale down for consistency or squeeze every last drop of GPU for capability.

Are you thinking about automating that orchestration a bit? Like, routing smaller jobs or TTS by default to the weaker GPU while reserving the 3060 for your heavier LLMs? That could save you from having to babysit it every time you switch workloads.

Jay • Sep 11

Yes I definitely have had to learn how to manage the vram. While i was trying to figure out why, even when jn my shell scripts i pointed TTS at the 1070 it would load on the 3060 I realized that in my effort to keep the number of venvs to a minimum I had shot myself in the foot a little.

The reason TTS kept ending up on the 3060 and choking the stack was because the venv it was running on was set to the 3060. That took longer then it should have for me to figure it out lol.

My 1070 is basically doing nothing most of the time so I have started doing exactly qhat you said and rerouting the lighter tasks over to the 1070.

Im glad I set this up cause it caused me to learn about proper resource management and for the first time I see vram as a scarce resource. I'd love to just grab a 24gb gpu but at the same time I would be missing out on the opportunity to learn all of this. One of my higher priority missions now is to finally start building out a server now that I have a better understanding of what I nees and how it all works. I would love to be able to actually leave it running 24/7 since I linked it to a web domain but sometimes I have to shut it down to free up resources for other tasks since I run it on my primary PC setup.

Guy • Sep 15

That’s exactly it. You’ve basically turned resource management into part of the orchestration layer, and that’s a skill most people skip straight over when they just rent cloud GPUs. Having to wrangle it locally forces you to think in terms of trade-offs and pipelines instead of raw horsepower. And you’re right, if you’d just dropped cash on a 24GB card you’d have missed that learning curve completely.

The server idea makes sense now that you’ve mapped out what each piece needs. Being able to leave it running 24/7 without starving your main rig will give you a whole different feel for stability too, it stops being a side project you spin up and down and starts behaving like an actual service.

Are you planning to script the routing logic once you’ve got the server, or keep it manual for the sake of control?

Vida Khoshpey • Sep 7

This is great, it's so great that you did such a great job and I find you.😂😁💪🏻 Keep going Ghotet

Jay • Sep 9

Thank you! It was a very fun and rewarding project. A little frustrsting at times, but completely worth it lol.

Jessica Williams • Sep 4

But with the pace AI is evolving looks like it will leave humans behind.

Jay • Sep 4

There is a chance. I saw someone liken it to when the calculator came along. I would argue even when the computer came along. Both of those changed the world, just not as fast. Embracing it and learning how it works is a good place to be :)