Building a Fully Offline AI Stack: A Proposal
Hey folks, this one's going to be short and simple. No fancy formatting or morbid humor this time—just a quick check-in to gauge if there's any interest in a potential write-up.
A Quick Recap: The "Franken-stack" AI Setup
A lot of people seemed to like my previous post about building a fully local AI stack, which I dubbed my Franken-stack. The goal was to replicate the functionality of ChatGPT with a complete set of features, including:
- Chat
- Image Generation
- Voice Interaction
- Web Search
- Connectivity with personal domains/websites
This entire system operates offline, is open-source, and costs you zero dollars. Everything works through a unified front end, so you can access it all through a single interface, just like you would with ChatGPT.
Is It for You? Here's What You Need to Know
While I won't be writing a full tutorial just yet, I wanted to gauge interest in a breakdown of the setup. If you're curious, I'd be happy to list the various software components needed to build this stack. Be warned, though, it's a bit GPU-heavy and can tax your CPU at times, so there are some minimum system requirements to consider.
It should run on Windows, but since I primarily use Linux Mint, there might be small caveats on the Windows side. However, I suspect setting it up on Windows might actually be easier than on Linux.
Key Considerations:
- CPU & GPU Requirements: The system can be pretty demanding on both the CPU and GPU. You’ll need a solid setup to ensure smooth performance, especially if you plan on running the stack with multiple features (like Image Generation and Web Search combined with the dmands of an LLM) at the same time. There are ways to run it with less resources which I will outline however generally speaking, reducing GPU load requires running lower tier LLM's so there is a performance sacrifice involved.
- Storage Considerations: This setup does take up a significant amount of disk space, so make sure you have enough room before getting started. I will outline how much space you will need as a baseline. Docker may be an option however due to my Linux use case and some edge case issues regarding the API domain logic I was unable to get it working using Docker containers. As far as I know this is a Linux specific issue with how it appends docker to the domain address by default.
- Cross-platform Support: I’m working toward making it cross-platform, so it will work across both Linux and Windows. However, this will take some time and testing.
What’s Missing?
Currently, the stack is feature-rich but has one limitation: Speech to Text (STT). I’ve implemented Text to Speech (TTS) already, so it can speak to you, but you can’t speak to it yet. However by the time you have it set up I'm sure you would have the skills and knowledge to be able to simply add it in yourself I just don't have a specific software in mind for this aspect yet.
Extra Thoughts: Personal Website Integration
I’ve already set up the stack to be hosted through my personal domain, enabling remote access anytime/anywhere. Would anyone be interested in learning how to do this as well? It’s a bit of extra effort, but definitely doable.This part will cost a few dollars as you would need to purchase a domain to use for it. You could also probably just link it to a new page through an existing domain if you have one.
A Word of Caution:
There are many tools available to accomplish these tasks, but the stack I’ve built uses very specific tools that work well together. You’re welcome to experiment with alternatives, but I can’t promise I’ll be able to assist you much if you go down a different route.
Future Plans:
- Speech to Text: I aim to integrate this feature soon, but I’m currently busy and don't have an exact timeline.
- Personal Website Integration: I already have the stack linked to my personal domain for online use. If you're interested in this functionality for your own setup, I can walk you through that too.
- More Platforms: Once everything's running smoothly, I plan to support setups for Linux, Windows, and Docker, and will make sure to include Speech-to-Text in the setup when I get around to doing a full tutorial.
Interested? Here’s How to Let Me Know
If you're interested in me laying out the tools, specs, and requirements for setting up your own fully offline, fully featured AI stack (aside from STT), just leave a reaction and drop a comment. If I get enough traction, I'll write up a detailed guide.
Also, if you have any specific questions, feel free to ask them in the comments. I’ll do my best to answer them!
If there’s enough interest, I’ll go ahead and dive deeper into the technical details and start building the full write-up. Thanks for reading!
Side note:
I ask for comments because, with a growing number of followers, I can’t go through each one individually to see who to follow back. A comment lets me know you’re active, that you’re following (thank you!), and helps me engage with the people who make writing these articles worthwhile.
I’m not monetizing this, I’m not advertising anything, and it’s just me—a chaotic solo dev—trying to carve out a space to share experiences with like-minded people. I’m happy to support fellow devs and help genuine insights reach a wider audience, but I can’t do that if I don’t know who’s genuinely contributing and who’s just posting AI-generated clickbait.
I want to help build a real, supportive community here. Personally, I find more value in authentic stories of trial and error than in “top 10” lists you could get from AI that still somehow get 180 reactions. I know opinions differ, but I’d rather highlight those real experiences than watch them get buried under viral but generic content. We all love AI, but let's take a moment to be human, shall we?
//Ghotet
Top comments (36)
This is exactly my kind of jam—especially as someone who values autonomy and privacy in AI tooling. Your plan for a fully offline, self-contained AI stack is not only ambitious but deeply refreshing. The nod to GPU and storage considerations highlights your practical, real-world awareness—solid groundwork for anyone planning to replicate or adapt your setup.
I’d absolutely love a detailed breakdown of the tools and specs, especially your handling of resource constraints and fallback setups. And if you're open to expanding into agent-based architectures, your post could become a gateway for building an AI agent that operates fully offline and customizable to each user's hardware.
Thanks for doing this—for real, it's the authentic, no-vendor-cloud vibe that sets work like yours apart. Count me in as a keen reader and eager tester!
Haha, I can tell your reply’s got a bit of AI polish—it’s smooth enough it could’ve been drafted by a model. No shade though, I actually filtered this one through AI myself just to keep the tone right. I’m tuned into that wavelength and can usually spot it a mile away. 😉
Really appreciate the support—it means a lot. Quick question I throw to everyone: what’s your setup like in terms of OS and GPU VRAM? Most folks chiming in so far are Windows-based. I’ll be building out support for Windows too, but I may need to spin up a Discord just for that crowd since I’ll have to rebuild/test on that side to offer proper troubleshooting.
On the agentic workflows—if you mean full corporate-style automation, that’s not really my lane. There are lighter-weight, purpose-built options for that. But if you mean letting the AI coordinate across offline tools, that’s actually part of the tool-set I provide. The stack already supports multi-model input and can hook into local utilities, so you could absolutely build an agent layer on top if you wanted. My focus is more on creativity and play than business pipelines—but the architecture leaves the door open for people to extend it however they like.
Either way—authentic or AI-filtered—you’re in the right place. Chaos is my element. If you’re ever serious about building a fully agentic stack though, that drifts into my freelance lane. Not something I’d spin up in a casual dev.to thread, but I’m always open to being contacted on Discord if someone wants to go pro with it.
check out WhisperX, it has TTS and can do word and letter-level timestamps as well as speaker diarization. There are different model sizes you can use. I wrote a little server that can processes a wav files and return JSON with word timestamps and speaker diarization. It is handy for adding animated subtitles (TikTok style). Another option is to use forced alignment (like NeMo Forced Alignment, or NFA). This can be more accurate because you give it the audio and the words that were spoken, and then it gives you timestamps and makes sure that the words are accurate and based on your input text.
That sounds almost perfect. Is it quick and accurate for me to just speak naturally to my AI? that's kind of the big thing I need personally.
It doesn't take that long to process your voice. It does depend on how long your utterance is, the size of the model you are using, your hardware, your pronunciation/enunciation.. you should probably just test it out and build a benchmark. I typically just feed TTS back into STT, but I'll probably be trying this on my own voice, too. You can also put the output text through an LLM doing STT to attempt corrections for words and phrases that the STT got wrong.
Major props for tackling a full-featured, offline stack. The unified front-end is the holy grail for these kinds of projects.
Please count this as a strong vote of interest. I'd love to see the component list!
I’ll happily count your vote as a strong one! My main question: which OS are you running? As far as I know, everything works on Windows, but beyond the LLM and Stable Diffusion, I haven’t fully tested TTS or domain integration there. I have a dual boot setup and my long term goal is accessibility regardless of platform.
I'm mainly on Linux (Ubuntu/Fedora), I think I could try to patch your solution to Linux, hopefully it will just work.
I run my own on Linux I'm just looking to support both OS in that regard, but hey once I have it out there, if you have any suggestions for tweaks or anything I'd love to hear them! I'm pretty new to Linux (I use MInt).
I'm planning on setting up a Discord for anyone who needs help along the way and I'd love to have someone like yourself in there to help. I built it out over months and didn't document it so it's hard for me to do a proper full tutorial without a clean environment to rebuild it.
This sounds really interesting. An offline, all-in-one AI stack would solve a lot of privacy and reliability concerns. Curious to see how you balance GPU demand with usability—definitely worth a deeper write-up.
That will come down to personal preference a bit and running models that are a good match for your available resources. I'm hoping to drop the run down later today!
I think that setup is a very impressive effort to develop a different perspective to use the IA capabilities, your article gives me a good view to understand the open-source IA advantages thank you again for your perspective.
Love the energy here. A few thoughts from running this kind of thing in anger:
The GPU/CPU tax is real. The “Franken-stack” works great until you try to run TTS + vision + LLM inference all at once and your fans spin like a jet engine. Orchestration matters more than raw horsepower, deciding when to call which model saves you far more than shaving another 5% off GPU load.
Docker pain on Linux tracks with my experience. I usually end up with a hybrid: local bare-metal for hot paths, containers for the ancillary stuff. Keeps API domain weirdness isolated.
Cross-platform is always a tarpit. My rule: pick the platform you’d actually deploy on in prod and stabilise there first. Windows being “easier” only lasts until you try scaling.
STT: Vosk or Whisper small models will get you 80% of the way. They’re not perfect, but fine for orchestration loops.
Biggest takeaway: don’t just think in terms of “feature parity with ChatGPT.” Think in terms of consistency, safety, and orchestration, otherwise you’ll end up with a toddler with a super brain opening every toy box at once.
Would read a full write-up, especially if you document not just what works, but what breaks. That’s where the real value is.
Your absolutely right on all accounts. I can feel your pain. I'm not sure if you read my other article about it but, well I basically wrote an entire article about the pain lol. Mine doesn't use docker at all. I kind of just see it as an added dependency. I'm running 2 GPUS, an onboard one in my gaming laptop and an external through the Alienware graphics amplifier. My Linux stack runs great which is why I offered to layout the tools and models im using. Feature oarity with ChatGPT was the original motivator but not the end goal once I got going. If you haven't tried the GPT-OSS 20B model yet, give that one a go. It's set up in a way that doesn't hog the GPU like a normal 20B model would. It's even lighter then a 13B model. its one of the smartest I have used but I keep the reasoning to low or medium for speed because it is slower then my 13B models, but you can always dial it up instead of down if you have the resources to spare.
Unfortunately I built it out over the span of a few months and didn't document anything so initially I would just be detailing tools I know that play nice together, things to consider when setting them up (such as GPU tax and any other caveats) but it wouldn't be a full on tutorial because that would require me rebuilding the entire thing and I'm just not into that at this time.
regarding cross platform thats a long term goal but for now since im running dual boot I would be building out my Linux stack again on Windows and that's mostly because its a clean environment over there so I can document it and make a tutorial easier since I know I'm intending to make one going in as opposed to 3-4 months after I had most of it done.
Out of Whisper or Vosk which do you recommend? I have about 4gb left of vram on the 2nd GPU lol.
Down the road I'll be building out a server with 2 proper GPUS to run it all as opposed to a laptop from 6 years ago with an EGPU carrying the weight.
Ha, yep, sounds familiar.
I totally get not wanting to rebuild just to document. My trick: don’t rebuild, instrument. Next time you spin a component, log every install step, patch, and config flag to a scratchpad. It’s “living docs” without the overhead. You’ll thank yourself six months later.
On GPUs; Running dual cards with an external amp is… heroic. Respect. But yeah, orchestration beats raw metal here too. Partition tasks smartly: route cheap inference (summaries, metadata) to the weaker GPU; push the heavy reasoning or vision to the stronger one. Keeps both busy without bottlenecking.
I haven’t run GPT-OSS 20B yet but agree, size isn’t everything. Orchestration and reasoning depth matters. Dial reasoning like a throttle: low for utility, high only when you need correctness. Think scalpel, not hammer.
On the STT side, if you’ve got 4GB spare VRAM, I’d lean Whisper tiny or small. It’ll still fit, runs smooth, and the accuracy is noticeably better than Vosk. Vosk only really shines when you’re stuck CPU-only, but since you’ve already got GPU cycles to play with, Whisper’s the safer long-term bet. Think of it as spending cycles on consistency now rather than patching accuracy issues later.
Down the road, when you’ve got that dual-GPU server, the biggest win isn’t raw speed, it’s headroom to orchestrate. Consistency loops, retries, validation pipelines, all the stuff that makes the system trustworthy. That’s when this “pain log” turns into a proper stack.
Typically I document things in whatever random text editor I have and I regret not doing that cause its going to hurt a little when I need to set it all up again on a new pc lol. I was going to make a full on build script to automate the entire process but theres no way I could do that without just rebuilding the stack and documenting as I go.
Reasonig wise for GPT-OSS it's nice to have but the drawback with my setup is that in order to change it between low medium or high, I have to go into LM Studio and eject the model and load it back in for the change to take effect. I'm debating trying to find a dynamic solution but realistically if I'm willing to wait 2 minutes for it to reason, I'm probably doing some heavy coding and I'd be better off just running deepseek-coder. The only real snag I run into with GPT-OSS is that even with rolling window or truncate middle enabled, it will eventually max out the context length and I have to go start a new chat. Thats more of an LM Studio thing, im sure as they update now that it is out they will account for that.
I like the idea of STT and another commenter suggested I try WhisperX so Whisper in whatever form seems to be the choice here, im just worried about processing time. I'm sure I can do some streeaming optimization or settings changes to get Chatterbox TTS to start responding with voice once the LLM has generated say a paragraph or so but as it stands I have to wait for the full LLM response to generate before TTS processing begins. Chatterbox does stream it back in chunks so once it starts it just keeps going with only a small pause at the start but I have a bit of work to do to make it feel more fluid. It's a nice to have though and I enjoy being able to just ask a question, switch tabs and in under a minute its generally going on a lengthy monologue for me lol.
Fortunately running what I have now I have already learned the importance of proper orchestration. There is no way I would be able to run an average 13B model, SD, and TTS at the same time. When I tried it maxed out the 3060 and would dump the LLM out of LM Studio causing me to have to reload the model, which by then my GPU is occupied so I had to run a smaller one to continue which is why the TTS runs on the 1070 instead. I'm iffy on using up that last 4GB cause I am usually still watching youtube and it still has to handle my GUI and such on my laptop for now so it gives me a bit of room to continue normal operation even with the full stack running :). If the STT is only around 1.5GB or so, give or take, I think that would be fine. I'll just have to give it a try and see.
Id be happy to hear more about it, I have a window gaming rig with a 3090 sitting around just waiting for a workload!
I'd be lying if I said I wasn't a little envious lol. My stack uses a 12GB 3060 as the main GPU. The main limitation would be VRAM. If you don't run the TTS and Image gen at the same time though you are probably in the clear. Ill provide some LLM variants that might still allow all of it to run at once.
Thank you for the input!
A lot of devs, startups, and privacy-focused users are interested — especially those who want control over data, avoid cloud costs, and experiment with custom AI locally
The privacy aspect is a huge bonus of this set up. And you can still use it online through a website while still having all your data stored locally.
Update: I intend to have this out within about 1 weeks time. I think most people are on windows so I'm going to make sure all of the components are available on that OS aswell. I'm pretty busy the next few days but I' hoping to get started on this as soon as possible.
Even just an outline of your choices would be a nice release. I know there are a ton of open source chatgpt clones, but I'd love to hear what worked best for you
Hey i would love to set this up locally , looking forward to hear from you
It should be up in the next few hours. Just keep in mind it's not a fully detailed tutorial or how-to guide. More like me handing you a set of Lego bricks but you have to put them together yourself :p
Yeap got it
A fully offline Franken stack sounds epic..