Deploying AI Models on limited resources

Abstract
I ran 4 ai models locally on my pc, and configured them to work together without overloading my GPU. The goal of this project was to speed up the process of AI on local Machines. Therefore the limitation is the RAM and Vram, as the models will be loaded into the machine's memory. I used [Qwen3-4B-Instruct-2507] for LLM, [Qwen3-TTS-12Hz-0.6B-CustomVoice] for TTS, [Qwen3-ASR-1.7B] for STT
and [Sorter] for the practical and planner. The important part is the Planner as this is what took the load off the job,
from the LLM.

Introduction
I wanted to build my own AI setup. And looked at API prices, but as a student I can’t spend much on APIs.
Therefore running models locally was my best option. I have 16GB RAM, 3070 NVIDIA GeForce and 8 Vram. Most AI models are designed for high-end hardware. So what was possible? I quickly learned that the 4B-parameter AI model for text generation was the best solution.
And I use the term SRP, this is not a real term in ML or AI but refers to the retriever due to the fact it also has the power to call functions, plan and get info so it works as a sorter, planner and retriever.

Hardware Setup
I'm using Windows, Linux was a possibility but as the goal was to figure out how AI can be used on most PCs, I chose to use WIndows. The code was in python
Processor AMD Ryzen 5 5600X 6-Core Processor 3.70 GHz
installeret RAM 16,0 GB
Storage 466 GB SSD Samsung SSD 870 EVO 500GB, 466 GB SSD Samsung SSD 970 EVO 500GB
Graphics Card NVIDIA GeForce RTX 3070 (8 GB)
Systemtype 64-bit operativsystem, x64-baseret processor
Detailed specs off your graphics card
System specifications
Software environment (Windows/Linux, Python version, etc.)
Cost breakdown if relevant

Methodology
Models I used:
Qwen2.5-0.5B-Instruct for sorter
Qwen3-4B-Instruct-2507 for bob
Qwen3-ASR-1.7B for speech
Qwen3-TTS-12Hz-0.6B-CustomVoice for tts
The setup works by loading up the AI the retriever will load up the History but before giving it to the language model, the retriever will give a summary of every interaction, precisely enough so the quality of the history/memory of the model doesn’t get worse, but may actually get better. The retriever will take the importance off each interaction, and remove the noise. The noise can make the model more confused.
When the user then asks the language model for a question the retriever jumps in and plans what to do, similar to thinking models, but as the tg is a <5b param model, adding the planning in thinking mode was an option, but would make the answer extremely slow, as it needs to read the instruction, and conclude which functions to call E.g should it search for information on the internet before giving a response. With the retriever this only took 0.58 seconds to do. But why is this the reason, they both follow an instruction? For a starter this can be due to the fact the 4b param model is an overkill and underkill at the same time. It's smart enough to choose functions but too dumb to do it quickly. It will typically generate many tokens, with questions and answering them itself like
"The user wants me to plan which functions to call, I could google it, but for this task, that may not be necessary."
This has its perks such as better quality, but for choosing if it needs to take an image off my screen, search the internet or take a picture out of my camera, this is overkill. Therefore a 0.5b param ai model is much better at this, preferably a 05b param instruct model. The important part of making AI models faster is to not use overkill models for the simple tasks.

I tried for fun getting GPT-5.3-Codex to generate a txt file with all the libraries I'm using, and it worked.
Core ML / GPU
PyTorch (torch): Main deep-learning framework used to load/generate with models, run inference, compile TTS model, and manage GPU memory (torch.cuda.empty_cache()).
CUDA (NVIDIA GPU runtime): Required by the code paths using device_map="cuda:0" and other GPU acceleration features in PyTorch.
Model Libraries
transformers (Hugging Face): Provides AutoModelForCausalLM, AutoTokenizer, and BLIP-2 classes for loading chat/sorter/image models.
qwen_asr: Qwen speech-to-text model wrapper (Qwen3ASRModel) used for transcription.
qwen_tts: Qwen text-to-speech model wrapper (Qwen3TTSModel) used to synthesize spoken audio.

Results
These are just the average time and usage for each model

**
The next graphs are for the LLM and model working together**

I wanted to create graphics showing if the chatbot gets called instead of the SRP, but I ran out of CUDA memory (Vram) and therefore crashed at cycle 2 I also tried to not load the tts and stt but this did not change the outcome. Therefore I must conclude that the SRP does in fact help with vram and ram too and not just the speed of it.

*What was the result? *
Lets start of with the limitations, Each model uses a lot of Vram, Both the text to text generation, the vision model uses almost 8GB vram and the retriever uses almost 2GB vram, I only have 8gb Vram available, so the vision, the retriever and the LLM, cannot work side by side. If they could, you could ask the retriever since it's a language model to make a plan on how to answer or prepare some information if some of the users question is in the prompt text and not vision. If we could do this, we could make them work together faster but we can only make 1 work at the time. The models can and shall still be loaded at the same time, if that is not possible, the loading process will take so much time. Therefore it will be slower than just using 1 text to text generation model to do the retrievers job. Another thing that didn't work too well, was the fact that you can't play heavy games and other things that require a lot of computer power, since the models take up much of it. To go to the things that worked well, was the speed. Since the jobs are split up to different models so we don't overkill simple tasks, the speed becomes very fast. I know Anthropic has made models that can beforehand plan out if it needs to think for a long time or not. but as those models require a lot of power this is a way to do it locally.

Discussion
This project demonstrates how AI is accessible on small specs for hobbyists and students.
With a 3070 GPU and 16GB RAM, that already is common in most gaming PCs,
you will be able to run a full AI setup, with tts, stt and llm at the same time with strategic planning. Using the Qwen2.5-0.5b-instruct, we can take some load off LLM to reduce hallucinations and improve the quality.
As the graph shows, using a 4B model would be overkill to decide yes or no questions. That is why a "division of labor" is the best strategy to host LLM on your private PC without a powerful GPU.
So why go local when you could use API.
Privacy - Running them locally makes sure no data leaves your PC
Regulations - Commercial use may have regulations like the GDPR
Price - API token pricing becomes expensive over time.

Conclusion
The conclusion is that this strategy requires much ram and Vram but is faster and does not need as much calculations on your graphic card. Therefore we can conclude that if you have ram and Vram enough but limited graphic card power, this is the way to do it. If you also have really powerful graphic cards that can run LLM fast, the quality will be better with the LLM in charge of it all. But these models require much more power if it needs to be 1 model. Therefore for a local host on a normal PC this is the strategy. If I could change one thing it would be to make the SRP more powerful. This would require more computing power, Vram and ram. This would make the quality better, as the SRP can make mistakes if the functions and power it holds is not shown clearly. You can get the code i used for this software on my patreon Here.

References

Qwen/Qwen2.5-0.5B-Instruct
@misc{qwen2.5,
title = {Qwen2.5: A Party of Foundation Models},
url = {https://qwenlm.github.io/blog/qwen2.5/},
author = {Qwen Team},
month = {September},
year = {2024}
}

@article{qwen2,
title={Qwen2 Technical Report},
author={An Yang and Baosong Yang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Zhou and Chengpeng Li and Chengyuan Li and Dayiheng Liu and Fei Huang and Guanting Dong and Haoran Wei and Huan Lin and Jialong Tang and Jialin Wang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Ma and Jin Xu and Jingren Zhou and Jinze Bai and Jinzheng He and Junyang Lin and Kai Dang and Keming Lu and Keqin Chen and Kexin Yang and Mei Li and Mingfeng Xue and Na Ni and Pei Zhang and Peng Wang and Ru Peng and Rui Men and Ruize Gao and Runji Lin and Shijie Wang and Shuai Bai and Sinan Tan and Tianhang Zhu and Tianhao Li and Tianyu Liu and Wenbin Ge and Xiaodong Deng and Xiaohuan Zhou and Xingzhang Ren and Xinyu Zhang and Xipin Wei and Xuancheng Ren and Yang Fan and Yang Yao and Yichang Zhang and Yu Wan and Yunfei Chu and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zhihao Fan},
journal={arXiv preprint arXiv:2407.10671},
year={2024}
}

Qwen/Qwen3-4B-Instruct-2507
@misc{qwen3technicalreport,
title={Qwen3 Technical Report},
author={Qwen Team},
year={2025},
eprint={2505.09388},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.09388},
}

Qwen/Qwen3-ASR-1.7B
@article{Qwen3-ASR,
title={Qwen3-ASR Technical Report},
author={Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, Junyang Lin},
journal={arXiv preprint arXiv:2601.21337},
year={2026}
}

Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice
@article{Qwen3-TTS,
title={Qwen3-TTS Technical Report},
author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
journal={arXiv preprint arXiv:2601.15621},
year={2026}
}

Anthropic adaptive thinking:
https://platform.claude.com/docs/en/build-with-claude/adaptive-thinking

DEV Community

Deploying AI Models on limited resources

Top comments (0)