“Surely my gaming laptop can handle a 7B model, right?”
— Me, before reality hit me like an OOM error.
So here’s the story: I dove into the world of local LLMs thinking my Lenovo Legion 5i Gaming Laptop—rocking 6GB of VRAM, 32GB of RAM, and a solid Intel i7 CPU could surely handle running a local LLM right? I figured, “Let’s try running Mistral-7B locally. Psssh, what could go wrong?”
Everything. Everything went wrong.
The moment I tried, my GPU begged for mercy, RAM usage spiked like it saw a ghost, and the model refused to even load. That's when I realized something important: bigger isn’t always better, especially when it comes to LLMs and local hardware.
But hey, that failure was a blessing in disguise. It led me to the world of tiny but mighty LLMs—specifically TinyLlama—paired with the vLLM inference engine, which is optimized for performance and memory efficiency. This combo actually works on my hardware, and it might work for yours too.
If you're in the same boat—curious about LLMs, want to run them locally, but don't have an RTX 4090 lying around—then this guide is for you.
Let’s get into it. 👇
Why Bother with Local "Tiny" vLLMs?
Running AI models locally has a lot of benefits:
- No OpenAI/HuggingFace API costs
- Full control over the system
- Great for testing, hacking, or even building apps offline
- You learn a LOT more doing it this way
But we're also being realistic here—if you're like me and you're not running an RTX 4090, you need something that fits your hardware. That's why I'm using TinyLlama—a small but capable model—and leveraging vLLM, a super-efficient inference engine built for LLMs.
🛠 Step-by-Step: How I Set It Up
1. Install WSL2
First things first, we need WSL2 set up. I didn’t go through the Microsoft Store route—just used good old command line.
Open PowerShell as Administrator and run:
wsl --install
That installs WSL2 along with the default Ubuntu distro. If you want a specific Ubuntu version (like 20.04), you can install it with:
wsl --install -d Ubuntu-20.04
Reboot if necessary.
2. Launch Ubuntu and Prep Your Environment
Once Ubuntu is installed, launch it via your Start Menu or run wsl from a terminal. From here, it’s just regular Linux commands.
Inside Ubuntu, I did the usual system update and installed Python tools:
sudo apt update && sudo apt upgrade -y
sudo apt install python3 python3-pip python3-venv -y
Next, I set up a Python virtual environment for this project:
python3 -m venv vllm-env
source vllm-env/bin/activate
pip install --upgrade pip
3. Run vLLM with TinyLlama
Now we launch the vLLM API server with TinyLlama. This runs a local OpenAI-compatible endpoint:
python -m vllm.entrypoints.openai.api_server \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.7
What these flags mean:
--model: Specifies the model from Hugging Face
--host 0.0.0.0: Exposes the API to your host machine
--port 8000: Runs it on port 8000
--gpu-memory-utilization 0.7: Only uses 70% of your GPU’s memory (you can tweak this)
Testing It
Once it’s up and running, try hitting the endpoint from your browser or a tool like Postman:
http://localhost:8000/docs
You can use it like the OpenAI API—just pass in your prompt, model name, and get a response.
Wrapping Up
So yeah—turns out you can run LLMs locally, just don’t expect your laptop to handle a 7B model without crying. TinyLlama + vLLM is a great combo for tinkering and learning without frying your GPU.
Now that it’s running, I’ll be hacking away at this black-box AI, trying to make it actually follow my prompts—especially when I ask for a specific word count (seriously, why is that so hard?).
More experiments to come. Stay tuned!
Top comments (0)