The Local AI movement just hit a massive milestone. With the release of Google's Gemma 4, 2-billion parameter models are no longer toys for simple chat. They're multimodal powerhouses purpose-built for advanced reasoning and agentic workflows.
In this guide, we'll break down how to harness the Gemma 4 E2B (Effective 2B) model using vLLM and integrate it with the Hermes Agent for a fully local, multimodal stack.
What is Gemma 4?
Google released Gemma 4 in four sizes: Effective 2B (E2B), Effective 4B (E4B), 26B Mixture of Experts, and 31B Dense. We're focused on the E2B; the one that fits on consumer hardware.
Key capabilities:
- Multimodal from day one - all models natively process text, images, and video. The E2B and E4B edge models also support audio input for speech recognition.
- Long context - edge models like E2B feature a 128K context window.
- Apache 2.0 licensed - commercially permissive, no strings attached.
Why E2B + vLLM for a local agent stack?
- Instruction tuning — Gemma 4 excels at following system prompts, critical for an agent managing many skills.
- Native tool calling — function calling, structured JSON output, and native system instructions are built in, letting the agent reliably interact with tools and APIs.
- Efficiency — at 2B effective parameters, it leaves plenty of VRAM for the agent's KV cache, keeping response times fast even on an RTX 3060/4060.
Step 1: Install vLLM
You'll need a HuggingFace account and access token, since Gemma 4 requires accepting Google's license on HF first. Then:
pip install uv
uv venv && source .venv/bin/activate
uv pip install -U vllm
Step 2: Serve Gemma 4 E2B
hf download google/gemma-4-e2b-it --local-dir ~/models/gemma-4-e2b-it
vllm serve google/gemma-4-e2b-it \
--port 8000 \
--max-model-len 32768 \
--enable-auto-tool-choice \
--tool-call-parser Hermes \
--gpu-memory-utilization 0.85
Your OpenAI-compatible endpoint is now live at http://localhost:8000/v1.
Step 3: Install Hermes Agent
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
Step 4: Point Hermes at your local model
hermes model
This walks you through selecting a provider. Choose Custom Endpoint and enter:
-
Base URL:
http://localhost:8000/v1 -
Model:
google/gemma-4-E2B-it
That's it. You now have a fully local, multimodal agent with 40+ built-in tools; web search, file operations, terminal access, browser automation, and more with zero cloud dependency.
Have you tried Gemma 4 yet? Drop your opinion in the comments!
Top comments (0)