DEV Community

Cover image for Building a Multimodal Local AI Stack: Gemma 4 E2B, vLLM, and Hermes Agent
Abdul Hakkeem P A
Abdul Hakkeem P A

Posted on

Building a Multimodal Local AI Stack: Gemma 4 E2B, vLLM, and Hermes Agent

The Local AI movement just hit a massive milestone. With the release of Google's Gemma 4, 2-billion parameter models are no longer toys for simple chat. They're multimodal powerhouses purpose-built for advanced reasoning and agentic workflows.

In this guide, we'll break down how to harness the Gemma 4 E2B (Effective 2B) model using vLLM and integrate it with the Hermes Agent for a fully local, multimodal stack.

What is Gemma 4?

Google released Gemma 4 in four sizes: Effective 2B (E2B), Effective 4B (E4B), 26B Mixture of Experts, and 31B Dense. We're focused on the E2B; the one that fits on consumer hardware.

Key capabilities:

  • Multimodal from day one - all models natively process text, images, and video. The E2B and E4B edge models also support audio input for speech recognition.
  • Long context - edge models like E2B feature a 128K context window.
  • Apache 2.0 licensed - commercially permissive, no strings attached.

Why E2B + vLLM for a local agent stack?

  • Instruction tuning — Gemma 4 excels at following system prompts, critical for an agent managing many skills.
  • Native tool calling — function calling, structured JSON output, and native system instructions are built in, letting the agent reliably interact with tools and APIs.
  • Efficiency — at 2B effective parameters, it leaves plenty of VRAM for the agent's KV cache, keeping response times fast even on an RTX 3060/4060.

Step 1: Install vLLM

You'll need a HuggingFace account and access token, since Gemma 4 requires accepting Google's license on HF first. Then:

pip install uv
uv venv && source .venv/bin/activate
uv pip install -U vllm
Enter fullscreen mode Exit fullscreen mode

Step 2: Serve Gemma 4 E2B

hf download google/gemma-4-e2b-it --local-dir ~/models/gemma-4-e2b-it

vllm serve google/gemma-4-e2b-it \
  --port 8000 \
  --max-model-len 32768 \
  --enable-auto-tool-choice \
  --tool-call-parser Hermes \
  --gpu-memory-utilization 0.85

Enter fullscreen mode Exit fullscreen mode

Your OpenAI-compatible endpoint is now live at http://localhost:8000/v1.

Step 3: Install Hermes Agent

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
Enter fullscreen mode Exit fullscreen mode

Step 4: Point Hermes at your local model

hermes model
Enter fullscreen mode Exit fullscreen mode

This walks you through selecting a provider. Choose Custom Endpoint and enter:

  • Base URL: http://localhost:8000/v1
  • Model: google/gemma-4-E2B-it

That's it. You now have a fully local, multimodal agent with 40+ built-in tools; web search, file operations, terminal access, browser automation, and more with zero cloud dependency.


Have you tried Gemma 4 yet? Drop your opinion in the comments!

Top comments (0)