DEV Community

Cover image for How to Use Gemma 4 12B for Free: 6 Working Methods in 2026
Hassann
Hassann

Posted on • Originally published at apidog.com

How to Use Gemma 4 12B for Free: 6 Working Methods in 2026

Gemma 4 12B is open-weights and Apache 2.0 licensed, so “free” means you can download the weights and run them yourself without an API bill or subscription. Your only real cost is the hardware you run it on.

Try Apidog today

Gemma 4 12B is designed for local and on-device usage. Google’s larger 31B and 26B variants are the ones available for free chat in AI Studio, while the 12B path is about getting a capable model running on your own laptop, workstation, browser demo, or edge device. If you want the model specs first, start with what is Gemma 4 12B.

Below are six practical ways to try or run Gemma 4 12B, from a zero-install browser demo to a local OpenAI-compatible API.

Quick summary

Method What you get Best for
Hugging Face Space Browser chat, zero install Trying it quickly
Ollama Local model + OpenAI-compatible API Developers who want one-command setup
LM Studio Local desktop app with GUI Running locally without a terminal
llama.cpp Lightweight local API server Low-overhead and advanced setups
Hugging Face Transformers Python access with full control Notebooks, experiments, and fine-tuning
Google AI Edge On-device and mobile runtime Phones and edge hardware

Method 1: Try it in your browser

The fastest way to test Gemma 4 12B is the official Hugging Face demo Space. You do not need to install anything or configure a GPU.

  1. Open the Gemma 4 12B demo Space
  2. Enter a prompt, or upload an image or audio clip
  3. Review the model response

Use this path for a quick sanity check, especially if you want to test multimodal input before downloading anything locally.

Method 2: Run Gemma 4 12B with Ollama

Ollama is the simplest developer-friendly way to run Gemma 4 12B locally and expose it through an API.

Install Ollama

On macOS or Linux:

curl -fsSL https://ollama.com/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

On Windows, download the installer from ollama.com and run it.

Pull and run the model

ollama pull gemma4:12b
ollama run gemma4:12b
Enter fullscreen mode Exit fullscreen mode

The pull command downloads the model. By default, Ollama uses a 4-bit Q4_K_M build, around 8GB. The run command starts an interactive chat session.

To exit:

/bye
Enter fullscreen mode Exit fullscreen mode

Call the local API

Ollama exposes an OpenAI-compatible API at:

http://localhost:11434/v1
Enter fullscreen mode Exit fullscreen mode

Send a chat completion request:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:12b",
    "messages": [
      {
        "role": "user",
        "content": "Explain how transformers work in two sentences."
      }
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Because the endpoint follows the OpenAI API shape, you can reuse tools and SDKs that support a custom base URL. Point them to:

http://localhost:11434/v1
Enter fullscreen mode Exit fullscreen mode

For an IDE setup pattern, the flow is similar to this DeepSeek V4 in Cursor walkthrough. Use gemma4:12b as the model name.

Useful Ollama commands:

ollama list
ollama ps
ollama show gemma4:12b
Enter fullscreen mode Exit fullscreen mode

Method 3: Use LM Studio without a terminal

If you prefer a GUI, LM Studio runs local models on Windows, macOS, and Linux.

  1. Download and install LM Studio
  2. Search for Gemma 4 12B in the model catalog
  3. Choose a quantization that fits your RAM
  4. Download the model
  5. Open the chat tab and start prompting

LM Studio can also run a local OpenAI-compatible server, usually on port 1234. That gives you a local API without writing setup code.

Example base URL:

http://localhost:1234/v1
Enter fullscreen mode Exit fullscreen mode

This is a good option if you want a desktop chat UI and an API endpoint in the same app.

Method 4: Run a lightweight server with llama.cpp

llama.cpp runs GGUF models with minimal overhead and includes an OpenAI-compatible server.

Install it:

# macOS
brew install llama.cpp

# Windows
winget install llama.cpp
Enter fullscreen mode Exit fullscreen mode

Then start a server using the official GGUF build. Browse the ggml-org/gemma-4 collection on Hugging Face for the exact 12B repo name, then pass it to llama-server:

llama-server -hf ggml-org/gemma-4-12B-it-GGUF
Enter fullscreen mode Exit fullscreen mode

The server exposes an OpenAI-compatible API at:

http://localhost:8080/v1
Enter fullscreen mode Exit fullscreen mode

Use this method when you want fewer dependencies, more control, or a lower-overhead runtime.

Method 5: Use Hugging Face Transformers in Python

For notebooks, scripts, custom inference, or fine-tuning workflows, use Transformers directly. If you do not have a local GPU, you can run this in a free Google Colab notebook.

Install the required packages:

pip install transformers torch accelerate torchvision
# add librosa for audio and video input
pip install librosa
Enter fullscreen mode Exit fullscreen mode

Load the instruction-tuned model and generate a response:

from transformers import AutoProcessor, AutoModelForMultimodalLM

MODEL_ID = "google/gemma-4-12B-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a short joke about saving RAM."},
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
    enable_thinking=False,
).to(model.device)

input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=1024)

response = processor.decode(
    outputs[0][input_len:],
    skip_special_tokens=False,
)

print(processor.parse_response(response))
Enter fullscreen mode Exit fullscreen mode

Use:

enable_thinking=True
Enter fullscreen mode Exit fullscreen mode

for step-by-step reasoning mode. Leave it off for faster basic chat.

For image or audio input, structure the message content as a list and include image content before the text prompt and audio content after it. The weights are also available on Kaggle if you prefer that source. More implementation examples are in the developer guide.

Method 6: Run on-device with Google AI Edge

For phones and edge hardware, use Google’s AI Edge stack. The Google AI Edge Gallery app and LiteRT-LM CLI can run Gemma 4 12B on-device.

To start a local server with LiteRT-LM:

litert-lm import \
  --from-huggingface-repo=litert-community/gemma-4-12B-it-litert-lm \
  gemma-4-12B-it.litertlm gemma4-12b

litert-lm serve
Enter fullscreen mode Exit fullscreen mode

Use this path for offline assistants, mobile apps, or embedded systems where data should stay on the device.

Test your local Gemma 4 12B API with Apidog

Once Gemma 4 12B is running through Ollama or llama.cpp, you have a local HTTP API. Before wiring it into an app, test the request and response shape in an API client. Apidog works well for this because the local endpoints are OpenAI-compatible.

Set it up like this:

  1. Download Apidog and create a new HTTP project
  2. Add a POST request to:
http://localhost:11434/v1/chat/completions
Enter fullscreen mode Exit fullscreen mode
  1. Set the body type to JSON
  2. Paste a sample payload:
{
  "model": "gemma4:12b",
  "messages": [
    {
      "role": "user",
      "content": "Return a JSON object with two fields: city and country."
    }
  ],
  "stream": false
}
Enter fullscreen mode Exit fullscreen mode
  1. Save the base URL as an environment variable so you can switch between Ollama and llama.cpp:
Ollama:    http://localhost:11434/v1
llama.cpp: http://localhost:8080/v1
Enter fullscreen mode Exit fullscreen mode
  1. Add a response assertion to confirm the model returns valid JSON in the content field
  2. Set "stream": true to verify streaming responses before building UI logic around streamed tokens

This helps catch malformed prompts, wrong field names, and streaming issues before they reach your application code.

If you are comparing API clients, see these roundups of free online API testing tools and the best Postman alternatives. The same testing workflow also maps to how to test APIs with Postman-style workflows.

Which quantization should you choose?

Gemma 4 12B fits different machines depending on how compressed the model is.

Build Memory needed Trade-off
Full precision ~16GB Best quality
8-bit ~14GB Near-full quality
4-bit (Q4_K_M) ~8GB Slight quality drop, runs widely

Ollama defaults to the 4-bit build, which is why it can run on an 8GB GPU or a 16GB unified-memory Mac. If you have extra memory, try an 8-bit build for better quality.

Which free method should you use?

Use this quick decision guide:

  • Just testing the model? Use the Hugging Face Space demo.
  • Building software? Use Ollama for the fastest local API setup.
  • Avoiding the terminal? Use LM Studio.
  • Optimizing for low overhead? Use llama.cpp.
  • Working in notebooks or fine-tuning? Use Hugging Face Transformers.
  • Targeting phones or edge devices? Use Google AI Edge.

Most developers will be productive with Ollama for daily local inference and Transformers for deeper Python workflows.

Tips for running Gemma 4 12B locally

  • Match the quantization to your RAM. If the model swaps to disk, inference will slow down heavily.
  • Start with 4-bit. It is the safest default for laptops and consumer hardware.
  • Use thinking mode selectively. Set enable_thinking=True for math or multi-step reasoning, but leave it off for fast chat.
  • Watch the context window. A 256K context is large, but long transcripts and codebases can still fill it.
  • Validate requests in Apidog first. Confirm the JSON shape before your app depends on it.
  • Reuse the same local pattern for other models. Similar workflows apply to Qwen 3.7, MiniMax M3, and Claude Opus 4.8 access paths.

FAQ

Is Gemma 4 12B really free?

Yes. It is Apache 2.0 open-weights, so you can download and run it, including commercially. You pay only for the hardware or cloud infrastructure you use.

Do I need a GPU?

No, but a GPU helps. The 4-bit build can run on an 8GB GPU or a 16GB unified-memory Mac. CPU-only inference works, but it is slower.

Can I use Gemma 4 12B in Google AI Studio?

Not currently. AI Studio hosts the 31B and 26B models for free browser chat. Gemma 4 12B is intended for local and on-device usage.

Does the local API require an API key?

No. Ollama and llama.cpp serve the model on localhost without an API key. If a client requires a key field, use any placeholder value; the local server ignores it.

Can I use it from existing OpenAI-compatible code?

Yes. Ollama and llama.cpp expose OpenAI-compatible endpoints.

Use these base URLs:

Ollama:    http://localhost:11434/v1
llama.cpp: http://localhost:8080/v1
Enter fullscreen mode Exit fullscreen mode

Then keep the same request structure and change the model name to gemma4:12b.

How do I run image and audio features?

Use Transformers, LM Studio, or Google AI Edge apps that support multimodal input. Add image content before the text prompt and audio content after it.

Which is faster: Ollama or llama.cpp?

They use the same underlying engine. llama.cpp has less overhead and more tuning flags, while Ollama is easier to install and operate. For most local development workflows, the practical difference is small.

Top comments (0)