Hassann

Posted on Jun 4 • Originally published at apidog.com

How to Use Gemma 4 12B for Free: 6 Working Methods in 2026

Gemma 4 12B is open-weights and Apache 2.0 licensed, so “free” means you can download the weights and run them yourself without an API bill or subscription. Your only real cost is the hardware you run it on.

Try Apidog today

Gemma 4 12B is designed for local and on-device usage. Google’s larger 31B and 26B variants are the ones available for free chat in AI Studio, while the 12B path is about getting a capable model running on your own laptop, workstation, browser demo, or edge device. If you want the model specs first, start with what is Gemma 4 12B.

Below are six practical ways to try or run Gemma 4 12B, from a zero-install browser demo to a local OpenAI-compatible API.

Quick summary

Method	What you get	Best for
Hugging Face Space	Browser chat, zero install	Trying it quickly
Ollama	Local model + OpenAI-compatible API	Developers who want one-command setup
LM Studio	Local desktop app with GUI	Running locally without a terminal
llama.cpp	Lightweight local API server	Low-overhead and advanced setups
Hugging Face Transformers	Python access with full control	Notebooks, experiments, and fine-tuning
Google AI Edge	On-device and mobile runtime	Phones and edge hardware

Method 1: Try it in your browser

The fastest way to test Gemma 4 12B is the official Hugging Face demo Space. You do not need to install anything or configure a GPU.

Open the Gemma 4 12B demo Space
Enter a prompt, or upload an image or audio clip
Review the model response

Use this path for a quick sanity check, especially if you want to test multimodal input before downloading anything locally.

Method 2: Run Gemma 4 12B with Ollama

Ollama is the simplest developer-friendly way to run Gemma 4 12B locally and expose it through an API.

Install Ollama

On macOS or Linux:

curl -fsSL https://ollama.com/install.sh | sh

On Windows, download the installer from ollama.com and run it.

Pull and run the model

ollama pull gemma4:12b
ollama run gemma4:12b

The pull command downloads the model. By default, Ollama uses a 4-bit Q4_K_M build, around 8GB. The run command starts an interactive chat session.

To exit:

/bye

Call the local API

Ollama exposes an OpenAI-compatible API at:

http://localhost:11434/v1

Send a chat completion request:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:12b",
    "messages": [
      {
        "role": "user",
        "content": "Explain how transformers work in two sentences."
      }
    ]
  }'

Because the endpoint follows the OpenAI API shape, you can reuse tools and SDKs that support a custom base URL. Point them to:

http://localhost:11434/v1

For an IDE setup pattern, the flow is similar to this DeepSeek V4 in Cursor walkthrough. Use gemma4:12b as the model name.

Useful Ollama commands:

ollama list
ollama ps
ollama show gemma4:12b

Method 3: Use LM Studio without a terminal

If you prefer a GUI, LM Studio runs local models on Windows, macOS, and Linux.

Download and install LM Studio
Search for Gemma 4 12B in the model catalog
Choose a quantization that fits your RAM
Download the model
Open the chat tab and start prompting

LM Studio can also run a local OpenAI-compatible server, usually on port 1234. That gives you a local API without writing setup code.

Example base URL:

http://localhost:1234/v1

This is a good option if you want a desktop chat UI and an API endpoint in the same app.

Method 4: Run a lightweight server with llama.cpp

llama.cpp runs GGUF models with minimal overhead and includes an OpenAI-compatible server.

Install it:

# macOS
brew install llama.cpp

# Windows
winget install llama.cpp

Then start a server using the official GGUF build. Browse the ggml-org/gemma-4 collection on Hugging Face for the exact 12B repo name, then pass it to llama-server:

llama-server -hf ggml-org/gemma-4-12B-it-GGUF

The server exposes an OpenAI-compatible API at:

http://localhost:8080/v1

Use this method when you want fewer dependencies, more control, or a lower-overhead runtime.

Method 5: Use Hugging Face Transformers in Python

For notebooks, scripts, custom inference, or fine-tuning workflows, use Transformers directly. If you do not have a local GPU, you can run this in a free Google Colab notebook.

Install the required packages:

pip install transformers torch accelerate torchvision
# add librosa for audio and video input
pip install librosa

Load the instruction-tuned model and generate a response:

from transformers import AutoProcessor, AutoModelForMultimodalLM

MODEL_ID = "google/gemma-4-12B-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a short joke about saving RAM."},
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
    enable_thinking=False,
).to(model.device)

input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=1024)

response = processor.decode(
    outputs[0][input_len:],
    skip_special_tokens=False,
)

print(processor.parse_response(response))

Use:

enable_thinking=True

for step-by-step reasoning mode. Leave it off for faster basic chat.

For image or audio input, structure the message content as a list and include image content before the text prompt and audio content after it. The weights are also available on Kaggle if you prefer that source. More implementation examples are in the developer guide.

Method 6: Run on-device with Google AI Edge

For phones and edge hardware, use Google’s AI Edge stack. The Google AI Edge Gallery app and LiteRT-LM CLI can run Gemma 4 12B on-device.

To start a local server with LiteRT-LM:

litert-lm import \
  --from-huggingface-repo=litert-community/gemma-4-12B-it-litert-lm \
  gemma-4-12B-it.litertlm gemma4-12b

litert-lm serve

Use this path for offline assistants, mobile apps, or embedded systems where data should stay on the device.

Test your local Gemma 4 12B API with Apidog

Once Gemma 4 12B is running through Ollama or llama.cpp, you have a local HTTP API. Before wiring it into an app, test the request and response shape in an API client. Apidog works well for this because the local endpoints are OpenAI-compatible.

Set it up like this:

Download Apidog and create a new HTTP project
Add a POST request to:

http://localhost:11434/v1/chat/completions

Set the body type to JSON
Paste a sample payload:

{
  "model": "gemma4:12b",
  "messages": [
    {
      "role": "user",
      "content": "Return a JSON object with two fields: city and country."
    }
  ],
  "stream": false
}

Save the base URL as an environment variable so you can switch between Ollama and llama.cpp:

Ollama:    http://localhost:11434/v1
llama.cpp: http://localhost:8080/v1

Add a response assertion to confirm the model returns valid JSON in the content field
Set "stream": true to verify streaming responses before building UI logic around streamed tokens

This helps catch malformed prompts, wrong field names, and streaming issues before they reach your application code.

If you are comparing API clients, see these roundups of free online API testing tools and the best Postman alternatives. The same testing workflow also maps to how to test APIs with Postman-style workflows.

Which quantization should you choose?

Gemma 4 12B fits different machines depending on how compressed the model is.

Build	Memory needed	Trade-off
Full precision	~16GB	Best quality
8-bit	~14GB	Near-full quality
4-bit (Q4_K_M)	~8GB	Slight quality drop, runs widely

Ollama defaults to the 4-bit build, which is why it can run on an 8GB GPU or a 16GB unified-memory Mac. If you have extra memory, try an 8-bit build for better quality.

Which free method should you use?

Use this quick decision guide:

Just testing the model? Use the Hugging Face Space demo.
Building software? Use Ollama for the fastest local API setup.
Avoiding the terminal? Use LM Studio.
Optimizing for low overhead? Use llama.cpp.
Working in notebooks or fine-tuning? Use Hugging Face Transformers.
Targeting phones or edge devices? Use Google AI Edge.

Most developers will be productive with Ollama for daily local inference and Transformers for deeper Python workflows.

Tips for running Gemma 4 12B locally

Match the quantization to your RAM. If the model swaps to disk, inference will slow down heavily.
Start with 4-bit. It is the safest default for laptops and consumer hardware.
Use thinking mode selectively. Set enable_thinking=True for math or multi-step reasoning, but leave it off for fast chat.
Watch the context window. A 256K context is large, but long transcripts and codebases can still fill it.
Validate requests in Apidog first. Confirm the JSON shape before your app depends on it.
Reuse the same local pattern for other models. Similar workflows apply to Qwen 3.7, MiniMax M3, and Claude Opus 4.8 access paths.

FAQ

Is Gemma 4 12B really free?

Yes. It is Apache 2.0 open-weights, so you can download and run it, including commercially. You pay only for the hardware or cloud infrastructure you use.

Do I need a GPU?

No, but a GPU helps. The 4-bit build can run on an 8GB GPU or a 16GB unified-memory Mac. CPU-only inference works, but it is slower.

Can I use Gemma 4 12B in Google AI Studio?

Not currently. AI Studio hosts the 31B and 26B models for free browser chat. Gemma 4 12B is intended for local and on-device usage.

Does the local API require an API key?

No. Ollama and llama.cpp serve the model on localhost without an API key. If a client requires a key field, use any placeholder value; the local server ignores it.

Can I use it from existing OpenAI-compatible code?

Yes. Ollama and llama.cpp expose OpenAI-compatible endpoints.

Use these base URLs:

Ollama:    http://localhost:11434/v1
llama.cpp: http://localhost:8080/v1

Then keep the same request structure and change the model name to gemma4:12b.

How do I run image and audio features?

Use Transformers, LM Studio, or Google AI Edge apps that support multimodal input. Add image content before the text prompt and audio content after it.

Which is faster: Ollama or llama.cpp?

They use the same underlying engine. llama.cpp has less overhead and more tuning flags, while Ollama is easier to install and operate. For most local development workflows, the practical difference is small.

DEV Community