DEV Community

Cover image for Running Gemma4 for Free on HuggingFace
Ismaili Simba
Ismaili Simba

Posted on

Running Gemma4 for Free on HuggingFace

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

For many developers from less privileged backgrounds, something as cost efficient as an $8 monthly plan to run your own state-of-the-art LLM can be out of reach. As such, we are very grateful for the existence of perpetual free-tier services such as GitHub, HuggingFace Spaces, MongoDB, Cloudflare R2, and many more.

Having utilized free-tier HuggingFace spaces, when Gemma4 launched, I found myself wondering if it would be possible to run any of the Gemma4 models fully and only on the free-tier. If possible, this would make Gemma4 accessible to a lot more developers and other technical users that want to experiment with it, but for one reason or another, could not afford the costs of such services like Cloud Run and others, were you could typically deploy such models and get excellent performance.

To try out this train of thought, I started by running a Deep Research prompt using the Thinking model on the Gemini web app. You can see the full report here. I used the prompt, "Please research the new Gemma4 models and generate a guide for how I could deploy one to HuggingFace spaces (free tier only)". I then printed out the full report on pdf and started another chat using the Pro model and the prompt, "Please review this research report and generate a detailed plan I can prompt you on step by step to deploy the appropriate Gemma4 model (pick one for me) to a free-tier HuggingFace space to act as a web API that supports chatting (streaming and full responses), image and audio generation".

Immediately, Gemini told me to check my expectations because the free-tier spaces on HuggingFace only give you 2 vCPUs and 8GB of RAM. This is quite a constrained environment to run an LLM. And because of that, while we might/will be able to do it, we would only be able to use the model for text-generation. Well, beggars can't be choosers, so I told Gemini to proceed.

The first step was to create a Dockerfile and a main.py file to download, install, run the model and serve it using FastAPI. After minor debugging to get the appropriate download URL, we hit a wall. llama-cpp-python did not yet support Gemma4. After going in circles for a while, I asked/suggested to Gemini whether using Ollama would help solve the issue. It said yes! And after a few more rounds of debugging, we had a working deployment of Gemma4 running fully on a free-tier HuggingFace space. No credit card required!

Do check your excitement as there are massive limitations. The model takes about 3 to 4 minutes to process a prompt and start responding. However, given that it's running on absolutely free infrastructure, I think it's worth it. You can check out this deployment here. You can also try prompting it using the link: https://ismizo-gemma4.hf.space/api/generate . See a curl example below.

curl -X POST https://ismizo-gemma4.hf.space/api/generate   -H "Content-Type: application/json"   -d '{
    "model": "gemma4",
    "prompt": "Explain the concept of a Dyson Sphere in three short sentences.",
    "stream": true
  }'

Enter fullscreen mode Exit fullscreen mode

To try this out on your own free-tier HugginFace space, you'll need the Dockerfile and entrypoint.sh file shown below. Just deploy a new free-tier space using the blank Docker option, then upload these two files, after some time, you'll have your own instance of Gemma4 running on HuggingFace.

If you can afford $8 per month and want more performance, I recommend you use Cloud Run, you can follow this article on how to deploy any LLM you want from Ollama to Cloud Run.

I hope this is useful to at least someone.

Good luck and enjoy!

Dockerfile

FROM python:3.10-slim

# 1. Install only what we need for the model download
RUN apt-get update && apt-get install -y \
    curl wget \
    && rm -rf /var/lib/apt/lists/*

# 2. THE BULLETPROOF FIX: Copy the binary directly from the official Ollama image
COPY --from=ollama/ollama:latest /usr/bin/ollama /usr/bin/ollama

WORKDIR /app

# 3. Download the Gemma 4 GGUF
RUN --mount=type=secret,id=HF_TOKEN,mode=0444 \
    wget --header="Authorization: Bearer $(cat /run/secrets/HF_TOKEN)" \
    https://huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF/resolve/main/google_gemma-4-E2B-it-Q4_K_M.gguf \
    -O model.gguf

# 4. Create the Modelfile with CPU Performance Parameters
RUN printf "FROM ./model.gguf\nPARAMETER num_ctx 2048\nPARAMETER num_thread 2\nPARAMETER num_batch 256\nPARAMETER num_keep 500" > Modelfile

# 5. Configure for Hugging Face Spaces (Port 7860)
EXPOSE 7860
ENV OLLAMA_HOST=0.0.0.0:7860
ENV OLLAMA_KEEP_ALIVE=-1

COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

ENTRYPOINT ["/entrypoint.sh"]
Enter fullscreen mode Exit fullscreen mode

entrypoint.sh

#!/bin/bash
# Start Ollama server in the background
ollama serve &

# Wait until the server is responsive on port 7860
until curl -s localhost:7860 > /dev/null; do
  echo "Waiting for Ollama server..."
  sleep 2
done

# Create the model using the local GGUF
ollama create gemma4 -f Modelfile

# Keep the script running
wait
Enter fullscreen mode Exit fullscreen mode

Top comments (0)