Ayush kumar for NodeShift

Posted on Sep 8 • Originally published at nodeshift.cloud

How to Install & Run EmbeddingGemma-300m Locally?

#gemma #opensource #ai #llm

EmbeddingGemma-300M is Google DeepMind’s lightweight, multilingual (100+ languages) embedding model built on Gemma 3/T5Gemma foundations. It outputs 768-dim vectors (with Matryoshka down-projections to 512/256/128) optimized for retrieval, classification, clustering, semantic similarity, QA, and code retrieval. It’s designed for low-resource / on-device use, loads via SentenceTransformers, and does not support float16—use FP32 or bfloat16.

Evaluation

Benchmark Results

The model was evaluated against a large collection of different datasets and metrics to cover different aspects of text understanding.

Full Precision Checkpoint

QAT Checkpoints

Note: QAT models are evaluated after quantization

Mixed Precision refers to per-channel quantization with int4 for embeddings, feedforward, and projection layers, and int8 for attention (e4_a8_f4_p4).

GPU/CPU Configuration Table

Use the following prompts based on your use case and input data type. These may already be available in the EmbeddingGemma configuration in your modeling framework of choice.

Resources

Link: https://huggingface.co/google/embeddinggemma-300m

Step-by-Step Process to Install & Run EmbeddingGemma-300m Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running EmbeddingGemma-300m, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like EmbeddingGemma-300m
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like EmbeddingGemma-300m.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the EmbeddingGemma-300m runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Verify Python Version & Install pip (if not present)

Since Python 3.10 is already installed, we’ll confirm its version and ensure pip is available for package installation.

Step 8.1: Check Python Version

Run the following command to verify Python 3.10 is installed:

python3 --version

You should see output like:

Python 3.10.12

Step 8.2: Install pip (if not already installed)

Even if Python is installed, pip might not be available.

Check if pip exists:

pip3 --version

If you get an error like command not found, then install pip manually.

Install pip via get-pip.py:

curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py

This will download and install pip into your system.

You may see a warning about running as root — that’s okay for now.

After installation, verify:

pip3 --version

Expected output:

pip 25.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)

Now pip is ready to install packages like transformers, torch, etc.

Step 9: Created and Activated Python 3.10 Virtual Environment

Run the following commands to created and activated Python 3.10 virtual environment:

apt update && apt install -y python3.10-venv git wget
python3.10 -m venv gemma
source gemma/bin/activate

Step 10: Install Dependencies

Run the following command to install dependencies:

pip install -U sentence-transformers faiss-cpu

Step 11: Install Hugging Face Hub

Run the following command to install huggingface_hub:

pip install -U huggingface_hub

Step 12: Log in to Hugging Face (CLI)

Run the following command to login in to hugging face:

huggingface-cli login

When prompted, paste your HF token (from https://huggingface.co/settings/tokens).

For “Add token as git credential? (Y/n)”:

Y if you plan to git clone models/repos.
n if you only use huggingface_hub downloads.

You should see: “Token is valid… saved to /root/.cache/huggingface/stored_tokens”.

The red line “Cannot authenticate through git-credential…” just means no Git credential helper is set. It’s safe to ignore.

Step 13: Connect to Your GPU VM with a Code Editor

Before you start running model script with the EmbeddingGemma-300m model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 14: Create app.py and Add the Following Code

Create the file
From your VM terminal:

nano app.py

Or in VS Code (as in your screenshot), click New File → name it app.py.

Paste this code:

from sentence_transformers import SentenceTransformer
import numpy as np

# Load the EmbeddingGemma-300M model (Google’s open embedding model)
model = SentenceTransformer("google/embeddinggemma-300m")  # auto device (CPU/GPU)

# A sample query
query = "Which planet is known as the Red Planet?"

# A small list of candidate documents
docs = [
    "Venus is often called Earth's twin.",
    "Mars, with its reddish hue, is the Red Planet.",
    "Jupiter is the largest planet.",
    "Saturn has iconic rings."
]

# Encode the query → vector representation optimized for search
q = model.encode_query(query)

# Encode the documents → vector representations optimized for retrieval
D = model.encode_document(docs)

# Compute similarity between the query vector and each document vector
scores = model.similarity(q, D).squeeze().tolist()

# Pair each score with its document and sort (highest similarity first)
ranked = sorted(zip(scores, docs), reverse=True)

# Print top 3 results
print(ranked[:3])

What this file does (detailed)

Imports:

SentenceTransformer loads the EmbeddingGemma-300M model.
numpy is for vector math.

Model load:

Loads the Google EmbeddingGemma-300M embedding model, which converts text into vectors (embeddings).

Query + documents:

Defines one query ("Which planet is known as the Red Planet?") and a small set of candidate sentences (our mini “document corpus”).

Encoding:

model.encode_query(query) → creates a vector representation of the query.
model.encode_document(docs) → creates vector representations of the candidate docs.
Using separate methods ensures query/document embeddings are tuned for retrieval.

Similarity:

model.similarity(q, D) computes how close each doc is to the query in vector space.

Ranking:

Sorts docs by similarity score (highest first). The result shows which document best answers the query.

Output:

Prints the top 3 results. You should see “Mars…” ranked highest, since it matches the Red Planet question.

In short:
app.py is a minimal semantic search demo using EmbeddingGemma. It shows how to encode queries & docs, compute similarity, and rank results — the basic workflow behind search engines, chatbots, and RAG systems.

Step 15: Run the Script

Run the script from the following command:

python3 app.py

This will download the model and generate response on terminal.

Step 16: Create build_index.py and add the following code

Create the file

nano build_index.py

Or in VS Code → New File → name it build_index.py.

Paste the full code (you already have it):

import os, json, argparse, numpy as np
from pathlib import Path
from sentence_transformers import SentenceTransformer
import faiss

def read_corpus(folder):
    paths = []
    texts = []
    for p in Path(folder).rglob("*"):
        if p.suffix.lower() in {".txt", ".md"} and p.stat().st_size > 0:
            paths.append(str(p))
            texts.append(p.read_text(encoding="utf-8", errors="ignore"))
    return paths, texts

def mrl_truncate_and_norm(X, k):
    X = X[:, :k]
    X = X / np.linalg.norm(X, axis=1, keepdims=True)
    return X.astype("float32")

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--data_dir", required=True, help="Folder with .txt/.md")
    ap.add_argument("--dim", type=int, default=768, choices=[768,512,256,128])
    ap.add_argument("--out_dir", default="index")
    args = ap.parse_args()

    os.makedirs(args.out_dir, exist_ok=True)

    print("Loading model…")
    model = SentenceTransformer("google/embeddinggemma-300m")  # fp32/bf16 only

    print("Reading corpus…")
    paths, texts = read_corpus(args.data_dir)
    assert texts, "No .txt/.md files found"

    print(f"Encoding {len(texts)} docs…")
    D = model.encode_document(texts, batch_size=64, convert_to_numpy=True)
    # L2-normalize (cosine sim via inner product)
    D = D / np.linalg.norm(D, axis=1, keepdims=True)

    if args.dim < 768:
        print(f"Applying Matryoshka truncation to {args.dim}…")
        D = mrl_truncate_and_norm(D, args.dim)

    index = faiss.IndexFlatIP(D.shape[1])
    index.add(D)

    faiss.write_index(index, f"{args.out_dir}/faiss_{args.dim}.index")
    np.save(f"{args.out_dir}/embeddings_{args.dim}.npy", D)
    with open(f"{args.out_dir}/mapping.json", "w") as f:
        json.dump(paths, f, indent=2)

    print(f"Saved index to {args.out_dir} (dim={args.dim}, N={len(texts)})")

if __name__ == "__main__":
    main()

What this script does

read_corpus(folder):
Reads all .txt and .md files in the given folder. Returns two lists:

paths → file paths
texts → file contents

mrl_truncate_and_norm(X, k):
Implements Matryoshka Representation Learning.

Takes embeddings of size 768.
Truncates to smaller dimension (512, 256, or 128).
Re-normalizes them for cosine similarity search.

main():
Parse arguments:

--data_dir → where your text files are.
--dim → embedding size (default 768).
--out_dir → where to save the index (default index/).

Load the EmbeddingGemma-300M model.
Read all docs from your folder.
Encode them with model.encode_document().
Normalize vectors.
Optionally shrink with MRL.
Create a FAISS index (cosine similarity using IndexFlatIP).

Save:

faiss_.index → the FAISS index file.
embeddings_.npy → numpy array of embeddings.
mapping.json → file path mapping to docs.

How to run it

Create some docs (if you don’t have any yet):

mkdir docs
echo "Mars is the Red Planet." > docs/mars.txt
echo "Venus is Earth's twin." > docs/venus.txt
echo "Jupiter is the largest planet." > docs/jupiter.txt

Run the script:

python3 build_index.py --data_dir ./docs

This will:

Read your .txt files in docs/
Encode them with EmbeddingGemma-300M
Save an index under ./index/

Output example:

Loading model…
Reading corpus…
Encoding 3 docs…
Saved index to index (dim=768, N=3)

What you get after running

Inside the index/ folder:

faiss_768.index → FAISS index file
embeddings_768.npy → stored embeddings
mapping.json → JSON mapping file paths

In short: build_index.py prepares your text files into a searchable embedding index using EmbeddingGemma + FAISS.

Conclusion

EmbeddingGemma-300M is a powerful yet lightweight open embedding model from Google DeepMind, designed for retrieval, semantic similarity, classification, clustering, and more — all while being efficient enough to run on laptops, desktops, or modest GPUs. In this guide, we walked through setting up a NodeShift GPU VM, installing dependencies, and building two core scripts:

app.py for a quick semantic search demo using queries and documents.
build_index.py for preparing and indexing your own text corpus with FAISS, ready for scalable search.

With these steps, you now have everything you need to integrate EmbeddingGemma into search pipelines, recommendation systems, or retrieval-augmented applications. Whether on-device or in the cloud, EmbeddingGemma-300M provides a practical and cost-effective foundation for embedding-based workflows.

DEV Community