DEV Community

Cover image for Serverless AI: Qwen3 Embeddings with Cloud Run
Remigiusz Samborski for Google Cloud

Posted on • Originally published at Medium

Serverless AI: Qwen3 Embeddings with Cloud Run

In this blog post I’ll show you the process of deploying the Qwen3 Embedding model to Cloud Run with GPUs for enhanced performance.

You will learn how to:

  • Containerize the embedding model with Docker and Ollama
  • Deploy the embedding model to Cloud Run with GPUs
  • Test the deployed model from a local machine

Before we jump into the code a couple words about key components of the solution.

Qwen3 Embedding Model

The Qwen3 Embedding series is a set of open-source models for text embedding and reranking, built on the Qwen3 Large Language Model (LLM) family. It's designed for retrieval-augmented generation (RAG), a technique that enhances the output of large language models by retrieving relevant information from a knowledge base, and other tasks requiring semantic search. You can learn more about embeddings in this video.

Open embedding models such as Qwen3 are the ideal choice when you need greater control, specialization, and security than proprietary, "black-box" APIs can offer. They are particularly well-suited for the following use cases:

  • Fine-Tuning for Niche Domains📻: by fine-tuning them on specialized data (e.g., legal contracts, medical research, internal company wikis) they can provide more accurate results for semantic search and RAG than a general-purpose model.
  • Data Privacy & Security🔒: open models can be self-hosted or deployed to cloud resources managed by your organization. This ensures compliance with regulations like GDPR and prevents data from ever leaving your control.
  • Cost-Effectiveness at Scale💰: for high-volume tasks, running an optimized open model can be cheaper than paying per-API-call fees to a proprietary service provider.
  • Offline & Edge Deployment🛜: open models can run locally and are perfect for applications that must function without an internet connection, such as on-device search in mobile apps or analysis on remote IoT devices.

I chose the Qwen3-Embedding-4B model due to its growing popularity and suitable size for the Cloud Run environment. However, you can experiment with different sizes (0.6B, 4B, and 8B) depending on your specific use case.

Cloud Run

Cloud Run is a managed compute platform on Google Cloud that lets you run containerized applications in a serverless environment. Think of it as a middle ground between a simple function-as-a-service (like Cloud Functions) and a more complex GKE cluster. You give it a container image, and it handles all the underlying infrastructure, from provisioning and scaling to managing the runtime.

The beauty of Cloud Run is that it can automatically scale to zero, meaning when there are no requests, you aren't paying for any resources. When traffic picks up, it quickly scales up to handle the load. This makes it perfect for stateless models that need to be highly available and cost-effective.

Deployment

But enough with the intros, let's get our hands dirty with some code 🧑‍💻

Below is a step by step instruction on how to get the Qwen3 Embedding model up and running.

Prepare the environment

First we need to configure the gcloud CLI environment.

Note: if you don’t have gcloud CLI installed please follow instructions available here.

  • Step 1 - Set your default project:
gcloud config set project PROJECT_ID
Enter fullscreen mode Exit fullscreen mode
  • Step 2 - Configure Google Cloud CLI to use the _europe-west1 _region for Cloud Run commands:
gcloud config set run/region europe-west1
Enter fullscreen mode Exit fullscreen mode

Important: at the time of writing, GPUs on Cloud Run are available in several regions. To check the closest supported region please refer to this page.

Containerize

We will use Docker and Ollama to run the Qwen3 Embedding model. Create a file named Dockerfile and put the following code inside it:

FROM ollama/ollama:latest

# Listen on all interfaces, port 8080
ENV OLLAMA_HOST=0.0.0.0:8080

# Store model weight files in /models
ENV OLLAMA_MODELS=/models

# Reduce logging verbosity
ENV OLLAMA_DEBUG=false

# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE=-1

# Store the model weights in the container image
ENV MODEL=dengcao/Qwen3-Embedding-4B:Q4_K_M
RUN ollama serve & sleep 5 && ollama pull $MODEL

# Start Ollama
ENTRYPOINT ["ollama", "serve"]
Enter fullscreen mode Exit fullscreen mode

Build and deploy

Next it’s time to leverage the power of Cloud Run’s source deployments. With a single command you can:

  • Build the container image from source (note the –source parameter in the command below)
  • Upload the container image to an Artifact Registry
  • Deploy the container to Cloud Run with GPUs enabled (note --gpu and --gpu-type options)
  • Redirect all traffic to the new deployment

To do all the above, you just need to run:

gcloud run deploy ollama-qwen3-embeddings \
  --source . \
  --concurrency 4 \
  --cpu 8 \
  --set-env-vars OLLAMA_NUM_PARALLEL=4 \
  --gpu 1 \
  --gpu-type nvidia-l4 \
  --max-instances 1 \
  --memory 32Gi \
  --no-allow-unauthenticated \
  --no-cpu-throttling \
  --no-gpu-zonal-redundancy \
  --timeout=600
  --labels dev-tutorial=blog-qwen3-embeddings
Enter fullscreen mode Exit fullscreen mode

Note the following important flags in this command:

  • --concurrency 4 is set to match the value of the environment variable OLLAMA_NUM_PARALLEL.
  • --gpu 1 with --gpu-type nvidia-l4 assigns 1 NVIDIA L4 GPU to every Cloud Run instance in the service.
  • --max-instances 1 specifies the maximum number of instances to scale to. It has to be equal to or lower than your project's NVIDIA L4 GPU quota.
  • --no-allow-unauthenticated restricts unauthenticated access to the service. By keeping the service private, you can rely on Cloud Run's built-in Identity and Access Management (IAM) authentication for service-to-service communication.
  • --no-cpu-throttling is required for enabling GPU.
  • --no-gpu-zonal-redundancy set zonal redundancy options depending on your zonal failover requirements and available quota.

Test the deployment

Now that you have successfully deployed the service, you can send requests to it. However, if you send a request directly, Cloud Run will respond with HTTP 401 Unauthorized. This is intentional, because we want our model to be called from other services, such as a RAG application, and not accessible by everyone on the Internet.

The easiest way to test the deployment from a local machine is to spin up the Cloud Run developer proxy by executing:

gcloud run services proxy ollama-qwen3-embeddings --port=9090
Enter fullscreen mode Exit fullscreen mode

Now in a second terminal window run:

curl http://localhost:9090/api/embed -d '{
  "model": "dengcao/Qwen3-Embedding-4B:Q4_K_M",
  "input": "Sample text"
}'
Enter fullscreen mode Exit fullscreen mode

You should see a response similar to this:

Qwen3 Embedding Model response from Cloud Run

You can also call the endpoint from a Python client. Example:

from ollama import Client

client = Client(host="http://localhost:9090")

response = client.embed(model="dengcao/Qwen3-Embedding-4B:Q4_K_M", input="Sample text")
print(response)
Enter fullscreen mode Exit fullscreen mode

Congratulations 🎉 Your Cloud Run deployment is up and running!

RAG Example

You can use the newly deployed model to build your first RAG application. Here’s how to achieve this:

Step 1 - Generate Embeddings

Install necessary dependencies:

pip install ollama chromadb
Enter fullscreen mode Exit fullscreen mode

Create an example.py with the following content:

import ollama
import chromadb

documents = [
    "Poland is a country located in Central Europe.",
    "The capital and largest city of Poland is Warsaw.",
    "Poland's official language is Polish, which is a West Slavic language.",
    "Marie Curie, the pioneering scientist who conducted groundbreaking research on radioactivity, was born in Warsaw, Poland.",
    "Poland is famous for its traditional dish called pierogi, which are filled dumplings.",
    "The Białowieża Forest in Poland is one of the last and largest remaining parts of the immense primeval forest that once stretched across the European Plain.",
]

client = chromadb.Client()
collection = client.create_collection(name="docs")

ollama_client = ollama.Client(host="http://localhost:9090")

# Store each document in a in-memory vector embeddings database
for i, d in enumerate(documents):
    response = ollama_client.embed(model="dengcao/Qwen3-Embedding-4B:Q4_K_M", input=d)
    embeddings = response["embeddings"]
    collection.add(ids=[str(i)], embeddings=embeddings, documents=[d])
Enter fullscreen mode Exit fullscreen mode

Step 2 - Retrieve

Next, the following code will search the vector database for the most relevant document (add it to your example.py):

# An example prompt
prompt = "What is Poland's official language?"

# Generate an embedding for the input and retrieve the most relevant document
response = ollama_client.embed(model="dengcao/Qwen3-Embedding-4B:Q4_K_M", input=prompt)
results = collection.query(query_embeddings=[response["embeddings"][0]], n_results=1)
data = results["documents"][0][0]
Enter fullscreen mode Exit fullscreen mode

Step 3 - Generate final answer

In the generation step we will use a locally installed Qwen3:0.6b.

Note: we use Qwen3 in generation step, but any other model could work here (i.e. Gemini, Gemma, Llama, etc.). Nevertheless it’s critical to use the same embeddings model in step 1 (Generate Embeddings) and step 2 (Retrieve).

You can install the Qwen3:0.6b model by running the following command:

ollama pull qwen3:0.6b
Enter fullscreen mode Exit fullscreen mode

Now we’re ready to combine user’s prompt with search results to generate the final answer (add to example.py):

# Final step - generate a response combining the prompt and data we retrieved in step 2
output = ollama.generate(
    model="qwen3:0.6b",
    prompt=f"Using this data: {data}. Respond to this prompt: {prompt}",
)

print(output["response"])
Enter fullscreen mode Exit fullscreen mode

Run the code by executing:

python example.py
Enter fullscreen mode Exit fullscreen mode

You should see an answer similar to the one below:

<think>
Okay, the user is asking what Poland's official language is, and they provided the information that Poland's official language is Polish, which is a West Slavic language. Let me make sure I understand this correctly.

First, I need to confirm if that's the correct information. I know that Poland is a country in Eastern Europe, and its official language is Polish. But wait, what's the source of this information? The user hasn't provided any other data, so I should stick strictly to the given information.

I should state that Poland's official language is Polish, and that it's a West Slavic language. I need to present this clearly and concisely. Maybe mention that it's the official language to emphasize its significance. Also, check if there's any other detail that needs to be included, but since the user provided only this, I can proceed.
</think>

Poland's official language is **Polish**. This language is a **West Slavic language**.
Enter fullscreen mode Exit fullscreen mode

Well done! You have just created and run your first RAG application with Qwen3 Embedding model under the hood.

Summary

At this point you have established a Cloud Run service running Qwen3 Embedding model. You can use it to generate embeddings for a semantic search or a RAG application.

Stay tuned for more content around leveraging Qwen3 Embedding in your applications.

Thanks for reading

I hope this article inspired you to experiment with open embedding models on Cloud Run. If you found this article helpful, please consider following me here on Medium and giving it a clap 👏 to help others discover it.

I'm always eager to chat with fellow developers and AI enthusiasts, so feel free to connect with me on LinkedIn or Bluesky.

Top comments (3)

Collapse
 
juanperez profile image
Juan Perez prueba

Great tutorial, great explaining though.

Collapse
 
prime_1 profile image
Roshan Sharma

Awesome setup
Curious, are there any additional optimizations or features you’re planning to add to this serverless embedding pipeline, like caching, batching, or monitoring improvements?

Collapse
 
rsamborski profile image
Remigiusz Samborski Google Cloud

Thanks for the suggestions. I will be exploring other topics and will add those to my backlog.