Gleb Otochkin

Posted on Dec 15, 2025 • Originally published at Medium on Dec 18, 2024

AlloyDB Omni and local models on GKE

#data #ai #googlecloudplatform #alloydb

AlloyDB and Vertex AI are great cloud services providing tons of capabilities and options to serve as a main backend for development. But what if you need something different? Maybe more local and deployed as a compact self-serving deployment where all communications between different parts of the application should be as closed as possible? Or deploy it where normal access to the service endpoints is unavailable? Can we do it and still use all the good stuff from AlloyDB such as AI integration and improved vector search? Yes we can and in this blog I will show how to deploy a local AI model and AlloyDB Omni to the same kubernetes cluster and make them working together.

Deploying AlloyDB Omni

For my deployment I am using Google GKE and we are starting from creating a standard cluster. For most of the actions I am using google cloud shell and standard utilities coming with it. But you of course can use your own preferred environment. Here is command to create a cluster.

export PROJECT_ID=$(gcloud config get project)
export LOCATION=us-central1
export CLUSTER_NAME=alloydb-ai-gke
gcloud container clusters create ${CLUSTER_NAME} \
  --project=${PROJECT_ID} \
  --region=${LOCATION} \
  --workload-pool=${PROJECT_ID}.svc.id.goog \
  --release-channel=rapid \
  --machine-type=e2-standard-8 \
  --num-nodes=1

As soon as the cluster is deployed we can follow up preparing it for AlloyDB Omni. You can read about all requirements and about installation procedure in much more details in the documentation.

One of the requirements is to install cert-manager service. Most of the actions on the cluster is done using native kubernetes utilities like kubectl and helm. And to use the tools, we need cluster credentials. In GKE it is done by gcloud command.

gcloud container clusters get-credentials ${CLUSTER_NAME} --region=${LOCATION}

Then we can install the cert-manager service on our cluster.

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.16.2/cert-manager.yaml

Now we need to get the helm package for the latest AlloyDB Omni kubernetes operator.

export GCS_BUCKET=alloydb-omni-operator
export HELM_PATH=$(gcloud storage cat gs://$GCS_BUCKET/latest)
export OPERATOR_VERSION="${HELM_PATH%%/*}"
gcloud storage cp gs://$GCS_BUCKET/$HELM_PATH ./ --recursive
helm install alloydbomni-operator alloydbomni-operator-${OPERATOR_VERSION}.tgz \
--create-namespace \
--namespace alloydb-omni-system \
--atomic \
--timeout 5m

When the AlloyDB Omni operator is installed we can follow up with the deployment of our database cluster. We need to deploy it with googleMLExtension=true parameter to be able to work with the AI models. Also I prefer to enable internal load balancer for the database deployment. It creates an internal IP in the project VPC and I can use a small VM with psql client installed to work with the databases, load data etc. You can find more information about the load balancer in the documentation. Here is my manifest to deploy AlloyDB Omni cluster with the name my-omni.

apiVersion: v1
kind: Secret
metadata:
  name: db-pw-my-omni
type: Opaque
data:
  my-omni: "VmVyeVN0cm9uZ1Bhc3N3b3Jk"
---
apiVersion: alloydbomni.dbadmin.goog/v1
kind: DBCluster
metadata:
  name: my-omni
spec:
  databaseVersion: "15.7.0"
  primarySpec:
    adminUser:
      passwordRef:
        name: db-pw-my-omni
    features:
      googleMLExtension:
        enabled: true
    resources:
      cpu: 1
      memory: 8Gi
      disks:
      - name: DataDisk
        size: 20Gi
        storageClass: standard
    dbLoadBalancerOptions:
      annotations:
        networking.gke.io/load-balancer-type: "internal"
  allowExternalIncomingTraffic: true

Save it as my-omni.yaml and then apply the configuration to the cluster.

kubectl apply -f my-omni.yaml

By the way, have you noticed the value I’ve used for my password in the secret? It accepts the values encoded in base64 and you can do it using standard linux utilities. Here is an example. I am encoding password “VeryStrongPassword” to get it encoded to base64.

echo -n "VeryStrongPassword" | base64

But speaking about kubernetes secrets and passwords I would rather use more secret solution to store passwords. In GKE I prefer to use Google Cloud Secret Manager. You can read in details how to implement it the documentation. It works really well. Also it helps to integrate AlloyDB Omni with AI models which require authorization like tokens or keys.

When the database cluster and internal load balancer are deployed we should see the external service for our Omni instance.

kubectl get service

In the output we should see a service of “LoadBalancer” type with an external IP. We can use that IP to connect to our instance from a VM in the same VPC.

DB_CLUSTER_NAME=my-omni
export INSTANCE_IP=$(kubectl get service al-${DB_CLUSTER_NAME}-rw-elb -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo $INSTANCE_IP

Knowing your load balancer IP you can use it as an export variable (useful for automation) or put it directly in the command.

export INSTANCE_IP=10.128.15.195
psql "host=${INSTANCE_IP} user=postgres"
# or simply
psql "host=10.128.15.195 user=postgres"

Deploying a Model

Now we need to deploy a local model to the same kubernetes cluster. So far we have only one default pool (compute nodes for your apps) with e2-standard-8 nodes. It is enough for our AlloyDB Omni but not ideal for inference. To run a model we need a node with graphic accelerator. For the test I’ve created a pool with L4 Nvidia accelerator. Here is the command.

export PROJECT_ID=$(gcloud config get project)
export LOCATION=us-central1
export CLUSTER_NAME=alloydb-ai-gke
gcloud container node-pools create gpupool \
  --accelerator type=nvidia-l4,count=1,gpu-driver-version=latest \
  --project=${PROJECT_ID} \
  --location=${LOCATION} \
  --node-locations=${LOCATION}-a \
  --cluster=${CLUSTER_NAME} \
  --machine-type=g2-standard-8 \
  --num-nodes=1

Keep in mind quotas for the project when you create the pools. Not all types accelerators available by default and it may dictate the way you deploy the model.

I was using Hugging Face to deploy the BGE Base v1.5 embedding model. Hugging face provides full instruction and deployment package to be used with GKE.

We need the deployment manifest and we can get it from the Huggigface GitHub.

git clone https://github.com/huggingface/Google-Cloud-Containers

If you plan to reuse the model it makes sense to use a google cloud storage (GCS) bucket to keep it between the deployments but in my case I am only testing it and skipping the bucket part. The GCS option is also included to the downloaded package.

For deployment without a GCS we need to review and modify the Google-Cloud-Containers/examples/gke/tei-from-gcs-deployment/gpu-config/deployment.yaml file replacing the cloud.google.com/gke-accelerator value by our nvidia-l4. Also we need to define limits to the resources we request or we can get an error.

vi Google-Cloud-Containers/examples/gke/tei-deployment/gpu-config/deployment.yaml

Here is the corrected manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tei-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tei-server
  template:
    metadata:
      labels:
        app: tei-server
        hf.co/model: Snowflake--snowflake-arctic-embed-m
        hf.co/task: text-embeddings
    spec:
      containers:
        - name: tei-container
          image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-4.ubuntu2204:latest
          resources:
            requests:
              nvidia.com/gpu: 1
            limits:
              nvidia.com/gpu: 1
          env:
            - name: MODEL_ID
              value: Snowflake/snowflake-arctic-embed-m
            - name: NUM_SHARD
              value: "1"
            - name: PORT
              value: "8080"
          volumeMounts:
            - mountPath: /dev/shm
              name: dshm
            - mountPath: /data
              name: data
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 1Gi
        - name: data
          emptyDir: {}
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4

Then we can follow up by creating the namespace, service account and deploying all the rest.

export NAMESPACE=hf-gke-namespace
export SERVICE_ACCOUNT=hf-gke-service-account
kubectl create namespace $NAMESPACE
kubectl create serviceaccount $SERVICE_ACCOUNT --namespace $NAMESPACE
kubectl apply -f Google-Cloud-Containers/examples/gke/tei-deployment/gpu-config

If we have a look to the created service we can see that by default it has only cluster IP and it means it is available only inside the cluster. Nobody outside the cluster have access to the model.

gleb@cloudshell:~/blog (test)$ kubectl get service tei-service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
tei-service ClusterIP 34.118.225.12 <none> 8080/TCP 12m
gleb@cloudshell:~/blog (test)$

The service will be available for requests using endpoint URL http://34.118.225.12:8080/embed for the embeddings generation.

Register Model in AlloyDB Omni

Everything is ready to register the deployed model in AlloyDB Omni. We are starting from creating a demo database. In a psql session (remember our jump box VM?) connect as postgres user and run.

create database demo;

Let’s connect to the new “demo” database

psql "host=10.128.15.195 user=postgres dbname=demo"

And there we can register our new model using the google_ml procedures. Before registering an embedding model we need to create Transform functions which are responsible to transform input and output to the expected values. Here are functions I’ve prepared for our model.

-- Input Transform Function corresponding to the custom model endpoint
CREATE OR REPLACE FUNCTION tei_text_input_transform(model_id VARCHAR(100), input_text TEXT)
RETURNS JSON
LANGUAGE plpgsql
AS $$
DECLARE
  transformed_input JSON;
  model_qualified_name TEXT;
BEGIN
  SELECT json_build_object('inputs', input_text, 'truncate', true)::JSON INTO transformed_input;
  RETURN transformed_input;
END;
$$;

-- Output Transform Function corresponding to the custom model endpoint
CREATE OR REPLACE FUNCTION tei_text_output_transform(model_id VARCHAR(100), response_json JSON)
RETURNS REAL[]
LANGUAGE plpgsql
AS $$
DECLARE
  transformed_output REAL[];
BEGIN
  SELECT ARRAY(SELECT json_array_elements_text(response_json->0)) INTO transformed_output;
  RETURN transformed_output;
END;
$$;

Then we register the new model with the name bge-base-1.5. I used the early described http endpoint with the cluster service IP and our transform functions.

CALL
  google_ml.create_model(
    model_id => 'bge-base-1.5',
    model_request_url => 'http://34.118.225.12:8080/embed',
    model_provider => 'custom',
    model_type => 'text_embedding',
    model_in_transform_fn => 'tei_text_input_transform',
    model_out_transform_fn => 'tei_text_output_transform');

Tests

Let’s test it and see how many dimensions have a generated vector. here is the output:

demo=# select array_dims(google_ml.embedding('bge-base-1.5','What is AlloyDB Omni?'));
 array_dims 
------------
 [1:768]
(1 row)

demo=#

Great! It works and shows that our embedding function returns a real array with 768 dimensions.

I used a small dataset from one of codelabs for embeddings I’ve created some time ago to generated embeddings and run a query.

demo=# \timing
Timing is on.
demo=# SELECT
        cp.product_name,
        left(cp.product_description,80) as description,
        cp.sale_price,
        cs.zip_code,
        (ce.embedding <=> google_ml.embedding('bge-base-1.5','What kind of fruit trees grow well here?')::vector) as distance
FROM
        cymbal_products cp
JOIN cymbal_embedding ce on
        ce.uniq_id=cp.uniq_id
JOIN cymbal_inventory ci on
        ci.uniq_id=cp.uniq_id
JOIN cymbal_stores cs on
        cs.store_id=ci.store_id
        AND ci.inventory>0
        AND cs.store_id = 1583
ORDER BY
        distance ASC
LIMIT 10;
     product_name | description | sale_price | zip_code | distance       
-----------------------+----------------------------------------------------------------------------------+------------+----------+---------------------
 California Sycamore | This is a beautiful sycamore tree that can grow to be over 100 feet tall. It is | 300.00 | 93230 | 0.22753925487632942
 Toyon | This is a beautiful toyon tree that can grow to be over 20 feet tall. It is an e | 10.00 | 93230 | 0.23497374266229387
 California Peppertree | This is a beautiful peppertree that can grow to be over 30 feet tall. It is an e | 25.00 | 93230 | 0.24215884459965364
 California Redwood | This is a beautiful redwood tree that can grow to be over 300 feet tall. It is a | 1000.00 | 93230 | 0.24564130578287147
 Cherry Tree | This is a beautiful cherry tree that will produce delicious cherries. It is an d | 75.00 | 93230 | 0.24846117929767153
 Fremont Cottonwood | This is a beautiful cottonwood tree that can grow to be over 100 feet tall. It i | 200.00 | 93230 | 0.2533482837690365
 Madrone | This is a beautiful madrona tree that can grow to be over 80 feet tall. It is an | 50.00 | 93230 | 0.25755536556243364
 Secateurs | These secateurs are perfect for pruning small branches and vines. | 15.00 | 93230 | 0.26093776589260964
 Sprinkler | This sprinkler is perfect for watering a large area of your garden. | 30.00 | 93230 | 0.26263969504592044
 Plant Pot | This is a stylish plant pot that will add a touch of elegance to your garden. | 20.00 | 93230 | 0.2639707045520192
(10 rows)

Time: 25.900 ms
demo=#

The response time was about 25 ms in average and relatively stable. Also the recall quality was quite descent returning good selection of trees from the inventory.

You can try to deploy AlloyDB Omni along with the different AI models right now in GKE or to your local Kubernetes environment. The great thing about AlloyDB Omni is that it can be deployed anywhere where you can run containers.

In the next post I will compare performance and recall with other model and with full text search. Stay tuned.