DEV Community: Olivier Bourgeois

Seamless scaling with VPA In-place Pod Resize on GKE

Olivier Bourgeois — Thu, 04 Jun 2026 18:19:03 +0000

Right-sizing Kubernetes workloads is a common platform engineering challenge. Set your requests too high, and you burn cloud budgets on idle capacity; set your limits too low, and your applications face throttling or dreaded OOMKills.

For years, the Vertical Pod Autoscaler (VPA) has been the standard answer to this problem, automatically adjusting CPU and memory requirements based on actual usage. However, this method of scaling came with a significant catch that prevented widespread adoption for critical workloads: applying new resource parameters required evicting and restarting the pod.

This disruption was often unacceptable for stateful applications, long-running connections, or latency-sensitive services.

Introducing In-place Pod Resize (IPPR) on GKE

In-place Pod Resize (IPPR) changes the game by allowing Kubernetes to modify resource requests and limits on live, running containers directly through the underlying container runtime, without triggering a restart.

By combining the intelligence of VPA with the non-disruptive nature of IPPR, GKE users finally have a viable path to dynamic, seamless, and automated right-sizing.

Note: As of writing, VPA IPPR is in Preview on GKE. While it is a massive step forward, I recommend evaluating it in staging environments before rolling it out to production workloads.

Getting started with IPPR

To use In-place Pod Resize, you need a GKE cluster running version 1.34.0-gke.2201000 or later.

GKE Autopilot: VPA is enabled by default.
GKE Standard: Requires the Vertical Pod Autoscaling feature to be enabled.

1. Enable the feature

If you aren't using Autopilot, ensure your cluster is created or updated with the necessary feature flags:

gcloud container clusters create CLUSTER_NAME \
  --project=PROJECT_ID \
  --location=us-east1 \
  --release-channel=rapid \
  --enable-vertical-pod-autoscaling

2. Define your VPA object

Create a VerticalPodAutoscaler resource targeting your Deployment or StatefulSet. The crucial element here is setting spec.updatePolicy.updateMode to InPlaceOrRecreate.

apiVersion: "autoscaling.k8s.io/v1"
kind: "VerticalPodAutoscaler"
metadata:
  name: "my-vpa"
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: "Deployment"
    name: "my-deployment"
  updatePolicy:
    updateMode: "InPlaceOrRecreate"

3. Watch it scale

Apply the resource to your cluster and monitor your application under load. Instead of watching Pods terminate and recreate, you can watch the resources modify live using kubectl describe.

kubectl describe pod POD_NAME

Look for the AllocatedResources field or check the events section. You will see the requests change in real-time to match the VPA recommendations, while the Restart Count remains exactly the same.

The "Or Recreate" Fallback: Keep in mind that physics still apply. If VPA recommends a resource size that exceeds the remaining capacity of the Node your Pod is currently running on, an in-place resize is impossible. In this scenario, VPA will fall back to evicting and recreating the Pod so it can be scheduled onto a larger or emptier Node.

Ready to dive deeper?

While this introduction covers the basics of IPPR, right-sizing is just one part of a robust scaling strategy. Implementing VPA often goes hand-in-hand with horizontal scaling and cluster autoscaling. Check out the guide to master scaling on GKE: Run full-stack workloads at scale on GKE.

Surviving the eviction: How to build interrupt-resilient AI workloads on GKE

Olivier Bourgeois — Tue, 02 Jun 2026 20:02:20 +0000

You did everything right. You containerized your massive model training job, deployed it to Google Kubernetes Engine (GKE), and cleverly routed it to a Spot VM node pool to save up to 90% on compute costs.

Everything is humming along perfectly for 38 hours. Then, a priority on-demand customer needs capacity, Google Cloud reclaims your underlying Spot VM, and your node vanishes.

Whether you are using preemptible Spot VMs to save money, or leveraging the Dynamic Workload Scheduler (DWS) to queue for scarce GPUs, you are building on top of ephemeral compute. The hardware will eventually be taken away. To successfully run critical AI workloads on un-committed capacity, your application architecture must assume failure is a given.

Here is a practical guide to building interruptible workloads on GKE.

1. Trap the warning

When Google Cloud reclaims a Spot VM, it doesn't just pull the power cord immediately. It sends an ACPI signal to the underlying node to begin a power off cycle. Kubernetes intercepts this and translates it into a SIGTERM signal sent directly to your running containers.

You have a grace period (up to 15 seconds for non-system pods) between that SIGTERM and the fatal SIGKILL.

Your application must explicitly listen for this signal. When caught, your code should immediately stop accepting new batches, finish its current loop, flush any in-memory data to disk, and exit with a 0 (success) status.

Here is a simple example on how to catch this signal in Python:

import signal
import sys
import time

def handle_sigterm(signum, frame):
    print("Received SIGTERM. Initiating graceful shutdown...")
    # 1. Stop processing new data
    # 2. Flush memory to persistent storage
    # 3. Save final checkpoint
    print("State saved. Exiting cleanly.")
    sys.exit(0)

# Register the signal handler
signal.signal(signal.SIGTERM, handle_sigterm)

# Your main training loop
print("Starting training loop...")
while True:
    # Train model...
    time.sleep(1)

2. Externalize your checkpoints

If your container dies, everything inside its local filesystem dies with it. To survive an interruption, you must periodically save your progress (model weights, optimizer states, epoch counters, etc.) to an external storage location.

Cloud Storage (GCS) is a common solution for this on Google Cloud.

Save frequently: Decide on a checkpointing interval that balances the cost of lost work against the overhead of writing to storage. Saving every epoch or every few thousand steps is common, but this can vary based on your needs.
Keep it local: Ensure your GCS buckets are in the same region as your GKE cluster (e.g., us-central1) to minimize latency and avoid outbound data transfer fees.
Resume, don't restart: The first thing your container's startup script should do is to check for that GCS bucket. If a checkpoint exists in the bucket, load it and resume from that exact step.

3. Design for Idempotency

"Idempotency" is a fancy way of saying that doing something twice yields the same result as doing it once.

Imagine a batch inference job that reads an image, processes it, and writes the result to a database. If your pod is preempted milliseconds after writing to the database but before it can mark the task as complete, the rescheduled pod will likely process that image again.

If your database blindly inserts new rows, you now have unintentional, duplicate data.

To build an idempotent pipeline:

Use UPSERT (update or insert) operations in your database based on a unique identifier (like an image ID).
Check if a record already exists before spending expensive GPU cycles processing it.

4. Decouple work queues for batch processing

If you are running a massive batch processing or inference job across thousands of files, do not write a monolithic Python script that iterates through a static CSV list. If the node dies at row 5,000, managing the state of where to restart is a nightmare.

Instead, decouple the workload:

Publish the work: Break your dataset down into discrete messages and push them into a message broker like Pub/Sub.
Pull the work: Have your Spot VM worker pods pull messages off the queue one by one or as a small chunk (e.g. 10 at a time).
Acknowledge completion: Only send an "ACK" (acknowledgment) back to Pub/Sub once the result is safely stored.

If a Spot node is preempted mid-inference, the worker dies before sending the ACK. After a brief timeout, Pub/Sub will automatically make that specific message available again. Another surviving worker pod will pick it up seamlessly. No data lost, no manual intervention required.

Key takeaways

Running on ephemeral compute like Spot VMs isn't just an infrastructure choice; it is a design choice. By handling termination signals, checkpointing aggressively to GCS, ensuring idempotent operations, and decoupling your queues, you can unlock massive cost savings and tap into scarce GPU pools without sacrificing reliability.

Strategies for running AI workloads on GKE without committed quota

Olivier Bourgeois — Mon, 01 Jun 2026 18:54:52 +0000

You’ve built your model, your training code is containerized, and you’re ready to scale up on Google Kubernetes Engine (GKE). You go to provision your nvidia-h100-80gb node pool and... QUOTA_EXCEEDED.

It’s one of the most common (and frustrating) roadblocks in modern AI development. High-end accelerators like H100s, A100s, and TPUs are in massive demand, and securing permanent, on-demand quota for them can be difficult. But a lack of on-demand quota doesn't mean you're out of options.

GKE provides two powerful, cost-effective strategies for acquiring these scarce resources when you can't get standard, on-demand instances: Spot VMs and the Dynamic Workload Scheduler (DWS).

Let's break down what they are, when to use each, and how to implement them.

Strategy 1: Spot VMs

Spot VMs are Google Cloud's excess compute capacity sold at a massive discount, up to 90% off the price of standard on-demand VMs. They are perfect for workloads that can be interrupted.

The catch is that Spot VMs have no availability guarantee. Google Cloud can "preempt" (i.e., terminate) them at any time if that capacity is needed for on-demand customers. GKE gets a 30-second warning before the node is terminated. Kubernetes uses this window to gracefully shut down your application (giving non-system pods up to 15 seconds to wrap up) before the node vanishes.

When to use Spot VMs for accelerators

Spot VMs are ideal for workloads that are:

Fault-tolerant and stateless: Your application can handle a node vanishing and having its pods rescheduled elsewhere.
Batch processing: Jobs that can be easily restarted or have checkpointing built-in.
CI/CD pipelines: Running tests or builds that don't need 100% uptime.

How to use Spot VMs in GKE

You can easily add a Spot VM node pool to your GKE Standard cluster. The key is to use Spot VMs for your workers, not your critical system pods.

Create a dedicated Spot VM node pool:
When creating a node pool, simply add the --spot flag and apply a taint so standard pods don't accidentally schedule there.

gcloud container node-pools create spot-gpu-pool \
  --cluster=my-cluster \
  --region=northamerica-northeast2 \
  --machine-type=g2-standard-4 \
  --accelerator=type=nvidia-l4,count=1 \
  --spot \
  --node-taints=cloud.google.com/gke-spot=true:NoSchedule

Add the toleration to your workload's YAML:

# You want to "tolerate" that taint only on the specific workloads you want to run there.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-batch-job
spec:
  template:
    # ... other specs
    spec:
      tolerations:
      - key: "cloud.google.com/gke-spot"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"

This architecture ensures your critical components stay on reliable on-demand nodes, while your interruptible training jobs run on the preemptible Spot nodes. (Note: If you are using GKE Autopilot, you simply request a Spot class in your pod spec and GKE handles the taints and nodes automatically!)

Strategy 2: Dynamic Workload Scheduler (DWS) with flex-start

What if your job can't be interrupted? Many large-scale training jobs can take days. While they might have checkpointing, restarting from scratch every few hours due to Spot preemptions is inefficient and costly.

This is where Dynamic Workload Scheduler (DWS) comes in.

DWS is a feature designed specifically for acquiring large amounts of scarce resources (like GPUs and TPUs) for batch workloads. It changes the request from "Give me this GPU right now" to "Give me this GPU when it becomes available."

The catch here is that your job doesn't start immediately. It enters a queue and might wait for minutes, hours, or even days for the resources to be provisioned.

There are a few massive upsides:

It's a "get-in-line" system: Instead of you writing a script to retry the gcloud command every 5 minutes, DWS queues your request and provisions the nodes automatically when capacity is found.
No preemptions: Once your DWS nodes are provisioned, they are yours for the entire duration of your job (up to seven days). They are not Spot VMs and will not be preempted.
Cost savings: DWS workloads are also offered at a significant discount (up to 53% for L4 GPUs) compared to on-demand instances.

When to use DWS

DWS is perfect for:

Model training or reinforcement learning (RL): Jobs that need to run uninterrupted for many hours or days.
Batch inference: Running a large inference job on a massive dataset.
Any workload that is not time-sensitive to start, but is sensitive to interruptions.

How to use DWS with flex-start

The flex-start mode in DWS is what enables this "wait-in-queue" behavior. If you are using a GKE Autopilot cluster (or a Standard cluster with Node Auto-provisioning enabled), implementing this is incredibly simple.

You do not need to create complex custom resources; you simply signal your intent via a nodeSelector in your standard Kubernetes Job object.

Request flex-start in your Job:
In your Job.yaml, add the cloud.google.com/gke-flex-start: "true" node selector alongside your accelerator request.

apiVersion: batch/v1
kind: Job
metadata:
  name: my-training-job
spec:
  template:
    spec:
      nodeSelector:
        cloud.google.com/gke-flex-start: "true"
        cloud.google.com/gke-accelerator: nvidia-tesla-a100
      containers:
      - name: my-trainer
        image: "gcr.io/my-project/my-training-image"
        resources:
          limits:
            nvidia.com/gpu: 1
      restartPolicy: Never

When you apply this Job, GKE sees the flex-start selector. It puts the Job's Pods into a Pending state until the DWS queueing system can provision the requested A100 node. Once the node is ready, the Pod is scheduled, your job runs to completion without interruption, and the node is automatically deprovisioned.

Which strategy should you choose?

Here's a simple cheat sheet:

Feature	Spot VMs	DWS with flex-start
Best for	Fault-tolerant, interruptible workloads	Long-running, uninterruptible batch jobs
Primary trade-off	Starts fast, can be preempted at any time	Can wait hours/days to start
Cost savings	Up to 90%	Up to 50%
GKE mode	Standard or Autopilot	Standard or Autopilot
Implementation	--spot flag in a Node Pool	cloud.google.com/gke-flex-start nodeSelector

By mastering both Spot VMs and the Dynamic Workload Scheduler, you can build a resilient and cost-effective AI platform on GKE, even when on-demand accelerator quota seems impossible to find.

Hands-on with Gemma 3 on Google Cloud

Olivier Bourgeois — Fri, 05 Dec 2025 16:31:49 +0000

The landscape of generative AI is shifting. While proprietary APIs are powerful, there is a growing demand for open models—models where the architecture and weights are publicly available. This shift puts control back in the hands of developers, offering transparency, data privacy, and the ability to fine-tune for specific use cases.

To help you navigate this landscape, we are releasing two new hands-on labs featuring Gemma 3, Google’s latest family of lightweight, state-of-the-art open models.

Why Gemma?

Built from the same research and technology as Gemini, Gemma models are designed for responsible AI development. Gemma 3 is particularly exciting because it offers multimodal capabilities (text and image) and fits efficiently on smaller hardware footprints while delivering massive performance.

But running a model on your laptop is very different from running it in production. You need scale, reliability, and hardware acceleration (GPUs). The question is: Where should you deploy?

We have prepared two different paths for you, depending on your infrastructure needs: Cloud Run or Google Kubernetes Engine (GKE).

Path 1: The Serverless Approach (Cloud Run)

Best for: Developers who want an API up and running instantly without managing infrastructure, scaling to zero when not in use.

If your priority is simplicity and cost-efficiency for stateless workloads, Cloud Run is your answer. It abstracts away the server management entirely. With the recent addition of GPU support on Cloud Run, you can now serve modern LLMs without provisioning a cluster.

Start the lab!

Lab: Serving Gemma 3 with vLLM on Cloud Run

Objectives:

Containerize vLLM (a high-throughput serving engine).
Deploy Gemma 3 to Cloud Run.
Leverage GPU acceleration for fast inference.
Expose an OpenAI-compatible API endpoint.

Path 2: The Platform Approach (GKE)

Best for: Engineering teams building complex AI platforms, requiring high throughput, custom orchestration, or integration with a broader microservices ecosystem.

When your application graduates from a prototype to a high-traffic production system, you need the control of Kubernetes. GKE Autopilot gives you that power while still handling the heavy lifting of node management. This path creates a seamless journey from local testing to cloud production.

Start the lab!

Lab: Deploying Open Models on GKE

In this lab, you will learn how to:

Prototype locally using Ollama.
Containerize your setup and transition to GKE Autopilot.
Deploy a scalable inference service using standard Kubernetes manifests.
Manage resources effectively for production workloads.

Which Path Will You Choose?

Whether you are looking for the serverless simplicity of Cloud Run or the robust orchestration of GKE, Google Cloud provides the tools to take Gemma 3 from a concept to a deployed application.

Dive into the labs today and start building:

Share your progress and connect with others on the journey using the hashtag #ProductionReadyAI. Happy learning!

These labs are part of the Open Models module in our official Production-Ready AI with Google Cloud program. Explore the full curriculum for more content that will help you bridge the gap from a promising prototype to a production-grade AI application.

Observability in Action: A Google Cloud Next demo

Olivier Bourgeois — Mon, 05 May 2025 18:17:47 +0000

It was only a few weeks ago that over 32,000 cloud practitioners from all over the world came together in Las Vegas to attend Google Cloud Next 2025. Beyond the keynotes, the workshops, and the multiple jam-packed tracks of talks and sessions, an entire expo hall offered attendees the opportunity to observe or play around with more than 500 live demos. Let’s check out one of these demos!

Overview of the demo

The main goals of the Observability in Action demo were twofold. We wanted to showcase various ways of interacting with metrics and logs. And we wanted to give attendees a little bit of an interactive experience. For the interactive part of the demo, we utilized various oversized physical buttons and pedals that could be used to select answers or confirm inputs.

The flow of the demo was as followed:

We ask the attendee to type in a prompt that they wanted sent to an AI model.
The prompt was sent in the background to three different models: Gemma 3 on Cloud Run, Gemini 2.0 Flash on Vertex AI, and Gemini 2.0 Flash-Lite on Vertex AI. This generated logs and metrics.
The attendee was then given a short quiz about these three models. Each quiz input also generated logs and metrics.
At the end of the quiz, we give the attendee a rundown of their answers, and then flip over to the Google Cloud Console.
In Cloud Monitoring, we showcase the various native metrics that Cloud Run offers, custom metrics implemented using OpenTelemetry, as well as the Cloud Trace functionality.
Finally, we turn to BigQuery to showcase how we can mirror logs to a database for further analysis using Jupyter Notebooks.

Architecture

While the demo frontend runs locally, the backend is deployed as a Cloud Run instance. This instance is then talking to Gemini through the Vertex AI SDK and to Gemma through its own Cloud Run instance. The persistent state of the demo resides in a Firestore database. All Cloud Run logs are mirrored to BigQuery using a simple sink.

Visualizing metrics using Cloud Monitoring

Cloud Monitoring provides visibility into the performance and health of your cloud applications and infrastructure. It collects metrics, events, and metadata from Google Cloud services and other sources, allowing you to visualize this data on dashboards and create alerts for critical issues. This is useful for proactively identifying and resolving problems, optimizing resource utilization, improving uptime, and understanding system behavior, ultimately leading to more reliable and cost-effective applications.

For services like Cloud Run which we’re using for the backend of this demo, Cloud Monitoring automatically collects a wide array of native metrics without any setup needed. This includes data points such as request latency, count, container CPU and memory usage, and instance counts. This out-of-the-box integration means developers get immediate insights into their serverless application's performance and resource consumption, simplifying troubleshooting and optimization efforts.

Cloud Trace is a distributed tracing system within Google Cloud that helps you understand request latency across your application and its services. It tracks how long different parts of your application take to process requests, visualizing the entire request flow. This is particularly valuable for identifying performance bottlenecks in microservices architectures by showing where time is spent during a request's lifecycle.

Here’s a real life example: In this demo we send a prompt to multiple models. We were sure we implemented concurrency correctly (so the calls to the three different models should’ve happened in parallel) yet the latency seems significantly higher than expected. When we dug into the trace of a call, we quickly realized that we were accidentally making those calls sequentially! These traces were made available to us via an OpenTelemetry instrumentation we added to our code.

Interact with your logs with BigQuery

BigQuery is a serverless enterprise data warehouse that enables super-fast SQL queries on large datasets without infrastructure management. It's built for scalable analytics, supports diverse data types, and integrates machine learning, offering a powerful platform for insights from real-time and historical data.

With a simple sink, you can directly stream logs from Cloud Logging into BigQuery, transforming it into a powerful, long-term log analytics platform. This allows you to run complex SQL queries across extensive historical log data, which is invaluable for in-depth security audits, compliance, and identifying subtle operational trends.

Connecting BigQuery to Jupyter Notebooks further enhances log analysis capabilities. This empowers users to leverage Python and data science libraries for advanced data exploration, custom visualizations, and machine learning on log data, facilitating deeper insights and shareable, interactive analysis beyond standard logging tools.

For this demo, we built a Jupyter Notebook that did analysis on the various interactive quiz events, cross-referenced answers with an external Firestore database, and built tables and charts of the resulting data.

Try it out!

Want to try this demo from home? The source code is available on GitHub.

Want to learn more about observability on Google Cloud? Check out these resources:

Streamline your LangChain deployments with LangServe

Olivier Bourgeois — Fri, 28 Feb 2025 18:31:31 +0000

Throughout this LangChain series, we've explored the power and flexibility of LangChain, from deploying it on Google Kubernetes Engine (GKE) with Gemini to running open models like Gemma. Now, let's introduce an interesting complement to help us deploy LangChain-powered applications as a REST API: LangServe.

What is LangServe?

LangServe is a helpful tool designed to simplify the deployment of LangChain applications as REST APIs. Instead of having to manually take care of the REST logic for your LLM deployment (like exposing endpoints or serving API documentation) we can get LangServe to do that for us. It's built by the same team behind LangChain, ensuring seamless integration and a developer-friendly experience.

Why use LangServe?

In the previous parts of this LangChain series, we've seen how to deploy a LangChain-powered application and how to talk to it. Isn't that enough? Well, LangServe offers several key advantages:

Rapid deployment: LangServe drastically reduces the amount of boilerplate code needed to expose your LangChain applications as APIs.
Automatic API documentation: LangServe automatically generates interactive API documentation for your deployed chains, making it easy for others (or your future self, if you're like me) to understand and use your services.
Built-in playground: LangServe provides a simple web playground for interacting with your deployed LangChain applications directly from your browser. This is incredibly helpful for testing and debugging.
Standardized interface: LangServe helps you create consistent, well-structured APIs for your LangChain applications, making them easier to integrate with other services and front-end applications.
Simplified client interaction: LangServe comes with a corresponding client library that simplifies calling your deployed chains from other Python or JavaScript applications.

How does LangServe work?

LangServe leverages the power of FastAPI and pydantic to create a robust and efficient serving layer for your LangChain applications. It essentially wraps your LangChain chains or agents, turning them into FastAPI endpoints.

Let's look at an example and see how that all comes together.

Building a LangServe application

Let's say you have the following LangChain application that uses Gemini:

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash")

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful assistant that answers questions about a given topic.",
        ),
        ("human", "{input}"),
    ]
)

chain = prompt | llm

Here's how you would adapt it for LangServe, which you can save as app.py:

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langserve import add_routes
import uvicorn
from fastapi import FastAPI

app = FastAPI(
  title="LangChain Server",
  version="1.0",
  description="A simple API server using LangChain's Runnable interfaces",
)

llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash")

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful assistant that answers questions about a given topic.",
        ),
        ("human", "{input}"),
    ]
)

chain = prompt | llm | StrOutputParser()

add_routes(
    app,
    chain,
    path="/my-chain",
)

if __name__ == "__main__":
    uvicorn.run(app, host="localhost", port=8000)

Then, create a requirements.txt file with our dependencies:

langserve
langchain-google-genai
uvicorn
fastapi
sse_starlette

And that's it! With these simple changes, your chain is now ready to be served. You can install dependencies and run this application using the following commands. Make sure to replace the your_google_api_key string with your Gemini API key.

export GOOGLE_API_KEY="your_google_api_key"
pip install -r requirements.txt
python app.py

This will start a server, by default on port 8000.

Interacting with your LangServe application

Once your server is running, you can interact with it in several ways:

Through the automatically generated API docs: Navigate to http://localhost:8000/docs in your browser to see the interactive API documentation.
Using the built-in playground: Go to http://localhost:8000/my-chain/playground/ to try out your chain directly in a simple web interface.
Using the LangServe client: You can use the provided client library to interact with your API programmatically from other Python or JavaScript applications. Here's a simple Python example:

from langserve import RemoteRunnable

remote_chain = RemoteRunnable("http://localhost:8000/my-chain")
response = remote_chain.invoke({"input": "Tell me about Google Cloud Platform"})
print(response)

Containerizing our application

You can also easily containerize your LangServe application to deploy on a platform like GKE, just like we did with our previous examples.

First, create a Dockerfile to define how to assemble our image:

# Use an official Python runtime as a parent image
FROM python:3-slim

# Set the working directory in the container
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

# Install any needed packages specified in requirements.txt
RUN pip install -r requirements.txt

# Make port 80 available to the world outside this container
EXPOSE 80

# Run app.py when the container launches
CMD [ "python", "app.py" ]

Finally, build the container image and push it to Artifact Registry. Don't forget to replace PROJECT_ID with your Google Cloud project ID.

# Authenticate with Google Cloud
gcloud auth login

# Create the repository
gcloud artifacts repositories create images \
  --repository-format=docker \
  --location=us

# Configure authentication to the desired repository
gcloud auth configure-docker us-docker.pkg.dev/PROJECT_ID/images

# Build the image
docker build -t us-docker.pkg.dev/PROJECT_ID/images/my-langchain-app:v1 .

# Push the image
docker push us-docker.pkg.dev/PROJECT_ID/images/my-langchain-app:v1

After a handful of seconds, your container image should now be stored in your Artifact Registry repository.

Now, let's deploy this image to our GKE cluster. You can create a GKE cluster through the Google Cloud Console or using the gcloud command-line tool, again taking care of replacing PROJECT_ID:

gcloud container clusters create-auto langchain-cluster \
  --project=PROJECT_ID \
  --region=us-central1

Once your cluster is up and running, create a YAML file with your Kubernetes deployment and service manifests. Let's call it deployment.yaml, replacing PROJECT_ID as well as YOUR_GOOGLE_API_KEY with your Gemini API key:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: langchain-deployment
spec:
  replicas: 3 # Scale as needed
  selector: # Add selector here
    matchLabels:
      app: langchain-app
  template:
    metadata:
      labels:
        app: langchain-app
    spec:
      containers:
      - name: langchain-container
        image: us-docker.pkg.dev/PROJECT_ID/images/my-langchain-app:v1
        ports:
        - containerPort: 80
        env:
        - name: GOOGLE_API_KEY
          value: YOUR_GOOGLE_API_KEY
---
apiVersion: v1
kind: Service
metadata:
  name: langchain-service
spec:
  selector:
    app: langchain-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
  type: LoadBalancer # Exposes the service externally

Apply the manifest to your cluster:

# Get the context of your cluster
gcloud container clusters get-credentials langchain-cluster --region us-central1

# Deploy the manifest
kubectl apply -f deployment.yaml

This creates a deployment with three replicas of your LangChain application and exposes it externally through a load balancer. You can adjust the number of replicas based on your expected load.

Conclusion

LangServe bridges the gap between development and production, making it easier than ever to share your AI applications with the world. By providing a simple, standardized way to serve your chains as APIs, LangServe unlocks a whole new level of accessibility and usability for your LangChain projects. Whether you're building internal tools or public-facing applications, LangServe streamlines the process, letting you focus on crafting impactful applications with LangChain.

Next Steps:

Dive into the LangServe documentation for a more in-depth look at its features and capabilities.
Experiment with deploying a LangServe application to GKE using the containerization techniques we've covered.
Explore the LangServe client library to see how you can easily integrate your deployed chains with other applications.

With this post, we conclude our journey through the world of LangChain, from its core concepts to advanced deployment strategies with GKE, open models, and now, streamlined serving with LangServe. I hope this series has empowered you to build and deploy your own amazing AI-powered applications!

Two years with Obsidian: How notes changed the way I store information

Olivier Bourgeois — Mon, 10 Feb 2025 15:38:21 +0000

I've been storing and keeping track of information in various ways for a long time. First using physical notes, then simple digital text files, and finally I jumped from app to app as I encountered issues that irritated me that I had no control over.

Near the tail end of 2022, I came across what I thought might be the answer to all my woes: a note-taking app built by a small team, with the name of Obsidian.

What's in an Obsidian?

Obsidian is a multi-platform note-taking and writing app. Simple enough. But aren't there plenty of those around? Yes absolutely, but they each have downsides that I wasn't able to settle with long-term. With Google Cloud it was the difficult of linking between notes (this has improved since but it's still not quite what I want). With Evernote my notes were in a proprietary format and stuck in the cloud. Notion also had the cloud-first problem and stored in an awkward non-standard Markdown format. And the list goes on.

Here's what Obsidian provides that sold it to me:

Local-first providing tangible .md files in a local directory structure I can interact with.
Portable notes in standard Markdown format allowing me to easily migrate to other platforms.
Links between notes gives the ability to quickly move to related notes using wiki-style links.
YAML frontmatter rendering key-value pairs of metadata for each note in a beautiful way.
Graphs and canvases allowing me to easily visualize notes and their connections.
Mobile friendly with support for all of the same features that the desktop version offers.
Native sync providing end-to-end encrypted sync and version control for a modest monthly fee.
Extensible with a broad catalog of community-created plugins and themes.

Why store information?

The way people interact with pieces of information is very personal and differ from person to person, but these are the main reasons I've been maintaining a repository of notes over the past decade or so:

Noting down information that I know I don't need in the short-term frees cognitive space to think about and remember other things.
Counter-intuitively, noting down information that I do need in the short-term helps me remember them better. The simple act of writing down reminders engraves them in my short-term memory.
Doing research in notes prevents me from doing duplicative research the next year or the next decade. At the very least, it gives me a foundation to work with instead of starting from scratch multiple times.
It's a sort of knowledge insurance for the future. I intend to live for at least a handful more decades, and that's plenty of time to forget things (either because of an illness, or simply because it's been so long).
If I were to author an auto-biography in my later years, all of the material would already be there, waiting to be pieced together in a coherent story.

So with that said, what are the kinds of notes that I have in Obsidian? Glad you asked! Here's a non-exhaustive list (in no particular order) of different note categories, with some examples for each:

Journaling & retrospectives ("Year 2024", "2025-02-09", ...)
Brainstorming ("3D printing ideas", "Photography ideas", ...)
Knowledge ("Interesting urbanism studies", "How to count things in Japanese", ...)
Personal ("5 years life plan & goals", "History of addresses lived at", ...)
Career ("Onboarding to a new job", "Employment history", ...)
Finances ("Tax return forms to expect", "TFSA contribution table", ...)
Health ("Eye exam & prescription history", "Family health history", ...)
Trips & events ("Pre-travel checklist", "List of flights taken", ...)
Media consumed & backlog ("Books I've read", "Christmas films I want to watch", ...)

Templates to reduce repetition

The first community-built plugin that I ended up trying out was QuickAdd. This plugin allows you to create custom commands in the command palette configured to duplicate a specific template note. This means that you could create for example a note called "New trip template" and configure a command called "Add new trip" which would duplicate that particular note and open it for you to fill out as desired.

In my Obsidian vault I've set-up many of these templates which both saves me a lot of time, and ensures consistency between notes of the same category / type. When I open the command palette and search for "QuickAdd", they all show up:

Let's say I'm going on a trip soon. I select the Add new trip command, enter a name ("Trip to the Land of OOO") and a note is automatically created, stored at the expected location, with the relevant template (both the YAML metadata and the Markdown note itself) ready for me to fill out!

Since templates mean I get to create a lot of notes really easily, I wanted to prevent an potential issue where my directories would be full of notes of all kinds mixed together. To solve this, I have the templating plugin set-up to place the notes in a relevant _items/ directory within the root-level category directory. This allows me to easily find the non-templated notes (in this case, something like "Packing list").

Scripting to leverage external metadata

One of the advantages of using a local-first notes app with an open portable format is that I can easily interact with the notes outside of the note taking app itself. This means that I can, among other things, build custom scripts or pipelines that can create or modify notes.

I currently do this for three types of notes:

Batch-converting Google Contacts metadata to people notes.
Generating concert notes from a setlist.fm URL which then auto-fills metadata like venue, tour name, and setlist.
Injecting metadata into media notes using public APIs like IGDB to auto-fill metadata like release date, synopsis, rating, and more.

Querying notes to render tables

Something that I missed after having used Notion for a few years was the ability to create rendered tables out of notes with custom columns, filters, and sorting. Obsidian doesn't have that built-in (though it is on the roadmap), but there is a community-built plugin called Dataview that offers most of what I was looking for.

Dataview works by parsing code blocks starting with

```dataview containing what they call Dataview Query Language (it's essentially SQL) and renders them based on that query. The query contains statements allowing you to do parsing, filtering, sorting, and grouping. It even has some limited support for expressions and function-calling.

I currently use Dataview for rendering tables of my media backlog, trips, and events.

Below you'll find an example of a Dataview table note I created and how it renders. The query essentially translates to: build a table with three columns (title, year, rating) made up of all notes of category "films" (excluding the template note), and sort by IMDb rating.



table without id
    string("[[" + file.path + "|" + title + "]]") as Title,
    year as Released,
    apiRating as "IMDb"
where
    contains(category,[[Films]]) and
    !contains(file.name,"template")
sort apirating desc

Journaling to clear my mind

I have a confession to make. Before 2024 I'd never try journaling. I decided to give it a try early last year and it's been useful so far! It helps me remember what I do on a day-to-day basis, track illnesses like the flu, and put nagging thoughts in order. On that first point, it's already helping me quickly answer questions like "when was the last time I chatted with so-and-so, and what did we talk about?" (the search and the backlink functionalities of Obsidian doing the heavy-lifting).

Since I was planning to do journaling every day, I wanted to make the process as streamlined and easy as possible for me, as to remove any cognitive friction that would push me towards skipping a day (or ten). This is the workflow I ended up building:

A template for the daily notes (_meta/templates/Daily template).
The built-in Daily notes plugin to manage and format daily notes (YYYY/MM-MMMM/YYYY-MM-DD-dddd)
The Calendar plugin to add a calendar in the sidebar that links to the relevant daily notes.

Takeaways

And now, two years with Obsidian, here are my takeaways:

The effort of migrating to yet another app was daunting, but now that all my notes are in an open format, I'm much less concerned about any hypothetical migration in the future (if this app were to cease development, for example).
It's not necessary to come up with all the notes you'll ever need right away. Managing personal notes is a marathon, not a sprint. In fact, it's better to wait until the time that you need a particular note created to create it (instead of trying to proactively come up with future use-cases that haven't come to pass yet).
Reinventing the wheel is not always the best use of time. It's worth looking if someone else built a similar pipeline, plugin, or system that gets you closer to what you want to achieve.
Keeping up with challenges (like journaling) is much easier if you reduce the friction necessary to complete these challenges. Make it so easy that skipping a day would sound silly.

I even have a small backlog of improvement ideas for the future:

Play around with different themes and styles (I'm still using the default theme).
Build a sort of "CRM" (using that term very loosely) to help me maintain relationships better.
Look into plugins to do task management and habit tracking.
Build a small script that could pull weather data into my daily notes.
Write periodic year and quarter retrospective notes (highlights, trips taken, people hung out with, etc.)

If you use Obsidian, or if you are thinking of giving it a try, I would love to hear how you approach note-taking!

Leverage open models like Gemma 2 on GKE with LangChain

Olivier Bourgeois — Thu, 06 Feb 2025 19:13:57 +0000

In my previous posts, we explored how LangChain simplifies AI application development and how to deploy Gemini-powered LangChain applications on GKE. Now, let's take a look at a slightly different approach: running your own instance of Gemma, Google's open large language model, directly within your GKE cluster and integrating it with LangChain.

Why choose Gemma on GKE?

While using an LLM endpoint like Gemini is convenient, running an open model like Gemma 2 on your GKE cluster can offer several advantages:

Control: You have complete control over the model, its resources, and its scaling. This is particularly important for applications with strict performance or security requirements.
Customization: You can fine-tune the model on your own datasets to optimize it for specific tasks or domains.
Cost optimization: For high-volume usage, running your own instance can potentially be more cost-effective than using the API.
Data locality: Keep your data and model within your controlled environment, which can be crucial for compliance and privacy.
Experimentation: You can experiment with the latest research and techniques without being limited by the API's features.

Deploying Gemma on GKE

Deploying Gemma on GKE involves several steps, from setting up your GKE cluster to configuring LangChain to use your Gemma instance as its LLM.

Set up credentials

To be able to use the Gemma 2 model, you first need a Hugging Face account. Start by creating one if you don't already have one, and create a token key with read permissions from your settings page. Make sure to note down the token value, which we'll need in a bit.

Then, go to the model consent page to accept the terms and conditions of using the Gemma 2 model. Once that is done, we're ready to deploy our open model.

Set up your GKE Cluster

If you don't already have a GKE cluster, you can create one through the Google Cloud Console or using the gcloud command-line tool. Make sure to choose a machine type with sufficient resources to run Gemma, such as the g2-standard family which includes an attached NVIDIA L4 GPU. To simplify this, we can simply create a GKE Autopilot cluster.

gcloud container clusters create-auto langchain-cluster \
  --project=PROJECT_ID \
  --region=us-central1

Deploy a Gemma 2 instance

For this example we'll be deploying an instruction-tuned instance of Gemma 2 using a vLLM image. The following manifest describes a deployment and corresponding service for the gemma-2-2b-it model. Replace HUGGINGFACE_TOKEN with the token you generated earlier.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-2-2b-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: model-garden
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250114_0916_RC00_maas
        resources:
          requests:
            cpu: 2
            memory: 34Gi
            ephemeral-storage: 10Gi
            nvidia.com/gpu: 1
          limits:
            cpu: 2
            memory: 34Gi
            ephemeral-storage: 10Gi
            nvidia.com/gpu: 1
        args:
        - python
        - -m
        - vllm.entrypoints.api_server
        - --host=0.0.0.0
        - --port=8000
        - --model=google/gemma-2-2b-it
        - --tensor-parallel-size=1
        - --swap-space=16
        - --gpu-memory-utilization=0.95
        - --enable-chunked-prefill
        - --disable-log-stats
        env:
        - name: MODEL_ID
          value: google/gemma-2-2b-it
        - name: DEPLOY_SOURCE
          value: "UI_NATIVE_MODEL"
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
---
apiVersion: v1
kind: Secret
metadata:
  name: hf-secret
type: Opaque
stringData:
  hf_api_token: HUGGINGFACE_TOKEN

Save this to a file called gemma-2-deployment.yaml, then deploy it to your cluster:

kubectl apply -f gemma-2-deployment.yaml

Deploying LangChain on GKE

Now that we have our GKE cluster and Gemma deployed, we need to create our LangChain application and deploy it. If you've followed my previous post, you'll notice that these steps are very similar. The main differences are that we're pointing LangChain to Gemma instead of Gemini, and that our LangChain application uses a custom LLM class to ingest our local instance of Gemma.

Containerize your LangChain application

First, we need to package our LangChain application into a Docker container. This involves creating a Dockerfile that specifies the environment and dependencies for our application. Here is a Python application using LangChain and Gemma, which we'll save as app.py:

from langchain_core.callbacks.manager import CallbackManagerForLLMRun
from langchain_core.language_models.llms import LLM
from langchain_core.prompts import ChatPromptTemplate
from typing import Any, Optional
from flask import Flask, request
import requests

class VLLMServerLLM(LLM):
    vllm_url: str
    model: Optional[str] = None
    temperature: float = 0.0
    max_tokens: int = 2048

    @property
    def _llm_type(self) -> str:
        return "vllm_server"

    def _call(
        self,
        prompt: str,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> str:
        headers = {"Content-Type": "application/json"}
        payload = {
            "prompt": prompt,
            "temperature": self.temperature,
            "max_tokens": self.max_tokens,
            **kwargs
        }

        if self.model:
          payload["model"] = self.model

        try:
            response = requests.post(self.vllm_url, headers=headers, json=payload, timeout=120)
            response.raise_for_status()
            json_response = response.json()

            if isinstance(json_response, dict) and "predictions" in json_response:
              text = json_response["predictions"][0]
            else:
              raise ValueError(f"Unexpected response format from vLLM server: {json_response}")

            return text

        except requests.exceptions.RequestException as e:
            raise ValueError(f"Error communicating with vLLM server: {e}")
        except (KeyError, TypeError) as e:
          raise ValueError(f"Error parsing vLLM server response: {e}. Response was: {json_response}")

llm = VLLMServerLLM(vllm_url="http://llm-service:8000/generate", temperature=0.7, max_tokens=512)

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful assistant that answers questions about a given topic.",
        ),
        ("human", "{input}"),
    ]
)

chain = prompt | llm

def create_app():
    app = Flask(__name__)

    @app.route("/ask", methods=['POST'])
    def talkToGemini():
        user_input = request.json['input']
        response = chain.invoke({"input": user_input})
        return response

    return app

if __name__ == "__main__":
    app = create_app()
    app.run(host='0.0.0.0', port=80)

Then, create a Dockerfile to define how to assemble our image:

# Use an official Python runtime as a parent image
FROM python:3-slim

# Set the working directory in the container
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

# Install any needed packages specified in requirements.txt
RUN pip install -r requirements.txt

# Make port 80 available to the world outside this container
EXPOSE 80

# Run app.py when the container launches
CMD [ "python", "app.py" ]

For our dependencies, create the requirements.txt file containing LangChain and a web framework, Flask:

langchain
flask

Finally, build the container image and push it to Artifact Registry. Don't forget to replace PROJECT_ID with your Google Cloud project ID.

# Authenticate with Google Cloud
gcloud auth login

# Create the repository
gcloud artifacts repositories create images \
  --repository-format=docker \
  --location=us

# Configure authentication to the desired repository
gcloud auth configure-docker us-docker.pkg.dev/PROJECT_ID/images

# Build the image
docker build -t us-docker.pkg.dev/PROJECT_ID/images/my-langchain-app:v1 .

# Push the image
docker push us-docker.pkg.dev/PROJECT_ID/images/my-langchain-app:v1

After a handful of seconds, your container image should now be stored in your Artifact Registry repository.

Deploy to GKE

Create a YAML file with your Kubernetes deployment and service manifests. Let's call it deployment.yaml, replacing PROJECT_ID.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: langchain-deployment
spec:
  replicas: 3 # Scale as needed
  selector: # Add selector here
    matchLabels:
      app: langchain-app
  template:
    metadata:
      labels:
        app: langchain-app
    spec:
      containers:
      - name: langchain-container
        image: us-docker.pkg.dev/PROJECT_ID/images/my-langchain-app:v1
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: langchain-service
spec:
  selector:
    app: langchain-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
  type: LoadBalancer # Exposes the service externally

Apply the manifest to your cluster:

# Get the context of your cluster
gcloud container clusters get-credentials langchain-cluster --region us-central1

# Deploy the manifest
kubectl apply -f deployment.yaml

This creates a deployment with three replicas of your LangChain application and exposes it externally through a load balancer. You can adjust the number of replicas based on your expected load.

Interact with your deployed application

Once the service is deployed, you can get the external IP address of your application using:

export EXTERNAL_IP=`kubectl get service/langchain-service \
  --output jsonpath='{.status.loadBalancer.ingress[0].ip}'`

You can now send requests to your LangChain application running on GKE. For example:

curl -X POST -H "Content-Type: application/json" \
  -d '{"input": "Tell me a fun fact about hummingbirds"}' \
  http://$EXTERNAL_IP/ask

Considerations and enhancements

Scaling: You can scale your Gemma deployment independently of your LangChain application based on the load generated by the model.
Monitoring: Use Cloud Monitoring and Cloud Logging to track the performance of both Gemma and your LangChain application. Look for error rates, latency, and resource utilization.
Fine-tuning: Consider fine-tuning Gemma on your own dataset to improve its performance on your specific use case.
Security: Implement appropriate security measures, such as network policies and authentication, to protect your Gemma instance.

Conclusion

Deploying Gemma on GKE and integrating it with LangChain provides a powerful and flexible way to build AI-powered applications. You gain fine-grained control over your model and infrastructure while still leveraging the developer-friendly features of LangChain. This approach allows you to tailor your setup to your specific needs, whether it's optimizing for performance, cost, or control.

Next steps:

Explore the Gemma documentation for more details on the model and its capabilities.
Check out the LangChain documentation for advanced use cases and integrations.
Dive deeper into GKE documentation for running production workloads.

In the next post, we will take a look at how to streamline LangChain deployments using LangServe.

Deploy Gemini-powered LangChain applications on GKE

Olivier Bourgeois — Tue, 28 Jan 2025 19:38:15 +0000

In my previous post, we explored how LangChain simplifies the development of AI-powered applications. We saw how its modularity, flexibility, and extensibility make it a powerful tool for working with large language models (LLMs) like Gemini. Now, let's take it a step further and see how we can deploy and scale our LangChain applications using the robust infrastructure of Google Kubernetes Engine (GKE) and the power of Gemini!

Why GKE for LangChain?

You might be wondering, "Why bother with Kubernetes? Isn't it complex?" While Kubernetes does have a learning curve (trust me, I've been through that!) GKE simplifies its management significantly by handling the heavy lifting for you so you can focus on your application.

Here's why GKE is an excellent choice for deploying LangChain applications:

Scalability: GKE allows you to easily scale your application up or down based on demand. This is crucial for handling fluctuating traffic to your AI-powered features. Imagine your chatbot suddenly going viral - GKE ensures it doesn't crash under the load.
Reliability: With GKE, your application runs on a cluster of machines, providing high availability and fault tolerance. If one machine fails, your application keeps running seamlessly.
Resource efficiency: GKE optimizes resource utilization, ensuring your application uses only what it needs. This can lead to cost savings, especially when dealing with resource-intensive LLMs.
Seamless integration with Google Cloud: GKE integrates smoothly with other Google Cloud services like Cloud Storage, Cloud SQL, and, importantly, Vertex AI, where Gemini and other LLMs are hosted.
Versioning and rollbacks: GKE allows you to easily manage different versions of your application, making updates and rollbacks a breeze. This is incredibly useful when experimenting with different prompts or model parameters.

But that's enough talking, let's build something!

Deploying LangChain on GKE

Let's walk through an example of deploying a simple LangChain application that uses Gemini on GKE. We'll build a basic service, similar to the example from the previous post, but this time, it will be packaged as a containerized application ready for deployment.

Containerize your LangChain application

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import ChatPromptTemplate
from flask import Flask, request

llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro")

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful assistant that answers questions about a given topic.",
        ),
        ("human", "{input}"),
    ]
)

chain = prompt | llm

def create_app():
    app = Flask(__name__)

    @app.route("/ask", methods=['POST'])
    def talkToGemini():
        user_input = request.json['input']
        response = chain.invoke({"input": user_input})
        return response.content

    return app

if __name__ == "__main__":
    app = create_app()
    app.run(host='0.0.0.0', port=80)

Then, create a Dockerfile to define how to assemble our image:

# Use an official Python runtime as a parent image
FROM python:3-slim

# Set the working directory in the container
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

# Install any needed packages specified in requirements.txt
RUN pip install -r requirements.txt

# Make port 80 available to the world outside this container
EXPOSE 80

# Run app.py when the container launches
CMD [ "python", "app.py" ]

For our dependencies, create the requirements.txt file containing LangChain and a web framework, Flask:

langchain
langchain-google-genai
flask

Finally, build the container image and push it to Artifact Registry. Don't forget to replace PROJECT_ID with your Google Cloud project ID.

# Authenticate with Google Cloud
gcloud auth login

# Create the repository
gcloud artifacts repositories create images \
  --repository-format=docker \
  --location=us

# Configure authentication to the desired repository
gcloud auth configure-docker us-docker.pkg.dev/PROJECT_ID/images

# Build the image
docker build -t us-docker.pkg.dev/PROJECT_ID/images/my-langchain-app:v1 .

# Push the image
docker push us-docker.pkg.dev/PROJECT_ID/images/my-langchain-app:v1

After a handful of seconds, your container image should now be stored in your Artifact Registry repository.

Deploy to GKE

Now, let's deploy this image to our GKE cluster. You can create a GKE cluster through the Google Cloud Console or using the gcloud command-line tool, again taking care of replacing PROJECT_ID:

gcloud container clusters create-auto langchain-cluster \
  --project=PROJECT_ID \
  --region=us-central1

apiVersion: apps/v1
kind: Deployment
metadata:
  name: langchain-deployment
spec:
  replicas: 3 # Scale as needed
  selector: # Add selector here
    matchLabels:
      app: langchain-app
  template:
    metadata:
      labels:
        app: langchain-app
    spec:
      containers:
      - name: langchain-container
        image: us-docker.pkg.dev/PROJECT_ID/images/my-langchain-app:v1
        ports:
        - containerPort: 80
        env:
        - name: GOOGLE_API_KEY
          value: YOUR_GOOGLE_API_KEY
---
apiVersion: v1
kind: Service
metadata:
  name: langchain-service
spec:
  selector:
    app: langchain-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
  type: LoadBalancer # Exposes the service externally

Apply the manifest to your cluster:

# Get the context of your cluster
gcloud container clusters get-credentials langchain-cluster --region us-central1

# Deploy the manifest
kubectl apply -f deployment.yaml

This creates a deployment with three replicas of your LangChain application and exposes it externally through a load balancer. You can adjust the number of replicas based on your expected load.

Interact with your deployed application

Once the service is deployed, you can get the external IP address of your application using:

export EXTERNAL_IP=`kubectl get service/langchain-service \
  --output jsonpath='{.status.loadBalancer.ingress[0].ip}'`

You can now send requests to your LangChain application running on GKE. For example:

curl -X POST -H "Content-Type: application/json" \
  -d '{"input": "Tell me a fun fact about hummingbirds"}' \
  http://$EXTERNAL_IP/ask

Taking it further

This is just a basic example, but you can expand on it in many ways:

Integrate with other Google Cloud services: Use Cloud SQL to store conversation history, or Cloud Storage to load documents for your chatbot to reference.
Implement more complex LangChain flows: Build sophisticated applications with chains, agents, and memory, all running reliably on GKE.
Set up CI/CD: Automate the build and deployment process using tools like Cloud Build and Cloud Deploy.
Monitor and optimize: Use Cloud Monitoring and Cloud Logging to track the performance and health of your application.

Continue your journey

Deploying LangChain applications on GKE with Gemini unlocks a new level of scalability, reliability, and efficiency. You can now build and run powerful AI-powered applications that can handle real-world demands. By combining the developer-friendly nature of LangChain, the power of Gemini, and the robustness of GKE, you have all the tools you need to create truly impressive and impactful applications.

Next steps:

Dive deeper into the GKE documentation.
Explore the Vertex AI documentation for more advanced LLM management and deployment options.
Check out the LangChain documentation for more complex use cases and examples.

In a future post, I will look into using an open model called Gemma!

Simplify development of AI-powered applications with LangChain

Olivier Bourgeois — Tue, 01 Oct 2024 15:10:38 +0000

Large language models (LLMs) like Gemini can generate human-quality text, translate languages, and answer questions in an informative way. But writing applications that use these LLMs effectively can be tricky, and models all have their own distinct APIs and supported features. That’s where LangChain comes in.

What is LangChain?

LangChain is an open source framework designed to help developers build applications that use LLMs. It provides a standardized interface and set of tools for interacting with a variety of different LLMs, making it easier to incorporate them into your applications. Think of it like a universal adapter that lets you plug in any LLM and start using it with a consistent set of commands. This simplifies development by abstracting away the complexities of individual LLM APIs and allowing you to focus on building your application logic.

With LangChain, you can:

Use different language models by easily switching between multiple models without rewriting your application logic.
Connect to various data sources such as documents and databases to provide context and grounding to the LLM responses.
Create complex flows by chaining together multiple pre-built components.
Engage in dynamic conversations by building chatbots that can remember past interactions and user preferences.
Access and manage external knowledge by integrating with APIs and other sources of real-time information.

Why would you use LangChain?

Imagine you're building a chatbot that needs to answer questions based on your company's internal documents. You have to write custom code to load those documents, format them for the LLM, send the API request, parse the response, and potentially even handle errors. Now, imagine needing to do this across multiple projects with different LLMs and data sources. That's a lot of repetitive and complex code!

LangChain simplifies the development process of LLM-powered applications by abstracting away shared concepts between models, by providing:

Modularity by breaking down your application into reusable components.
Flexibility by being able to easily swap out models and components.
Extensibility by allowing you to customize and extend the framework based on your needs.

This allows you to focus on your application logic instead of reinventing the wheel.

How do you use LangChain?

Getting started is simple! LangChain is available as a package for both Python and JavaScript, and offers extensive documentation and resources. In addition, the LangChain developer community is vast and lots of bindings have been created for other languages, such as LangChain4j for Java. To see which LLM models (and related features) are supported by LangChain, you can take a look at the official tables for LLM models and for chat models.

Let’s take a look at how quickly you can get a LangChain application running in Python. In this example we’ll use the latest Gemini Pro model, but the steps are similar for any model you choose.

First, you install the required packages. In our case, the core LangChain package as well as the LangChain Google AI package.

pip install langchain langchain-google-genai

Then set your Gemini API key, which you can generate following these instructions.

export GOOGLE_API_KEY=your-api-key

And with only a few lines of code, you have a working Q&A application powered by both templating and chaining features!

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro")

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful assistant that creates poems in {input_language} containing {line_count} lines about a given topic.",
        ),
        ("human", "{input}"),
    ]
)

chain = prompt | llm

response = chain.invoke(
    {
        "input_language": "French",
        "line_count": "4",
        "input": "Google Cloud",
    }
)

print (response.content)

Try it out on Google Cloud!

Google Cloud is a great platform for developing and running enterprise-ready LangChain applications. With powerful compute resources, seamless integration with other Google Cloud services, and an extensive collection of pre-hosted LLMs to choose from in Model Garden on Vertex AI, you have everything you need to build and deploy your AI-powered applications.

Explore the following resources to get started:

In a later post, I will take a look at how you can use LangChain to connect to a local Gemma instance, all running in a Google Kubernetes Engine (GKE) cluster.