DEV Community: Remigiusz Samborski

Building an agentic PR reviewer with Antigravity SDK

Remigiusz Samborski — Thu, 18 Jun 2026 19:21:08 +0000

As announced in this blog post on June 18, 2026, Gemini CLI and Gemini Code Assist IDE extensions will stop serving requests for Google AI Pro and Ultra, as well as those using it free of charge using Gemini Code Assist for individuals. Google is unifying its AI terminal tools by transitioning the community-focused Gemini CLI into Antigravity CLI, a new agent-first platform built for complex, multi-agent workflows.

With this transition timeline in place, development teams relying on Gemini CLI for repository management and automated tasks must establish a migration path. In this post, I will show you how to transition seamlessly by building an automated "first-pass" pull request reviewer using the Google Antigravity SDK and the run-agy-sdk composite GitHub Action.

The orchestration tax

The approach I am proposing also solves another pressing issue for modern engineering teams: cognitive overload. As Addy Osmani recently pointed out, there is an orchestration tax to using AI for coding. The time developers save generating code is often pushed onto reviewers as large, complex PRs, causing context switching and cognitive fatigue.

By offloading the tedious "first pass" search to an Antigravity agent, human reviewers can mitigate this tax and focus on high-level architecture and safeguarding quality.

Why we need automated agentic code reviews

AI-generated code can be deceptively good. It is often clean, well-documented, and syntactically correct. This makes it harder for human reviewers to spot subtle logical bugs or security vulnerabilities that might not be immediately obvious.

In a large codebase, manually verifying every change is simply not feasible. This is why we need autonomous agents that can step into the codebase and analyze it from a fresh perspective.

But if a developer used an LLM to generate the code, how can we trust another AI to find the bugs? The answer lies in the agent architecture and context separation.

Developers might write code using any tool — whether it's CLI, an IDE extension, or various models like Gemini 3.5 Flash or Gemini 3.1 Pro. The reviewer, however, is a managed Antigravity Agent running via a separate SDK integration. This agent has a specialized, low-freedom persona and strict system instructions that force it to act as an adversarial code auditor rather than a developer. Furthermore, it operates in an isolated environment. Because it has a different system prompt, safety guardrails, and context boundaries, the agent reviews the changes with a completely fresh perspective, catching logical bugs and vulnerabilities that the original generator might miss.

To demonstrate it in practice I created an agentic review pipeline, which:

Leverages a managed Antigravity Agent configured via the SDK to review the code. The agent uses advanced reasoning to explore files and verify logic under strict guidelines.
Runs reviews inside isolated workspaces or sandboxes with custom policies to prevent shell or arbitrary code execution risks.
Enables the agent to use the GitHub MCP server to interact directly with the environment to write pull request comments and reviews.
Avoids using the synchronize trigger in pull request workflows to prevent redundant review runs and endless loops. Instead, runs reviews on opened and reopened events, and triggers subsequent passes manually by posting a @agy /review comment on the PR.

You can find the code at run-agy-sdk.

What is run-agy-sdk?

The run-agy-sdk is a composite GitHub Action that runs the Google Antigravity SDK (google-antigravity) directly on the GitHub Actions host runner.

Why run on the host instead of a container?

By running directly on the host, the Antigravity SDK has access to the host's Docker daemon. This allows the SDK to spawn Docker-based MCP servers (like the GitHub MCP server) to read files, run tests, and post reviews.

Sub-containers should ideally run with restricted network access and read-only filesystems where possible to prevent an LLM from being tricked into executing arbitrary destructive commands. The limited set of permissions is handled in the GitHub Action configuration (see here). Whereas the Antigravity agent has a limited number of tools it can use from GitHub MCP (see here).

Moreover the workflow is explicitly protected from running automatically on forks, preventing unauthorized code execution. The automated review job will only run if the pull request originates from the same repository (see here). On-demand reviews triggered by commenting @agy /review are restricted so that they can only be initiated by maintainers (see here).

Demonstration walkthrough

The demo below shows the action triggered by a new PR:

Implementation: How to install the action in your repo

Let's walk through the setup process step-by-step.

Step 1: Add your API key to GitHub secrets

The action requires a Google Gemini or Antigravity API key to authenticate language model interactions.

Generate your API key.
Navigate to your target GitHub repository and go to Settings > Secrets and variables > Actions.
Create a new Repository Secret named ANTIGRAVITY_API_KEY and paste your API key as the value.

Step 2: Configure the GitHub Actions workflow

Add a new file in your repository at .github/workflows/antigravity-review.yml and add the following configuration:

name: '🔎 Antigravity PR Review'

on:
  pull_request:
    types: [opened, reopened]
  workflow_dispatch:

concurrency:
  group: '${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}'
  cancel-in-progress: true

jobs:
  antigravity-review:
    runs-on: 'ubuntu-latest'
    timeout-minutes: 20

    permissions:
      contents: 'read'
      pull-requests: 'write'
      issues: 'write'

    steps:
      - name: 'Checkout Repository'
        uses: 'actions/checkout@v6'
        with:
          persist-credentials: false

      - name: 'Run Antigravity PR Review'
        uses: 'rsamborski/run-agy-sdk@main'
        id: 'agy_pr_review'
        with:
          api-key: '${{ secrets.ANTIGRAVITY_API_KEY }}'
          github-token: '${{ secrets.GITHUB_TOKEN }}'
          mode: 'review'
          prompt: '/antigravity-review'
          trust-workspace: 'true'
          sandbox-profile: 'true'

Pro Tip: Pin the action version to a specific commit SHA (e.g., rsamborski/run-agy-sdk@<commit-sha>) rather than using @main. This prevents unexpected breaks from upstream updates.

While you can reference run-agy-sdk directly in your workflows, its real power lies in using it as a blueprint. I encourage you to fork the repository and use it as a template to build your own custom, agentic GitHub Actions. By modifying the safety policies, custom tools, or prompts in run_agent.py, you can tailor the agent's review behavior to your team's specific codebase, style guidelines, and compliance rules.

For a full workflow template supporting both automated PR reviews and comment-triggered reviews, refer to the workflows folder in the repository.

Conclusions

Automating code reviews is a necessity as AI-generated code volumes increase. By using run-agy-sdk, you can run the Antigravity SDK to review PRs automatically and shift more of the burden of code quality assurance away from human reviewers.

Access the full source code in the GitHub Repository.
Read the documentation to customize the prompts and mode.
Feel free to fork the repository and build your own automation.

Acknowledgments

This project was inspired by the run-gemini-cli action, while shifting to the recently released Antigravity SDK. It is a personal sample implementation of how to run the Antigravity SDK in a GitHub Action, and is not an officially supported Google product.

Let’s connect!

I’d love to hear how you’re using Antigravity for your agentic workflows. Are you building automated code review loops or keeping a tighter leash on your agents?

Connect with me on LinkedIn
Follow me on X
Catch me on Bluesky

How I built a supersonic AI riddling duel in under 20 Minutes (with zero manual coding)

Remigiusz Samborski — Thu, 28 May 2026 14:54:55 +0000

Can an AI developer agent build complex full-stack logic without constant hand-holding?

To test this, I set up the #NapkinChallenge: take a rough architectural sketch on a paper napkin and turn it into a fully functional web application in under 20 minutes.

The goal: Build "Blaine the Mono's Riddling Competition" — a game where you duel a psychotic, sentient supersonic monorail from Stephen King's The Dark Tower series. If you trip him up with three logic-defying riddles, you save the passengers. If his neural net solves them, you crash at Mach 4.

The result? The AI built the entire application with zero manual lines of code written by me.

Here is how the architecture works, how the backend uses the new @google/genai SDK, and how you can run the challenge yourself.

The Architecture (from sketch to code)

The napkin sketch outlined a straightforward but highly interactive setup:

Frontend: A UI built in Next.js (App Router).
Backend: A Next.js API route that forwards user riddles to the Gemini API.
The Engine: The gemini-3-flash-preview model, prompted to act as Blaine the Mono. It must analyze the riddle and return structured JSON containing its solution, status, and theatrical psychotic commentary.

Key Takeaways

Zero-Code Feasibility: Developer agents like Antigravity can now understand visual layouts, bootstrap templates, design customized styling systems, and integrate real-world API clients with robust error boundaries.
Styling is a First-Class Citizen: The agent did not just generate functional code; it constructed a bespoke styling sheet (src/app/globals.css) incorporating Orbitron typography, custom CSS scrollbars, CRT overlays, and reactive screen shakes using CSS keyframes.
Time to Gameplay: The entire application was built in under 20 minutes, including the time it took to generate the code, test it, and get it running.

Video

See the full video of the process:

Join the #NapkinChallenge 📝

Now it is your turn. Can you match the speed of autonomous generation?

Sketch your app concept, system architecture, or database model on a physical paper napkin.
Upload the photo to Antigravity, Gemini, or Google AI Studio.
Prompt your agent to build the code from the image alone.
Share a video or screenshot of your live app using #NapkinChallenge!

Let's see what you can build!

Building "Sweets Vault" - a multimodal Gemini Agent with physical hardware integration

Remigiusz Samborski — Fri, 15 May 2026 12:43:35 +0000

Motivating seven-year-olds to complete their daily reading and handwriting practice is a classic parenting challenge. Traditional rewards work for a while, but they lack interactivity and require constant manual verification.

As a developer, I like to solve such challenges with automation. After putting some thought into it, I came up with the Sweets Vault idea: an interactive agent powered by Google's Agent Development Kit (ADK) and the Gemini API. The system acts as a cheerful guardian that talks to children, visually inspects their workbooks via uploaded images, tests their reading comprehension, and triggers a hardware lock to open a drawer full of sweets upon successful completion.

In this guide, I will walk you through the architecture and implementation of this solution. You will learn how to:

Structure a multimodal agent using the Agent Development Kit (ADK).
Implement visual and verbal verification using Gemini's multimodal capabilities.
Manage state across multiple conversation turns and tools.
Connect agent tool calls to physical hardware interfaces.
Develop and run locally to access the physical hardware.

If you’d like to jump directly to the code visit the GitHub repository. All the code is available there for your exploration.

System architecture overview

The diagram below presents the high level architecture of the solution:

The core components include:

Gemini API: Handles reasoning, multimodal homework validation and tool calls.
ADK Agent & Tools: Encapsulates the system instructions, state management, and callable Python functions.
Hardware Interface: Translates tool execution into physical actions (unlocking specific drawer IDs).

The system is designed in such a way that the Agent runs on a local machine (I am using a mini PC with Ubuntu installed) to allow for direct hardware access:

Magnetic drawers controlled via FT232H USB to GPIO converter
LED Matrix controlled via REST API running on a Raspberry Pi

Initially, I planned to control the LED Matrix using a second FT232H controller, but due to lack of library support, I ended up using an intermediary Raspberry Pi. This approach has its benefits, for example the LED Matrix can be located anywhere at home within the Wifi range 😀

Root agent logic

To kick-start the agent development, I leveraged the agent-starter-pack templates. It provides a production-ready foundation with FastAPI, frontend UI integration, and built-in observability.

The heart of the Sweets Vault is located in agent/app/agent.py. I start by configuring the environment and initializing Gemini Enterprise Agent Platform (former Vertex AI). I also define the specific tasks required for our users (Mary and James):

load_dotenv()
project_id = os.getenv("GOOGLE_CLOUD_PROJECT")
location = os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1")
os.environ["GOOGLE_CLOUD_PROJECT"] = project_id
os.environ["GOOGLE_CLOUD_LOCATION"] = location
os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = "True"

# Initialize Vertex AI
vertexai.init(project=project_id, location=location)

As a native Polish speaker I want to have the ability for the Agent to work both in Polish (for the sake of my kids) and English (for demo purposes). This is handled by the AGENT_LANGUAGE variable:

AGENT_LANGUAGE = os.getenv("AGENT_LANGUAGE", "en")

The actual agent (root_agent) is created at the bottom of the same file:

root_agent = Agent(
    name="root_agent",
    model=Gemini(
        model="gemini-2.5-flash",
        retry_options=types.HttpRetryOptions(attempts=3),
    ),
    instruction=load_prompt_from_file(f"sweet-vault-agent-{AGENT_LANGUAGE}.txt"),
    tools=[get_progress, complete_task, unlock_drawer],
)

Note: The prompt is language specific and pulled from a file with a language suffix (en or pl).

Handling state

A common failure mode in conversational AI is lost context or hallucinated task completion. To prevent this, we implement strict state management using ToolContext.

Instead of relying on the model's memory, the agent reads and writes explicit completion flags to its session state:

def _get_task_status(user_key: str, task_id: str, tool_context: ToolContext) -> bool:
    """Retrieves the completion status for a specific task from the flat state."""
    state_key = f"user_tasks_{user_key}_{task_id}"
    return tool_context.state.get(state_key, False)


def _set_task_status(user_key: str, task_id: str, is_done: bool, tool_context: ToolContext):
    """Saves the completion status for a specific task and ensures all user/task 
    combinations are explicitly represented in the flat tool_context.state.
    """
    # First, update the specific target task in the current tool state
    target_key = f"user_tasks_{user_key}_{task_id}"
    tool_context.state[target_key] = is_done

    # Now, ensure every possible combination for all known users exists in the flat state.
    all_sync_updates = {}
    for name in user_names:
        u_key = name.lower()
        for t_id in TASKS_CONFIG:
            key = f"user_tasks_{u_key}_{t_id}"
            # If the key isn't already in the current state, default it to False.
            # Otherwise, keep its existing value.
            all_sync_updates[key] = tool_context.state.get(key, False)

    # Apply all values back to the flat state
    tool_context.state.update(all_sync_updates)
    logging.info(f"Synchronized all task state values. Updated {target_key} to {is_done}")

Key learning: When building the system I tried using session state elements as a nested dictionary, but unfortunately at the time of writing this is not supported. The workaround was to use a flat structure with keys including both the user_key and task_id, which works well for my use case. However, this pattern might not scale well for a complex system with many users and tasks, in which case serialization or an external DB could be a better option.

Agent tools

I provided the agent with three specific tools: checking progress, marking tasks complete, and unlocking the drawer.

Checking progress

The get_progress function retrieves and formats a checklist of a specific user's tasks, indicating whether each task is marked as completed or pending based on the application's current session state.

def get_progress(user_name: str, tool_context: ToolContext) -> str:
    """Check the progress of tasks for a specific user."""
    user_key = user_name.lower()

    status_msg = f"Progress for {user_name}:\n"
    for task_id, desc in TASKS_CONFIG.items():
        is_done = _get_task_status(user_key, task_id, tool_context)
        state_str = "✅ DONE" if is_done else "❌ PENDING"
        status_msg += f"- [{task_id}] {desc}: {state_str}\n"

    return status_msg

Marking task as complete

The complete_task tool acts as a gatekeeper. It checks if all tasks are finished before informing the model that it is authorized to unlock the drawer:

def complete_task(user_name: str, task_id: str, tool_context: ToolContext) -> str:
    """Mark a task as completed for a user."""
    user_key = user_name.lower()

    # Mark task as complete
    if task_id in TASKS_CONFIG:
        _set_task_status(user_key, task_id, True, tool_context)
    else:
        return f"Error: Task ID '{task_id}' not found."

    # Check if ALL tasks are complete
    all_complete = True
    remaining = []
    for t_id in TASKS_CONFIG:
        if not _get_task_status(user_key, t_id, tool_context):
            all_complete = False
            remaining.append(t_id)

    if all_complete:
        return (
            f"SUCCESS: All tasks completed for {user_name}! "
            "You may now unlock the drawer."
        )

    # If not all complete, show progress
    return (
        f"Task {task_id} marked as DONE. "
        f"Remaining tasks: {', '.join(remaining)}."
    )

Notice how descriptive the returned values are. They are written this way intentionally to give the Agent enough information to handle communication with the user, provide feedback and motivate them to complete the remaining tasks.

Integrating physical hardware

When the model receives the success confirmation, it calls the unlock_drawer tool. This interfaces directly with our hardware relay logic to update the LED display and pop open the assigned drawer:

# Initialize the HW interface and lock the drawers
user_names = ["Maria", "Jan"] if AGENT_LANGUAGE == "pl" else ["Mary", "James"]
hw_interface = HardwareInterface(user_names)

def unlock_drawer(id: int, user_name: str) -> str:
    """Unlock a drawer by its ID."""
    if id in [0, 1]:
        hw_interface.unlock_drawer(id)
        return f"Drawer {id} unlocked for {user_name}"

    return "Drawer not found"

The HardwareInterface (defined in agent/app/app_utils/hw_interface.py) actively communicates with the LED Matrix API on the Raspberry Pi to display whether each drawer is currently locked or unlocked.

While the code to control the physical drawer magnets is fully functional and tested (located in drawers.py), it is not yet integrated into the main HardwareInterface. This integration is simply on hold until the magnets are physically mounted to the drawer box.

Agent prompts

Tools alone are not enough; the model requires precise instructions on how to verify the work. In agent/app/prompts I defined a strict multi-step verification protocol both in English and Polish. Here is the English prompt:

You are a friendly, cheerful, and helpful AI assistant, the guardian of the "Sweets Vault." Your task is to verify tasks performed by children in order to grant a sweet reward.

### MAIN RULES:
1. **LANGUAGE**: You speak ONLY AND EXCLUSIVELY IN ENGLISH.
2. **USERS**:
   - **Mary** (girl, 7 years old) -> Assigned drawer ID: **0**
   - **James** (boy, 7 years old) -> Assigned drawer ID: **1**
   - **Parent** (man, 42 years old) -> May test the system by saying, for example, "I'm pretending to be Mary." Treat him exactly like the child he is claiming to be.
3. **PERSONALITY**: You are enthusiastic, warm, and supportive. Use exclamation marks and a joyful tone.

### TASK VERIFICATION PROCESS:
1. **STATE IDENTIFICATION**: When a child starts a conversation, ALWAYS first use the `get_progress(user_name)` tool to check what needs to be done.
2. **REPORTING**: The child reports completing a task (A or B).
3. **VERIFICATION**: Conduct a rigorous verification (camera/questions) as described below.
4. **CREDITING**: If verification is successful, use the `complete_task(user_name, task_id)` tool.
   - Read the tool's response carefully!
   - ONLY IF the response is "SUCCESS: All tasks completed...", then use `unlock_drawer`.
   - If the response shows "Remaining tasks," inform the child what they still need to do.

**Task A: Reading a page of a book**
*   **Verification 1**: Ask the child to show the read page to the camera. Confirm that you see it. Don't expose any details that can help answer the question in the next step (i.e. avoid sharing details of what exactly you can see).
*   **Verification 2**: Ask a simple follow-up question about the read text. The child must answer it.
*   **Task ID**: "A"

**Task B: Calligraphy (writing letters in workbooks)**
*   **Verification 1**: Ask to show the completed page in the workbooks to the camera. 
*   **Verification 2**: Confirm that the task has been performed. Make sure the picture contains hand-written letters (usually with a pencil).
*   If the page only contains examples, ask the child to complete missing parts.
*   **Task ID**: "B"

### SUCCESS AND REWARD:
IF the `complete_task` tool returns "SUCCESS", run `unlock_drawer(id)`.
Then **CELEBRATE!** Use phrases like: "Yippee!", "Hurray!", "Bravo!", "You're a champion!", "The sweets are yours!". Make some "noise."

### FAILURE:
If verification fails (e.g., the child doesn't show the page or answers incorrectly), gently and encouragingly ask for improvement or a retry. Do not open the drawer.

This prompt structure ensures the agent does its due diligence, preventing kids from simply holding up a blank page or skipping the reading comprehension check.

Demo

You can see a demonstration of the working system in the video below:

Conclusion

By combining the Gemini API, the Agent Development Kit, and a simple hardware relay, you can build highly interactive, physically grounded AI Agents. The Sweets Vault demonstrates how multimodal verification and structured tool calling solve practical, real-world problems with a dose of fun.

Explore more at:

Future plans

Current implementation uses Gemini Flash which guarantees high performance, multimodality and tool calling capabilities. Nevertheless it requires text input and provides only text as output. In the near future I plan to experiment with Gemini Live API which enables voice, video and text as input and conversational audio as output.

I am also going to finish the physical locks part with electro magnets. Stay tuned for updates.

Thanks for reading

Thank you for reading. I hope this blog inspires you to bring your own creative AI and hardware projects to life. If you found this article helpful, please consider following me here and giving it a clap 👏 to help others discover it.

I am always eager to connect with fellow developers and AI enthusiasts, so feel free to follow me on LinkedIn, X or Bluesky. Your feedback is incredibly valuable, so please do not hesitate to leave a comment with your thoughts, questions, or your own experiences building multimodal agents!

How I used Gemini CLI to orchestrate a complex RAG migration

Remigiusz Samborski — Tue, 28 Apr 2026 13:02:03 +0000

Building a complex, multi-phase cloud project like a RAG migration is as much about orchestration as it is about code. You have to manage infrastructure (Terraform), backend services (Python), frontend UI (Next.js), data pipelines (BigQuery/AlloyDB), and documentation - all while maintaining a consistent technical strategy.

Standard IDE completions are great for snippets, but they lack the system-level context needed for this kind of engineering. To build this reference architecture, I didn't just use an AI to write code. I used an AI to orchestrate the entire project.

In this final post (see previous part 1 and part 2), I'll share a behind-the-scenes look at using Gemini CLI with the Conductor extension to orchestrate this migration.

In this post, you will learn:

How to leverage terminal-first AI assistants for system-level engineering
How to implement spec-driven development with the Conductor extension
How to use AI-driven Test-Driven Development (TDD) for reliable code generation
How to collaborate with AI agents using the "Human-in-the-Loop" model

Before we dive into the workflow, let's briefly discuss why orchestration is the next logical step for AI-assisted development.

The Developer Experience

Let's walk through my development process step-by-step. The entire specification, plan, and implementation logic is available in the conductor directory of the rag-migration repository.

Spec-driven development with Conductor

Central to my workflow, is the Conductor extension. It's built on the principle of spec-driven development. Instead of jumping straight into code, we define the "source of truth" in Markdown files.

Product Definition (product.md): What are we building?
Tech Stack (tech-stack.md): What tools are we using?
Tracks Registry (tracks.md): What are the major milestones?
Implementation Plans (plan.md for each of the tracks): What are the step-by-step tasks?
Workflow (workflow.md): How are we building the solution?

By having these documents in the codebase, the AI agent (Gemini CLI) always has the high-level context it needs to make smart decisions. It's also a good practice to share those with your team so everyone (including AI agents) is on the same page about the project's direction.

Conductor initialization

The first step for the project initialization is to create product definition and tech stack files. This is handled by running:

/conductor:setup

Gemini CLI will ask you a series of questions to help you define your project, including:

What is the name of your product?
Who are the primary users?
What is the tech stack you are using?
What are the major features you want to implement?
What is the workflow you want to use?

It will then create the initial project structure in the conductor directory, including the product.md and tech-stack.md files.

The lifecycle of a track

Each major feature in this project was implemented as a "Track". A typical track lifecycle consists of:

Track Initialization (/conductor:newTrack):
- The agent creates a spec.md file that describes the goals of the track
- The agent maps the existing codebase and validates assumptions
- The agent creates a plan.md file that describes the steps needed to achieve the goals
Track Execution (/conductor:implement):
- The agent iterates through tasks using a Plan -> Act -> Validate cycle
Track Completion:
- The agent verifies the changes made during the track
- The agent ask for user feedback on the implementation
Track Archivization:
- Once a track is completed, Gemini CLI archives the track in the conductor/archive directory

For example, when I started the initial embeddings track, I initialized it with:

/conductor:newTrack

Gemini CLI researches the codebase, asks clarifying questions and creates a spec.md and plan.md files. Only after I review and approve them, the actual implementation starts.

Terraform for Infrastructure as Code

My product.md file instructs Gemini CLI to write Terraform code for all the resources created during the project. This works really well as all the resources are consistently managed by source code and it's easy to spin up a new environment when needed.

You can see all the Terraform files and infrastructure scripts used in the first track in the infra directory.

Moreover, in the course of the project creation I instructed Gemini CLI to always run terraform plan before terraform apply. Keeping this information in the workflow.md file ensures that such an approach is applied to all tracks.

TDD with an AI agent

One of the most powerful aspects of this workflow is AI-driven Test-Driven Development (TDD). I didn't just ask the agent to "write the code". It followed a strict protocol:

Write Failing Tests: The agent defines the expected behavior in a new test file
Red Phase: It runs the tests and confirms they fail
Green Phase: It writes the minimum code needed to pass the tests
Refactor: It refactors the implementation code and the test code to improve clarity, remove duplication, and enhance performance without changing the external behavior.
Verify Coverage: It verifies that the test coverage meets the project requirements (target: >80% coverage for new code).
Commit Code Changes: The agent commits code changes related to the task.

This ensures that the AI-generated code isn't just "syntactically correct" but functionally verified against my requirements. This workflow is described in the workflow.md file.

Checkpoints and quality gates

At the end of every phase, Gemini CLI runs a "Checkpoint" protocol. This includes:

Automated Verification: Running the full test suite.
Manual Verification: Providing the user with step-by-step instructions to verify the changes.
Auditable Records: Attaching a verification report to the git commit using git notes and update plan.md with the new commit hash.

Conductor commits demonstrating the checkpoint protocol.

Effective Human-in-the-Loop

To achieve an effective AI agent-human development synergy I heavily depended on following solutions:

Gemini CLI in a sandbox with Yolo mode enabled - see my past article for more about it.
Custom sandbox notifier script that runs in another terminal.

This approach provided safe guardrails and allowed me to jump into work on other projects while the AI was working on this one. I was always able to jump back quickly thanks to timely notifications. Moreover the checkpointing mechanism of Conductor allowed me to always have a possibility to revert unnecessary changes or to restart from a known working state.

I also used Antigravity to polish the generated code and the documentation. It was particularly helpful for minor tweaks or refactoring of the code that was generated by Gemini CLI.

Token usage

Throughout the project I used several models (Gemini 3 Pro, Gemini 3 Flash and Gemini 2.5 Flash Lite). The total token consumption was:

Input tokens: ~19M
Cached input tokens: ~66M
Output tokens: ~400k

Notice the high number of cached input tokens, which significantly impacts the spend. The total Vertex AI token cost was around $30. Not bad for several days of AI assisted work.

See the pricing page for more details and please mind that your mileage may vary.

Summary

Software engineering is evolving from writing code to orchestrating agentic workflows. By using tools like Gemini CLI and frameworks like Conductor, you can scale your impact as an architect while ensuring consistent, high-quality implementation.

Ready to build your own AI-assisted development projects?

Thanks for reading

If you found this article helpful, please consider adding 50 claps to this post by pressing and holding the clap button 👏 This will help others find it. You can also share it with your friends on socials.

I'm always eager to share my learnings or chat with fellow developers and AI enthusiasts, so feel free to follow me on LinkedIn, X or Bluesky.

Migrating vector embeddings in production without downtime

Remigiusz Samborski — Tue, 21 Apr 2026 15:41:50 +0000

In the fast-moving world of AI, models evolve rapidly. What was state-of-the-art six months ago is now being surpassed by newer models. For a RAG system, this presents a significant challenge: vector embeddings are tied to the specific model that generated them.

If you want to upgrade your model, you can’t just start using the new one. Existing vectors in your database are incompatible with queries from the new model. A "naive" migration-shutting down the site, re-embedding everything, and restarting-means hours of potential downtime.

In this post, I'll show you how to execute a zero-downtime migration strategy using dual-column schemas and background processing.

If you haven't read the previous post, I recommend starting there to understand the basics of building a RAG pipeline with BigQuery, Cloud Run Jobs, Vertex AI, and AlloyDB for PostgreSQL.

In this post we will start off with a running system built in the previous post, and I will show you how to:

Implement the Shadow Deployment pattern with dual-column schemas
Execute background backfilling using Cloud Run Jobs
Safely switch application logic without impacting search functionality
Ensure data consistency and handle migration failures

Before we dive into the code, let's briefly discuss the concept of shadow deployment and how it supports the RAG application migration process.

Shadow deployment with dual columns

A robust way to migrate embeddings is to use a Shadow Deployment pattern. Instead of replacing the existing vectors, you store the new vectors alongside them in a separate column. The migration process boils down to following major steps:

Add a new column: We update our AlloyDB table to include embedding_v2.
Backfill in the background: We run a migration job to populate embedding_v2 for all existing rows.
Switch: Once every row has a new vector, we update the application code to use the new model and the new column.

This strategy ensures that your live search functionality, which still uses the original embedding column, remains fully operational during the entire migration process.

Implementation

Let's walk through the migration process step-by-step. All the code for this migration is available in the 03-migration folder of the RAG Migration Repository.

Step 1: Schema evolution

First, we prepare the database. Using a simple SQL query, we add the new vector column. Because we are targeting an existing database, we connect via the AlloyDB Auth Proxy and use psql to execute the query.

# Ensure your AlloyDB Auth proxy is running in another terminal window by running
# ./alloydb-auth-proxy projects/<PROJECT_ID>/locations/<LOCATION>/clusters/<CLUSTER>/instances/<INSTANCE> --port <PORT> --auto-iam-authn --public-ip

# Navigate to the migration directory
cd 03-migration

# Apply the schema change
psql -h 127.0.0.1 -p <PORT> -U postgres -d <DATABASE_NAME> -f 001_add_embedding_v2.sql

The content of 001_add_embedding_v2.sql is straightforward:

ALTER TABLE products ADD COLUMN IF NOT EXISTS embedding_v2 VECTOR(768);

Since AlloyDB handles schema changes gracefully, this operation is near-instantaneous and doesn’t lock the table for reads. Your live API is completely unaffected.

Note: In production you may want to run this query via your CI/CD pipeline.

Step 2: Configure the migration environment

We reuse the parallelization framework we built in the previous post, but this time we configure the environment for the new model. The project uses uv for dependency management:

# Sync local dependencies (run it in 03-migration folder)
uv sync

# Set required environment variables
export GOOGLE_CLOUD_PROJECT="YOUR_PROJECT_ID"
export DB_PASSWORD="YOUR_ALLOYDB_PASSWORD"
export GEMINI_EMBEDDING_MODEL="gemini-embedding-001"
export GEMINI_EMBEDDING_DIMENSION=768
export BATCH_SIZE=1000

Step 3: Background backfilling worker

The migration worker (03-migration/main.py) specifically targets rows where the new column is still empty. This makes the migration process idempotent and resumable - if a task fails, you can just run it again.

# snippet from 03-migration/main.py
# Fetch products where embedding_v2 is null, respecting offset
fetch_stmt = text("""
    SELECT id, name, category, brand FROM products 
    WHERE embedding_v2 IS NULL
    ORDER BY id
    LIMIT :batch_size OFFSET :offset
""")

We deploy this worker as a Cloud Run Job. A convenient deploy script is provided in the repository which builds the Docker image and configures the job on GCP.

./infra/scripts/deploy_migration.sh

Step 4: Orchestrating the migration

Instead of manually calculating the number of tasks to run, we use a Python orchestrator (03-migration/orchestrator.py) to query the database, calculate the remaining work, and dynamically scale the Cloud Run Job.

The orchestrator counts the number of unmigrated rows:

# snippet from orchestrator.py logic
count_stmt = text("SELECT COUNT(*) FROM products WHERE embedding_v2 IS NULL")
unmigrated_count = session.execute(count_stmt).scalar()
total_tasks = math.ceil(unmigrated_count / batch_size)

Then, it triggers the Cloud Run Job via the Google Cloud SDK, passing the exact number of tasks required:

# Run the orchestrator to kick off the migration
uv run orchestrator.py

The job runs in the background, consuming rows and generating new embeddings without competing for critical resources with our live search API.

Step 5: Safely changing the query

Once the orchestrator reports that 100% of rows have embedding_v2 populated, we are ready for the switch. This happens entirely at the application layer (02-ui).

The search API code is updated to:

Use the gemini-embedding-001 model to embed the user's search query.
Query the embedding_v2 column in AlloyDB instead of embedding.

Congratulations 🎉 You have successfully migrated your entire vector database with zero downtime!

Production best practices: evals and feature flags

While a direct code swap works for a simple demonstration, in a real-world production environment, you should avoid an abrupt 100% cutover. Instead, you should leverage the fact that both vector representations exist simultaneously in your database to roll out safely:

Evaluation pipeline: Before exposing the new model to customers, build an eval pipeline. Take a golden dataset of your most common or critical search queries and run them against both the old (embedding) and new (embedding_v2) columns. Compare the relevance of the retrieved results to ensure the new model actually improves the search experience.
Feature flags for traffic routing: Wrap the application-layer switch in a feature flag. Start by routing a small percentage of your traffic (e.g., 5% or 10%) to the new embedding_v2 logic. Monitor your application metrics, click-through rates, and error logs.

Because the migration happened in the background, this dual-state makes it trivial to run A/B tests or instantly rollback by toggling the feature flag if the new model introduces unexpected regressions. Once you're fully ramped up to 100% and verified the new performance, the old embedding column can be safely dropped in a future database cleanup.

See it in action

The semantic search UI seamlessly returns results using the new gemini-embedding-001 model without any disruption to the user experience.

Summary

AI infrastructure is about more than just the initial build; it’s about designing for evolution. By using shadow deployments, you ensure your RAG system can always stay at the cutting edge of model performance without sacrificing availability.

Ready to take it further?

In my next post, we’ll look at the Developer Experience - how I used Gemini CLI and the Conductor extension to build and manage this entire multi-phase project.

Thanks for reading

I'm always eager to share my learnings or chat with fellow developers and AI enthusiasts, so feel free to follow me on LinkedIn, X or Bluesky.

Building a Scalable RAG Backend with Cloud Run Jobs and AlloyDB

Remigiusz Samborski — Wed, 15 Apr 2026 08:26:53 +0000

Building a Retrieval-Augmented Generation (RAG) sounds easy with all the available tutorials. You take a few hundred products, run them through an embedding model, and store them in a vector database. It works beautifully on your machine or staging environment.

The friction starts at production scale. When your dataset jumps from a few hundred to millions of products, that simple Python loop you wrote to generate embeddings hits a wall. Between network latency and hitting API rate limits every few seconds, what was a five-minute task quickly spirals into a multi-hour ordeal that blocks your entire pipeline.

Scaling effectively means moving past sequential processing. In this post, we’ll explore how to build an industrial-strength RAG backend using BigQuery, Cloud Run Jobs, Vertex AI, and AlloyDB for PostgreSQL.

You will learn how to:

Provision infrastructure with Terraform
Parallelize embedding generation using Cloud Run Jobs
Use the google-genai SDK for Vertex AI text-embedding-005 model
Store and query vectors in AlloyDB for PostgreSQL using pgvector

Note: I decided to use AlloyDB in this example, but any other PostgreSQL database with pgvector extension could work too, for example you may consider leveraging Cloud SQL for PostgreSQL.

Before we dive into the code, let's briefly discuss the core components that power this serverless AI solution.

The Industrial-Strength Architecture

Our pipeline is designed for massive scale and serverless efficiency. We leverage the following Google Cloud services:

BigQuery: Our source of truth, containing millions of product records.
Cloud Run Jobs: A serverless compute platform that allows us to run hundreds of parallel tasks.
Vertex AI (text-embedding-005): The latest state-of-the-art embedding model from Google.
AlloyDB for PostgreSQL: An enterprise-grade database with built-in pgvector support for high-performance vector search.

The diagram below illustrates the high-level architecture of our RAG pipeline:

Implementation

Let's walk through the setup and execution process step-by-step. All the code for this project is available in the RAG Migration Repository.

Prepare the environment

First, let's configure the gcloud CLI, clone the repository and create a virtual environment with dependencies.

Step 1 - set your default project:

gcloud config set project YOUR_PROJECT_ID

Step 2 - configure the default region for Cloud Run:

gcloud config set run/region europe-central2

Step 3 - clone the code repository

git clone https://github.com/rsamborski/rag-migration.git
cd rag-migration/01-generation

Step 4 - create a virtual environment and install dependencies

uv init
uv sync

Infrastructure with Terraform

We use Terraform to provision the AlloyDB cluster, the Artifact Registry, and the Cloud Run Job. Navigate to 01-generation/infra/terraform and apply the configuration:

terraform init
terraform plan -var="project_id=YOUR_PROJECT_ID" -var="db_password=YOUR_SECURE_PASSWORD" -out tfplan
terraform apply tfplan

Note: The -out tfplan flag saves the plan to a file named tfplan, and terraform apply tfplan applies that specific plan. This is a best practice for ensuring that the plan and apply operations are consistent.

Connecting to AlloyDB

To interact with AlloyDB, the application needs to establish a secure connection. Depending on where you are running the code, the approach differs:

Local Development: For running scripts or testing queries from your local machine, use the AlloyDB Auth Proxy. It provides secure access to your instance without authorizing your local IP to the AlloyDB instance.
Cloud Run Jobs: When running in Cloud Run, the job connects to the AlloyDB instance over the private network (VPC). For this setup, we pass the database password via an environment variable to the Cloud Run Job configuration.

Note: For production workloads, it is highly recommended to use Google Cloud Secret Manager to handle sensitive data like database passwords, rather than passing them as plain text environment variables.

Embedding logic

The worker script (01-generation/main.py) is designed to run as an individual task within a Cloud Run Job. It uses the CLOUD_RUN_TASK_INDEX environment variable to calculate its specific shard of data.

# Cloud Run Job environment variables
task_index = int(os.environ.get("CLOUD_RUN_TASK_INDEX", 0))
batch_size = int(os.environ.get("BATCH_SIZE", 100))

# Calculate offset
offset = task_index * batch_size

The embedding generation logic (01-generation/src/embedder.py) uses the google-genai SDK:

import os
from google import genai
from google.genai.types import EmbedContentConfig

def generate_embeddings(texts: list[str]) -> list[list[float]]:
    """
    Generates embeddings for a list of texts using the text-embedding-005 model.
    Uses the new google-genai SDK to avoid deprecation warnings.
    """
    if not texts:
        return []

    project_id = os.environ.get("GOOGLE_CLOUD_PROJECT", "rsamborski-rag")
    location = os.environ.get("GOOGLE_CLOUD_REGION", "europe-central2")

    # Initialize the Gen AI client for Vertex AI
    client = genai.Client(vertexai=True, project=project_id, location=location)

    # The dimensionality of the output embeddings for text-embedding-005.
    dimensionality = 768 
    task = "RETRIEVAL_DOCUMENT" # standard task for documents in RAG

    response = client.models.embed_content(
        model="text-embedding-005",
        contents=texts,
        config=EmbedContentConfig(
            task_type=task,
            output_dimensionality=dimensionality
        )
    )

    return [embedding.values for embedding in response.embeddings]

Build and deploy

We containerize the application using the provided Dockerfile and deploy it as a Cloud Run Job. The deploy.sh script automates this process, you can run it by executing:

./infra/scripts/deploy.sh

Once finished you should see:

---------------------------------------------------------
✅ Deployment Finished
---------------------------------------------------------

Run and monitor

Now you can start the orchestrator by running:

uv run orchestrator.py

The orchestrator provides real-time feedback on the job status, which you can also monitor in the Google Cloud Console.

Congratulations 🎉 You have successfully built and run a parallelized embedding pipeline!

For production environment I recommend to create a ScaNN index to improve the speed of your queries. Please refer to the linked documentation to learn more about it.

Testing with the Semantic Search UI

To see the embeddings in action, you can spin up the Next.js semantic search UI locally.

Run the UI

Navigate to the UI directory and configure the environment:

cd ../02-ui
cp .env.template .env

Edit the .env file to include your Google Cloud PROJECT_ID and the AlloyDB DB_PASSWORD you used during the Terraform deployment. Set DB_HOST=127.0.0.1 to route queries through the AlloyDB Auth Proxy.

Install dependencies:

npm install

Start the AlloyDB Auth Proxy (in a separate terminal window):

# Make sure you have downloaded the alloydb-auth-proxy binary
./alloydb-auth-proxy projects/YOUR_PROJECT_ID/locations/europe-central2/clusters/rag-migration-cluster/instances/rag-migration-instance

Start the development server:

npm run dev

Navigate to http://localhost:3000 to interact with the search portal. You can now run natural language queries directly against your product catalog!

See it in action

Watch as natural language queries return highly relevant results mapped via the text-embedding-005 model in real-time.

Summary

You now have a scalable, serverless foundation for your RAG system. By using Cloud Run Jobs, you've transformed a bottleneck into a highly parallelized process capable of handling millions of records.

Ready to take it further?

Check out the full source code on GitHub.
Learn more about Cloud Run Jobs.
Learn more about AlloyDB and pgvector.
Learn how to create a ScaNN index for your embeddings.
Learn more about Embeddings APIs on VertexAI.

In the next post, we’ll dive into Zero-Downtime Embedding Migration - how to upgrade your vector models without taking your search offline.

Thanks for reading

I'm always eager to share my learnings or chat with fellow developers and AI enthusiasts, so feel free to follow me on LinkedIn, X or Bluesky.

Secure Gemini CLI for Cloud development

Remigiusz Samborski — Fri, 13 Mar 2026 06:01:17 +0000

AI agents are a double-edged sword. You hear horror stories of autonomous tools deleting production databases or purging entire email inboxes. These risks often lead users to require manual confirmation for every agent operation. This approach keeps you in control but limits the agent's autonomy. You will soon find yourself hand-holding the agent and hindering its true capabilities. You need a way to let the agent run in "yolo mode" without risking your system.

In this blog you will learn how to secure your Gemini CLI in a way that will allow you to run it in an isolated environment with limited GitHub and Google Cloud access while not worrying that it will do too much damage if things go wrong. We will follow the least privilege pattern to make sure Gemini CLI has all necessary permissions to build your project, but at the same time can’t access systems it shouldn’t touch.

The Sandbox premise

The solution consists of following components:

Using GitHub fine-grained personal access tokens - limits source control risks.
Google Cloud service account - limits cloud risks.
Docker - limits local system risks.
Session limits - avoid surprises with the number of used tokens (especially important when running in --yolo mode).

Following this approach will protect you from the 'helpful agent curse' - it’s a situation when the agent tries very hard to achieve a task by finding ways around blockers. Examples include: granting itself more permissions, copying files to the current folder to edit them, and many more.

GitHub fine-grained personal access tokens

First let’s limit agent’s GitHub exposure by leveraging the fine grained tokens:

Navigate to GitHub Settings > Developer Settings > Personal access tokens > Fine-grained tokens.
Click Generate a new token.
Provide a descriptive name for your token and consider using expiration date to force rotations on a regular basis.
Restrict Repository access to the specific target repo you are working on.
Grant Read and Write permissions for Contents.
Save the token locally by running export GITHUB_TOKEN="github_pat_...".

Google Cloud Service Account

Create an isolated Service Account (SA) with minimal permissions. This prevents the agent from accessing protected resources and other projects.

Run these commands after updating YOUR_PROJECT_ID and roles below:

# Set your project ID
export CLOUDSDK_CORE_PROJECT="YOUR_PROJECT_ID"
gcloud config set project $CLOUDSDK_CORE_PROJECT

# Create the Service Account
gcloud iam service-accounts create gemini-cli-sa \
    --description="Isolated account for Gemini CLI"

# Grant minimal roles (adjust roles as needed)
gcloud projects add-iam-policy-binding $CLOUDSDK_CORE_PROJECT \
   --member="serviceAccount:gemini-cli-sa@$CLOUDSDK_CORE_PROJECT.iam.gserviceaccount.com" \
    --role="roles/aiplatform.user"

# Generate the JSON key file
gcloud iam service-accounts keys create sa-key.json \
    --iam-account=gemini-cli-sa@$CLOUDSDK_CORE_PROJECT.iam.gserviceaccount.com

Hint: you can use the IAM roles and permissions index page to easily find the roles to grant.

A good practice is to use a dedicated project for each of your AI coding initiatives. This way you can run several agents in parallel. They will build different solutions without worrying about stepping on each other's toes.

Custom Docker Build

The Gemini CLI uses a sandbox image to isolate the execution environment. You must customize this image to install gcloud, terraform, vim and set git configuration.

Prepare the Dockerfile

Create a .gemini directory in your project, and inside it, create a sandbox.Dockerfile. Using this specific file name allows Gemini CLI to automatically detect and build your custom sandbox profile if you’re running it from source and you can also use it to build the image manually if you’re running a binary installation.

Paste this content in the .gemini/sandbox.Dockerfile:

# Start from the official Gemini CLI sandbox image with proper version
ARG GEMINI_CLI_VERSION 0.33.0
FROM us-docker.pkg.dev/gemini-code-dev/gemini-cli/sandbox:${GEMINI_CLI_VERSION}

# Switch to root to install system dependencies (gcloud)
USER root

# Install Google Cloud SDK, Git, and prerequisites
RUN apt-get update && apt-get install -y curl apt-transport-https ca-certificates gnupg git && \
    echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \
    curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && \
    apt-get update && apt-get install -y google-cloud-cli

# Install Terraform
RUN apt-get update && apt-get install -y wget lsb-release && \
    wget -O- https://apt.releases.hashicorp.com/gpg | gpg --dearmor | tee /usr/share/keyrings/hashicorp-archive-keyring.gpg > /dev/null && \
    echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(grep -oP '(?<=UBUNTU_CODENAME=).*' /etc/os-release || lsb_release -cs) main" | tee /etc/apt/sources.list.d/hashicorp.list && \
    apt-get update && apt-get install -y terraform

# Install vim
RUN apt-get install -y vim

# Switch back to the non-root user (the official sandbox image uses 'node' as the default user)
USER node
WORKDIR /workspace

# Configure Git to use the injected GitHub PAT at runtime
RUN git config --global credential.helper '!f() { echo "username=x-access-token"; echo "password=$GITHUB_TOKEN"; }; f'

Prepare docker for building images (optional MacOS step)

If you haven’t built any Docker images before then run following commands to prepare your environment with brew:

# Install dependencies
brew install docker colima docker-buildx

# Configure docker-buildx
mkdir -p ~/.docker/cli-plugins
ln -sfn $(brew --prefix)/opt/docker-buildx/bin/docker-buildx ~/.docker/cli-plugins/docker-buildx

# Start colima service
brew services start colima

# Update DOCKER_HOST (you might want to add this line to .bash_profile):
export DOCKER_HOST="unix://${HOME}/.colima/default/docker.sock"

Build the image (binary installation)

If you installed Gemini CLI with npm, brew or any other binary method then you will need to manually build the Docker image and tag it as a default one that Gemini CLI is looking for:

# Get the base name the CLI looks for
export IMAGE_BASE_NAME="us-docker.pkg.dev/gemini-code-dev/gemini-cli/sandbox"

# Get your currently installed Gemini CLI version (e.g., 0.33.0)
export GEMINI_CLI_VERSION=$(gemini --version)

# Combine them
export IMAGE_NAME="${IMAGE_BASE_NAME}:${GEMINI_CLI_VERSION}"

# Build your custom sandbox image
docker build \
  --build-arg GEMINI_CLI_VERSION=$GEMINI_CLI_VERSION \
  -t "${IMAGE_NAME}" \
  -f .gemini/sandbox.Dockerfile .

Important: this image will be tagged with the exact version of the Gemini CLI you use. This means it needs to be rebuilt every time you update the CLI. I keep the above code in a shell script to run it after every update.

Build the image (source installation)

If you’re running your Gemini CLI from source as explained here. You can trigger the image build automatically each time you start gemini.

First update the top part of your sandbox.Dockerfile by substituting the FROM line with following:

# Start from the official Gemini CLI sandbox image (source installation)
FROM gemini-cli-sandbox

Start Gemini CLI in the sandbox mode

First set couple very important environment variables:

# Export the necessary environment variables
export GITHUB_TOKEN="github_pat_..."
export GEMINI_API_KEY="your-api-key"
export CLOUDSDK_CORE_PROJECT="YOUR_PROJECT_ID"
export GEMINI_SANDBOX=docker

# We keep the ENV variables for our dynamic credentials
export SANDBOX_FLAGS="\
-e GITHUB_TOKEN=${GITHUB_TOKEN} \
-e CLOUDSDK_CORE_PROJECT=${CLOUDSDK_CORE_PROJECT} \
-e CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE=$(pwd)/sa-key.json"

Note: You can put the above variables in a shell script to speed up starts in the future. Just make sure to update your .gitignore to keep it and sa-key.json from getting added to your repository.

Now you can start your Gemini CLI with following command:

# For binary installation
gemini

# For source installation
BUILD_SANDBOX=1 gemini

Session limits

To avoid surprises with the number of tokens that Gemini CLI uses in your session, you can use the Max Session Turns in /settings or your ~/.gemini/settings.json:

Congratulations

Congratulations 🚀You’re ready to validate your setup.

Validation and "Ultimate Tests"

Once the environment is launched within a sandbox we should verify the security boundaries.

First let’s run the /about command to see if we’re running within a sandbox. You should see something like this:

Now let’s try to break out from our new sandbox.

GitHub privilege escalation

Try asking Gemini CLI to access a private repo it shouldn’t have access. Example prompt:

Clone https://github.com/USER_NAME/PRIVATE_REPOSITORY to a new folder

You should see how Gemini CLI really tries and struggles to get access. Mine got really creative at trying to access the repo with git, gh, curl and even tried to reuse the GITHUB_TOKEN manually. All these tries failed and this error was displayed:

Google Cloud privilege escalation

Ask the agent to list all compute instances:

List all my compute instances

It should fail due to missing permissions on your restricted Service Account. Gemini CLI tries really hard and executes couple different commands including reauthentication, but it fails at the end:

Local privilege escalation

Finally let’s try to access a file from another project folder by prompting:

There are other projects in the folder above the current one. List them and let me know if there is anything that is interesting from hacker's perspective.

I am starting to feel sorry for the poor agent 😉 Once again it can’t complete its task:

Conclusions

Now that you have validated your sandbox setup you should feel much more confident to run gemini --yolo and streamline your work as Gemini CLI delivers your code without hand-holding and pesky Can I execute this command? Prompts.

I am looking forward to all the creative ideas you’ll bring to life!

What's next?

If you find this setup useful here are some additional steps to consider:

Try out Gemini CLI Conductor Extension - it’s very powerful and can significantly help you run autonomous agents effectively. Here is a deep dive into some of the advantages.
Read my Antigravity the Ralph Wiggum style which covers sandboxing for Antigravity.
Add emojis to this post to help others find it.
Share this post with your friends on socials.
Connect with me via LinkedIn, X or Bluesky.

Thanks for reading.

Agent Factory Recap: Antigravity and Nano Banana Pro with Remik

Remigiusz Samborski — Mon, 02 Feb 2026 12:18:03 +0000

In Episode #17 of the Agent Factory podcast, we step away from the purely theoretical and get our hands dirty with the latest developer tools from Google. Together with Vlad we take a deep dive into Antigravity and Nano Banana Pro, demonstrating how to build AI agents that bridge the gap between code generation and high-fidelity media.

This post guides you through the key ideas from our conversation. Use it to quickly recap topics or dive deeper into specific segments with links and timestamps.

Introductions

Antigravity - a new agentic mission control

Timestamp: 02:02

Antigravity is Google’s new agent-first development application. It is designed as a multi-window IDE that uses Gemini 3 under the hood to manage complex, asynchronous coding tasks. Unlike a standard text editor, Antigravity features a dedicated Agent Manager view where developers can interact with agents through planning modes, artifact reviews, and built-in browsers for live UI testing.

Nano Banana Pro and why it’s so special

Timestamp: 03:47

The model behind our slide generator is Nano Banana Pro (the technical name is Gemini 3 Pro Image model). What sets Nano Banana Pro apart is its ability to "think" before it creates. It uses Google Search grounding to retrieve real-time data, like current weather or live stock charts, and integrates that information into the generated image.

The Factory Floor

Agent architecture

Timestamp: 06:41

We dove into the architecture for building the slide generator agent using a combination of the Agent Development Kit (ADK), Antigravity, and Nano Banana Pro. Moreover we explained how these components interconnect to support the flow from a user prompt to a high-fidelity presentation by leveraging a Model Context Protocol (MCP) server.

Agent Starter Pack

Timestamp: 10:04

We started by introducing Agent Starter Pack and using it to spin up a new ADK project.

Vibe Coding a Slide Generator Agent

Timestamp: 14:56

Using Antigravity’s "Always Review" mode, we submitted a prompt to build a completely new slides agent which will use a MCP server to create detailed images.

We highlighted how Antigravity handles tasks and implementation plan reviews, allowing the developer to modify the agent's proposed steps before it executes them. We also dove into automated testing with Antigravity’s web browser plugin functionality.

Results

Timestamp: 32:25

The slides generator agent successfully called the MCP tools to generate a series of images stored in a Cloud Storage bucket, presenting the final links directly in the UI. Antigravity reviewed the results and to our surprise immediately suggested additional improvements.

Wrap up section

We are moving into an era where building an agent isn't just about the LLM; it's about the orchestration of tools, planning, and high-fidelity output. Antigravity is our agent manager and Nano Banana Pro acts as an asset creator.

We can’t wait to see what our viewers create next with these powerful tools!

Resources & links

Antigravity → https://goo.gle/3XMvGwf
Nano Banana Pro → https://goo.gle/3KMmKUG
Agent starter pack → https://goo.gle/4oLfypT
Agent Development Kit (Python) → https://goo.gle/4p0Pszm
GitHub with demo code → https://goo.gle/48N83IN

Connect with us

Remik → LinkedIn, X, GitHub
Vlad → LinkedIn, X, GitHub

Antigravity the Ralph Wiggum style

Remigiusz Samborski — Mon, 02 Feb 2026 12:05:20 +0000

The Ralph Wiggum trend has been surfacing across social platforms lately. If you're tracking current tech developments, it’s hard to miss. Named after a persistent and slightly confused second-grader, the Wiggum Loop agentic development boils down to: Don't stop until the job is done.

In traditional AI coding, the agent performs a task, stops, and waits for you to approve its next step or request changes. In a Wiggum Loop, you give the agent a mission and success criteria (like passing tests), and it keeps looping, fixing its own bugs and refactoring - until it hits the green light.

The recent excitement around the Wiggum Loop agentic development highlights a powerful shift: achieving autonomous, self-correcting development. I've been leveraging a similar approach effectively with Antigravity for some time already. In this post, I’ll share my strategy, enabling you to implement true unsupervised development yourself.

Going "Full Wiggum"

To achieve true unsupervised development, we need to move away from the review-driven defaults and let the agent take the wheel. Antigravity is uniquely built for this because it's an agent-first environment capable of acting in both the terminal and the browser.

To mirror the “Bash loop" persistence of the Ralph Wiggum plugin, configure your Antigravity settings as follows:

Mode: Select Agent-driven development. This shifts the agent from a "wait for instructions" assistant to a "goal-oriented" architect.
Terminal execution policy: Set to Always Proceed. This allows the agent to run npm test, uv run pytests, and other commands without constantly pausing for approval.
Review policy: Set to Always Proceed. This tells the agent that its implementation plans are pre-approved.
JavaScript execution policy: Set to Always Proceed. This is essential for agents that need to run scripts or interact with browser environments to verify their work.

WARNING: THE SANDBOX IS NOT OPTIONAL. Running an agent in "Always Proceed" mode is like giving Bart Simpson a slingshot in front of a mirror store. Only do this in a sandbox environment.

Here is a great article from my colleague which shows a step-by-step guide to setting such an environment up and running on Cloud Workstation.

Example

To see this in practice, I ran the following prompt against Antigravity:

Build a REST API for todos in NodeJS.

When complete:
- All CRUD endpoints are working
- Input validation is in place
- Tests are passing (coverage > 80%)
- README with API docs exists

The screencast below shows how Antigravity handled the task without my interruptions (I spent this time on other tasks rather than handholding the agent):

How does this work?

Antigravity isn't just looping in a vacuum. Because it has native hooks into Gemini 3 Pro, it utilizes a massive context window that remembers exactly why a previous command failed.

It kicks things off by drafting up an implementation plan and a task list. In the video, you can watch it tick through these items in real time. It doesn't just plan, though - it actually touches the terminal to initialize the npm project and run tests.

The loop only closes once every requirement is met and the test suite hits green. It then provides a handy walkthrough so you can easily understand the architecture it just spun up.

This approach turns development from writing code into verifying outcomes.

From vibe-coding to vibe-building

The Ralph Wiggum trend isn't about cutting corners; it's about embracing sheer, stubborn persistence through automation. By letting Antigravity operate autonomously, you transition from a coder to an architect and team lead. You define the standards and environment, while agents manage the iterative grind of writing, testing, and debugging cycles that typically consume a developer's valuable time.

Are you brave enough to let the agent "Always Proceed"? Visit Antigravity’s download page to start experimenting yourself.

Other resources

Let’s Connect!

I’d love to hear how you’re using Antigravity for your agentic workflows. Are you building Wiggum loops or keeping a tighter leash on your agents?

Connect on LinkedIn
Follow me on X
Catch me on Bluesky

Serverless AI: EmbeddingGemma with Cloud Run

Remigiusz Samborski — Thu, 25 Sep 2025 09:29:59 +0000

Building on the previous blog post about running Qwen3 Embedding models on Cloud Run, this article focuses on the recently released EmbeddingGemma model from the Gemma family. Discover how to leverage the same powerful serverless techniques to deploy this model on Google Cloud's serverless platform.

You will learn how to:

Containerize the embedding model with Docker and Ollama
Deploy the embedding model to Cloud Run with GPUs
Test the deployed model from a local machine

Before we dive into the code, let's briefly discuss the core components that power this serverless AI solution.

EmbeddingGemma Model

According to the EmbeddingGemma model card:

“EmbeddingGemma is a 308M parameter multilingual text embedding model based on Gemma 3. It is optimized for use in everyday devices, such as phones, laptops, and tablets. The model produces numerical representations of text to be used for downstream tasks like information retrieval, semantic similarity search, classification, and clustering.”

Its optimization for efficiency makes EmbeddingGemma an ideal candidate for serverless deployment on Cloud Run, ensuring high performance and cost-effectiveness for your AI applications.

Cloud Run

Cloud Run is a managed compute platform on Google Cloud that lets you run containerized applications in a serverless environment. Think of it as a middle ground between a simple function-as-a-service (like Cloud Run Functions) and a more customizable GKE cluster. You give it a container image, and it handles all the underlying infrastructure, from provisioning and scaling to managing the runtime.

The beauty of Cloud Run is that it can automatically scale to zero, meaning when there are no requests, you aren't paying for any resources. When traffic picks up, it quickly scales up to handle the load. This makes it perfect for stateless models that need to be highly available and cost-effective.

Deployment

Let's walk through the deployment process step-by-step.

Prepare the environment

First lets configure the gcloud CLI environment.

Note: if you do not have gcloud CLI installed please follow instructions available here.

Step 1 - Set your default project:

gcloud config set project PROJECT_ID

Step 2 - Configure Google Cloud CLI to use the _europe-west1 _region for Cloud Run commands:

gcloud config set run/region europe-west1

Important: at the time of writing, GPUs on Cloud Run are available in several regions. To check the closest supported region please refer to this page.

Containerize

Now we will use Docker and Ollama to run the EmbeddingGemma model. Create a file named Dockerfile containing:

FROM ollama/ollama:latest

# Listen on all interfaces, port 8080
ENV OLLAMA_HOST=0.0.0.0:8080

# Store model weight files in /models
ENV OLLAMA_MODELS=/models

# Reduce logging verbosity
ENV OLLAMA_DEBUG=false

# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE=-1

# Store the model weights in the container image
ENV MODEL=embeddinggemma:latest
RUN ollama serve & sleep 5 && ollama pull $MODEL

# Start Ollama
ENTRYPOINT ["ollama", "serve"]

Build and Deploy

We will now use Cloud Run's source deployments. This allows you to achieve the following with one command:

First, compile the container image from the provided source.
Next, upload the resulting container image to an Artifact Registry.
Then, deploy the container to Cloud Run, ensuring that GPU support is enabled using the --gpu and --gpu-type parameters.
Finally, redirect all incoming traffic to this newly deployed version.

You just need to run:

gcloud run deploy embedding-gemma \
  --source . \
  --concurrency 4 \
  --cpu 8 \
  --set-env-vars OLLAMA_NUM_PARALLEL=4 \
  --gpu 1 \
  --gpu-type nvidia-l4 \
  --max-instances 1 \
  --memory 32Gi \
  --no-allow-unauthenticated \
  --no-cpu-throttling \
  --no-gpu-zonal-redundancy \
  --timeout=600 \
  --labels dev-tutorial=blog-embedding-gemma

Note the following important flags in this command:

--concurrency 4 is set to match the value of the environment variable OLLAMA_NUM_PARALLEL.
--gpu 1 with --gpu-type nvidia-l4 assigns 1 NVIDIA L4 GPU to every Cloud Run instance in the service.
--max-instances 1 specifies the maximum number of instances to scale to. It has to be equal to or lower than your project's NVIDIA L4 GPU quota.
--no-allow-unauthenticated restricts unauthenticated access to the service. By keeping the service private, you can rely on Cloud Run's built-in Identity and Access Management (IAM) authentication for service-to-service communication.
--no-cpu-throttling is required for enabling GPU.
--no-gpu-zonal-redundancy set zonal redundancy options depending on your zonal failover requirements and available quota.

Test the deployment

Upon successful deployment of the service, you can initiate requests. However, direct api calls will result in an HTTP 401 Unauthorized response from Cloud Run.

This behaviour follows Google’s “secure by default” approach. The model is intended for calls from other services, such as a RAG application, and therefore is not open for public access.

To support local testing of your deployment, the simplest approach is to launch the Cloud Run developer proxy using the following command:

gcloud run services proxy embedding-gemma --port=9090

Afterwards, in a second terminal window, run:

curl http://localhost:9090/api/embed -d '{
  "model": "embeddinggemma",
  "input": "Sample text"
}'

The response will look similar to this:

You can also use Python to call the endpoint. Example:

from ollama import Client

client = Client(host="http://localhost:9090")

response = client.embed(model="embeddinggemma", input="Sample text")
print(response)

Congratulations 🎉 The Cloud Run deployment is up and running!

RAG Example

You can use the newly deployed model to build your first RAG application. Here’s how to achieve this:

Step 1 - Generate Embeddings

Start with required dependencies:

pip install ollama chromadb

Create an example.py file containing:

import ollama
import chromadb

documents = [
    "Poland is a country located in Central Europe.",
    "The capital and largest city of Poland is Warsaw.",
    "Poland's official language is Polish, which is a West Slavic language.",
    "Marie Curie, the pioneering scientist who conducted groundbreaking research on radioactivity, was born in Warsaw, Poland.",
    "Poland is famous for its traditional dish called pierogi, which are filled dumplings.",
    "The Białowieża Forest in Poland is one of the last and largest remaining parts of the immense primeval forest that once stretched across the European Plain.",
]

client = chromadb.Client()
collection = client.create_collection(name="docs")

ollama_client = ollama.Client(host="http://localhost:9090")

# Store each document in a in-memory vector embeddings database
for i, d in enumerate(documents):
    response = ollama_client.embed(model="embeddinggemma", input=d)
    embeddings = response["embeddings"]
    collection.add(ids=[str(i)], embeddings=embeddings, documents=[d])

Step 2 - Retrieve

Next, with the following code you can search the vector database for the most relevant document (add it to your example.py):

# An example question
question = "What is Poland's official language?"

# Generate an embedding for the input and retrieve the most relevant document
response = ollama_client.embed(model="embeddinggemma", input=question)
results = collection.query(query_embeddings=[response["embeddings"][0]], n_results=1)
data = results["documents"][0][0]

Step 3 - Generate Final Answer

In this final step step we will use a locally installed Gemma3.

_Note: We use Gemma3 in the generation step, but any other model could work here (e.g., Gemini, Qwen3, Llama, etc.). Nevertheless, it is critical to use the same embeddings model in Step 1 (Generate Embeddings) and Step 2 (Retrieve). _

To install the Gemma3:latest model run:

ollama pull gemma3

Now can combine user’s prompt with search results and generate the final answer (add this code to example.py):

# Final step - generate a response combining the prompt and data we retrieved in step 2
prompt = f"Using this data: {data}. Respond to this prompt: {question}"

output = ollama.generate(
    model="gemma3",
    prompt=prompt,
)

print(f"Prompt: {prompt}")
print(output["response"])

Run the code:

python example.py

The answer should look similar to the one below:

Prompt: Using this data: Poland's official language is Polish, which is a West Slavic language.. Respond to this prompt: What is Poland's official language?
Poland's official language is Polish. It's a West Slavic language.

You have successfully created and run your first RAG application using the EmbeddingGemma model.

Summary

At this point, you have successfully established a Cloud Run service running the EmbeddingGemma model, ready to generate embeddings for semantic search or RAG applications.

This method also allows you to deploy and compare multiple embedding models on Cloud Run (e.g. Qwen3 Embedding or other Ollama-supported models), enabling you to find the best fit for your specific use case without major code changes.

Ready to build your own serverless AI applications?

Start building on Cloud Run today and explore its full potential!
If you’re interested in learning more about RAG evaluation, this article is a good starting point.

Thanks for reading

If you found this article helpful, please consider following me here and giving it a clap 👏 to help others discover it.

I'm always eager to chat with fellow developers and AI enthusiasts, so feel free to connect with me on LinkedIn or Bluesky.

Serverless AI: Qwen3 Embeddings with Cloud Run

Remigiusz Samborski — Wed, 20 Aug 2025 10:31:23 +0000

In this blog post I’ll show you the process of deploying the Qwen3 Embedding model to Cloud Run with GPUs for enhanced performance.

You will learn how to:

Containerize the embedding model with Docker and Ollama
Deploy the embedding model to Cloud Run with GPUs
Test the deployed model from a local machine

Before we jump into the code a couple words about key components of the solution.

Qwen3 Embedding Model

The Qwen3 Embedding series is a set of open-source models for text embedding and reranking, built on the Qwen3 Large Language Model (LLM) family. It's designed for retrieval-augmented generation (RAG), a technique that enhances the output of large language models by retrieving relevant information from a knowledge base, and other tasks requiring semantic search. You can learn more about embeddings in this video.

Open embedding models such as Qwen3 are the ideal choice when you need greater control, specialization, and security than proprietary, "black-box" APIs can offer. They are particularly well-suited for the following use cases:

Fine-Tuning for Niche Domains📻: by fine-tuning them on specialized data (e.g., legal contracts, medical research, internal company wikis) they can provide more accurate results for semantic search and RAG than a general-purpose model.
Data Privacy & Security🔒: open models can be self-hosted or deployed to cloud resources managed by your organization. This ensures compliance with regulations like GDPR and prevents data from ever leaving your control.
Cost-Effectiveness at Scale💰: for high-volume tasks, running an optimized open model can be cheaper than paying per-API-call fees to a proprietary service provider.
Offline & Edge Deployment🛜: open models can run locally and are perfect for applications that must function without an internet connection, such as on-device search in mobile apps or analysis on remote IoT devices.

I chose the Qwen3-Embedding-4B model due to its growing popularity and suitable size for the Cloud Run environment. However, you can experiment with different sizes (0.6B, 4B, and 8B) depending on your specific use case.

Cloud Run

Cloud Run is a managed compute platform on Google Cloud that lets you run containerized applications in a serverless environment. Think of it as a middle ground between a simple function-as-a-service (like Cloud Functions) and a more complex GKE cluster. You give it a container image, and it handles all the underlying infrastructure, from provisioning and scaling to managing the runtime.

Deployment

But enough with the intros, let's get our hands dirty with some code 🧑‍💻

Below is a step by step instruction on how to get the Qwen3 Embedding model up and running.

Prepare the environment

First we need to configure the gcloud CLI environment.

Note: if you don’t have gcloud CLI installed please follow instructions available here.

Step 1 - Set your default project:

gcloud config set project PROJECT_ID

Step 2 - Configure Google Cloud CLI to use the _europe-west1 _region for Cloud Run commands:

gcloud config set run/region europe-west1

Important: at the time of writing, GPUs on Cloud Run are available in several regions. To check the closest supported region please refer to this page.

Containerize

We will use Docker and Ollama to run the Qwen3 Embedding model. Create a file named Dockerfile and put the following code inside it:

FROM ollama/ollama:latest

# Listen on all interfaces, port 8080
ENV OLLAMA_HOST=0.0.0.0:8080

# Store model weight files in /models
ENV OLLAMA_MODELS=/models

# Reduce logging verbosity
ENV OLLAMA_DEBUG=false

# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE=-1

# Store the model weights in the container image
ENV MODEL=dengcao/Qwen3-Embedding-4B:Q4_K_M
RUN ollama serve & sleep 5 && ollama pull $MODEL

# Start Ollama
ENTRYPOINT ["ollama", "serve"]

Build and deploy

Next it’s time to leverage the power of Cloud Run’s source deployments. With a single command you can:

Build the container image from source (note the –source parameter in the command below)
Upload the container image to an Artifact Registry
Deploy the container to Cloud Run with GPUs enabled (note --gpu and --gpu-type options)
Redirect all traffic to the new deployment

To do all the above, you just need to run:

gcloud run deploy ollama-qwen3-embeddings \
  --source . \
  --concurrency 4 \
  --cpu 8 \
  --set-env-vars OLLAMA_NUM_PARALLEL=4 \
  --gpu 1 \
  --gpu-type nvidia-l4 \
  --max-instances 1 \
  --memory 32Gi \
  --no-allow-unauthenticated \
  --no-cpu-throttling \
  --no-gpu-zonal-redundancy \
  --timeout=600 \
  --labels dev-tutorial=blog-qwen3-embeddings

Note the following important flags in this command:

--concurrency 4 is set to match the value of the environment variable OLLAMA_NUM_PARALLEL.
--gpu 1 with --gpu-type nvidia-l4 assigns 1 NVIDIA L4 GPU to every Cloud Run instance in the service.
--max-instances 1 specifies the maximum number of instances to scale to. It has to be equal to or lower than your project's NVIDIA L4 GPU quota.
--no-allow-unauthenticated restricts unauthenticated access to the service. By keeping the service private, you can rely on Cloud Run's built-in Identity and Access Management (IAM) authentication for service-to-service communication.
--no-cpu-throttling is required for enabling GPU.
--no-gpu-zonal-redundancy set zonal redundancy options depending on your zonal failover requirements and available quota.

Test the deployment

Now that you have successfully deployed the service, you can send requests to it. However, if you send a request directly, Cloud Run will respond with HTTP 401 Unauthorized. This is intentional, because we want our model to be called from other services, such as a RAG application, and not accessible by everyone on the Internet.

The easiest way to test the deployment from a local machine is to spin up the Cloud Run developer proxy by executing:

gcloud run services proxy ollama-qwen3-embeddings --port=9090

Now in a second terminal window run:

curl http://localhost:9090/api/embed -d '{
  "model": "dengcao/Qwen3-Embedding-4B:Q4_K_M",
  "input": "Sample text"
}'

You should see a response similar to this:

You can also call the endpoint from a Python client. Example:

from ollama import Client

client = Client(host="http://localhost:9090")

response = client.embed(model="dengcao/Qwen3-Embedding-4B:Q4_K_M", input="Sample text")
print(response)

Congratulations 🎉 Your Cloud Run deployment is up and running!

RAG Example

You can use the newly deployed model to build your first RAG application. Here’s how to achieve this:

Step 1 - Generate Embeddings

Install necessary dependencies:

pip install ollama chromadb

Create an example.py with the following content:

import ollama
import chromadb

documents = [
    "Poland is a country located in Central Europe.",
    "The capital and largest city of Poland is Warsaw.",
    "Poland's official language is Polish, which is a West Slavic language.",
    "Marie Curie, the pioneering scientist who conducted groundbreaking research on radioactivity, was born in Warsaw, Poland.",
    "Poland is famous for its traditional dish called pierogi, which are filled dumplings.",
    "The Białowieża Forest in Poland is one of the last and largest remaining parts of the immense primeval forest that once stretched across the European Plain.",
]

client = chromadb.Client()
collection = client.create_collection(name="docs")

ollama_client = ollama.Client(host="http://localhost:9090")

# Store each document in a in-memory vector embeddings database
for i, d in enumerate(documents):
    response = ollama_client.embed(model="dengcao/Qwen3-Embedding-4B:Q4_K_M", input=d)
    embeddings = response["embeddings"]
    collection.add(ids=[str(i)], embeddings=embeddings, documents=[d])

Step 2 - Retrieve

Next, the following code will search the vector database for the most relevant document (add it to your example.py):

# An example prompt
prompt = "What is Poland's official language?"

# Generate an embedding for the input and retrieve the most relevant document
response = ollama_client.embed(model="dengcao/Qwen3-Embedding-4B:Q4_K_M", input=prompt)
results = collection.query(query_embeddings=[response["embeddings"][0]], n_results=1)
data = results["documents"][0][0]

Step 3 - Generate final answer

In the generation step we will use a locally installed Qwen3:0.6b.

Note: we use Qwen3 in generation step, but any other model could work here (i.e. Gemini, Gemma, Llama, etc.). Nevertheless it’s critical to use the same embeddings model in step 1 (Generate Embeddings) and step 2 (Retrieve).

You can install the Qwen3:0.6b model by running the following command:

ollama pull qwen3:0.6b

Now we’re ready to combine user’s prompt with search results to generate the final answer (add to example.py):

# Final step - generate a response combining the prompt and data we retrieved in step 2
output = ollama.generate(
    model="qwen3:0.6b",
    prompt=f"Using this data: {data}. Respond to this prompt: {prompt}",
)

print(output["response"])

Run the code by executing:

python example.py

You should see an answer similar to the one below:

<think>
Okay, the user is asking what Poland's official language is, and they provided the information that Poland's official language is Polish, which is a West Slavic language. Let me make sure I understand this correctly.

First, I need to confirm if that's the correct information. I know that Poland is a country in Eastern Europe, and its official language is Polish. But wait, what's the source of this information? The user hasn't provided any other data, so I should stick strictly to the given information.

I should state that Poland's official language is Polish, and that it's a West Slavic language. I need to present this clearly and concisely. Maybe mention that it's the official language to emphasize its significance. Also, check if there's any other detail that needs to be included, but since the user provided only this, I can proceed.
</think>

Poland's official language is **Polish**. This language is a **West Slavic language**.

Well done! You have just created and run your first RAG application with Qwen3 Embedding model under the hood.

Summary

At this point you have established a Cloud Run service running Qwen3 Embedding model. You can use it to generate embeddings for a semantic search or a RAG application.

Stay tuned for more content around leveraging Qwen3 Embedding in your applications.

Thanks for reading

I hope this article inspired you to experiment with open embedding models on Cloud Run. If you found this article helpful, please consider following me here and giving it a clap 👏 to help others discover it.

I'm always eager to chat with fellow developers and AI enthusiasts, so feel free to connect with me on LinkedIn or Bluesky.

Polish Large Language Model na Google Cloud (video)

Remigiusz Samborski — Tue, 25 Mar 2025 09:20:00 +0000

Kontynuuję moją przygodę z polskim dużym modelem językowym (PLLuM) na Google Cloud! Tym razem oddaję w Wasze ręce nowy film, który krok po kroku pokaże Wam, jak przygotować i uruchomić ten model na Vertex AI.

Zapraszam do oglądania i komentowania: