DEV Community: Marcelo Costa

Remote Control Antigravity - Migrating to new Antigravity ecossytem and fully autonomous goals

Marcelo Costa — Fri, 10 Jul 2026 20:27:05 +0000

In my previous post, we explored how to build an AI-powered IDE companion app from scratch using Antigravity and Gemini 3.1. We built a PWA that communicates via WebSockets to a local Rust backend, intercepting IDE commands and effectively giving us a remote control for our AI agent.

But here's the golden rule of software development: when you reverse engineer things, you need to be prepared for them to break.

And break it did!

With the recent launch of Antigravity 2.0, the ecosystem evolved from just being an AI-powered IDE into a suite of multiple tools, including a powerful standalone CLI which replaced Gemini CLI. Originally, my companion app didn't even use the CLI, it operated strictly as a remote control for the IDE itself using the Chrome Developer Protocol (CDP).

My remote control ended up being broken mostly because the binaries and the way to interact with them changed with the tools.

Ironically, to fix it, I used the very thing I was trying to integrate: Antigravity 2.0. I paired with the new agent to figure out how to integrate with the new foundation, adapt the codebase, and successfully bring CLI execution to the bridge. Here is how we did it.

The Architecture: Bridging the Gap

To understand the changes, we first need to look at how the application is wired up. The PWA sends a payload over WebSockets to bridge-core. That payload is also routed now to the new cli_adapter (if you want full details read the previous blog post), which is responsible for spawning the Antigravity agent on the host machine and streaming the output back.

Bridging the PWA and the CLI

To bring the CLI to the PWA, we needed to create a communication layer. The PWA itself is entirely decoupled from the host operating system, so it relies on the bridge-core Rust backend to do the heavy lifting.

When you submit a goal from the companion app, the PWA sends a WebSocket message to the backend. From there, the cli_adapter takes over. Its job is to act as the process manager for the Antigravity agent:

Spawning: It dynamically resolves the location of the agy binary on your machine and uses tokio::process::Command to spawn a headless, non-interactive instance of the agent in the background.

Streaming Progress: Because background processes can take a while to complete complex coding goals, the adapter manages asynchronous streams to capture the agent's output and heartbeat signals, piping them back over the WebSocket to the PWA so you are never left staring at a frozen screen.
Process Control (Kill Switch): We also needed a way to abort long-running tasks safely. By wrapping the spawned process with tokio's async select loops and maintaining a registry of active "kill switches" in memory, the PWA can instantly send an abort signal that terminates the background agent on demand.

By letting Rust handle the messy process lifecycle management, the PWA remains lightweight while still offering full control over the Antigravity CLI.

Wrapping Up

By adapting our Rust backend to properly manage the new CLI tool, we brought the companion app back from the dead. Migrating to the new Antigravity ecosystem took a bit of reverse engineering, but having a fully autonomous AI agent accessible from a custom PWA was well worth the effort.

If you have ever had a major update completely nuke your weekend side project let me know!

And if you want to try it yourself can find the repo here: https://github.com/mesmacosta/antigravitybridge2.0.

Google Cloud credits are provided for this project.

Taking Action on your GCP bill: Automating BigQuery Storage Cleanup

Marcelo Costa — Sat, 14 Mar 2026 16:21:53 +0000

In my last post, we explored how to decode GCP Billing with Antigravity and BigQuery MCP to turn an opaque GCP billing export into a granular, custom FinOps CLI. We successfully moved from scratching our heads over cost spikes to having a clear, actionable dashboard right in the terminal.

But observation is only half the FinOps battle. Once you identify the cost drivers, you need a safe, repeatable way to remediate them.

Working deeply in the BigQuery ecosystem every day, I frequently see storage costs silently accumulate from staging environments, daily snapshot dumps, or temporary processing tables.

When it comes to cleaning these up, you often don't want to completely DROP the tables. Dropping a table means destroying its schema, field descriptions, metadata, and carefully crafted IAM policies. Often, you just want to zero out the storage bytes while keeping the structure intact for the next pipeline run.

The solution? TRUNCATE TABLE.

The Ideal State: Dataset Expiration Rules

In a perfect world, the best way to handle these temporary processing tables is to isolate them in a dedicated dataset and configure a Default Table Expiration. By setting this rule at the dataset level, BigQuery automatically drops any table created within it after a specified number of days, zero maintenance required.

Unfortunately, in the real world, that’s not always possible.

Data architectures get messy. Staging tables often end up living alongside long-term reference data where a blanket expiration rule would cause chaos. Or, you might need to keep a specific temp table around for an unpredictable amount of time to debug a broken pipeline. When blunt-force dataset rules are too risky or simply not an option due to legacy architecture, you need a more surgical approach.

Here is how I leveraged Antigravity's agentic workflow to build a reusable bash script to automate this targeted cleanup safely.

The Truncation Tool: Moving from Analysis to Action

When building scripts that perform destructive actions across dozens or hundreds of tables, safety and precision are key. Just like in the exploration phase, having an AI agent that can test raw commands against your actual BigQuery environment via MCP eliminates the usual trial-and-error of writing bash utilities.

Here is the script to solve this (use at your discretion):

#!/bin/bash
# =============================================================================
# BigQuery Table Truncation Script
# =============================================================================
# Safely truncates tables in a BigQuery dataset based on a prefix.
# Defaults to DRY RUN mode.
#
# Usage: ./truncate_tables.sh --project ID --dataset NAME [--prefix PREFIX] [--execute]
# ./truncate_tables.sh --project [my-project]--dataset [mydataset] --prefix PREFIX
# =============================================================================

set -e

# Defaults
PROJECT_ID=""
DATASET_NAME=""
TABLE_PREFIX=""
DRY_RUN=true

# Colors
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
BOLD='\033[1m'

# Parse arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        --project)
            PROJECT_ID="$2"
            shift 2
            ;;
        --dataset)
            DATASET_NAME="$2"
            shift 2
            ;;
        --prefix)
            TABLE_PREFIX="$2"
            shift 2
            ;;
        --execute)
            DRY_RUN=false
            shift
            ;;
        *)
            echo "Unknown option: $1"
            echo "Usage: $0 --project ID --dataset NAME [--prefix PREFIX] [--execute]"
            exit 1
            ;;
    esac
done

# Validation
# Validation
if [ -z "$PROJECT_ID" ] || [ -z "$DATASET_NAME" ]; then
    echo -e "${RED}Error: Missing required arguments.${NC}"
    echo "Usage: $0 --project ID --dataset NAME [--prefix PREFIX] [--execute]"
    exit 1
fi

print_header() {
    echo ""
    echo -e "${BLUE}${BOLD}═══════════════════════════════════════════════════════════════${NC}"
    echo -e "${BLUE}${BOLD}  $1${NC}"
    echo -e "${BLUE}${BOLD}═══════════════════════════════════════════════════════════════${NC}"
    echo ""
}

print_header "🗑️  BigQuery Table Truncation Tool"

if [ "$DRY_RUN" = true ]; then
    echo -e "${YELLOW}${BOLD}[DRY RUN MODE]${NC} No data will be deleted."
    echo "Use --execute to perform the actual truncation."
else
    echo -e "${RED}${BOLD}[EXECUTION MODE]${NC} Tables WILL be truncated."
fi
echo ""

if [ -z "$TABLE_PREFIX" ]; then
    echo "Fetching ALL tables in ${PROJECT_ID}.${DATASET_NAME}..."
    TABLES=$(bq ls --project_id="$PROJECT_ID" --max_results=1000 "$DATASET_NAME" | awk '{if(NR>2) print $1}')
else
    echo "Fetching tables matching prefix '${TABLE_PREFIX}' in ${PROJECT_ID}.${DATASET_NAME}..."
    TABLES=$(bq ls --project_id="$PROJECT_ID" --max_results=1000 "$DATASET_NAME" | grep -E "\b${TABLE_PREFIX}" | awk '{print $1}')
fi

if [ -z "$TABLES" ]; then
    if [ -z "$TABLE_PREFIX" ]; then
         echo "No tables found in dataset '${DATASET_NAME}'."
    else
         echo "No tables found matching prefix '${TABLE_PREFIX}'."
    fi
    exit 0
fi

echo ""
echo -e "${BOLD}Found the following tables:${NC}"
COUNT=0
for table in $TABLES; do
    echo "  - ${DATASET_NAME}.${table}"
    ((COUNT++))
done
echo ""
echo "Total tables to truncate: $COUNT"
echo ""

if [ "$DRY_RUN" = true ]; then
    echo -e "${GREEN}Dry run complete. To truncate these tables, run:${NC}"
    if [ -z "$TABLE_PREFIX" ]; then
        echo "./scripts/truncate_tables.sh --project $PROJECT_ID --dataset $DATASET_NAME --execute"
    else
        echo "./scripts/truncate_tables.sh --project $PROJECT_ID --dataset $DATASET_NAME --prefix $TABLE_PREFIX --execute"
    fi
    exit 0
fi

# Confirmation prompt for Execution Mode
echo -e "${RED}${BOLD}WARNING: You are about to TRUNCATE (delete all data from) the $COUNT tables listed above.${NC}"
read -p "Are you absolutely sure? Type 'CONFIRM' to proceed: " CONFIRMATION

if [ "$CONFIRMATION" != "CONFIRM" ]; then
    echo "Operation cancelled."
    exit 0
fi

echo ""
echo "Starting truncation..."

for table in $TABLES; do
    FULL_TABLE_ID="${PROJECT_ID}.${DATASET_NAME}.${table}"
    echo -n "Truncating $FULL_TABLE_ID ... "

    if bq query --use_legacy_sql=false --quiet "TRUNCATE TABLE \`${FULL_TABLE_ID}\`"; then
        echo -e "${GREEN}DONE${NC}"
    else
        echo -e "${RED}FAILED${NC}"
    fi
done

echo ""
echo -e "${GREEN}All operations completed.${NC}"

How It Works (and Why It’s Built This Way)
Writing this script involved piecing together the bq command-line tool, string manipulation, and standard shell logic. Here are the core design decisions that make it robust:

Defaulting to "Dry Run" The most dangerous scripts are the ones that execute destructive actions by default. This script requires an explicit --execute flag. If you run a command like this:

./truncate_tables.sh --project my-project --dataset mDWH_pre --prefix stg_

It will simply output a neatly formatted list of the tables it would have truncated, giving you complete visibility before pulling the trigger.

Prefix Targeting via awk and grep
Piping bq ls output into grep and awk can be time consuming due to the formatting of the bq CLI tables. Because Antigravity could validate these commands live via MCP, it quickly nailed the exact regex and column isolation needed to cleanly extract just the table names, whether you are targeting the entire dataset or just a specific prefix.
The "Human in the Loop" Failsafe
Even with the --execute flag, truncating data is a one-way door. To prevent accidental executions from simply up-arrowing in the terminal and hitting enter too quickly, the script implements a hard pause:

read -p "Are you absolutely sure? Type 'CONFIRM' to proceed: "

The FinOps Payoff

By combining the analytical script from part one with this targeted remediation script, you close the loop on cloud waste. You can identify the exact dataset driving your BigQuery storage costs, and within seconds, safely truncate hundreds of obsolete staging tables while preserving your carefully constructed data warehouse schema.

With AI tools like Antigravity providing live context into your environment, creating these bespoke, highly effective utility scripts takes minutes instead of hours. The barrier to maintaining a lean cloud environment has never been lower.

Profiling Memory Leaks in Rust: A Tale of Unexpected Challenges

Marcelo Costa — Mon, 20 Jan 2025 23:49:22 +0000

Rust, with its strong ownership and borrowing system, is well known for its ability to prevent many common programming errors, including memory leaks. However, even Rust isn't immune to these issues under specific circumstances. This blog post serves as a reminder to my past self, who had to identify and resolve a memory leak in a Rust application, and a cautionary tale for my future self, emphasizing the importance of proactive profiling.

The Unlikely Culprit: A Rust Memory Leak

Imagine a Rust application deployed to Google Cloud Run. It has been running smoothly for weeks. However, over time, the memory usage gradually increases, leading to eventual crashes due to insufficient memory. In this chart we can see how each day the memory would hit it's peak then reset due a crash:

While Rust's ownership system prevents many common memory errors, certain scenarios can still lead to leaks:

Reference Cycles: Circular references between objects can create a situation where objects hold onto each other, preventing them from being deallocated. This is similar to how memory leaks occur in languages with garbage collection.
Unintentional Rc or Arc Cycles: Using Rc (reference counting) or Arc (atomic reference counting) can introduce cycles if not managed carefully. If objects have strong references to each other through these types, they can keep each other alive indefinitely.
Global Variables with Interior Mutability: Global variables with interior mutability (RefCell, Mutex, etc.) can leak memory if the mutable references are not properly managed. If a reference is held indefinitely, the data it points to will also remain in memory.
Forgotten drop Implementations: If a type owns resources that need explicit deallocation (e.g., file handles, network connections), forgetting to implement the drop trait can lead to resource leaks, which can manifest as memory leaks.

The Challenge of Troubleshooting Memory Leaks

Pinpointing the root cause of a memory leak can be a challenging task, even for experienced developers. Many programmers tend to avoid diving deep into memory profiling, due it's time consuming nature to narrow down the problem. It's a time-consuming process of elimination, akin to diagnosing a rare medical condition. You formulate hypotheses, test them, and discard them one by one until the culprit has nowhere left to hide.

In my case, given the critical nature of our service, we needed to act quickly. Within minutes of identifying the memory leak, we implemented a temporary workaround. A GitHub Workflow was set up to automatically restart our Cloud Run service every two hours.

Basically we just forced a redeploy pointing to the latest image, using GitHub Actions' cron functionality, sample:

name: Redeploy every 2 hours

on:
  schedule:
    - cron: '0 */2 * * *' # Runs every 2 hours
env:
  ...

jobs:
  init:
    ...
  tenant-deploys:
    needs: [ init ]
    runs-on: ubuntu-latest
    strategy:
      matrix:
        service: [ tentant-1, tentant-2, tentant-3 ]
    steps:
      ...
      - name: Deploy on Cloud Run
        uses: google-github-actions/deploy-cloudrun@v1
        with:
          service: ${{ matrix.service }}
          image: ${{needs.init.outputs.image_name}}:latest
          region: ${{ env.REGION }}
          gcloud_component: beta
          env_vars: |
            ENV=${{ needs.init.outputs.env }}
            COMMIT_ID=${{ env.COMMIT_ID }}
            RUST_BACKTRACE=full

That was enough to prevent any downtime, now back to the drawing board:

Inspiration from the Rust Community

I came by this great reference from the community: The Rust Performance Book, I started testing the options from the list, until I got to Instruments:

Then that led me to these 2 videos:

I had used different memory profiling tools for other languages in the past, given those recommendations, I decided to explore Instruments' capabilities for profiling my Rust application.

The Unexpected Source of the Leak

After looking at Instruments profiling report:

I was able to narrow it down to the allocation of a few Pyo3 objects, the leak was triggered by a complex interaction between Rust and Python which was specific to our application code. The Rust code, calling Python, was holding onto memory Pyo3 objects that were needed during execution, but never released. Circling back to the beginning of the post, it was a bit like the Forgotten drop Implementations scenario.

A quick tip, if you try to build your rust binary and use it in Instruments, you may get this error:

You need to build the binary with debugging symbols:
[profile.release]
debug = true
and sign the binary as:
https://forums.developer.apple.com/forums/thread/681687?answerId=734339022#734339022

The fix

Again, a bit specific to our custom implementation since we were loading some custom objects into memory, we then added a cleanup method, that we would call after running the Python code, a simple one liner did it:

py03_module.call_method0(“cleanup”)

After rerunning Instruments, with the fix the memory would stay well behaved:

Leveraging Instruments

Instruments proved to be an invaluable tool in identifying the memory leak, that I'd certainly recommend and use again! By analyzing the memory allocation patterns, I was able to pinpoint the exact line of Rust code responsible for the issue. Once the culprit was identified, fixing the memory leak was relatively straightforward.

Key Takeaways

Pragmatism over perfection: Sometimes, a temporary workaround is the most practical approach. In our case, implementing a quick fix freed us to thoroughly investigate the memory leak without impacting users. This allowed us to dedicate the necessary time to find a permanent solution.
Tool up: Familiarize yourself with a great memory profiler. When you encounter a memory leak, having the right tools can significantly speed up the debugging process.
Embrace the challenge: While frustrating at times, hunting down memory leaks can be make you learn a lot about how the language works. The satisfaction of identifying and resolving the issue is a reward in itself.

By sharing this experience, I hope to encourage other Rust developers to embrace profiling as a best practice and to be on the watch for unexpected memory leaks, happy profiling!

Customizing Retry Predicates in Google Cloud Python Libraries

Marcelo Costa — Sat, 16 Nov 2024 12:27:16 +0000

Google Cloud's Python libraries are designed for resilience. They add strong retry mechanisms to handle transient errors effectively. However, there may be situations where the default retry behavior isn't suitable. For example, you might encounter certain errors that should not trigger a retry, or you may require more control over the retry logic.

This blog post explores how the Google Cloud's Python libraries interacts with custom retry predicates, allowing you to customize the retry behavior to better meet your specific requirements.

In this blog post, I want to highlight a specific example related to using service account impersonation within Google Cloud libraries. In an architecture I designed and am currently working on, we isolate user environments into separate Google Cloud projects. We noticed that some of our services were experiencing degraded performance in certain user flows. After investigating, we traced the issue back to the default retry behavior of the libraries mentioned earlier.

The Default Retry Mechanism

Before we go into customization, it's important to understand the default retry behavior of Google Cloud Python libraries. These libraries typically have an exponential backoff strategy with added jitter for retries. This means that when a transient error occurs, the library will retry the operation after a brief delay, with the delay increasing exponentially after each subsequent attempt. The inclusion of jitter introduces randomness to the delay, which helps prevent synchronization of retries across multiple clients.

While this strategy is effective in many situations, it may not be ideal for every scenario. For example, if you're using service account impersonation and encounter an authentication error, attempting to retry the operation may not be helpful. In such cases, the underlying authentication issue likely needs to be resolved before a retry can succeed.

Enter Custom Retry Predicates

In Google Cloud libraries, custom retry predicates enable you to specify the precise conditions under which a retry attempt should be made. You can create a function that accepts an exception as input and returns True if the operation should be retried, and False if it should not.

For example, here’s a custom retry predicate that prevents retries for certain authentication errors that occur during service account impersonation:

from google.api_core.exceptions import GoogleAPICallError
from google.api_core.retry import Retry, if_transient_error

def custom_retry_predicate(exception: Exception) -> bool:
    if if_transient_error(exception): # exceptions which should be retried
        if isinstance(exception, GoogleAPICallError):
            if "Unable to acquire impersonated credentials" in exception.message: # look for specific impersonation error
                return False
        return True
    return False

This predicate checks if the exception is a GoogleAPICallError and specifically looks for the message "Unable to acquire impersonated credentials". If this condition is met, it returns False, preventing a retry.

Using Custom Predicates with Google Cloud Libraries

Firestore:

from google.cloud import firestore

# ... your Firestore setup ...

retry = Retry(predicate=custom_retry_predicate, timeout=10)

# example of an arbitrary firestore api call, works with all
stream = collection.stream(retry=retry)

BigQuery:

from google.cloud import bigquery

# ... your BigQuery setup ...

retry = Retry(predicate=custom_retry_predicate, timeout=10)

# example of an arbitrary bigquery api call, works with all
bq_query_job = client.get_job(job_id, retry=retry)

In both examples, we create a Retry object with our custom predicate and a timeout value. This Retry object is then passed as an argument to the respective API calls.

Benefits of Custom Retry Predicates

Fine-grained control: Define retry conditions based on specific exceptions or error messages with precision.
Improved efficiency: Avoid unnecessary retries for non-transient errors, thus saving resources and time.
Enhanced application stability: Handle specific errors gracefully to prevent cascading failures.

Conclusion

Custom retry predicates offer an effective way to enhance the resilience of your Google Cloud applications. By customizing the retry behavior to suit your specific requirements, you can ensure that your applications are robust, efficient, and scalable. Take charge of your error handling and master the retry process!

BigQuery's New JSON Functions: Struct vs. JSON - Choosing the Right Structure

Marcelo Costa — Sun, 25 Aug 2024 18:01:45 +0000

BigQuery recently expanded its capabilities with new JSON helper functions, as seen on their release notes:

Combined with enhancements to log analytics (which utilizes JSON columns) and the power of search functions across JSON data:

It's an exciting time to use BigQuery to best leverage these data types. Then let's dive into when to use Struct vs JSON columns in BigQuery, considering their strengths and potential trade-offs.

STRUCT

A simple example:

CREATE TABLE customers (
  customer_id INT64,
  customer_name STRING,
  address STRUCT<
    street STRING,
    city STRING,
    state STRING,
    zip_code STRING
  >,
  contact STRUCT<
    email STRING,
    phone STRING
  >
);

Strengths:

Schema Enforcement: Enforces a clear structure, ensuring data consistency and integrity.
No need to run a JSON_KEYS keys like function, remember? When your data environment starts to get past a few tables to hundreds, that certainly makes a big difference!
Query Performance & Cost Savings: Optimized for querying specific nested attributes, leading to potentially faster performance and lower costs for well-structured data.

Illustrative example, referencing a BigQuery public dataset, we see how querying different Struct fields impacts the amount of bytes processed:

Querying the ci field processes significantly fewer bytes compared to querying the system field, demonstrating the potential cost savings when targeting specific Struct attributes.

ci
system
Ease of Use: Simple syntax with dot notation for accessing nested fields, making queries more readable.

SELECT 
  customer_name, 
  address.city, 
  contact.email 
FROM customers;

JSON

A simple example:

CREATE TABLE products (
  product_id INT64,
  product_name STRING,
  details JSON
);

SELECT 
  product_name, 
  JSON_EXTRACT_SCALAR(details, '$.color') AS color,
  JSON_VALUE(details, '$.price') AS price
FROM products;

Strengths:

Data Exchange: A widely used format for seamless integration with external systems and APIs.
Flexibility: Handles dynamic or evolving data structures without schema changes - perfect for unpredictable or unstructured data.

"With great flexibility comes great responsibility"

The challenge with JSON is handling varying key values. Upstream validation using frameworks like data contracts or other techniques, can help enforce consistency, but if that level of rigor is needed, Structs might be a better fit.

For genuine JSON needs, new functions like JSON_KEYS and JSONPath_mode provide powerful tools for querying and managing your data.

Choosing the Right Structure

The ideal choice between STRUCT and JSON hinges on your specific data characteristics and priorities:

STRUCT: When you require strict schema enforcement, predictable query performance, and ease of use with nested data.
JSON: When you need to accommodate flexible or evolving data structures and prioritize seamless data exchange.

Whichever path you choose, BigQuery has you covered! The latest enhancements provide greater control and flexibility in managing both structured and semi-structured data.

Working with Files in Cloud Run Jobs: Introducing GCS Fuse

Marcelo Costa — Sun, 21 Jul 2024 13:42:18 +0000

When it comes to processing files within your Cloud Run Jobs, having a familiar filesystem interface can make things a whole lot easier. That's where GCS Fuse comes in! It bridges the gap between Google Cloud Storage (GCS) and your Cloud Run Job's environment, allowing you to mount GCS buckets as if they were local directories.

Why GCS Fuse?

Simplified File Access: Read, write, and list files using standard commands and libraries.
Performance: GCS Fuse caches frequently accessed files, making subsequent reads faster.
Flexibility: Integrate with your existing file-based workflows and tools effortlessly.

Cloud Storage Volume Mounts

Before it was a bit of a hassle to set up GCS Fuse in either Cloud run or Cloud run jobs, you had to install it manually in a Docker container and start it, as you could see in Google samples repo.

When Google announced managed support for it:

It was great news! Made my job and the job of many folks that leverage "serverless" solutions in many different parts of their architectures much easier!

Now, what are those cloud storage volume mounts, you may ask?

The managed version of GCS Fuse leverages a Cloud Run feature called Cloud Storage volume mounts. Essentially, this allows you to specify a GCS bucket in your Cloud Run Job's configuration, and the job will have direct access to the files within that bucket.

Setting it up

All you need is to include a volumes section to define the mount point and the GCS bucket you want to access. docs

Python library:

    container = run_v2.Container()
    container.volume_mounts = [
        run_v2.VolumeMount(
            name=volume_name,
            mount_path=my_local_dir_path,
        ),
    ]

    job = run_v2.Job()
    job.template.template.volumes = [
        run_v2.Volume(
            name=volume_name,
            gcs=run_v2.GCSVolumeSource(
                bucket=my_bucket_path,
            ),
        ),
    ]

To use any files that lives inside the bucket, the beauty about it, is you abstract away all the GCS code, and only need to deal with local files.

Really simple example:

f = open(f"{my_local_dir_path}/sample-logfile.txt", "a")

Under the hood the GCS Fuse config will be doing all the necessary list and read operations, same for writing.

Tips and Considerations

Caching: Keep in mind that GCS Fuse uses caching, so changes you make to files in the mounted directory might not immediately propagate back to GCS.
Concurrency: For multi-worker jobs, be aware of potential concurrency issues if multiple workers try to modify the same file simultaneously.
File Locking: GCS Fuse doesn't provide file locking, so consider how your job handles concurrent writes.

That's it!

GCS Fuse and now Cloud Storage volume mounts provide a powerful way to deal with file operations in your Cloud Run Jobs. I use this feature extensively in production, make sure you dive into the official documentation for more details and start leveraging this feature to enhance your cloud-based workflows.

How to programmatically backup your Firestore database with simple steps

Marcelo Costa — Sun, 28 Apr 2024 19:40:33 +0000

Why this post? Recently, Google Cloud announced in preview a way to automatically setup and schedule your Firestore backups. Prior to the announcement, the recommended approach required multiple serverless components, such as Cloud Functions and Cloud Scheduler.

At the time this post was written, there was no public documentation around how to use Google Cloud APIs to run the aforementioned feature, but using gcloud:

How to do it programmatically with Python

Many users are not aware, but sometimes the newest API operations or available features are not immediately available on Google SDKs, but you have something they call discovery API client:

In summary, the Google API Discovery service simplifies the process of working with Google APIs by providing structured and standardized documentation, which under the hood is utilized by their client libraries:

Basically, it's a document that tells machines how to interact with their APIs, which sometimes can be helpful as documentation. I recommend always using each of Google's SDK services and relying on the discovery client if the operation is unavailable in the SDK or if you want to get more details on what is available for that service with its models.

Then how to use it?

First, start by installing the google-api-python-client PyPI package.

Next, after looking at the discovery JSON that you can get in this link, and finding what is the right service and operation you need to call, you build the service object:

Then, by inspecting what the gcloud command was doing, I got to the service I needed:

The full code sample is here; I hope it helps!

import googleapiclient.discovery

# change to your project and db ids
project_id = "MY_PROJECT_ID"
database_id = "MY_FIRSTORE_DB_ID"

api_service_name = "firestore"
api_version = "v1"
discovery_url = f"https://{api_service_name}.googleapis.com/$discovery/rest?version={api_version}"
service = googleapiclient.discovery.build(
    api_service_name, api_version, discoveryServiceUrl=discovery_url
)
created_backup = (
    service.projects()
    .databases()
    .backupSchedules()
    .create(
        parent=f"projects/{project_id}/databases/{database_id}",
        body={
            "retention": "604800s",
            "dailyRecurrence": {},
        },
    )
    .execute()
)

I chose 604800s, equivalent to 7 days, and dailyRecurrence which doesn't require any payload attributes for daily backups. If you are looking to schedule it weekly, you may change dailyRecurrence to something like this:

"weeklyRecurrence": {
  # day of week enum
  "day": "MONDAY"
}

How to combine BigQuery with DuckDB

Marcelo Costa — Sat, 27 Apr 2024 14:01:28 +0000

This blog post will discuss the benefits of integrating Google BigQuery, a leading data warehouse solution, with DuckDB, an embedded analytical database. This powerful combination can enhance your data analysis processes by offering the best of both worlds: BigQuery's massive scalability and DuckDB's agility for quick and on-the-fly queries.

Before we start, here is a quick summary of the key features for each:

BigQuery

Key Features:

Serverless Architecture: BigQuery manages infrastructure automatically, scaling to meet query demands without manual resource provisioning.
Storage and Computation Separation: Users can store large amounts of data independently, reducing costs and optimizing performance.
Real-time Analytics: Supports real-time analysis with the capability to stream and query data almost instantaneously.
Machine Learning Integration: BigQuery ML offers machine learning capabilities inside the database, allowing SQL practitioners to build and deploy models using SQL commands.

DuckDB

Key Features:

In-Process Database: Designed for embedded processes, it is ideal for applications and analytics tools requiring a built-in database.
Simple Integration: Easy to set up

Let's review two easy options for bringing your BigQuery data to DuckDB.

Export Data From BigQuery to DuckDB

Export it to cloud storage, then download it manually or use gsutil.

EXPORT DATA
  OPTIONS (
    uri = 'gs://bq_export_demo/export/*.parquet',
    format = 'PARQUET',
    overwrite = true)
AS (
  SELECT ssn, user_name
  FROM `demo-project.bq_dataset_0024.org_extend_rich_schemas_2890`
  ORDER BY user_name
);

Using the cloud storage import feature from DuckDB is also possible.

BigQuery Client Library

Make sure your environment has the following libraries installed:

pip install duckdb
pip install pyarrow
pip install google-cloud-bigquery
pip install google-cloud-bigquery-storage

Then an efficient way of querying the data is using the bigquery storage client and its underlying abstractions that map the rows to pyarrow:

import duckdb
from google.cloud import bigquery

bqclient = bigquery.Client()
table = bigquery.TableReference.from_string(
    "demo-project.bq_dataset_0024.org_extend_rich_schemas_2890"
)
rows = bqclient.list_rows(table)
org_extend_rich_schemas_2890 = rows.to_arrow(create_bqstorage_client=True)
cursor = duckdb.connect()
print(cursor.execute('SELECT * FROM org_extend_rich_schemas_2890').fetchall())

Push Data from DuckDB to BigQuery

DuckDB has the advantage of allowing you to run everything on your local machine without having to worry about costs. However, it is important to keep in mind that if you are dealing with sensitive or customer-related data, you should take appropriate security measures to protect it.

DuckDB: Transform Data and Export to Parquet

-- Load the Parquet file
CREATE OR REPLACE TABLE original_data AS
SELECT *
FROM read_parquet('/path/bq_export_demo/export/*.parquet');

-- Perform transformations
CREATE OR REPLACE TABLE transformed_data AS
SELECT
    column1,
    column2,
    column3 + 10 AS new_column3,
    UPPER(column4) AS new_column4
FROM original_data;

-- Export the transformed data to a new Parquet file
COPY transformed_data
TO '/path/to/output_file.parquet' (FORMAT 'parquet');

Once you have your transformed Parquet file, you can load it into BigQuery using a load job:

bq load --source_format=PARQUET --autodetect \
mydataset.new_table \
'gs://your_bucket/path/to/output_file.parquet'

And that's it! Combining both is certainly something I have in my data toolkit, and it helps me with my day-to-day work.

Having said that here are some final caveats:

Don’t Overload DuckDB with Big Data Tasks:
DuckDB is not designed to handle data of the same scale as BigQuery. Avoid using DuckDB for large datasets better suited to BigQuery’s infrastructure.
Don’t Neglect the Cost Implications:
Be mindful of the costs associated with data storage and transfer, especially when moving large amounts of data between BigQuery and DuckDB.
Don’t Forget to Scale Appropriately:
As your data grows or your analytical needs change, revisit your use of BigQuery and DuckDB. Scalability is a crucial concern, and what works at one scale may not work well at another.
Worry about Security:
Moving sensitive data from a secure production warehouse to your local environment or any environment where DuckDB is used as an embedded database can raise security concerns. Therefore, it is essential to handle sensitive data with care.

I hope this helps!

Sample code on Service-to-Service Authentication in Google Cloud Run for Production and Local environments

Marcelo Costa — Sat, 13 Apr 2024 13:48:12 +0000

When using Google Cloud Run, securing communications between services is crucial. If your system architecture utilizes multiple services, it's likely that these services will need to communicate with each other either synchronously or asynchronously. Some of these services may be private and require authentication credentials for access.

It's often not easy to find sample code for setting it up for production and local environments and working with both scenarios with a good developer experience. The goal of this blog post is to provide sample code for the aforementioned scenarios for Python and Node/Javascript.

Javascript

set up libraries and functions:

import { execSync } from "child_process";
import { GoogleAuth } from "google-auth-library";

function exec(command: string): string {
  return execSync(command).toString().trim();
}

get id token for local env:

function getLocalIdToken(): string {
  return exec("gcloud auth print-identity-token");
}

get id token for production:

async function getProductionIdToken(url: string) {
  const auth = new GoogleAuth();
  const targetAudience = `https://${url}`;
  const client = await auth.getIdTokenClient(targetAudience);
  return await client.idTokenProvider.fetchIdToken(targetAudience);
}

suggested approach to use an env variable to switch it:

const idToken = process.env.NODE_ENV === "production" ? 
await getProductionIdToken(url)) : getLocalIdToken();
// add your additional logic here that uses the idToken in the Rest or GRPC call.

Python

set up libraries and functions:

import google.auth.transport.requests
import google.oauth2.id_token
from google import auth

get id token for local env:

def get_local_id_token() -> str:
    creds, _ = auth.default(
        scopes=["https://www.googleapis.com/auth/cloud-platform"],
    )
    request = google.auth.transport.requests.Request()
    creds.refresh(request)
    return creds.id_token

get id token for production:

def get_production_id_token(url: str) -> str:
    auth_request = google.auth.transport.requests.Request()
    audience = f"https://{url}"
    return google.oauth2.id_token.fetch_id_token(auth_request, audience=audience)

suggested approach to use an env variable to switch it:

def get_id_token(url: str, env: str) -> str:
    if env == "production":
        return get_production_id_token(url)

    return get_local_id_token()

// add your additional logic here that uses the idToken in the Rest or GRPC call.

At the time of writing this blog post, it was not yet possible to use the exact same code for both strategies. Therefore, I recommend switching the presented logic using an environment or configuration variable. I hope this helps!

How to Fix Cloud Run Jobs Logging

Marcelo Costa — Fri, 29 Dec 2023 14:49:21 +0000

Google Cloud Run is a great product and became even better after allowing you to use it to run background Jobs:

Recently it became even more flexible allowing users to override a bunch execution args:

But at the time this blog post was written, it lacks a bit on some developer tooling, for instance automatically showing logs in Google Cloud Logging.

If you set up your application and instrument it to run in Google Cloud Run, the logs are automatically set as a gce_instance resource type. Then if you go in Google Cloud Console, and jump into the Logs of a Cloud Run Job, you see nothing... because it expects it to live under a cloud_run_job resource type.

This most likely happens since this feature is kinda new, so I'd imagine this will be fixed, but in meanwhile, here's some Python sample code that fixes this behavior:

import google.cloud.logging
from google.cloud.logging.handlers import CloudLoggingHandler
from google.cloud.logging_v2.handlers import setup_logging
from google.cloud.logging_v2.resource import Resource
from google.cloud.logging_v2.handlers._monitored_resources import retrieve_metadata_server, _REGION_ID, _PROJECT_NAME

client = google.cloud.logging.Client()

cloud_run_job = os.environ.get("CLOUD_RUN_JOB")
if cloud_run_job:
    region = retrieve_metadata_server(_REGION_ID)
    project = retrieve_metadata_server(_PROJECT_NAME)

    # build a manual resource object
    cr_job_resource = Resource(
        type="cloud_run_job",
        labels={
            "job_name": cloud_run_job,
            "location": region.split("/")[-1] if region else "",
            "project_id": project,
        },
    )
    labels = {"run.googleapis.com/execution_name": os.environ.get("CLOUD_RUN_EXECUTION")}
    handler = CloudLoggingHandler(client, resource=cr_job_resource, labels=labels)
    setup_logging(handler)

Hope it helps!

How to use BigQuery Query Caching with Dynamic Wildcard Tables

Marcelo Costa — Fri, 29 Dec 2023 14:21:11 +0000

The Problem: Caching does not work with wildcard tables

From BigQuery official docs:

Let's say you have some tables named my_data_2023_*, where the asterisk represents various months. You want to analyze data across all these tables. Since BigQuery doesn't know automatically when new tables were created, it will invalidate any available cache and run a fresh query, so cache won't be used.

Just for reference, it's not a good practice to use date sharded tables:

Recently I faced a scenario where tables where dynamically created based on a business domain field, the date example is only for illustration purposes, if you are using sharded tables, the better solution is to migrate it to BigQuery partitions instead.

The Solution: Union THEM ALL!

Enter the BigQuery Information Schema:

The BigQuery INFORMATION_SCHEMA views are read-only, system-defined views that provide metadata information about your BigQuery objects.

We can use the tables view to dynamically generate a list of all tables matching our pattern (e.g., my_data_2023_*). Then, we leverage UNION to combine individual queries for each identified table.

Here's a sample using Python:

from google.cloud import bigquery

client = bigquery.Client()

# Specify the dataset and wildcard pattern
dataset_id = "your-project.your_dataset"
wildcard_pattern = "my_data_2023_"

# Query the INFORMATION_SCHEMA to get matching table names
query = f"""
    SELECT table_name
    FROM `{dataset_id}.INFORMATION_SCHEMA.TABLES`
    WHERE table_name LIKE '{wildcard_pattern}%'
"""

rows = list(client.query(f"SELECT table_name FROM `{dataset_id}.INFORMATION_SCHEMA.TABLES` "
                                f"where table_name like '{your_table_prefix_}%'"))

if not rows:
    return

view_query = __create_sql(dict(rows[0])["table_name"])
for row in table_names[1:]:
    view_query = f"""
    {view_query}
    UNION ALL
    {__create_sql(dict(row['table_name'])}
"""

I omitted the __create_sql function, which is just a logic that creates a complex SQL based on each table name, with the generated SQL then you can use it to create a BigQuery view:

view = bigquery.Table(table_ref)
view.view_query = view_query
client.create_table(view, exists_ok=True)

Hope that helps, cheers!

How to Impersonate a Service Account Using Bigquery Client Library

Marcelo Costa — Sat, 30 Sep 2023 19:01:47 +0000

If you are not familiar with Service Accounts in Google Cloud, here's a short text explaining it:

A service account is a special kind of account typically used by an application or compute workload, such as a Compute Engine instance, rather than a person. A service account is identified by its email address, which is unique to the account.

The most common way to make an application act like a service account is by connecting the service account to the resource where the application is running. For instance, you can link a service account to a Compute Engine instance so that the applications running on that instance can act as the service account. After that, you can give the service account special permissions (IAM roles) so that it, and the applications on the instance, can use Google Cloud resources.

In some scenarios such as multi-tentant deployments where you need to have more strict control permissions for each organisation or customer it may make sense to tailor down the permissions, there are multiple ways of dealing with it, but recently upon facing that scenario, I used a feature from Google Cloud called Service Account impersonation to isolate each organisation resources access controls.

When an authenticated principal, such as a user or another service account, authenticates as a service account to gain the service account's permissions, it's called impersonating the service account. Impersonating a service account lets an authenticated principal access whatever the service account can access. Only authenticated principals with the appropriate permissions can impersonate service accounts.

It's also a quite nice feature since it allows you to use a short-lived token flow as stated in this part of Google Cloud documentation:

Quite common scenario if you don't want to have our engineering team downloading service accounts and potentially exposing those credentials. See Service account impersonation for more details.

How to use it within BigQuery Client Library

There are several ways of doing Service Account impersonation and many samples out there, but at the time this post was written I didn't find sample code showing how to do it using BigQuery client library, so after digging a little bit and some tests here is a working version of it:

Packages used:
pip install google-cloud-bigquery
pip install google-auth

Sample code:

from google import auth
from google.auth import impersonated_credentials
from google.cloud import bigquery


# Set scopes, usually using the global cloud-platform is enough since the actual persmissions 
# will be set at the Service Account level.
target_scopes = ["https://www.googleapis.com/auth/cloud-platform"]

source_credentials, project = auth.default()
creds = impersonated_credentials.Credentials(
    source_credentials=source_credentials,
    target_principal="[MY_SERVICE_ACCOUNT_ID]@[MYGCP_PROJECT_ID].iam.gserviceaccount.com",
    target_scopes=target_scopes,
)
client = bigquery.Client(credentials=creds, project=project, location=settings.region)

# Then run any additional commands with the impesonated auth scope
# client.query(...

Hope this helps!