DEV Community: matthieucham

How to pass an Array of Structs in Bigquery's parameterized queries

matthieucham — Tue, 15 Oct 2024 06:57:39 +0000

In Google's Bigquery, SQL queries can be parameterized. If you're not familiar with this concept, it basically means that you can write SQL queries as parameterized templates like this:

INSERT INTO mydataset.mytable(columnA, columnB)
    VALUES (@valueA, @valueB)

And pass the values separately. This has numerous benefits:

The query is more readable than when it's built by string concatenation
The code is more robust and industrialized
It's a great protection against SQL injection attacks (mandatory XKCD)

The passing of query parameters from a Python script appears straightforward... at first sight. For example:

from google.cloud.bigquery import (
    Client,
    ScalarQueryParameter,
    ArrayQueryParameter,
    StructQueryParameter,
    QueryJobConfig,
)

client=Client()

client.query("
INSERT INTO mydataset.mytable(columnA, columnB)
    VALUES (@valueA, @valueB)
", job_config=QueryJobConfig(
    query_parameters=[
        ScalarQueryParameter("valueA","STRING","A"), 
        ScalarQueryParameter("valueB","STRING","B")
])

The example above inserts simple ("Scalar") values in columns A and B. But you can also pass more complex parameters:

Arrays (ArrayQueryParameter)
Structs (StructQueryParameter)

Problems arise when you want to insert arrays of structs : there are many gotchas, almost no documentation and very few resources on the subject on the web. The goal of this article is to fill this gap.

How to persist an array of structs in bigquery using parameterized queries

Let's define the following object that we want to store in our destination table

from dataclasses import dataclass

@dataclass
class Country:
    name: str
    capital_city: str

@dataclass
class Continent:
    name: str
    countries: list[Country]

by invoking this parameterized query

query = UPDATE continents SET countries=@countries WHERE name="Oceania"

The first try by following the shallow documentation would be

client.query(query, 
    job_config=QueryJobConfig(query_parameters=[
        ArrayQueryParameter("countries", "RECORD", [
             {name="New Zealand", capital_city="Wellington"},
             {name="Fiji", capital_city="Suva"} ...]
]))

which would fail miserably

AttributeError: 'dict' object has no attribute 'to_api_repr'

Gotcha n°1: ArrayQueryParameter's values must be instances of StructQueryParameter

It turns out that the third argument of the constructor - values- must be a collection of StructQueryParameter instances, not the wanted values directly. So let's build them:

client.query(query, 
job_config=QueryJobConfig(query_parameters=[
    ArrayQueryParameter("countries", "RECORD", [
    StructQueryParameter("countries",
        ScalarQueryParameter("name", "STRING", ct.name), 
        ScalarQueryParameter("capital_city", "STRING", ct.capital_city)
    )
    for ct in countries])
]))

This time it works... Until you try to set an empty array

client.query(query, 
    job_config=QueryJobConfig(
    query_parameters=[
        ArrayQueryParameter("countries", "RECORD", [])
]))

ValueError: Missing detailed struct item type info for an empty array, please provide a StructQueryParameterType instance.

Gotcha n°2: Provide the full structure type as second argument

The error message is pretty clear: "RECORD" is not enough for Bigquery to know what to do with your empty array. It needs the fully detailed structure. So be it

client.query(query, job_config=QueryJobConfig(query_parameters=[
    ArrayQueryParameter("countries",
        StructQueryParameterType(
            ScalarQueryParameterType("STRING","name"),
            ScalarQueryParameterType("STRING","capital_city")
        ), [])
]))

(Notice how the order of the arguments of the ...ParameterType constructor is the reverse of ...Parameter constructor. Just another trap on the road...)

And now it works for empty arrays too, yay !

One last gotcha to be aware of: every subfield of a StructQueryParameterType must have a name, even if the second parameter (name) is optional in the constructor. It's actually mandatory for subfields, otherwise you'll get a new kind of error

Empty struct field name

I think that's all we need to know to complete the usage of arrays of records in query parameters, I hope this helps !

Thanks for reading! I’m Matthieu, data engineer at Stack Labs.
If you want to discover the Stack Labs Data Platform or join an enthousiast Data Engineering team, please contact us.

Photo de Denys Nevozhai sur Unsplash

Automatically Update BigQuery View Schema Changes

matthieucham — Tue, 30 Jul 2024 15:29:59 +0000

SQL views are virtual tables simplifying data access and security. They offer tailored data perspectives, protecting sensitive information. Data analysts widely use them to streamline modeling.

As such, views are a crucial feature of Google Cloud's fully managed data warehouse, BigQuery. However, they have certain limitations. One of these limitations can be particularly troublesome for data analysts and end-users:

The schemas of the underlying tables are stored with the view when the view is created. If columns are added, deleted, or modified after the view is created, the view isn't automatically updated and the reported schema will remain inaccurate until the view SQL definition is changed or the view is recreated. Even though the reported schema may be inaccurate, all submitted queries produce accurate results.

To see this limitation into action, create a source table with two columns

CREATE OR REPLACE TABLE `demo_devto.source_table` (
  A STRING,
  B STRING
) AS (
  SELECT "a", "b" 
)

Then create a view above it

CREATE OR REPLACE VIEW `demo_devto.expo_view` AS (
  SELECT * from `demo_devto.source_table`
)

As expected, the schema of the view presents 2 columns A and B

Now add a column to the source table

ALTER TABLE `demo_devto.source_table`
  ADD COLUMN C STRING

The new column is reflected by the source table's schema

But not by the view's schema

Still, the result of a query is correct with 3 columns

This article outlines a method to circumvent this limitation and maintain the view's schema in alignment with the underlying table's schema as closely as possible.

A fully serverless event-driven architecture to synchronize schemas

This solution make use of a log sink to capture audit logs from BigQuery, a PubSub topic where relevant log entries are directed, a PubSub subscription and a Cloud Run service to process them

Let's review each step and dive into details

1. Bigquery audit logs

All Google Cloud services generate logs which are viewable in Cloud Logging. BigQuery is no exception and audit logs offer all information we need. See their structure here

2. Cloud Logging log sink

A log sink is a location where the logs are collected and stored. Google Cloud Logging log sinks collect within a scope - project, folder, organization. So to capture update logs from tables for a whole organization, a log sink at organization level is needed. To monitor a project only, a sink at project level is enough.

A log sink must declare a filter. This is very important to limit costs - which depend of the volume of captured logs - and to process relevant events only. Here we are using the following filter to capture events about schema changes:

resource.type="bigquery_resource"
AND protoPayload.serviceName="bigquery.googleapis.com"
AND protoPayload.methodName="tableservice.update"
AND protoPayload.authenticationInfo.principalEmail !~ <regex identifying the service account used by the cloud run service who process logs>

The filter on principalEmail serves as a mechanism to identify updates to exposition views made by the Cloud Run service, which we wish to exclude from processing as our focus lies solely on source table update events.

Finally, we need to give the sink a destination, where received logs who pass the filter are routed. Several kinds of destination are possible. Because our architecture is event-driven, the selected destination is a PubSub topic. The log entry is then encoded as JSON

Here is how to provision such a sink with Terraform, at project level:

resource "google_logging_project_sink" "demo" {
  provider               = google-beta
  project                = "my-project"
  name                   = "logsink-demo"
  destination            = "pubsub.googleapis.com/${google_pubsub_topic.demo.id}"
  filter                 = <<EOT
    resource.type="bigquery_resource"
    AND protoPayload.serviceName="bigquery.googleapis.com"
    AND protoPayload.methodName="tableservice.update"
    AND protoPayload.authenticationInfo.principalEmail !~ "^sa-demo@myproject.iam.gserviceaccount.com$"
  EOT
  unique_writer_identity = true
}

resource "google_pubsub_topic_iam_member" "demo" {
  provider = google-beta
  topic    = google_pubsub_topic.demo.id
  role     = "roles/pubsub.publisher"
  member   = google_logging_project_sink.demo.writer_identity
}

3. PubSub topic and subscription

The PubSub topic is the destination of log events who pass the log sink filter.

To consume these events, a subscription in PUSH mode send these events to a HTTPS endpoint.

Here is an example of how these resources can be provisioned with Terraform:

resource "google_pubsub_topic" "demo" {
  provider = google-beta
  name     = "topic-demo"
}

resource "google_pubsub_subscription" "demo" {
  provider             = google-beta
  name                 = "sub-demo"
  topic                = google_pubsub_topic.demo.id
  ack_deadline_seconds = 600

  push_config {
    push_endpoint = <URL of the cloud run endpoint>
    oidc_token {
      service_account_email = google_service_account.default.email
    }
  }

  expiration_policy {
    ttl = ""
  }
}

4. 5. and 6. Events processing

The processing of log events is performed by a Cloud Run service in this system, but could be done by a Cloud Function for example.

In Python, the decoding of incoming events can be done like this:

import base64
import json

bq_log = json.loads(base64.b64decode(message["data"]).decode("utf-8"))

By parsing the bq_log object, we can retrieve the updated table id:

import re

from google.cloud.bigquery import TableReference

RESOURCENAME_PATTERN = re.compile(
    "^projects/(?P<project>[^/]+)/datasets/(?P<dataset>[^/]+)/tables/(?P<table>[^/]+)$"
)

resource_name = bq_log.get("protoPayload", {}).get("resourceName", "")
if (match := RESOURCENAME_PATTERN.match(resource_name)) is not None:
    return TableReference.from_api_repr(
        {k + "Id": match.group(k) for k in ["project", "dataset", "table"]}
    )

The next step is to identify the views which relies on this source table. Here, associations between source tables and exposition views are registered in a Firestore database, but other designs are possible. For example, you could query INFORMATION_SCHEMA.VIEWS metadata views and identify the affected views by parsing the content of the VIEW_DEFINITION column

SELECT VIEW_DEFINITION FROM `demo_devto.INFORMATION_SCHEMA.VIEWS`

Finally, synchronize all affected views. BigQuery views seem to not support the updating of the "schema" field by the update_table() method when columns are added. The recommended way is then to re-create views with SQL DDL statements:

CREATE OR REPLACE VIEW AS ...

With all steps pieced together, any schema update from source tables automatically triggers the re-creation of exposition views, keeping the schema synchronized after a short delay !

Thanks for reading! I’m Matthieu, data engineer at Stack Labs.
If you want to discover the Stack Labs Data Platform or join an enthousiast Data Engineering team, please contact us.

Cover picture by Miguel Delmar on Unsplash

What exactly is exactly-once delivery ?

matthieucham — Fri, 10 Feb 2023 08:15:50 +0000

GoogleCloud PubSub is the serverless implementation of a Publisher - Subscriber management service. It's built around the concept of topics (where messages are published to) and subscriptions (where message are consumed from).

There are 3 types of subscriptions:

Push: message consumption is initiated by PubSub
Pull: message consumption is initiated by the subscriber
BigQuery: special mode where the subscriber is a BigQuery agent which stores messages in a Bigquery table.

PubSub now offers Exactly-once delivery option for Pull subscriptions ; this option is GA since December 2022 (doc).

Let's see what does that mean in details.

What is a delivery ?

In PubSub context, a delivery is a process which encompasses the following items:

sending a message to a consumer
receiving an acknowledgement (ack) from the consumer before the ack delay of the sent message times out.
Alternatively, the consumer can send a "nack" (negative acknowledgment) instead of an ack. It tells the sender that the message could not be processed and must be sent again.

When the ack of the message is received, the message is considered delivered by PubSub

The following schema illustrates the delivery process in Pub/Sub:

The publisher sends the message in the topic
The subscriber pulls the message from the subscription. The message ack delay starts here
When the processing is done, the subscriber acknowledges the message
The message is marked as delivered.

Any failure occurring during this flow - networking issue, VM crash - can potentially lead to new delivery attempts resulting in duplicate outputs if multiple attempts finally succeed for the same message.

Exactly-once delivery

The usual guarantee offered by PubSub is at least once delivery. It means that in case of such a failure, PubSub will attempt to deliver the message again, until it's successfully acked (or the subscription retention limit is reached).

The exactly-once delivery option ensures that PubSub will not resend messages

while the ack (or nack) is not received and the delay has not expired
once the ack is received

This guarantee is made possible by the usage of persistent storage by PubSub agents: Contrary to the default mode where messages' status are stored in transient memory, Exactly-once uses a regional persistent storage service. Hence, this guarantee is enforced at regional level.

Consequences

The Exactly-once mode enforces that under the conditions detailed above, the message will only be delivered once. However this doesn't mean that no message will ever be sent multiple times. How come ?

Indeed, one has to make the difference between duplicates and legitimate redeliveries. For example, if the consumer takes too much time to process a message, so much that the ack delay expires, or if the process crashes and no ack is sent back to PubSub, then PubSub has no way to know that the message was already processed. Thus the message will be sent again in response to the next Pull request.

As a consequence, if "exactly-once" message processing is of primary importance for you and you absolutely want to avoid any duplicate, you have to put extra care at every layer of your system. You need to pick ack delays according to the maximum time the processing can take (including retries and such). You also need to mitigate all risks of duplicates at each stage of your workflow. As you can see, Exactly-once solves the central step of the pipeline below, not the others who still have to be taken care of:

PubSub's Exactly-once option is definitely a good step forward in helping to reach this goal, but it doesn't solve everything. End-to-end exactly-once delivery remains a challenge for any event-based system !

Thanks for reading! I’m Matthieu, data engineer at Stack Labs.
If you want to discover the Stack Labs Data Platform or join an enthousiast Data Engineering team, please contact us.

Cover picture by Brett Jordan on Unsplash

How to mock a decorator in Python

matthieucham — Thu, 26 Jan 2023 15:58:31 +0000

Programming is the art of adding bugs to an empty text file.

Mastering unittest.mock capabilities of Python to prevent these bugs, is another art of its own. And one of the trickiest move of them all is how to mock a decorator, such as it doesn't go in the way of the function you want to test.

Let's consider the following function to test:

### module api.service
from utils.decorators import decorator_in_the_way

@decorator_in_the_way
def function_to_be_tested(*args, **kwargs):
    # do something
    ...

The problem here is to write a unit test for function_to_be_tested without invoking the decorator (which would make your test fail)

Problem: it's not possible to @patch the decorator above the test function, as it is too late: in Python functions are decorated at module-loading time. So this:

### file service_test.py

from unittest.mock import patch
from .service import function_to_be_tested

def mock_decorator = ...


@patch("api.service.decorator_in_the_way", mock_decorator)
def test_function_to_be_tested():
    result = function_to_be_tested()
    assert ...

... simply does not work. The patched decorator will simply be ignored, as the function to be tested already has been instrumented with the original one.

Fortunately, there is a workaround. The idea is to patch the decorator before the module to test is loaded. Of course the requirement is that the decorator and the function to test don't belong to the same module

The following example will work:

### file service_test.py
from unittest.mock import patch
from functools import wraps

def mock_decorator(*args, **kwargs):
    """Decorate by doing nothing."""
    def decorator(f):
        @wraps(f)
        def decorated_function(*args, **kwargs):
            return f(*args, **kwargs)
        return decorated_function
    return decorator

# PATCH THE DECORATOR HERE
patch('utils.decorators.decorator_in_the_way', mock_decorator).start()

# THEN LOAD THE SERVICE MODULE
from .service import function_to_be_tested

# ALL TESTS OF THE TEST SESSION WILL USE THE PATCHED DECORATOR 

def test_function_to_be_tested():
    result = function_to_be_tested() # uses the mock decorator
    assert ...

Hope this helps !

I’m Matthieu, data engineer at Stack Labs.
If you want to join an enthousiast Data Engineering or Cloud Developer team, please contact us.

13 tricks for the new Bigquery Storage Write API in Python

matthieucham — Wed, 16 Nov 2022 09:35:56 +0000

In order to stream data into a BigQuery table programmatically, Google is promoting a new API: The Storage Write API

Hence, the usual API and its ominous tabledata.insertAll method is now called "Legacy streaming API" which does not look very appealing when starting a new project.

Indeed, as stated in the official Google Cloud documentation:

For new projects, we recommend using the BigQuery Storage Write API instead of the tabledata.insertAll method.

Moreover, the new API is advertised with lower price and new features such as the possibility of exactly-once delivery.

Exciting, isn't it ?

Well, it is, but the Python client wrapping the new API is very bare-metal and its usage does not feel pythonic at all. As a consequence, integrating this API is much more difficult than usual with other Google Cloud clients, which is quite surprising. Having recently completed the integration of this new product I can speak from experience: I faced an unusually high number of challenges to integrate the Storage Write API into a Python application, for the most common use case: writing data rows into a BigQuery table.

This article aims to list these issues and help future developers to overcome them.

Describe the target schema with Protobuf

Protocol Buffers aka Protobuf is the portable exchange format widely used amongst Google Cloud API. It is usually hidden in the implementation when client libraries are used. With the new streaming API however, you will have to dive into it.

Protobuf relies on a schema descriptor. It describes how exchanged data are structured. The descriptor is written as a .proto text file where all fields, their type and their cardinality are listed.

The first thing to do when integrating the Storage Write API is to write the proto file. The message description must match the schema of the target BigQuery table:

same field names (case insensitive)
same structure (nested types are supported)
compatible field types (see the type mapping table here)

Trick #1: use proto2 syntax

Protobuf now exists in two flavours: proto2 and proto3, the newest.
proto2 works well with BigQuery, whereas there are some issues with proto3 which is fairly recent. Moreover, all examples provided by GCP currently use proto2. So for now I recommend to stick to this version.

Trick #2: all your fields are optional

In proto2 syntax you can declare a field as optional or required. This possibility is removed from proto3 (optional is implicit). In proto2, it is now recommended by Google to declare all your fields as optional, even if they are REQUIRED in the Bigquery schema. However, you will still see some required fields in GCP examples like here.

Trick #3: auto-generate the proto file

Writing a .proto descriptor can be very tedious, if the target schema has many columns with deep nested structures: don't forget that the descriptor has to match the target schema exactly !

You can ease the pain by autogenerating some of the proto file from the bigquery schema. First, download the target schema from bq:

bq show --schema --format=prettyjson dataset_name:project_name:target_table_name > schema_target_table.json

Then use some scripting to convert the downloaded schema file into a proto. ParseToProtoBuffer.py, courtesy of matthiasa4, is useful for inspiration.

Trick #4 : Bigquery TIMESTAMP are protobuf int64

Even though protobuf provides a timestamp data type, the best way to send a timestamp value to Bigquery is to declare an int64 field.
Set the field value to the Epoch timestamp in microseconds and it will be automatically converted into a Bigquery TIMESTAMP

Generate Python Protobuf objects

The next step is to generate Python code from the .proto file.

Trick #5: install or upgrade protoc

Ensure you have installed the latest version of protoc, the Protobuf compiler.

The aim of this software is to generate code artefacts from proto files. Several flavours are available. Of course we pick Python. Invoke protoc like this:

protoc -I=. --python_out=.  schema_target_table.proto

The outcome is a file named schema_target_table_pb2.py

The content of the generated file is surprising: it appears to be lacking a lot of definitions ! The reason is that the missing parts are going to be dynamically inserted at runtime by the Protobuf Python library. As a consequence:

your IDE will be mad about you
Pylint will insult you
you have to take a guess about the missing definition names

Come on Google, are you serious ?

Trick #6: make _pb2.py files pass Pylint

Simply put the following line on top of each _pb2 file, and Pylint will leave you alone:

# pylint: skip-file

Trick #7: Import the missing classes

The generated classes have the following format:

The same name as the message
In case of a nested type, it will be accessible as a class variable of the parent type / class

Let's illustrate. The following proto file

syntax = "proto2";

message Mymessage {
  optional string myfield = 1;

  message Mysubmessage {
    optional string mysubfield = 1;
  }

  optional Mysubmessage mycomplexfield = 2;
  repeated string collection = 3;
}

Will generate python classes which can be imported like this:

from .schema_target_table_pb2.py import Mymessage


submessage_instance = Mymessage.Mysubmessage()

Needless to say, your IDE will turn red because of these imports. You will have to tame Pylint too:

# pylint: disable=no-name-in-module

Set Protobuf object fields

Filling proto fields up is very counterintuitive. Good job that Google provides an exhaustive documentation about it.

Here is a straight-to-the-point TL;DR:

Trick #8: Simple (scalar) type fields can be directly assigned

mymsg = Mymessage()
mymsg.myfield = "toto"

Trick #9: Use CopyFrom for nested type fields

Yes, CopyFrom(), a Python method name in CamelCase starting with an upper letter. Come on, Google !

Anyway, you cannot assign a complex field directly:

mymsg = Mymessage()
mymsg.myfield = "toto"
mysubmsg = Mymessage.Mysubmessage()
mymsg.mycomplexfield.CopyFrom(mysubmsg)

Trick #10: Use append for repeated fields

You mustn't instanciate an empty list. Append it as if it existed

mymsg = Mymessage()
mymsg.collection.append("titi")

Store Protobuf object into Bigquery

The next step is to store the Protobuf objects into Bigquery. There again there are some tricks to achieve this:

Trick #11: Be a dataEditor

The user or service account performing the storage must have bigquery.tables.updateData permission on the target table.

You get this permission in the bigquery.dataEditor role

Trick #12: Don't set a package name in the proto file

In many proto file samples a package directive is set:

package foo.bar;

message Mymessage{...

This is to avoid name clashes. But they are not really useful in Python (generated classes are identified by their file path) and moreover package names are not supported by Bigquery in nested message types when storing

So, just don't set a package.

Trick #13: abstract the storage in a manager.

The Protobuf object is ready to be inserted at last ! Adapt the snippet given by Google to your own code to perform the storage. As you can see, it's not really a one-liner: more than 20 lines are necessary just to setup the destination stream. Besides, each append operation requires an AppendRowsRequest to be created, which is tedious too.

It's a good idea to wrap all these tasks in a practical Manager class for your application to use. Here is an example implementation:

"""Wrapper around BigQuery call."""
from __future__ import annotations
from typing import Any, Iterable
import logging
from google.cloud import bigquery_storage
from google.cloud.bigquery_storage_v1 import exceptions as bqstorage_exceptions

from google.cloud.bigquery_storage_v1 import types, writer
from google.protobuf import descriptor_pb2
from google.protobuf.descriptor import Descriptor



class DefaultStreamManager:  # pragma: no cover
    """Manage access to the _default stream write streams."""

    def __init__(
        self,
        table_path: str,
        message_protobuf_descriptor: Descriptor,
        bigquery_storage_write_client: bigquery_storage.BigQueryWriteClient,
    ):
        """Init."""
        self.stream_name = f"{table_path}/_default"
        self.message_protobuf_descriptor = message_protobuf_descriptor
        self.write_client = bigquery_storage_write_client
        self.append_rows_stream = None

    def _init_stream(self):
        """Init the underlying stream manager."""
        # Create a template with fields needed for the first request.
        request_template = types.AppendRowsRequest()
        # The initial request must contain the stream name.
        request_template.write_stream = self.stream_name
        # So that BigQuery knows how to parse the serialized_rows, generate a
        # protocol buffer representation of our message descriptor.
        proto_schema = types.ProtoSchema()
        proto_descriptor = descriptor_pb2.DescriptorProto()  # pylint: disable=no-member
        self.message_protobuf_descriptor.CopyToProto(proto_descriptor)
        proto_schema.proto_descriptor = proto_descriptor
        proto_data = types.AppendRowsRequest.ProtoData()
        proto_data.writer_schema = proto_schema
        request_template.proto_rows = proto_data
        # Create an AppendRowsStream using the request template created above.
        self.append_rows_stream = writer.AppendRowsStream(
            self.write_client, request_template
        )

    def send_appendrowsrequest(
        self, request: types.AppendRowsRequest
    ) -> writer.AppendRowsFuture:
        """Send request to the stream manager. Init the stream manager if needed."""
        try:
            if self.append_rows_stream is None:
                self._init_stream()
            return self.append_rows_stream.send(request)
        except bqstorage_exceptions.StreamClosedError:
            # the stream needs to be reinitialized
            self.append_rows_stream.close()
            self.append_rows_stream = None
            raise

    # Use as a context manager

    def __enter__(self) -> DefaultStreamManager:
        """Enter the context manager. Return the stream name."""
        self._init_stream()
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        """Exit the context manager : close the stream."""
        if self.append_rows_stream is not None:
            # Shutdown background threads and close the streaming connection.
            self.append_rows_stream.close()


class BigqueryWriteManager:
    """Encapsulation for bigquery client."""

    def __init__(
        self,
        project_id: str,
        dataset_id: str,
        table_id: str,
        bigquery_storage_write_client: bigquery_storage.BigQueryWriteClient,
        pb2_descriptor: Descriptor,
    ):  # pragma: no cover
        """Create a BigQueryManager."""
        self.bigquery_storage_write_client = bigquery_storage_write_client

        self.table_path = self.bigquery_storage_write_client.table_path(
            project_id, dataset_id, table_id
        )
        self.pb2_descriptor = pb2_descriptor

    def write_rows(self, pb_rows: Iterable[Any]) -> None:
        """Write data rows."""
        with DefaultStreamManager(
            self.table_path, self.pb2_descriptor, self.bigquery_storage_write_client
        ) as target_stream_manager:
            proto_rows = types.ProtoRows()
            # Create a batch of row data by appending proto2 serialized bytes to the
            # serialized_rows repeated field.
            for row in pb_rows:
                proto_rows.serialized_rows.append(row.SerializeToString())
            # Create an append row request containing the rows
            request = types.AppendRowsRequest()
            proto_data = types.AppendRowsRequest.ProtoData()
            proto_data.rows = proto_rows
            request.proto_rows = proto_data

            future = target_stream_manager.send_appendrowsrequest(request)

            # Wait for the append row requests to finish.
            future.result()

Conclusion

This API is promising but so more difficult to integrate in Python app than usual ! Hopefully, Google will publish a more high-level client library in the future.

If it's not the case, I hope that at least I spared you some headaches with this API usage.

Thanks for reading! I’m Matthieu, data engineer at Stack Labs.
If you want to discover the Stack Labs Data Platform or join an enthousiast Data Engineering team, please contact us.

Photo by Emily Bernal on Unsplash

BigQuery transactions over multiple queries, with sessions

matthieucham — Mon, 09 May 2022 15:05:01 +0000

BigQuery supports transactions since last year (Presented at Google Cloud Next'21) : it is now possible to perform mutating operations over one or several tables and then commit or rollback the result atomically, by wrapping the script between

BEGIN TRANSACTION;

and

COMMIT TRANSACTION;

ROLLBACK TRANSACTION;

Easy enough ! It's all explained in the official documentation

Yet transactions come with a limitation:

A transaction cannot span multiple scripts.

While this is not an issue most of the time, it can be a problem when the scripts enclosed in the transaction become too complex or have too many query parameters, or break any other quota of BigQuery jobs. This can happen when query scripts are auto-generated from a request payload, for example.

There is a way around it, with BigQuery sessions. Let's see how it works

BigQuery Sessions

Sessions are a way to link jobs and persist transient data, like temporary tables, between them.

One common use case for sessions is exactly what we want:

Create multi-statement transactions over multiple queries. Within a session, you can begin a transaction, make changes, and view the temporary result before deciding to commit or rollback. You can do this over several queries in the session. If you do not use a session, a multi-statement transaction needs to be completed in a single query.

The idea is to stack the transaction queries inside the same session, beginning with BEGIN TRANSACTION; and ending with COMMIT TRANSACTION;.
In between, you can call put as many queries as necessary and the whole session will have atomic behavior.

A session is closed automatically after 24 hours of inactivity. However, when mixed with transactions, it can happen that the targeted table gets "locked" in the session and becomes unusable until the end of the session. That's why I recommend to force the session ending at the end of the script. It is done by invoking the following query:

CALL BQ.ABORT_SESSION();

Python implementation

We are dealing with the notion of session that we need to open and always close at the end of the processing : a context manager is naturally indicated for this.

"""ContextManager wrapping a bigquery session."""
from google.cloud import bigquery


class BigquerySession:
    """ContextManager wrapping a bigquerySession."""

    def __init__(self, bqclient: bigquery.Client, bqlocation: str = "EU") -> None:
        """Construct instance."""
        self._bigquery_client = bqclient
        self._location = bqlocation
        self._session_id = None

    def __enter__(self) -> str:
        """Initiate a Bigquery session and return the session_id."""
        job = self._bigquery_client.query(
            "SELECT 1;",  # a query can't fail
            job_config=bigquery.QueryJobConfig(create_session=True),
            location=self._location,
        )
        self._session_id = job.session_info.session_id
        job.result()  # wait job completion
        return self._session_id

    def __exit__(self, exc_type, exc_value, traceback):
        """Abort the opened session."""
        if self._session_id:
            # abort the session in any case to have a clean state at the end
            # (sometimes in case of script failure, the table is locked in
            # the session)
            job = self._bigquery_client.query(
                "CALL BQ.ABORT_SESSION();",
                job_config=bigquery.QueryJobConfig(
                    create_session=False,
                    connection_properties=[
                        bigquery.query.ConnectionProperty(
                            key="session_id", value=self._session_id
                        )
                    ],
                ),
                location=self._location,
            )
            job.result()

It then become really easy to use this context to stack jobs into a single session, thus to create a multistatement, multiscripts bigquery transaction:

with BigquerySession(self.bigquery_client, BIGQUERY_LOCATION) as session_id:
    # open transaction
    job = self.bigquery_client.query(
        "BEGIN TRANSACTION;",
        job_config=bigquery.QueryJobConfig(
            create_session=False,
            connection_properties=[
                bigquery.query.ConnectionProperty(
                    key="session_id", value=session_id
                )
            ],
        ),
        location=BIGQUERY_LOCATION,
    )
    job.result()
    # stack queries
    for queryscript in scripts:
        job = self.bigquery_client.query(
            queryscript,
            job_config=bigquery.QueryJobConfig(
                create_session=False,
                connection_properties=[
                    bigquery.query.ConnectionProperty(
                        key="session_id", value=session_id
                    )
                ],
            ),
            location=BIGQUERY_LOCATION,
        )
        job.result()
    # end transaction
    job = self.bigquery_client.query(
        "COMMIT TRANSACTION;",
        job_config=bigquery.QueryJobConfig(
            create_session=False,
            connection_properties=[
                bigquery.query.ConnectionProperty(
                    key="session_id", value=session_id
                )
            ],
        ),
        location=BIGQUERY_LOCATION,
    )
    job.result()

Notice how all jobs are run with the same session_id (i.e. within the same session) and in the same location (this is a requirement for sessions).

Hope this helps !

Thanks for reading! I’m Matthieu, data engineer at Stack Labs.
If you want to discover the Stack Labs Data Platform or join an enthousiast Data Engineering team, please contact us.

Photo by Caroline Selfors on Unsplash

How to overcome Cloud Run's 32MB request limit

matthieucham — Wed, 06 Apr 2022 10:27:36 +0000

Cloud Run is an awesome serverless product provided by Google Cloud which is often a perfect fit to run containerized web services. It offers many advantages such as autoscaling, rolling updates, autorestart, scale to 0 to name just a few. All of it without the hassle of provisioning and managing any cluster !

You would definitely pick this product to host, say, a Python Flask Rest API with the following design:

1- Upload a data file by HTTP POST to a REST endpoint
2- Process the file
3- Insert the data into BigQuery using the client lib

Which is perfectly fine... Unless you want to be able to handle data file bigger than 32 MB !

Indeed, Cloud Run won't let you upload such a big file. Instead, you'll get an error message:

413: Request entity too large

Congratulations, you've just hit the hard size limit of Cloud Run inbound requests.
But don't worry, you can keep using Cloud Run for your service, if you apply the improved design below:

Improved design, with Cloud Storage, Signed Url and PubSub notifications

To work around the limitation, you can design a solution based upon Cloud Storage signed urls:

This time, the file is not directly uploaded to the REST endpoint, but uploaded to cloud storage instead, thus bypassing the 32 MB limitation.

The downside of this process is that the client has to make two requests instead of one. Hence, the whole new sequence goes like this:

1- the client requests a signed url to upload to
2- the webservice, using the Cloud Storage client, generates a signed url and returns it to the client
3- the client uploads the file to the Cloud Storage bucket directly (HTTP PUT to the signed url)
4- at the end of the file upload, the notification OBJECT_FINALIZE is sent to PubSub
5- the notification is then pushed back to the webservice on Cloud Run through a subscription
6- the webservice reacts to the notification by downloading the file
7- the webservice can then process the file, in the exact same way it did it in the original design
8- likewise, data are inserted into BigQuery

This design is entirely serverless and scales neatly, without any single point of failure. Now, let's see in more details how to implement it.

Make a signed url from Cloud Run

Gotcha! It is required for the Cloud Run service to have the role roles/iam.serviceAccountTokenCreator in order to be able to generate a signed url. It is not really documented, and if you don't grant it, you get a HTTP error 403 without much more information.

This python code, courtesy of this blog post by Evan Peterson, exposes how to produce signed urls with the Cloud Run webservice's default service account, without requiring the private key file locally (which is big no-no for security reason !)

from typing import Optional
from datetime import timedelta

from google import auth
from google.auth.transport import requests
from google.cloud.storage import Client


def make_signed_upload_url(
    bucket: str,
    blob: str,
    *,
    exp: Optional[timedelta] = None,
    content_type="application/octet-stream",
    min_size=1,
    max_size=int(1e6)
):
    """
    Compute a GCS signed upload URL without needing a private key file.
    Can only be called when a service account is used as the application
    default credentials, and when that service account has the proper IAM
    roles, like `roles/storage.objectCreator` for the bucket, and
    `roles/iam.serviceAccountTokenCreator`.
    Source: https://stackoverflow.com/a/64245028

    Parameters
    ----------
    bucket : str
        Name of the GCS bucket the signed URL will reference.
    blob : str
        Name of the GCS blob (in `bucket`) the signed URL will reference.
    exp : timedelta, optional
        Time from now when the signed url will expire.
    content_type : str, optional
        The required mime type of the data that is uploaded to the generated
        signed url.
    min_size : int, optional
        The minimum size the uploaded file can be, in bytes (inclusive).
        If the file is smaller than this, GCS will return a 400 code on upload.
    max_size : int, optional
        The maximum size the uploaded file can be, in bytes (inclusive).
        If the file is larger than this, GCS will return a 400 code on upload.
    """
    if exp is None:
        exp = timedelta(hours=1)
    credentials, project_id = auth.default()
    if credentials.token is None:
        # Perform a refresh request to populate the access token of the
        # current credentials.
        credentials.refresh(requests.Request())
    client = Client()
    bucket = client.get_bucket(bucket)
    blob = bucket.blob(blob)
    return blob.generate_signed_url(
        version="v4",
        expiration=exp,
        service_account_email=credentials.service_account_email,
        access_token=credentials.token,
        method="PUT",
        content_type=content_type,
        headers={"X-Goog-Content-Length-Range": f"{min_size},{max_size}"}
    )

Terraform

There is no robust way to do Cloud without infra as code, and Terraform is the perfect tool to manage your Cloud resources.

Here are the Terraform fragments for deploying this design:

# Resources to handle big data files (>32 Mb)
# These files are uploaded to a special bucket with notifications

provider "google-beta" {
  project = <your GCP project name>
}

data "google_project" "default" {
  provider = google-beta
}

resource "google_storage_bucket" "bigframes_bucket" {
  project  = <your GCP project name>
  name     = "upload-big-files"
  location = "EU"

  cors {
    origin = ["*"]
    method = ["*"]
    response_header = [
      "Content-Type",
      "Access-Control-Allow-Origin",
      "X-Goog-Content-Length-Range"
    ]
    max_age_seconds = 3600
  }
}

resource "google_service_account" "default" {
  provider     = google-beta
  account_id   = "sa-webservice"
}

resource "google_storage_bucket_iam_member" "bigframes_admin" {
  bucket = google_storage_bucket.bigframes_bucket.name
  role   = "roles/storage.admin"
  member = "serviceAccount:${google_service_account.default.email}"
}

# required to generate a signed url
resource "google_service_account_iam_member" "tokencreator" {
  provider           = google-beta
  service_account_id = google_service_account.default.name
  role               = "roles/iam.serviceAccountTokenCreator"
  member             = "serviceAccount:${google_service_account.default.email}"
}

# upload topic for notifications
resource "google_pubsub_topic" "bigframes_topic" {
  provider = google-beta
  name     = "topic-bigframes"
}

# upload deadletter topic for failed notifications
resource "google_pubsub_topic" "bigframes_topic_deadletter" {
  provider = google-beta
  name     = "topic-bigframesdeadletter"
}

# add frame upload notifications on the bucket
resource "google_storage_notification" "bigframes_notification" {
  provider       = google-beta
  bucket         = google_storage_bucket.bigframes_bucket.name
  payload_format = "JSON_API_V1"
  topic          = google_pubsub_topic.bigframes_topic.id
  event_types    = ["OBJECT_FINALIZE"]
  depends_on = [google_pubsub_topic_iam_binding.bigframes_binding]
}

# required for storage notifications
# seriously, Google, this should be by default !
resource "google_pubsub_topic_iam_binding" "bigframes_binding" {
  topic   = google_pubsub_topic.bigframes_topic.id
  role    = "roles/pubsub.publisher"
  members = ["serviceAccount:service-${data.google_project.default.number}@gs-project-accounts.iam.gserviceaccount.com"]
}

# frame upload main sub
resource "google_pubsub_subscription" "bigframes_sub" {
  provider = google-beta
  name     = "sub-bigframes"
  topic    = google_pubsub_topic.bigframes_topic.id

  push_config {
    push_endpoint = <URL where pushed notification are POST-ed>
  }
  dead_letter_policy {
    dead_letter_topic = google_pubsub_topic.bigframes_topic_deadletter.id
  }
}

# frame upload deadletter subscription
resource "google_pubsub_subscription" "bigframes_sub_deadletter" {
  provider             = google-beta
  name                 = "sub-bigframesdeadletter"
  topic                = google_pubsub_topic.bigframes_topic_deadletter.id
  ack_deadline_seconds = 600

  push_config {
    push_endpoint = <URL where pushed notification are POST-ed>
  }
}

Just terraform deploy it !

How to upload

One final gotcha: to upload to Cloud Storage with the signed url, you must set an additional header in the PUT request:
X-Goog-Content-Length-Range: <min size>,<max size>
where min size and max size match min_size and max_size of the make_signed_upload_url() method above.

Conclusion

Have you experienced this design ? How would you improve it ? Please let me know in the comments.

Thanks for reading! I’m Matthieu, data engineer at Stack Labs.
If you want to discover the Stack Labs Data Platform or join an enthousiast Data Engineering team, please contact us.

Design schemas made with Excalidraw and the GCP Icons library by @clementbosc
Cover photo by joel herzog on Unsplash

Brace yourself: Using BigQuery as an operational backend

matthieucham — Tue, 11 Jan 2022 10:38:34 +0000

BigQuery as backend ? WTF ?

BigQuery has become a major player in the field of Data analytics solutions. It provides an ever-growing list of powerful features in an easy, performant and cost-effective way. However BigQuery definitely is OLAP, while the sensible option for an application backend is OLTP.
Therefore, using BigQuery as a backend may seem weird, and... it is indeed! Just to scratch the surface of many problems that would arise:

BigQuery is not optimized for writing, but for performing complex queries.
Single-line inserts are discouraged.
BigQuery does not enforce keys, foreign keys nor constraints
BigQuery does not perform well with normalized schema, on the contrary it encourages denormalization
...

So, if you think about BigQuery to be the storage backend of a full-blown application, you should definitely think again.

There can be some situation though, where this design has some interest...

Think of a huge analytics data platform revolving around BigQuery. Now imagine a corner case of the platform where some configuration have to be stored in a relational fashion, and this configuration data will have an impact on how data will be accessed. In order to update the configuration, an API is exposed. There are now 2 options to store the configuration values:

The normal way: set up a transactional database like PostgreSQL or MySQL, maybe through CloudSQL since we're dealing with the Google Cloud Platform. Use it as application's storage backend. Setup redundancy and backup strategy. Sort out IAM permissions. Query configuration data from BigQuery via federated queries. All of this will of course cost you some extra dollars.
The hacky way: store the configuration in a BigQuery dataset somewhere, and profit of the near-free hosting and redundancy provided by this serverless database. Use the bigquery client API to integrate with the application. Query like any other dataset.

Don't try this at home ! This stunt is being performed by professionals

Enforcing unicity constraints on BigQuery table

All warnings having been exposed, let's proceed with the implementation. It's not really difficult because Google provide BigQuery client libraries for many languages. It is also possible to use the REST Api.

The problem is, even if you agreed on cutting corners by not provisioning a proper OLTP database, you may still need to enforce some constraints.

Imagine that your configuration table consists in the following schema:

| Id (STRING) | Attribute (STRING) | Value (STRING) |

And you want to enforce that Id values are unique. Normally, this would be a simple UNIQUE(Id) statement in OLTP databases. But such a statement doesn't exist in BigQuery.

Luckily, there is a new feature of BigQuery to the rescue: Transactions

BigQuery Transactions

Multi-statement transactions feature is covered by Pre-GA offerings at the time of writing. It enables the wrapping of standard SQL scripts into atomic transactions.

So, to compensate the absence of a traditional UNIQUE constraint, we can implement the following sequence when saving or modifying an entry into the table:

1) Open a transaction
2) Search if the Id to save already exists, raise an error if found
3) Insert the new entry
4) Commit the transaction

Here is an implementation of this script:

-- 1
BEGIN TRANSACTION;
-- 2
SELECT * FROM (
    SELECT COUNT(1) AS conflict
    FROM `configds.configtable`
    WHERE Id=@input_id
) WHERE
IF (conflict=0, TRUE, ERROR("Id already exists));
-- 3
INSERT INTO `configds.configtable`
VALUES(@input_id,@input_attribute,@input_value);
-- 4
COMMIT TRANSACTION;

At step 2, ERROR() will automatically rollback the transaction, so that step 3 will not occur. Named parameters are used here to protect against SQL injection.

Call this script from the application backend on config saving, and this will protect the table against concurrent inserts from the application's clients.

Limitations

As mentioned earlier, this way of implementing constraints is not to be generalized. Here are the most prominent limitations:

Transactions are only Pre-GA at the moment
The unicity is enforced at the application level, but not at the database level. Nothing prevents another BigQuery client, like the BigQuery console itself, from inserting rows regardless of unicity of Id. Only the application is safe
Performance is very poor: it takes several seconds to perform the script.
Not supported by ORMs, you have to write plain SQL queries and be careful of SQL injections

Conclusion

The new BigQuery Multi-statement Transactions feature enables the usage of BigQuery as a somewhat-workable application backend, which can come in handy if used with high caution. Still, carefully consider the trade-offs vs a traditional OLTP database, and be prepared to defend your choices if you follow this path !

Thanks for reading! I’m Matthieu, data engineer at Stack Labs.
If you want to discover the Stack Labs Data Platform or join an enthousiast Data Engineering team, please contact us.

Cover photo by Domi Nemeth on Unsplash

Orchestrate Dataflow pipelines easily with GCP Workflows

matthieucham — Thu, 11 Mar 2021 09:15:20 +0000

Dataflow pipelines rarely are on their own. Most of the time, they are part of a more global process. For example : one pipeline collects events from the source into BigTable, then a second pipeline computes aggregated data from BigTable and store them into BigQuery.

Of course, each pipeline could be scheduled independently with Cloud Scheduler.
But if these pipelines need to be linked somehow, such as launching the second pipeline when the first is done, then orchestration is required.

Until recently, GCP had one tool in the box for this kind of purpose : Cloud Composer , a (slightly) managed Apache Airflow. Despite its rich and numerous functionalities and its broad community, this service had several caveats for the kind of simple orchestration I was after:

it's not fully integrated : you need to manage costly resources such as a GKE cluster, a CloudSQL instance
it pushes Python in your codebase, there is no other choice
any change in the setup (like, adding an environment variable) is painfully slow to propagate
the wide variety of operators in the ecosystem can lead to a poor separation of concerns between orchestration and business processes

And I won't even talk about the Airflow UI... (I've heard that some people like it)

Because of these, orchestrating with Composer is overly difficult. Yet, as it is often the case with the GCP platform, if you face too many difficulties when doing something that should be simple enough, you're probably not doing it right. This proved true once again: Cloud Composer wasn't the right product for my need...

Enter GCP Workflows !

Workflows is a new service : it has been promoted out of bêta very recently. And luckily, it already offers most of the needed functionality to do the orchestration of GCP services' jobs, and doing it simply:

it is fully managed and serverless, which means you don't pay when you don't use it
it does only one job and does it well : orchestrating HTTP calls
all is configured in YAML files, whose syntax is short and easy to learn
the UI is neatly integrated and feels more "part of GCP" than Composer (Although there is still quite a few display bugs at the moment)

With this new product it becomes really easy to write a Workflow which chains multiple Dataflow jobs like in the diagram above.

A sample workflow for Dataflow jobs

Workflow files are YAML. It is simple and straightforward:

main:
  steps:
    - init:
        assign:
          - project: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}
          - region: "europe-west1"
          - topic: "myTopic"
    - firstPipeline:
        call: LaunchDataflow
        args:
          project: ${project}
          region: ${region}
          template: "first"
        result: firstJobId
    - waitFirstDone:
        call: DataflowWaitUntilStatus
        args:
          project: ${project}
          region: ${region}
          jobId: ${firstJobId}
          status: "JOB_STATE_DONE"
    - secondPipeline:
        call: LaunchDataflow
        args:
          project: ${project}
          region: ${region}
          template: "second"
        result: secondJobId    
    - waitSecondDone:
        call: DataflowWaitUntilStatus
        args:
          project: ${project}
          region: ${region}
          jobId: ${secondJobId}
          status: "JOB_STATE_DONE"
    - publish:
        call: googleapis.pubsub.v1.projects.topics.publish
        args:
          topic: ${"projects/" + project + "/topics/" + topic}
          body:
            messages:
              - data: ${base64.encode(text.encode("{\"message\":\"workflow done\"}"))}

Let's break it down. The sample workflow has the following steps:

init: preprocessing stage, where workflow variables are initialized.
firstPipeline: Launch the first dataflow job
waitFirstDone: Wait until the first dataflow job is completed
secondPipeline: Launch the second dataflow job
waitSecondDone: Wait until the second dataflow job is completed -publish: Push a sample PubSub notification at the end of the workflow

As you have noticed, firstPipeline and secondPipeline call a custom routine, a subworkflow, which is defined in the same file:

LaunchDataflow:
  params: [project, region, template]
  steps:
    - launch:
        call: http.post
        args:
          url: ${"https://dataflow.googleapis.com/v1b3/projects/"+project+"/locations/"+region+"/flexTemplates:launch"}
          auth:
            type: OAuth2
          body:
            launchParameter:
              jobName: ${"workflow-" + template }
              environment:
                numWorkers: 1
                maxWorkers: 8
              containerSpecGcsPath: ${template}
        result: dataflowResponse
        next: jobCreated
    - jobCreated:
        return: ${dataflowResponse.body.job.id}

This subworkflow calls the Dataflow Rest API to launch a job (here, a flex template). With workflows you can easily call any service's API or any external HTTP endpoint.

Similarly, waitFirstDone and waitSecondDone call another subworkflow:

DataflowWaitUntilStatus:
  params: [project, region, jobId, status]
  steps:
    - init:
        assign:
          - currentStatus: ""
          - failureStatuses: ["JOB_STATE_FAILED", "JOB_STATE_CANCELLED", "JOB_STATE_UPDATED", "JOB_STATE_DRAINED"]
    - check_condition:
        switch:
          - condition: ${currentStatus in failureStatuses}
            next: exit_fail
          - condition: ${currentStatus != status}
            next: iterate
        next: exit_success
    - iterate:
        steps:
          - sleep30s:
              call: sys.sleep
              args:
                seconds: 30
          - getJob:
              call: http.get
              args:
                url: ${"https://dataflow.googleapis.com/v1b3/projects/"+project+"/locations/"+region+"/jobs/"+jobId}
                auth:
                  type: OAuth2
              result: getJobResponse
          - getStatus:
              assign:
                - currentStatus: ${getJobResponse.body.currentState}
          - log:
              call: sys.log
              args:
                text: ${"Current job status="+currentStatus}
                severity: "INFO"
        next: check_condition
    - exit_success:
        return: ${currentStatus}
    - exit_fail:
        raise: ${"Job in unexpected terminal status "+currentStatus}

This subworkflow also calls the Dataflow Rest API, this time in a kind of loop until the job reach a terminal status. In case of unexpected state, an exception is raised and the workflow stops and is marked as failed. Otherwise, it proceeds to the next stage

Finally, just deploy this workflow, via the UI or gcloud for example:

#! /bin/bash
localDir=$(dirname "$0")

WORKFLOW="sample"
DESCRIPTION="Sample workflow"
SOURCE="sample.yaml"
PROJECT="my-gcp-project"
REGION="europe-west4"
SERVICE_ACCOUNT="sa-workflows@my-gcp-project.iam.gserviceaccount.com"

gcloud beta workflows deploy "${WORKFLOW}" --location="${REGION}" --service-account="${SERVICE_ACCOUNT}" --source="${localDir}/${SOURCE}" --description="${DESCRIPTION}"

Breaking news Workflows resources are now available in Terraform for you IAC freaks

Once deployed, it can be launched, for example from Scheduler, by POSTing to this endpoint https://workflowexecutions.googleapis.com/v1/projects/${PROJECT}/locations/${REGION}/workflows/${WORKFLOW}/executions

Conclusion

Thanks to Workflows, with just a relatively small YAML file we were able to chain two Dataflow jobs the easy way: serverlessly.

Thanks for reading! I’m Matthieu, data engineer at Stack Labs.
If you want to discover the Stack Labs Data Platform or join an enthousiast Data Engineering team, please contact us.

Tricky Dataflow ep.2 : Import documents from MongoDB views

matthieucham — Wed, 17 Feb 2021 16:46:58 +0000

This is the second episode of my Tricky Dataflow series, in which I present some of the trickiest issues I faced while implementing pipelines with Google Cloud Dataflow, and how I overcame them.

The last episode dealt with some BigQuery issues. This time, let's talk about a completely different flavour of database : MongoDB

MongoDB now is fairly widespread in the DB world, and arguably the most well-known NoSql database on the market. So, as one would expect, the Dataflow SDK has got a MongoDB connector ready to ease the usage of MongoDB as a datasource.

It offers the ability to read from and write to MongoDB collections, so I (the naïve me who was not so familiar with MongoDB at the time) thought it was all that was required to implement this simple kind of pipeline:

But of course - otherwise there would be not point writing a blog post - everything did not run as smoothly as I expected.

So you want to query a view, huh ?

In the first version of the pipeline, which I did as a warmup, I read documents directly from the collection with MongoDbIO.read().withUri(...).withDatabase(...).withCollection(...) and faced no real issue. There was one subtle point though, of which I did not realize the importance at the time:
Because the source MongoDB instance was hosted on Atlas, MongoDbIO was not allowed to run the default splitVector() command and therefore it was mandatory to add withBucketAuto(true) clause to download the collection.

I was not expecting the difficulties that came when I naïvely tried to use the view name in place of the collection :

[WARNING] org.apache.beam.sdk.Pipeline$PipelineExecutionException: com.mongodb.MongoCommandException: Command failed with error 166 (CommandNotSupportedOnView): 'Namespace [myview] is a view, not a collection' on server [***]

So apparently MongoDB knows about my view, understand I'd like to request this view but no, it won't let me retrieve documents from it. Actually it turns out there was no simple way to just get documents from the view. There surely is a good explanation for this, but I couldn't find it. So frustrating...

You know that feeling... (Photo Wikipedia / Nlan86)

Actually, a view in MongoDB is not as straightforward as a regular view in the SQL world : a MongoDB view is the result of collection documents processed by an aggregation pipeline. And MongoDbIO is able to perform aggregation queries on read collection thanks to AggregationQuery that may be passed to .withQueryFn(). The solution started to appear:

read from the collection
retrieve the aggregation definition from the view options
pass the aggregation pipeline to withQueryFn
MongoDB will process the document through the provided pipeline, which will result in the same documents as from the view

Let's follow the plan !

Retrieve the view's aggregation pipeline

To get the pipeline, we need to use mongo-java-client directly and get collection infos with it. It's pretty verbose:

static List<BsonDocument> retrieveViewPipeline(Options options) {
        if (Strings.isNullOrEmpty(options.getView())) {
            LOG.debug("No view in options");
            return new ArrayList<>();
        }
        com.mongodb.MongoClientOptions.Builder optionsBuilder = new com.mongodb.MongoClientOptions.Builder();
        optionsBuilder.maxConnectionIdleTime(60000);
        MongoClient mongoClient = new MongoClient(new MongoClientURI("mongodb+srv://" + options.getMongoDBUri(),
                optionsBuilder));

        List<Document> viewPipeline = null;
        for (Document collecInfosDoc : mongoClient.getDatabase(options.getDatabase()).listCollections()) {
            if (collecInfosDoc.getString("name").equalsIgnoreCase(options.getView())) {
                viewPipeline = collecInfosDoc.get("options", Document.class).getList("pipeline", Document.class);
                break;
            }
        }
        checkArgument(viewPipeline != null, String.format("%s view not found", options.getView()));

        return viewPipeline.stream().map((doc) -> doc.toBsonDocument(BsonDocument.class,
                MongoClient.getDefaultCodecRegistry())).collect(Collectors.toList());
    }

Pass the pipeline to MongoDbIO

As mentioned, MongoDbIO has a method to handle aggregations : withQueryFn. However, this method actually has a little bug in the current version (2.27) when the pipeline has multiple steps:

Line 71: Harsh time for the last stage of the pipeline :( (screenshot from Github)

Of course, there is a simple workaround for this: just append a useless item to the pipeline list, which will be replaced by the bucket() stage:

if (viewPipeline.size() > 1) {
    viewPipeline.add(new BsonDocument());
}

There you go, with the source connector configured like this, you can now retrieve the view documents:

PCollectionTuple mongoDocs =
    pipeline.apply("Read from MongoDB",
        MongoDbIO.read()
        .withUri("mongodb+srv://" + options.getMongoDBUri())         
        .withDatabase(options.getDatabase())                        
        .withCollection(options.getCollection())
        .withBucketAuto(true) 
        .withQueryFn(
            AggregationQuery.create()
                .withMongoDbPipeline(viewPipeline))
    )

But wait ! Does it work on HUGE collections ?

Finally ! You can now retrieve documents from your testing dataset, now you feel ready to test your shiny new pipeline on your real, huge, MongoDB view. And then...

com.mongodb.MongoCommandException:
Command failed with error 16819 (Location16819): ‘Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in.’

... it turns out you're not finished yet. At least the error message is pretty clear: when processing the aggregation pipeline on the MongoDB instance, the memory (RAM) limit has been exceeded. Sadly this limit is not configurable. The only work around is to allow MongoDB to use a swap file, which you can force by setting the parameter allowDiskUse: true alongside the aggregation pipeline.
This parameter is easily accessible through mongo-java-client thanks to AggregateIterable.allowDiskUse(). The problem is that, sadly, this method is not exposed in MongoDbIO yet. There is a feature request for it but it's not in a roadmap at the moment.

Unfortunately, allowDiskUse() is necessary in two places of the MongoDB Beam connector and it's not possible to override them:

MongoDbIO.buildAutoBuckets

AggregateIterable<Document> buckets = mongoCollection.aggregate(aggregates).allowDiskUse(true);

AggregationQuery.apply()

return collection.aggregate(mongoDbPipeline()).allowDiskUse(true).iterator();

So, the only way to edit these classes for now is to fork or duplicate them. Not perfect, but at least you can do some cleanup in your pipeline dependencies:

    <!-- MongoDB connector -->
    <!-- Because of limitations, a fork of this lib is used -->
    <!--<dependency>
      <groupId>org.apache.beam</groupId>
      <artifactId>beam-sdks-java-io-mongodb</artifactId>
      <version>${beam.version}</version>
    </dependency>-->
    <!-- The fork needs the Mongo-java driver -->
    <dependency>
      <groupId>org.mongodb</groupId>
      <artifactId>mongo-java-driver</artifactId>
      <version>3.12.7</version>
    </dependency>

All you need is mongo-java-driver

This long story has a happy end: thanks to allowDiskUse and the swap file, your custom MongoDbIO connector can now query MongoDb views of any size !

That's it for this second episode. Stay tuned for the next one, I'll talk present you GCP Workflows, a convenient way to orchestrate your Dataflow pipelines

Tricky Dataflow ep.1 : Auto create BigQuery tables in pipelines

matthieucham — Wed, 03 Feb 2021 09:35:30 +0000

GCP's Dataflow is a really powerful weapon when you need to manipulate massive amounts of data in a highly parallel and flexible fashion. Dataflow pipelines surely are the number one asset in every GCP data engineer toolbox.

However, learning to use Apache Beam, which is the open source framework behind Dataflow, is no bed of roses: The official documentation is sparse, GCP-provided templates don't work out-of-the-box, and the Javadoc is, well, a javadoc.

In this series, I would like to present you some of the trickiest issues Dataflow and Beam had in store for me, and how I overcame them.

Let's start with a bit of BigQueryIO frustration...

How to write data to dynamically generated BigQuery tables ?

Beam provides the ability to load data into BigQuery using dynamic destinations, where the target table spec is derived dynamically from incoming elements. We would like to use this feature to achieve the following design

Events coming from several Kafka topics are handled by a single Dataflow pipeline then serialized to several BigQuery tables: events from topic A go to table A, and so on...

Luckily, there is a solution to this exact problem offered in Beam Javadoc

A common use case is to dynamically generate BigQuery table names based on the current value. To support this, BigQueryIO.Write.to(SerializableFunction) accepts a function mapping the current element to a tablespec. For example, here's code that outputs quotes of different stocks to different tables:

PCollection<Quote> quotes = ...;

 quotes.apply(BigQueryIO.write()
         .withSchema(schema)
         .withFormatFunction(quote -> new TableRow()...)
         .to((ValueInSingleWindow<Quote> quote) -> {
             String symbol = quote.getSymbol();
             return new TableDestination(
                 "my-project:my_dataset.quotes_" + symbol, // Table spec
                 "Quotes of stock " + symbol // Table description
               );
           });

Unfortunately, when we implemented this stage in our pipeline, with a CreateDisposition of BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED, we soon found out that only the first dynamic table of the pipeline output had been created. The other tables were missing, and as a result the data loading failed. This is a known bug in Beam, which has been around for more than 2 years.

Facing this issue, we had no other choice than to develop a workaround.

Custom table-creation stage

Apache Beam is an extensible framework, and as such it is possible to develop a new transformation stage, aka PTransform. Therefore, we developed a PTransform whose single goal is to check if the target table of incoming elements exists, and create it if not.

For this we have to bypass BigQueryIO and use the BigQuery java client directly. The client should be instantiated during the setup or startBundle phase of the PTransform:

import com.google.cloud.bigquery.*;

...

@Setup
public void setup() {
    this.bigquery = BigQueryOptions.getDefaultInstance().getService();
}

With the client, we can easily call api methods to create tables. But, not so fast: we can't check tables for each element, it would not be sustainable. On the contrary, we have to group elements by lot and check for each lot. Within a streaming pipeline, grouping means windowing. A simple strategy with fixed windows is enough for this:

input.apply(Window.into(FixedWindows.of(Duration.standardSeconds(15l))))

Fixed windows of 15 seconds' width

Then we just have to inspect the content of each window and group their elements by target table name

input.apply("Compute target table name", WithKeys.of(new GetTargetTableName(this.outputTableSpec))).apply(GroupByKey.create())

Now it's possible to check the table for each group. But no need to check repeatedly for the same table name every 15 seconds : let's use some caching, for example Guava Cache. That way we minimize costly api calls

input.apply("Create target table if needed", ParDo.of(new CreateIfNeeded()))

To see the details of CreateIfNeeded and the rest of the implementation, check out this Gist

Finally we can have the satisfaction to watch our nice stage deployed, just before BigQuery.Write who can safely load data into sure-to-exist tables

See you soon for the next Dataflow tricky trick !