DEV Community

Koji Matsumoto
Koji Matsumoto

Posted on • Edited on

Starting small Airbyte on GCP

Since I have started to operate Airbyte in production, I would like to introduce some points that I have devised for stable operation.

Current status of Airbyte

Airbyte is an OSS ETL system that provides a way to deploy with docker-compose or Kubernetes for various platforms. Airbyte Cloud is also available as a fully managed service. (https://airbyte.io/pricing ) However, it is currently provided only in the US, and it is planned to expand the service to countries other than the US in 2022. Since it is not possible to devote a lot of time to the construction and operation of system infrastructure in the environment where I am, using SaaS is a reasonable option if possible. But unfortunately, I'm a Japanese user, so I can't use Airbyte Cloud at the moment.

Why adopted Airbyte

ETL for services related to Zendesk was required, but Stitch, Fivetran, etc. did not have a Connector corresponding to the service in Zendesk that I needed. Therefore, I was narrowed down to the choice of developing data extraction from REST API by myself or self-hosting Airbyte which already has a corresponding Connector.

When developing data extraction by yourself, it is better to develop it as an singer-tap or input plug-in of Embulk in consideration of reusability, but the development time up to that point cannot be allocated. What's more, it's difficult to maintain as the Zendesk API updates.

In the case of Airbyte, Connector is already prepared. The code of Connector itself is also managed by monorepo in the Airbyte repository. Not only Unit Test but also consistent Integration Test is prepared for all connectors, and the quality is maintained by CI. Furthermore, changes such as bug fixes and function additions will be released promptly under the review of Airbyte development core members.

It is important to maintain the continuous quality of the Connector developed for each service subject to ETL. I decided to adopt Airbyte with a view to migrating services that are already ETL with Stitch in the future.

Since Airbyte is currently in the status of the Alpha version, I started production operation with a small start while verifying when implementing it.

System structure

As a premise, it is assumed to be integrated into the data infrastructure centered on BigQuery. Then build a system with GCP services. This time, as a small start, I planned to introduce the ETL of one service in Zendesk into production, and gradually add processing of ETL of other services while confirming stable operation. The Airbyte repository describes how to deploy with docker-compose in GCE, so I decided to deploy with this method and operate it for a while. There is also a deployment method with Kubernetes, but GKE has not only the cost of compute resources but also the cost of operation work, so I decided not to do it this time.

Ops

System monitoring and backup of settings are indispensable for operating as a production system. Of course, the monitoring and backup structure was implemented in the environment constructed this time. And these structures are what I want to share most in this article.

System monitoring

System health monitoring is essential in a production system, but implementing it in a self-host environment can be taking time a lot. Even if you start small, you need to create a solid structure. The most important point in the Airbyte server is whether it is in a state where sync processing can be performed. This was defined as healthy and was targeted for monitoring. In order to grasp the state of the system in more detail, it is necessary to monitor the health of the server instance, the health of the service process, network reachability, etc., but this time it is a small start, so whether sync processing can be performed as the most important criterion I decided to catch.

To check if the sync process can be performed, check if the Connection set in Airbyte can be executed. Therefore, I decided to prepare a Connection that ends without sync anything and execute it periodically. If I execute it with the Airbyte scheduler, I will not notice the anomality from the outside, so I need to trigger the Connection from the outside and check the success or failure. Therefore, I implemented it that triggers a Connection that does not synchronize anything from Cloud Functions via Airbyte's REST API. I configured the Cloud Functions to run periodically triggered by Cloud Scheduler. And Sentry will be notified if the Connection fails to execute. For security, Cloud Functions access the GCE's internal IP address in VPC via Serverless VPC access connector. Below is the system overview.

Image description

A Connection that sync nothing consists of connectors source-none and destination-none. Since these are used only internally, they are developed independently and built locally as Docker images.

Sync failure notification in Slack

I have also set up failure notifications to Slack so that I will be immediately aware if normal sync processing other than for monitoring purposes fails. Airbyte has a notification function to Slack, but we can't customize the notification text yet. I want to mention it in Slack, so I hope it will be possible to customize the content of the notification text.

Config backup

I also want to avoid losing the Airbyte config due to problems with the server instance. I have also implemented an automatic backup of the config on a regular basis. First I need to decide where to back up. Since IaC using GitHub is practiced in my environment, server settings are basically managed in the repository. However, the Airbyte config contains the credential information for accessing the API and database as it is. Therefore, it cannot be managed in the repository. I decided to save the config in a GCS bucket with access control. Again, I configured it with Cloud Scheduler and Cloud Functions for automatic backups on a regular basis.

Image description

Last words

Currently, it is operating stably with the above configuration. There are some challenges that need to be addressed in the future.

In this configuration, because of Airbyte's access control to the WebUI, it is necessary to set up an SSH Tunnel to the GCE instance in order to access the WebUI. It doesn't matter because I don't access it so often now, but eventually I want to be able to control access by user account. In that case, I think that it should be configured to access via Identity-Aware Proxy in GCP.
Also, if the number of ETL targets increases, I would like to consider migrating to Kubernetes so that it can be easily scaled. However, I hope that Airbyte Cloud will be available in Japan before that.

Top comments (0)