DEV Community: John Lafleur

Why ETL Needs Open Source to Address the Long Tail of Integrations

John Lafleur — Fri, 02 Jul 2021 03:33:08 +0000

Over the last year, our team has interviewed more than 200 companies about their data integration use cases. What we discovered is that data integration in 2021 is still a mess.

The Unscalable Current Situation

At least 80 of the 200 interviews were with users of existing ETL technology, such as Fivetran, StitchData and Matillion. We found that every one of them were also building and maintaining their own connectors even though they were using an ETL solution (or an ELT one - for simplicity, I will just use the term ETL). Why?

We found two reasons:

Incomplete coverage for connectors
Significant friction around database replication

Inability to cover all connector needs

Many users’ ETL solution didn’t support the connector they wanted, or supported it but not in the way they needed.

An example for context: Fivetran has been in existence for eight years and supports 150 connectors. Yet, in just two sectors -- martech and adtech -- there are over 10,000 potential connectors.

The hardest part of ETL is not building the connectors, it is maintaining them. That is costly, and any closed-source solution is constrained by ROI (return on investment) considerations. As a result, ETL suppliers focus on the most popular integrations, yet companies use more and more tools every month and the long tail of connectors goes ignored.

So even with ETL tools, data teams still end up investing huge amounts of money and time building and maintaining in-house connectors.

Inability to address the database replication use case

Most companies store data in databases. Our interviews uncovered two significant issues with database connectors provided by existing ETL.

Volume-based pricing: Databases are huge and serve growing amounts of data. A database with millions of rows, with the goal of serving hundreds of millions of rows, is a common sight.The issue with current ETL solutions is their volume-based pricing. It’s easy for an employee to replicate a multi-million row database with a click. And that simple click could cost a few thousand dollars!
Data privacy: With today’s concerns over privacy and security, companies place an increasing importance on control of their data. The architecture of existing ETL solutions often end up pulling data out of a company’s private cloud. The closed source offerings prevent companies from closely inspecting the underlying ETL code/systems. The reduced visibility means lesser trust

Both of these points explain why companies end up building additional internal database replication pipelines.

Inability to scale with data

The two points mentioned above about volume-based pricing and data privacy also apply as companies scale. It becomes less expensive for companies to have an internal team of data engineers to build the very same pipelines maintained in ETL solutions.

Why Open-Source Is the Only Way Forward

Open-source addresses many of the points raised above. Here is what open source gives us.

The right to customize: Having access to and being able to edit the code to your needs is a privilege open-source brings. For instance, what if the Salesforce connector is missing some data you need? With open source, such a change is as easy as submitting a code change. No more long threads on support tickets!
Addressing the long tail of connectors: You no longer need to convince a proprietary ETL provider that a connector you need is worth building. If you need a connector faster than a platform will develop it, you can build it yourself and maintain it with the help of a large user community.
Broader Integrations with data tools and workflows: Because an open source product must support a wide variety of stacks and workflows for orchestration, deployment, hosting, etc., you are more likely to find out-of-the-box support for your data stack and workflow (UI-based, API-based, CLI-based, etc.) with an open source community. Some of them, like Airbyte’s open source Airflow operator, are contributed by the community. To be fair, you can theoretically do that with a closed-source approach, but you’d likely need to build a lot of the tooling from scratch.
Debugging autonomy: If you experience any connector issues, you won’t need to wait for a customer support team to get back to you or for your fix to be at the top of the priorities of a third-party company. You can fix the issue yourself.
Out-of-the-box security and privacy compliance. If the open-source project is open enough (MIT, Apache 2.0, etc.), any team can directly address their integration needs by deploying the open-source code in their infrastructure.

The Necessity of a Connector Development Kit

However, open source itself is not enough to solve the data integration problem.This is because the barrier to entry for creating a robust and full-featured connector is too high.

Consider for example a script that pulls data from a REST API.

Conceptually this is a simple SELECT * FROM entity query over some data living in a database, potentially with a WHERE clause to filter by some criteria. But anyone who has written a script or connector to continuously and reliably perform this task knows it’s a bit more complicated than that.
First, there is authentication, which can be as simple as a username/password or as complicated as implementing a whole OAuth flow (and securely storing and managing these credentials).

We also need to maintain state between runs of the script so we don’t keep rereading the same data over and over.

Afterwards, we’ll need to handle rate limiting and retry intermittent errors, making sure not to confuse them with real errors that can’t be retried.

We’ll then want to transform data into a format suitable for downstream consumers, all while performing enough logging to fix problems when things inevitably break.

Oh, and all this needs to be well tested, easily deployable... and done yesterday, of course.

All in all, it currently takes a few full days to build a new REST API source connector. This barrier to entry not only means fewer connectors created by the community, but can often mean lower quality connectors.

However, we believe that 80% of this hardship is incidental and can mostly be automated away. Reducing implementation time would significantly help the community contribute and address the long tail of connectors. If this automation is done in a smart way, we might also be able to improve standardization and thus maintenance across all connectors contributed.

What a Connector Development Kit Looks Like

Let’s look again at the work involved in building a connector, but from a different perspective.

Incidental complexity

Setting up the package structure
Packaging the connector in a Docker container and setting up the release pipeline Lots of repeated logic:
Reinventing the same design patterns and code structure for every connector type (REST APIs, Databases, Warehouses, Lakes, etc.)
Writing the same helpers for transforming data into a standard format, implementing incremental syncs, logging, input validation, etc.
Testing that the connector is correctly adhering to the protocol. Testing happy flows and edge cases

You can see that a lot can be automated away, and you’ll be happy to know that Airbyte has made available an open source Connector Development Kit (CDK) to do all this.

We believe in the end, the way to build thousands of high-quality connectors is to think in onion layers. To make a parallel with the pet/cattle concept that is well known in DevOps/Infrastructure, a connector is cattle code, and you want to spend as little time on it as possible. This will accelerate productivity tremendously.

Abstractions as onion layers

Maximizing high-leverage work leads you to build your architecture with an onion-esque structure:

The center defines the lowest level of the API. Implementing a connector at that level requires a lot of engineering time. But it is your escape hatch for very complex connectors where you need a lot of control.

Then, you build new layers of abstraction that helps tackle families of connectors very quickly. For example, sources have a particular interface, and destinations have a different kind of interface.

Then, for sources you have different kinds like HTTP-API based connectors and Databases. HTTP connectors might be split into REST, GraphQL, and SOAP, whereas Databases might split into relational, NoSQL, and graph databases. Destinations might split into Warehouses, Datalakes, and APIs (for reverse ELT).
The CDK is the framework for those abstractions!

What Is Already Available

Airbyte’s CDK is still in its early days, so expect lots of improvements to come over time. Today, the framework ships with the following features:

A Python framework for writing source connectors
A generic implementation for rapidly developing connectors for HTTP APIs
A test suite to test compliance with the Airbyte Protocol and happy code paths
A code generator to bootstrap development and package your connector

In the end, the CDK enables building robust, full-featured connectors within 2 hours versus 2 days previously.

The Airbyte team has been using the framework internally to develop connectors, and it is the culmination of our experience developing more than 70+ connectors (our goal is 200 by end of the year with help from the user community!). Everything we learn from our own experience, along with the user community go into improving the CDK.

Conclusion - The Future Ahead

Wouldn’t it be great to bring the time needed to build a new connector down to 10 minutes, and to extend to more and more families of possible integrations. How’s that for a moonshot!

If we manage to do that together with our user community, then at long last the long tail of integrations will be addressed in no time! Not to mention that data integration pipelines will be commoditized through open-source.

If you would like to get involved, we hope you’ll join our Slack community - the most active one around data integration - as we connect to the future of open source for the benefit of all!

How “User Success” Helps Us Become the Most Active Slack Community

John Lafleur — Tue, 27 Apr 2021 06:11:56 +0000

Today, we’re celebrating three important milestones for Airbyte. Within just 7 months of the release of our very first product (MVP) - which had only 6 connectors - we became the most active Slack community of data professionals around data integration. This is our first milestone.

As you might already know, we are a transparent company. Every month or so, we publish information on our project and company that would be confidential in other companies, such as:

The slides we used to raise our seed round with Accel
Our company handbook with even our strategy and business model

Today, we want to tell you more about our Slack community, our focus on user success and what it means for our community, and two other not yet announced milestones.

The Most Active Slack Community on Data Integration

This weekend, we reached the milestone of 1,000 Slack members, and at the same time became the most active community.

Within 7 months, we grew from 5 people (our original team) to 1,020 members as of 04/26/21. Out of those 1,030 members, 450 are active weekly, and this resulted in 115k messages exchanged with the community. Yes, 45% of our Slack community is active on a weekly basis, which is a great starting point.

The last time we checked, Singer’s Slack community had 40k messages after 4 years, and Meltano had 33k within 2 years. With Airbyte reaching 115k messages in 7 months, who knows how many we’ll have in 2 or 4 years?!

Defining “User Success”

Airbyte’s community is worldwide. About 35% of our users come from the US, but the remaining majority is spread across the globe. That’s why we decided to build a remote-first team with people in France, United Kingdom, India, Singapore, New Caledonia (near Australia), and the US to cover all timezones. The goal is to be the best at what we call “user success.”

What is user success? You're probably familiar with customer success, which is well known in the SaaS world. In customer success, your goal is to make your customers successful with your product. However, when you are an open-source tool, you are first focusing on becoming the industry standard, and therefore, you’re focusing on the users of your open-source project.

Within Airbyte, we define “user success” as our team’s focus to help our users be successful in whatever project they want to build around data, whether it be with Airbyte or another tool. We believe the best way to build trust with our community is by aligning our goals and incentives with theirs; we want them to know we have their back and always will.

Measuring User Success

We’re measuring two things:

Time to first response
Time to resolution

Time to first response is the time elapsed between a user request on our Slack and the first response from a team member or community member.

Time to resolution is the time elapsed between the first user request and when the thread is marked with a ✅ emoji. That is how we notify the rest of the team that this request has been fully addressed.

For the moment, we have an average of 2 hours 30 minutes for the time to first response, and our time to resolution is about 3 hours 30 minutes.

We were also thinking about tracking:

resolution rate, i.e., the percentage of threads that have been marked with a ✅ emoji, but the data was too skewed by us sometimes forgetting to mark the thread as resolved.
feature-coverage rate, i.e., the % of users we interact with for whom we have met all their feature and connector needs

Some Examples of User Success Processes

So, what specific actions do we take in terms of user success?

Well, for instance, we personally welcome every new Slack member with a personalized message. It takes a bit of time but it is definitely worth it, as it enables us to understand their use cases and needs.

You can note all of this information in your CRM or community tool (we are big fans of Orbit) so that when you release a new feature, you can notify those who expressed any interest in that feature. That’s exactly what we do with connectors. Every time we support a new connector, we’ll reach out to all users who mentioned any interest in that connector.

Any interaction with the user is an opportunity to get information on how we can provide more value at a later date.

In the end, as for customer success, you want your users to be more and more successful with your open-source tool, so they become your next advocates.

What Kind of Role in User Success?

Airbyte is an open-source data integration platform, so it’s targeting data engineers, analysts and scientists. The only way to help them become successful is by helping them solve their technical issues. So the role that makes sense is a User Success Engineer.

And, as a matter of fact, we just hired for the role at Airbyte. Here is the description of the role:

Your goal as a User Success Engineer is to make our users successful when deploying or contributing to Airbyte.

The main responsibilities of the role will be:

Help users troubleshoot issues they have when deploying or contributing to Airbyte.
Write documentation and make (or suggest) code changes to resolve recurring issues.
Triage bugs to the correct team (or fix the issue yourself).

Airbyte’s open-source community has been growing very quickly, and one component of our success is the love of our community. This role is instrumental to scaling the support to our users, and includes finding ways to reduce the overall cost of user support through better documentation and new processes.

An excellent candidate will become an expert in the Airbyte system. They will determine which information needs to be shared with the engineering team so that the team has a deep understanding of existing pain points. They will also filter out information that they can resolve themselves through code fixes, documentation, or by working with the users. This will allow the engineering team to be laser focused on the product goals while maintaining intense user empathy. The role is at the heart of our values of leveraging our time and abilities.

An ideal candidate can start out as an individual contributor but can grow this operation into a team as the company scales.

---------

We hope this gives you some insight on how we think about user success at Airbyte and its community. So how does this translate in terms of measurable goals?

Our Next Milestone Is 1,000 Weekly Active Slack Members

The actual metric you want to track is the activity level of your community. Having a non-engaged metric is a waste of time for everybody. So one would think that we should define our next goal in terms of messages exchanged in the community. Why not aim for 1M messages?

The issue with that approach is that messages are not synonymous with value brought to your users. If it takes you half the messages to get your point across and solve your users’ issues, you should definitely go this way. Number of messages is not the right proxy, and never was in our case.

The right approach is to track whether your community keeps being engaged, and that is, simply, weekly active members. That’s why our next milestone is not signups, or messages exchanged, but 1,000 weekly active Slack members.

How to Achieve the Next Milestone

This is where we want to announce two new milestones.

1. Our First Developer Advocate Hire

Abhi Vaidyanatha joined us on 04/26. As our senior developer advocate, he will work on constantly improving our developer experience and engagement. This includes documentation, tutorials, and, therefore, insightful content for our Slack communities.

Maybe we’ll do AMAs there - anything becomes possible when you have someone with the energy of Abhi!

2. Our First User Success Engineer Hire

If, by any chance, we coined the term “user success engineer,” feel free to reuse the term, as it should be open-sourced (MIT) like the rest of Airbyte 😉.

Our first user success engineer should be joining us in the next few weeks. This person will help us drive the time to first response and resolution down so you’ll have the best support experience with Airbyte in the whole ETL/ELT industry - while just using the open-source edition!

---

You will see that the Airbyte team will be growing fast in the next few weeks. And we also have big plans for the Slack community, but we won’t reveal everything just yet as we want to keep some surprises for you!

In case you didn’t join, here’s our Slack community, and you can also contribute to our GitHub repository. Either way - whether you’re already a member or planning to join - we hope to hear from you soon!

And yes, Airbyte is also about to become the GitHub repo with the most stars around data integration, too!

How We Performed on Our Q1 OKRs, and The Goals for Q2

John Lafleur — Wed, 14 Apr 2021 10:07:05 +0000

In January, we shared how we were thinking about OKRs, along with our OKRs for Q1 2021. So we wanted to give some updates about them, and how they have evolved for the 2nd quarter.

Our focus for 2021 is to become the open-source standard for replicating data. This entails three overarching goals:

Making Airbyte just work whatever your data infrastructure, volume and connector needs.
Building the largest developer community for data integration. We envision that most connectors will be built and maintained by the community eventually, because we will have made that so simple with our low-code framework.
Making Airbyte so easy to use in a production context that Airbyte becomes the new standard for data teams to replicate data.

Let’s see how this translates itself into our first two quarterly OKRs.

How We Performed on Airbyte’s OKRs for Q1 2021

1. O: Growing Community Love

What is community love? We’re still big fans of Orbit’s definition for it. Love is a member's level of engagement and investment in the community. Someone with high love is highly active and plays key roles in the community, like contributing, moderating, and organizing.

Let’s first look at GitHub Stars

In this chart, we’re comparing Airbyte with other famous open-source projects around data integration: DBT and RudderStack. Our growth rate (Airbyte in red) is a huge validation that we’re not the only ones to believe that data integration will be solved with an open-source and community approach.

GitHub stars are good awareness metrics, but they don’t mean that you actually have community adoption or contribution. We need to look at other metrics for that:

Overall, we outperformed our Q1 OKRs for community love, even though we set aggressive goals. This is still the very beginning of our journey, but this was extremely encouraging for all the team. We strongly believe we can commoditize data integration through our growing community.

2. O: Growing Production Usage

We call “activated users” users who have deployed Airbyte, connected a source, a destination and synced data successfully from this source to this destination.

We call “prod users” users who have been syncing data more than 5 times in the past week and 5 times in the week before.

Here’s a chart showing the evolution of activated users and prod users during Q1.

We don’t publish the number of prod users we have yet, but you can see that the conversion from activated to prod users is growing with time, which is what we want to see.

But, is the usage of Airbyte growing among prod users?

If we had to follow only one graph, it would be this one. It accounts for both prod user growth and usage growth within prod users.

Here’s the usage growth in terms of sync per prod user:

Overall, this was exactly what we wanted to see. Teams start by testing Airbyte for a few days or weeks, before expanding their usage to other connectors.

3. O: Becoming a Reliable Standard

Airbyte can only become the new standard if connectors are reliable. You could consider that a “sanity” metric - in the sense it is not related to some growth metrics -, but it is actually where almost all of the engineering work goes. The more users use Airbyte, the more edge cases connectors get exposed to. It is a thousand-paper-cut problem, where every user comes with their needs in terms of usage, data and volume. The more users we have, the less reliable connectors can appear, and we have to seize these opportunities to strengthen them.

The metrics we’re looking at in this case are the percent of failures at sync attempts:

We launched on HackerNews on January 26th. That’s when we gained a lot more users at once and got exposed to a lot more use cases. During the whole month of February, we worked on strengthening our connectors, and you can see in this chart how it paid off. Our KR was 5% of failures by the end of the quarter, and this is something that we will keep working on.

Some other metrics we wanted to track:

KR: Response time to any message on Slack or GitHub - our goal was to reach <30 min by end of Q1 2021.
KR: Time to high bug resolution - our goal is to reach 1.5 days by the end of Q1 2021.

In the end, we couldn’t really measure those 2 metrics. But the overall response time to any message on Slack was about 1-2 hours.

4. O: Building the Dream Team

We strongly believe in talent density, and that it’s better to have one stellar colleague than 5 average ones.

KR: 2 A+ engineers => 3 engineers will be joining us in the next few weeks.
KR: 1 senior developer advocate => Abhi will be joining us soon!

Our Q1 Milestones

Now that we have seen how we performed on our OKRs, how did we perform on the milestones?

Community efforts

January: Hard launch on HackerNews
Building tutorials to improve the developer experience (DX) in building their own connectors, or editing pre-built ones => this is still a work in progress.

Product engineering efforts

One thing we didn’t anticipate is the toll providing great support would take on our engineering velocity. Even though we had great output, we were not able to deliver on all the milestones we had intended.

For our core platform:

Integration in data stack with DBT and Airflow => delivered, although we still have a lot on DBT’s front!
Core upgrade strategy => delivered!

For our connectors:

Strengthen our connectors so all our connectors are A+ => we started certifying our connectors against a set of best practice, and you can now see the health status of our connectors.
Schemas migration management => reprioritized
Seamless OAuth support => reprioritized
More high-level abstractions to build connectors more easily => ongoing effort!
An MVP for CDC (Capture Data Change) => delivered!
Connector upgrade strategy => delivered!
A public dashboard showing the stability (failure rate) of all our connectors => delivered!

Our New Q2 OKRs

So what about the next quarter? Doing OKRs is actually a great learning opportunity enabling us to make better estimates every time. This time, we have experience on how much time providing a great support experience takes in engineering time. So we can plan accordingly.

For Q2, we kept the same objectives but changed some KRs that we’ve put in bold.

O: Growing Community Love

KR: Active Slack users (Q1/21: 350, Q2/21: 600)
KR: GitHub stars (Q1/21: 2k, Q2/21: 4k)
KR: Issue contributors from start (Q1/21: 125, Q2/21: 250)
KR: PR contributors from start (Q1/21: 25, Q2/21: 50)
KR: Connector Contributors (Q1/21: 10, Q2/21: 30)

O: Growing Prod Usage

KR: Prod users
KR: Active connections per prod user
KR: # connectors (Q1/21: 56, Q2/21: 90)

O: Becoming a Reliable Standard

KR: % failure at attempts
KR: average throughput of connectors
KR: support replicating large databases in X minutes

O: Building the Dream Team

KR: 2 A+ engineers
KR: 1 dev evangelist (to be confirmed) + 1 operations manager

Our Next Q2 Milestones

How does this translate into milestones?

Make Airbyte the easiest way to create line-of-business connectors with our low-code solution for creating connectors quickly and more reliably.
Support custom DBT models.
CDC for all major database sources.
Mature handling of (large) production data sets.
Production-grade single node support (across platforms): creating solid AMIs, systemctl, etc., with less setup.
First-class support on K8s.
OAuth support for connector authentication.
"Automatic" Schema change handling.
Support for data lake use cases.

So... a lot of engineering milestones! And they can be accomplished as we grow our engineering team.

Let’s see how we perform in 3 months!

How to Visualize the Time Spent by Your Team in Zoom Calls

John Lafleur — Mon, 05 Apr 2021 07:15:06 +0000

In this article, we will show you how you can understand how much your team leverages Zoom, or spends time in meetings, in a couple of minutes. We will be using Airbyte (an open-source data integration platform) and Tableau (a business intelligence and analytics software) for this tutorial.

Here is what we will cover:

Step 1: Setting up data replication from Zoom to a PostgreSQL database using the Airbyte Zoom connector
Step 2: Connecting the PostgreSQL database to Tableau
Step 3: Creating charts in Tableau with Zoom data

We will produce the following charts in Tableau:

Evolution of the number of meetings per week in a team
Evolution of the number of hours a team spends in meetings per week
Listing of team members with the number of meetings per week and number of hours spent in meetings, ranked
Evolution of the number of webinars per week in a team
Evolution of the number of hours a team spends in webinars per week
Evolution of the number of participants for all webinars in a team per week
Listing of team members with the number of webinars per week and number of hours spent in meetings, ranked

Let’s get started by replicating Zoom data using Airbyte.

Step 1: Replicating Zoom data to PostgreSQL

Launching Airbyte

In order to replicate Zoom data, we will need to use Airbyte’s Zoom connector. To do this, you need to start off Airbyte’s web app by opening up your terminal and navigating to Airbyte and running:

docker-compose up

You can find more details about this in the Getting Started tutorial.

This will start up Airbyte on localhost:8000; open that address in your browser to access the Airbyte dashboard.

In the top right corner of the Airbyte dashboard, click on the + new source button to add a new Airbyte source. In the screen to set up the new source, enter the source name (we will use airbyte-zoom) and select Zoom as source type.

Choosing Zoom as source type will cause Airbyte to display the configuration parameters needed to set up the Zoom source.

The Zoom connector for Airbyte requires you to provide it with a Zoom JWT token. Let’s take a detour and look at how to obtain one from Zoom.

Obtaining a Zoom JWT Token

To obtain a Zoom JWT Token, login to your Zoom account and go to the Zoom Marketplace. If this is your first time in the marketplace, you will need to agree to the Zoom’s marketplace terms of use.

Once you are in, you need to click on the Develop dropdown and then click on Build App.

Clicking on Build App for the first time will display a modal for you to accept the Zoom’s API license and terms of use. Do accept if you agree and you will be presented with the below screen.

Select JWT as the app you want to build and click on the Create button on the card. You will be presented with a modal to enter the app name; type in airbyte-zoom.

Next, click on the Create button on the modal.

You will then be taken to the App Information page of the app you just created. Fill in the required information (at the very least).

After filling in the needed information, click on the Continue button. You will be taken to the App Credentials page. Here, click on the View JWT Token dropdown.

There you can set the expiration time of the token (we will leave the default 90 minutes), and then you click on the Copy button of the JWT Token.

After copying it, click on the Continue button.

You will be taken to a screen to activate Event Subscriptions. Just leave it as is, as we won’t be needing Webhooks. Click on Continue, and your app should be marked as activated.

Connecting Zoom on Airbyte

So let’s go back to the Airbyte web UI and provide it with the JWT token we copied from our Zoom app.

Now click on the Set up source button. You will see the below success message when the connection is made successfully.

And you will be taken to the page to add your destination.

Connecting PostgreSQL on Airbyte

For our destination, we will be using a PostgreSQL, since Tableau supports PostgreSQL as a data source. Click on the add destination button, and then in the drop down click on + add a new destination. In the page that presents itself, add the destination name and choose the Postgres destination.

To supply Airbyte with the PostgreSQL configuration parameters needed to make a PostgreSQL destination, we will spin off a PostgreSQL container with Docker using the following command in our terminal.

docker run --rm --name airbyte-zoom-db -e POSTGRES_PASSWORD=password -v airbyte_zoom_data:/var/lib/postgresql/data -p 2000:5432 -d postgres

This will spin a docker container and persist the data we will be replicating in the PostgreSQL database in a Docker volume airbyte_zoom_data.

Now, let’s supply the above credentials to the Airbyte UI requiring those credentials.

Then click on the Set up destination button.

After the connection has been made to your PostgreSQL database successfully, Airbyte will generate the schema of the data to be replicated in your database from the Zoom source.

Leave all the fields checked.

Select a Sync frequency of manual and then click on Set up connection.

After successfully making the connection, you will see your PostgreSQL destination. Click on the Launch button to start the data replication.

Then click on the airbyte-zoom-destination to see the Sync page.

Syncing should take a few minutes or longer depending on the size of the data being replicated. Once Airbyte is done replicating the data, you will get a succeeded status.

Then, you can run the following SQL command on the PostgreSQL container to confirm that the sync was done successfully.

docker exec airbyte-zoom-db psql -U postgres -c "SELECT * FROM public.users;"

Now that we have our Zoom data replicated successfully via Airbyte, let’s move on and set up Tableau to make the various visualizations and analytics we want.

Step 2: Connect the PostgreSQL database to Tableau

Tableau helps people and organizations to get answers from their data. It’s a visual analytic platform that makes it easy to explore and manage data.

To get started with Tableau, you can opt in for a free trial period by providing your email and clicking the DOWNLOAD FREE TRIAL button to download the Tableau desktop app. The download should automatically detect your machine type (Windows/Mac).

Go ahead and install Tableau on your machine. After the installation is complete, you will need to fill in some more details to activate your free trial.

Once your activation is successful, you will see your Tableau dashboard.

On the sidebar menu under the To a Server section, click on the More… menu. You will see a list of datasource connectors you can connect Tableau with.

Select PostgreSQL and you will be presented with a connection credentials modal.

Fill in the same details of the PostgreSQL database we used as the destination in Airbyte.

Next, click on the Sign In button. If the connection was made successfully, you will see the Tableau dashboard for the database you just connected.

Note: If you are having trouble connecting PostgreSQL with Tableau, it might be because the driver Tableau comes with for PostgreSQL might not work for newer versions of PostgreSQL. You can download the JDBC driver for PostgreSQL here and follow the setup instructions.

Now that we have replicated our Zoom data into a PostgreSQL database using Airbyte’s Zoom connector, and connected Tableau with our PostgreSQL database containing our Zoom data, let’s proceed to creating the charts we need to visualize the time spent by a team in Zoom calls.

Step 3: Create the charts on Tableau with the Zoom data

Evolution of the number of meetings per week in a team

To create this chart, we will need to use the count of the meetings and the createdAt field of the meetings table. Currently, we haven’t selected a table to work on in Tableau. So you will see a prompt to Drag tables here.

Drag the meetings table from the sidebar onto the space with the prompt.

Now that we have the meetings table, we can start building out the chart by clicking on Sheet 1 at the bottom left of Tableau.

As stated earlier, we need Created At, but currently it’s a String data type. Let’s change that by converting it to a data time. So right click on Created At, then select ChangeDataType and choose Date & Time. And that’s it! That field is now of type Date & Time.

Next, drag Created At to Columns.

Currently, we get the Created At in YEAR, but per our requirement we want them in Weeks, so right click on the YEAR(Created At) and choose Week Number.

Tableau should now look like this:

Now, to finish up, we need to add the meetings(Count) measure Tableau already calculated for us in the Rows section. So drag meetings(Count) onto the Columns section to complete the chart.

And now we are done with the very first chart. Let's save the sheet and create a new Dashboard that we will add this sheet to as well as the others we will be creating.

Currently the sheet shows Sheet 1; right click on Sheet 1 at the bottom left and rename it to Weekly Meetings.

To create our Dashboard, we can right click on the sheet we just renamed and choose new Dashboard. Rename the Dashboard to Zoom Dashboard and drag the sheet into it to have something like this:

Now that we have this first chart out of the way, we just need to replicate most of the process we used for this one to create the other charts. Because the steps are so similar, we will mostly be showing the finished screenshots of the charts except when we need to conform to the chart requirements.

Evolution of the number of hours a team spends in meetings per week

For this chart, we need the sum of the duration spent in weekly meetings. We already have a Duration field, which is currently displaying durations in minutes. We can derive a calculated field off this field since we want the duration in hours (we just need to divide the duration field by 60).

To do this, right click on the Duration field and select create, then click on calculatedField. Change the name to Duration in Hours, and then the calculation should be [Duration]/60. Click ok to create the field.

So now we can drag the Duration in Hours and Created At fields onto your sheet like so:

Note: We are adding a filter on the Duration to filter out null values. You can do this by right clicking on the SUM(Duration) pill and clicking filter, then make sure the include null values checkbox is unchecked.

Evolution of the number of participants for all meetings per week

For this chart, we will need to have a calculated field called # of meetings attended, which will be an aggregate of the counts of rows matching a particular user's email in the report_meeting_participants table plotted against the Created At field of the meetings table. To get this done, right click on the User Email field. Select create and click on calculatedField, then enter the title of the field as # of meetings attended. Next, enter the below formula:

COUNT(IF [User Email] == [User Email] THEN [Id (Report Meeting Participants)] END)

Then click on apply. Finally, drag the Created At fields (make sure it’s on the Weekly number) and the calculated field you just created to match the below screenshot:

Listing of team members with the number of meetings per week and number of hours spent in meetings, ranked.

To get this chart, we need to create a relationship between the meetings table and the report_meeting_participants table. You can do this by dragging the report_meeting_participants table in as a source alongside the meetings table and relate both via the meeting id. Then you will be able to create a new worksheet that looks like this:

Note: To achieve the ranking, we simply use the sort menu icon on the top menu bar.

Evolution of the number of webinars per week in a team

The rest of the charts will be needing the webinars and report_webinar_participants tables. Similar to the evolution of the number of meetings per week in a team, we will be plotting the Count of webinars against the Created At property.

Evolution of the number of hours a week spends in webinars per week

For this chart, as for the meeting’s counterpart, we will get a calculated field off the Duration field to get the Webinar Duration in Hours, and then plot Created At against the Sum of Webinar Duration in Hours, as shown in the screenshot below. Note: Make sure you create a new sheet for each of these graphs.

Evolution of the number of participants for all webinars per week

This calculation is the same as the evolution of the number of participants for all meetings per week, but instead of using the meetings and report_meeting_participants tables, we will use the webinars and report_webinar_participants tables.

Also, the formula will now be:

COUNT(IF [User Email] == [User Email] THEN [Id (Report Webinar Participants)] END)

Below is the chart:

Listing of team members with the number of webinars per week and number of hours spent in meetings, ranked

Below is the chart with these specs

Conclusion

In this article, we see how we can use Airbyte to get data off the Zoom API onto a PostgreSQL database, and then use that data to create some chart visualizations in Tableau.

You can leverage Airbyte and Tableau to produce graphs on any collaboration tool. We just used Zoom to illustrate how it can be done. Hope this is helpful!

Our Truth for 2021: Airbyte Just Works

John Lafleur — Sun, 04 Apr 2021 22:37:21 +0000

We try to limit our discussions with VCs, as they can easily become a distraction. As a startup, focus is what will differentiate between success and failure. But sometimes, we can’t refuse an introduction and a discussion, as some investors have a lot of insights on your industry.

Recently, we had one discussion with a top-tier VC general partner. In addition to a lot of feedback and insights, one question in particular he asked really struck me: “What is your truth for 2021?”

In this article, we will explain what he means by truth, and what our immediate answer was for Airbyte.

What is a truth?

A truth is what we absolutely need to achieve for your company to be on the path to success. It is the one thing you need to strive for, and that should determine your priorities, strategy, initiatives, recruiting plan, etc.

A truth helps put every consideration in perspective. Any time you have a decision to make you can ask yourself whether that brings you closer to that truth. Anything that doesn’t get you closer to it, you should ponder whether you should actually do it.

It is by having this singular goal in mind that you will give yourself the highest chance to get there.

Is a truth a SMART goal in the end?

I’m sure you have heard about “SMART” goals. SMART stands for Specific, Measurable, Achievable, Relevant, Time-based.
It’s true that your truth needs to be specific. It cannot just be “My company is successful.” You need to define exactly what success means to you as a company.

Your truth should also be achievable and relevant, and it is by definition time-based, as it’s for your current year (or another period of your choice).

But the difference lies in the fact that your truth should be aspirational above being measurable. It should be very easy to express, just a few words, very memorable.

When we were asked this question, we hadn’t thought about it this way, but Michel - my co-founder - and I knew the answer instantaneously.

Our truth for 2021: “Airbyte just works”

What came to our mind is that for the end of 2021, we envision that Airbyte just works. This is the feeling we want all our users to have.

This includes reliability of the platform and all its connectors, whatever your infrastructure and the volume of data you need to replicate. But it also includes agnosticity for whatever connector needs you have, whatever data stack you opted for. Airbyte just works.
Let’s go into more detail.

Whichever your data infrastructure

This year, we will be focusing on integrating with the rest of the data stack, should it be for orchestration (Airflow, Dagster, Prefect, etc.), data quality (Great Expectations), cloud provider (GCP, AWS, Azure…), whatever the scale, which implies we must support multi-node. Until now, we’ve been focusing on single node setup.

Whatever your data volume

We are constantly improving our connectors, and are even certifying them against a set of best practices that we will keep adding to. Data integration pipelines are a thousand-paper-cut problem. Each new user brings some new use cases that may or may not be supported yet. We will continuously grow the team in charge of building new connectors and strengthening existing ones. At the end of the year, we hope we will be able to support TB-level replication.

Whatever your connector needs

We want to support at least 200 connectors by the end of 2021. And this will only be the beginning. We’re working on a low-code framework to make it easier to build and maintain connectors. 200 is obviously not enough to cover all connector needs, but hopefully, we will be at a point where the developer experience to build new connectors is so easy that the number of connectors won’t be perceived as limiting to address any use cases.

On that matter, we will also be working to support Kafka, Spark and webhooks.

This is our truth for 2021. By the end of the year, whatever your use case, you will be able to set up Airbyte and start fulfilling your data integration needs in a matter of hours. We believe this is the only way to commoditize data integration.

How you can use the truth framework elsewhere

A last note for this article. You can use the truth framework in other contexts.

For instance, we see a lot of entrepreneurs making decisions based on the amount of equity they hope to keep and the valuation of the company they hope to reach. However, they fail to remember that startups are either a 0 (you failed to exit and you died), or a 1 (you exited, IPO’d or are profitable). Any consideration of equity and valuation should actually be multiplied by this 0 or 1.

So as such, you need to consider if the decisions you make bring you closer to the 1. If you keep focusing on the 1 you will see that, in the long term, they were the right decisions to make, as having a bit more equity is not important if you end up building a successful company.

What is your truth? Are any of the decisions you make taking you closer to it?

How To Build a Slack Activity Dashboard With Open Source

John Lafleur — Wed, 03 Mar 2021 02:45:08 +0000

Build a Slack Activity Dashboard

This article will show how to use Airbyte - open-source data integration platform - and Apache Superset - open-source data exploration platform - in order to build a Slack activity dashboard showing:

Total number of members of a Slack workspace
The evolution of the number of Slack workspace members
Evolution of weekly messages
Evolution of messages per channel
Members per time zone

Before we get started, let’s take a high-level look at how we are going to achieve creating a Slack dashboard using Airbyte and Apache Superset.

We will use the Airbyte’s Slack connector to get the data off a Slack workspace (we will be using Airbyte’s own Slack workspace for this tutorial).
We will save the data onto a PostgreSQL database.
Finally, using Apache Superset, we will implement the various metrics we care about.

Got it? Now let’s get started.

1. Replicating Data from Slack to Postgres with Airbyte

a. Deploying Airbyte

There are several easy ways to deploy Airbyte, as listed here. For this tutorial, I will just use the Docker Compose method from my workstation:

# In your workstation terminal
git clone https://github.com/airbytehq/airbyte.git
cd airbyte
docker-compose up

The above command will make the Airbyte app available on localhost:8000. Visit the URL on your favorite browser, and you should see Airbyte’s dashboard (if this is your first time, you will be prompted to enter your email to get started).

If you haven’t set Docker up, follow the instructions here to set it up on your machine.

b. Setting Up Airbyte’s Slack Source Connector

Airbyte’s Slack connector will give us access to the data. So, we are going to kick things off by setting this connector to be our data source in Airbyte’s web app. I am assuming you already have Airbyte and Docker set up on your local machine. We will be using Docker to create our PostgreSQL database container later on.

Now, let’s proceed. If you already went through the onboarding, click on the “new source” button at the top right of the Sources section. If you're going through the onboarding, then follow the instructions.

You will be requested to enter a name for the source you are about to create. You can call it “slack-source”. Then, in the Source Type combo box, look for “Slack,” and then select it. Airbyte will then present the configuration fields needed for the Slack connector. So you should be seeing something like this on the Airbyte App:

The first thing you will notice is that this connector requires a Slack token. So, we have to obtain one. If you are not a workspace admin, you will need to ask for permission.

Let’s walk through how we would get the Slack token we need.

Assuming you are a workspace admin, open the Slack workspace and navigate to [Workspace Name] > Administration > Customize [Workspace Name]. In our case, it will be Airbyte > Administration > Customize Airbyte (as shown below):

In the new page that opens up in your browser, you will then need to navigate to Configure apps.

In the new window that opens up, click on Build in the top right corner.

Click on the Create an App button.

In the modal form that follows, give your app a name - you can name it airbyte_superset, then select your workspace from the Development Slack Workspace.

Next, click on the Create App button. You will then be presented with a screen where we are going to set permissions for our airbyte_superset app, by clicking on the Permissions button on this page.

In the next screen, navigate to the scope section. Then, click on the Add an OAuth Scope button. This will allow you to add permission scopes for your app. At a minimum, your app should have the following permission scopes:

Then, we are going to add our created app to the workspace by clicking the Install to Workspace button.

Slack will prompt you that your app is requesting permission to access your workspace of choice. Click Allow.

After the app has been successfully installed, you will be navigated to Slack’s dashboard, where you will see the Bot User OAuth Access Token.

This is the token you will provide back on the Airbyte page, where we dropped off to obtain this token. So make sure to copy it and keep it in a safe place.

Now that we are done with obtaining a Slack token, let’s go back to the Airbyte page we dropped off and add the token in there.

We will also need to provide Airbyte with start_date. This is the date from which we want Airbyte to start replicating data from the Slack API, and we define that in the format: YYYY-MM-DDT00:00:00Z.

We will specify ours as 2020-09-01T00:00:00Z. We will also tell Airbyte to exclude archived channels and not include private channels, and also to join public channels, so the latter part of the form should look like this:

Finally, click on the Set up source button for Airbyte to set the Slack source up.

If the source was set up correctly, you will be taken to the destination section of Airbyte’s dashboard, where you will tell Airbyte where to store the replicated data.

c. Setting Up Airbyte’s Postgres Destination Connector

For our use case, we will be using PostgreSQL as the destination.

Click the add destination button in the top right corner, then click on add a new destination.

In the next screen, Airbyte will validate the source, and then present you with a form to give your destination a name. We’ll call this destination slack-destination. Then, we will select the Postgres destination type. Your screen should look like this now:

Great! We have a form to enter Postgres connection credentials, but we haven’t set up a Postgres database. Let’s do that!

Since we already have Docker installed, we can spin off a Postgres container with the following command in our terminal:

docker run --rm --name slack-db -e POSTGRES_PASSWORD=password -p 2000:5432 -d postgres

(Note that the Docker compose file for Superset ships with a Postgres database, as you can see here).

The above command will do the following:

create a Postgres container with the name slack-db,
set the password to password,
expose the container’s port 5432, as our machine’s port 2000.
create a database and a user, both called postgres.

With this, we can go back to the Airbyte screen and supply the information needed. Your form should look like this:

Then click on the Set up destination button.

d. Setting Up the Replication

You should now see the following screen:

Airbyte will then fetch the schema for the data coming from the Slack API for your workspace. You should leave all boxes checked and then choose the sync frequency - this is the interval in which Airbyte will sync the data coming from your workspace. Let’s set the sync interval to every 24 hours.

Then click on the Set up connection button.

Airbyte will now take you to the destination dashboard, where you will see the destination you just set up. Click on it to see more details about this destination.

You will see Airbyte running the very first sync. Depending on the size of the data Airbyte is replicating, it might take a while before syncing is complete.

When it’s done, you will see the Running status change to Succeeded, and the size of the data Airbyte replicated as well as the number of records being stored on the Postgres database.

To test if the sync worked, run the following in your terminal:

docker exec slack-source psql -U postgres -c "SELECT * FROM public.users;"

This should output the rows in the users’ table.

To get the count of the users’ table as well, you can also run:

docker exec slack-db psql -U postgres -c "SELECT count(*) FROM public.users;"

Now that we have the data from the Slack workspace in our Postgres destination, we will head on to creating the Slack dashboard with Apache Superset.

2. Setting Up Apache Superset for the Dashboards

a. Installing Apache Superset

Apache Superset, or simply Superset, is a modern data exploration and visualization platform. To get started using it, we will be cloning the Superset repo. Navigate to a destination in your terminal where you want to clone the Superset repo to and run:

git clone https://github.com/apache/superset.git

It’s recommended to check out the latest branch of Superset, so run:

cd superset

And then run:

git checkout latest

Superset needs you to install and build its frontend dependencies and assets. So, we will start by installing the frontend dependencies:

npm install

Note: The above command assumes you have both Node and NPM installed on your machine.

Finally, for the frontend, we will build the assets by running:

npm run build

After that, go back up one directory into the Superset directory by running:

cd..

Then run:

docker-compose up

This will download the Docker images Superset needs and build containers and start services Superset needs to run locally on your machine.

Once that’s done, you should be able to access Superset on your browser by visiting http://localhost:8088, and you should be presented with the Superset login screen.

Enter username: admin and Password: admin to be taken to your Superset dashboard.

Great! You’ve got Superset set up. Now let’s tell Superset about our Postgres Database holding the Slack data from Airbyte.

b. Setting Up a Postgres Database in Superset

To do this, on the top menu in your Superset dashboard, hover on the Data dropdown and click on Databases.

In the page that opens up, click on the + Database button in the top right corner.

Then, you will be presented with a modal to add your Database Name and the connection URI.

Let’s call our Database slack_db, and then add the following URI as the connection URI:

postgresql://postgres:password@docker.for.mac.localhost:2000/postgres

If you are on a Windows Machine, yours will be:

postgresql://postgres:password@docker.for.win.localhost:2000/postgres

Note: We are using docker.for.[mac|win].localhost in order to access the localhost of your machine, because using just localhost will point to the Docker container network and not your machine’s network.

Your Superset UI should look like this:

We will need to enable some settings on this connection. Click on the SQL LAB SETTINGS and check the following boxes:

Afterwards, click on the ADD button, and you will see your database on the data page of Superset.

c. Importing our dataset

Now that you’ve added the database, you will need to hover over the data menu again; now click on Datasets.

Then, you will be taken to the datasets page:

We want to only see the datasets that are in our slack_db database, so in the Database that is currently showing All, select slack_db and you will see that we don’t have any datasets at the moment.

You can fix this by clicking on the + DATASET button and adding the following datasets.

Note: Make sure you select the public schema under the Schema dropdown.

Now that we have set up Superset and given it our Slack data, let’s proceed to creating the visualizations we need.

Still remember them? Here they are again:

Total number of members of a Slack workspace
The evolution of the number of Slack workspace members
Evolution of weekly messages
Evolution of weekly threads created
Evolution of messages per channel
Members per time zone

3. Creating Our Dashboards with Superset

a. Total number of members of a Slack workspace

To get this, we will first click on the users’ dataset of our slack_db on the Superset dashboard.

Next, change untitled at the top to Number of Members.

Now change the Visualization Type to Big Number, remove the Time Range filter, and add a Subheader named “Slack Members.” So your UI should look like this:

Then, click on the RUN QUERY button, and you should now see the total number of members.

Pretty cool, right? Now let’s save this chart by clicking on the SAVE button.

Then, in the ADD TO DASHBOARD section, type in “Slack Dashboard”, click on the “Create Slack Dashboard” button, and then click the Save button.

Great! We have successfully created our first Chart, and we also created the Dashboard. Subsequently, we will be following this flow to add the other charts to the created Slack Dashboard.

b. Casting the ts column

Before we proceed with the rest of the charts for our dashboard, if you inspect the ts column on either the messages table or the threads table, you will see it’s of the type VARCHAR. We can’t really use this for our charts, so we have to cast both the messages and threads’ ts column as TIMESTAMP. Then, we can create our charts from the results of those queries. Let’s do this.

First, navigate to the Data menu, and click on the Datasets link. In the list of datasets, click the Edit button for the messages table.

You’re now in the Edit Dataset view. Click the Lock button to enable editing of the dataset. Then, navigate to the Columns tab, expand the ts dropdown, and then tick the Is Temporal box.

Persist the changes by clicking the Save button.

c. The evolution of the number of Slack workspace members

In the exploration page, let’s first get the chart showing the evolution of the number of Slack members. To do this, make your settings on this page match the screenshot below:

Save this chart onto the Slack Dashboard.

d. Evolution of weekly messages posted

Now, we will look at the evolution of weekly messages posted. Let’s configure the chart settings on the same page as the previous one.

Remember, your visualization will differ based on the data you have.

e. Evolution of weekly threads created

Now, we are finished with creating the message chart. Let's go over to the thread chart. You will recall that we will need to cast the ts column as stated earlier. So, do that and get to the exploration page, and make it match the screenshot below to achieve the required visualization:

f. Evolution of messages per channel

For this visualization, we will need a more complex SQL query. Here’s the query we used (as you can see in the screenshot below):

SELECT CAST(m.ts as TIMESTAMP), c.name, m.text
FROM public.messages m
INNER JOIN public.channels c
ON m.channel_id = c_id

Next, click on EXPLORE to be taken to the exploration page; make it match the screenshot below:

Save this chart to the dashboard.

g. Members per time zone

Finally, we will be visualizing members per time zone. To do this, instead of casting in the SQL lab as we’ve previously done, we will explore another method to achieve casting by using Superset’s Virtual calculated column feature. This feature allows us to write SQL queries that customize the appearance and behavior of a specific column.

For our use case, we will need the updated column of the users table to be a TIMESTAMP, in order to perform the visualization we need for Members per time zone. Let’s start on clicking the edit icon on the users table in Superset.

You will be presented with a modal like so:

Click on the CALCULATED COLUMNS tab:

Then, click on the + ADD ITEM button, and make your settings match the screenshot below.

Then, go to the exploration page and make it match the settings below:

Now save this last chart, and head over to your Slack Dashboard. It should look like this:

Of course, you can edit how the dashboard looks to fit what you want on it.

Conclusion

In this article, we looked at using Airbyte’s Slack connector to get the data from a Slack workspace into a Postgres database, and then used Apache Superset to craft a dashboard of visualizations.If you have any questions about Airbyte, don’t hesitate to ask questions on our Slack! If you have questions about Superset, you can join the Superset Community Slack!

How Open-source Can Disrupt Build vs. Buy Considerations

John Lafleur — Fri, 22 Jan 2021 02:29:16 +0000

When you’re selling or considering purchasing a B2B tool, you need to understand the build vs. buy argument. What are the pros and cons of building the tool internally vs. buying the tool from a third-party vendor? This is especially true in big companies where you have the resources to build the said tools. Early-stage startups will generally opt for the faster route, going with self-served B2B tools -- unless the pricing is prohibitive.

But something we don’t often think about is how open-source just messes the whole thing up. The build is completely redefined. You now need to compare the B2B tool with the build without the open-source tool, as well as with the open-source tool, which most often lowers the barrier significantly.
In this article, we’ll take the example of the ETL/ELT industry. We know it best, as we’re building Airbyte, the open-source ELT alternative. Let’s see how open-source for ETL / ELT with Airbyte is also flipping the previous Build vs. Buy balance on its head.

We’ve produced an infographic to illustrate that point. You will see that without taking Airbyte into consideration, the build vs. buy was pretty useful with Fivetran, in contrast to building connectors yourself. But now, with Airbyte, you can either just use the open-sourced connectors and start replicating data in minutes for free, or even build new connectors (if ever Airbyte doesn’t support them) in a matter of days (vs. months before) with maintenance being crowdsourced throughout the Airbyte community.

The Infographic

Here is:

in white, the original “build” scenario;
in blue, the original "buy" scenario with cloud-based Fivetran;
in purple, the new "build" scenario with 2 options: “build non-supported connector with Airbyte” in light purple, and “use prebuilt connectors from Airbyte” in dark purple

Let’s just say it: the playing field has changed!

The Explanation

Some context: the average business today uses well over 100 software apps, many of which contain valuable insights about an organization’s operations. Your company is likely on the way to using just as many apps, if not more, and you’ll need a solution to integrate all of the data your apps produce.

Time & Effort

Building your own pipeline by yourself is a significant time commitment. It can take between 3-6 months to set up a basic pipeline. Furthermore, beyond the time commitment, there is some inherent complexity in building a reliable, high-performance ELT pipeline. You need to:

Obtain developer access to the data source
Explore the data
Design the schema/data models
Set up a connector framework
Test the connector and validate the data
Set up orchestration, configuration validation, state management, normalization, schema migration, monitoring, etc. 7. Maintain the connector for every schema change that happens every few weeks. This part is very cumbersome, as it requires an increasing number of data engineers to manage your connectors.

In contrast, an off-the-shelf solution such as Fivetran can be set up in a matter of minutes with prebuilt connectors. Airbyte also takes literally 30 seconds to deploy, and you can start replicating data within 2 minutes.

The big difference between both options in terms of time and effort is that all the Fivetran customers we talked to also had to build and maintain connectors on the side, as the connectors they needed were either not supported in the way they needed or not supported at all by Fivetran.

That’s where the option to build with Airbyte comes in. For connectors not supported by Airbyte, it is a matter of hours to build connectors. Indeed, Airbyte already took care of having a UI, monitoring, scheduling, orchestration, integration with your data stack, automatic schema changes, etc. There is a very high chance we support your destination. So in the end, it’s only the EL part of the source connector you have to build, and Airbyte is providing some abstractions to make that easier.

Regarding maintenance, the goal of Airbyte is to crowdsource throughout the community. When a connector fails because of significant API changes, it will notify the connectors’ users. As soon as the fix is made available by the Airbyte team or a community member, Airbyte will propagate the fix to all the users. The hope is that this approach will provide a better SLA than closed-source solutions such as Fivetran, not to mention the fact that you won’t have to maintain the connector yourself.

People & Money

From what we’ve seen, a typical company requires the equivalent of at least two or three full-time data engineers to build and maintain a data pipeline. The total cost of three full-time engineers can reach the high six figures (including benefits). So that’s a lot!

Fivetran’s fees for a typical mid-sized company with five connectors is about $50,000. But you’ll have to add to that cost all the connectors you need to build and maintain by yourself.

In contrast, Airbyte’s connectors are open-sourced, so you can use them for free. You also don’t need to pay for the egress to Fivetran’s infrastructure. It is possible that you might need a little bit of engineering time to operate Airbyte. If you need to build some of the connectors yourself, you will have to pay for the time spent by the data engineering team on building and maintaining them, but that would still be way less than if you had to do everything yourself.

Opportunity Costs

The actual value brought by your data team is through analysis and modeling. All the data integration, cleaning and transformation is important, as they enable the analysis and modeling. So the more time your team can spend on value-producing tasks, the better for the business.
So opportunity costs as depicted in the illustration are very important to consider. Plus, ask any data team -- they will much prefer doing analysis or modeling tasks, rather than pipelining! So you will have better talent retention this way.

Now you can see how open-source can flip the previous build vs. buy balance on its head. Before Airbyte, Fivetran was an easy sell. Now, it seems the contrary. Leveraging Airbyte’s open-source technology to build your own data infrastructure seems the obvious choice.

There is one last thing to consider when choosing which direction to take: the future.

Future Growth of Your Company

As your company grows, you will add data sources to the pool. The complexity and effort of building and maintaining a data pipeline for a huge number of data sources can quickly escalate beyond your data engineering team’s ability to handle it.

You might consider taking a chance on Fivetran’s ability to cover all or most of your connector needs, so that your team doesn’t need to build and maintain a continually increasing number of connectors (that would defeat the purpose). But, be mindful that Fivetran will always have a ROI consideration to maintaining connectors on the long tail; they won’t maintain connectors that don’t bring enough revenue to offset the maintenance costs.

On the other hand, Airbyte will continue to grow the number of prebuilt community-maintained connectors, and can even take a large portion of the maintenance costs off your hands.
When making a decision, consider how your company will evolve. And you can be sure that a great data infrastructure that grows with you will be a competitive advantage.

How to Build Thousands of Connectors

John Lafleur — Wed, 04 Nov 2020 04:59:37 +0000

We’re building an open-source data integration platform at Airbyte. We launched our MVP about a month ago. We were thrilled by the amount of feedback and support we got from the community. We even got our first big pull request from a contributor this week (2,000+ lines of code). But during this full month, we didn’t release any new connectors. You might wonder why we didn’t build on that momentum. If people were excited with our MVP even though it had only 6 connectors, you might think we should have ramped up on the number of connectors as fast as possible. We didn’t do that for two very important and differentiating reasons.

First, we were defining exactly what the best data protocol would be if we wanted to solve data integration once and for all, and this for all companies. You can learn more about our specification here. Even though it’s not final yet, you will have a glimpse of our vision for the future.

Second, and just as important, we were building a real manufacturing plant for data integration connectors. See, our team led data integration at LiveRamp, which has more than 1,000 data ingestion connectors and 1,000+ distribution connectors. So we have the experience of abstracting what can be abstracted and simplifying the manufacturing of new integration (very often without code). We haven’t fully built our manufacturing plant, but engineers can already add one new connector every day.

This article describes how we built this connector manufacturing plant.

What you need to think about when building a large number of connectors

When building a large catalog of connectors, there are several things that you need to think through.

Initial build

This is when you start from a blank page. This step usually requires a little bit of planning since it involves communication with external teams/companies.
The initial build step involves:

Access to the source/destination documentation
Access to test accounts, test infrastructure, etc.
Using golden path encoding good practices
Using the best language for the task: today, we support both Java and Python, but anyone can add their own language
Creating documentation
Defining the necessary inputs ##Tests Tests are essential to make sure that any code or protocol change won’t affect the connectors. They need to run before every merge.

They also ensure that the connector behaves as you expect. For that you need to run your connector against the actual production service. For example, if you’re working on the Salesforce connector, you must make sure that Salesforce actually behaves the way you expect. It is not unusual that an API or service documentation doesn’t fully reflect the reality.

We currently have the foundation of our test framework; it allows developers to focus solely on providing inputs and outputs, and the rest is taken care of by the framework.

These tests give us 90% certainty that the connector is fully functional. If there are edge cases, it is always possible to add more custom tests.

Liveliness & Change detection

It is essential to ensure that the source or destination continues to behave as it was encoded during the initial build phase and to ensure that the source or destination is still alive for monitoring purposes.

These verifications must be run at a cadence, and any failure needs to be investigated and fixed, leading to the maintenance phase.

Maintenance

We need to define how we are going to update the connector, push changes and propagate the changes to all the running instances of Airbyte.

The art of building connectors is thinking in onion layers

Segmenting cattle code

To make a parallel with the pet/cattle concept that is well known in DevOps/Infrastructure, a connector is cattle code, and you want to spend as little time on it as possible. Anything you can do to prevent yourself from doing work in the future, you need to do. This will accelerate your production tremendously.

Abstractions as onion layers

Maximizing high-leverage work leads you to build your architecture with an onion-esque structure:

The center defines the lowest level of the API. Implementing a connector at that level requires a lot of engineering time. But, it is your escape hatch for very complex connectors where you need a lot of control.

Then, you build new layers of abstraction that help tackle families of connectors very quickly.

Today, we’ve built one of these abstractions to support existing Singer integration. Building an integration leveraging Singer takes us less than 3 hours, and our goal is to bring it down to less than 10 minutes.

We have the same ambition for every other family of sources and destinations.

As we continue to improve our manufacturing plant for connectors, we will build tools that will allow us to handle 95% of integrations with no or very little code.

This is how we are going to address the long tail of integrations and how we’re going to make integrations a commodity.

What Airbyte has built up to now

We’ve built the following:

The center of the onion
The golden path in Java & Python to build new connectors
The first version of the integration test framework
Connectors: 10 sources with a rate of 1 new source per day, and 4 destinations
A layer to quickly support Singer integrations

What our ambitions are with this connector manufacturing plant

We want to reach a rate of 5 connectors per day and accelerate even beyond that.

We also want to provide the community with more tools to build and contribute their own connectors. Ideally, 95% of connectors can be added to Airbyte with no code.

We hope this gives you a better understanding of what we’ve been up to and what our real ambitions are. If you see any ways to improve this architecture, we’re all ears. Don’t hesitate to join our Slack to discuss any questions or suggestions with the team.

Why the Future of ETL Is Not ELT, But EL(T)

John Lafleur — Wed, 04 Nov 2020 04:51:39 +0000

How we store and manage data has completely changed over the last decade. We moved from an ETL world to an ELT world, with companies like Fivetran pushing the trend. However, we don’t think it is going to stop there; ELT is a transition in our mind towards EL(T) (with EL decoupled from T). And to understand this, we need to discern the underlying reasons for this trend, as they might show what’s in store for the future.

This is what we will be doing in this article. I’m the co-founder of Airbyte, the new upcoming open-source standard for data integrations.

What are the problems with ETL?

Historically, the data pipeline process consisted of extracting, transforming, and loading data into a warehouse or a data lake. There are serious disadvantages to this sequence.

Inflexibility

ETL is inherently rigid. It forces data analysts to know beforehand every way they are going to use the data, every report they are going to produce. Any change they make can be costly. It can potentially affect data consumers downstream of the initial extraction.

Lack of visibility

Every transformation performed on the data obscures some of the underlying information. Analysts won’t see all the data in the warehouse, only the one that was kept during the transformation phase. This is risky, as conclusions might be drawn based on data that hasn’t been properly sliced.

Lack of Autonomy for Analysts

Last but not least, building an ETL-based data pipeline is often beyond the technical capabilities of analysts. It typically requires the close involvement of engineering talent, along with additional code to extract and transform each source of data.

The alternative to a complex engineering project is to conduct analyses and build reports on an ad hoc, time-intensive, and ultimately unsustainable basis.

What changed and why ELT is way better

Cloud-based Computation and Storage of Data

The ETL approach was once necessary because of the high costs of on-premises computation and storage. With the rapid growth of cloud-based data warehouses such as Snowflake, and the plummeting cost of cloud-based computation and storage, there is little reason to continue doing transformation before loading at the final destination. Indeed, flipping the two enables analysts to do a better job in an autonomous way.

ELT Supports Agile Decision-Making for Analysts

When analysts can load data before transforming it, they don’t have to determine beforehand exactly what insights they want to generate before deciding on the exact schema they need to get.

Instead, the underlying source data is directly replicated to a data warehouse, comprising a “single source of truth.” Analysts can then perform transformations on the data as needed. Analysts will always be able to go back to the original data and won’t suffer from transformations that might have compromised the integrity of the data, giving them a free hand. This makes the business intelligence process incomparably more flexible and safe.

ELT Promotes Data Literacy Across the Whole Company

When used in combination with cloud-based business intelligence tools such as Looker, Mode, and Tableau, the ELT approach also broadens access to a common set of analytics across organizations. Business intelligence dashboards become accessible even to relatively non-technical users.

We’re big fans of ELT at Airbyte, too. But ELT is not completely solving the data integration problem and has problems of its own. We think EL needs to be completely decoupled from T.

What’s changing now and why EL(T) is the future

Merging of Data Lakes and Warehouses

There was a great analysis by Andreessen Horowitz about how data infrastructures are evolving. Here is the architecture diagram of the modern data infrastructure they came up with after a lot of interviews with industry leaders.

Data infrastructure serves two purposes at a high level:

Helps business leaders make better decisions through the use of data - analytic use cases
Builds data intelligence into customer-facing applications, including via machine learning - operational use cases Two parallel ecosystems have grown up around these broad use cases.

The data warehouse forms the foundation of the analytics ecosystem. Most warehouses store data in a structured format. They are designed to generate insights from core business metrics, usually with SQL (although Python is growing in popularity).

The data lake is the backbone of the operational ecosystem. By storing data in raw form, it delivers the flexibility, scale, and performance required for applications and more advanced data processing needs. Data lakes operate on a wide range of languages including Java/Scala, Python, R, and SQL.

What’s really interesting is that modern data warehouses and data lakes are starting to resemble one another – both offering commodity storage, native horizontal scaling, semi-structured data types, ACID transactions, interactive SQL queries, and so on.

So you might be wondering if data warehouses and data lakes are on a path toward convergence. Will they become interchangeable in a stack? Will data warehouses also be used for the operational use case?

EL(T) Supports Both Use Cases: Analytics and Operational ML

EL, in contrast to ELT, completely decouples the Extract-Load part from any optional transformation that may occur.
The operational use cases are all unique in the way incoming data is leveraged. Some might use a unique transformation process; some might not even use any transformation.

In regards to the analytics case, analysts will need to get the incoming data normalized for their own needs at some point. But decoupling EL from T would let them choose whichever normalization tool they want. DBT has been gaining a lot of traction lately among data engineering and data science teams. It has become the open-source standard for transformation. Even Fivetran integrates with them to let teams use DBT if they’re used to it.

EL Scales Faster and Leverages the Whole Ecosystem

Transformation is where all the edge cases lie. For every specific need within any company, there is a schema normalization unique to it, for each and every one of the tools.

By decoupling EL from the T, this enables the industry to start covering the long tail of connectors. At Airbyte, we’re building a “connector manufacturing plant” so we can get to 1,000 pre-built connectors in a matter of months.

Furthermore, as mentioned above, it would help teams leverage the whole ecosystem in an easier way. You start to see an open-source standard for every need. In a sense, the future data architecture might look like this:

In the end, extract and load will be decoupled from transformation. Do you agree with us? If so, you might be interested to have a look at what Airbyte does.

The State of Open-Source Data Integration and ETL

John Lafleur — Sun, 18 Oct 2020 23:34:49 +0000

Open-source data integration is not new. It started 16 years ago with Talend. But since then, the whole industry has changed. The likes of Snowflake, Bigquery, Redshift have changed how data is being hosted, managed, and accessed, while making it easier and a lot cheaper. But the data integration industry has evolved as well.

On one hand, new open-source projects emerged, such as Singer.io in 2017. This enabled more data integration connectors to become accessible to more teams, even though it still required a significant amount of manual work.

On the other hand, data integration was made accessible to more teams (analysts, scientists, business intelligence teams). Indeed, companies like Fivetran benefited from Snowflake’s rise, empowering non-engineering teams to set up and manage their data integration connectors by themselves, so they can use and work on the data in an autonomous way.

But even with this progress, a large majority of teams still build their own connectors in-house. The build vs. buy leans strongly on the build. That’s why we think it’s time to have a fresh new look at the landscape of the open-source technologies around data integration.

However, the idea for this article came from an awesome debate on DBT’s Slack last week. The discussion centered around two things:

The state of open-source alternatives to Fivetran, and
Whether an open-source (OSS) approach is more relevant than a commercial software approach in addressing the data integration problem.

Even Fivetran’s CEO was involved in the debate.

We already synthetized the second point in a previous article. In this article, we want to analyze the first point: the landscape of open-source data integration technologies.

TL;DR

Here is a table summarizing our analysis.

In orange is what we’re currently building at Airbyte in the next few weeks.

To better understand this table, we invite you to read below the details of our analysis on the landscape.

Data integration open-source projects

Singer

Singer was launched in 2017, and was until now the most popular open-source project. It was initiated by StitchData, which was founded in 2016. Over the years, Singer grew to support 96 taps and targets.

Increasingly outdated connectors: Talend (acquirer of StitchData) seems to have stopped investing in maintaining Singer’s community and connectors. As most connectors see schema changes several times a year, more and more Singer’s taps and targets are not actively maintained and are becoming outdated.
Absence of standardization: each connector is its own open-source project. So you never know the quality of a tap or target until you have actually used it. There is no guarantee whatsoever about what you’ll get.
Singer’s connectors are standalone binaries: you still need to build everything around to make them work.
No full commitment to open sourcing all connectors, as some connectors are only offered by StitchData under a paid plan.

In the end, a lot of teams will use StitchData for the connectors that work well, and will build their own integration connectors if they don’t work out of the box. Editing a Singer connector is not easier than building and maintaining the connector yourself. This defeats the purpose of open source.

Airbyte

Airbyte was born in July 2020, so it is still new. It was born out of frustration with Singer and other open-source projects. It was built by a team of data integration veterans from Liveramp, who individually built and maintained more than 1,000 integrations, so 8 times more than Singer. Their ambition is to support 50+ connectors by the end of 2020, so in just 5 months since the inception of the project.

Airbyte’s mission is to commoditize data integration, and we have made several significant choices towards this goal:

Airbyte’s connectors are usable out of the box through a UI and API, with monitoring, scheduling and orchestration. Airbyte was built on the premise that a user, whatever their background, should be able to move data in 2 minutes. Data engineers might want to use raw data and their own transformation processes, or to use Airbyte’s API to include data integration in their workflows. On the other hand, analysts and data scientists might want to use normalized consolidated data in their database or data warehouses. Airbyte supports all these use cases.
One platform, one project with standards: This will help consolidate the developments behind one single project, some standardization and specific data protocol that can benefit all teams and specific cases.
Connectors can be built in the language of your choice, as Airbyte runs them as Docker containers.
Decoupling of the whole platform to let teams use whatever part of Airbyte they want based on their needs and their existing stack (orchestration with Airflow, Kubernetes, or Airbyte, transformation with DBT or again Airbyte, etc.). Teams can use Airbyte’s orchestrator or not, their normalization or not; everything becomes possible.
A full commitment to the open-source MIT project with the promise not to hide some connectors behind paid walls.

The number of connectors supported by Airbyte and its community is growing fast. Their team anticipates that it will outgrow Singer’s by early 2021. Note that Airbyte’s data protocol is compatible with Singer’s. So it is easy to migrate a Singer tap onto Airbyte, too.

PipelineWise

PipelineWise is an open-source project by Transferwise that was built with the primary goal to serve their own needs. They support 21 connectors, and add new ones based on the needs of the mother company. There is no business model attached to the project, and no apparent interest from the company in growing the community.

As close to the original format as possible: PipelineWise aims to reproduce the data from the source to an Analytics-Data-Store in as close to the original format as possible. Some minor load time transformations are supported, but complex mapping and joins have to be done in the Analytics-Data-Store to extract meaning.
Managed Schema Changes: When source data changes, PipelineWise detects the change and alters the schema in your Analytics-Data-Store automatically.
YAML based configuration: Data pipelines are defined as YAML files, ensuring that the entire configuration is kept under version control.
Lightweight: No daemons or database setup are required. Compatible with Singer’s data protocols: PipelineWise is using Singer.io compatible taps and target connectors. New connectors can be added to PipelineWise with relatively small effort.

Meltano

Meltano is an orchestrator dedicated to data integration, built by Gitlab on top of Singer’s taps and targets. Since 2019, they have been iterating on several approaches. They now have one maintainer for this project that is CLI-first. After one year, they now support 19 connectors.

Built on top of Singer’s taps and targets: Meltano has the same limitations as Singer’s in regards to its data protocol.
CLI-first approach: Meltano was primarily built with a command line interface in mind. In that sense, they seem to target engineers with a preference for that interface.
A new UI: Meltano has recently built a new UI to try to appeal to a larger audience.
Integration with DBT for transformation: Meltano offers some deep integration with DBT, and therefore lets data engineering teams handle transformation any way they want.
Integration with Airflow for orchestration: You can either use Meltano alone for orchestration or with Airflow; Meltano works both ways.

Related noteworthy open-source projects

Here are some other open-source projects that you might have heard of, as they’re often used by data engineering teams. We thought they deserved to be mentioned.

Apache Airflow

We see a lot of teams building their own data integration connectors using Airflow for the orchestration and scheduling. Airflow wasn’t built with data integration in mind. But a lot of teams use it to build workflows. Airbyte is the only open-source project to offer an API so teams can include data integration jobs in their workflows.

DBT

DBT is the most widely used data transformation open-source project. You need to be proficient in SQL to use it properly, but a lot of data engineering / integration teams use it to normalize raw data coming into the warehouses or databases.

Both Airbyte and Meltano are compatible with DBT. Airbyte will offer teams the ability to choose between raw or normalized data for each connection they need, which addresses the needs of both data engineering and data analyst teams. Meltano doesn’t provide normalized schemas, and relies solely on DBT for that.

Apache Camel

Apache Camel is an open-source rule-based routing and mediation engine. That means you get smart completion of routing rules in your IDE whether in your Java, Scala, or XML editor. It uses URIs to enable easier integration with all kinds of transport and messaging models including HTTP, ActiveMQ, JMS, JBI, SCA, MINA and CXF, together with working with pluggable Data Format options.

Streamsets

Streamsets is a data-ops platform that includes a low-level open-source data collection engine named DataCollector. This open-source project is not supported by any community, and is mostly used by the company to assure their enterprise clients that they will still have access to the code whatever happens.

Let us know if we missed any open-source projects or any valuable information on the listed ones. We will try to keep this list up to date and precise.

Airbyte vs. Singer: Why Airbyte Is Not Built on Top of Singer

John Lafleur — Tue, 13 Oct 2020 03:40:10 +0000

We’ve been asked if Airbyte was being built on top of Singer. Even though we at Airbyte loved the initial mission they had, that won’t be the case. Aibyte’s data protocol will be compatible with Singer’s, so that you can easily integrate and use Singer’s taps, but our protocol will differ in many ways from theirs.

Let’s first go over the reasons why we don’t build on top of Singer, in contrast with other open-source projects (such as Meltano), and then let’s see how different Airbyte is.

Why Airbyte is not built on top of Singer

A little history on Singer.io. It was the first open-source project with the mission to address the data integration problem. It was introduced by the company StitchData (which was acquired by Talend in 2018) as a way to offer extendibility to the connectors they had pre-built. Your company could build their own taps (source connectors). Singer now counts about 150-200 connectors, on par with the closed-source Fivetran.

So what is the issue with Singer? Several things:

1. Absence of standardization

There is an absence of standardization and enforcement of protocol. Developers just add whatever they want in their implementation and messages. Contributors only address their own use cases and needs, and don’t build the connector with the mindset to address most use cases that the community might need. So, you never know the quality of a tap or target until you have actually used it. There is no guarantee whatsoever about what you’ll get.

2. No real ownership

There is no real ownership or direction for the project anymore. Indeed, StitchData, over the years, became less and less involved with maintaining the open-source project. And the difficulty with data integration is that applications and APIs change schemas every few months. So a lot of the connectors became outdated, as they were not maintained anymore. In fact, it is not unusual to see connectors with years old PRs that aren’t merged.

In the end, you have a set of connectors with varying quality. In general, the more used a connector is, the more maintained it is. So there is still some value in being compatible with Singer, but building on top of them and being limited by them would not be smart.

How Airbyte’s choices are different from Singer’s

1. Airbyte’s connectors are not standalone binaries

Singer’s connectors are standalone binaries: you still need to build everything around them to use them.

With Airbyte, we want to help take them to the next level with a platform that can orchestrate and make them usable out of the box (through our UI or API).

2. One platform, one project with standards

In contrast to Singer, Airbyte has one single repo for the whole project and all the connectors. This will help consolidate the developments, and it will unify the community behind one single project and one vision. This means that the project and community can be opinionated about what a connector should be, and how it should be built.

3. Connectors can be built in the language of your choice

Airbyte runs connectors as Docker containers. So connectors can be built in whichever language you want. The overall platform is built using Java, but any team can build their own connector in Python, Go, Javascript, etc. The easier we make contributions to potential contributors, the more active the community will be in building and maintaining connectors.

4. Decoupling of Extract-Load from Transformation

A normalization stands for an opinionated view of how one should use the data. By separating extract-load and transformation, Airbyte enables:

Engineers, who want to transform the data themselves with their processes, to do that.
Engineers / Analysts / Data scientists / Teams, who want to use the normalized data right out of the box, if the normalization is in line with how they want to use the data.

It also enables Airbyte to more easily cover the long tail of integrations. Some connectors might not have some normalization, and that can come separately. The community will also be able to contribute their own normalization, so data users can choose which one suits them best.

5. A UI and API to address every teams’ needs

Airbyte was built on the premise that a user, whatever their background, should be able to move data in 2 minutes. To do that, we needed to build the UI and also an API. Stitchdata does bring the UI, but it doesn’t come with Singer. Here, Airbyte’s UI and API are both open sourced.

Check out our tutorial on how to move data between Postgres DB in just a few minutes.

6. A full commitment to the open-source MIT project with no premium connectors

Singer was born after the company StitchData as a way to expand the number of connectors, thanks to the community. StitchData was just selling their product along with it, and was keeping some connectors away from the community as an upgrade lever.

Airbyte’s core product is the open-source connectors, and the team is fully committed to expanding and maintaining the connectors. Our business model doesn’t depend on any premium connectors. Our vision is to become the open-source standard for data integrations, and then to build Enterprise-targeted features (security and privacy compliance, user and role management, support, SLAs, etc.).

How Airbyte is helping Singer users achieve their goals

If you’re using some Singer taps, we’ve got you covered. Our data protocol is not built upon Singer’s, but it is compatible with it.

This means we will be able to support Singer’s taps (we will be very selective and focus only on the highest quality and well-maintained ones), and that you can add your own on Airbyte.

Note that we will keep very high standards of quality for our connectors, though. So we don’t guarantee we will put forward / support Singer’s low-quality connectors.

We hope this article clarifies how Airbyte is different from Singer (or Meltano, which is built on top of Singer). Airbyte's ambitions go beyond what Singer’s protocol can offer.

Solving Data Integration: The Pros and Cons of Open Source and Commercial Software

John Lafleur — Sun, 11 Oct 2020 23:49:17 +0000

There was an awesome debate on DBT’s Slack last week discussing mainly about 2 things:

The state of open-source alternatives to Fivetran
Whether an open-source (OSS) approach is more relevant than a commercial software approach in addressing the data integration problem.

If you’re already on DBT’s Slack, here is the thread’s URL. Even Fivetran’s CEO was involved in the debate.

In this article, we want to discuss the second point and go over the different points mentioned by each party. The first point will come in another article. It’s more relevant to discuss whether an OSS approach makes sense before drilling down into the different alternatives.

We’ll go over the main challenges that companies face and see which approach fits best. We’ll call “commercial companies” the ones with a commercial software product, and “OSS companies” the ones with an open-source approach.

TL;DR

Here is a table summarizing the points mentioned in the debate.

To better understand this table, we invite you to read the list of challenges each approach faces below.

1. Having a large number of high-quality well-maintained pre-built connectors

This might be the most challenging part for the open-source approach, but there are actually choices that can make an OSS approach even stronger than a commercial one on that matter.

Commercial approach

In this case, a company supports a limited number of connectors (the most used ones) and actively maintains them. They know when there is a schema change and when they need to update the connector, and can be pretty responsive on that, if the organization is responsive and scales well.

However, the more connectors, the more difficult it is for a commercial company to keep the same level of maintenance across all connectors. In an ideal world, the organization will grow linearly with the number of connectors. But most often, there are inefficient processes, so every organization will reach a limit. The more efficient they are, the higher the limit.

Open-source approach

When you look at Singer.io, you would think that an OSS approach is at a serious disadvantage here. But Singer.io has actually made several significant poor choices here:

There is an absence of standardization / enforcement of protocol, and developers just add whatever they want in their implementation and messages.
Integrations live by themselves in their own repos. So each connector can be seen as an individual open-source project. This does not help when building a community.
There is no real ownership, nor direction for the project.

So can we actually achieve a large number of high-quality well-maintained connectors with an open-source approach? The answer is probably, but only if the following conditions are met:

One single repo for the whole project and all connectors. This will help consolidate the developments, and it will unify the community behind one single project and one vision. This means that the project and community can be opinionated about what a connector should be, and how it should be built.
Detailed logs for all users, with the ability to easily surface the issue on the GitHub project to the whole community.
A separation between extract-load and transformation. Users would be able to use the connector without any normalization, so they use their own transformation processes. By doing so, we can more easily identify where failures happen, and how to solve them. Plus, it would make it easier to support a lot more integrations faster.

A last note here: when users use a connector on prod and see it fail, they will be able to fix it without waiting for external support teams (as for commercial companies), and the whole community will be able to benefit from the fix.

2. Addressing unique corner cases and unique needs

Every company is opinionated about the way it wants to use a connector; each has specific needs.

Commercial approach

Commercial companies have a ROI-driven relationship with connectors. Any additional work on any edge case will need to have some ROI, or the company will have a hard time scaling. That also means that the commercial company will most likely have very high quality connectors for the most used ones. But it can’t afford to keep that level of quality across all integrations, and thus won’t be addressing their corner cases. With their finite resources, it will always be a trade-off. Commercial companies can’t address the long tail of integrations and can’t meet all unique needs.

Furthermore, in the case a client’s needs are not met, the client won’t be able to customize the existing connector to their needs. That is the limitation of this approach.

Open-source approach

It is actually not obvious that an OSS approach would easily address all unique corner cases. Contributors build the connectors with their use cases and needs in mind; they won’t be building a high quality connector to address most companies’ needs.

However, there are ways to avoid that:

Setting standards and enforce protocols, with one single platform / project, as mentioned in the first point.
Decoupling extract-load (EL) and normalization. A lot of the customization can be done directly in normalization, but a lot of the EL is common to everyone. In that case, everybody can benefit from the EL part and use their own transformation tools to address their own needs.
Enabling contributors to build their connectors using the language of their choice by running connectors as Docker containers

Being open source lets any engineer adapt any pre-built connectors to their needs, so every company will be able to address their own needs. The question is more about how much effort would be needed for each of their cases.

3. Pricing model indexed on usage

All existing commercial products have their pricing indexed in some way on the volume of data transferred and the number of connectors used. That means that teams cannot use those services without always thinking about costs.

On the other hand, open-source products will base their pricing on the features used and not the volume of data as it is self-hosted. So once adopted, teams do not have to worry about costs and can use all connectors freely.

4. Debugging autonomy

This is an easy one. When dealing with a commercial product, if you face an issue, the only thing you can do is contact their support, and hope that your issue will be prioritized in the list of issues the company’s engineering team must resolve. You won’t be able to know when the issue will be fixed. It can take hours at best, or months.

With open source, you don’t need to wait on anybody. You can fix the integration yourself. Having this ability is important if the integration is vital to your business.

5. Getting the internal resources

By resources, we mean either budget or the engineering time to work on the integrations. At Airbyte, we don’t like the word resources when talking about people. But, in this context. it is the only common one we could think of.

The truth is that in a lot of enterprises, it can be easier to justify hiring a new employee than getting even a modest budget approved for an external vendor. Depending on the open-source project, most or all of the job could be already done. That’s one of the reasons why most companies use open-source technologies rather than external vendors.

6. Easy to use by any teams (analysts, finance teams, etc.)

Once consolidated, the data is used by different teams: business intelligence / analysts, data scientists, finance teams, customer success, product or marketing team, etc. That is the beauty of products like Fivetran: anybody can use them and anyone can add more data.

One would think that ease of use is exclusive to commercial products, but that is actually not true. For instance, Gitlab and many other OSS projects have UIs that are usable by a non-technical audience. Airbyte has one and it is one of its primary features.

7. Security and privacy compliance

We’re talking about data, and most probably this involves customer data. All companies are subject to privacy compliance laws, such as GDPR, CCPA, HIPAA, etc. As a matter of fact, above a certain stage (about 100 employees) in a company, all external products need to go through a security compliance process that can take several months.

With open source, teams just use the technology directly without such processes. The adoption is frictionless, and the engineering need can be met overnight.

8. Moving data between internal databases

Commercial products mostly sit in the cloud. So if you have to replicate data from an internal database to another, it makes no sense to have the data move through the external vendor. In addition to the vendor’s costs, you need to add egress costs and the issue it poses in terms of privacy.

Open-source products are mostly self-hosted, so they don’t have this problem.

Is open source hard without a direct revenue channel?

One of the comments in the thread was that “open source is hard without a direct revenue channel to support it.” That is actually true for any company. If you have a hard time getting revenues, you can’t scale the team or your effort.

According to Bessemer, open-source companies grow faster in revenues than many cloud leaders.

Some other interesting quotes / points

“The general argument for open-source software that applies to all software: Having access to the source code means users have freedom--the freedom to change the software how they see fit and fix issues.”

“If you spent months or years using, say, Fivetran, and all of a sudden Fivetran goes out of business or significantly changes its business, you could run the software yourself, even if just for a period of time. I’ve been there when it’s happened with other software; it’s not pretty, and sometimes you only have days or weeks of heads up.”

“If it’s open source, I can run it without Fivetran, so for special circumstances, like compliance or high volume, I don’t have to deal with legal or cost limitations.”

“I can see the code running my integration; I can more quickly debug issues. I can also easily create integrations anyone can choose to use. Tools like Fivetran and Stitch only work a percentage of the time; open source allows me to fill in that gap without creating one-off scripts (or Fivetran lambda functions) or going back and forth for weeks or months with support on a bug that I could fix in a few hours.”

Conclusion

If the open-source project is well structured and well thought through, there will be no points where a commercial approach can be better. The issue is that, until now, there were no good open-source projects that could alleviate the dangers to this approach.

If you like how we think an open-source project can address all the possible concerns by building a single platform with a single repo, a defined data protocol, a ready-to-use UI, and decoupling of EL and normalization, you should definitely have a look at what we’re building at Airbyte. We want to become the open-source standard for data integrations.

Let us know if you think we missed anything. Our goal is to see things from all perspectives and to keep this article up to date.