DEV Community: Airbyte

Why ETL Needs Open Source to Address the Long Tail of Integrations

John Lafleur — Fri, 02 Jul 2021 03:33:08 +0000

Over the last year, our team has interviewed more than 200 companies about their data integration use cases. What we discovered is that data integration in 2021 is still a mess.

The Unscalable Current Situation

At least 80 of the 200 interviews were with users of existing ETL technology, such as Fivetran, StitchData and Matillion. We found that every one of them were also building and maintaining their own connectors even though they were using an ETL solution (or an ELT one - for simplicity, I will just use the term ETL). Why?

We found two reasons:

Incomplete coverage for connectors
Significant friction around database replication

Inability to cover all connector needs

Many users’ ETL solution didn’t support the connector they wanted, or supported it but not in the way they needed.

An example for context: Fivetran has been in existence for eight years and supports 150 connectors. Yet, in just two sectors -- martech and adtech -- there are over 10,000 potential connectors.

The hardest part of ETL is not building the connectors, it is maintaining them. That is costly, and any closed-source solution is constrained by ROI (return on investment) considerations. As a result, ETL suppliers focus on the most popular integrations, yet companies use more and more tools every month and the long tail of connectors goes ignored.

So even with ETL tools, data teams still end up investing huge amounts of money and time building and maintaining in-house connectors.

Inability to address the database replication use case

Most companies store data in databases. Our interviews uncovered two significant issues with database connectors provided by existing ETL.

Volume-based pricing: Databases are huge and serve growing amounts of data. A database with millions of rows, with the goal of serving hundreds of millions of rows, is a common sight.The issue with current ETL solutions is their volume-based pricing. It’s easy for an employee to replicate a multi-million row database with a click. And that simple click could cost a few thousand dollars!
Data privacy: With today’s concerns over privacy and security, companies place an increasing importance on control of their data. The architecture of existing ETL solutions often end up pulling data out of a company’s private cloud. The closed source offerings prevent companies from closely inspecting the underlying ETL code/systems. The reduced visibility means lesser trust

Both of these points explain why companies end up building additional internal database replication pipelines.

Inability to scale with data

The two points mentioned above about volume-based pricing and data privacy also apply as companies scale. It becomes less expensive for companies to have an internal team of data engineers to build the very same pipelines maintained in ETL solutions.

Why Open-Source Is the Only Way Forward

Open-source addresses many of the points raised above. Here is what open source gives us.

The right to customize: Having access to and being able to edit the code to your needs is a privilege open-source brings. For instance, what if the Salesforce connector is missing some data you need? With open source, such a change is as easy as submitting a code change. No more long threads on support tickets!
Addressing the long tail of connectors: You no longer need to convince a proprietary ETL provider that a connector you need is worth building. If you need a connector faster than a platform will develop it, you can build it yourself and maintain it with the help of a large user community.
Broader Integrations with data tools and workflows: Because an open source product must support a wide variety of stacks and workflows for orchestration, deployment, hosting, etc., you are more likely to find out-of-the-box support for your data stack and workflow (UI-based, API-based, CLI-based, etc.) with an open source community. Some of them, like Airbyte’s open source Airflow operator, are contributed by the community. To be fair, you can theoretically do that with a closed-source approach, but you’d likely need to build a lot of the tooling from scratch.
Debugging autonomy: If you experience any connector issues, you won’t need to wait for a customer support team to get back to you or for your fix to be at the top of the priorities of a third-party company. You can fix the issue yourself.
Out-of-the-box security and privacy compliance. If the open-source project is open enough (MIT, Apache 2.0, etc.), any team can directly address their integration needs by deploying the open-source code in their infrastructure.

The Necessity of a Connector Development Kit

However, open source itself is not enough to solve the data integration problem.This is because the barrier to entry for creating a robust and full-featured connector is too high.

Consider for example a script that pulls data from a REST API.

Conceptually this is a simple SELECT * FROM entity query over some data living in a database, potentially with a WHERE clause to filter by some criteria. But anyone who has written a script or connector to continuously and reliably perform this task knows it’s a bit more complicated than that.
First, there is authentication, which can be as simple as a username/password or as complicated as implementing a whole OAuth flow (and securely storing and managing these credentials).

We also need to maintain state between runs of the script so we don’t keep rereading the same data over and over.

Afterwards, we’ll need to handle rate limiting and retry intermittent errors, making sure not to confuse them with real errors that can’t be retried.

We’ll then want to transform data into a format suitable for downstream consumers, all while performing enough logging to fix problems when things inevitably break.

Oh, and all this needs to be well tested, easily deployable... and done yesterday, of course.

All in all, it currently takes a few full days to build a new REST API source connector. This barrier to entry not only means fewer connectors created by the community, but can often mean lower quality connectors.

However, we believe that 80% of this hardship is incidental and can mostly be automated away. Reducing implementation time would significantly help the community contribute and address the long tail of connectors. If this automation is done in a smart way, we might also be able to improve standardization and thus maintenance across all connectors contributed.

What a Connector Development Kit Looks Like

Let’s look again at the work involved in building a connector, but from a different perspective.

Incidental complexity

Setting up the package structure
Packaging the connector in a Docker container and setting up the release pipeline Lots of repeated logic:
Reinventing the same design patterns and code structure for every connector type (REST APIs, Databases, Warehouses, Lakes, etc.)
Writing the same helpers for transforming data into a standard format, implementing incremental syncs, logging, input validation, etc.
Testing that the connector is correctly adhering to the protocol. Testing happy flows and edge cases

You can see that a lot can be automated away, and you’ll be happy to know that Airbyte has made available an open source Connector Development Kit (CDK) to do all this.

We believe in the end, the way to build thousands of high-quality connectors is to think in onion layers. To make a parallel with the pet/cattle concept that is well known in DevOps/Infrastructure, a connector is cattle code, and you want to spend as little time on it as possible. This will accelerate productivity tremendously.

Abstractions as onion layers

Maximizing high-leverage work leads you to build your architecture with an onion-esque structure:

The center defines the lowest level of the API. Implementing a connector at that level requires a lot of engineering time. But it is your escape hatch for very complex connectors where you need a lot of control.

Then, you build new layers of abstraction that helps tackle families of connectors very quickly. For example, sources have a particular interface, and destinations have a different kind of interface.

Then, for sources you have different kinds like HTTP-API based connectors and Databases. HTTP connectors might be split into REST, GraphQL, and SOAP, whereas Databases might split into relational, NoSQL, and graph databases. Destinations might split into Warehouses, Datalakes, and APIs (for reverse ELT).
The CDK is the framework for those abstractions!

What Is Already Available

Airbyte’s CDK is still in its early days, so expect lots of improvements to come over time. Today, the framework ships with the following features:

A Python framework for writing source connectors
A generic implementation for rapidly developing connectors for HTTP APIs
A test suite to test compliance with the Airbyte Protocol and happy code paths
A code generator to bootstrap development and package your connector

In the end, the CDK enables building robust, full-featured connectors within 2 hours versus 2 days previously.

The Airbyte team has been using the framework internally to develop connectors, and it is the culmination of our experience developing more than 70+ connectors (our goal is 200 by end of the year with help from the user community!). Everything we learn from our own experience, along with the user community go into improving the CDK.

Conclusion - The Future Ahead

Wouldn’t it be great to bring the time needed to build a new connector down to 10 minutes, and to extend to more and more families of possible integrations. How’s that for a moonshot!

If we manage to do that together with our user community, then at long last the long tail of integrations will be addressed in no time! Not to mention that data integration pipelines will be commoditized through open-source.

If you would like to get involved, we hope you’ll join our Slack community - the most active one around data integration - as we connect to the future of open source for the benefit of all!

How “User Success” Helps Us Become the Most Active Slack Community

John Lafleur — Tue, 27 Apr 2021 06:11:56 +0000

Today, we’re celebrating three important milestones for Airbyte. Within just 7 months of the release of our very first product (MVP) - which had only 6 connectors - we became the most active Slack community of data professionals around data integration. This is our first milestone.

As you might already know, we are a transparent company. Every month or so, we publish information on our project and company that would be confidential in other companies, such as:

The slides we used to raise our seed round with Accel
Our company handbook with even our strategy and business model

Today, we want to tell you more about our Slack community, our focus on user success and what it means for our community, and two other not yet announced milestones.

The Most Active Slack Community on Data Integration

This weekend, we reached the milestone of 1,000 Slack members, and at the same time became the most active community.

Within 7 months, we grew from 5 people (our original team) to 1,020 members as of 04/26/21. Out of those 1,030 members, 450 are active weekly, and this resulted in 115k messages exchanged with the community. Yes, 45% of our Slack community is active on a weekly basis, which is a great starting point.

The last time we checked, Singer’s Slack community had 40k messages after 4 years, and Meltano had 33k within 2 years. With Airbyte reaching 115k messages in 7 months, who knows how many we’ll have in 2 or 4 years?!

Defining “User Success”

Airbyte’s community is worldwide. About 35% of our users come from the US, but the remaining majority is spread across the globe. That’s why we decided to build a remote-first team with people in France, United Kingdom, India, Singapore, New Caledonia (near Australia), and the US to cover all timezones. The goal is to be the best at what we call “user success.”

What is user success? You're probably familiar with customer success, which is well known in the SaaS world. In customer success, your goal is to make your customers successful with your product. However, when you are an open-source tool, you are first focusing on becoming the industry standard, and therefore, you’re focusing on the users of your open-source project.

Within Airbyte, we define “user success” as our team’s focus to help our users be successful in whatever project they want to build around data, whether it be with Airbyte or another tool. We believe the best way to build trust with our community is by aligning our goals and incentives with theirs; we want them to know we have their back and always will.

Measuring User Success

We’re measuring two things:

Time to first response
Time to resolution

Time to first response is the time elapsed between a user request on our Slack and the first response from a team member or community member.

Time to resolution is the time elapsed between the first user request and when the thread is marked with a ✅ emoji. That is how we notify the rest of the team that this request has been fully addressed.

For the moment, we have an average of 2 hours 30 minutes for the time to first response, and our time to resolution is about 3 hours 30 minutes.

We were also thinking about tracking:

resolution rate, i.e., the percentage of threads that have been marked with a ✅ emoji, but the data was too skewed by us sometimes forgetting to mark the thread as resolved.
feature-coverage rate, i.e., the % of users we interact with for whom we have met all their feature and connector needs

Some Examples of User Success Processes

So, what specific actions do we take in terms of user success?

Well, for instance, we personally welcome every new Slack member with a personalized message. It takes a bit of time but it is definitely worth it, as it enables us to understand their use cases and needs.

You can note all of this information in your CRM or community tool (we are big fans of Orbit) so that when you release a new feature, you can notify those who expressed any interest in that feature. That’s exactly what we do with connectors. Every time we support a new connector, we’ll reach out to all users who mentioned any interest in that connector.

Any interaction with the user is an opportunity to get information on how we can provide more value at a later date.

In the end, as for customer success, you want your users to be more and more successful with your open-source tool, so they become your next advocates.

What Kind of Role in User Success?

Airbyte is an open-source data integration platform, so it’s targeting data engineers, analysts and scientists. The only way to help them become successful is by helping them solve their technical issues. So the role that makes sense is a User Success Engineer.

And, as a matter of fact, we just hired for the role at Airbyte. Here is the description of the role:

Your goal as a User Success Engineer is to make our users successful when deploying or contributing to Airbyte.

The main responsibilities of the role will be:

Help users troubleshoot issues they have when deploying or contributing to Airbyte.
Write documentation and make (or suggest) code changes to resolve recurring issues.
Triage bugs to the correct team (or fix the issue yourself).

Airbyte’s open-source community has been growing very quickly, and one component of our success is the love of our community. This role is instrumental to scaling the support to our users, and includes finding ways to reduce the overall cost of user support through better documentation and new processes.

An excellent candidate will become an expert in the Airbyte system. They will determine which information needs to be shared with the engineering team so that the team has a deep understanding of existing pain points. They will also filter out information that they can resolve themselves through code fixes, documentation, or by working with the users. This will allow the engineering team to be laser focused on the product goals while maintaining intense user empathy. The role is at the heart of our values of leveraging our time and abilities.

An ideal candidate can start out as an individual contributor but can grow this operation into a team as the company scales.

---------

We hope this gives you some insight on how we think about user success at Airbyte and its community. So how does this translate in terms of measurable goals?

Our Next Milestone Is 1,000 Weekly Active Slack Members

The actual metric you want to track is the activity level of your community. Having a non-engaged metric is a waste of time for everybody. So one would think that we should define our next goal in terms of messages exchanged in the community. Why not aim for 1M messages?

The issue with that approach is that messages are not synonymous with value brought to your users. If it takes you half the messages to get your point across and solve your users’ issues, you should definitely go this way. Number of messages is not the right proxy, and never was in our case.

The right approach is to track whether your community keeps being engaged, and that is, simply, weekly active members. That’s why our next milestone is not signups, or messages exchanged, but 1,000 weekly active Slack members.

How to Achieve the Next Milestone

This is where we want to announce two new milestones.

1. Our First Developer Advocate Hire

Abhi Vaidyanatha joined us on 04/26. As our senior developer advocate, he will work on constantly improving our developer experience and engagement. This includes documentation, tutorials, and, therefore, insightful content for our Slack communities.

Maybe we’ll do AMAs there - anything becomes possible when you have someone with the energy of Abhi!

2. Our First User Success Engineer Hire

If, by any chance, we coined the term “user success engineer,” feel free to reuse the term, as it should be open-sourced (MIT) like the rest of Airbyte 😉.

Our first user success engineer should be joining us in the next few weeks. This person will help us drive the time to first response and resolution down so you’ll have the best support experience with Airbyte in the whole ETL/ELT industry - while just using the open-source edition!

---

You will see that the Airbyte team will be growing fast in the next few weeks. And we also have big plans for the Slack community, but we won’t reveal everything just yet as we want to keep some surprises for you!

In case you didn’t join, here’s our Slack community, and you can also contribute to our GitHub repository. Either way - whether you’re already a member or planning to join - we hope to hear from you soon!

And yes, Airbyte is also about to become the GitHub repo with the most stars around data integration, too!

How We Performed on Our Q1 OKRs, and The Goals for Q2

John Lafleur — Wed, 14 Apr 2021 10:07:05 +0000

In January, we shared how we were thinking about OKRs, along with our OKRs for Q1 2021. So we wanted to give some updates about them, and how they have evolved for the 2nd quarter.

Our focus for 2021 is to become the open-source standard for replicating data. This entails three overarching goals:

Making Airbyte just work whatever your data infrastructure, volume and connector needs.
Building the largest developer community for data integration. We envision that most connectors will be built and maintained by the community eventually, because we will have made that so simple with our low-code framework.
Making Airbyte so easy to use in a production context that Airbyte becomes the new standard for data teams to replicate data.

Let’s see how this translates itself into our first two quarterly OKRs.

How We Performed on Airbyte’s OKRs for Q1 2021

1. O: Growing Community Love

What is community love? We’re still big fans of Orbit’s definition for it. Love is a member's level of engagement and investment in the community. Someone with high love is highly active and plays key roles in the community, like contributing, moderating, and organizing.

Let’s first look at GitHub Stars

In this chart, we’re comparing Airbyte with other famous open-source projects around data integration: DBT and RudderStack. Our growth rate (Airbyte in red) is a huge validation that we’re not the only ones to believe that data integration will be solved with an open-source and community approach.

GitHub stars are good awareness metrics, but they don’t mean that you actually have community adoption or contribution. We need to look at other metrics for that:

Overall, we outperformed our Q1 OKRs for community love, even though we set aggressive goals. This is still the very beginning of our journey, but this was extremely encouraging for all the team. We strongly believe we can commoditize data integration through our growing community.

2. O: Growing Production Usage

We call “activated users” users who have deployed Airbyte, connected a source, a destination and synced data successfully from this source to this destination.

We call “prod users” users who have been syncing data more than 5 times in the past week and 5 times in the week before.

Here’s a chart showing the evolution of activated users and prod users during Q1.

We don’t publish the number of prod users we have yet, but you can see that the conversion from activated to prod users is growing with time, which is what we want to see.

But, is the usage of Airbyte growing among prod users?

If we had to follow only one graph, it would be this one. It accounts for both prod user growth and usage growth within prod users.

Here’s the usage growth in terms of sync per prod user:

Overall, this was exactly what we wanted to see. Teams start by testing Airbyte for a few days or weeks, before expanding their usage to other connectors.

3. O: Becoming a Reliable Standard

Airbyte can only become the new standard if connectors are reliable. You could consider that a “sanity” metric - in the sense it is not related to some growth metrics -, but it is actually where almost all of the engineering work goes. The more users use Airbyte, the more edge cases connectors get exposed to. It is a thousand-paper-cut problem, where every user comes with their needs in terms of usage, data and volume. The more users we have, the less reliable connectors can appear, and we have to seize these opportunities to strengthen them.

The metrics we’re looking at in this case are the percent of failures at sync attempts:

We launched on HackerNews on January 26th. That’s when we gained a lot more users at once and got exposed to a lot more use cases. During the whole month of February, we worked on strengthening our connectors, and you can see in this chart how it paid off. Our KR was 5% of failures by the end of the quarter, and this is something that we will keep working on.

Some other metrics we wanted to track:

KR: Response time to any message on Slack or GitHub - our goal was to reach <30 min by end of Q1 2021.
KR: Time to high bug resolution - our goal is to reach 1.5 days by the end of Q1 2021.

In the end, we couldn’t really measure those 2 metrics. But the overall response time to any message on Slack was about 1-2 hours.

4. O: Building the Dream Team

We strongly believe in talent density, and that it’s better to have one stellar colleague than 5 average ones.

KR: 2 A+ engineers => 3 engineers will be joining us in the next few weeks.
KR: 1 senior developer advocate => Abhi will be joining us soon!

Our Q1 Milestones

Now that we have seen how we performed on our OKRs, how did we perform on the milestones?

Community efforts

January: Hard launch on HackerNews
Building tutorials to improve the developer experience (DX) in building their own connectors, or editing pre-built ones => this is still a work in progress.

Product engineering efforts

One thing we didn’t anticipate is the toll providing great support would take on our engineering velocity. Even though we had great output, we were not able to deliver on all the milestones we had intended.

For our core platform:

Integration in data stack with DBT and Airflow => delivered, although we still have a lot on DBT’s front!
Core upgrade strategy => delivered!

For our connectors:

Strengthen our connectors so all our connectors are A+ => we started certifying our connectors against a set of best practice, and you can now see the health status of our connectors.
Schemas migration management => reprioritized
Seamless OAuth support => reprioritized
More high-level abstractions to build connectors more easily => ongoing effort!
An MVP for CDC (Capture Data Change) => delivered!
Connector upgrade strategy => delivered!
A public dashboard showing the stability (failure rate) of all our connectors => delivered!

Our New Q2 OKRs

So what about the next quarter? Doing OKRs is actually a great learning opportunity enabling us to make better estimates every time. This time, we have experience on how much time providing a great support experience takes in engineering time. So we can plan accordingly.

For Q2, we kept the same objectives but changed some KRs that we’ve put in bold.

O: Growing Community Love

KR: Active Slack users (Q1/21: 350, Q2/21: 600)
KR: GitHub stars (Q1/21: 2k, Q2/21: 4k)
KR: Issue contributors from start (Q1/21: 125, Q2/21: 250)
KR: PR contributors from start (Q1/21: 25, Q2/21: 50)
KR: Connector Contributors (Q1/21: 10, Q2/21: 30)

O: Growing Prod Usage

KR: Prod users
KR: Active connections per prod user
KR: # connectors (Q1/21: 56, Q2/21: 90)

O: Becoming a Reliable Standard

KR: % failure at attempts
KR: average throughput of connectors
KR: support replicating large databases in X minutes

O: Building the Dream Team

KR: 2 A+ engineers
KR: 1 dev evangelist (to be confirmed) + 1 operations manager

Our Next Q2 Milestones

How does this translate into milestones?

Make Airbyte the easiest way to create line-of-business connectors with our low-code solution for creating connectors quickly and more reliably.
Support custom DBT models.
CDC for all major database sources.
Mature handling of (large) production data sets.
Production-grade single node support (across platforms): creating solid AMIs, systemctl, etc., with less setup.
First-class support on K8s.
OAuth support for connector authentication.
"Automatic" Schema change handling.
Support for data lake use cases.

So... a lot of engineering milestones! And they can be accomplished as we grow our engineering team.

Let’s see how we perform in 3 months!

How to Visualize the Time Spent by Your Team in Zoom Calls

John Lafleur — Mon, 05 Apr 2021 07:15:06 +0000

In this article, we will show you how you can understand how much your team leverages Zoom, or spends time in meetings, in a couple of minutes. We will be using Airbyte (an open-source data integration platform) and Tableau (a business intelligence and analytics software) for this tutorial.

Here is what we will cover:

Step 1: Setting up data replication from Zoom to a PostgreSQL database using the Airbyte Zoom connector
Step 2: Connecting the PostgreSQL database to Tableau
Step 3: Creating charts in Tableau with Zoom data

We will produce the following charts in Tableau:

Evolution of the number of meetings per week in a team
Evolution of the number of hours a team spends in meetings per week
Listing of team members with the number of meetings per week and number of hours spent in meetings, ranked
Evolution of the number of webinars per week in a team
Evolution of the number of hours a team spends in webinars per week
Evolution of the number of participants for all webinars in a team per week
Listing of team members with the number of webinars per week and number of hours spent in meetings, ranked

Let’s get started by replicating Zoom data using Airbyte.

Step 1: Replicating Zoom data to PostgreSQL

Launching Airbyte

In order to replicate Zoom data, we will need to use Airbyte’s Zoom connector. To do this, you need to start off Airbyte’s web app by opening up your terminal and navigating to Airbyte and running:

docker-compose up

You can find more details about this in the Getting Started tutorial.

This will start up Airbyte on localhost:8000; open that address in your browser to access the Airbyte dashboard.

In the top right corner of the Airbyte dashboard, click on the + new source button to add a new Airbyte source. In the screen to set up the new source, enter the source name (we will use airbyte-zoom) and select Zoom as source type.

Choosing Zoom as source type will cause Airbyte to display the configuration parameters needed to set up the Zoom source.

The Zoom connector for Airbyte requires you to provide it with a Zoom JWT token. Let’s take a detour and look at how to obtain one from Zoom.

Obtaining a Zoom JWT Token

To obtain a Zoom JWT Token, login to your Zoom account and go to the Zoom Marketplace. If this is your first time in the marketplace, you will need to agree to the Zoom’s marketplace terms of use.

Once you are in, you need to click on the Develop dropdown and then click on Build App.

Clicking on Build App for the first time will display a modal for you to accept the Zoom’s API license and terms of use. Do accept if you agree and you will be presented with the below screen.

Select JWT as the app you want to build and click on the Create button on the card. You will be presented with a modal to enter the app name; type in airbyte-zoom.

Next, click on the Create button on the modal.

You will then be taken to the App Information page of the app you just created. Fill in the required information (at the very least).

After filling in the needed information, click on the Continue button. You will be taken to the App Credentials page. Here, click on the View JWT Token dropdown.

There you can set the expiration time of the token (we will leave the default 90 minutes), and then you click on the Copy button of the JWT Token.

After copying it, click on the Continue button.

You will be taken to a screen to activate Event Subscriptions. Just leave it as is, as we won’t be needing Webhooks. Click on Continue, and your app should be marked as activated.

Connecting Zoom on Airbyte

So let’s go back to the Airbyte web UI and provide it with the JWT token we copied from our Zoom app.

Now click on the Set up source button. You will see the below success message when the connection is made successfully.

And you will be taken to the page to add your destination.

Connecting PostgreSQL on Airbyte

For our destination, we will be using a PostgreSQL, since Tableau supports PostgreSQL as a data source. Click on the add destination button, and then in the drop down click on + add a new destination. In the page that presents itself, add the destination name and choose the Postgres destination.

To supply Airbyte with the PostgreSQL configuration parameters needed to make a PostgreSQL destination, we will spin off a PostgreSQL container with Docker using the following command in our terminal.

docker run --rm --name airbyte-zoom-db -e POSTGRES_PASSWORD=password -v airbyte_zoom_data:/var/lib/postgresql/data -p 2000:5432 -d postgres

This will spin a docker container and persist the data we will be replicating in the PostgreSQL database in a Docker volume airbyte_zoom_data.

Now, let’s supply the above credentials to the Airbyte UI requiring those credentials.

Then click on the Set up destination button.

After the connection has been made to your PostgreSQL database successfully, Airbyte will generate the schema of the data to be replicated in your database from the Zoom source.

Leave all the fields checked.

Select a Sync frequency of manual and then click on Set up connection.

After successfully making the connection, you will see your PostgreSQL destination. Click on the Launch button to start the data replication.

Then click on the airbyte-zoom-destination to see the Sync page.

Syncing should take a few minutes or longer depending on the size of the data being replicated. Once Airbyte is done replicating the data, you will get a succeeded status.

Then, you can run the following SQL command on the PostgreSQL container to confirm that the sync was done successfully.

docker exec airbyte-zoom-db psql -U postgres -c "SELECT * FROM public.users;"

Now that we have our Zoom data replicated successfully via Airbyte, let’s move on and set up Tableau to make the various visualizations and analytics we want.

Step 2: Connect the PostgreSQL database to Tableau

Tableau helps people and organizations to get answers from their data. It’s a visual analytic platform that makes it easy to explore and manage data.

To get started with Tableau, you can opt in for a free trial period by providing your email and clicking the DOWNLOAD FREE TRIAL button to download the Tableau desktop app. The download should automatically detect your machine type (Windows/Mac).

Go ahead and install Tableau on your machine. After the installation is complete, you will need to fill in some more details to activate your free trial.

Once your activation is successful, you will see your Tableau dashboard.

On the sidebar menu under the To a Server section, click on the More… menu. You will see a list of datasource connectors you can connect Tableau with.

Select PostgreSQL and you will be presented with a connection credentials modal.

Fill in the same details of the PostgreSQL database we used as the destination in Airbyte.

Next, click on the Sign In button. If the connection was made successfully, you will see the Tableau dashboard for the database you just connected.

Note: If you are having trouble connecting PostgreSQL with Tableau, it might be because the driver Tableau comes with for PostgreSQL might not work for newer versions of PostgreSQL. You can download the JDBC driver for PostgreSQL here and follow the setup instructions.

Now that we have replicated our Zoom data into a PostgreSQL database using Airbyte’s Zoom connector, and connected Tableau with our PostgreSQL database containing our Zoom data, let’s proceed to creating the charts we need to visualize the time spent by a team in Zoom calls.

Step 3: Create the charts on Tableau with the Zoom data

Evolution of the number of meetings per week in a team

To create this chart, we will need to use the count of the meetings and the createdAt field of the meetings table. Currently, we haven’t selected a table to work on in Tableau. So you will see a prompt to Drag tables here.

Drag the meetings table from the sidebar onto the space with the prompt.

Now that we have the meetings table, we can start building out the chart by clicking on Sheet 1 at the bottom left of Tableau.

As stated earlier, we need Created At, but currently it’s a String data type. Let’s change that by converting it to a data time. So right click on Created At, then select ChangeDataType and choose Date & Time. And that’s it! That field is now of type Date & Time.

Next, drag Created At to Columns.

Currently, we get the Created At in YEAR, but per our requirement we want them in Weeks, so right click on the YEAR(Created At) and choose Week Number.

Tableau should now look like this:

Now, to finish up, we need to add the meetings(Count) measure Tableau already calculated for us in the Rows section. So drag meetings(Count) onto the Columns section to complete the chart.

And now we are done with the very first chart. Let's save the sheet and create a new Dashboard that we will add this sheet to as well as the others we will be creating.

Currently the sheet shows Sheet 1; right click on Sheet 1 at the bottom left and rename it to Weekly Meetings.

To create our Dashboard, we can right click on the sheet we just renamed and choose new Dashboard. Rename the Dashboard to Zoom Dashboard and drag the sheet into it to have something like this:

Now that we have this first chart out of the way, we just need to replicate most of the process we used for this one to create the other charts. Because the steps are so similar, we will mostly be showing the finished screenshots of the charts except when we need to conform to the chart requirements.

Evolution of the number of hours a team spends in meetings per week

For this chart, we need the sum of the duration spent in weekly meetings. We already have a Duration field, which is currently displaying durations in minutes. We can derive a calculated field off this field since we want the duration in hours (we just need to divide the duration field by 60).

To do this, right click on the Duration field and select create, then click on calculatedField. Change the name to Duration in Hours, and then the calculation should be [Duration]/60. Click ok to create the field.

So now we can drag the Duration in Hours and Created At fields onto your sheet like so:

Note: We are adding a filter on the Duration to filter out null values. You can do this by right clicking on the SUM(Duration) pill and clicking filter, then make sure the include null values checkbox is unchecked.

Evolution of the number of participants for all meetings per week

For this chart, we will need to have a calculated field called # of meetings attended, which will be an aggregate of the counts of rows matching a particular user's email in the report_meeting_participants table plotted against the Created At field of the meetings table. To get this done, right click on the User Email field. Select create and click on calculatedField, then enter the title of the field as # of meetings attended. Next, enter the below formula:

COUNT(IF [User Email] == [User Email] THEN [Id (Report Meeting Participants)] END)

Then click on apply. Finally, drag the Created At fields (make sure it’s on the Weekly number) and the calculated field you just created to match the below screenshot:

Listing of team members with the number of meetings per week and number of hours spent in meetings, ranked.

To get this chart, we need to create a relationship between the meetings table and the report_meeting_participants table. You can do this by dragging the report_meeting_participants table in as a source alongside the meetings table and relate both via the meeting id. Then you will be able to create a new worksheet that looks like this:

Note: To achieve the ranking, we simply use the sort menu icon on the top menu bar.

Evolution of the number of webinars per week in a team

The rest of the charts will be needing the webinars and report_webinar_participants tables. Similar to the evolution of the number of meetings per week in a team, we will be plotting the Count of webinars against the Created At property.

Evolution of the number of hours a week spends in webinars per week

For this chart, as for the meeting’s counterpart, we will get a calculated field off the Duration field to get the Webinar Duration in Hours, and then plot Created At against the Sum of Webinar Duration in Hours, as shown in the screenshot below. Note: Make sure you create a new sheet for each of these graphs.

Evolution of the number of participants for all webinars per week

This calculation is the same as the evolution of the number of participants for all meetings per week, but instead of using the meetings and report_meeting_participants tables, we will use the webinars and report_webinar_participants tables.

Also, the formula will now be:

COUNT(IF [User Email] == [User Email] THEN [Id (Report Webinar Participants)] END)

Below is the chart:

Listing of team members with the number of webinars per week and number of hours spent in meetings, ranked

Below is the chart with these specs

Conclusion

In this article, we see how we can use Airbyte to get data off the Zoom API onto a PostgreSQL database, and then use that data to create some chart visualizations in Tableau.

You can leverage Airbyte and Tableau to produce graphs on any collaboration tool. We just used Zoom to illustrate how it can be done. Hope this is helpful!

Our Truth for 2021: Airbyte Just Works

John Lafleur — Sun, 04 Apr 2021 22:37:21 +0000

We try to limit our discussions with VCs, as they can easily become a distraction. As a startup, focus is what will differentiate between success and failure. But sometimes, we can’t refuse an introduction and a discussion, as some investors have a lot of insights on your industry.

Recently, we had one discussion with a top-tier VC general partner. In addition to a lot of feedback and insights, one question in particular he asked really struck me: “What is your truth for 2021?”

In this article, we will explain what he means by truth, and what our immediate answer was for Airbyte.

What is a truth?

A truth is what we absolutely need to achieve for your company to be on the path to success. It is the one thing you need to strive for, and that should determine your priorities, strategy, initiatives, recruiting plan, etc.

A truth helps put every consideration in perspective. Any time you have a decision to make you can ask yourself whether that brings you closer to that truth. Anything that doesn’t get you closer to it, you should ponder whether you should actually do it.

It is by having this singular goal in mind that you will give yourself the highest chance to get there.

Is a truth a SMART goal in the end?

I’m sure you have heard about “SMART” goals. SMART stands for Specific, Measurable, Achievable, Relevant, Time-based.
It’s true that your truth needs to be specific. It cannot just be “My company is successful.” You need to define exactly what success means to you as a company.

Your truth should also be achievable and relevant, and it is by definition time-based, as it’s for your current year (or another period of your choice).

But the difference lies in the fact that your truth should be aspirational above being measurable. It should be very easy to express, just a few words, very memorable.

When we were asked this question, we hadn’t thought about it this way, but Michel - my co-founder - and I knew the answer instantaneously.

Our truth for 2021: “Airbyte just works”

What came to our mind is that for the end of 2021, we envision that Airbyte just works. This is the feeling we want all our users to have.

This includes reliability of the platform and all its connectors, whatever your infrastructure and the volume of data you need to replicate. But it also includes agnosticity for whatever connector needs you have, whatever data stack you opted for. Airbyte just works.
Let’s go into more detail.

Whichever your data infrastructure

This year, we will be focusing on integrating with the rest of the data stack, should it be for orchestration (Airflow, Dagster, Prefect, etc.), data quality (Great Expectations), cloud provider (GCP, AWS, Azure…), whatever the scale, which implies we must support multi-node. Until now, we’ve been focusing on single node setup.

Whatever your data volume

We are constantly improving our connectors, and are even certifying them against a set of best practices that we will keep adding to. Data integration pipelines are a thousand-paper-cut problem. Each new user brings some new use cases that may or may not be supported yet. We will continuously grow the team in charge of building new connectors and strengthening existing ones. At the end of the year, we hope we will be able to support TB-level replication.

Whatever your connector needs

We want to support at least 200 connectors by the end of 2021. And this will only be the beginning. We’re working on a low-code framework to make it easier to build and maintain connectors. 200 is obviously not enough to cover all connector needs, but hopefully, we will be at a point where the developer experience to build new connectors is so easy that the number of connectors won’t be perceived as limiting to address any use cases.

On that matter, we will also be working to support Kafka, Spark and webhooks.

This is our truth for 2021. By the end of the year, whatever your use case, you will be able to set up Airbyte and start fulfilling your data integration needs in a matter of hours. We believe this is the only way to commoditize data integration.

How you can use the truth framework elsewhere

A last note for this article. You can use the truth framework in other contexts.

For instance, we see a lot of entrepreneurs making decisions based on the amount of equity they hope to keep and the valuation of the company they hope to reach. However, they fail to remember that startups are either a 0 (you failed to exit and you died), or a 1 (you exited, IPO’d or are profitable). Any consideration of equity and valuation should actually be multiplied by this 0 or 1.

So as such, you need to consider if the decisions you make bring you closer to the 1. If you keep focusing on the 1 you will see that, in the long term, they were the right decisions to make, as having a bit more equity is not important if you end up building a successful company.

What is your truth? Are any of the decisions you make taking you closer to it?

How To Build a Slack Activity Dashboard With Open Source

John Lafleur — Wed, 03 Mar 2021 02:45:08 +0000

Build a Slack Activity Dashboard

This article will show how to use Airbyte - open-source data integration platform - and Apache Superset - open-source data exploration platform - in order to build a Slack activity dashboard showing:

Total number of members of a Slack workspace
The evolution of the number of Slack workspace members
Evolution of weekly messages
Evolution of messages per channel
Members per time zone

Before we get started, let’s take a high-level look at how we are going to achieve creating a Slack dashboard using Airbyte and Apache Superset.

We will use the Airbyte’s Slack connector to get the data off a Slack workspace (we will be using Airbyte’s own Slack workspace for this tutorial).
We will save the data onto a PostgreSQL database.
Finally, using Apache Superset, we will implement the various metrics we care about.

Got it? Now let’s get started.

1. Replicating Data from Slack to Postgres with Airbyte

a. Deploying Airbyte

There are several easy ways to deploy Airbyte, as listed here. For this tutorial, I will just use the Docker Compose method from my workstation:

# In your workstation terminal
git clone https://github.com/airbytehq/airbyte.git
cd airbyte
docker-compose up

The above command will make the Airbyte app available on localhost:8000. Visit the URL on your favorite browser, and you should see Airbyte’s dashboard (if this is your first time, you will be prompted to enter your email to get started).

If you haven’t set Docker up, follow the instructions here to set it up on your machine.

b. Setting Up Airbyte’s Slack Source Connector

Airbyte’s Slack connector will give us access to the data. So, we are going to kick things off by setting this connector to be our data source in Airbyte’s web app. I am assuming you already have Airbyte and Docker set up on your local machine. We will be using Docker to create our PostgreSQL database container later on.

Now, let’s proceed. If you already went through the onboarding, click on the “new source” button at the top right of the Sources section. If you're going through the onboarding, then follow the instructions.

You will be requested to enter a name for the source you are about to create. You can call it “slack-source”. Then, in the Source Type combo box, look for “Slack,” and then select it. Airbyte will then present the configuration fields needed for the Slack connector. So you should be seeing something like this on the Airbyte App:

The first thing you will notice is that this connector requires a Slack token. So, we have to obtain one. If you are not a workspace admin, you will need to ask for permission.

Let’s walk through how we would get the Slack token we need.

Assuming you are a workspace admin, open the Slack workspace and navigate to [Workspace Name] > Administration > Customize [Workspace Name]. In our case, it will be Airbyte > Administration > Customize Airbyte (as shown below):

In the new page that opens up in your browser, you will then need to navigate to Configure apps.

In the new window that opens up, click on Build in the top right corner.

Click on the Create an App button.

In the modal form that follows, give your app a name - you can name it airbyte_superset, then select your workspace from the Development Slack Workspace.

Next, click on the Create App button. You will then be presented with a screen where we are going to set permissions for our airbyte_superset app, by clicking on the Permissions button on this page.

In the next screen, navigate to the scope section. Then, click on the Add an OAuth Scope button. This will allow you to add permission scopes for your app. At a minimum, your app should have the following permission scopes:

Then, we are going to add our created app to the workspace by clicking the Install to Workspace button.

Slack will prompt you that your app is requesting permission to access your workspace of choice. Click Allow.

After the app has been successfully installed, you will be navigated to Slack’s dashboard, where you will see the Bot User OAuth Access Token.

This is the token you will provide back on the Airbyte page, where we dropped off to obtain this token. So make sure to copy it and keep it in a safe place.

Now that we are done with obtaining a Slack token, let’s go back to the Airbyte page we dropped off and add the token in there.

We will also need to provide Airbyte with start_date. This is the date from which we want Airbyte to start replicating data from the Slack API, and we define that in the format: YYYY-MM-DDT00:00:00Z.

We will specify ours as 2020-09-01T00:00:00Z. We will also tell Airbyte to exclude archived channels and not include private channels, and also to join public channels, so the latter part of the form should look like this:

Finally, click on the Set up source button for Airbyte to set the Slack source up.

If the source was set up correctly, you will be taken to the destination section of Airbyte’s dashboard, where you will tell Airbyte where to store the replicated data.

c. Setting Up Airbyte’s Postgres Destination Connector

For our use case, we will be using PostgreSQL as the destination.

Click the add destination button in the top right corner, then click on add a new destination.

In the next screen, Airbyte will validate the source, and then present you with a form to give your destination a name. We’ll call this destination slack-destination. Then, we will select the Postgres destination type. Your screen should look like this now:

Great! We have a form to enter Postgres connection credentials, but we haven’t set up a Postgres database. Let’s do that!

Since we already have Docker installed, we can spin off a Postgres container with the following command in our terminal:

docker run --rm --name slack-db -e POSTGRES_PASSWORD=password -p 2000:5432 -d postgres

(Note that the Docker compose file for Superset ships with a Postgres database, as you can see here).

The above command will do the following:

create a Postgres container with the name slack-db,
set the password to password,
expose the container’s port 5432, as our machine’s port 2000.
create a database and a user, both called postgres.

With this, we can go back to the Airbyte screen and supply the information needed. Your form should look like this:

Then click on the Set up destination button.

d. Setting Up the Replication

You should now see the following screen:

Airbyte will then fetch the schema for the data coming from the Slack API for your workspace. You should leave all boxes checked and then choose the sync frequency - this is the interval in which Airbyte will sync the data coming from your workspace. Let’s set the sync interval to every 24 hours.

Then click on the Set up connection button.

Airbyte will now take you to the destination dashboard, where you will see the destination you just set up. Click on it to see more details about this destination.

You will see Airbyte running the very first sync. Depending on the size of the data Airbyte is replicating, it might take a while before syncing is complete.

When it’s done, you will see the Running status change to Succeeded, and the size of the data Airbyte replicated as well as the number of records being stored on the Postgres database.

To test if the sync worked, run the following in your terminal:

docker exec slack-source psql -U postgres -c "SELECT * FROM public.users;"

This should output the rows in the users’ table.

To get the count of the users’ table as well, you can also run:

docker exec slack-db psql -U postgres -c "SELECT count(*) FROM public.users;"

Now that we have the data from the Slack workspace in our Postgres destination, we will head on to creating the Slack dashboard with Apache Superset.

2. Setting Up Apache Superset for the Dashboards

a. Installing Apache Superset

Apache Superset, or simply Superset, is a modern data exploration and visualization platform. To get started using it, we will be cloning the Superset repo. Navigate to a destination in your terminal where you want to clone the Superset repo to and run:

git clone https://github.com/apache/superset.git

It’s recommended to check out the latest branch of Superset, so run:

cd superset

And then run:

git checkout latest

Superset needs you to install and build its frontend dependencies and assets. So, we will start by installing the frontend dependencies:

npm install

Note: The above command assumes you have both Node and NPM installed on your machine.

Finally, for the frontend, we will build the assets by running:

npm run build

After that, go back up one directory into the Superset directory by running:

cd..

Then run:

docker-compose up

This will download the Docker images Superset needs and build containers and start services Superset needs to run locally on your machine.

Once that’s done, you should be able to access Superset on your browser by visiting http://localhost:8088, and you should be presented with the Superset login screen.

Enter username: admin and Password: admin to be taken to your Superset dashboard.

Great! You’ve got Superset set up. Now let’s tell Superset about our Postgres Database holding the Slack data from Airbyte.

b. Setting Up a Postgres Database in Superset

To do this, on the top menu in your Superset dashboard, hover on the Data dropdown and click on Databases.

In the page that opens up, click on the + Database button in the top right corner.

Then, you will be presented with a modal to add your Database Name and the connection URI.

Let’s call our Database slack_db, and then add the following URI as the connection URI:

postgresql://postgres:password@docker.for.mac.localhost:2000/postgres

If you are on a Windows Machine, yours will be:

postgresql://postgres:password@docker.for.win.localhost:2000/postgres

Note: We are using docker.for.[mac|win].localhost in order to access the localhost of your machine, because using just localhost will point to the Docker container network and not your machine’s network.

Your Superset UI should look like this:

We will need to enable some settings on this connection. Click on the SQL LAB SETTINGS and check the following boxes:

Afterwards, click on the ADD button, and you will see your database on the data page of Superset.

c. Importing our dataset

Now that you’ve added the database, you will need to hover over the data menu again; now click on Datasets.

Then, you will be taken to the datasets page:

We want to only see the datasets that are in our slack_db database, so in the Database that is currently showing All, select slack_db and you will see that we don’t have any datasets at the moment.

You can fix this by clicking on the + DATASET button and adding the following datasets.

Note: Make sure you select the public schema under the Schema dropdown.

Now that we have set up Superset and given it our Slack data, let’s proceed to creating the visualizations we need.

Still remember them? Here they are again:

Total number of members of a Slack workspace
The evolution of the number of Slack workspace members
Evolution of weekly messages
Evolution of weekly threads created
Evolution of messages per channel
Members per time zone

3. Creating Our Dashboards with Superset

a. Total number of members of a Slack workspace

To get this, we will first click on the users’ dataset of our slack_db on the Superset dashboard.

Next, change untitled at the top to Number of Members.

Now change the Visualization Type to Big Number, remove the Time Range filter, and add a Subheader named “Slack Members.” So your UI should look like this:

Then, click on the RUN QUERY button, and you should now see the total number of members.

Pretty cool, right? Now let’s save this chart by clicking on the SAVE button.

Then, in the ADD TO DASHBOARD section, type in “Slack Dashboard”, click on the “Create Slack Dashboard” button, and then click the Save button.

Great! We have successfully created our first Chart, and we also created the Dashboard. Subsequently, we will be following this flow to add the other charts to the created Slack Dashboard.

b. Casting the ts column

Before we proceed with the rest of the charts for our dashboard, if you inspect the ts column on either the messages table or the threads table, you will see it’s of the type VARCHAR. We can’t really use this for our charts, so we have to cast both the messages and threads’ ts column as TIMESTAMP. Then, we can create our charts from the results of those queries. Let’s do this.

First, navigate to the Data menu, and click on the Datasets link. In the list of datasets, click the Edit button for the messages table.

You’re now in the Edit Dataset view. Click the Lock button to enable editing of the dataset. Then, navigate to the Columns tab, expand the ts dropdown, and then tick the Is Temporal box.

Persist the changes by clicking the Save button.

c. The evolution of the number of Slack workspace members

In the exploration page, let’s first get the chart showing the evolution of the number of Slack members. To do this, make your settings on this page match the screenshot below:

Save this chart onto the Slack Dashboard.

d. Evolution of weekly messages posted

Now, we will look at the evolution of weekly messages posted. Let’s configure the chart settings on the same page as the previous one.

Remember, your visualization will differ based on the data you have.

e. Evolution of weekly threads created

Now, we are finished with creating the message chart. Let's go over to the thread chart. You will recall that we will need to cast the ts column as stated earlier. So, do that and get to the exploration page, and make it match the screenshot below to achieve the required visualization:

f. Evolution of messages per channel

For this visualization, we will need a more complex SQL query. Here’s the query we used (as you can see in the screenshot below):

SELECT CAST(m.ts as TIMESTAMP), c.name, m.text
FROM public.messages m
INNER JOIN public.channels c
ON m.channel_id = c_id

Next, click on EXPLORE to be taken to the exploration page; make it match the screenshot below:

Save this chart to the dashboard.

g. Members per time zone

Finally, we will be visualizing members per time zone. To do this, instead of casting in the SQL lab as we’ve previously done, we will explore another method to achieve casting by using Superset’s Virtual calculated column feature. This feature allows us to write SQL queries that customize the appearance and behavior of a specific column.

For our use case, we will need the updated column of the users table to be a TIMESTAMP, in order to perform the visualization we need for Members per time zone. Let’s start on clicking the edit icon on the users table in Superset.

You will be presented with a modal like so:

Click on the CALCULATED COLUMNS tab:

Then, click on the + ADD ITEM button, and make your settings match the screenshot below.

Then, go to the exploration page and make it match the settings below:

Now save this last chart, and head over to your Slack Dashboard. It should look like this:

Of course, you can edit how the dashboard looks to fit what you want on it.

Conclusion

In this article, we looked at using Airbyte’s Slack connector to get the data from a Slack workspace into a Postgres database, and then used Apache Superset to craft a dashboard of visualizations.If you have any questions about Airbyte, don’t hesitate to ask questions on our Slack! If you have questions about Superset, you can join the Superset Community Slack!

How to Save and Search Your Slack History on a Free Slack Plan

Charles — Wed, 24 Feb 2021 18:01:24 +0000

The Slack free tier saves only the last 10K messages. For social Slack instances, it may be impractical to upgrade to a paid plan to retain these messages. Similarly, for an open-source project like Airbyte where we interact with our community through a public Slack instance, the cost of paying for a seat for every Slack member is prohibitive.

However, searching through old messages can be really helpful. Losing that history feels like some advanced form of memory loss. What was that joke about Java 8 Streams? This contributor question sounds familiar—haven't we seen it before? But you just can't remember!

This tutorial will show you how you can, for free, use Airbyte to save these messages (even after Slack removes access to them). It will also provide you a convenient way to search through them.

Specifically, we will export messages from your Slack instance into an open-source search engine called MeiliSearch. We will be focusing on getting this setup running from your local workstation. We will mention at the end how you can set up a more productionized version of this pipeline.

We want to make this process easy, so while we will link to some external documentation for further exploration, we will provide all the instructions you need here to get this up and running.

1. Set Up MeiliSearch

First, let's get MeiliSearch running on our workstation. MeiliSearch has extensive docs for getting started. For this tutorial, however, we will give you all the instructions you need to set up MeiliSearch using Docker.

docker run -it --rm \
  -p 7700:7700 \
  -v $(pwd)/data.ms:/data.ms \
  getmeili/meilisearch

That's it!
MeiliSearch stores data in $(pwd)/data.ms, so if you prefer to store it somewhere else, just adjust this path.

2. How To Replicate Your Slack Messages to MeiliSearch

a. Set Up Airbyte

Make sure you have Docker and Docker Compose installed. If you haven’t set Docker up, follow the instructions here to set it up on your machine. Then, run the following commands:

git clone https://github.com/airbytehq/airbyte.git
cd airbyte
docker-compose up

If you run into any problems, feel free to check out our more extensive getting started for more help.

Once you see an Airbyte banner, the UI is ready to go at http://localhost:8000/. Once you have set your user preferences, you will be brought to a page that asks you to set up a source. In the next step, we'll go over how to do that.

b. Set Up Airbyte’s Slack Source Connector

In the Airbyte UI, select Slack from the dropdown. We provide step-by-step instructions for setting up the Slack source in Airbyte here. These will walk you through how to complete the form on this page.

By the end of these instructions, you should have created a Slack source in the Airbyte UI. For now, just add your Slack app to a single public channel (you can add it to more channels later). Only messages from that channel will be replicated.

The Airbyte app will now prompt you to set up a destination. Next, we will walk through how to set up MeiliSearch.

c. Set Up Airbyte’s MeiliSearch Destination Connector

Head back to the Airbyte UI. It should still be prompting you to set up a destination. Select "MeiliSearch" from the dropdown. For the host field, set: http://localhost:7700. The api_key can be left blank.

d. Set Up the Replication

On the next page, you will be asked to select which streams of data you'd like to replicate. We recommend unchecking "files" and "remote files" since you won't really be able to search them easily in this search engine.

For frequency, we recommend every 24 hours.

3. Search MeiliSearch

After the connection has been saved, Airbyte should start replicating the data immediately. When it completes you should see the following:

When the sync is done, you can sanity check that this is all working by making a search request to MeiliSearch. Replication can take several minutes depending on the size of your Slack instance.

curl 'http://localhost:7700/indexes/messages/search' --data '{ "q": "<search-term>" }'

For example, I have the following message in one of the messages that I replicated: "welcome to airbyte".

curl 'http://localhost:7700/indexes/messages/search' --data '{ "q": "welcome to" }'
# => {"hits":[{"_ab_pk":"7ff9a858_6959_45e7_ad6b_16f9e0e91098","channel_id":"C01M2UUP87P","client_msg_id":"77022f01-3846-4b9d-a6d3-120a26b2c2ac","type":"message","text":"welcome to airbyte.","user":"U01AS8LGX41","ts":"2021-02-05T17:26:01.000000Z","team":"T01AB4DDR2N","blocks":[{"type":"rich_text"}],"file_ids":[],"thread_ts":"1612545961.000800"}],"offset":0,"limit":20,"nbHits":2,"exhaustiveNbHits":false,"processingTimeMs":21,"query":"test-72"}

4. Search via a UI

Making curl requests to search your Slack History is a little clunky, so we have modified the example UI that MeiliSearch provides in their docs to search through the Slack results.
Download (or copy and paste) this html file to your workstation. Then, open it using a browser. You should now be able to write search terms in the search bar and get results instantly!

5. "Productionizing" Saving Slack History

You can find instructions for how to host Airbyte on various cloud platforms here.
Documentation on how to host MeiliSearch on cloud platforms can be found here.
If you want to use the UI mentioned in the section above, we recommend statically hosting it on S3, GCS, or equivalent.

How Open-source Can Disrupt Build vs. Buy Considerations

John Lafleur — Fri, 22 Jan 2021 02:29:16 +0000

When you’re selling or considering purchasing a B2B tool, you need to understand the build vs. buy argument. What are the pros and cons of building the tool internally vs. buying the tool from a third-party vendor? This is especially true in big companies where you have the resources to build the said tools. Early-stage startups will generally opt for the faster route, going with self-served B2B tools -- unless the pricing is prohibitive.

But something we don’t often think about is how open-source just messes the whole thing up. The build is completely redefined. You now need to compare the B2B tool with the build without the open-source tool, as well as with the open-source tool, which most often lowers the barrier significantly.
In this article, we’ll take the example of the ETL/ELT industry. We know it best, as we’re building Airbyte, the open-source ELT alternative. Let’s see how open-source for ETL / ELT with Airbyte is also flipping the previous Build vs. Buy balance on its head.

We’ve produced an infographic to illustrate that point. You will see that without taking Airbyte into consideration, the build vs. buy was pretty useful with Fivetran, in contrast to building connectors yourself. But now, with Airbyte, you can either just use the open-sourced connectors and start replicating data in minutes for free, or even build new connectors (if ever Airbyte doesn’t support them) in a matter of days (vs. months before) with maintenance being crowdsourced throughout the Airbyte community.

The Infographic

Here is:

in white, the original “build” scenario;
in blue, the original "buy" scenario with cloud-based Fivetran;
in purple, the new "build" scenario with 2 options: “build non-supported connector with Airbyte” in light purple, and “use prebuilt connectors from Airbyte” in dark purple

Let’s just say it: the playing field has changed!

The Explanation

Some context: the average business today uses well over 100 software apps, many of which contain valuable insights about an organization’s operations. Your company is likely on the way to using just as many apps, if not more, and you’ll need a solution to integrate all of the data your apps produce.

Time & Effort

Building your own pipeline by yourself is a significant time commitment. It can take between 3-6 months to set up a basic pipeline. Furthermore, beyond the time commitment, there is some inherent complexity in building a reliable, high-performance ELT pipeline. You need to:

Obtain developer access to the data source
Explore the data
Design the schema/data models
Set up a connector framework
Test the connector and validate the data
Set up orchestration, configuration validation, state management, normalization, schema migration, monitoring, etc. 7. Maintain the connector for every schema change that happens every few weeks. This part is very cumbersome, as it requires an increasing number of data engineers to manage your connectors.

In contrast, an off-the-shelf solution such as Fivetran can be set up in a matter of minutes with prebuilt connectors. Airbyte also takes literally 30 seconds to deploy, and you can start replicating data within 2 minutes.

The big difference between both options in terms of time and effort is that all the Fivetran customers we talked to also had to build and maintain connectors on the side, as the connectors they needed were either not supported in the way they needed or not supported at all by Fivetran.

That’s where the option to build with Airbyte comes in. For connectors not supported by Airbyte, it is a matter of hours to build connectors. Indeed, Airbyte already took care of having a UI, monitoring, scheduling, orchestration, integration with your data stack, automatic schema changes, etc. There is a very high chance we support your destination. So in the end, it’s only the EL part of the source connector you have to build, and Airbyte is providing some abstractions to make that easier.

Regarding maintenance, the goal of Airbyte is to crowdsource throughout the community. When a connector fails because of significant API changes, it will notify the connectors’ users. As soon as the fix is made available by the Airbyte team or a community member, Airbyte will propagate the fix to all the users. The hope is that this approach will provide a better SLA than closed-source solutions such as Fivetran, not to mention the fact that you won’t have to maintain the connector yourself.

People & Money

From what we’ve seen, a typical company requires the equivalent of at least two or three full-time data engineers to build and maintain a data pipeline. The total cost of three full-time engineers can reach the high six figures (including benefits). So that’s a lot!

Fivetran’s fees for a typical mid-sized company with five connectors is about $50,000. But you’ll have to add to that cost all the connectors you need to build and maintain by yourself.

In contrast, Airbyte’s connectors are open-sourced, so you can use them for free. You also don’t need to pay for the egress to Fivetran’s infrastructure. It is possible that you might need a little bit of engineering time to operate Airbyte. If you need to build some of the connectors yourself, you will have to pay for the time spent by the data engineering team on building and maintaining them, but that would still be way less than if you had to do everything yourself.

Opportunity Costs

The actual value brought by your data team is through analysis and modeling. All the data integration, cleaning and transformation is important, as they enable the analysis and modeling. So the more time your team can spend on value-producing tasks, the better for the business.
So opportunity costs as depicted in the illustration are very important to consider. Plus, ask any data team -- they will much prefer doing analysis or modeling tasks, rather than pipelining! So you will have better talent retention this way.

Now you can see how open-source can flip the previous build vs. buy balance on its head. Before Airbyte, Fivetran was an easy sell. Now, it seems the contrary. Leveraging Airbyte’s open-source technology to build your own data infrastructure seems the obvious choice.

There is one last thing to consider when choosing which direction to take: the future.

Future Growth of Your Company

As your company grows, you will add data sources to the pool. The complexity and effort of building and maintaining a data pipeline for a huge number of data sources can quickly escalate beyond your data engineering team’s ability to handle it.

You might consider taking a chance on Fivetran’s ability to cover all or most of your connector needs, so that your team doesn’t need to build and maintain a continually increasing number of connectors (that would defeat the purpose). But, be mindful that Fivetran will always have a ROI consideration to maintaining connectors on the long tail; they won’t maintain connectors that don’t bring enough revenue to offset the maintenance costs.

On the other hand, Airbyte will continue to grow the number of prebuilt community-maintained connectors, and can even take a large portion of the maintenance costs off your hands.
When making a decision, consider how your company will evolve. And you can be sure that a great data infrastructure that grows with you will be a competitive advantage.

How We Leveraged Singer for Our MVP

Charles — Mon, 30 Nov 2020 20:28:00 +0000

One of the (many) hard things about doing a startup is figuring out what that MVP should be. You are trading off between presenting something that is “good” enough that it gets people excited to use (or invest in) you and getting something done fast. In this article, we explore how we wrestled with this trade-off. Specifically, we explore our decisions around how to use Singer to bootstrap our MVP. It is something we get tons of questions about, and it was hard for us to figure out ourselves!

When we set out to create an MVP for our data integration project, we began with this prompt:

Create an OSS data integration project that includes all of Singer’s major features. In addition, it should have a UI that can be used by non-technical users and has production-grade job scheduling and tracking.
Do it in a month.
Use Singer to bootstrap it.

We knew from the start that in the long run, we did not want Singer to be core to the working of our platform. In the short term, however, we wanted to be able to bootstrap our integration ecosystem off of Singer’s existing taps and targets. So should we make Singer part of our core platform in the beginning to bootstrap? And if so, at what cost?

This picture shows the spectrum of options we considered, from wrapping a UI around Singer and relying entirely on it as our backend to shooting for our original goal of Singer as a peripheral.

1. Thin UI wrapper around Singer

This felt like the “startup-y” option. We could throw Singer, a database, and a UI in a Docker container and have “something” up and running in, perhaps, days. We never tried to go with this approach because we were able to see some really big trade-offs.

Pros

Just a few days in terms of amount of work needed
No new code for each integration, just use Singer’s.

Cons

Pretty much all throw-away code after the initial release.
Because Singer taps / targets don’t declare their configurations (more on this later), there would be no way in the UI to tell the user what values they needed to provide in order to configure a source. We would only be able to accept a big json blob.

While we were going for an MVP, we did not think we would be able to get anyone interested in the first iteration. We also knew that subsequent iterations would be painful, since we would be effectively starting from scratch because the initial iteration was not a sturdy building block. We skipped this approach.

2. Airbyte integration configurations

Given that we wanted to provide a UI experience that was accessible to non-data engineers, our next step was to figure out how we could make it easy to configure integrations in the UI. This meant we had to build our own configuration abstraction for integrations, because this is something that Singer does not provide (we go into more depth on this feature in the first article in this series).

This abstraction was basically a way for each integration to declare what information it needed in order to be configured. For example, a Postgres source might need a hostname, port, etc. This layer made it possible for the UI to display user-friendly forms for setting up integrations. With this approach, we could still rely on Singer as the “backend” for the platform, but we could provide a better configuration experience for the user.

In order to implement this layer, we created a standardized way to declare information about an integration and how to configure it in a JsonSchema object. When someone selects an integration in the UI, it will render a form based on that JsonSchema. The user would then provide the needed information and pass it directly to the backend.

This is ultimately where we started out. And everything was good for about a week…

3. Dockerize Singer integrations

Up until this point, the only thing we had to do per integration was write a JsonSchema object that declared the configuration inputs for an integration. But what if we want the form in the UI to display different fields than those that Singer taps / targets consume?

The first case we ran into was in the Postgres Singer tap. That tap takes in a field called a “filter_dbs” field. This attribute restricts which databases the tap scans when being run in “discover” mode. The tap also takes in a field called ”database,” which is the name of the database from which data will be replicated. In our use case, we wanted “filter_dbs” to be populated with only a single entry, the value that the user had provided for “database.”

In order to hide filter_dbs from the UI, but still populate it behind the scenes, we were going to need to write some special code that executed only when the Postgres Tap ran. But where was that code going to run? The abstraction we had was that our core platform just assumed that all integration-specific code was bundled in the Singer Tap. So we were either going to need to insert this integration-specific code into our core platform or restructure our abstraction so that we could run custom integration code that was not packaged as part of Singer.

Again, we already had a rough idea of what we wanted this to look like in the long term. We imagined each integration running entirely in its own Docker container. Airbyte would handle passing messages from the container running the source to the container running the destination. We had hoped we could get to MVP without it, but ultimately, when we hit this issue, it tipped us over the edge. So we traded some time to figure out how to package Singer taps and targets into Docker containers that made it easy for us to mediate all of the interactions between the core platform and the integration running in the container.

4. Use the Airbyte protocol instead of the Singer protocol

Now fast forward another couple weeks: we are on the night before we plan to do our first public launch, and nothing is working. We have 3 sources and 3 destinations, and not one of them can work with all of the others.

The issue was two-fold:

We ran into inconsistencies in the Singer protocol that made it hard to treat all Singer Taps and Targets the same way programmatically.
In falling back on Singer to handle our “backend,” there were implementation details in the way Singer worked that were incompatible with the product we wanted to build.

We won’t spend a ton of time discussing these issues, because we’ve already written about them here. So let’s just say we hit a point where we realized that we either needed to become the world’s foremost experts on the Singer protocol or focus on defining our own protocol. Since the latter already aligned with our long-term vision, we went in that direction.

Ultimately, we tore out our hair and got through that night, and then for our next release we introduced our own protocol. Even at our early stage, this was an expensive endeavour. It took one-ish engineers over a week to migrate us from the Singer protocol to our own (this felt like eons to us!).

Did we do it right?

Obviously, this question is impossible to answer. After reading this article, you might have come to the conclusion that we should have built the first version of our product with Singer at the periphery of our system. And had we done that, we could have skipped the iteration of moving Singer from within our core system to the outskirts. I wouldn’t begrudge you that conclusion!

Had we taken that approach, however, we would have delayed our initial release by an additional month (double time to MVP!). Getting something out early was valuable, because it gave us early feedback that what we were building was interesting to people. We made trade- offs to move fast, but still work from a base that we could iterate on quickly--pretty much the classic trade-off you think about when trying to launch an MVP. And, ultimately, we can’t draw any hard and fast rules other than to use your own judgment!

The unexpected insight that we came away with, however, was that this approach allows us to learn a lot from Singer. Even having Singer be part of the core system for just a few weeks, we got a really good understanding of why they had solved certain issues the way they did.

For example, when we first encountered the Singer Catalog, the use of a breadcrumb system to map metadata onto a schema felt unintuitive and needlessly complicated. The metadata and the schema were in the same parent object, so why did we need this complex system of having the metadata fields index into the schema? Couldn’t they be combined? After using it closely for a few weeks, we understood the complexities that come with configuring special behavior at a field level for deeply nested schemas. Had we gone our own way from the start, we would have learned this lesson much later (and the later we learned it, the harder it would have been to remedy).

Building on top of Singer in the beginning forced us into a Chesterton’s Fence situation. Each time we wanted to do something a certain way, because we thought Singer’s approach didn’t make sense, we were forced to fully understand why Singer had done things the way it did. By doing so, we avoided mistakes we would otherwise have made. We also were able to make decisions different from Singer’s while still benefiting from its experience. All in all, we feel we made the right choice. What do you think?

Why You Should NOT Build Your Data Pipeline on Top of Singer

Charles — Mon, 30 Nov 2020 20:26:59 +0000

Singer.io is an open-source CLI tool that makes it easy to pipe data from one tool to another. At Airbyte, we spent time determining if we could leverage Singer to programmatically send data from any of their supported data sources (taps) to any of their supported data destinations (targets).

For the sake of this article, let’s say we are trying to build a tool that can do the following:

Run any Singer tap or target
Provide a UI for configuring and running those taps and targets
Count the number of records synced in each run

In the context of these goals, being able to use Singer programmatically means writing a program that can, for any integration:

provide a UI with instructions on what information a user needs to input in order to configure that integration (e.g., host, password, etc).
take those user-provided values and execute each integration.

We know that the described requirements are not the use case that Singer sets out to solve, but nonetheless, we wanted to see if we could leverage Singer to bootstrap building out this case. Sure enough, we ran into some “gotchas” along the way. These gotchas illustrate some of the core primitives that a programmatic data integration tool requires.

Integrations do not declare their configurations

The Singer protocol does not specify how an integration should define what inputs it requires. This means that, in order to use most Singer taps, you need to scour the entire implementation to figure out what properties it uses; depending on the complexity of the integration, this can be pretty painful.

Some integrations help out by specifying what the configuration should look like in a readme or in a sample config. Even these lead to headaches. They often just list the fields that need to be passed in but do not explain what they mean, what their format is, or how to find them (good luck trying to find all the information you need to configure your Google Ads integration!). In other cases, they only list a subset, and then you have to discover the rest by reading the integration (e.g., tap-salesforce doesn’t mention is_sandbox in the docs UPDATE: someone has now added this field in the readme with this PR).

These taps are great; we have happily used all of them, but because they do not specify what is required to configure them, they can’t be used programmatically. Specifically, our program needs to know that for the Postgres tap it requires the field’s hostname and port. Without this specification, the program cannot figure out how to build a valid configuration for an integration. This configuration is expensive to shim, because it requires engineering work for every single integration!

No way to tell which Singer feature is compatible with which integration

Singer has excellent documentation around its core protocol. It also does a nice job defining the suite of special metadata that it supports. When you start actually using Singer, however, mapping these primitives onto your integrations is difficult. For example, “replication-method” sets whether all the data from the source should be replicated (“full_table”) or just the new or updated data (“incremental”). What is unclear is which taps actually support “incremental” or “full_table” or both.

Taps do not advertise, in a way that is programmatically consumable, which of these replication methods they support. Some of them mention it in their documentation, but ultimately that’s insufficient for the type of tool we want to build. So what happens when you request “incremental” from a source that only supports “full_table”? The behavior is undefined. Some taps will throw an error, some will just do a full refresh. Either way, from the point of view of the UI-based tool that we are trying to build, this isn’t really usable.

The problem only gets hairier for some of the more niche metadata as well (e.g., “view-key-properties”). You either need to read the source or just try it out and see if the configuration works. This problem is adjacent to the configuration problem described in the previous section, and, similarly, requires a shim for every integration.

Singer’s own secret menu

If you’re from the West coast, you might be familiar with how In-N-Out Burger popularized the “secret” menu in fast food chains. While charming at a drive thru, secret menus can ruin your data integration.

The Singer protocol has some of its own secret menu items. For example, we were parsing each message that a tap output into JSON using the declared schema in the Singer docs. We were trying to understand really well what messages were being sent between taps and targets, so we would fail loudly if anything was sent that did not match the documented message types. Then we started getting errors on “ActivateVersionMessage.” After spelunking in the source code for a bit, we found that this message type has existed in Singer as an experimental feature since 2017. A handful of the official Singer taps use it, but there’s no guidance on what you’re supposed to do with it (I suspect it is a feature used internally at Stitch--the paid, managed solution from the creators of Singer). If you’re building something programmatic on top of Singer, your choice is to just filter it out or let it pass and hope that stuff…just works, I guess?

Handling this one case is not the end of the world, but it leaves you feeling uncertain what else is lurking in the protocol that might not play well with your system.

Conclusion

So to answer our original question, can we reasonably stretch the Singer to meet our product requirements? The answer is no. Doing so would require writing custom shims for every single Singer tap and target. Since the goal with data integrations is always to scale to more integrations, having to do any work on them per integration is very expensive.

The Singer protocol is underspecified for this use case. This realization makes sense, because ultimately this is not the use case for which the protocol is trying to solve. Achieving these requirements depends on integrations declaring much more information about how they are configured and which features they support. We are tackling this problem at Airbyte, so if you are looking for an OSS solution that makes it easy to move your data into a warehouse, instead of trying to roll your own on top of Singer, come check us out!

This article is meant to be the first in a pair of articles. The second will explore the engineering journey that we took to figure out where Singer should fit into our system.

How to Build Thousands of Connectors

John Lafleur — Wed, 04 Nov 2020 04:59:37 +0000

We’re building an open-source data integration platform at Airbyte. We launched our MVP about a month ago. We were thrilled by the amount of feedback and support we got from the community. We even got our first big pull request from a contributor this week (2,000+ lines of code). But during this full month, we didn’t release any new connectors. You might wonder why we didn’t build on that momentum. If people were excited with our MVP even though it had only 6 connectors, you might think we should have ramped up on the number of connectors as fast as possible. We didn’t do that for two very important and differentiating reasons.

First, we were defining exactly what the best data protocol would be if we wanted to solve data integration once and for all, and this for all companies. You can learn more about our specification here. Even though it’s not final yet, you will have a glimpse of our vision for the future.

Second, and just as important, we were building a real manufacturing plant for data integration connectors. See, our team led data integration at LiveRamp, which has more than 1,000 data ingestion connectors and 1,000+ distribution connectors. So we have the experience of abstracting what can be abstracted and simplifying the manufacturing of new integration (very often without code). We haven’t fully built our manufacturing plant, but engineers can already add one new connector every day.

This article describes how we built this connector manufacturing plant.

What you need to think about when building a large number of connectors

When building a large catalog of connectors, there are several things that you need to think through.

Initial build

This is when you start from a blank page. This step usually requires a little bit of planning since it involves communication with external teams/companies.
The initial build step involves:

Access to the source/destination documentation
Access to test accounts, test infrastructure, etc.
Using golden path encoding good practices
Using the best language for the task: today, we support both Java and Python, but anyone can add their own language
Creating documentation
Defining the necessary inputs ##Tests Tests are essential to make sure that any code or protocol change won’t affect the connectors. They need to run before every merge.

They also ensure that the connector behaves as you expect. For that you need to run your connector against the actual production service. For example, if you’re working on the Salesforce connector, you must make sure that Salesforce actually behaves the way you expect. It is not unusual that an API or service documentation doesn’t fully reflect the reality.

We currently have the foundation of our test framework; it allows developers to focus solely on providing inputs and outputs, and the rest is taken care of by the framework.

These tests give us 90% certainty that the connector is fully functional. If there are edge cases, it is always possible to add more custom tests.

Liveliness & Change detection

It is essential to ensure that the source or destination continues to behave as it was encoded during the initial build phase and to ensure that the source or destination is still alive for monitoring purposes.

These verifications must be run at a cadence, and any failure needs to be investigated and fixed, leading to the maintenance phase.

Maintenance

We need to define how we are going to update the connector, push changes and propagate the changes to all the running instances of Airbyte.

The art of building connectors is thinking in onion layers

Segmenting cattle code

To make a parallel with the pet/cattle concept that is well known in DevOps/Infrastructure, a connector is cattle code, and you want to spend as little time on it as possible. Anything you can do to prevent yourself from doing work in the future, you need to do. This will accelerate your production tremendously.

Abstractions as onion layers

Maximizing high-leverage work leads you to build your architecture with an onion-esque structure:

The center defines the lowest level of the API. Implementing a connector at that level requires a lot of engineering time. But, it is your escape hatch for very complex connectors where you need a lot of control.

Then, you build new layers of abstraction that help tackle families of connectors very quickly.

Today, we’ve built one of these abstractions to support existing Singer integration. Building an integration leveraging Singer takes us less than 3 hours, and our goal is to bring it down to less than 10 minutes.

We have the same ambition for every other family of sources and destinations.

As we continue to improve our manufacturing plant for connectors, we will build tools that will allow us to handle 95% of integrations with no or very little code.

This is how we are going to address the long tail of integrations and how we’re going to make integrations a commodity.

What Airbyte has built up to now

We’ve built the following:

The center of the onion
The golden path in Java & Python to build new connectors
The first version of the integration test framework
Connectors: 10 sources with a rate of 1 new source per day, and 4 destinations
A layer to quickly support Singer integrations

What our ambitions are with this connector manufacturing plant

We want to reach a rate of 5 connectors per day and accelerate even beyond that.

We also want to provide the community with more tools to build and contribute their own connectors. Ideally, 95% of connectors can be added to Airbyte with no code.

We hope this gives you a better understanding of what we’ve been up to and what our real ambitions are. If you see any ways to improve this architecture, we’re all ears. Don’t hesitate to join our Slack to discuss any questions or suggestions with the team.

Why the Future of ETL Is Not ELT, But EL(T)

John Lafleur — Wed, 04 Nov 2020 04:51:39 +0000

How we store and manage data has completely changed over the last decade. We moved from an ETL world to an ELT world, with companies like Fivetran pushing the trend. However, we don’t think it is going to stop there; ELT is a transition in our mind towards EL(T) (with EL decoupled from T). And to understand this, we need to discern the underlying reasons for this trend, as they might show what’s in store for the future.

This is what we will be doing in this article. I’m the co-founder of Airbyte, the new upcoming open-source standard for data integrations.

What are the problems with ETL?

Historically, the data pipeline process consisted of extracting, transforming, and loading data into a warehouse or a data lake. There are serious disadvantages to this sequence.

Inflexibility

ETL is inherently rigid. It forces data analysts to know beforehand every way they are going to use the data, every report they are going to produce. Any change they make can be costly. It can potentially affect data consumers downstream of the initial extraction.

Lack of visibility

Every transformation performed on the data obscures some of the underlying information. Analysts won’t see all the data in the warehouse, only the one that was kept during the transformation phase. This is risky, as conclusions might be drawn based on data that hasn’t been properly sliced.

Lack of Autonomy for Analysts

Last but not least, building an ETL-based data pipeline is often beyond the technical capabilities of analysts. It typically requires the close involvement of engineering talent, along with additional code to extract and transform each source of data.

The alternative to a complex engineering project is to conduct analyses and build reports on an ad hoc, time-intensive, and ultimately unsustainable basis.

What changed and why ELT is way better

Cloud-based Computation and Storage of Data

The ETL approach was once necessary because of the high costs of on-premises computation and storage. With the rapid growth of cloud-based data warehouses such as Snowflake, and the plummeting cost of cloud-based computation and storage, there is little reason to continue doing transformation before loading at the final destination. Indeed, flipping the two enables analysts to do a better job in an autonomous way.

ELT Supports Agile Decision-Making for Analysts

When analysts can load data before transforming it, they don’t have to determine beforehand exactly what insights they want to generate before deciding on the exact schema they need to get.

Instead, the underlying source data is directly replicated to a data warehouse, comprising a “single source of truth.” Analysts can then perform transformations on the data as needed. Analysts will always be able to go back to the original data and won’t suffer from transformations that might have compromised the integrity of the data, giving them a free hand. This makes the business intelligence process incomparably more flexible and safe.

ELT Promotes Data Literacy Across the Whole Company

When used in combination with cloud-based business intelligence tools such as Looker, Mode, and Tableau, the ELT approach also broadens access to a common set of analytics across organizations. Business intelligence dashboards become accessible even to relatively non-technical users.

We’re big fans of ELT at Airbyte, too. But ELT is not completely solving the data integration problem and has problems of its own. We think EL needs to be completely decoupled from T.

What’s changing now and why EL(T) is the future

Merging of Data Lakes and Warehouses

There was a great analysis by Andreessen Horowitz about how data infrastructures are evolving. Here is the architecture diagram of the modern data infrastructure they came up with after a lot of interviews with industry leaders.

Data infrastructure serves two purposes at a high level:

Helps business leaders make better decisions through the use of data - analytic use cases
Builds data intelligence into customer-facing applications, including via machine learning - operational use cases Two parallel ecosystems have grown up around these broad use cases.

The data warehouse forms the foundation of the analytics ecosystem. Most warehouses store data in a structured format. They are designed to generate insights from core business metrics, usually with SQL (although Python is growing in popularity).

The data lake is the backbone of the operational ecosystem. By storing data in raw form, it delivers the flexibility, scale, and performance required for applications and more advanced data processing needs. Data lakes operate on a wide range of languages including Java/Scala, Python, R, and SQL.

What’s really interesting is that modern data warehouses and data lakes are starting to resemble one another – both offering commodity storage, native horizontal scaling, semi-structured data types, ACID transactions, interactive SQL queries, and so on.

So you might be wondering if data warehouses and data lakes are on a path toward convergence. Will they become interchangeable in a stack? Will data warehouses also be used for the operational use case?

EL(T) Supports Both Use Cases: Analytics and Operational ML

EL, in contrast to ELT, completely decouples the Extract-Load part from any optional transformation that may occur.
The operational use cases are all unique in the way incoming data is leveraged. Some might use a unique transformation process; some might not even use any transformation.

In regards to the analytics case, analysts will need to get the incoming data normalized for their own needs at some point. But decoupling EL from T would let them choose whichever normalization tool they want. DBT has been gaining a lot of traction lately among data engineering and data science teams. It has become the open-source standard for transformation. Even Fivetran integrates with them to let teams use DBT if they’re used to it.

EL Scales Faster and Leverages the Whole Ecosystem

Transformation is where all the edge cases lie. For every specific need within any company, there is a schema normalization unique to it, for each and every one of the tools.

By decoupling EL from the T, this enables the industry to start covering the long tail of connectors. At Airbyte, we’re building a “connector manufacturing plant” so we can get to 1,000 pre-built connectors in a matter of months.

Furthermore, as mentioned above, it would help teams leverage the whole ecosystem in an easier way. You start to see an open-source standard for every need. In a sense, the future data architecture might look like this:

In the end, extract and load will be decoupled from transformation. Do you agree with us? If so, you might be interested to have a look at what Airbyte does.