DEV Community: Charles

How to Save and Search Your Slack History on a Free Slack Plan

Charles — Wed, 24 Feb 2021 18:01:24 +0000

The Slack free tier saves only the last 10K messages. For social Slack instances, it may be impractical to upgrade to a paid plan to retain these messages. Similarly, for an open-source project like Airbyte where we interact with our community through a public Slack instance, the cost of paying for a seat for every Slack member is prohibitive.

However, searching through old messages can be really helpful. Losing that history feels like some advanced form of memory loss. What was that joke about Java 8 Streams? This contributor question sounds familiar—haven't we seen it before? But you just can't remember!

This tutorial will show you how you can, for free, use Airbyte to save these messages (even after Slack removes access to them). It will also provide you a convenient way to search through them.

Specifically, we will export messages from your Slack instance into an open-source search engine called MeiliSearch. We will be focusing on getting this setup running from your local workstation. We will mention at the end how you can set up a more productionized version of this pipeline.

We want to make this process easy, so while we will link to some external documentation for further exploration, we will provide all the instructions you need here to get this up and running.

1. Set Up MeiliSearch

First, let's get MeiliSearch running on our workstation. MeiliSearch has extensive docs for getting started. For this tutorial, however, we will give you all the instructions you need to set up MeiliSearch using Docker.

docker run -it --rm \
  -p 7700:7700 \
  -v $(pwd)/data.ms:/data.ms \
  getmeili/meilisearch

That's it!
MeiliSearch stores data in $(pwd)/data.ms, so if you prefer to store it somewhere else, just adjust this path.

2. How To Replicate Your Slack Messages to MeiliSearch

a. Set Up Airbyte

Make sure you have Docker and Docker Compose installed. If you haven’t set Docker up, follow the instructions here to set it up on your machine. Then, run the following commands:

git clone https://github.com/airbytehq/airbyte.git
cd airbyte
docker-compose up

If you run into any problems, feel free to check out our more extensive getting started for more help.

Once you see an Airbyte banner, the UI is ready to go at http://localhost:8000/. Once you have set your user preferences, you will be brought to a page that asks you to set up a source. In the next step, we'll go over how to do that.

b. Set Up Airbyte’s Slack Source Connector

In the Airbyte UI, select Slack from the dropdown. We provide step-by-step instructions for setting up the Slack source in Airbyte here. These will walk you through how to complete the form on this page.

By the end of these instructions, you should have created a Slack source in the Airbyte UI. For now, just add your Slack app to a single public channel (you can add it to more channels later). Only messages from that channel will be replicated.

The Airbyte app will now prompt you to set up a destination. Next, we will walk through how to set up MeiliSearch.

c. Set Up Airbyte’s MeiliSearch Destination Connector

Head back to the Airbyte UI. It should still be prompting you to set up a destination. Select "MeiliSearch" from the dropdown. For the host field, set: http://localhost:7700. The api_key can be left blank.

d. Set Up the Replication

On the next page, you will be asked to select which streams of data you'd like to replicate. We recommend unchecking "files" and "remote files" since you won't really be able to search them easily in this search engine.

For frequency, we recommend every 24 hours.

3. Search MeiliSearch

After the connection has been saved, Airbyte should start replicating the data immediately. When it completes you should see the following:

When the sync is done, you can sanity check that this is all working by making a search request to MeiliSearch. Replication can take several minutes depending on the size of your Slack instance.

curl 'http://localhost:7700/indexes/messages/search' --data '{ "q": "<search-term>" }'

For example, I have the following message in one of the messages that I replicated: "welcome to airbyte".

curl 'http://localhost:7700/indexes/messages/search' --data '{ "q": "welcome to" }'
# => {"hits":[{"_ab_pk":"7ff9a858_6959_45e7_ad6b_16f9e0e91098","channel_id":"C01M2UUP87P","client_msg_id":"77022f01-3846-4b9d-a6d3-120a26b2c2ac","type":"message","text":"welcome to airbyte.","user":"U01AS8LGX41","ts":"2021-02-05T17:26:01.000000Z","team":"T01AB4DDR2N","blocks":[{"type":"rich_text"}],"file_ids":[],"thread_ts":"1612545961.000800"}],"offset":0,"limit":20,"nbHits":2,"exhaustiveNbHits":false,"processingTimeMs":21,"query":"test-72"}

4. Search via a UI

Making curl requests to search your Slack History is a little clunky, so we have modified the example UI that MeiliSearch provides in their docs to search through the Slack results.
Download (or copy and paste) this html file to your workstation. Then, open it using a browser. You should now be able to write search terms in the search bar and get results instantly!

5. "Productionizing" Saving Slack History

You can find instructions for how to host Airbyte on various cloud platforms here.
Documentation on how to host MeiliSearch on cloud platforms can be found here.
If you want to use the UI mentioned in the section above, we recommend statically hosting it on S3, GCS, or equivalent.

How We Leveraged Singer for Our MVP

Charles — Mon, 30 Nov 2020 20:28:00 +0000

One of the (many) hard things about doing a startup is figuring out what that MVP should be. You are trading off between presenting something that is “good” enough that it gets people excited to use (or invest in) you and getting something done fast. In this article, we explore how we wrestled with this trade-off. Specifically, we explore our decisions around how to use Singer to bootstrap our MVP. It is something we get tons of questions about, and it was hard for us to figure out ourselves!

When we set out to create an MVP for our data integration project, we began with this prompt:

Create an OSS data integration project that includes all of Singer’s major features. In addition, it should have a UI that can be used by non-technical users and has production-grade job scheduling and tracking.
Do it in a month.
Use Singer to bootstrap it.

We knew from the start that in the long run, we did not want Singer to be core to the working of our platform. In the short term, however, we wanted to be able to bootstrap our integration ecosystem off of Singer’s existing taps and targets. So should we make Singer part of our core platform in the beginning to bootstrap? And if so, at what cost?

This picture shows the spectrum of options we considered, from wrapping a UI around Singer and relying entirely on it as our backend to shooting for our original goal of Singer as a peripheral.

1. Thin UI wrapper around Singer

This felt like the “startup-y” option. We could throw Singer, a database, and a UI in a Docker container and have “something” up and running in, perhaps, days. We never tried to go with this approach because we were able to see some really big trade-offs.

Pros

Just a few days in terms of amount of work needed
No new code for each integration, just use Singer’s.

Cons

Pretty much all throw-away code after the initial release.
Because Singer taps / targets don’t declare their configurations (more on this later), there would be no way in the UI to tell the user what values they needed to provide in order to configure a source. We would only be able to accept a big json blob.

While we were going for an MVP, we did not think we would be able to get anyone interested in the first iteration. We also knew that subsequent iterations would be painful, since we would be effectively starting from scratch because the initial iteration was not a sturdy building block. We skipped this approach.

2. Airbyte integration configurations

Given that we wanted to provide a UI experience that was accessible to non-data engineers, our next step was to figure out how we could make it easy to configure integrations in the UI. This meant we had to build our own configuration abstraction for integrations, because this is something that Singer does not provide (we go into more depth on this feature in the first article in this series).

This abstraction was basically a way for each integration to declare what information it needed in order to be configured. For example, a Postgres source might need a hostname, port, etc. This layer made it possible for the UI to display user-friendly forms for setting up integrations. With this approach, we could still rely on Singer as the “backend” for the platform, but we could provide a better configuration experience for the user.

In order to implement this layer, we created a standardized way to declare information about an integration and how to configure it in a JsonSchema object. When someone selects an integration in the UI, it will render a form based on that JsonSchema. The user would then provide the needed information and pass it directly to the backend.

This is ultimately where we started out. And everything was good for about a week…

3. Dockerize Singer integrations

Up until this point, the only thing we had to do per integration was write a JsonSchema object that declared the configuration inputs for an integration. But what if we want the form in the UI to display different fields than those that Singer taps / targets consume?

The first case we ran into was in the Postgres Singer tap. That tap takes in a field called a “filter_dbs” field. This attribute restricts which databases the tap scans when being run in “discover” mode. The tap also takes in a field called ”database,” which is the name of the database from which data will be replicated. In our use case, we wanted “filter_dbs” to be populated with only a single entry, the value that the user had provided for “database.”

In order to hide filter_dbs from the UI, but still populate it behind the scenes, we were going to need to write some special code that executed only when the Postgres Tap ran. But where was that code going to run? The abstraction we had was that our core platform just assumed that all integration-specific code was bundled in the Singer Tap. So we were either going to need to insert this integration-specific code into our core platform or restructure our abstraction so that we could run custom integration code that was not packaged as part of Singer.

Again, we already had a rough idea of what we wanted this to look like in the long term. We imagined each integration running entirely in its own Docker container. Airbyte would handle passing messages from the container running the source to the container running the destination. We had hoped we could get to MVP without it, but ultimately, when we hit this issue, it tipped us over the edge. So we traded some time to figure out how to package Singer taps and targets into Docker containers that made it easy for us to mediate all of the interactions between the core platform and the integration running in the container.

4. Use the Airbyte protocol instead of the Singer protocol

Now fast forward another couple weeks: we are on the night before we plan to do our first public launch, and nothing is working. We have 3 sources and 3 destinations, and not one of them can work with all of the others.

The issue was two-fold:

We ran into inconsistencies in the Singer protocol that made it hard to treat all Singer Taps and Targets the same way programmatically.
In falling back on Singer to handle our “backend,” there were implementation details in the way Singer worked that were incompatible with the product we wanted to build.

We won’t spend a ton of time discussing these issues, because we’ve already written about them here. So let’s just say we hit a point where we realized that we either needed to become the world’s foremost experts on the Singer protocol or focus on defining our own protocol. Since the latter already aligned with our long-term vision, we went in that direction.

Ultimately, we tore out our hair and got through that night, and then for our next release we introduced our own protocol. Even at our early stage, this was an expensive endeavour. It took one-ish engineers over a week to migrate us from the Singer protocol to our own (this felt like eons to us!).

Did we do it right?

Obviously, this question is impossible to answer. After reading this article, you might have come to the conclusion that we should have built the first version of our product with Singer at the periphery of our system. And had we done that, we could have skipped the iteration of moving Singer from within our core system to the outskirts. I wouldn’t begrudge you that conclusion!

Had we taken that approach, however, we would have delayed our initial release by an additional month (double time to MVP!). Getting something out early was valuable, because it gave us early feedback that what we were building was interesting to people. We made trade- offs to move fast, but still work from a base that we could iterate on quickly--pretty much the classic trade-off you think about when trying to launch an MVP. And, ultimately, we can’t draw any hard and fast rules other than to use your own judgment!

The unexpected insight that we came away with, however, was that this approach allows us to learn a lot from Singer. Even having Singer be part of the core system for just a few weeks, we got a really good understanding of why they had solved certain issues the way they did.

For example, when we first encountered the Singer Catalog, the use of a breadcrumb system to map metadata onto a schema felt unintuitive and needlessly complicated. The metadata and the schema were in the same parent object, so why did we need this complex system of having the metadata fields index into the schema? Couldn’t they be combined? After using it closely for a few weeks, we understood the complexities that come with configuring special behavior at a field level for deeply nested schemas. Had we gone our own way from the start, we would have learned this lesson much later (and the later we learned it, the harder it would have been to remedy).

Building on top of Singer in the beginning forced us into a Chesterton’s Fence situation. Each time we wanted to do something a certain way, because we thought Singer’s approach didn’t make sense, we were forced to fully understand why Singer had done things the way it did. By doing so, we avoided mistakes we would otherwise have made. We also were able to make decisions different from Singer’s while still benefiting from its experience. All in all, we feel we made the right choice. What do you think?

Why You Should NOT Build Your Data Pipeline on Top of Singer

Charles — Mon, 30 Nov 2020 20:26:59 +0000

Singer.io is an open-source CLI tool that makes it easy to pipe data from one tool to another. At Airbyte, we spent time determining if we could leverage Singer to programmatically send data from any of their supported data sources (taps) to any of their supported data destinations (targets).

For the sake of this article, let’s say we are trying to build a tool that can do the following:

Run any Singer tap or target
Provide a UI for configuring and running those taps and targets
Count the number of records synced in each run

In the context of these goals, being able to use Singer programmatically means writing a program that can, for any integration:

provide a UI with instructions on what information a user needs to input in order to configure that integration (e.g., host, password, etc).
take those user-provided values and execute each integration.

We know that the described requirements are not the use case that Singer sets out to solve, but nonetheless, we wanted to see if we could leverage Singer to bootstrap building out this case. Sure enough, we ran into some “gotchas” along the way. These gotchas illustrate some of the core primitives that a programmatic data integration tool requires.

Integrations do not declare their configurations

The Singer protocol does not specify how an integration should define what inputs it requires. This means that, in order to use most Singer taps, you need to scour the entire implementation to figure out what properties it uses; depending on the complexity of the integration, this can be pretty painful.

Some integrations help out by specifying what the configuration should look like in a readme or in a sample config. Even these lead to headaches. They often just list the fields that need to be passed in but do not explain what they mean, what their format is, or how to find them (good luck trying to find all the information you need to configure your Google Ads integration!). In other cases, they only list a subset, and then you have to discover the rest by reading the integration (e.g., tap-salesforce doesn’t mention is_sandbox in the docs UPDATE: someone has now added this field in the readme with this PR).

These taps are great; we have happily used all of them, but because they do not specify what is required to configure them, they can’t be used programmatically. Specifically, our program needs to know that for the Postgres tap it requires the field’s hostname and port. Without this specification, the program cannot figure out how to build a valid configuration for an integration. This configuration is expensive to shim, because it requires engineering work for every single integration!

No way to tell which Singer feature is compatible with which integration

Singer has excellent documentation around its core protocol. It also does a nice job defining the suite of special metadata that it supports. When you start actually using Singer, however, mapping these primitives onto your integrations is difficult. For example, “replication-method” sets whether all the data from the source should be replicated (“full_table”) or just the new or updated data (“incremental”). What is unclear is which taps actually support “incremental” or “full_table” or both.

Taps do not advertise, in a way that is programmatically consumable, which of these replication methods they support. Some of them mention it in their documentation, but ultimately that’s insufficient for the type of tool we want to build. So what happens when you request “incremental” from a source that only supports “full_table”? The behavior is undefined. Some taps will throw an error, some will just do a full refresh. Either way, from the point of view of the UI-based tool that we are trying to build, this isn’t really usable.

The problem only gets hairier for some of the more niche metadata as well (e.g., “view-key-properties”). You either need to read the source or just try it out and see if the configuration works. This problem is adjacent to the configuration problem described in the previous section, and, similarly, requires a shim for every integration.

Singer’s own secret menu

If you’re from the West coast, you might be familiar with how In-N-Out Burger popularized the “secret” menu in fast food chains. While charming at a drive thru, secret menus can ruin your data integration.

The Singer protocol has some of its own secret menu items. For example, we were parsing each message that a tap output into JSON using the declared schema in the Singer docs. We were trying to understand really well what messages were being sent between taps and targets, so we would fail loudly if anything was sent that did not match the documented message types. Then we started getting errors on “ActivateVersionMessage.” After spelunking in the source code for a bit, we found that this message type has existed in Singer as an experimental feature since 2017. A handful of the official Singer taps use it, but there’s no guidance on what you’re supposed to do with it (I suspect it is a feature used internally at Stitch--the paid, managed solution from the creators of Singer). If you’re building something programmatic on top of Singer, your choice is to just filter it out or let it pass and hope that stuff…just works, I guess?

Handling this one case is not the end of the world, but it leaves you feeling uncertain what else is lurking in the protocol that might not play well with your system.

Conclusion

So to answer our original question, can we reasonably stretch the Singer to meet our product requirements? The answer is no. Doing so would require writing custom shims for every single Singer tap and target. Since the goal with data integrations is always to scale to more integrations, having to do any work on them per integration is very expensive.

The Singer protocol is underspecified for this use case. This realization makes sense, because ultimately this is not the use case for which the protocol is trying to solve. Achieving these requirements depends on integrations declaring much more information about how they are configured and which features they support. We are tackling this problem at Airbyte, so if you are looking for an OSS solution that makes it easy to move your data into a warehouse, instead of trying to roll your own on top of Singer, come check us out!

This article is meant to be the first in a pair of articles. The second will explore the engineering journey that we took to figure out where Singer should fit into our system.