DEV Community: Kern AI

Twitter Issues are a mess!!

Johannes Hötter — Sat, 01 Apr 2023 12:50:49 +0000

Ok, you all most likely heard it. Twitter went open-source. That's amazing. Curious as I am, I wanted to dive into their repository.

When looking into their issues list, I was laughing out loud. Check this:

GitHub users are making fun on the whole release, and turn the issues list into a jokes section.

As an engineer on the dev team of Twitter, however, I would be really annoyed. Differentiating between issues of trolls and non-trolls is now a new todo on their list. So let's try to help them. I'm going to show a first, very simple version of a classifier for identifying troll-issues in the Twitter repo. Of course, I'm sharing the work on GitHub as well. Here's the repo.

Getting the data

I've scraped the issues with a simple Python script, which I also shared in the repo:

import requests
import json

PAT = "add-your-PAT-here" # see https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token
owner = "twitter" 
repo = "the-algorithm" 

url = f"https://api.github.com/repos/{owner}/{repo}/issues"
headers = {"Authorization": f"Bearer {PAT}"}

all_issues = []

while url:
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        issues = response.json()
        all_issues.extend(issues)
        if "next" in response.links:
            url = response.links["next"]["url"]
        else:
            url = None
    else:
        print(f"Failed to retrieve issues (status code {response.status_code}): {response.text}")
        break

issues_reduced = []
for issue in all_issues:
    issue_reduced = {
        "title": issue["title"],
        "body": issue["body"],
        "html_url": issue["html_url"],
        "reactions_laugh": issue["reactions"]["laugh"],
        "reactions_hooray": issue["reactions"]["hooray"],
        "reactions_confused": issue["reactions"]["confused"],
        "reactions_heart": issue["reactions"]["heart"],
        "reactions_rocket": issue["reactions"]["rocket"],
        "reactions_eyes": issue["reactions"]["eyes"],
    }
    issues_reduced.append(issue_reduced)

with open("twitter-issues.json", "w") as f:
    json.dump(issues_reduced, f)

print(f"Retrieved {len(all_issues)} issues and saved to twitter-issues.json")

Of course, these days, I didn't write the code for this myself. ChatGPT did that, but you all already know that.

I decided to reduce the downloaded data a bit, because much of the content didn't seem to be relevant to me. Instead, I wanted to just have the URL to the issue, the title and body, and some potentially interesting metadata in form of the reactions.

An example of this looks as follows:

  {
    "title": "adding Documentation",
    "body": null,
    "html_url": "https://github.com/twitter/the-algorithm/pull/838",
    "reactions_laugh": 0,
    "reactions_hooray": 0,
    "reactions_confused": 0,
    "reactions_heart": 0,
    "reactions_rocket": 0,
    "reactions_eyes": 0
  },

Building the classifier

With the data downloaded, I started refinery on my local machine. With refinery, I'm able to label a little bit of data and build some heuristics to quickly test if my idea works. It's open-sourced under Apache 2.0, you can just grab it and try along.

Simply upload the twitter-issues.json file we just created:

For the title and body attributes, I added two distilbert-base-uncased embeddings directly from Hugging Face.

After that, I set up three labeling tasks, of which for now only the Seriousness task is relevant.

Diving into the data, I labeled a few examples to see how the data looks like and to get some reference labels for my automations I want to build.

I realized that quite often, people are searching for jobs in issues. So i started building my first heuristic for this, in which I use a lookup list that I created to search for appearances of job-terms. I'm going to later combine this via weak supervision with other heuristics to power my classifier.

For reference, this is how the lookup lists looks like. Terms are automatically added while labeling spans (which is also why i had three labeling tasks, one for classification and two for span labeling), but I could also have uploaded a CSV file of terms.

As I also already labeled a bit of data, I created a few active learners:

With weak supervision, I can easily combine this active learner with my previous job search classifier without having to worry about conflicts, overlaps and the likes.

Also I noted a couple of issues with just a link to play chess online:

So i added a heuristic for detecting links via spaCy.

Of course, I also wanted to create a GPT-based classifier, since this is publicly available data. However, GPT seems to be down while I'm initially building this :(

After circa 20 minutes of labeling and working with the data, this is how my heuristics tab looked like

So there are mainly active learners, some lookup lists and regular-expression like heuristics. I will add GPT in the comments section as soon as I can access it again :)

Now, I weakly supervised the results:

You can see that the automation already nicely fits the distribution of trolls vs. non-trolls.

I also noticed a strong difference in confidence:

So I headed over to the data browser and configured the confidence so I only see the records with above 80% confidence.

Notice that in here, we could also filter by single heuristic hits, e.g. to find records where different heuristics vote different labels:

In the dashboard, I now filter for the high confidence records and see that our classifier is performing quite good already (note, this isn't even using GPT yet!):

Next steps

I exported the project snapshot and labeled examples into the public repository (twitter_default_all.json.zip), so you can play with the bit of labeled data yourself. I'll continue on this topic the next days, and we'll add a YouTube video for this article for a version 2 of the classifier. There certainly are further attributes, we can look into, such as taking the length of the body into account (I already saw that shorter bodys typically are troll-like).

Also, keep in mind that this is an excellent way to benchmark how power GPT can add for your use case. Simply add it as a heuristic, try a few different prompts, and play with excluding or adding it from your heuristics in the weak supervision procedure. For instance, here, I excluded GPT:

I'm really thrilled about Twitter going open-source with their algorithm, and I'm sure it will add a lot of benefits. What you can already tell is due to the nature of Twitter's community, issues are often written by trolls. So finding detecting such will be important for the dev team of Twitter. Maybe this post here can be of help for that :)

Alleviate the pain of manual labeling and deploying models with weak supervision

Leonard Püttmann — Tue, 14 Mar 2023 11:24:10 +0000

Natural Language Processing (NLP) has seen significant improvements in recent years due to the advent of neural network models that have achieved state-of-the-art performance on a range of tasks like sentiment analysis, named entity recognition, and machine translation. However, these models require vast amounts of labeled data, which can be prohibitively expensive and time-consuming to obtain. With refinery and gates, we enable users to leverage the power of weak supervision in their production environment to mitigate the need for expensive manual labeling and costly model deployment.

Why use weak supervision?

Weak supervision is a promising alternative to traditional supervised learning that can help alleviate the need for vast amounts of labeled data. Weak supervision uses heuristics or rules to create noisy labels for the data. Noisy means that the labels are largely correct, but can contain more errors than labels obtained from manual labeling. These noisy labels can then be used to train a model that can generalize to new, unseen data. Or we can also optimize the weak supervision to use it directly for production! This approach is especially useful when there is no large labeled dataset available for a particular task or domain.

In the context of NLP, weak supervision can be used to annotate text data, such as social media posts or online reviews, with labels that are less precise than those obtained through manual annotation. For example, a weak supervision approach to sentiment analysis might use a set of rules in the form of programmatic labeling functions to label all tweets containing positive emojis as "positive," and all tweets that use negative emojis as "negative." These noisy labels can then be used to train a model that can classify new tweets into positive or negative categories.

Despite the challenges, weak supervision has gained popularity in recent years due to its ability to quickly and easily generate labeled data. It can also be used in combination with traditional supervised learning to improve model performance further. For instance, a weak supervision approach can be used to generate initial labels for a dataset, which can then be refined by human annotators in a process called "bootstrapping." This iterative process of weak supervision and human annotation can help reduce the amount of manual labeling required and improve the quality of the final dataset.

Another advantage of weak supervision is that it can be used to label rare or unusual events that are difficult to label using traditional supervised learning approaches. For example, it can be challenging to obtain labeled data for rare diseases or rare events in social media. However, a weak supervision approach can use domain-specific knowledge to generate labels for these rare events, enabling models to learn from limited data.

How to deal with noisy labels

While weak supervision can be a powerful tool for training models with limited labeled data, it comes with its own set of challenges. Cheap labels are often noisy. Noisy labels can introduce errors and biases into the training data, and it can be difficult to quantify the quality of the labels generated by the weak supervision process. However, there are reliable methods that allow us to easily spot errors in labels quickly.

One of those methods is called confident learning. Confident learning is a technique used to identify and correct errors in noisy labels, so this is perfect for weakly supervised data! The approach involves training a model on the noisy labels generated through weak supervision and then using the model's predictions to estimate the confidence of each label. Labels with low confidence can then be flagged as potentially incorrect and subjected to further scrutiny or correction. This method can help improve the quality of the final dataset and reduce the impact of errors introduced by weak supervision without the need to check all the labels manually.

Heuristics - sources for cheap labels

The classic approach to get labels would of course be to manually annotate the whole dataset. But that would be time-consuming, tedious, slow, and very expensive. Instead, we can only partly label our data by hand and try to find cheaper methods to obtain labels for the rest of our data. These cheap and inexpensive label sources will most likely be noisy. But if we can get labels from many different sources, then the final, weakly supervised label will be greater than the sum of its parts. In other words: It’s going to be accurate. In refinery, we call these sources heuristics.

Labeling functions

Labeling functions allow us to programmatically incorporate domain knowledge into the labeling process. Similar to expert systems, which are still in place in many companies. We can do this with programming languages like Python. A domain expert has a mental model in his mind, of how and why he would label things in a certain way. In the field of natural language processing, this could be certain words, the structure of sentences, or the author of a text. All of these can be expressed programmatically. Let’s imagine that we want to build a classifier to detect clickbait. We would quickly notice that a lot of clickbait starts with a number and that clickbait often addresses the reader directly. We could incorporate this simply with just a few lines of code:

These labeling functions won’t be perfect, but as long as they are better than guessing, we’ve already gained something. In the iterative process of getting labels, these functions can also be improved and debugged later on as we gain more insights into our data.

Active learner

Active learning is a technique used to select the most informative examples from a large unlabeled dataset to be labeled by a human annotator. The goal is to select examples that will provide the most value in improving the model's performance while minimizing the number of examples that need to be labeled. This approach can be especially useful when labeled data is scarce or expensive to obtain. It involves iteratively training a model on a small subset of labeled data, selecting the most informative examples to label, and retraining the model on the expanded labeled dataset. This process can be repeated until the model's performance reaches a desired level or the budget for labeling is exhausted.

This is especially powerful if we use this in combination with pre-trained transformer models to embed out text data. These models handle all of the heavy-lifting during the active learning part, so that we can quickly get accurate results, even if we don’t have tons of data available.

Zero-shot classification

Zero-shot classification allows us to label data without having to explicitly train a model on that particular task or domain. Instead, we can use a pre-trained language model, such as BERT or GPT-3, that has been trained on a large corpus of text to generate labels for new, unseen data.

To use zero-shot classification, we need to provide the language model with a set of labels or categories that we want to use for classification. The language model can then generate a score or probability for each label, indicating how likely the input text belongs to that label. This approach can be especially useful when we have a small labeled dataset or no labeled data at all for a particular task or domain.

For example, let's say we want to classify news articles into different topics, such as politics, sports, and entertainment. We can use a pre-trained language model, such as BERT, to generate probabilities for each label. We would provide the language model with a set of examples for each label, such as news articles that belong to the politics category, and let it learn the patterns and features that are characteristic of each category. Then, when we give the language model a new, unseen news article, it can generate probabilities for each label, indicating how likely the article belongs to each category.

Combining heuristics for weak supervision in refinery

We now learned why weak supervision is so powerful. Let’s take a closer look at how we use all the possible heuristics for weak supervision in refinery.

In refinery, labels from different sources are combined and result in a singular, weakly supervised label. The labeling source includes manual labels, labels from labeling functions, active learner models as well as zero-shot labels. Let’s take a closer look at an example project, which contains news headlines.

A closer look under the hood

Here are the steps on how we obtain the weakly supervised label. We build a DataFrame out of the source vectors of our heuristics, which contain the label <> record mappings as well as the confidence values for all of our label sources. Afterward, the predictions are calculated for all of the label sources by multiplying the precision with the confidence of each label source. These values then all get added to an ensemble voting system, which integrated all the relevant data from the noisy label matrix into the singular weakly supervised label.

The ensemble voting system works by retrieving all of the confidence values from all the label sources, aggregating them, and selecting the label with the highest confidence values. The final confidence score is then calculated by subtracting the highest voting confidence score from the sum of voters and then subtracting this from the highest voting confidence score and after that, passing the output of this to a sigmoid function. The resulting weakly supervised label and the final confidence value are then added to our record in refinery.

Using weak supervision in production with gates

The great thing about weak supervision is that we can directly use it in production as well. We created custom labeling functions, already have machine learning models in our active learners, and include state-of-the-art transformer models to obtain labels. When using weak supervision, we set up a powerful environment to efficiently label data with ease. If this approach works well on unlabeled data in our dataset, chances are that it also works well with new incoming data in a production environment. Even better: We can even enrich our project with new incoming data and react to changes quickly. We can monitor the quality of our weak supervision system via the confidence scores and create or change labeling functions or label some new data manually when needed. This ensures high accuracy but also low cost in regards to maintaining AI in a production environment and removes the stress of monitoring and re-deploying machine learning models regularly.

That’s why we created gates, which allows you to use information from a weak supervision environment anywhere through an API. Our data-centric IDE for NLP allows you to use all the powerful weak supervision techniques to quickly label data. And through gates, you can access all your heuristics and active learners with the snap of a finger. No need to put models into production, you can access everything right away.

In conclusion, weak supervision is an amazing technique for NLP that can help alleviate the need for vast amounts of labeled data. It can be used to label data quickly and easily, especially for rare or unusual events, and can be combined with traditional supervised learning to improve model performance further. While it comes with its own set of challenges, recent research has shown that combining weak supervision with other techniques can help mitigate these challenges and improve model performance.

How we used AI to automate stock sentiment classification

Leonard Püttmann — Tue, 21 Feb 2023 14:40:23 +0000

This article is meant to accompany this video: https://www.youtube.com/watch?v=yeML0vX0yLw

In this article, we would like to provide you with a step-by-step tutorial, in which we build a slack bot that sends us a daily message on the sentiment of the news of our stocks. To do this, we need a tool that can:

automatically fetch data from news sources
retrieve API calls from our ML model to get predictions
send out our enriched data to Slack

We will build the web scraper in Kern AI workflow, labeled our news articles in refinery, and then enrich the data with gates AI. After that, we will use workflow again to send out the predictions and the enriched data via a webhook to Slack. If you'd like to follow along or explore these tools on your own, you can join our waitlist here: https://www.kern.ai/

Let's dive into the project!

Scraping data in workflow

To get started, we first need to get data. Sure, there are some publicly available datasets for stock news available. But we are interested in building a sentiment classifier for only specific companies and, ideally, we want news articles that are not too old and therefore irrelevant.

We start our project in workflow. Here we can add a Python node, with which we can execute custom Python code. In our case, we use it to scrape some news articles.

There are many ways to access news articles. We decided to use the Bing News API because it offers up to 1000 free searches per month and is fairly reliable. But of course, you can do this part however you like!

To do this, we use a Python yield node, which takes in one input (the scraping results) but can return multiple outputs (in this case, one record per found article):

def node(record: dict):
    from bs4 import BeautifulSoup
    import requests
    import time
    from datetime import datetime
    from uuid import uuid4

    search_term = "AAPL" # You can make this a list and iterate over it so search multiple companies! 

    subscription_key = <YOUR_AZURE_COGNITIVE_KEY>
    search_url = "https://api.bing.microsoft.com/v7.0/news/search"

    headers = {"Ocp-Apim-Subscription-Key" : subscription_key}
    params  = {"q": search_term, "textDecorations": True, "textFormat": "HTML", "mkt": "en-US"}

    response = requests.get(search_url, headers=headers, params=params)
    response.raise_for_status()
    search_results = response.json()

    headers = ["name", "description", "provider", "datePublished", "url"]

    record = {}
    for i in headers:
        if i == "provider": 
            providers = [article[i][0] for article in search_results["value"]]
            names = []
            for index in range(len(providers)):
                for key in providers[index]:
                    if key == "name":
                        names.append(providers[index][key])

            record[i] = names
        else: 
            part_of_response = [article[i] for article in search_results["value"]]
            record[i] = part_of_response

    record["topic"] = [search_term] * len(record["name"])

    # Scraper the collected urls
    texts = []
    for url in record["url"]:
        try:
            req = requests.get(url)
            text = req.content 
            soup = BeautifulSoup(text, 'html')
            results = soup.find_all('p')
            scraped_text = [tag.get_text() for tag in results]
            scraped_text_joined = " ".join(scraped_text)
            texts.append(scraped_text_joined)
            time.sleep(0.5)
        except:
            texts.append("Text not available.")
    record["text"] = texts

    for item in range(len(record)):
        yield {
            "id": str(uuid4()),
            "name": record["name"][item],
            "description": record["description"][item],
            "provider": record["provider"][item],
            "datePublished": record["datePublished"][item],
            "url": record["url"][item],
            "topic": record["topic"][item],
            "text": record["text"][item],
        }

After that, we can store our data in a shared store. There are two store nodes, a "Shared Store send" to receive data into a store and a "Shared Store read" node from which you can access stored data and feed it into other nodes.

We can create a Shared Store in the store section of Workflow. In the store section, you'll also find many other cool stores, such as spreadsheets or LLMs from OpenAI!

Simply click on "add store" and give it a fitting name. Afterward, you'll be able to add the created store in a node in workflow.

Now that we've scraped some data, we can move on to label and process it!

Enriching new incoming data with gates

Once we've run our web scraper and collected some data, we can sync a shared store with refinery. This will load all of our scraped data into a refinery project. Once we run the scraper again, new records will be loaded into the refinery project automatically.

Refinery is our data-centric IDE for text data and we can use it to label and process our articles very quickly and easily.

For example, we can create heuristics or something called an active learner to speed up and semi-automate the labeling process. Click here for a quickstart tutorial on how to label and process data with refinery.

Once the results of the project are satisfactory, all heuristics of ml models of a refinery can be accessed via an API through our second new tool, which is called gates.

Before we can access a refinery project, we have to go to gates first, open our project there, and start our model and/ or heuristic in the configuration.

Once we've done so, we will be able to select the model of the running gate in our gates AI node in workflow.

Gates is integrated directly into workflow and we don't need an API token to do this. But of course, the gates API is also usable outside of workflow, for this we would need an API token. But we will cover this in another blog article then.

After we've passed the data through gates, we get a dictionary as a response containing all the made predictions and confidence values for each of our active learners and heuristics. We also get all the input values returned as well, so if you are only interested in the results, we have to do a little bit of filtering. The Python code below takes in the response from gates and only returns the prediction and the topic. You can use a normal Python node for this.

def node(record: dict):
    return {
        "id": record["id"],
        "prediction": record["prediction"]["results"]["sentiment"]["prediction"],
        "stock": record["topic"],
    }

Afterward, we store the filtered and enriched results in a separate store!

Aggregate the sentiments

Now we got all of our news articles enriched with sentiment predictions. The only thing that's left is to aggregate the predictions and send them out. You could do this via E-Mail or you could also send the results to a Google Sheet. In our example, we are going to use a webhook to send out the aggregated results to a dedicated slack channel.

In our example we simply count the number of positive, neutral and negative articles, but you could also send out the confidence of the articles or the text snippets of articles. To do this, we use a Python aggregate node, which takes in multiple records but sends out only one output. Here's the code for this node:

def node(records: list[dict]):
    from datetime import datetime
    from uuid import uuid4

    positive_count = 0
    neutral_count = 0
    negative_count = 0

    for item in records:
        if item["prediction"] == "rather positive":
            positive_count += 1
        elif item["prediction"] == "neutral":
            neutral_count += 1
        elif item["prediction"] == "rather negative":
            negative_count += 1

    return {
        "id": str(datetime.today()),
        "text": f"Beep boop. This is the daily stock sentiment bot. There were {positive_count} positive, {neutral_count} neutral and {negative_count} news about Apple today!"
    }

We then create a webhook store, add the URL of our slack channel and then add the node to our workflow. Afterward, we can run the workflow and it should send out a slack message to us!

This simple use-case only scratches the surface of what you can do with the Kern AI platform and you have a lot of freedom to customize the project and workflow to your need!

If you have any questions or feedback, feel free to leave it in the comment sections down below!

Why and how we started Kern AI (our seed funding announcement)

Johannes Hötter — Thu, 16 Feb 2023 08:17:15 +0000

Our co-founders Henrik and Johannes first met in January during a seminar at the Hasso Plattner Institute in Potsdam. It was a one-week seminar, which went from early in the morning until late at night. During that time, Johannes had just started an AI consultancy and was about to land the big first project. He was euphoric - soon, he would be able to implement a large neural network to process image data on a large scale in a real-world project.

After an in-depth discussion with Henrik, he realized he wasn’t prepared for it. It was a project deeply-rooted in Physics. The funny thing, Johannes is the person that failed almost every Physics exam in school. He had great knowledge of the latest AI frameworks and architectures and was able to write decent ETL pipelines, but he had zero knowledge about what he was meant to build.

Henrik, who studied physics before he started his master's degree, offered to help. They decided to implement the project together, and after they finished it successfully (which, looking back, is a bit of a miracle), they realized two things:

Henrik and Johannes would make an awesome founder duo
to implement a successful project, AI alone isn't going to help. It requires both engineers and business users (you might argue, "that is something you can read on Forbes". They did that before, but it was something completely different when they realized it hands-on in a project).

During 2020, both continued to implement projects but realized that they’d like to start building their own software. By the end of 2022 they decided to turn their consultancy into a software startup. This is how Kern AI was born.

"We know a lot about AI and have built great projects that created value for clients, but we certainly are missing lots of domain knowledge. Why not build a No-Code AI tool, and let the end user implement the AI?", share Johannes and Henrik about their thinking process.

Together, they built the first mockup in November '20, signed an agreement with a client by December '20, and developed the MVP in January '21. It was about to go into production, and Henrik and Johannes were about to witness our first SaaS client succeed, right? ... Wrong ...

Our first (failed) product: onetask

We called the SaaS onetask (you had to do one task to build the AI). By labeling data, a model was trained in the background, which you could then call in a small playground or via an API.

In February '20, both received the first feedback from the client and were shocked: The AI was just as good as random guessing. It learned nothing. And in addition, the people that labeled the data felt insecure about "building an AI". Henrik and Johannes figured out two new things:

The AI was fed with training data that, as a data scientist, you wouldn't consider training data. It simply wasn't good enough (they did plenty of projects beforehand and faced bad data before, but as they were able to fix data issues with our technical knowledge, they didn't realize how big this obstacle would be).
Being involved in building AI doesn't mean building the AI. The users felt insecure. But doesn't No-Code always win? Well, most often, No-Code applications result in deterministic results. Connecting your Webflow form to Hubspot via Zapier means that a new inbound lead is always sent to the CRM. But building AI means building statistical applications that produce results that are probabilistic. It's a new level of complexity.

As both were trying to figure out with the client how they could improve the AI, one developer from the client’s side asked Henrik why they don't automatically label the data via some rules and then let the users only label parts of the data. This simple statement was core to our product pivot:

Give superpowers to technical users. Our main user shifted from a non-technical user to a developer. Understand what they require to build AI, and help them build that.
Optimize the collaboration with end users (or generally where the technical user requires help), but have clear separations of responsibility.

At that time (the full team was still enrolled at university), Johannes heard about data-centric AI in research, a concept in which developers focus on building the training data of an AI system in collaboration with domain experts. "Jackpot, that's it!" - they looked for another early client, pitched the concept to their data science team (i.e., again, they went to our end users first), and outlined a project.

In May '21, we had the next MVP.

Early signs of the right direction

We saw that the data science team of our client had as the initial training data an Excel spreadsheet that was partially labelled years ago. Think of column A containing the raw content that should be predicted and column B (partially) containing what the model should predict. No documentation at all. Yikes.

Because of this, in the following project, our goals were:

To give data scientists more control in building the AI
To let domain experts collaborate actively (as we knew that this is crucial from day one)

Our MVP gave the data scientists a toolkit powering labelling automation, initially to fill out missing manually labelled data. To set up the automation, we asked the domain experts to label some data with us in a Zoom session and to speak out loud about what they were thinking as they were labelling the data.

Turns out this 2-hour session was worth a ton. Why?

The data scientists learned more about the data itself. Of course, they weren't completely new to the field, but no domain expert said out loud before what they were thinking about the record.
In the call, we turned the thoughts into code (think little Python snippets), and ran our software to combine the heuristics with some active learning (i.e., Machine Learning on the data labelled in the session). Seeing how the labelling was turning more and more into automation, the domain experts were excited at the end of the call, feeling they were an active and integral part of the process.
Lastly, the data scientists had a much better foundation to build models. Their training data now contained more labels, it contained better labels (we found tons of mislabeled data in that process).
Furthermore, the data was documented via automation and was more and more becoming part of an actual software artifact.

Ultimately, the data science team built a new model on top of the iterated training data, resulting in an F1-score raise from 72% to 80%. In non-technical terms, this means that you can trust your model much more.

We found that we were heading in the right direction. Our next question was: "what do we need to build precisely, and how can we best ship this to developers?”.

To answer the first question better than anyone else, we realized in early 2022 that we must win the hearts of developers. And this - for many good reasons - typically means via open-source.

We went open-source - version 1.0 of “Kern AI refinery”

Fast forward to July ‘22 (after many further product iterations and a full redesign), we open-sourced our product under a new name: Kern AI refinery (the origin of the name is very simple: we want to improve, i.e., refine, the foundation for building models).

We decided to fully focus on natural language processing (NLP), as we both saw refinery performing exceptionally well in NLP use cases in the past, and as we got incredibly excited about what the future of NLP might bring (this was before ChatGPT btw).

On our launch day, we were trending on HackerNews, and so we quickly gained interest from developers all over the world. From the feedback we got, we saw that refinery was moving exactly in the direction we hoped it would:

Shortly after the release, we had more than 1,000 stars on GitHub (i.e., a GH users expressing that they like the project), hundreds of thousands of views on the repository, and dozens of people telling us about the use cases they implemented via refinery. We were thrilled and started digging deeper.

This leads us to today.

Announcing our seed funding, co-led by Seedcamp and Faber with participation from xdeck, another.vc and Hasso Plattner Seed Fund

We are happy to announce that Seedcamp and Faber co-led our seed funding of €2.7m.

Our investors share our vision of bringing data-centric NLP into action and trust us in building Kern AI by focusing on the end users first. We’re thrilled to receive their support and backing and now aim to continue expanding our platform.

Doing so, we today announce the release of our data-centric NLP platform.

It is the result of our insights and efforts since we started Kern AI. What makes it stand out?

It puts users in their roles, while also sparking collaboration and creativity. bricks (our content library) is connected with refinery (database + application logic), such that developers can turn an idea into implementation within literally seconds. Why? Because that way, devs and domain experts can validate ideas immediately.
It is capable of doing the sprint and the marathon. Prototype an idea within an afternoon and automatically have the setup to grow your use cases over time. Just like regular software.
You can use it both for batch data and real-time streams. Start by uploading an Excel spreadsheet into refinery, and over time grow your database via native integrations or by setting up your own data stream via our commercial API (gates).
It is flexible. You are using crowd labelling to annotate your training data? No problem, you can integrate crowd labelling into refinery. Do you already have a set of tools? This also works, and refinery even comes with native integrations to tools like Labelstudio. The more familiar you get with the platform, the more use cases you will see. That’s what gets us excited: sparking creativity.
It can power your own NLP product as the database. Or you can use it as the NLP API. Or you can even cover a full end-to-end workflow on it. Use cases range from building sophisticated applications up to implementing a small internal natural language-driven workflow.

Our team is genuinely excited about what comes next. We believe that NLP is just about to get started, and it will disrupt almost anything touched by technology. And we’re confident that our work will contribute to it.

GPT and BERT: A Comparison of Transformer Architectures

Leonard Püttmann — Thu, 09 Feb 2023 13:37:07 +0000

Transformer models such as GPT and BERT have taken the world of machine learning by storm. While the general structures of both models are similar, there are some key differences. Let’s take a look.

The original Transformer architecture

The first transformer was presented in the famous paper "attention is all you need" by Vaswani et al. The transformer models were intended to be used for machine translation and used as encoder-decoder architecture that didn't rely on things like recurrence. Instead, the transformer focused on something called attention. In a nutshell, attention is like a communication layer that is put on top of tokens in a text. This allows the model to learn the contextual connections of words in a sentence.

From this original transformer paper, different models emerged, some of which you might already know. If you spent a little of your time exploring transformers already, you've probably come across this image, outlining the architecture of the first transformer model.

The approach to using an encoder and a decoder is nothing new. It means that you train two neural networks and use one of them for encoding and one of them for decoding. This is not limited to transformers, as we can use the encoder-decoder architecture with other types of neural networks like LSTM (Long-Short Term Memory). This is especially useful if we would like to convert an input into something else, like a sentence of one language into another language. Or an image into a text description.

The crux of the transformer is the use of (self-)attention. Things like recurrence are dropped completely, hence the name "attention is all you need" of the original paper!

GPT vs BERT: What’s The Difference?

The original transformer paper sprouted lots of really cool models, such as the all-mighty GPT or BERT.

GPT stands for Generative Pre-trained Transformer, and it was developed by OpenAI to generate human-like text from given inputs. It uses a language model that is pre-trained on large datasets of text to generate realistic outputs based on user prompts. One advantage GPT has over other deep learning models is its ability to generate long sequences of text without sacrificing accuracy or coherence. In addition, it can be used for a variety of tasks, including translation and summarization.

BERT, which stands for Bidirectional Encoder Representations from Transformers, was developed by the Google AI Language team and open-sourced in 2018. Unlike GPT, which only processes input from left to right like humans read words, BERT processes input both left to right and right to left in order to better understand the context of given texts. Furthermore, BERT has also been shown to outperform traditional NLP models such as LSTMs on various tasks related to natural language understanding.

There is, however, an extra difference in how BERT and GPT are trained:

BERT is a Transformer encoder, which means that, for each position in the input, the output at the same position is the same token (or the [MASK] token for masked tokens), that is the inputs and output positions of each token are the same. Models with only an encoder stack like BERT generate all its outputs at once.
GPT is an autoregressive transformer decoder, which means that each token is predicted and conditioned on the previous token. We don't need an encoder, because the previous tokens are received by the decoder itself. This makes these models really good at tasks like language generation, but not good at classification. These models can be trained with unlabeled large text corpora from books or web articles.

The special thing about transformer models is the attention mechanism, which allows these models to understand the context of words more deeply.

How does attention work?

The self-attention mechanism is a key component of transformer models, and it has revolutionized the way natural language processing (NLP) tasks are performed. Self-attention allows for the model to attend to different parts of an input sequence in parallel, allowing it to capture complex relationships between words or sentences without relying on recurrence or convolutional layers. This makes transformer models more efficient than traditional recurrent neural networks while still being able to achieve superior results in many NLP tasks. In essence, self-attention enables transformers to encode global context into representations that can be used by downstream tasks such as text classification and question answering.

Let's take a look at how this work. Imagine that we have a text x, which we convert from raw text using an embedding algorithm. To then apply the attention, we map a query (q) as well as a set of key-value pairs (k, v) to our output x. Both q, k, as well as v, are vectors. The result z is called the attention-head and is then sent along a simple feed-forward neural network.

If this sound confusing to you, here is a visualization that highlights connections that are built by the attention mechanism:

You can explore this yourself in this super cool Tensor2Tensor Notebook here.

In conclusion, while both GPT and BERT are examples of transformer architectures that have been influencing the field of natural language processing in recent years, they have different strengths and weaknesses that make them suitable for different types of tasks. GPT excels at generating long sequences of text with high accuracy whereas BERT focuses more on the understanding context within given texts in order to perform more sophisticated tasks such as question answering or sentiment analysis. Data scientists, developers, and machine learning engineers should decide which architecture best fits their needs before embarking on any NLP project using either model. Ultimately, both GPT and BERT are powerful tools that offer unique advantages depending on the task at hand.

Get refinery today

Download refinery, our data-centric IDE for NPL. In our tool, you can use state-of-the-art transformer models to process and label your data.

Get it for free here: https://github.com/code-kern-ai/refinery

Further articles:

NanoGPT by Andrej Kaparthy https://github.com/karpathy/nanoGPT/blob/master/train.py
BERT model explained https://www.geeksforgeeks.org/explanation-of-bert-model-nlp/
Encoder-decoder in LSTM neural nets https://machinelearningmastery.com/encoder-decoder-long-short-term-memory-networks/
The illustrated transformer by Jay Alammar http://jalammar.github.io/illustrated-transformer/

Hyperparameter Optimisation for Machine Learning - Bayesian Optimisation

Divyanshu Katiyar — Wed, 01 Feb 2023 16:50:29 +0000

When building a machine-learning or deep learning model, have you ever ran into the dilemma about setting up certain parameters which directly affect your model, like how many layers should you stack? Or how many units of filters should be there for each layer? What activation function should you use? These architecture-level parameters are called as hyperparameters which play an important role in deep learning to set out the best configuration so as to produce the best performance of the model.
In this blog we will cover some of the concepts describing how bayesian optimisation works and how fast it is compared to random search and grid search hyperparameter optimisation methods.

Introduction

Algorithms today serve as proxies for decision making processes traditionally performed by humans. As we automate these decision making processes, we must ask the question recursively if our algorithms are behaving as intended. If not, how far is the algorithm from the the state of the art standards? Sometimes the best trained model may yet not be the best because of the amount of inaccuracies when run on the validation data. Hyperparameter optimisation is the process of tuning the parameters of a machine learning model to improve its performance. These are not learned from the data, but set manually by the user. For example, if the model in question is a neural network (NN), the user can set the learning rate of the NN which counts as a hyperparameter. More and more complex machine learning models usually require more parameters to be fine-tuned. The goal of this is to find the set of hyperparameters that result in the best performance of the model.

Grid Search

One of the most common methods for such kind of a problem is grid search, where a pre-defined set of values is used to train and evaluate the model. The process is to train the model over and over again for every combination of values of the hyperparameters and choose the combination which results in the model delivering the best performance. However, a consequence of this is that it can be extremely time-consuming when the number of hyperparametes and the size of the values are large. For example, suppose we have 3 hyperparameters, each of which takes a list of 3 values. The number of combinations it has to process would be 3 x 3 x 3 = 27. More generally, if 'm' is the number of hyperparameters to be optimised and each of them contains n values in a list, then the number of combinations would be
$m^n$ , which becomes a problem when the number of samples is very large. We need a more efficient way to reduce the time complexity of this fine-tuning.

Random Search

Random search is another method where random values are chosen from a pre-defined range for hyperparameters. The model is trained each time and evaluated for each set of random values. It performs much faster than the grid search algorithm since it uses random sampling, however, it can still be very time-consuming and might not find the best set of hyperparameters. Let us assume a hypervolume (measure of the size of the feasible region in a multi-dimensional space) $v_{\epsilon}$ where a function takes values within $\epsilon$ of its maximum. The random search then will sample this hypervolume with probability

P(\epsilon) = \frac{v_{\epsilon}}{V}

where 'V' is the total search space volume. If the total search space is given as

R^d

and the hypervolume spans with

r^d

('d' being the input dimension), the random search would have to process number of samples in the order of

\left(\frac{r}{R} \right)^{-d}

. If r << R then this becomes really expensive!

Bayesian Optimisation

To mitigate the aforementioned problem, there is yet another method which is called Bayesian optimisation. It is an alternative method which works faster and more efficient than the grid search and random search algorithms. It uses Bayesian inference to model the unknown function that maps the parameter values to the performance of the model. The major advantage here is that it uses the information about the past iterations to inform the next set of iterations. If you recall Bayes theorem, the conditional probability of an event A, given that event B has already occurred is given by:

\frac{P(B|A)*P(A)}{P(B)}

We can further simplify this by removing the normalising value P(B) and we are left with:

P (A ∣ B) = P (B ∣ A) * P (A)

What we have calculated here is known as the posterior probability and it is the calculated using the reverse conditional probability (P(B|A)), also called the likelihood and the prior probability (P(A)). Suppose we have some sample values

x = {x_1, x_2, ....., x_n}

that we evaluate using an objective function f(x) and create a dataset (D) out of the samples and the values returned by the function on those samples. The likelihood in this case is defined as the probability of observing the data given the function P(D|f). This is how we can maximise the likelihood to observe for the best hyperparameters from the sample.

Applications in NLP and bricks

In NLP, the hyperparameter optimisation can be used to tune the hyperparameters of a word embedding model, such as word2vec, GloVe, etc. We can train the embedding models with different sets of hyperparameters and evaluate the performance on basic NLP tasks like text classification, named entity recognition, sentiment analysis, etc. It will soon be available to be used as an active learner, along with the other active learners like random search and grid search in bricks. Since it is an active learner, there is no live runtime environment for it in bricks. Instead, one has to copy the code from the code snippet and paste it in refinery. In refinery, we use sentence embeddings (one vector per sentence) to train the ML models, which can be obtained from large language models such as BERT, or even simpler methods like TF-IDF/BoW. Word2Vec/Glove produce word embeddings (one vector per word), which are more useful if you are interested to learn about the relationships between words and phases. You can find more information about the active transfer learning and applications in refinery in this blog.

Here is an example which shows the implementation of the bayesian optimisation for a text classification task:

from gensim.models import Word2Vec
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from bayes_opt import BayesianOptimization

def bayesian_optimization(sentences, labels, n_iter=10, cv=5, random_state=42):
    """
    Perform Bayesian optimization to tune the hyperparameters of a word2vec model for text classification.
    """

    def cross_validation_score(dim, window, negative):
        """
        Function to evaluate the performance of the word2vec model on the text classification task
        """
        model = Word2Vec(sentences, vector_size=int(dim), window=int(window), negative=int(negative))
        X = model[model.wv.vocab]  # load the vocabulary
        clf = RandomForestClassifier(random_state=random_state)
        score = cross_val_score(clf, X, labels, cv=cv).mean()
        return score

    optimizer = BayesianOptimization(
        f=cross_validation_score,
        pbounds={
            "dim": (50, 100, 150), 
            "window": (2, 4, 6),
            "negative": (5, 10, 15)  # the hyperparameters used
        },
        random_state=random_state
    )
    optimizer.maximize(init_points=5, n_iter=n_iter)  # maximize the likelihood function
    best_params = optimizer.max["params"]
    best_params["dim"] = int(best_params["dim"])
    best_params["window"] = int(best_params["window"])
    best_params["negative"] = int(best_params["negative"])
    return best_params

To conclude, grid search is great if you already know what hyperparameters work well and their sample space is not too big. Random search works good if you have no knowledge of the hyperparameters to fine-tune. Try not to use it on too large of sample sizes due to reasons explained above. Bayesian optimisation is a powerful method and can be more efficiently and effectively used over random search and grid search methods. In comparison, it is almost 10 times faster than the grid search method. It serves many crucial use cases in the domain of natural language processing. With its flexibility in the choice of the acquisition functions, it can be tailored to a wide range of problems.

Citations:

Active Learning with Transformer-Based Machine Learning Models

Leonard Püttmann — Thu, 19 Jan 2023 14:09:33 +0000

The combination of active learning and transformer-based machine learning models provides a powerful tool for efficiently training deep learning models. By leveraging active learning, data scientists are able to reduce the amount of labeled data required to train a model while still achieving high accuracy. This post will explore how transformer-based machine learning models can be used in an active learning setting, as well as which models are best suited for this task.

What is Active Learning?

Active learning is an iterative process that uses feedback from previously acquired labels to inform the selection of new data points to label. It works by continuously selecting the most informative unlabeled data points that have the greatest potential to improve the model’s performance when labeled and incorporated into training. This iterative process creates an efficient workflow that allows you to quickly get high quality models with minimal effort. With each iteration, the performance increases, allowing to observe the improvement of a machine learning model.

Source: Active Learning with AutoNLP and Prodigy

For example, an experiment on the MRPC dataset with the bert-base-uncased transformer model found that 21 % fewer examples were needed using the active learning approach in contrast to using a fully labeled dataset from the start.

Transformer-Based Machine Learning Models for Active Learning

Transformer-based machine learning models such as

are well suited for active learning due to their ability to capture context information in text data. These models have been shown to achieve state-of-the-art results on many natural language processing tasks such as question answering, sentiment analysis, and document classification. By utilizing these types of models in an active learning setting, you can quickly identify the most important samples that need labeling and use them to effectively train your model. Additionally, these models are very easy to deploy on cloud platforms like AWS or Azure, making it even more convenient to use them in an active learning environment.

How we approach active learning in Kern AI refinery

In refinery, we use SOTA transformer models from Huggingface to create embeddings from text datasets.

This is usually done at the start of a new project because having the embedding for all of our text data allows us to quickly find similar records by calculating the cosine similarity of each embedded text. This can drastically increase the labeling speed.

After some labeling of the data is done, we are able to use these text embeddings to train simple machine learning algorithms, such as a Logistic Regression or a Decision Tree. We do not use these embeddings to train a transformer-based model again, because the embeddings are of such a high quality that even simple models provide high-accuracy results. While you save time and money through the active learning approach, you also save a lot of computational workload down the road.

In conclusion, transformer-based machine learning models provide a powerful tool for efficiently training deep learning models using active learning techniques. By leveraging their ability to capture contextual information from text data, you can quickly identify which samples should be labeled next in order to effectively train your model with minimal effort and cost. Furthermore, these types of models are highly scalable and easy to deploy on cloud platforms making them ideal for use in an active learning setting. With all these advantages combined together, it’s no wonder why transformer-based machine learning models are becoming increasingly popular among developers and data scientists alike.

The theory behind Image Captioning

Divyanshu Katiyar — Mon, 09 Jan 2023 10:17:38 +0000

Introduction

One of the most challenging tasks in artificial intelligence is automatically describing the content of an image. This requires the knowledge of both computer vision using artificial neural networks, and natural language processing. This can have great impact in many different domains - be it to make it easier for visually impaired people of the community to understand the contents of the images on the web, or for tedious tasks like data labelling where data is in the form of images. In this article, we will walk through the basic concepts that are needed in order to create your own image captioning model.

Textual description of an image

In principle, converting an image into text is a significantly hard task. The description should not only contain the objects highlighted in the image but also the context of the image. On top of that the output has to be expressed in a natural language like English, German, etc., so a language model is also needed to complete the picture.

In the above image we see people on a vacation hiking on the foothills of a mountain range. Let us say that we want to generate a text describing this image. The image is used as input I and is fed to the model (called the Show and Tell model, developed by Google) which is trained to maximise the likelihood p( S | I ) of producing a sequence of words S = {S₁, S₂, ...., Sₙ}, where each word Sₖ comes from a given dictionary which describes the image accurately.
In order to process the input data, we use Convolutional Neural Networks (CNNs) as "encoders" and the output of the CNN is fed to a type of Recurring neural network called Long-Short Term memory (LSTM) network which is responsible for generating natural language outputs. Before describing the model, let us briefly look into CNNs and LSTM.

Convolutional neural networks

CNN is a type of neural network which is used mainly for image classification and recognition. As the name suggests, it uses a mathematical operation called convolution to process the data. The CNN consists of an input layer, single or multiple hidden layers, and an output layer. The middle layers are called hidden because their inputs and outputs are masked by the activation function.

Convolutions operate over 3D tensors called feature maps with two spatial axes and a channel axis. The hidden layers (convolutional layers) are made up of multiple convolutions that scan the input data and apply filters to extract output feature. This output feature is also a 3D tensor which is passed through a non-linear activation function in order to induce non-linearity.
The output of the convolutional layers is passed through a pooling layer, which aggressively downsamples the feature maps and reduce computation complexity. Eventually, the output of the pooling layer is passed through a fully connected dense layer, which computes the final prediction.
Below is an example of how to instantiate a convolutional neural network in python:

from tensorflow.keras import layers
from tensorflow.keras import models
from tensorflow.keras import optimizers

model = models.Sequential()
model.add(layers.Conv2D(32, (3,3), activation='relu', input_shape=(120, 120, 10)))
model.add(layers.MaxPool2D((2,2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(loss="binary_crossentropy",
              optimizer=optimizers.RMSprop(learning_rate=1e-3), 
              metrics=['acc'])

Here we have assumed that the user already has the pre-processed input data. Let us assume that this data is split into training and test sets. In the next step, we can fit the model and save it.

fitting = model.fit_generator(
            training_data,
            steps_per_epoch=80,
            epochs=25,
            validation_data=validation_data,
            validation_steps=70
          )

model.save("output_data.h5")

Long-short term memory

LSTM networks are type of recurrent neural networks that are well suited for modelling long-term dependencies in data. They are called "long-short term" because they can remember information for long periods of time, but they can also forget information that is no longer relevant.
RNN was a consequence of the failure of the feedforward neural networks.
Problems with feedforward neural networks?

Not designed for sequences and time series
Do not model memory - in the sense that they do not retain information from previous data points when processing new data.

RNNs solve this issue conveniently. The recursive formula defined as:

S_t = F_w(S_{t-1}, X_t)

states that the new state at time 't' is a function of the old state at 't-1' and input at time 't'. This makes the RNNs different from other neural nets (NNs) since NNs learn from backpropagation and RNNs learn from backpropagation through time!

The output from this network is now used to calculate the loss.

In the image shown above, we describe a recurrent neural network which is run for - say - 100 time steps. Our aim is to compute the loss. Let us assume that at each state, out gradient is 0.01. As we go back a 100 time steps, the update in our weights is

\Delta w = (0.01)^{100} \approx 0

which is negligible. Thus, the neural network won't learn at all! This is known as the vanishing gradient problem. In order to solve this, we need to add some extra interactions to the RNN. This gives rise to the Long-Short Term Memory.
LSTM, like any other NN, consists of three main components: input layer, single or multiple hidden layers, and the output layer. What makes it different are the operations happening in the hidden layers. The hidden layer consists of three gates - input gate, forget gate and output gate - and one cell state. The memory cells are responsible for storing and manipulating the information over time. Each memory cell is connected to a gate which decides what information stays and what information is forgotten. Gosh these machines are getting smart!
To describe the functionality of the gates mathematically, we can look at this expression:

g_t = \sigma (W_g S_{t-1} + W_g X_t)

where 'g' represents either input(i), forget(f), or output(o) gates; 'W' denotes the respective weights for the gates, 'S' denotes the state at 't-1' time step and 'X' is the input.

The above image shows the functionality of the LSTM network. The important part is that the network can decide what information to discard and what to keep. This resolves the vanishing gradient problem that we face in a normal RNN.
Here is a simple implementation of the LSTM network in keras-

from tensorflow.keras.layers import LSTM
from tensorflow.keras import models

model = models.Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])

model_fitting = model.fit(input_train, y_train,
                    epochs=30,
                    batch_size=128,
                    validation_split=0.2)

Back to the textual description

Now that we have gone through the concepts of the tools required, it is understood that using CNNs to process the image inputs and LSTM for natural language output we can build rather accurate models to generate image captions. For that, use a CNN as an encoder, by pre-training it first for image classification and using the last hidden layer as an input to the RNN "decoder" that generates sentences.
The model is trained to maximise the likelihood of generating correct description given the image which is given as:

\theta^* = arg max \sum_{I,S} log p(S|I;\theta)

where θ are the parameters of the model, I is the input image, and S is correct word. The loss is described as the negative log likelihood of the correct word at each step:

\sum_{t=1}^N log p_t(S_t)

Once the loss is minimized and the likelihood maximized, we have to consider the epoch where the validation loss is minimum. And tada! We have our model ready. All you would have to do is to input the images and the expected output should be a sentence describing that image.

This article is more about the in-depth knowledge of the tools used to build this use case. Once you are proficient enough, you can create your own use case and build your own models for it.

Citations:

Drastically decrease the size of your Docker application

Leonard Püttmann — Tue, 03 Jan 2023 17:15:15 +0000

Containers are amazing for building applications. Because they allow you to pack up a program together with all it's dependencies and execute it wherever you like. That is why our application consists of 20+ individual containers, forming our data-centric IDE for NLP, which you can check out here: https://github.com/code-kern-ai/refinery.

If you don't know what Docker or a container is, here's a short rundown: Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package.

Using Docker, you can run many containers simultaneously on a single host. This can be useful for a variety of purposes, such as:

Isolating different applications from each other, so they don't interfere with each other.
Testing new software in a contained environment, without the need to set up a new machine or install any dependencies.
Running multiple versions of the same software on the same machine, without having to worry about version conflicts.
Packaging and distributing applications in a consistent and easy-to-use way.

Overall, Docker allows developers to easily create, deploy, and run applications in a containerized environment.

The problem of size

One problem of Docker containers is that they can get quite large. Because the container, well, contains everything that the program needs to run, the total size of a single container can quickly get to a couple of gigabytes.

Version 1.4 of our application took up about 10.96 GB of disk space. While that's not absolutley enormous for a modern application, we saw a lof of potential to increase the usability by decreasing the total size. In the end, smaller is always better, especially when keeping in mind that not all of our users have incredible internet and almost 11 GB can sometimes take quite some time to download.

In the end, we managed to cut the needed disk space by almost 50 % to 5.2 GB. How did we manage to do this?

Choosing smaller parent images

First, let's take a look at parent images for Docker containers. In Docker, a parent image is the image from which a new image is built. When you create a new Docker image, you are usually creating it based on an existing image, which serves as the parent image for the new image.

For example, let's say you want to create a new Docker image for a web application. You might start by using an existing image such as ubuntu:18.04 as the base, or parent, image. You would then add your application code and any necessary dependencies to the image, creating a new child image.

The parent image provides a foundation for the child image, and all of the files and settings in the parent image are inherited by the child image. This allows you to create new images that are based on a known, stable foundation, and ensures that your new images have all of the necessary dependencies and configurations.

The new child image can then be used to build you container and run your application.

There are many parent images you could choose. You can check them out at https://hub.docker.com/. Most of our containers used the python:3.9 parent image. This image comes with a full Python installation build on top of Linux. Technically, this is just fine for what we do. Thing is, the image alone is 865 MB large, at lease for the amd64 architecture.

Maybe something smaller would do the job just as well. The python:3.9-alpine image for example is build on alpine Linux, a super tiny Linux distribution. The image python:3.9-slim is also substantially smaller.

We then tried out the smaller parent images for all of our child images to see if they still run. For some images we had to stay with the normal python:3.9 image, but the majority of images are just running normally with python:3.9-alpine or python:3.9-slim. This reduced the total size of the application quite a lot!

Shared layers

Another thing we optimized was the use of shared layers. Docker images consist of multiple layers, which can be shared between different images. These shared layers have to be downloaded and stored on disk only once. Therefore, increasing the usage of shared layers reduces download time and disk consumption. Following this approach, we created custom docker parent images, which have already preinstalled the python dependencies needed by the refinery services.

Above you can see a comparison of the image sizes before and after. In the size column the effect of the choice of the smaller parent images is visible. The effect of the shared layers is shown in the shared and unique size columns.

Those are some tricks we used to decrease the needed disk space for our application. If you found this article useful please leave a like or follow our author. If you have great tips on how to reduce the size of an application that uses containers, please leave them in the comments below!

How we managed to build our open-source content library crazy fast

Leonard Püttmann — Thu, 15 Dec 2022 16:57:07 +0000

Our newest project here at Kern AI offers you a fantastic library of modules called bricks, with which you can enrich your NLP text data. Our content library seamlessly integrates into our main tool, Kern AI refinery. But it also provides the source code for all the modules, providing you with maximum control. All modules can also be tested by calling an endpoint via an API.

We managed to build this incredible tool in just less than two months, thanks in large part to the amazing team at Kern, but also in part to the stunning capabilities of DigitalOcean's App Platform. We also managed to increase development by automating large parts of our content management system.

You can try out bricks here!

Let's first talk about the general structure of bricks and then dive into a little bit more detail!

How bricks is structured

Bricks is built using four components:

Frontend using next.js built with Tailwind UI
Backend with Strapi and a managed PostgreSQL database
Service to serve the live endpoints on bricks
A separate search module to easily find modules

Frontend

The overall design of bricks should be fitting to the one that is also used in refinery. The bricks UI is created using React and deployed via NextJS. We also used Tailwind for the UI elements.

Backend

For the backend, we use Strapi, which is an awesome open-source content management system. Strapi is connected to a PostgreSQL database to store all the content that is displayed on bricks. The frontend connects to the backend via an API to then display all the content.

Managing content with Strapi itself is super easy, but to make things even more easier for us, we wrote an automation script that is able to fetch new modules created for bricks and automatically add them to Strapi. That's why the source code of a bricks module needs to be in a specific format in order to be added to Strapi.

Live endpoints

Every module can be tested right from bricks itself. On the right side of every module, you'll see a window that allows you to try out the module without the need to install anything.

Providing this was very important for us, as we want users to find out what exactly they get with every module and test the module with some of their own data. The default input is usually some text or sometimes some additional parameters for the endpoint.

Search module

To quickly find the right modules, we also build a custom search module. The search module uses a small transformer model to embed all the names of the module, which can be searched very quickly.

Let's now take a closer look at the technologies we used to quickly get bricks live.

Leveraging DigitalOcean's App Platform

The App Platform is a convenient and cheap way to deploy your web apps. Instead of deploying an app on a virtual machine that you'll have to manage yourself, the app will run in a Docker container. That way you don't have to think about the underlying infrastructure and also get the benefit of easy scalability. It's also a bit cheaper than hosting your app on a single VM. In the case of DigitalOcean, you also get the option to auto-deploy from a GitHub repository, which is super handy.

There are many cloud platforms out there offering such a service, but for our purposes, we chose to use DigitalOcean. This post is not sponsored by them, we just like their service a lot.

To get started with bricks, we used this tutorial on how to deploy Strapi to DigitalOcean. We highly recommend you to check it out as well if you would like to use Strapi on DigitalOcean, as it was really helpful to get us started with the project.

Auto-deploy from a GitHub repository

To deploy on DigitalOcean, you can simply attach a GitHub repository, from which the app will automatically get deployed. In our case, we use the auto-deploy function for our endpoint service, so that new modules added to bricks will automatically get integrated.

But before we can do that do that, we first need to deploy our backend and frontend components. To keep things clear, we deploy them separately and also store backend and frontend in a different repository. DigitalOcean also allows you to connect your app to a managed database, which is super convenient.

Setting up a managed database

Before we can deploy the backend, we need a managed PostgreSQL database first. DigitalOcean offers many different database types, but PostgreSQL should be just fine for our needs. When deploying Strapi on DigitalOcean, you can also choose a cheaper dev database for your app. However, we had a lot of trouble getting that dev database to run, so we instead directly went for the managed database that is meant to be used in production anyway.

Creating an App on DigitalOcean

Next up, we are going to create our first App on DigitalOcean. The app will host the Strapi backend of the site and will be connected to the managed PostgreSQL database we created in the previous step. Deploying the backend is fairly easy, you simply select the GitHub repo and the fitting directory you want to deploy, and DigitalOcean will handle all the rest for you. You can also opt-in for auto-deploy, and your app will be redeployed whenever there is a new change to your repository.

Creating a second App for the frontend

While it is technically possible to host the backend and the frontend on the same app, we chose not to do that. Setting up the frontend was much easier in a separate app, and apps are very cheap in general, so we would only save a few dollars if we would've deployed on the same app. So we thought it would not be worth the hassle. The frontend gets all the information from the backend via a simple API call, so the frontend and backend don't need to be connected in any other way, too.

Building the second app for the frontend is essentially the same procedure as for the backend. You simply select the repository and the directory and let DigitalOcean do the work for you.

Deploying the endpoint app

Once backend and frontend are up and running, we need to deploy the service that is running our endpoints. Otherwise, a user would still be able to access bricks and check out the modules, but they couldn't directly try them out on the site itself.

The procedure is the same as before: connect your GitHub repository and deploy a containerized application via DigitalOcean. The endpoint service is using FastAPI to deliver the results of each endpoint to bricks. So far, a single service is enough to serve all the 50+ endpoints we have available on bricks so far.

Using bricks to quickly enrich dataset for NLP

We hope that you liked this insight into the structure behind bricks. You can try out bricks here to inspect the result for yourself.

If you have any questions or feedback you would like to share, feel free to post it in the comments down below. Have fun using bricks!

Introducing bricks, an open-source content-library for NLP

Leonard Püttmann — Thu, 08 Dec 2022 14:50:54 +0000

This week we launched bricks, an open-source library which provides enrichments for your natural language processing projects. Our main goal with bricks is to shorten the amount of time that you need from idea to implementation. Bricks also seamlessly integrates into our main tool, the Kern AI refinery.

Let's take a closer look at the structure of bricks and how to use it. You'll find bricks here

https://bricks.kern.ai/home

Structure of a brick module

In each module of bricks, you will find the source code for the function. You can directly use a bricks module in refinery, either by directly copying the source code or via the bricks integration that will be available in the next release of refinery 1.7. Of course, this code could also be used outside of refinery.

On the right hand side, you can directly try out the module over an live endpoint that we've deployed. You can try out the module with the example input that is already provided, or you can type something yourself and try it out!

Types of bricks modules

Currently, there are three main types of modules in refinery:

Classifiers:

As the name suggests, these modules can be used to classify something. Need to find out the language of your text or get the complexity of it? You'll find what you need in the classifiers!

Extractors:

The extractors are really useful if you would like to pull certain information or entities from your text. The most bricks modules can currently be found here, where you'll find modules to extract metrics, time, names, adresses and many more useful thing! We've built all of these modules in a way that they can instantly be used for labeling functions in refinery.

Generators:

This type of module generates some new form of output, such as a translation or a cleaned or corrected version of a text. In the generators, you will also find two premium functions, for which you'll need an API key of an external provider to use them, in this case for language translation. However, it's also very important to us to always provide similar modules that don't need an API key.

Using a bricks module in refinery

Let's say that we have a dataset with news articles, and we want to categorize them by their complexity. We then go to the sentence complexity module in bricks and copy all the source code.

We then go back to our project in refinery and create a new attribute calculation, which we can do on the settings page.

We then paste in the code and put in the name of our attribute, in our case the headlines!

As a result, we'll then get the sentence complexity of each of our headlines that we have in our dataset.

All of this takes less than a minute to implement.

Contributing to bricks

As all projects at Kern, bricks is open-source, meaning that you get access to the source-code. You can also contribute to bricks if you built something that you would like to share and that you think would be useful to others. Should you have a great idea or implementation, feel free to just open an issue on our GitHub page.You can check bricks GitHub page here. On our GitHub page, you'll also find a detailed explaination of how to contribute to bricks.

We've also made a tutorial on YouTube in which our DevRel guy Div shows you all the neccessary steps to contribute.

You may also join our Discord community, where you can ask questions and discuss things with the wonderful Kern community. Join us here: https://discord.gg/WAnAgQEv

How we automated license checking for our Python & JS dependencies

Leonard Püttmann — Fri, 28 Oct 2022 10:06:03 +0000

There are many popular license types for open-source software out there, such as the MIT, BSD or Apache Software License. When building software privately, these license types are minor. However, things get way more complicated when building a commercial product, even if it's open-source. For us as a company, that meant a lot of insecurities about how to handle these licenses.

In a nutshell, when using a dependency, you'll need to ensure that the dependency allows for commercial use. That's not a problem with the majority of the licenses, but there are some lesser-known ones that could cause some trouble.

For our tool, the Kern AI refinery, we use dozens of different libraries. Checking all the dependencies manually for all the repositories would be an extremely tedious task, to say the least. So, our machine learning engineer Felix thought to himself "why don't I automate this then?". And that's exactly what he did!

Checking Python licenses with LicenseCheck

We have a lot of Python dependencies, so checking these licenses was our biggest priority. When it comes to checking licenses of Python dependencies, we've found a really cool tool called LicenseCheck, which can check the requirements.txt file of a GitHub repository and find the licenses for all the dependencies listed inside the file. LicenseCheck can simply be installed via pip and can then be used to print out all the licenses. This already helps a lot, but when you have 50+ repositories, it's still a lot of manual work.

Building a Python script

Code snipped from the script

To check all of our repositories, our ML engineer Felix build an amazing Python script that completely automates the whole license checking of our Python dependencies. You can find the whole script here if you are interested in using it!

How does the script work? In a nutshell, you simply paste in the repositories you want to check by putting the name and the URL of the repo inside of a dictionary. Feel free to select as many repos as you like. The script then loops over all the repositories and checks the requirements.txt file from each repo.

Using the script, you can simply check all the licenses in your repository

Checking the results

Finally, the script then saves all the results into a handy Excel spreadsheet, in which you'll get a list with all your dependencies and the corresponding license.

Using this script we were able to have the licenses of 114 dependencies in one list just by running this script. In the future, we might have even more dependencies, but with this tool, we can easily check them again in the future with very little effort.

Finding licenses for JavaScript dependencies

Python is not the only programming language that we use. Our application is also build with JavaScript, mainly for the UI and the dashboards for our admins. Sadly, the LicenseCheck tool doesn't work for JavaScript or any other language other than Python.

As an alternative, we've found LincenseFinder, which is an awesome open-source tool to check dependencies for JavaScript. The tool checks the package.json file of a repository and tells you the used licenses. You can also create a list of permitted licenses and LicenseFinder will check if your dependencies are in that list. It basically works very similar as the LicenceChecker for Python did.

We hope that you'll find this helpful. Let us know in the comments if you found similar tools for other programming languages as well so that other people can also see them.

Make sure to check out our GitHub page to find out more about our open-source, data-centric IDE which we are building!