DEV Community: Leonard Püttmann

How to deploy ML models on Azure Kubernetes Service (AKS)

Leonard Püttmann — Tue, 11 Apr 2023 19:59:21 +0000

In this article, I am going to provide a step by step tutorial on how to deploy a machine learning model on Azure Kubernetes Service (AKS). I will also outline when you should and shouldn’t use AKS. Let’s go!

Azure Kubernetes Service in a nutshell

Azure Kubernetes Service (AKS) is a fully managed Kubernetes container orchestration service that simplifies the deployment and management of containerized applications. With AKS, you can quickly deploy and scale containerized apps without worrying about the underlying infrastructure.

AKS is a handy platform for running, managing, and orchestrating containerized applications. It allows developers to focus on building and deploying their applications rather than worrying about infrastructure-related tasks. AKS supports a wide range of container technologies and tools, which makes it versatile and easy to use. Whether you are building a new application from scratch or migrating an existing one to the cloud, AKS provides a seamless experience for deploying and scaling your application.

When to use AKS

AKS is meant for production level workloads. One of the big upsides of AKS is its scaling capabilities. If you are processing a lot of data or want to serve large machine learning models to lots of people, AKS could be an option for you. In terms of deployment options, it falls under the “managed” category, because you still have to create and manage the AKS cluster yourself, even if Kubernetes does a lot of the work for you. Azure ML also provides other deployment options which are entirely managed by Azure, so before you deploy on AKS, you should maybe take a look at the other deployment methods that Azure offers.

In a nutshell, you should use AKS if:

If you have heavy workloads
Scalability is important to you
You want or need to manage your compute resources yourself

Project requirements

Let’s take a look at the steps to deploy on AKS and the requirements for the project. To follow along, you should have:

An active Azure subscription.
An Azure ML workspace.
Azure CLI installed.
Python > 3.9 and the Azure ML Python SDK v2 installed.

For this tutorial, I am going to use the Azure Portal as well as the Azure CLI to create and configure the Kubernetes cluster. Then, the Azure ML Python SDK v2 will be used to actually connect the compute to Azure ML in order to deploy a model.

For the actual model deployment we need:

An machine learning model as a .pkl file (or equivalent).
A conda env file as a .yaml.
A scoring.py file.

If you need a reference on how these files should look, you can get a dummy model, env and scoring script here. Optionally, you can also check out my GitHub for the code used to deploy via the Python SDK v2.

Project outline

Next, let’s take a look at the required steps for this project.

First, we are going to create a new AKS in the Azure Portal. Optionally, this can also be done with the Azure CLI.

After the AKS cluster is created, we need to provide access from and to the cluster in order to install the Azure ML extention on AKS and access the cluster.

Then the Azure ML extention is installed on the AKS cluster via the CLI. If you are using an Azure Arc AKS cluster, this can also be done via the Azure Portal.

Once the extention is installed, the AKS cluster can be attached to a Azure ML workspace to train or deploy ml model.

The cluster can the be used to deploy a machine learning model using the model and conda env.

Here's my attempt to visualize this:

Creating an AKS cluster

Let’s provision the AKS cluster. You can change all these settings, like availability, pricing and so on, so that they suit all your needs. When it comes to the node size, I would advise to opt for a node that has at least 8 GBs of RAM or more. With less powerful nodes I often faced the problem that the Azure ML k8s extention didn’t install because the memory maxed out during installation. Here is my configuration:

Providing access to the cluster

After the AKS cluster is deployed, we need to configure the service principal so that the cluster is able to install the Azure ML k8s extention to be able to get attached to the Azure ML workspace. This is the part where I struggled a bit and which showed me that I need to brush up my skills regarding Azure AD and access policies.

Some of the following steps might be overkill. There are probably other approaches that work, too. So feel free to let me know what I could have done better here!

Anyway. In the Azure Active Directory, we create a new group called AKS-group and add our user and our Azure ML enviroment to that group.

Then, in the AKS cluster, we need to ensure to set the method of Authentication to Azure AD authentication with Kubernetes RBAC. We then add the AKS cluster to the previously created group. To be able to attach the cluster to Azure ML later, make sure to aktivate Kubernetes local accounts. Otherwise, the attachment will fail.

After that, head to Access controll (IAM) and grant access to and from the AKS cluster. I choose to grant admin rights for a bit of peace of mind here, but you can go for any access level that is enough to allow the installation of extentions and the attachment to Azure ML.

Installing the Azure ML k8s extention

Now that the access is set, it’s time to acutally install the Azure ML extention so that the cluster can be used in Azure ML. As of writing this in April 2023, this can be done via the Azure Portal for Azure Arc clusters or via the CLI for typical AKS clusters. Having the latter, I used the following command to install the extention:

az k8s-extension create --name Aml-extension --extension-type Microsoft.AzureML.Kubernetes --config enableTraining=True enableInference=True inferenceRouterServiceType=LoadBalancer allowInsecureConnections=True inferenceLoadBalancerHA=False --cluster-type managedClusters --cluster-name whale --resource-group MlGroup --scope cluster

Quite a long command. Let’s take a look at what’s happening here:

—name is the simple name you would like to give to the extention. Choose any name you like for this.

—extension-type is the actual extension we would like to install. In this case, we need the Microsoft.AzureML.Kubernetes extension.

Depending on if you want to use the cluster for training or inference, you need to set the —enableTraining and/ or —enableInference flags.

Point to the cluster you would like the extension to install on with the —cluster-name and —resource-group flag.

For secure deployments you should configure SSL and set —allowInsecureConnections to False.

That’s it. The installation should take a couple of minutes. If the installation of your extention on the AKS cluster takes too long (15 minutes or more), then the node size is probably to small. If you get an authentication error then you’ll need to re-visit the access rights in the Access control (IAM).

Connecting to the Azure ML workspace

After the extention is install, we can head over to our Azure ML workspace. In the compute section you can find, right besides the compute instances and compute clusters, a column for Kubernetes Clusters.

To attach our cluster, simply click on “new” > “Kubernetes”. You should then be able to select the previously created AKS cluster and give this compute a name (I usually give the same name as the AKS cluster).

Hit attach and after a couple of seconds your AKS cluster should be usable via Azure ML. Hurray!

Deploying the machine learning model

For the next step, we are going to use some code with the Azure ML Python SDK v2. Before deploying, an endpoint is needed. This can be set up like this:



# deploy the model to AKS
import datetime
from azure.ai.ml.entities import KubernetesOnlineEndpoint

online_endpoint_name= "k8s-endpoint"+ datetime.datetime.now().strftime("%m%d%H%M%f")

# create an online endpoint
endpoint= KubernetesOnlineEndpoint(
    name=online_endpoint_name,
    compute="moby",
    description="this is a sample k8s endpoint",
    auth_mode="key",
    tags={"key": "test_deplyoment"},
)

# then create the endpoint
ml_client.begin_create_or_update(endpoint).result()

After the endpoint is created, we can deploy a machine learning model. To do that, we provide the path to a machine learning model (in this case a .pkl file) as well as and environment, which requires a conda.yml and a base image, which you can get from Microsoft. For inference, we also need a python script to init the model.

Note that we don’t need to populate one node of our AKS with one model. For this dummy model, 0.1 CPUs and 0.5 GB of RAM are enough. Set this to the size suitable for your model.



from azure.ai.ml.entities import KubernetesOnlineDeployment, CodeConfiguration

from azure.ai.ml.entities._deployment.resource_requirements_settings import ResourceRequirementsSettings

# configure the deployment
model = Model(path=r".\model\model\sklearn_regression_model.pkl")
env = Environment(
    conda_file=r".\model\environment\conda.yml",
    image="mcr.microsoft.com/azureml/minimal-ubuntu18.04-py37-cpu-inference:latest",
)

blue_deployment = KubernetesOnlineDeployment(
    name="blue",
    endpoint_name=online_endpoint_name,
    model=model,
    environment=env,
    code_configuration=CodeConfiguration(
        code=r".\model\onlinescoring", scoring_script="score.py"
    ),
    instance_count=1,
    resources=ResourceRequirementsSettings(
        requests=ResourceSettings(
            cpu="100m",
            memory="0.5Gi",
        )
    ),
)

Finally, it’s time to deploy:



ml_client.begin_create_or_update(blue_deployment).result()

You should then see the AKS as a compute option.

Maybe you are wondering why we call this blue_deployment. This is done because of the so called Blue-green deployment strategy, which allows for zero dewntime during the deployment and update of the model later on. When a new version of the machine learning model is ready for deployment, it is deployed to the inactive (green) environment which has 0 % of the traffic. Once the new version has been successfully deployed and tested, the traffic is switched from the active environment to the newly deployed one, making it the new active environment. You can read more on this here.

Wrap-up

The nice thing about Azure ML is that it allows you to manage and monitor the deployments with ease and provides you a lot of great tools to keep track of your deployed models.

In this article we went through the steps needed to deploy a machine learning model on Azures Kubernetes Service. If you have any questions or feedback, feel free to leave them in the comments of hit me up on LinkedIn!

Have fun deploying and thank you for reading. 🙂

Alleviate the pain of manual labeling and deploying models with weak supervision

Leonard Püttmann — Tue, 14 Mar 2023 11:24:10 +0000

Natural Language Processing (NLP) has seen significant improvements in recent years due to the advent of neural network models that have achieved state-of-the-art performance on a range of tasks like sentiment analysis, named entity recognition, and machine translation. However, these models require vast amounts of labeled data, which can be prohibitively expensive and time-consuming to obtain. With refinery and gates, we enable users to leverage the power of weak supervision in their production environment to mitigate the need for expensive manual labeling and costly model deployment.

Why use weak supervision?

Weak supervision is a promising alternative to traditional supervised learning that can help alleviate the need for vast amounts of labeled data. Weak supervision uses heuristics or rules to create noisy labels for the data. Noisy means that the labels are largely correct, but can contain more errors than labels obtained from manual labeling. These noisy labels can then be used to train a model that can generalize to new, unseen data. Or we can also optimize the weak supervision to use it directly for production! This approach is especially useful when there is no large labeled dataset available for a particular task or domain.

In the context of NLP, weak supervision can be used to annotate text data, such as social media posts or online reviews, with labels that are less precise than those obtained through manual annotation. For example, a weak supervision approach to sentiment analysis might use a set of rules in the form of programmatic labeling functions to label all tweets containing positive emojis as "positive," and all tweets that use negative emojis as "negative." These noisy labels can then be used to train a model that can classify new tweets into positive or negative categories.

Despite the challenges, weak supervision has gained popularity in recent years due to its ability to quickly and easily generate labeled data. It can also be used in combination with traditional supervised learning to improve model performance further. For instance, a weak supervision approach can be used to generate initial labels for a dataset, which can then be refined by human annotators in a process called "bootstrapping." This iterative process of weak supervision and human annotation can help reduce the amount of manual labeling required and improve the quality of the final dataset.

Another advantage of weak supervision is that it can be used to label rare or unusual events that are difficult to label using traditional supervised learning approaches. For example, it can be challenging to obtain labeled data for rare diseases or rare events in social media. However, a weak supervision approach can use domain-specific knowledge to generate labels for these rare events, enabling models to learn from limited data.

How to deal with noisy labels

While weak supervision can be a powerful tool for training models with limited labeled data, it comes with its own set of challenges. Cheap labels are often noisy. Noisy labels can introduce errors and biases into the training data, and it can be difficult to quantify the quality of the labels generated by the weak supervision process. However, there are reliable methods that allow us to easily spot errors in labels quickly.

One of those methods is called confident learning. Confident learning is a technique used to identify and correct errors in noisy labels, so this is perfect for weakly supervised data! The approach involves training a model on the noisy labels generated through weak supervision and then using the model's predictions to estimate the confidence of each label. Labels with low confidence can then be flagged as potentially incorrect and subjected to further scrutiny or correction. This method can help improve the quality of the final dataset and reduce the impact of errors introduced by weak supervision without the need to check all the labels manually.

Heuristics - sources for cheap labels

The classic approach to get labels would of course be to manually annotate the whole dataset. But that would be time-consuming, tedious, slow, and very expensive. Instead, we can only partly label our data by hand and try to find cheaper methods to obtain labels for the rest of our data. These cheap and inexpensive label sources will most likely be noisy. But if we can get labels from many different sources, then the final, weakly supervised label will be greater than the sum of its parts. In other words: It’s going to be accurate. In refinery, we call these sources heuristics.

Labeling functions

Labeling functions allow us to programmatically incorporate domain knowledge into the labeling process. Similar to expert systems, which are still in place in many companies. We can do this with programming languages like Python. A domain expert has a mental model in his mind, of how and why he would label things in a certain way. In the field of natural language processing, this could be certain words, the structure of sentences, or the author of a text. All of these can be expressed programmatically. Let’s imagine that we want to build a classifier to detect clickbait. We would quickly notice that a lot of clickbait starts with a number and that clickbait often addresses the reader directly. We could incorporate this simply with just a few lines of code:

These labeling functions won’t be perfect, but as long as they are better than guessing, we’ve already gained something. In the iterative process of getting labels, these functions can also be improved and debugged later on as we gain more insights into our data.

Active learner

Active learning is a technique used to select the most informative examples from a large unlabeled dataset to be labeled by a human annotator. The goal is to select examples that will provide the most value in improving the model's performance while minimizing the number of examples that need to be labeled. This approach can be especially useful when labeled data is scarce or expensive to obtain. It involves iteratively training a model on a small subset of labeled data, selecting the most informative examples to label, and retraining the model on the expanded labeled dataset. This process can be repeated until the model's performance reaches a desired level or the budget for labeling is exhausted.

This is especially powerful if we use this in combination with pre-trained transformer models to embed out text data. These models handle all of the heavy-lifting during the active learning part, so that we can quickly get accurate results, even if we don’t have tons of data available.

Zero-shot classification

Zero-shot classification allows us to label data without having to explicitly train a model on that particular task or domain. Instead, we can use a pre-trained language model, such as BERT or GPT-3, that has been trained on a large corpus of text to generate labels for new, unseen data.

To use zero-shot classification, we need to provide the language model with a set of labels or categories that we want to use for classification. The language model can then generate a score or probability for each label, indicating how likely the input text belongs to that label. This approach can be especially useful when we have a small labeled dataset or no labeled data at all for a particular task or domain.

For example, let's say we want to classify news articles into different topics, such as politics, sports, and entertainment. We can use a pre-trained language model, such as BERT, to generate probabilities for each label. We would provide the language model with a set of examples for each label, such as news articles that belong to the politics category, and let it learn the patterns and features that are characteristic of each category. Then, when we give the language model a new, unseen news article, it can generate probabilities for each label, indicating how likely the article belongs to each category.

Combining heuristics for weak supervision in refinery

We now learned why weak supervision is so powerful. Let’s take a closer look at how we use all the possible heuristics for weak supervision in refinery.

In refinery, labels from different sources are combined and result in a singular, weakly supervised label. The labeling source includes manual labels, labels from labeling functions, active learner models as well as zero-shot labels. Let’s take a closer look at an example project, which contains news headlines.

A closer look under the hood

Here are the steps on how we obtain the weakly supervised label. We build a DataFrame out of the source vectors of our heuristics, which contain the label <> record mappings as well as the confidence values for all of our label sources. Afterward, the predictions are calculated for all of the label sources by multiplying the precision with the confidence of each label source. These values then all get added to an ensemble voting system, which integrated all the relevant data from the noisy label matrix into the singular weakly supervised label.

The ensemble voting system works by retrieving all of the confidence values from all the label sources, aggregating them, and selecting the label with the highest confidence values. The final confidence score is then calculated by subtracting the highest voting confidence score from the sum of voters and then subtracting this from the highest voting confidence score and after that, passing the output of this to a sigmoid function. The resulting weakly supervised label and the final confidence value are then added to our record in refinery.

Using weak supervision in production with gates

The great thing about weak supervision is that we can directly use it in production as well. We created custom labeling functions, already have machine learning models in our active learners, and include state-of-the-art transformer models to obtain labels. When using weak supervision, we set up a powerful environment to efficiently label data with ease. If this approach works well on unlabeled data in our dataset, chances are that it also works well with new incoming data in a production environment. Even better: We can even enrich our project with new incoming data and react to changes quickly. We can monitor the quality of our weak supervision system via the confidence scores and create or change labeling functions or label some new data manually when needed. This ensures high accuracy but also low cost in regards to maintaining AI in a production environment and removes the stress of monitoring and re-deploying machine learning models regularly.

That’s why we created gates, which allows you to use information from a weak supervision environment anywhere through an API. Our data-centric IDE for NLP allows you to use all the powerful weak supervision techniques to quickly label data. And through gates, you can access all your heuristics and active learners with the snap of a finger. No need to put models into production, you can access everything right away.

In conclusion, weak supervision is an amazing technique for NLP that can help alleviate the need for vast amounts of labeled data. It can be used to label data quickly and easily, especially for rare or unusual events, and can be combined with traditional supervised learning to improve model performance further. While it comes with its own set of challenges, recent research has shown that combining weak supervision with other techniques can help mitigate these challenges and improve model performance.

How we used AI to automate stock sentiment classification

Leonard Püttmann — Tue, 21 Feb 2023 14:40:23 +0000

This article is meant to accompany this video: https://www.youtube.com/watch?v=yeML0vX0yLw

In this article, we would like to provide you with a step-by-step tutorial, in which we build a slack bot that sends us a daily message on the sentiment of the news of our stocks. To do this, we need a tool that can:

automatically fetch data from news sources
retrieve API calls from our ML model to get predictions
send out our enriched data to Slack

We will build the web scraper in Kern AI workflow, labeled our news articles in refinery, and then enrich the data with gates AI. After that, we will use workflow again to send out the predictions and the enriched data via a webhook to Slack. If you'd like to follow along or explore these tools on your own, you can join our waitlist here: https://www.kern.ai/

Let's dive into the project!

Scraping data in workflow

To get started, we first need to get data. Sure, there are some publicly available datasets for stock news available. But we are interested in building a sentiment classifier for only specific companies and, ideally, we want news articles that are not too old and therefore irrelevant.

We start our project in workflow. Here we can add a Python node, with which we can execute custom Python code. In our case, we use it to scrape some news articles.

There are many ways to access news articles. We decided to use the Bing News API because it offers up to 1000 free searches per month and is fairly reliable. But of course, you can do this part however you like!

To do this, we use a Python yield node, which takes in one input (the scraping results) but can return multiple outputs (in this case, one record per found article):

def node(record: dict):
    from bs4 import BeautifulSoup
    import requests
    import time
    from datetime import datetime
    from uuid import uuid4

    search_term = "AAPL" # You can make this a list and iterate over it so search multiple companies! 

    subscription_key = <YOUR_AZURE_COGNITIVE_KEY>
    search_url = "https://api.bing.microsoft.com/v7.0/news/search"

    headers = {"Ocp-Apim-Subscription-Key" : subscription_key}
    params  = {"q": search_term, "textDecorations": True, "textFormat": "HTML", "mkt": "en-US"}

    response = requests.get(search_url, headers=headers, params=params)
    response.raise_for_status()
    search_results = response.json()

    headers = ["name", "description", "provider", "datePublished", "url"]

    record = {}
    for i in headers:
        if i == "provider": 
            providers = [article[i][0] for article in search_results["value"]]
            names = []
            for index in range(len(providers)):
                for key in providers[index]:
                    if key == "name":
                        names.append(providers[index][key])

            record[i] = names
        else: 
            part_of_response = [article[i] for article in search_results["value"]]
            record[i] = part_of_response

    record["topic"] = [search_term] * len(record["name"])

    # Scraper the collected urls
    texts = []
    for url in record["url"]:
        try:
            req = requests.get(url)
            text = req.content 
            soup = BeautifulSoup(text, 'html')
            results = soup.find_all('p')
            scraped_text = [tag.get_text() for tag in results]
            scraped_text_joined = " ".join(scraped_text)
            texts.append(scraped_text_joined)
            time.sleep(0.5)
        except:
            texts.append("Text not available.")
    record["text"] = texts

    for item in range(len(record)):
        yield {
            "id": str(uuid4()),
            "name": record["name"][item],
            "description": record["description"][item],
            "provider": record["provider"][item],
            "datePublished": record["datePublished"][item],
            "url": record["url"][item],
            "topic": record["topic"][item],
            "text": record["text"][item],
        }

After that, we can store our data in a shared store. There are two store nodes, a "Shared Store send" to receive data into a store and a "Shared Store read" node from which you can access stored data and feed it into other nodes.

We can create a Shared Store in the store section of Workflow. In the store section, you'll also find many other cool stores, such as spreadsheets or LLMs from OpenAI!

Simply click on "add store" and give it a fitting name. Afterward, you'll be able to add the created store in a node in workflow.

Now that we've scraped some data, we can move on to label and process it!

Enriching new incoming data with gates

Once we've run our web scraper and collected some data, we can sync a shared store with refinery. This will load all of our scraped data into a refinery project. Once we run the scraper again, new records will be loaded into the refinery project automatically.

Refinery is our data-centric IDE for text data and we can use it to label and process our articles very quickly and easily.

For example, we can create heuristics or something called an active learner to speed up and semi-automate the labeling process. Click here for a quickstart tutorial on how to label and process data with refinery.

Once the results of the project are satisfactory, all heuristics of ml models of a refinery can be accessed via an API through our second new tool, which is called gates.

Before we can access a refinery project, we have to go to gates first, open our project there, and start our model and/ or heuristic in the configuration.

Once we've done so, we will be able to select the model of the running gate in our gates AI node in workflow.

Gates is integrated directly into workflow and we don't need an API token to do this. But of course, the gates API is also usable outside of workflow, for this we would need an API token. But we will cover this in another blog article then.

After we've passed the data through gates, we get a dictionary as a response containing all the made predictions and confidence values for each of our active learners and heuristics. We also get all the input values returned as well, so if you are only interested in the results, we have to do a little bit of filtering. The Python code below takes in the response from gates and only returns the prediction and the topic. You can use a normal Python node for this.

def node(record: dict):
    return {
        "id": record["id"],
        "prediction": record["prediction"]["results"]["sentiment"]["prediction"],
        "stock": record["topic"],
    }

Afterward, we store the filtered and enriched results in a separate store!

Aggregate the sentiments

Now we got all of our news articles enriched with sentiment predictions. The only thing that's left is to aggregate the predictions and send them out. You could do this via E-Mail or you could also send the results to a Google Sheet. In our example, we are going to use a webhook to send out the aggregated results to a dedicated slack channel.

In our example we simply count the number of positive, neutral and negative articles, but you could also send out the confidence of the articles or the text snippets of articles. To do this, we use a Python aggregate node, which takes in multiple records but sends out only one output. Here's the code for this node:

def node(records: list[dict]):
    from datetime import datetime
    from uuid import uuid4

    positive_count = 0
    neutral_count = 0
    negative_count = 0

    for item in records:
        if item["prediction"] == "rather positive":
            positive_count += 1
        elif item["prediction"] == "neutral":
            neutral_count += 1
        elif item["prediction"] == "rather negative":
            negative_count += 1

    return {
        "id": str(datetime.today()),
        "text": f"Beep boop. This is the daily stock sentiment bot. There were {positive_count} positive, {neutral_count} neutral and {negative_count} news about Apple today!"
    }

We then create a webhook store, add the URL of our slack channel and then add the node to our workflow. Afterward, we can run the workflow and it should send out a slack message to us!

This simple use-case only scratches the surface of what you can do with the Kern AI platform and you have a lot of freedom to customize the project and workflow to your need!

If you have any questions or feedback, feel free to leave it in the comment sections down below!

GPT and BERT: A Comparison of Transformer Architectures

Leonard Püttmann — Thu, 09 Feb 2023 13:37:07 +0000

Transformer models such as GPT and BERT have taken the world of machine learning by storm. While the general structures of both models are similar, there are some key differences. Let’s take a look.

The original Transformer architecture

The first transformer was presented in the famous paper "attention is all you need" by Vaswani et al. The transformer models were intended to be used for machine translation and used as encoder-decoder architecture that didn't rely on things like recurrence. Instead, the transformer focused on something called attention. In a nutshell, attention is like a communication layer that is put on top of tokens in a text. This allows the model to learn the contextual connections of words in a sentence.

From this original transformer paper, different models emerged, some of which you might already know. If you spent a little of your time exploring transformers already, you've probably come across this image, outlining the architecture of the first transformer model.

The approach to using an encoder and a decoder is nothing new. It means that you train two neural networks and use one of them for encoding and one of them for decoding. This is not limited to transformers, as we can use the encoder-decoder architecture with other types of neural networks like LSTM (Long-Short Term Memory). This is especially useful if we would like to convert an input into something else, like a sentence of one language into another language. Or an image into a text description.

The crux of the transformer is the use of (self-)attention. Things like recurrence are dropped completely, hence the name "attention is all you need" of the original paper!

GPT vs BERT: What’s The Difference?

The original transformer paper sprouted lots of really cool models, such as the all-mighty GPT or BERT.

GPT stands for Generative Pre-trained Transformer, and it was developed by OpenAI to generate human-like text from given inputs. It uses a language model that is pre-trained on large datasets of text to generate realistic outputs based on user prompts. One advantage GPT has over other deep learning models is its ability to generate long sequences of text without sacrificing accuracy or coherence. In addition, it can be used for a variety of tasks, including translation and summarization.

BERT, which stands for Bidirectional Encoder Representations from Transformers, was developed by the Google AI Language team and open-sourced in 2018. Unlike GPT, which only processes input from left to right like humans read words, BERT processes input both left to right and right to left in order to better understand the context of given texts. Furthermore, BERT has also been shown to outperform traditional NLP models such as LSTMs on various tasks related to natural language understanding.

There is, however, an extra difference in how BERT and GPT are trained:

BERT is a Transformer encoder, which means that, for each position in the input, the output at the same position is the same token (or the [MASK] token for masked tokens), that is the inputs and output positions of each token are the same. Models with only an encoder stack like BERT generate all its outputs at once.
GPT is an autoregressive transformer decoder, which means that each token is predicted and conditioned on the previous token. We don't need an encoder, because the previous tokens are received by the decoder itself. This makes these models really good at tasks like language generation, but not good at classification. These models can be trained with unlabeled large text corpora from books or web articles.

The special thing about transformer models is the attention mechanism, which allows these models to understand the context of words more deeply.

How does attention work?

The self-attention mechanism is a key component of transformer models, and it has revolutionized the way natural language processing (NLP) tasks are performed. Self-attention allows for the model to attend to different parts of an input sequence in parallel, allowing it to capture complex relationships between words or sentences without relying on recurrence or convolutional layers. This makes transformer models more efficient than traditional recurrent neural networks while still being able to achieve superior results in many NLP tasks. In essence, self-attention enables transformers to encode global context into representations that can be used by downstream tasks such as text classification and question answering.

Let's take a look at how this work. Imagine that we have a text x, which we convert from raw text using an embedding algorithm. To then apply the attention, we map a query (q) as well as a set of key-value pairs (k, v) to our output x. Both q, k, as well as v, are vectors. The result z is called the attention-head and is then sent along a simple feed-forward neural network.

If this sound confusing to you, here is a visualization that highlights connections that are built by the attention mechanism:

You can explore this yourself in this super cool Tensor2Tensor Notebook here.

In conclusion, while both GPT and BERT are examples of transformer architectures that have been influencing the field of natural language processing in recent years, they have different strengths and weaknesses that make them suitable for different types of tasks. GPT excels at generating long sequences of text with high accuracy whereas BERT focuses more on the understanding context within given texts in order to perform more sophisticated tasks such as question answering or sentiment analysis. Data scientists, developers, and machine learning engineers should decide which architecture best fits their needs before embarking on any NLP project using either model. Ultimately, both GPT and BERT are powerful tools that offer unique advantages depending on the task at hand.

Get refinery today

Download refinery, our data-centric IDE for NPL. In our tool, you can use state-of-the-art transformer models to process and label your data.

Get it for free here: https://github.com/code-kern-ai/refinery

Further articles:

NanoGPT by Andrej Kaparthy https://github.com/karpathy/nanoGPT/blob/master/train.py
BERT model explained https://www.geeksforgeeks.org/explanation-of-bert-model-nlp/
Encoder-decoder in LSTM neural nets https://machinelearningmastery.com/encoder-decoder-long-short-term-memory-networks/
The illustrated transformer by Jay Alammar http://jalammar.github.io/illustrated-transformer/

Active Learning with Transformer-Based Machine Learning Models

Leonard Püttmann — Thu, 19 Jan 2023 14:09:33 +0000

The combination of active learning and transformer-based machine learning models provides a powerful tool for efficiently training deep learning models. By leveraging active learning, data scientists are able to reduce the amount of labeled data required to train a model while still achieving high accuracy. This post will explore how transformer-based machine learning models can be used in an active learning setting, as well as which models are best suited for this task.

What is Active Learning?

Active learning is an iterative process that uses feedback from previously acquired labels to inform the selection of new data points to label. It works by continuously selecting the most informative unlabeled data points that have the greatest potential to improve the model’s performance when labeled and incorporated into training. This iterative process creates an efficient workflow that allows you to quickly get high quality models with minimal effort. With each iteration, the performance increases, allowing to observe the improvement of a machine learning model.

Source: Active Learning with AutoNLP and Prodigy

For example, an experiment on the MRPC dataset with the bert-base-uncased transformer model found that 21 % fewer examples were needed using the active learning approach in contrast to using a fully labeled dataset from the start.

Transformer-Based Machine Learning Models for Active Learning

Transformer-based machine learning models such as

are well suited for active learning due to their ability to capture context information in text data. These models have been shown to achieve state-of-the-art results on many natural language processing tasks such as question answering, sentiment analysis, and document classification. By utilizing these types of models in an active learning setting, you can quickly identify the most important samples that need labeling and use them to effectively train your model. Additionally, these models are very easy to deploy on cloud platforms like AWS or Azure, making it even more convenient to use them in an active learning environment.

How we approach active learning in Kern AI refinery

In refinery, we use SOTA transformer models from Huggingface to create embeddings from text datasets.

This is usually done at the start of a new project because having the embedding for all of our text data allows us to quickly find similar records by calculating the cosine similarity of each embedded text. This can drastically increase the labeling speed.

After some labeling of the data is done, we are able to use these text embeddings to train simple machine learning algorithms, such as a Logistic Regression or a Decision Tree. We do not use these embeddings to train a transformer-based model again, because the embeddings are of such a high quality that even simple models provide high-accuracy results. While you save time and money through the active learning approach, you also save a lot of computational workload down the road.

In conclusion, transformer-based machine learning models provide a powerful tool for efficiently training deep learning models using active learning techniques. By leveraging their ability to capture contextual information from text data, you can quickly identify which samples should be labeled next in order to effectively train your model with minimal effort and cost. Furthermore, these types of models are highly scalable and easy to deploy on cloud platforms making them ideal for use in an active learning setting. With all these advantages combined together, it’s no wonder why transformer-based machine learning models are becoming increasingly popular among developers and data scientists alike.

Drastically decrease the size of your Docker application

Leonard Püttmann — Tue, 03 Jan 2023 17:15:15 +0000

Containers are amazing for building applications. Because they allow you to pack up a program together with all it's dependencies and execute it wherever you like. That is why our application consists of 20+ individual containers, forming our data-centric IDE for NLP, which you can check out here: https://github.com/code-kern-ai/refinery.

If you don't know what Docker or a container is, here's a short rundown: Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package.

Using Docker, you can run many containers simultaneously on a single host. This can be useful for a variety of purposes, such as:

Isolating different applications from each other, so they don't interfere with each other.
Testing new software in a contained environment, without the need to set up a new machine or install any dependencies.
Running multiple versions of the same software on the same machine, without having to worry about version conflicts.
Packaging and distributing applications in a consistent and easy-to-use way.

Overall, Docker allows developers to easily create, deploy, and run applications in a containerized environment.

The problem of size

One problem of Docker containers is that they can get quite large. Because the container, well, contains everything that the program needs to run, the total size of a single container can quickly get to a couple of gigabytes.

Version 1.4 of our application took up about 10.96 GB of disk space. While that's not absolutley enormous for a modern application, we saw a lof of potential to increase the usability by decreasing the total size. In the end, smaller is always better, especially when keeping in mind that not all of our users have incredible internet and almost 11 GB can sometimes take quite some time to download.

In the end, we managed to cut the needed disk space by almost 50 % to 5.2 GB. How did we manage to do this?

Choosing smaller parent images

First, let's take a look at parent images for Docker containers. In Docker, a parent image is the image from which a new image is built. When you create a new Docker image, you are usually creating it based on an existing image, which serves as the parent image for the new image.

For example, let's say you want to create a new Docker image for a web application. You might start by using an existing image such as ubuntu:18.04 as the base, or parent, image. You would then add your application code and any necessary dependencies to the image, creating a new child image.

The parent image provides a foundation for the child image, and all of the files and settings in the parent image are inherited by the child image. This allows you to create new images that are based on a known, stable foundation, and ensures that your new images have all of the necessary dependencies and configurations.

The new child image can then be used to build you container and run your application.

There are many parent images you could choose. You can check them out at https://hub.docker.com/. Most of our containers used the python:3.9 parent image. This image comes with a full Python installation build on top of Linux. Technically, this is just fine for what we do. Thing is, the image alone is 865 MB large, at lease for the amd64 architecture.

Maybe something smaller would do the job just as well. The python:3.9-alpine image for example is build on alpine Linux, a super tiny Linux distribution. The image python:3.9-slim is also substantially smaller.

We then tried out the smaller parent images for all of our child images to see if they still run. For some images we had to stay with the normal python:3.9 image, but the majority of images are just running normally with python:3.9-alpine or python:3.9-slim. This reduced the total size of the application quite a lot!

Shared layers

Another thing we optimized was the use of shared layers. Docker images consist of multiple layers, which can be shared between different images. These shared layers have to be downloaded and stored on disk only once. Therefore, increasing the usage of shared layers reduces download time and disk consumption. Following this approach, we created custom docker parent images, which have already preinstalled the python dependencies needed by the refinery services.

Above you can see a comparison of the image sizes before and after. In the size column the effect of the choice of the smaller parent images is visible. The effect of the shared layers is shown in the shared and unique size columns.

Those are some tricks we used to decrease the needed disk space for our application. If you found this article useful please leave a like or follow our author. If you have great tips on how to reduce the size of an application that uses containers, please leave them in the comments below!

How we managed to build our open-source content library crazy fast

Leonard Püttmann — Thu, 15 Dec 2022 16:57:07 +0000

Our newest project here at Kern AI offers you a fantastic library of modules called bricks, with which you can enrich your NLP text data. Our content library seamlessly integrates into our main tool, Kern AI refinery. But it also provides the source code for all the modules, providing you with maximum control. All modules can also be tested by calling an endpoint via an API.

We managed to build this incredible tool in just less than two months, thanks in large part to the amazing team at Kern, but also in part to the stunning capabilities of DigitalOcean's App Platform. We also managed to increase development by automating large parts of our content management system.

You can try out bricks here!

Let's first talk about the general structure of bricks and then dive into a little bit more detail!

How bricks is structured

Bricks is built using four components:

Frontend using next.js built with Tailwind UI
Backend with Strapi and a managed PostgreSQL database
Service to serve the live endpoints on bricks
A separate search module to easily find modules

Frontend

The overall design of bricks should be fitting to the one that is also used in refinery. The bricks UI is created using React and deployed via NextJS. We also used Tailwind for the UI elements.

Backend

For the backend, we use Strapi, which is an awesome open-source content management system. Strapi is connected to a PostgreSQL database to store all the content that is displayed on bricks. The frontend connects to the backend via an API to then display all the content.

Managing content with Strapi itself is super easy, but to make things even more easier for us, we wrote an automation script that is able to fetch new modules created for bricks and automatically add them to Strapi. That's why the source code of a bricks module needs to be in a specific format in order to be added to Strapi.

Live endpoints

Every module can be tested right from bricks itself. On the right side of every module, you'll see a window that allows you to try out the module without the need to install anything.

Providing this was very important for us, as we want users to find out what exactly they get with every module and test the module with some of their own data. The default input is usually some text or sometimes some additional parameters for the endpoint.

Search module

To quickly find the right modules, we also build a custom search module. The search module uses a small transformer model to embed all the names of the module, which can be searched very quickly.

Let's now take a closer look at the technologies we used to quickly get bricks live.

Leveraging DigitalOcean's App Platform

The App Platform is a convenient and cheap way to deploy your web apps. Instead of deploying an app on a virtual machine that you'll have to manage yourself, the app will run in a Docker container. That way you don't have to think about the underlying infrastructure and also get the benefit of easy scalability. It's also a bit cheaper than hosting your app on a single VM. In the case of DigitalOcean, you also get the option to auto-deploy from a GitHub repository, which is super handy.

There are many cloud platforms out there offering such a service, but for our purposes, we chose to use DigitalOcean. This post is not sponsored by them, we just like their service a lot.

To get started with bricks, we used this tutorial on how to deploy Strapi to DigitalOcean. We highly recommend you to check it out as well if you would like to use Strapi on DigitalOcean, as it was really helpful to get us started with the project.

Auto-deploy from a GitHub repository

To deploy on DigitalOcean, you can simply attach a GitHub repository, from which the app will automatically get deployed. In our case, we use the auto-deploy function for our endpoint service, so that new modules added to bricks will automatically get integrated.

But before we can do that do that, we first need to deploy our backend and frontend components. To keep things clear, we deploy them separately and also store backend and frontend in a different repository. DigitalOcean also allows you to connect your app to a managed database, which is super convenient.

Setting up a managed database

Before we can deploy the backend, we need a managed PostgreSQL database first. DigitalOcean offers many different database types, but PostgreSQL should be just fine for our needs. When deploying Strapi on DigitalOcean, you can also choose a cheaper dev database for your app. However, we had a lot of trouble getting that dev database to run, so we instead directly went for the managed database that is meant to be used in production anyway.

Creating an App on DigitalOcean

Next up, we are going to create our first App on DigitalOcean. The app will host the Strapi backend of the site and will be connected to the managed PostgreSQL database we created in the previous step. Deploying the backend is fairly easy, you simply select the GitHub repo and the fitting directory you want to deploy, and DigitalOcean will handle all the rest for you. You can also opt-in for auto-deploy, and your app will be redeployed whenever there is a new change to your repository.

Creating a second App for the frontend

While it is technically possible to host the backend and the frontend on the same app, we chose not to do that. Setting up the frontend was much easier in a separate app, and apps are very cheap in general, so we would only save a few dollars if we would've deployed on the same app. So we thought it would not be worth the hassle. The frontend gets all the information from the backend via a simple API call, so the frontend and backend don't need to be connected in any other way, too.

Building the second app for the frontend is essentially the same procedure as for the backend. You simply select the repository and the directory and let DigitalOcean do the work for you.

Deploying the endpoint app

Once backend and frontend are up and running, we need to deploy the service that is running our endpoints. Otherwise, a user would still be able to access bricks and check out the modules, but they couldn't directly try them out on the site itself.

The procedure is the same as before: connect your GitHub repository and deploy a containerized application via DigitalOcean. The endpoint service is using FastAPI to deliver the results of each endpoint to bricks. So far, a single service is enough to serve all the 50+ endpoints we have available on bricks so far.

Using bricks to quickly enrich dataset for NLP

We hope that you liked this insight into the structure behind bricks. You can try out bricks here to inspect the result for yourself.

If you have any questions or feedback you would like to share, feel free to post it in the comments down below. Have fun using bricks!

Introducing bricks, an open-source content-library for NLP

Leonard Püttmann — Thu, 08 Dec 2022 14:50:54 +0000

This week we launched bricks, an open-source library which provides enrichments for your natural language processing projects. Our main goal with bricks is to shorten the amount of time that you need from idea to implementation. Bricks also seamlessly integrates into our main tool, the Kern AI refinery.

Let's take a closer look at the structure of bricks and how to use it. You'll find bricks here

https://bricks.kern.ai/home

Structure of a brick module

In each module of bricks, you will find the source code for the function. You can directly use a bricks module in refinery, either by directly copying the source code or via the bricks integration that will be available in the next release of refinery 1.7. Of course, this code could also be used outside of refinery.

On the right hand side, you can directly try out the module over an live endpoint that we've deployed. You can try out the module with the example input that is already provided, or you can type something yourself and try it out!

Types of bricks modules

Currently, there are three main types of modules in refinery:

Classifiers:

As the name suggests, these modules can be used to classify something. Need to find out the language of your text or get the complexity of it? You'll find what you need in the classifiers!

Extractors:

The extractors are really useful if you would like to pull certain information or entities from your text. The most bricks modules can currently be found here, where you'll find modules to extract metrics, time, names, adresses and many more useful thing! We've built all of these modules in a way that they can instantly be used for labeling functions in refinery.

Generators:

This type of module generates some new form of output, such as a translation or a cleaned or corrected version of a text. In the generators, you will also find two premium functions, for which you'll need an API key of an external provider to use them, in this case for language translation. However, it's also very important to us to always provide similar modules that don't need an API key.

Using a bricks module in refinery

Let's say that we have a dataset with news articles, and we want to categorize them by their complexity. We then go to the sentence complexity module in bricks and copy all the source code.

We then go back to our project in refinery and create a new attribute calculation, which we can do on the settings page.

We then paste in the code and put in the name of our attribute, in our case the headlines!

As a result, we'll then get the sentence complexity of each of our headlines that we have in our dataset.

All of this takes less than a minute to implement.

Contributing to bricks

As all projects at Kern, bricks is open-source, meaning that you get access to the source-code. You can also contribute to bricks if you built something that you would like to share and that you think would be useful to others. Should you have a great idea or implementation, feel free to just open an issue on our GitHub page.You can check bricks GitHub page here. On our GitHub page, you'll also find a detailed explaination of how to contribute to bricks.

We've also made a tutorial on YouTube in which our DevRel guy Div shows you all the neccessary steps to contribute.

You may also join our Discord community, where you can ask questions and discuss things with the wonderful Kern community. Join us here: https://discord.gg/WAnAgQEv

Host your ML model as a serverless function on Azure

Leonard Püttmann — Fri, 11 Nov 2022 15:03:49 +0000

Building machine learning models is fun. But to deliver real business value, machine learning models need to be put into production. Often, that's easier said than done.

The logical solution is to make the model accessible via an endpoint that is hosted in the cloud. One option for this is to host the machine learning model on a virtual machine in the cloud. With this, however, comes the need to manage the VM. This can quickly become a hassle, especially when you need to serve a lot of endpoints.

One interesting alternative is to run your machine learning model as a serverless function in the cloud. While I think that the name serverless is not quite fitting, because we still use a server to run the code, the concept by itself is still amazing. Instead of managing hardware resources yourself, you just provide the code you would like to run and the provider will handle all the rest. Serverless functions are also dirty cheap. The code of the serverless function is usually packed up inside a Docker image, which is started up every time you call your model. This means that you don't pay for a resource 24/7, you only pay as you go.

For today's article, we will be using serverless functions on Microsoft's Azure Cloud. The first million executions for cloud functions there are free. So if your needs for hosting an ML model are manageable, you should be able to provide your model without paying even a single cent.

**In a nutshell, serverless functions are great, if:

The model is small and efficient.
You need the model only sometimes or in regular intervals and
response times of a couple of seconds are alright for you.
You don't want to pay a lot of money (or you don't have any).**

Deploying serverless functions of Azure is super easy. In this article, we are going to deploy a decision tree model, which was trained on data from a wind turbine to predict energy production. You'll need at least a basic understanding of the concepts of cloud computing as well as some knowledge of Python. However, if you'll have any questions along the way of this article, feel free to ask anything in the comments. To follow along, you'll need:

An actvive subscription in Azure
VS Code as well as some Azure Extentions
Python 3.8 or higher
Optional: Azure function core tools

Before we start building a serverless function app, we'll first take a look at our machine learning model as well as the data we trained the model with.

The whole code + data of the project is accessible here.

Predicting energy production of a wind turbine

The machine learning model meant for deployment is a simple decision tree that was trained on two and a half years of data from a wind turbine. The goal of the model is to predict the energy production of the wind turbine given information such as the ambient temperature, technical conditions of the turbine and, of course, the current wind speed. The dataset is really interesting and fun and I encourage you to dive deeper into it. When looking at the energy produced by the wind turbine, we can see that between July and September, the energy output seem to be the highest, usually outputting max energy. After that, the energy production is much lower.

To predict the energy production of the wind turbine, we are going to use a decision tree. While a decision tree will most likely yield worse results than, say, a random forest, XGBoost regressor or a neural net, a simple model such as a decision tree also has upsides. For one, it's very fast and lightweight, making it ideal to run as a serverless function. It's also very interpretable, making it easy to understand why exactly the model came to make its prediction. And when the underlying data itself is very good, decision trees can be very powerful. This is why this model, together with a logistic regression, is my first go-to model when first tackling any problem.

That being said, we probably won't see stunning results from the model. Before training the model, I shifted the data in a way that the model tries to predict the energy production for the next 10 minutes. The further the prediction is in the future, the more likely I expect the model to become unaccurate, but for demonstration purposes, I think this should be fine.

Creating an intital function in the Azure portal

Time to get started with the creation of the serverless function. Go to the Azure portal and click on function app to create a new function app. Choose to publish the function as code and select Python 3.9 as the runtime stack and version. I am going to deploy the function in northern Europe, as it is the nearest location to me. But feel free to choose any other location that's closest to you. As for the plan type we want to select the consumption (serverless) plan. After that, hit review + create to create the function app.

While it is technically possible to write and deploy code directly from the Azure portal, it's much easier to just do that from VS code. Let's jump into VS Code!

Installing VS Code extentions

Deploying from VS Code has some benefits as opposed to writing the code directly in Azure. First, we can debug and try out our function locally, which I think is great. Second, when deploying from VS Code, you have more options to customize your model. For example, we might upload a machine learning model as a .pkl file packed up with our function, which is exactly what we are interested in.

Go into the extensions section of VS Code and install the Azure Tools as well as the Azure Functions extension. To test functions locally, you also need to install the Azure function core tools. The last step is optional, but I highly recommend testing our functions locally before deploying them. Make sure to log in to your Azure account as well.

Initializing an empty function in VS code

Once all the extensions are installed, open the directory you want to work in, hit F1 and search for "Azure Functions: create new function...". We've already created a function app in Azure, but this will only serve as the shell, to which we will later deploy our actual function.

In the first step, you can select the programming language. I'm going to use Python, but there are many other programming languages available, such as Java, C# or JavaScript. After that, you can choose to select a runtime for a virtual environment, in which you can run your function.

You can also select a template for your function, which determines the way that your function gets triggered. For example, you can create a time trigger to trigger your function at a given interval or time of day. A function might also get triggered whenever there is a new entry in data storage. I am going to use the HTTP Trigger, which triggers the function every time it receives a POST request.

After that, you can set the name of the function as well as the access rights. If you need to access other tools from Azure, I recommend that you set the auth settings to "Admin".

Everything is set now, and all the needed components will automatically get created for us in the directory we initialized the function. Time to write some code!

Writing our function

Inside your directory, you should find a folder with the name that you have given to your function. In there, you'll find a file called __init__.py. All the code that we want to run in the serverless function goes into this file. The file also contains a python function called main, which is crucial for the serverless function. You may also add more python functions, but you need to have the main functions as well!

import logging
import pandas as pd
import joblib
import json
import azure.functions as func

# Load the decision tree model
dtr = joblib.load('windpred_model.pkl')

def main(req: func.HttpRequest) -> func.HttpResponse:

    # Parse the received data as a JSON file
    data = req.get_json()
    data = json.loads(data)

    # If the data is not empty, convert to a pandas DataFrame 
    if data is not None:
        response = []
        df = pd.DataFrame(data)
        df = df.apply(pd.to_numeric, errors='coerce')

        # Create new prediction for every entry in the df
        for i in range(df.shape[0]):
            entry = df.iloc[[i]]
            y_hat = dtr.predict(entry)

            # Store results in a dict
            results = {
                'energy_production': y_hat[0]
            }

            # Append results to a list 
            response.append(results)

        # Return the predictions as a JSON file
        return json.dumps(response)

    else:
        # If no JSON data is recieved, print error response
        return func.HttpResponse(
             "Please pass a properly formatted JSON object to the API",
             status_code=400
        )

Deploying the function

Now we have everything in place for our serverless function and we are ready to deploy. Click on the Azure Extention and click on your subscription in the resources section. Search for the function app, in which you'll find the function app that we previously deployed in the Azure Portal. Right-click on the function and click on "Deploy to Function App...". This will automatically push the function app we configured in VS Code to Azure.

Trying out our serverless function

After a couple of minutes, the function app should be deployed and ready to use. Because our function is triggered by HTTP requests, we can test out our app with the help of the Python requests library.

To be able to send the data, we need to have it in JSON format. Luckily, Pandas is capable to save a DataFrame to JSON by calling .to_json on a DataFrame. I have taken a subset of the test data, which looks like this as a JSON:

{"AmbientTemperatue":{"1553427000000":38.039763,"1547726400000":29.4031836,"1524628800000":33.7847183784,"1558305600000":33.6842655556,"1545313800000":25.8910933},"BearingShaftTemperature":{"1553427000000":44.663637,"1547726400000":40.0134959,"1524628800000":47.901935875,"1558305600000":40.8214605,"1545313800000":42.1678136},"Blade1PitchAngle":{"1553427000000":45.7368925375,"1547726400000":45.7368925375,"1524628800000":45.7368925375,"1558305600000":34.3081334429,"1545313800000":45.7368925375},"Blade2PitchAngle":{"1553427000000":43.6993571429,"1547726400000":43.6993571429,"1524628800000":43.6993571429,"1558305600000":32.3317821077,"1545313800000":43.6993571429},"Blade3PitchAngle":{"1553427000000":43.6993571429,"1547726400000":43.6993571429,"1524628800000":43.6993571429,"1558305600000":32.3317821077,"1545313800000":43.6993571429},"GearboxBearingTemperature":{"1553427000000":65.5114258,"1547726400000":61.7971398,"1524628800000":77.119133,"1558305600000":49.5740933,"1545313800000":69.9417784},"GearboxOilTemperature":{"1553427000000":59.7925489,"1547726400000":56.3766701,"1524628800000":64.204399375,"1558305600000":54.2150616667,"1545313800000":58.2121131},"GeneratorRPM":{"1553427000000":1053.90176,"1547726400000":1030.01957,"1524628800000":1751.7155625,"1558305600000":115.3844747778,"1545313800000":1433.95605},"GeneratorWinding1Temperature":{"1553427000000":66.001735,"1547726400000":56.8643122,"1524628800000":113.29087075,"1558305600000":57.8011734444,"1545313800000":70.1916634},"GeneratorWinding2Temperature":{"1553427000000":65.1331282,"1547726400000":56.0960214,"1524628800000":112.59216325,"1558305600000":57.2828797778,"1545313800000":69.546118},"HubTemperature":{"1553427000000":43.996185,"1547726400000":36.0191189,"1524628800000":42.996094375,"1558305600000":39.5969365,"1545313800000":34.0113608},"MainBoxTemperature":{"1553427000000":49.83125,"1547726400000":39.95625,"1524628800000":41.89842625,"1558305600000":43.1738124,"1545313800000":36.2125},"NacellePosition":{"1553427000000":135.25,"1547726400000":173.0,"1524628800000":60.75,"1558305600000":183.5,"1545313800000":172.0},"ReactivePower":{"1553427000000":49.440134,"1547726400000":18.04336296,"1524628800000":-9.3880610405,"1558305600000":-10.1450863947,"1545313800000":0.1142242497},"RotorRPM":{"1553427000000":9.4539536,"1547726400000":9.2260374,"1524628800000":15.708135375,"1558305600000":1.0182516089,"1545313800000":12.841184},"TurbineStatus":{"1553427000000":2.0,"1547726400000":2.0,"1524628800000":0.0,"1558305600000":2.0,"1545313800000":2.0},"WindDirection":{"1553427000000":135.25,"1547726400000":173.0,"1524628800000":60.75,"1558305600000":183.5,"1545313800000":172.0},"WindSpeed":{"1553427000000":4.49369983,"1547726400000":3.690674815,"1524628800000":2.5373755703,"1558305600000":2.7541209632,"1545313800000":7.19549971}}

As you can see, there is a lot of information that we are passing to the model packed up inside the JSON file. We may load in the JSON data and call our function like this:

# Load the JSON data
with open('data.json') as f:
    data = json.load(f)

# Send request to our serverless function
URL = 'URL OF MODEL GOES HERE'
headers = {'Content-type': 'application/json'}
req = requests.post(URL, json=json.dumps(json_data))
print(req.text)

You can get the URL of the function in the Azure portal in the code + test section of your function. The output is the predicted energy output for the next 10 minutes.

I hope you found this small article about deploying machine learning models as serverless functions useful. If you have any thoughts or questions you would like to share, let me know in the comments! :-)

How we automated license checking for our Python & JS dependencies

Leonard Püttmann — Fri, 28 Oct 2022 10:06:03 +0000

There are many popular license types for open-source software out there, such as the MIT, BSD or Apache Software License. When building software privately, these license types are minor. However, things get way more complicated when building a commercial product, even if it's open-source. For us as a company, that meant a lot of insecurities about how to handle these licenses.

In a nutshell, when using a dependency, you'll need to ensure that the dependency allows for commercial use. That's not a problem with the majority of the licenses, but there are some lesser-known ones that could cause some trouble.

For our tool, the Kern AI refinery, we use dozens of different libraries. Checking all the dependencies manually for all the repositories would be an extremely tedious task, to say the least. So, our machine learning engineer Felix thought to himself "why don't I automate this then?". And that's exactly what he did!

Checking Python licenses with LicenseCheck

We have a lot of Python dependencies, so checking these licenses was our biggest priority. When it comes to checking licenses of Python dependencies, we've found a really cool tool called LicenseCheck, which can check the requirements.txt file of a GitHub repository and find the licenses for all the dependencies listed inside the file. LicenseCheck can simply be installed via pip and can then be used to print out all the licenses. This already helps a lot, but when you have 50+ repositories, it's still a lot of manual work.

Building a Python script

Code snipped from the script

To check all of our repositories, our ML engineer Felix build an amazing Python script that completely automates the whole license checking of our Python dependencies. You can find the whole script here if you are interested in using it!

How does the script work? In a nutshell, you simply paste in the repositories you want to check by putting the name and the URL of the repo inside of a dictionary. Feel free to select as many repos as you like. The script then loops over all the repositories and checks the requirements.txt file from each repo.

Using the script, you can simply check all the licenses in your repository

Checking the results

Finally, the script then saves all the results into a handy Excel spreadsheet, in which you'll get a list with all your dependencies and the corresponding license.

Using this script we were able to have the licenses of 114 dependencies in one list just by running this script. In the future, we might have even more dependencies, but with this tool, we can easily check them again in the future with very little effort.

Finding licenses for JavaScript dependencies

Python is not the only programming language that we use. Our application is also build with JavaScript, mainly for the UI and the dashboards for our admins. Sadly, the LicenseCheck tool doesn't work for JavaScript or any other language other than Python.

As an alternative, we've found LincenseFinder, which is an awesome open-source tool to check dependencies for JavaScript. The tool checks the package.json file of a repository and tells you the used licenses. You can also create a list of permitted licenses and LicenseFinder will check if your dependencies are in that list. It basically works very similar as the LicenceChecker for Python did.

We hope that you'll find this helpful. Let us know in the comments if you found similar tools for other programming languages as well so that other people can also see them.

Make sure to check out our GitHub page to find out more about our open-source, data-centric IDE which we are building!

Data-centric AI for NLP is here, and it's here to stay!

Leonard Püttmann — Thu, 20 Oct 2022 18:39:52 +0000

When a machine learning model performs poorly, many teams intuitively try to improve the model and the underlying code - let’s say switching from a logistic regression to a neural network. Knowing that this can be helpful, it isn’t the only approach you can take to implement your use case. Taking a data-centric approach and improving the underlying data itself is often a more efficient approach to increase the performance of your models. In this article, we want to show you how that can be done - for instance, but not limited to - using our open-source tool refinery. Let’s demystify data-centric AI!

Starting from scratch

In our case, let's imagine that we have some unprocessed text data, with which we would like to build a classifier. For instance, a model to differentiate between topics of a text.

The first step is to carefully look at the data we have. Can we already spot repeating patterns in the data, such as regular expressions or keywords? How is the data structured, i.e. are there short and long paragraphs and such? Does the data capture the information I need to achieve my goals, or do I need to process it firsthand in some other way? These are not easy questions, but answering them (at least to some extent) early on will ensure the success of your projects later down the road.

Generally, you can do this by labeling a few of the examples. You will require some manually labeled examples in any way. As you understand patterns and have some reference data, you will get (imperfect, maybe noisy) ideas for automation patterns. Let’s dive into how they can look like.

One way to create signals for automated labels is labeling functions. These labeling functions allow us to programmatically express and label patterns found in the data, even if these patterns are a bit noisy. We’ll look into this a bit later, don’t worry about that now. For example, we can write a Python function to assign a label once certain words are found in a data point.

A simple labeling function in refinery

We can also use an active learner for labeling text data. The active learner is a machine learning model that is trained on a subset of the data which has already been labeled manually. Because the data has been processed into high-quality embeddings by SOTA transformer models (e.g. distilbert-base-uncased), we can use simple models such as a logistic regression as the active learner.

Code for the active learner

The active learner then automatically applies labels to the unlabeled data. The more data we label manually, the more accurate the active learner becomes, creating a positive feedback loop.

We could even think of more ideas, e.g. integrating APIs or crowd labeling, but for now, let’s just think of these two examples. We’re also currently building a really cool content library, which will help you to come up with the best ideas for your automation.

Results from an active learner

The labels from our labeling functions and the active learner can then be used for weak supervision, which takes all the labels and aggregates them into a weak supervision label. Think of weak supervision as a framework for really simple integration and denoising of noisy labels.

Improving the labeled data quality

Your data will rarely be perfect from the get-go. Usually, the data will be messy. That means that it's important to continuously improve on the existing training data. How can we do this? The output of the weak supervision also gives us confidence for each label assigned, which is super helpful!

We can then look at all the labels with particularly low confidence, create a new data slice and improve on that specific part of our data. We can manually label some more data out of that low-confidence data slice and write more labeling functions (or ask someone from our team to do that for us, as each slice is tagged with a URL), which then further improves the active learner as well. This not also allows us to improve the labels of our data, but we can also spot noisy labels which differ from the ground truth.

Confidence distribution of our labels

We can also compare the manual labels with the labels from the weak supervision. The data browser makes it very easy to spot differences. Again, this shows how data-centric AI is not only about scaling your labeling. It really is about adding metadata to your records that help you build large-scale, but especially high-quality training datasets.

The data browser in refinery

Making the problem at hand easier

There are also further steps we can take towards the goal of data-centric AI. For example, in the domain of NLP, we can further improve the embeddings we use. Let's have a quick refresher on what embeddings are.

To work with text data for NLP, we can embed sentences into a vector space. The words are represented as numeric values in this vector space. Positioning words in this space make sure that the underlying information and meaning of the words are kept intact while also enabling the processing of the texts by an algorithm.

State-of-the-art embeddings are created using modern transformer models. These embeddings are very rich with information, but can also be super complex, occasionally having hundreds of dimensions. Fine-tuning these embeddings often lead to huge improvements.

Instead of using more and more complex models, we can do something else. An alternative approach is to improve the data at hand in a way that the information within the data is preserved, but we don't need super complex models like Transformer or LSTM neural nets in our downstream tasks to make use of the data. By improving the vector space itself, even simple models such as logistic regressions or decision trees can give tremendously great results!

Why data-centric AI is here to stay

Now, if you’re thinking: “Wait, but isn’t this also changing the model parameters, and thus again model-centric AI?”. Well, you’re not wrong - but you’re effectively spending the biggest chunk of your time on improving the data, and thus improving the model. That is what data-centric AI is all about.

We just explored some of the upsides of data-centric AI. Weak supervision provides us with an interface to easily integrate heuristics, such as labeling functions and active learning, to (semi-)automate the labeling process. Further, it helps us to enrich our data with confidence scores or metadata for slicing our records, such that we can easily use weakly supervised labels to manage our data quality. Last but not least, similarity learning can be used to simplify the underlying problem itself, which is much easier than increasing the complexity of the model used.

Enriching data and treating it like a software artefact enables us to build better and more robust machine learning models. This is why we are confident that data-centric AI is here to stay.

If you think so too, please make sure to check out our GitHub repository with the open-source refinery. We’re sure it will be super helpful for you, too :)

Live session: A first peek into the Kern AI refinery.

Leonard Püttmann — Fri, 07 Oct 2022 10:31:56 +0000

We are going live on Twitch! Have you ever thought about integrating refinery into your NLP projects but did not know where to start? On October 13th, Leonard and Moritz will give you a practical overview of the application with an easy-to-understand dataset.

When? (CEST timezone)

6pm - 7pm: Livestream on twitch.tv/MeetKern
7pm - 7.30pm: Live Q&A on our Community Discord server

In the livestream you will learn:
👉 How to install refinery
👉 How to set up a project
👉 How to use all the features of refinery

Our CTO Jens will accompany the livestream and Q&A so that you can get all these burning questions off your chest.
We very much look forward to this, as we greatly value your feedback and engagement. Refinery was built to make your life easier and we think that sessions like this will help you get the hang of this fantastic application.