DEV Community: Liubov Zatolokina

How to Integrate GPT models: Testing 6 ChatGPT Use Cases

Liubov Zatolokina — Mon, 13 Mar 2023 14:01:30 +0000

GPT models weren’t widely popular until November 2022, once OpenAI built a successor of their language models called ChatGPT. As for March 2023 ChatGPT is now officially released with an API and SDK to integrate the model into your products. While there are a plethora of applications on the surface, ChatGPT offers a wide range of improvements for different types of technical and business processes. So here we’ll put to the test the most viable, and see where it gets us.

I’ll share with you my vision on how we can benefit from GPT models on different levels of automation, while providing you with a concise answer of the question, “What do I need to implement a GPT model in my product?”.

Please note: all the tests were made during the pre-release stage.

What is ChatGPT and What are GPT Models in General?

If you already know what ChatGPT and other GPT models are, you can skip this part. Because here I’ll define ChatGPT from different angles, along with how to exploit its capabilities to dig for data and provide us with necessary information. The easiest way to do this is by simply asking the model, and here is how ChatGPT identifies itself:

ChatGPT defines itself

ChatGPT is the latest GPT model that was developed on the basis of GPT 3.5. It has the ability to interact with the user in the form of a dialogue and provide answers to almost any question based on events and information up until 2021. GPT models are highly capable of solving a range of Natural Language Processing (NLP) tasks with the possibility of customization such as:

Keywords extraction
Text translation
Question answering
Text generation
Text summary
Spelling correction
Text style transferring
Named entity recognition

How to Use ChatGPT in Your App or Website

As of March 2023, ChatGPT is available via API and some companies like Snapchat are already using it. So it makes sense to discuss how ChatGPT can be integrated in your application for a technical standpoint. There is a certain fork of options here:

THE TASK IS VIABLE BY MEANS OF THE EXISTING MODEL.

Here, the model operates as a smart catalog, helping the user to browse through content and make choices more quickly using the response of GPT. However, customizations and personalization are not available.

FINE-TUNING THE MODEL WITH YOUR OWN DATA.

Such models can be useful for creating a personalized chatbot recommender system based on user actions/desires and other use cases. Using your own data, the GPT model can be trained to introduce the element of personalization. The amount of data needed to fine tune a GPT model depends on various factors, such as the size and complexity of the data set, and the desired performance of the model. Typically, at least a few hundred examples are required to effectively fine-tune a GPT model. However, in complex domains, it may take tens of thousands or even millions of examples to achieve optimal model performance.

To understand whether you’ll need to fine-tune the model, I would recommend you running through these 5 steps:

Business understanding that involves learning the business case, what task is delegated to the model, what data it should use, and the personalization features that are expected.
Data gathering. This step can be skipped if personalized information can be skipped at the early stages. But if you plan to use the GPT model in a highly specific domain, by default, the model will fail to operate correctly. Which means, this stage is necessary. For example, it is necessary to recommend clothes for the user based on what is available in the store. Then it is necessary to collect information about the products of the store and provide basic instructions for recommendations (for example, jeans with the article YYYY should be worn over a T-shirt with such XXXX).
Data annotation/Data labeling. This stage will include creating a prompt for each sample from the data gathering stage. For example: shorts with YYYY articles must be worn over a T-shirt with XXXX article = what should I wear under a T-shirt with XXX article? And the expected answer: jeans with YYYY article.
Data Preparation. This implies converting the data into a suitable format, e.g. “{“prompt”: “”, “completion”: “”}”.
Modeling. Finally, applying those changes and implementing the model in production.

Keep in mind that the GPT model is not a complete product that covers all the domain areas out of the box. You’ll also need the application for it and management tools on the backend. Then, the output of a GPT model has to be processed with NLP methods in order to extract meaningful information. The model returns an answer as text in multiple sentences, like a list. This output should be presented in the form of one word or several, which will be used as a filter or keywords in the following parts of the pipeline. So this means the GPT model will also require other models to operate.

Knowing that, let’s discuss GPT models as a complementary tool suitable for different business and technical needs.

ChatGPT Business Use Cases

ChatGPT is much more capable than its predecessors since it was trained with the help of the reinforcement learning with human feedback (RLHF) technique. This provided a distinguishing level of performance compared to the no less popular GPT 3 model. Because of this, ChatGPT is now adopted to a range of business and technical tasks using its NLP capabilities. Let us ask what ChatGPT thinks of its business use cases:

ChatGPT business use cases

Here we can see the model providing an answer listing some of the most common use cases. Since the model relies on data limited to 2021, it’s not even a full list and there are tons of low level tasks we can delegate to ChatGPT. We can also try to narrow down our focus, and extract some value in a single domain area:

ChatGPT gives advice on how to develop a product strategy

Based on these examples, we can make a bunch of conclusions concerning three of the most promising business use cases:

CUSTOMER SUPPORT

ChatGPT is a highly capable conversational bot that can be customized with your own data. As it’s able to understand the question correctly even with broken grammatical structure or missing syntactical parts, it can be considered a ready made solution for customer support systems, and namely chat bots.

LEAD GENERATION

Following up on the previous point, the same capabilities can be used in lead generation activities since in most cases, the ChatGPT level of speech doesn’t differ too much from what people can write.

DATA ANALYTICS

At some point, GPT models may become an alternative to modern business intelligence and data analytic solutions. Since they can operate on huge databases of historical data and provide human workers with valuable insights.

However, when it comes to real-time data analysis that requires constant updates and the most recent information available, GPT models appear to be clumsy. The process of training with the RLHF technique is complex, and besides that, it will require a machine learning pipeline with automated retraining and redeployment of the models. Which is, kind of a rocket science… for now.

So applications that rely heavily on real-time data updates are not really suitable for GPT models yet. What else can we do with it?

1. Copywriting with ChatGPT

One of the most popular statements about language models and neural networks is that they can definitely take the job of copywriters or at least boost the speed of writing to a large extent. Let us see whether it’s true for today. We’ll use two types of requests to check how the model will handle article writing on the software engineering topic. The first type of request simply states the general idea of the text, while the second one will specify things I want the model to describe and provide an expert opinion or arguments for.

Request: “Write me an article about different PHP frameworks.”

ChatGPT output:

ChatGPT article on PHP frameworks topic

Request: “Write me an article about the difference between Laravel and Symfony frameworks, and what PHP framework should I choose for an Ecommerce application.”

ChatGPT output:

ChatGPT article comparing PHP frameworks

In both cases, the model handles the topic pretty well concerning the information standpoint. Grammatically and stylistically texts are also fine. However, when it comes to the level of creativity, ease of reading, reasoning, and justification of a certain point, it looks very standard and unconvincing.

While ChatGPT certainly can generate some blueprints to simplify your marketing efforts, it imposes one problem for human proofreaders. As the model parses the data by itself, it appears difficult to find the source of information and prove the model’s claims. So to be used for marketing purposes, ChatGPT requires fine tuning and additional features like AI explainability to be effective and useful.

2. General Programming with ChatGPT

If GPT models can speak a human language, nothing stops them from using programming languages to write applications. Our PHP/DevOps group leader, Anton Logvinenko, put this to the test to see what results can be achieved by exploiting the model to write code in PHP for various functions.

Request: “Write a PHP code on Laravel Framework which will display the list of users”.

ChatGPT response:

User controller

View

"In most cases, this code will definitely work. In this specific case, it lacks the database structure, as well as which fields does each user have, like name and surname. So you should keep in mind that currently ChatGPT doesn’t provide you with a finished code that can be used off the shelf.

The use case I can see for this is to support junior-level developers in learning PHP or its frameworks and speed up the process a bit. While the code itself is written in a syntactically correct manner, there are a lot of things to improve."

Anton Logvinenko
PHP/DevOps Group Leader

Request: “Write a PHP code which implements REST API with CRUD endpoints for article resource”.

ChatGPT response:

Rest API

"Again, looking at what the models suggest, the code looks like a beginner level effort in PHP, but it will most likely work if you set up the database, configure access and so on. Although this code is written on the level of an intern PHP developer, it doesn’t use any of the frameworks available, and it is not optimized nor can you easily scale it to the whole application. Additionally, there are potential security threats in the form of SQL injections which ChatGPT leaves places for, which is not good in any case."

Anton Logvinenko
PHP/DevOps Group Leader

Tree task request, first attempt

ChatGPT response:

Tree task response, first attempt

"In the first attempt of this example, I wanted ChatGPT to generate the tree of items based on the provided example. However, the output turned out to be a garbage code, and I figured out that I didn’t specify that it’s a tree in the prompt. So at the second attempt I’ve added an explanation of what I need as the output."

Anton Logvinenko
PHP/DevOps Group Leader

Tree task request, second attempt

ChatGPT response:

Tree task response, second attempt

"In the second attempt, ChatGPT completed the task and the tree looks fine except for the code quality. What can we conclude from this little experiment? Obviously, ChatGPT isn’t able to replace software engineers yet because the approach and the quality of code writing is still very shallow. But at the end of the day, this thing will definitely see improvements over time. For now, it’s too early to speak of ChatGPT as an automated code writer."

Anton Logvinenko
PHP/DevOps Group Leader

If you are considering options to partially automate coding workflows and support your developers, we would recommend using Copilot. Copilot uses OpenAI Codex to suggest code snippets and functions in real-time, and it proved to be much more usable by middle-level/senior software engineers with a much better quality code. Compared with just Googling what’s needed to finish the code, Copilot enables much faster workflows, so comes as a better alternative to what ChatGPT can suggest at the moment.

3. Building AI pipelines with the help of GPT Models

Given that we know GPT models can complete the majority of NLP tasks, we can think of it as an alternative to training your own language models. Following this tradition, I’ll provide the basics by questioning ChatGPT why it can be a better option than training a custom NLP model:

Why is ChatGPT better than training your own NLP model?

It is important to realize the limitations of this approach. Since GPT models do not have access to real-time data, they can only help you kickstart some service without initial data gathering. Over time, you’ll be able to collect live data from the model’s interaction with the users, but this doesn’t serve as a hotfix for all the specific NLP tasks.

Instead, we can think of other application areas in data science where GPT models can be used.

4. Processing Table Data with ChatGPT

In order to build a classification, regression, or forecasting model, we need to run through exploratory data analysis (EDA) steps. EDA allows us to find the relationships between different features in data, and denote the insights and outliers to understand the reason for this or that anomalous behavior. So the majority of data science projects will require performing 4 steps:

EDA Pipeline

With the help of the GPT model, the above pipeline can be stripped down to a single well-formulated request to the GPT model. For instance, let’s use this request about Amazon sales numbers by quarters, and ask ChatGPT to extract insights from this information.

Amazon sales ChatGPT response

And here, we can see what insights and conclusions ChatGPT came up with, based on the previously presented data:

ChatGPT response:

"Insights:

Amazon’s net sales increased by 38% YoY in Q4
In Q1, net sales increased by 26% YoY
in Q2 it increased by 40% YoY
Q2 it increased by 40% YoY, and in Q3 it increased by 37% YoY

Conclusions (reasons):

- It’s important to note that the COVID-19 pandemic had a significant impact on Amazon’s sales, as more people turned to online shopping while physical stores were closed. Additionally, Amazon’s AWS and Advertising segments also had significant growth in 2020."

The output of the model can vary in quality, so it does require supervision or finetuning of the initial prompt. However, such capabilities can be easily applied to make the EDA stage easier and potentially automate data science pipeline tasks.

5. ChatGPT for recommendation systems

Recommendation systems or recommender engines have become as common as search engines on various platforms and within various applications. Recommendations perform one very important task: understanding the preferences of each customer, to suggest relevant content and keep the customer on the platform as long as possible.

But, despite competitive advantages recommender systems can provide, the fact that we can’t build one without gathering data, experimenting and training a working model appears to be a major reason why not everyone can afford recommendations. To start recommending, we generally need to collect data concerning our content first, and then methodically track what content categories our customers gravitate towards.

In this situation, we can use ChatGPT as an alternative to recommendation engines, because it already has tons of data inside of it. While it’s not fine tuned for your content and it doesn’t know your users’ search history, it can provide fairly specific recommendations once we use specific enough prompts.

For instance, we used this type of request: “I want to have a journey. I have a 2-year-old kid and like to spend time near the sea. Where should I go?”. Here is the output:

ChatGPT response:“Taking a trip with a 2-year-old and spending time near the sea can be a great idea. Here are a few suggestions for destinations that might be a good fit for you and your family:

The beaches of California: The beaches of California offer a great combination of sun, sand, and sea. Some popular destinations include Santa Monica, Huntington Beach, and Newport Beach.
The beaches of Florida: Florida is home to many beautiful beaches, including Miami Beach, Siesta Key, and Clearwater Beach.
The beaches of Hawaii: Hawaii’s beaches are famous for their clear waters and soft, white sand. Some popular destinations include Waikiki Beach, Kaanapali Beach, and Poipu Beach.
The beaches of the Caribbean: The Caribbean is home to many beautiful beaches, including Seven Mile Beach in Jamaica, Eagle Beach in Aruba, and Grace Bay in Turks and Caicos.
The beaches of the Mediterranean: Mediterranean countries like Spain, Greece, and Italy are home to many beautiful beaches. Some popular destinations include Playa de Muro in Mallorca, Mykonos in Greece, and Positano in Italy.

It’s worth noting that these destinations also offer a variety of other activities and attractions for both adults and children, such as swimming, sunbathing, water sports, playgrounds and theme parks. It’s also important to consider other factors such as weather, safety, and accessibility before making a decision.”

If we specify our request by adding details, here’s how the answer may look:

Request: “I want to have a journey. I have a 2-year-old kid and like to spend time near the sea, I have 3000$. Where should I go?”.

ChatGPT response:

“Taking a trip with a 2-year-old and spending time near the sea can be a great idea. It’s important to consider your budget when planning your trip. Here are a few suggestions for destinations that might be a good fit for you and your family within your budget:

The beaches of Mexico: Mexico offers a great combination of sun, sand, and sea. Some popular destinations include Cancun, Playa del Carmen, and Puerto Vallarta.
The beaches of Central America: Countries like Costa Rica, Panama, and Nicaragua have beautiful beaches, great weather and are budget-friendly.
The beaches of Southeast Asia: Countries like Thailand, Vietnam, and the Philippines have beautiful beaches, great weather and are budget-friendly.
The beaches of the Caribbean: Some Caribbean destinations such as the Dominican Republic, Jamaica, and Cuba are budget-friendly and have beautiful beaches.

It’s worth noting that these destinations also offer a variety of other activities and attractions for both adults and children, such as swimming, sunbathing, water sports, playgrounds and theme parks. Also, when traveling with a 2-year-old, it’s important to consider accessibility, safety, and the availability of good quality accommodation and services, such as good hospitals and childcare facilities.”

As you can see the output isn’t framed around parameters we provided as a request, as the model tries to go past the initial prompt. And even those raw results can perform better than trying to find content by applying dozens of filters, because it requires a human to use the right words to get satisfying results. That’s basically the major advantage of NLP-based recommendations over standard approaches.

6. Using ChatGPT for Computer Vision Tasks

Since GPT models are trained on text data, it’s obvious we can’t replace computer vision models this way. However, ChatGPT can simplify the implementation of certain parts of the computer vision pipeline such as:

Color detection for known images
Image similarity, e.g. logos similar to MCDonalds by color or typography
Object detection, for instance identifying the object depicted on photos, graphic materials, brand symbols, etc.
ptical character recognition or OCR

The information provided by ChatGPT can be useful when analyzing and preparing data for a computer vision model in different areas. For instance, in OCR tasks, the model can be used to drag text from a picture, like in ID card recognition. This is a separate step in preparing data for training.

Polish ID

Since ID cards have their own format and different fields in each country, the task of recognizing ID cards can be very resource intensive. With GPT models, we can reduce the time needed for extracting textual data from the images, since ChatGPT can take some part of a pipeline on its own.

Text fields of polish ID

The Importance of Prompt Engineering and the Future of GPT Models

Prompt engineering is basically the discipline of creating working input text that’s passed to the GPT model like a technical task. The output will always depend on how you explain what you need, and it involves understanding what’s the best way to point to each aspect of your request so that the model doesn’t produce some Frankenstein for you. There are several basic principles of rapid development of GPT models:

Be specific: the more specific the prompt, the more relevant the generated text will be to the task at hand.

Define the context: the more context the model has, the clearer the task for the GPT model is.

Set the prompt in the same language and in the same terminology (using the features of the domain area) in which you want to receive an answer.

Find a compromise between prompt length and specificity: longer and more specific prompts can lead to more accurate and relevant answers, but also complicate model management. In some cases, complex and long prompts can trigger the model to break and write an unfinished response, which we have experienced several times during our testing.

All the following examples were made exploiting ChatGPT on a pre-release stage, and it shows much better results than its predecessors. However, GPT 4 model has been announced by OpenAI in a first quarter of 2023, and it uses 500x more data samples then GPT 3. So it also makes sense to keep track of the release dates and the situation is very dynamic and you might prefer future models by OpenAI for your product.

The comparison between training data using for GPT 3 and GPT 4 models

The main advantage of GPT models is the ability to find ways to solve a problem without an initial large volume of data. Obviously, such models cannot replace the desired functionality completely. However, in the first iterations of the product, the use of such models helps to start development, providing users with a service while collecting custom data.

How to Build an Effective Speech Recognition System

Liubov Zatolokina — Mon, 14 Nov 2022 08:55:58 +0000

Modern voice applications use AI algorithms to recognize different sounds, including human voice and speech. In technical terms, most of the voice apps perform either voice recognition or speech recognition. And while there is no big difference between the architecture and AI models that perform voice/speech recognition, they actually relate to different business tasks. So first of all, let us elaborate on the difference between them.

VOICE RECOGNITION VS SPEECH RECOGNITION

Voice recognition is the ability to single out specific voices from other sounds, and identify the owner’s tone to implement security features like voice biometrics.

Speech recognition is mostly responsible for extracting meaningful information from the audio, recognizing the words said, and the context they are placed in. With this we can create systems like chatbots and virtual assistants for automated communication and precise understanding of voice commands.

Both terms can often be used interchangeably, because there is not much technical difference between the algorithms that perform these functions. Although, depending on what you need, the pipeline for voice or speech recognition may be different in terms of processing steps.

In this post, we’ll focus on the general approach for speech recognition applications, and elaborate on some of the architectural principles we can apply to cover all of the possible functional requirements.

How do speech recognition applications work?

Speech recognition covers the large sphere of business applications ranging from voice-driven user interfaces to virtual assistants like Alexa or Siri. Any speech recognition solution is based on the Automatic Speech Recognition (ASR) technology that extracts words and grammatical constructions from the audio, to process it and provide some type of system response.

WHICH TYPE OF AI IS USED IN SPEECH RECOGNITION?

Speech recognition models can react to speech directly as an activation signal for any type of action. But since we’re speaking about speech recognition, it is important to note that AI doesn’t extract meaningful information right from the audio, because there are many odd sounds in it. This is where speech-to-text conversion is done as an obligatory component to further apply Natural Language Processing or NLP.

So the top-level scope of a speech recognition application can be represented as follows: the user’s speech provides input to the AI algorithm, which helps to find the appropriate answer for the user.

High-level representation of an automatic speech recognition application

However, it is important to note that the model that converts speech to text for further processing is the most obvious component of the entire AI pipeline. Besides the conversion model, there will be numerous components that ensure proper system performance.

So approaching the speech recognition system development, first you must decide on the scope of the desired application:

What will the application do?
Who will be the end users?
What environmental conditions will it be used in?
What are the features of the domain area?
How will it scale in the future?

WHAT IS IMPORTANT FOR SPEECH RECOGNITION TECHNOLOGY?

When starting speech recognition system development, there are a number of basic audio properties we need to consider from the start:

Audio file format (mp3, wav, flac etc.)
Number of channels (stereo or mono)
Sample rate value (8kHz, 16kHz, etc.)
Bitrate (32 kbit/s, 128 kbit/s, etc.)
Duration of the audio clips.

The most important ones are audio file format and sample rate, so let’s speak of them in detail. Input devices record audio in different file formats, and most often audio is saved in loosy mp3, but there can also be lossless formats like WAV or Flac. Whenever we record a sound wave, we basically digitize the sound by sampling it in discrete intervals. This is what’s called a sample rate, where each sample is an amplitude of a waveform in a particular duration of time.

Audio signal representation

Some models are tolerant to format changes and sample rate variety, while others can intake only a fixed number of formats. In order to minimize this kind of inconsistency, we can use various built-in methods for working with audio in each programming language. For example, if we are talking about the Python language, then various operations such as reading, transforming, and recording audio can be performed using the libraries like Librosa, scipy.io.wavfile and others.

Once we get the specifics of audio processing, this will bring us to a more solid understanding of what data we’ll need, and how much effort it will take to process it. At this stage, consultancy services from a data science team experienced in ASR and NLP is highly recommended. Since gathering wrong data and or setting unrealistic objectives are the biggest risks in the beginning.

Automatic Speech Recognition process and components

Automatic speech recognition, speech-to-text, and NLP are some of the most obvious modules in the whole voice-based pipeline. But they cover a very basic range of requirements. So now let’s look at the common requirements to speech recognition, to understand what else we might include in our pipeline:

The application has to work in background mode, so it has to separate the user’s speech from other sounds. For this feature, we’ll need voice activity detection methods, which will transfer only those frames that contain the target voice.
The application is meant to be used in crowded places, which means there will be other voices and surrounding noise. Background noise suppression models are preferable here, especially neural networks which can remove both low-frequency noise, and high frequency loud sounds like human voices.
In cases where there will be several people talking, like in the case of a call center, we also want to apply speaker diarization methods to divide the input voice stream into several speakers, finding the required one.
The application must display the result of voice recognition to the user. Then it should take into account that speech2text (ASR) models may return text without punctuation marks, or with grammatical mistakes. In this case, it is advisable to apply spelling correction models, which will minimize the likelihood that the user will see a solid text in front of them.
The application will be used in a domain area, where professional terms and abbreviations are used. In such cases, there is a risk that speech2text models will not be able to correctly cope with this task and then training of a custom speech2text model will be required.

In this way, we can derive the following pipeline design which will include multiple modules just to fetch the correct data and process it.

Automatic Speech Recognition (ASR) pipeline

Throughout the AI pipeline, there are blocks that are used by default: ASR and NLP methods (for example, intent classification models). Essentially, the AI algorithm takes sound as an input, converts it to speech using ASR models, and chooses a response for the user using a pre-trained NLP model. However, for a qualitative result, such stages as pre-processing and post-processing are necessary.

Now we’ll move to advanced architecture.

Our 4 recommendations for improving quality of ASR

To optimize the planning of the development and mitigate the risks before you get into trouble, it is better to know of the existing problems within the standard approaches in advance. MobiDev ran an explicit test of the standard pipeline, so in this section will share some of the insights found that need to be considered.

PAY ATTENTION TO THE SAMPLE RATE

As we’ve mentioned before, audio has characteristics such as sample rate, number of channels, etc. These can significantly affect the result of voice recognition and overall operation of the ASR model. In order to get the best possible results, we should consider that most of the pre-trained models were trained on datasets with 16Hz samples and only one channel, or in other words, mono audio.

This brings with it some constraints on what data we can take for processing, and adds requirements to the data preparation stages.

NORMALIZE RECORDING VOLUME

Obviously, ASR methods are sensitive to audio containing a lot of extraneous noise, and suffer when trying to recognize atypical accents. But what’s more important, speech recognition results will strongly depend on the sound volume. Sound recordings can often be inconsistent in volume due to the distance from the microphone, noise suppression effects, and natural volume fluctuations in speech. In order to avoid such inaccuracies, we can use the Pyloudnorm library from the Python language that helps to determine the sound volume range and amplify the sound without any distortion. This method is very similar to audio compression, but brings less artifacts, improving the overall quality of the model’s predictions.

Nvidia Quarznet 15×5 speech recognition results with and without volume normalization

Here you can see an example of voice recognition without volume normalization, and also with it. In the first case, the model struggled to recognize a simple word, but after volume was restored, the results improved.

IMPROVE RECOGNITION OF SHORT WORDS

The majority of ASR models were trained on datasets that contain texts with proper semantic relations between each sentence. This brings us to another problem with recognizing short phrases taken out of context. Below is a comparison of the performance of the ASR model on short words taken out of context and on a full sentence:

The result of recognizing short words in and out of context

In order to overcome this problem, it is necessary to think about the use of any preprocessing methods that allow the model to understand in which particular area a person wants to receive information more accurately.

Additionally, ASR models can generate non-existing words and other specific mistakes during the text to speech conversion. Spell correction methods may simply fail in the best cases, or choose to correct the word to one that is close to the right choice, or even change to a completely wrong one. This problem also applies to very short words taken out of context, but it should be foreseen in advance.

USE NOISE SUPPRESSION METHODS ONLY WHEN NEEDED

Background noise suppression methods can greatly help to separate a user’s speech from the surrounding sounds. However, once loud noise is present, noise suppression can lead to another problem, such as incorrect operation of the ASR model.

Human speech tends to change in volume depending on the part of the sentence. For example, when we speak we would naturally lower our voice at the end of the sentence, which leads to the voice blending with other sounds and being drowned out by the noise suppression. This results in the ASR model not being able to recognize a part of the message. Below you can see an example of noise suppression affecting only a part of a user’s speech.

Noise suppression effect on the speech recognition

It is also worth considering that as a result of applying Background Noise Suppression models, the original voice is distorted, which adversely affects the operation of the ASR model. Therefore, you should not apply Background Noise Suppression without a specific need for it.

Get the enhanced ASR system

Based on the mentioned points, the initial pipeline can bring more trouble with it than actual performance benefits. This is because some of the components that seem logical and obligatory may interrupt the work of other essential components. In other cases, there is a strict need to add layers of preprocessing before the actual AI model can interact with data. We can therefore come up with the following enhanced ASR system architecture:

Enhanced automatic speech recognition system pipeline

That is why, based on the above points, noise suppression and spelling correction modules were removed. Instead, to solve the problem of removing noise and getting rid of errors in the recognized text, the ASR model has to be fine-tuned on the real data. This data will fully reflect the actual environmental conditions and features of the domain area.

Human Pose Estimation Technology Capabilities and Use Cases in 2022

Liubov Zatolokina — Tue, 26 Apr 2022 12:12:26 +0000

Human Pose Estimation is a computer vision-based technology that identifies and classifies specific points on the human body. These points represent our limbs and joints to calculate the angle of flexion, and estimate, well, human pose.

While it sounds awkward, knowing the right angle of a joint in a specific exercise is the basis of work for physiotherapists, fitness trainers, and artists. Implementing such capabilities for a machine results in surprisingly useful applications in different fields.

In this article we’ll explore human pose estimation in depth. We’ll figure out its principle of work and capabilities to understand suitable business cases. Also, we’ll analyze different approaches to Human Pose Estimation as a machine learning technology, and try to define the applications for each.

What is Human Pose Estimation?

Human Pose Estimation (HPE) is a task in computer vision that focuses on identifying the position of a human body in a specific scene. Most of the HPE methods are based on recording an RGB image with the optical sensor to detect body parts and the overall pose. This can be used in conjunction with other computer vision technologies for fitness and rehabilitation, augmented reality applications, and surveillance.

The essence of the technology lies in detecting points of interest on the limbs, joints, and even face of a human. These key points are used to produce a 2D or 3D representation of a human body model.

Image source

These models are basically a map of body joints we track during the movement. This is done for a computer not only to find the difference between a person just sitting and squatting, but also to calculate the angle of flexion in a specific joint, and tell if the movement is performed correctly.

There are three common types of human models: skeleton-based model, contour-based, and volume-based. The skeleton-based model is the most used one in human pose estimation because of its flexibility. This is because it consists of a set of joints like ankles, knees, shoulders, elbows, wrists, and limb orientations comprising the skeletal structure of a human body.

Image source

A skeleton-based model is used for 2D, as well as 3D representation. But generally, 2D and 3D methods are used in conjunction. 3D human pose estimation grants better accuracy to the application measurements since it considers the depth coordinate and fetches those results into calculation. For the majority of movements, depth is important, because the human body doesn’t move in a 2D dimension.

So now let’s find out how 3D human pose estimation works from a technical perspective, and find out the current capabilities of such systems.

How 3D Human Pose Estimation Works

The overall flow of a body pose estimation system starts with capturing the initial data and uploading it for a system to process. As we’re dealing with motion detection, we need to analyze a sequence of images rather than a still photo. Since we need to extract how key points change during the movement pattern.

Once the image is uploaded, the HPE system will detect and track the required key points for analysis. In a nutshell, different software modules are responsible for tracking 2D keypoints, creating a body representation, and converting it into a 3D space. So generally, when we speak about creating a body pose estimation model, we mean implementing two different modules for 2D and 3D planes.

Video source

So for the majority of human pose estimation tasks, the flow will be broken into two parts:

Detecting and extracting 2D key points from the sequence of images. This entails using horizontal and vertical coordinates that build up a skeleton structure.
Converting 2D key points into 3D adding the depth dimension.

During this process, the application will make the required calculations to perform pose estimation.

Estimating human pose during the exercise is just one example in the fitness industry. Some models can also detect keypoints on the human face and track head position, which can be applied for entertainment applications like Snapchat masks. But we’ll discuss the use cases of HPE later in the article.

You can check this demo to see how it works in a nutshell: just upload a short video performing some movement and wait for the processing time to see the pose analysis.

3D POSE ESTIMATION PERFORMANCE AND ACCURACY

Depending on the chosen algorithm, the HPE system will provide different performance and accuracy results. Let’s see how they correlate in terms of our experiment with two of the most popular human pose estimation models, VideoPose3D and BlazePose.

We’ve tested BlazePose and VideoPose3D models on the same hardware using a 5-second video with 2160*3840 dimensions and 60 frames per second. VideoPose3D got a total time of 8 minutes for video processing and a good accuracy result. In contrast, BlazePose processing time reached 3-4 frames per second, which allows the use in real-time applications. But the accuracy results shown below don’t correspond to the objectives of any HPE task.

Video source

The processing time depends on the movement complexity, video and lighting quality, and the 2D pose detector module. Given the fact that BlazePose and VideoPose3D have different 2D detectors, this stage appears to be a performance bottleneck in both cases.

One of the possible ways to optimize HPE performance is the acceleration of 2D keypoint detection. Existing 2D detectors can be modified or amplified with the post processing stages to improve general accuracy.

Real-time 3D human pose estimation

Whether we deal with a fitness app, an app for rehabilitation, face masks, or surveillance, real-time processing is highly required. Of course, the performance of the model will depend on the chosen algorithm and hardware, but the majority of existing open-source models provide quite a long response time. In the opposite scenario, the accuracy suffers. So is it possible to improve existing 3D human pose estimation models to achieve acceptable accuracy with real-time processing?

While models like BlazePose are able to provide real-time processing, the accuracy of its tracking is not suitable for commercial use or complex tasks. In terms of our experiment, we tested the 2D component of a BlazePose with a modified 3D-pose-baseline model using Python language.

In terms of speed, our model achieves about 46 FPS on the above-mentioned hardware without video rendering where the 2D pose detection model produces keypoints with about 50 FPS. In comparison to the 2D pose detection model, the modified 3D baseline model can produce keypoints with about 780 FPS. The detailed information about the spent processing time of our approach is presented below.

While this approach doesn’t guarantee reliability in complex scenarios with dim lighting or unusual poses, standard videos can be processed in real time. But, generally, the accuracy of model predictions will depend on the training and the chosen architecture. Understanding the true capabilities of human pose estimation, we can analyze some common business applications and general use cases for this technology.

Human pose estimation use cases

HPE can be considered a quite mature technology since there are groundworks in the areas of applications like fitness, rehabilitation, augmented reality, animation, gaming, robotics, and even surveillance. So now let’s talk about the existing use cases.

AI FITNESS AND SELF-COACHING

Fitness applications and AI-driven coaches are some of the most obvious use cases for body pose estimation. The model implemented in the phone app can use the hardware camera as a sensor to record someone doing an exercise and perform its analyses.

Tracking the movement of a human body, the exercise can be split into phases of eccentric and concentric movements to analyze different angles of flexion and overall posture. This is done via tracking the keypoints and providing analytics in the form of hints or graphic analysis. This can be handled in real-time or after some delay, providing analytics on the major movement patterns and body mechanics for the user.

REHABILITATION AND PHYSIOTHERAPY

The physiotherapy industry is another human activity tracking use case with similar rules of application. In the era of telemedicine, in-home consultations become much more flexible and diverse. AI technologies have enabled more complex ways that treatment can be done online.

The analysis of rehab activities applies similar concepts to fitness applications, except for the requirements to accuracy. Since we’re dealing with recovering from the injury, this category of applications will fall into the healthcare category. Which means it has to meet the standards of the healthcare industry and general data protection laws in a certain country.

AUGMENTED REALITY

Augmented reality applications like virtual fitting rooms can benefit from human estimation as one of the most advanced methods of detecting and recognizing the position of a human body in space. This can be used in ecommerce where shoppers struggle to fit their clothes before buying.

Human pose estimation can be applied to track key points on the human body and pass this data to the augmented reality engine that will fit clothes on the user. This can be applied to any body part and type of clothes, or even face masks. We’ve described our experience of using human pose estimation for virtual fittings rooms in a dedicated article.

ANIMATION AND GAMING

Game development is a tough industry with a lot of complex tasks that require knowledge of human body mechanics. Body pose estimation is widely used in animation of game characters to simplify this process by transferring tracked key points in a certain position to the animated model.

The process of this work resembles motion tracking technology used in video production, but doesn’t require a large number of sensors placed on the model. Instead, we can use multiple cameras to detect the motion pattern and recognize it automatically. The data fetched then can be transformed and transferred to the actual 3D model in the game engine.

SURVEILLANCE AND HUMAN ACTIVITY ANALYSIS

Some surveillance cases don’t require spotting a crime in a crowd of people. Instead, cameras can be used to automate everyday processes like shopping at a grocery store.

Cashierless store systems like Amazon GO, for example, apply human pose estimation to understand whether a person took some item from a shelf. HPE is used in combination with other computer vision technologies, which allows Amazon to automate the process of checkout in their stores using a network of camera sensors, IoT devices, and

Human pose estimation is responsible for the part of the process where the actual area of contact with the product is not visible to the camera. So here, the HPE model analyzes the position of customers’ hands and heads to understand if they took the product from the shelf, or left it in place.

How to train a human pose estimation model?

Human pose estimation is a machine learning technology, which means you’ll need data to train it. Since human pose estimation completes quite difficult tasks of detecting and recognizing multiple objects on the screen, and neural networks are used as an engine of it. Training a neural network requires enormous amounts of data, so the most optimal way is to use available datasets like the following ones:

The majority of these datasets are suitable for fitness and rehab applications with human pose estimation. But this doesn’t guarantee high accuracy in terms of more unusual movements or specific tasks like surveillance or multi-person pose estimation.

For the rest of the cases, data collection is inevitable since a neural network will require quality samples to provide accurate object detection and tracking. Here, experienced data science and machine learning teams can be helpful, since they can provide consultancy on how to gather data, and handle the actual development of the model.

HOW TO AVOID TRAINING HUMAN POSE ESTIMATION FROM SCRATCH?

Human pose estimation models are appearing rapidly, as the technology is live and progressive. This gives us options in terms of pretrained models tailored for different tasks. To analyze existing approaches and models, we used Human3.6M as an evaluation dataset.

The evaluation metric is MPJPE (Mean Per Joint Position Error) that shows the distance averaged over all joints which is measured in millimeters. In other words, this metric shows how accurately each specific model detects joints over time. The graph represents the analysis of several open-source models trained for human pose estimation tasks.

In terms of our experiment with such models, we can conclude that some of them can be modified to implement real-time processing with comparably high FPS. The performance of the model for the most part depends on its 2D detector module. Which enables us to implement a high-performance model for the most business cases, including mobile applications.

How to Implement AI Self Checkout in Retail if You Are Not Amazon

Liubov Zatolokina — Fri, 11 Mar 2022 12:12:25 +0000

Online retail has one key advantage — customer experience. No queues, no delays, and little movement to make a purchase. According to a research from Forrester, 72% of U.S. retail sales will still occur in bricks-and-mortar stores because people want to interact with a product before buying, or simply don’t want to wait for delivery.

The idea of checkout-free shopping in venues crystalized as Amazon Go, Tesco, Walmart, and many more. The idea of using fully-automated checkout with computer vision is a successful example of retail automation. But, a few store owners want to build a whole new outlet to run their business offline. As it requires an integrated software infrastructure, as well as imposes development and financial challenges we will discuss today.

In this article, we’ll analyze how any brick and mortar store can be automated with computer vision systems. Here we’ll look at how it works, what the options for checkout automation are, and what challenges are out there.

Image credit

Computer vision checkout automation for brick and mortar retail

The majority of in-store operations like shelf management, checkout, or product weighing require human supervision. Human productivity is basically a performance marker for the retailer, and it often becomes a bottleneck, as well as becoming a customer frustration factor.

Namely, checkout queues are the pain point both for customers and retailers. But it’s not only the queues, since actual human effort costs money. So how does computer vision apply to these operations?

Computer vision (CV) is a technology under the hood of artificial intelligence that enables machines to extract meaningful information from the image. At its core, computer vision aims at mimicking human sight. So analogically to an eye, CV relies on camera sensors that capture the environment. In its turn, an underlying neural network, it’s brain, will recognize objects, their position in the frame, or some other specific properties (such as differing a Pepsi can from Dr. Pepper can).

That’s our ground base for understanding how computer vision can fit brick and mortar retail tasks, as it can recognize products situated in the frame. These products can be placed on the shelves, or carried by the customers. Which allows us to exclude barcode scanning, cash register operation, or self-checkout machines.

Although implementations of computer vision significantly differ by complexity and budgeting, there are two common scenarios of how it can be used for retail automation. So first let’s look at how full store automation can be built.

AI-powered autonomous checkout: full store automation

Autonomous checkout is called by different names: “cashierless”, “grab-and-go”, “checkout-free”, etc. In the shopping experience of Amazon, Tesco, and even Walmart, such stores check the products during the shopping, and charge for them when you walk out. Sounds simple, and that’s how it works in a basic scenario.

Shopping session start. Shops like Amazon use turnstiles to initiate shopping via scanning a QR code. At this point, the system matches the Amazon profile and digital wallet with the actual person entering the store.

Person detection. This is the recognition and tracking of people and objects done via computer vision cameras. Simply, cameras remember who the person is, and once they take a product from the shelf, the system places it into a virtual shopping cart. Some shops use hundreds of cameras to view from different angles and cover all the store zones.

Product recognition. Once the person grabs something from the shelf, and takes it with them, cameras capture this action. Matching the product image on video with the actual product in the retailer’s database, the store places an item into a virtual shopping cart.

Checkout. As the product list is finished, the person may just walk out. When the person leaves a zone covered by cameras, computer vision considers this as the end of a shopping session. This triggers the system to calculate the total sum, and charge it from the customer’s digital wallet.

From the customer standpoint, such a system represents a similar shopping experience as it is in the online stores, except you don’t need to checkout. Enter, find what you want, grab it, and leave. Although, to provide customers with full autonomy, and cover all the edge cases, we’ll need to solve a large number of problems technically. So what’s so complex about autonomous checkout?

The challenges of AI-powered autonomous stores

Customer behavior can be unpredictable, as we are going to automate checkout for dozens of people that check and buy thousands of products at the same time. This imposes a number of challenges for computer vision:

CONTINUOUS PERSON TRACKING

As the customer enters the store, the system should be able to continuously track them along shopping routes. We need to know that it’s the same person who took this or that item in different parts of the store. In a crowded store, continuous tracking might be difficult. As long as it’s not allowed to use face recognition, the model should recognize people by their appearance. So what will happen if somebody takes off his coat, or carries a child on shoulders?

To enable continuous tracking, we’ll need to provide 100% coverage for cameras to detect people passing from zone to zone. Placing cameras at different angles, we also need sensors to communicate their precise location, so we can use this data to track objects more accurately.

THE “WHO TOOK WHAT?” PROBLEM

Then, we have to remember there are also products, right? And customers’ shopping process is not linear. They move items, smell them, put them back, and go to another shelf. Especially when there are multiple people at one shelf, it becomes difficult for a model to recognize who took what, and if they actually took the product to buy.

Amazon, for example, solved this problem by implementing human pose estimation and human activity analysis. Basically, that’s another layer of artificial intelligence coupled with computer vision. What it does is it measures the position and movement of a person, to predict what he or she grabs, and if the product was taken to be purchased.

This solves the problem with multiple customers at a shelf, and helps to denote who took this specific product even if the camera was blocked by somebody.

IDENTIFYING SIMILAR PRODUCTS

Concerning products, we’ll also need to deal with similar packages. Some products have minor differences in their look, which makes it harder for the model to fetch all the detail. Especially if there is some obstruction going on in the frame, or the object is moving fast. We can address this issue through training the model to spot little details, and use cameras with higher resolution and frame rate.

While it looks beneficial to use autonomous checkout, the complexity of such a system can be onerous. For a tech-first company, this is not a problem. But for the usual retailer, the burden brought by artificial intelligence lowers the value of such automation. That’s why partial store automation with computer vision can be more suitable.

Smart vending machines: partial store automation

When it comes to vending machines, they can be placed in-store, or moved out to other indoor and outdoor locations. And this can be an elegant solution to the problem imposed by tracking the whole store. Vending machines can be represented by shelves with glass doors or regular fridges using computer vision cameras to operate purchase processes. Installing a QR code scanner, we can minimize the checkout procedure to the location of a single fridge. So the idea is quite simple:

Shopping session start. The session starts once a person approaches the fridge and opens it up. This can be done via scanning a QR via mobile app if it’s a door-closed fridge. In the case of a usual shelf, cameras can track what’s grabbed from it to initiate the session.

Creating a virtual shopping cart. As the person scans the QR code, it’s a signal for a system to create a shopping cart for this specific user.

Product recognition. The cameras might be installed inside or outside of the vending machine. The internal cameras should be able to track the taken/put back products. External cameras might track manipulations within an open fridge, just like with a regular shelf. Both types of cameras capture the products and put them into a shopping cart.

As the person might examine multiple items and move from side to side, CV cameras can also track the person in the frame. This will help us verify that it’s a single person making a purchase, and not another one standing nearby.

Verifying products. When the product is taken, the system sends this data to compare the image of the product with the one in the database and extract the price. Additionally, we can update availability automatically in our inventory management system.

Editing product list. Once the products are taken, they will be sent to the user’s shopping cart available on their smartphone, or tablet on the fridge. Here, the customer can modify items, and proceed to the payment.

Checkout. In case of a mobile application and QR code scanning, closing the fridge might be a trigger point to complete a purchase and charge a sum from a digital wallet. But, there might also be a POS terminal installed to allow credit card payment. At this point, the purchase is done, and the person can leave the store.

While it looks like a relatively weak alternative to the autonomous checkout system, vending machines can be scaled easily to automate the whole store. Which makes a little difference in terms of customer experience, but requires less engineering effort and budgeting.

The same concept of modular automation can be applied to numerous other cases. Except for supermarkets and grocery stores, computer-vision kiosks can also be installed in food service venues or coffee shops.

CHECKOUT FREE FOOD SERVICE

Restaurants, cafes, and canteens often use a buffet serving system like a sideboard with portioned dishes customers can choose from. Customers place dishes on trays, then need to check out their order, which can potentially be handled by a computer vision kiosk.

A machine learning model sitting on the backend can be trained to recognize dishes and other products placed on the tray to launch the checkout process. This idea can be implemented as a checkout kiosk where a set of cameras will scan the order. The actual payment can be completed via a usual POS terminal, or using a mobile application and a digital wallet.

The concept of cashierless operations can be taken to extremes like with Starbucks. Using Amazon’s system, Starbucks became the first of a kind grab & go coffee shop. Customers can place an order via a mobile application and come for their coffee without any checkout similar to Amazon GO. However, handling computer vision projects requires knowledge of a subject matter. Specifically, data science and machine learning expertise.

So now let’s talk a bit of what you should know to approach computer vision-based checkout automation.

How to approach AI-based checkout?

Based on our experience, let’s examine the steps it takes to create a computer vision system for automation in retail. We’ll focus on the smart fridge case as the most approachable and versatile one.

GATHERING REQUIREMENTS

First of all we need to understand our business case in detail:

Preferred automation method. Choosing between smart fridges or other types of dispenser machines might require less global modifications to the store, while maintaining a scalable approach. Full store automation will mostly require changes to the venue layout, and additional hardware like turnstiles, which can be a con for the majority of the store owners.

Store size. Vending machines can be installed in basically any number, to cover all of the store’s inventory and product diversity. So the store size will determine how many vending machines you’ll need, and what will be the store layout using smart fridges for some part of products.

Quantity of products for recognition. As any other machine learning project, a computer vision system requires training before it can recognize anything. A single fridge might contain 20 to 50 different products. So we should consider those numbers as it will determine how long the training phase will take.

Existing infrastructure. In most cases, physical stores don’t have enough integration between inventory management, point of sale, and accounting. Although, computer vision systems will require access to the store data to automate sales updates and product availability. So examining your existing infrastructure is another point to understand when considering the requirements of this project.

So let’s say a single fridge can contain 35 items and we’ll focus on those numbers.

DATA COLLECTION

Computer vision is an artificial intelligence technology. Which means, we need data so it can recognize objects. The data is used for model training to identify different products in the frame, as well as identify people and what they grab.

The optimal way to collect data for object recognition is basically to record each product on video from different angles and lightning conditions. It is important to have these videos categorized by product, so the labeling (what product is in the frame) will be done automatically. General recommendations for gathering the data are that it should be as close as possible to how it will look for real users.

Once we implement a working model to automate checkout, we’ll need 60 frames per second. This is required to guarantee fast operation of the model. The higher the frame rate, the smoother the image is, and the more detail we can extract from it.

MODEL TRAINING

The next step is training. Once we collect all the video recordings, a machine learning expert will prepare them for model training. This process can be split into two tasks.

Preparing data means we need to split all the video frames into separate images, and label the products we need to detect. Put simply, we extract 60 photos out of a minute long video, and draw bounding boxes around our target objects.
Choosing an algorithm. An algorithm is a mathematical model that learns patterns from the given data to make predictions. For tasks like object recognition, there are existing working algorithms that can be applied for building a model. So our task here is to choose a suitable one, and feed it with our data.

The process of training may take several weeks, as we struggle to get decent accuracy.

MODEL RETRAINING

If any products are added or swapped in the process, the model needs to be retrained. This is because prediction results will differ depending on the data input. This means that each time a store obtains new items for sales, and places them into a computer vision fridge — we’ll need to launch a new training phase for the model to learn new items.

Given that, we’ll need retraining to recognize, say, Pringles cans on the image if there weren’t any Pringles before. Although, this becomes easier as soon as we implement cameras in the fridge because we can use live recordings to make annotations and launch training again.

REQUIRED INFRASTRUCTURE

The existing infrastructure in the store is usually represented by a server that processes inventory updates, and records sales volume via POS terminals. To implement a machine learning model, we’ll need to add several components:

Cameras to record and pass the visual data.
Video processing unit. This can be a video card or a single board computer like the Nvidia Jetson that includes a GPU optimized for computer vision needs.
QR scanner. This sticker is placed on a turnstile, or a fridge the user scans to identify the person and launch the shopping process.
Model server. As we’re talking about real time video processing, implementing a hardware server at the store will guarantee more stable results. Basically, as a person grabs something from a fridge, the reaction of the system should be noteless so that hardware components can respond fast enough.

All of those components should be interconnected, as there has to be data flow between each unit. As for the cameras, we also want to make sure the store has a stable and fast bandwidth. Since cameras will process live streams of data in the real time, there has to be no delay for the model to function properly. On the other hand, the customer will expect a fast reaction of the vending machine, which depends on how quickly the model receives and processes the data.
Daniil Liadov
Python engineer at MobiDev

PRIVACY CONCERNS

Among other questions that might concern both retailers and customers is privacy. Since computer vision is designed to detect and track objects on video, recording and storing such data may violate the privacy laws in some countries.

Although, in the US it’s generally legal to use surveillance cameras in stores. As long as customers are tracked with random IDs just for the sake of the checkout task, no other technologies like face recognition are required. And even if the camera captures a person’s face it could be blurred using AI to sustain confidentiality.

Is AI self-checkout for every retailer?

All with all systems, autonomous checkout may seem like a pricey and bulky thing to implement. Customers are still willing to use more convenient checkout methods, however. That’s noted in a Retail Customer Experience report 2021 that 60% of consumers would choose self-checkout over interaction with a cashier.

That being said, vending machines might be an affordable option for the retail industry, as it brings a lot of benefits for a reasonable cost. Additionally, such systems can be customized to serve the specific needs of a given retailer due to flexibility of machine learning models. Basically, any type of product can be recognized with proper training. So convenience stores are not the only ones who can benefit from computer vision applications.