Hello everyone, this is Avi from Voximplant.
In this article, I’m going to discuss the history of machine learning and natural language processing, tell you about what AutoML is, and explain how Voximplant made this technology available for everyone with its AutoML component.
A brief history of NLP technology
Let’s start with a short historical overview of natural language processing: how it evolved and why we at Voximplant decided that now is the best time to enter the competitive market with our AutoML technology.
The NLP history can be divided into three phases. We’re going to take an in-depth look at each of these stages below:
Stage one
The first phase is applying classic machine learning methods to natural language processing. This approach implied the vectorization method, meaning that we needed to transform text to a representation that ML models could accept as input.
However, the main disadvantage of such an approach was that the words “good” and “perfect” were as far apart as the words “good” and “tree.” Thus, if you were to paraphrase a sentence with synonyms, you would get a completely different ML vector. In addition, the vectors themselves turned out to be huge, and it was almost impossible to provide them as input to a neural network architecture. However, this was merely the first stage of NLP.
Stage two
The second phase in the development of NLP is word embeddings. This is a technique that allowed each word to have a compact vector representation, and similar words had similar vectors. Each word is represented by a real-valued vector with either tens or hundreds of dimensions.
Such representations made it easier to process by neural engines, and NLP solutions became noticeably smarter. The algorithms started to understand the basic language structures and semantics based on huge unlabeled inputs, and later we could train the results for a specific task.
The processing became much faster because the basic language knowledge was already hardwired into the model, and the data for each task decreased significantly. For the NLP industry, this created a new kind of job: research centers trained fundamental models on huge arrays of data, and other companies used them as templates to train for more specific tasks.
Stage three
The third phase is connected to the creation of the transformer architecture. This architecture didn’t have recurrent network problems and had a much higher level of parallelism, which meant that a much higher level of data could be processed simultaneously, making the whole model much bigger.
In 2018, Google released BERT, a transformer that de facto became the template for solving complex NLP tasks. With some modifications, this model becomes dominant in NLP. Moreover, nowadays, the standard process of solving an NLP problem is to take a pre-trained BERT and educate it for more specific tasks.
What is AutoML?
Despite the decent number of ML solutions by large companies such as Nvidia, Amazon, and Google, many retail companies still can’t use them. Why is that? Even if we don’t consider the problems of working with data, one needs a team of highly qualified specialists to support these solutions — this can be quite expensive.
For many companies that don’t have strong technical expertise teams, it’s difficult to discern whether the efforts are worth it and whether having an ML team will pay off in the long run. Additionally, pilot projects are quite expensive and take time to deploy. For all these reasons, many companies feel discouraged from using this technology.
This is where AutoML comes to the rescue. AutoML is an attempt to automate the work of an ML team. In the ideal state, it allows people without machine learning education to create ML-based solutions and put them into production. A company uploads its data into the system, and at the output, it receives a ready-to-use scalable service integrated into the company’s infrastructure at some level. The company needs only to work with data and analytics, leaving all the engineering work under the hood. Sounds interesting, right?
Of course, there are already quite a few players on the market, such as Google Cloud, AWS, and h2o.ai. Their solutions solve some problems in the NLP sphere, but the entry threshold for such solutions is relatively high, and they aim at software engineers instead of technical specialists. This looks like a part of a constructor but not a complete box solution.
Why Voximplant decided to create an AutoML engine
Voximplant is a cloud platform for creating different communication services, such as chats and chatbots, calls and conference services, contact centers, and IVRs. We also have telephony and a lot of vendors for voice recognition and synthesis. However, one thing that was missing from our toolkit was a state-of-the-art NLU engine.
Here's how we went about it. First, we added a 3rd-party solution into our system, but in practice, this solution did not have enough flexibility for real-world scenarios, and this solution could not use the full power of our platform. That’s why we decided to create our own solution.
How we created an AutoML solution in Voximplant
There are two types of conversational bots: chit-chat bots and goal-oriented ones. Chit-chat is designed only for chats, like Replica. These are only useful in answering the client with the phrase that makes the most sense. Goal-oriented bots are designed to find out what kind of action a user wants to perform and perform it. Examples of goal-oriented bots are Siri or Alexa and some smart IVRs in contact centers. Since we are releasing a B2B product, we’re more interested in goal-oriented bots.
So, what components did we need to create such a robot? Let’s take a look at the typical flow. A user pronounces some phrase, and then it is processed by a voice recognition module, which converts voice into text. From this text, first of all, we want to extract user intent, which means the phrase's normalized meaning.
During the creation of our bot, we came up with a list of possible intents so our bot could match the user’s phrase with one of these intents. Also, we want to extract all the available data from the user’s speech. For example, if a user wants to book a table at a restaurant, we need to extract the exact timing for the reservation. All these tasks are to convert the natural language into a machine-readable one.
All the extracted information needs to be passed into the dialog state manager — a component that decides how to respond to the normalized request. There are many approaches to this task, both based on ML and scripting, but in general, we view this component as some kind of script written for each case. After that, the only thing left to do is synthesize speech to answer the user.
Hence, we need the following components:
- Intent classifier, which learns on user input
- Component for extracting and normalizing user data
- Dialog state manager to make decisions
- API for managing
The intent classifier needs to have a front end that accepts user data and stores it in the back end. After that, we need to start training the user model, store the resulting artifact somewhere and send it to the inference server that processes user requests in real-time.
While it doesn’t look too complicated, the system has several limitations:
- The service should be responsive and not have any delays during a phone call
- The service quality should be comparable with SoTA
- The service should be resilient to the influx of users
- The service should have a competitive price
What was also important, we needed to launch the project with a small team of 4 people at the beginning, and we had six months to get the MVP. Also, we wanted to start receiving feedback from the market as soon as possible to understand whether our product really solves customers’ problems or not.
Therefore, we decided to launch our AutoML service in several smaller iterations.
Iteration 1
In the first iteration, we didn’t focus on scaling problems. We wanted to make a quick prototype and receive feedback from the market. We decided to construct a base project, which we could enhance with different technologies in the future, and started with the intent classifier. By making only the intent classifier, we could cover a very large number of tasks, such as automatic comment classification, NPS surveys, and smart IVR.
Next, we automated the training process for RoBerta Large for a project team that can assemble solutions because this approach gives the highest quality. Inference service prices were intentionally avoided, not to get into too much detail on the one hand and to frighten customers with poor quality on the other.
We took Bootstrap to make a simple control panel for data management, made a backend via FastAPI, which trains the tasks, and used MLflow ModelRegistly to store the results. We also used it for storing metrics for debugging. For the inference server, we deployed several hardware nodes with GPU and Nvidia Triton installed.
Our telemetry needs were met by MLflow out of the box, as well as the UI tools for its analysis. We trained each user's model on both the cross-validation dataset as well as the full dataset for final model training. As a result, we were able to identify situations in which our hyperparameter fitting heuristics failed and adjust the formula manually to avoid problems.
More than that, MLflow helped us solve the problem of delivering models to the inference server. The ModelRegistry service it provides makes it possible to track metrics and store model artifacts in a versioned way.
The Nvidia Triton inference service pulled the models into memory. The only task left for us was writing a small Python solution that examined the model registry for new models and sending them to Triton.
As a result, in just a few months, we got a ready-to-use product that could be deployed fast and solve customers' cases without any intervention from ML engineers.
But although our solution had decent quality and low latency, it was completely non-scalable. The main problem was that each client model always required at least 2 GB of video memory, even if the client made a “hello world”. If we get a large number of new clients, our solution will not withstand the load.
Iteration 2
The goal of the 2nd iteration was to make our solution scalable. To achieve it, we needed a way to share the GPU resources with our clients in an efficient way.
What are possible solutions? We can conduct an autoscaling of GPU nodes, but it has drawbacks — not only is it expensive, but it also has high latency during high loads, and the hardware utilization is very low.
Besides autoscaling, we can add smart rotation logic for GPU nodes, load them on demand, and unload them when not in use. When working with voice, however, you should keep in mind that Nvidia Triton loads models within seconds, not milliseconds. This will lead to a large delay and not a natural conversation. This also doesn’t solve the cost problem. Another variant is switching to the previous generation models and moving from GPU to CPU, but the quality we got did not satisfy us. So, we decided to dig deeper.
The thing we want to keep in mind is that we do not allocate the GPU memory for each client exclusively. A better idea would be if some parts of the ML model were the same for everyone. This variant simplifies scaling a lot because, in this case, we scale GPU resources, not for client applications (most of them are hello worlds).
One curious fact we learned from our small team's experience with ML is that the upper layers of transformers change a lot with fine-tuning, while the lower layers hardly change. This seems reasonable because the lower layers are in charge of basic language rules and semantics, and the upper layers are in charge of high-level representations and cases.
In fact, you can suspend the lower layers while fine-tuning a model for a particular case without losing quality. And the embeddings and subword layer can take about half of the whole model, and if we suspend the first N layers, then up to 90% of the entire model can be left unchanged.
This solves our problem: 95% of our model will be suspended and only used to add new features and custom models, and we will be fine-tuning the last few layers of users' data.
In this case, we solve the scaling problem as follows: we have a pool of GPU machines, which we scale when the number of requests to the system increases, which is quite easy, and at the same time, we infer the user part of the system on CPUs with model rotation in RAM. We can quickly deploy a user model on the CPU in less than half a second.
The final variant looks like this: when a user request arrives, it gets to a GPU machine with a backbone, which vectorizes it using the transformer, and then from the 22nd layer of RoBerta, the embeddings array transfers to a CPU machine, where the user’s model is either already cached in RAM, or if not, it is lifted from the local disk cache in another 50-100 milliseconds. Hence we are dealing with a stateless system. We don’t need to think about which pods have which user models — we can send a request to any pod, and it will work correctly. Even though sending the request to the pod when the model is already cached in RAM is better, this problem can be easily solved at the infrastructure level without changing the application logic.
Thus, we solved the main problem: scaling without sacrificing quality. Thanks to this, we didn’t have to worry about peak loads on the system, and in addition, we got a very clear billing model, calculated by the price of a request.
Iteration 3
The final touches we needed to feel that our system was complete were hyperparameter selection and multi-language support for our product. That’s what we turn our attention to in the 3rd iteration.
Taking into account that our system is primarily designed for people who are not ML engineers and don’t have a proper education in this field, we left only one hyperparameter available: the number of training loops. This problem could be solved in two ways. The first one is to perform a two-step training. However, the main issue with this solution is that it makes the training more complicated, and second, we can’t control the dataset size. It was difficult to predict how such a system would behave in a few-shot mode, but most likely, it won’t be good. Therefore, we decided to continue researching the dependence of dataset parameters and the optimal number of training loops.
Here is where the Massive dataset by Meta Research came to aid. With this multilingual classification dataset, we were able to cut thousands of datasets of different sizes and data distribution and began to research the optimal number of training loops and dataset parameters. Specifically, the number of examples in a dataset, the number of classes, and the average and variance of the example’s distribution by class. To our surprise, we found that on different slices of this multi-dimensional distribution, we saw one two-dimensional space, and the dependence was mapped with fairly good quality. Moreover, we got almost the same inverse dependence.
After that, we only needed to use the usual regression to get the general formula form from the obtained data(of course, not forgetting the minimum and the maximum number of epochs to get rid of any artifacts).
And the last problem for us to solve was related to multi-language support. Due to Voximplant’s working geography, we needed support in Russian, English, Spanish, and Portuguese. We wanted to know if it would be a nice idea to use a multilingual model as a backbone for all languages or if each language needs its own model. Luckily for us, the guys at Meta had already researched this problem and proved the excellent work of the multilingual XLM-Roberta. So we left a separate Russian backbone because it was already noticeably better than the XLM-Roberta’s. After all, we already had a huge amount of raw data; for other languages, we took XLM-Roberta as a backbone.
Outcome
In the end, after circa nine months of hard work, we made a full-fledged tool for creating flexible communication scenarios based on SoTA NLU and serverless JavaScript, deeply integrated with Voximplant features such as chats and telephony.
Top comments (0)