Let's talk with chatGPT with your voice

#chatgpt #ai #programming #beginners

Hello! This is Avi from Voximplant.

Last time I was talking about AutoML solutions and about how we created a virtual assistant called Avatar at Voximplant.

Since then, I received several messages about artificial intelligence in general and about virtual assistants. I noticed that lately, people became obsessed with chatGPT.

I don't think I need to explain what ChatGPT is. But for those who haven't heard of it: it's a conversational robot based on machine learning like Voximplant Avatar, but it is a different kind of a robot. Avatar is a goal-oriented robot which is made for companies to assist their customers. ChatGPT is a chit-chat robot that learns from users and provides a realistic conversation.

At Voximplant, we provide not only our virtual assistant, but a lot of communication services, such as calls, conferences, contact centers, interactive menus for hotlines, etc. And, of course, with all that, we have speech synthesis and recognition. So, I decided to integrate chatGPT into Voximplant, give it a voice and teach it to recognize human speech.

So, I'm going to show you how to build an application that allows talking with chatGPT with your voice.

Create the application

First, you need to register an account and obtain an API key for chatGPT. This can be easily done at http://platform.openai.com/.

After you get your API key, you need to create an application at Voximplant. This is also an easy task. Read the guide on how to create an app in the Voximplant documentation.

In the Voximplant application, you need to create a scenario where all the code will be present; buy a real or a testing phone number and bind it to your application if you want to call chatGPT from your phone and talk with it; and, of course, you will need a routing rule to launch the scenario (leave all the routing rule settings as default).

Write the scenario

In the scenario, we need to connect to chatGPT via an API request, use speech synthesis to pronounce what chatGPT says, and use the speech recognition module to convert your words to text and send them back as an utterance.

Let’s start with requires and the API request.

All we need to require is the Voximplant's ASR module. Then, we prepare the API url and the API key received in the very first step. For OpenAI, we will also need an array of messages.

I will make the API request to chatGPT each time the ASR module recognizes the utterance. Let's see what it will look like.

require(Modules.ASR)

const openaiURL = 'https://api.openai.com/v1/chat/completions'
const openaiApiKey = 'your_api_key'
var messages = []

async function requestCompletion() {
   return Net.httpRequestAsync(openaiURL, {
       headers: [
           "Content-Type: application/json",
           "Authorization: Bearer " + openaiApiKey
       ],
       method: 'POST',
       postData: JSON.stringify({
           "model": "gpt-3.5-turbo",
           "messages": messages
           // you can configure the length of the answer
           // by sending the max_tokens parameter, e.g.:
           // "max_tokens": 150
       })
   })
}

As you can see, I commented the "max_tokens" parameter, which is sent to chatGPT. By default, chatGPT responds with very long messages, which are okay for IM, but can be quite annoying in oral communication. With this parameter, you can adjust the length of incoming messages (utterances) from chatGPT.

After that, we need to process the inbound call in the scenario. Let's create a couple of necessary variables and integrate our logic into the CallAlerting event.

We need to create an automatic speech recognition instance to process its events. At this step, you can choose the ASR provider. Voximplant provides a huge list of them from multiple vendors for multiple languages. You are free to choose any.

Also, you need to choose a voice for your robot. The same as for ASR vendors, we have a huge list of voices, and at this part, you may try a lot of voices to find your favorite.

After that, we need to process the ASR result each time you stop talking to chatGPT. Let's push your utterance to the messages array and then call the requestCompletion() function that we created in the previous step to send a request to OpenAI.

If the API request is successful, we pronounce the chatGPT utterance via the media player and push the robot's utterance to the messages array. If the request is unsuccessful, we just ask the user to repeat it.

Let's see how it works.

var call, player, asr;
const defaultVoice = VoiceList.Google.en_US_Neural2_C;

VoxEngine.addEventListener(AppEvents.CallAlerting, (e) => {
   call = e.call
   asr = VoxEngine.createASR({
       profile: ASRProfileList.Google.en_US,
       singleUtterance: true
   })
   asr.addEventListener(ASREvents.Result, async (e) => {
       messages.push({ "role": "user", "content": e.text })
       var res = await requestCompletion()

       if (res.code == 200) {
           let jsData = JSON.parse(res.text)
           player = VoxEngine.createTTSPlayer(jsData.choices[0].message.content,
               {
                   language: defaultVoice,
                   progressivePlayback: true
               })
           player.sendMediaTo(call)
           player.addMarker(-300)
           messages.push({ role: "assistant", content: jsData.choices[0].message.content })
       } else {
           Logger.write(res.code + " : " + res.text)
           player = VoxEngine.createTTSPlayer('Sorry, something went wrong, can you repeat please?',
               {
                   language: defaultVoice,
                   progressivePlayback: true
               })
           player.sendMediaTo(call)
           player.addMarker(-300)
       }
       player.addEventListener(PlayerEvents.PlaybackMarkerReached, (ev) => {
           player.removeEventListener(PlayerEvents.PlaybackMarkerReached)
           call.sendMediaTo(asr)
       })
   })
})

As you can notice, I added the Logger.write() function to add the error message from chatGPT to the application logs for later analysis. You can add more logging to understand, for example, how long the API request completion takes.

Let’s extend the ASREvents.Result event processing with logging.

After that, we process the call's Connected and Disconnected events. When the call is established, let's pronounce a greeting and send the call's data to the voice recognition module. And when the call is disconnected, let's terminate the scenario.

After these preparations are done, we can answer the inbound call. The complete event handler will look like this:

VoxEngine.addEventListener(AppEvents.CallAlerting, (e) => {
   call = e.call
   asr = VoxEngine.createASR({
       profile: ASRProfileList.Google.en_US,
       singleUtterance: true
   })
   asr.addEventListener(ASREvents.Result, async (e) => {
       messages.push({ "role": "user", "content": e.text })
       Logger.write("Sending data to the OpenAI endpoint")
       let ts1 = Date.now();
       var res = await requestCompletion()
       let ts2 = Date.now();
       Logger.write("Request complete in " + (ts2 - ts1) + " ms")

       if (res.code == 200) {
           let jsData = JSON.parse(res.text)
           player = VoxEngine.createTTSPlayer(jsData.choices[0].message.content,
               {
                   language: defaultVoice,
                   progressivePlayback: true
               })
           player.sendMediaTo(call)
           player.addMarker(-300)
           messages.push({ role: "assistant", content: jsData.choices[0].message.content })
       } else {
           Logger.write(res.code + " : " + res.text)
           player = VoxEngine.createTTSPlayer('Sorry, something went wrong, can you repeat please?',
               {
                   language: defaultVoice,
                   progressivePlayback: true
               })
           player.sendMediaTo(call)
           player.addMarker(-300)
       }
       player.addEventListener(PlayerEvents.PlaybackMarkerReached, (ev) => {
           player.removeEventListener(PlayerEvents.PlaybackMarkerReached)
           call.sendMediaTo(asr)
       })
   })
   call.addEventListener(CallEvents.Connected, (e) => {
       player = VoxEngine.createTTSPlayer('Hi, ChatGPT bot is at your service, how may I help you?',
           {
               language: defaultVoice
           })
       player.sendMediaTo(call)
       player.addMarker(-300)
       player.addEventListener(PlayerEvents.PlaybackMarkerReached, (ev) => {           
           player.removeEventListener(PlayerEvents.PlaybackMarkerReached)
           call.sendMediaTo(asr)
       })
   })
   call.addEventListener(CallEvents.Disconnected, (e) => {
       VoxEngine.terminate()
   })
   call.answer()
})

Now, let's test our application. Call the phone number you have bound to your application, wait for the greeting, and start talking with your robot. Or use the debug softphone in the control panel to call the scenario.

You can find the complete scenario with the comments in the corresponding article in the Voximplant documentation. Feel free to modify the scenario at your wish, process more events, change voices and utterances' length, and much more. It has been tested and is ready to use.

I hope you enjoyed this article. Until the next time!