For most companies, the story usually goes like this.
- A customer calls in to complain, praise, or ask for assistance.
- The call is recorded for further training or evaluation.
- The recording is typically picked at random, listened to by someone, and reviewed with the customer service representative.
This process can take anywhere from an hour to a week after a customer hangs up. During this time, a lot can go wrong. Compliance issues and poor service could leave you with some unhappy customers.
I’ll show you how to work smarter, not harder, and identify problems as soon as they occur. What most developers don’t realize is that the intricate pieces pre-built inside the Google Cloud Platform.
There are three essential items you will want to look for when evaluating a call.
- Identity — Separate the individuals on the call distinctly.
- Sentiment — Are these individuals generally positive or negative in the interaction.
- Trigger Words — Were any words or phrases said that warrant further review.
Let’s complicate this a bit and evaluate single-channel audio phone calls. Complexity means we are not only dealing with call quality type audio, but also audio where each caller co-mingles in a single channel. Single channels make it much harder to distinguish who is talking and when.
A Google Cloud Function is the easiest way to trigger code execution at scale when a file is uploaded to Cloud Storage. Setting up a Cloud Function for this purpose is easy and straight forward.
Let’s first start with the requirements.txt file and imports.
Requirements.txt
google-cloud-speech==1.3.2
google-cloud-storage==1.27.0
pathlab
imports
In this example, I will be using diarization to distinguish and separate the audio between the two callers. Diarization is:
The process of partitioning an input audio stream into homogeneous segments according to the speaker identity
This process requires Cloud Speech beta module speech_v1p1beta1.
import os
import requests
import json
import sys
import time
import uuid
from google.cloud import speech_v1p1beta1
from google.cloud.speech_v1p1beta1 import enums
from google.cloud import storage
Identifying the created file
As the Cloud Function is triggered by a google.storage.object.finalize event inside GCS, a dictionary with data specific to this type of event is sent.
Grabbing the path of the file name is as easy as pulling out the object file[‘name’]
from the [dictionary]
(https://cloud.google.com/functions/docs/calling/storage). Knowing all this information, we can build out a gs:// URI that can be used for various Google AI services.
BucketName = 'gcs-bucket'
def transcribe_audio(event, context):
file = event
now = time.time()
FileName = file['name']
storage_uri = 'gs://' + BucketName + '/' + FileName
Transcribing the Audio
Before transcribing the audio, I first want to make sure it is an actual audio file. In this example, I am only going to deal with mp3 audio. There are a tremendous amount of options to choose from, and I will highlight a few. First, the hertz rate is essential, and more often than not, is 8000 for phone audio recordings. Second, because this is a phone call, it is different. Google has a different Machine Learning model for phone call audio that creates a better transcription overall. Finally, for proper configuration, make sure to enable diarization and set the appropriate amount of speakers on the call. If required, auto-adjust your utterance dictionary and pick out specific pronouns, business names, or phrases that can show up in conversation.
# Let's process only mp3 files
if storage_uri[-4:] ==".mp3":
client = speech_v1p1beta1.SpeechClient()
# Sample rate in Hertz of the audio data sent
sample_rate_hertz = 8000
# The language of the supplied audio
language_code = "en-US"
model = "phone_call"
# Encoding of audio data sent. This sample sets this explicitly.
# This field is optional for FLAC and WAV audio formats.
encoding = enums.RecognitionConfig.AudioEncoding.MP3
config = {
"sample_rate_hertz": sample_rate_hertz,
"language_code": language_code,
"encoding": encoding,
"model": model,
"use_enhanced": True,
"enable_automatic_punctuation": True,
"enable_speaker_diarization": True,
"diarization_speaker_count": 2,
"speech_contexts": [{
"phrases": ["Thank you for calling ABC",
"Thank you for contacting ABC",
"Welcome to ABC",
"ABC customer service",
"Thank you for calling ABC customer support."]
}]
}
audio = {"uri": storage_uri}
operation = client.long_running_recognize(config, audio)
#print(u"Waiting for operation to complete...")
response = operation.result()
transcript = ""
transcriptw = ""
sendtrans = False
keyword = "Empty Audio"
speaker = ""
for result in response.results:
words_info = result.alternatives[0].words
for word_info in words_info:
if str(word_info.speaker_tag) != "0":
if str(word_info.speaker_tag) != str(speaker):
#print(str(word_info.speaker_tag) + " is not " + str(speaker))
speaker = str(word_info.speaker_tag)
transcriptw = transcriptw + "\n-------\n*Speaker " + speaker + ":* " + word_info.word
else:
#print(str(word_info.speaker_tag) + " is " + speaker)
transcriptw = transcriptw + " " + word_info.word
speaker = str(word_info.speaker_tag)
sendtrans = False
keyword = "Empty Audio"
print(transcriptw)
if transcriptw.strip() == "":
transcriptw = "*No Sound*"
sendtrans = True
else:
list = ["bitcoin","payment", "invoice", "bill", "utilities", "utility", "electricity", "credit card", "package", "testing","kits","financial", "supplies", "mask", "symptoms", "isolate","oxygen","ventilator","social security","government","internal revenue","covid", "world health", "national institute", "virus", "corona","quarantine","stimulus","relief","cdc","disease","pandemic","epidemic","sickness"]
# Using for loop
for i in list:
if i.lower() in transcriptw.lower():
keyword = i.lower()
sendtrans = True
break
if sendtrans == True:
print(f"Sending to Slack: {file['name']}.")
filename = file['name']
send_slack(transcript.strip(),filename,keyword)
For Longer audio such as entire phone conversations, the best practice is to use the client.long_running_recognize(config, audio)
method. This method performs asynchronous speech recognition.
After transcribing, I check the transcript for any keyword triggers and, if any match, send the transcription to slack for immediate notification.
Below is the slack function
def send_slack(transcript,filename,keyword):
try:
response = requests.post(url="https://hooks.slack.com/services/ABCDEFG/123456/ABC123",
headers={
"Content-Type": "application/json",
},
data=json.dumps({
"text": "*Audio:* https://storage.cloud.google.com/" + BucketName + "/" + filename + "\n*Transcription:*\n" + transcript
})
)
print('Response HTTP Status Code: {status_code}'.format(
status_code=response.status_code))
print('Response HTTP Response Body: {content}'.format(
content=response.content))
except requests.exceptions.RequestException:
print('HTTP Request failed')
An open-source and simplified example of the above code is in one of Ytel’s public Gitlab repositories.
Telecom companies quickly needed to identify and report certain types of scam oriented communications when the Covid-19 outbreak started.
Top comments (0)