DEV Community: Andrew R. Freed

How to Improve your Assistant

Andrew R. Freed — Thu, 07 Oct 2021 20:37:07 +0000

Take 40% off Conversational AI by entering fccfreed into the discount code box at checkout at manning.com.

Imagine that you have been contracted to diagnose problems and offer solutions for a virtual assistant belonging to a company called FICTITIOUS INC.

FICTITIOUS INC have deployed their virtual assistant to production, but aren’t achieving the success metrics they outlined for the solution. The virtual assistant was supposed to reduce the burden on other customer service channels, but these channels haven’t seen a significant reduction in user activity. FICTITIOUS INC knows how to troubleshoot their traditional applications but doesn’t know where to start troubleshooting their virtual assistant.

FICTITIOUS INC needs to quickly drill down into WHY their assistant isn’t performing well. They need to find out if their conversational flow doesn’t work for users, or if the intent mapping they have done doesn’t work, or if there’s some other core problem with their assistant.

FICTITIOUS INC is in good company. Deploying a virtual assistant to production isn’t the end – it’s only the beginning! Figure 1 demonstrates the continuous improvement in a virtual assistant’s lifecycle. Continuous improvement is broadly applicable in software projects, and it’s applicable for virtual assistants.

Figure 1 Improvement is part of a continuous cycle in the life of a virtual assistant. This cycle continues even after an assistant is deployed to production!

This cycle doesn’t stop for FICTITIOUS INC when they deploy their assistant. The first improvement cycle after deploying to production is the most informative. This is where FICTITIOUS INC learns which of their assumptions were correct and which ones need to be revisited.

Deploying a virtual assistant to production isn’t the end – it’s only the beginning!

In this article, we learn how FICTITIOUS INC can identify where their assistant needs the most improvement. FICTITIOUS INC has chosen “successful containment” as their key success metric and we use that to drive our investigation. Containment for virtual assistants is the percentage of conversations handled entirely by the virtual assistant. (A conversation which isn’t escalated to a human is “contained”). FICTITIOUS INC’s “successful containment” modifies this definition: only conversations that finish with the least process flow are “successfully contained”.

With successful containment in mind, we use a data-driven approach to evaluate their virtual assistant, including the dialog flows and intent identification. We conduct a single evaluation of FICTITIOUS INC’s virtual assistant. FICTITIOUS INC needs to evaluate their virtual assistant many times over its lifetime. Let’s start by looking for the first improvement FICTITIOUS INC needs to make.

"Change is the only constant in life." Heraclitus, Greek philosopher

Using a success metric to determine where to start improvements

Analyzing a virtual assistant can feel like a daunting process. Many different types of analyses exist. Where should FICTITIOUS INC begin their analysis? Analysis should center on a success metric. This success metric forms a guiding principle for all analysis and improvement. Any potential analysis or improvement work should be prioritized based on how it impacts a success metric.

FICTITIOUS INC’s chosen success metric is “successful containment”. “Successful containment” is better aligned with their users’ needs better than “containment”. If a user quits a conversation before getting an answer, that conversation is contained, but FICTITIOUS INC doesn’t consider the conversation a success. Table 1 contrasts containment and successful containment.

Table 1 Sample scenarios and how they are measured. FICTITIOUS INC uses "successful containment".

FICTITIOUS INC uses three data point to start the analysis of their assistant: overall successful containment, volume by intent, and successful containment by intent. These data points enable analysis of each intent. From these data points we can find which intents have the largest impact on the overall successful containment.

To simplify the analysis, we only consider five of FICTITIOUS INC’s intents. These intents and their associated metrics are shown in Table 2. Based on this table, which intent would you explore first?

Table 2 FICTITIOUS INC's metrics for conversation volume and successful containment, broken down per intent.

#appointments has the lowest overall containment at thirty percent, but it’s a low-volume intent. And #reset_password is the largest source of uncontained conversations, comprising two-thirds of the total uncontained conversations. If FICTITIOUS INC can fix what’s wrong in those two intents, their virtual assistant will have much higher containment and be more successful. Because #reset_password has the biggest problem, FICTITIOUS INC should start there.

Improving the first flow to fix containment problems

Solving problems is easier when you know what the specific problems are. FICTITIOUS INC has identified the #reset_password flow as the biggest source of non-contained conversations. This is the most complex of FICTITIOUS INC’s process flows, and this is probably not a coincidence. Let’s reacquaint ourselves with FICTITIOUS INC’s #reset_password flow in Figure 2.

Figure 2 FICTITIOUS INC’s reset password conversational flow. Any conversation that visits P00 is counted as a password reset conversation. Only conversations that include P08 are successfully contained.

A password reset conversation always starts with dialog P00 and P01. After that, the password reset flow has only one path to success. The successful path includes dialog nodes P00, P01, P03, P05, and P08. These nodes form a conversion funnel which is shown in Figure 3. Every conversation that includes P03 must necessarily include P01, but some conversations that include P01 don’t include P03. A conversation that includes P01 but not P03 has “drop-off” at P01. By measuring the drop-off between P01, P03, P05, and P08, FICTITIOUS INC can narrow in on why password reset conversations fail to complete.

Figure 3 A successful password reset flow, visualized as a funnel. A conversation that completes each step in this funnel is successfully contained.

FICTITIOUS INC can analyze their password reset process flow via a conversion funnel by analyzing their virtual assistant logs and counting how many times each dialog node is invoked. Then, they can compute the drop-off in each step of the funnel. This high-level analysis illuminates what parts of the process flow require further analysis. The parts causing the most drop-off should be improved first.

How can you run log analysis in your specific virtual assistant platform?
You can perform the analysis multiple ways in this section. The specific steps vary by virtual assistant platform. For instance, your virtual assistant platform may make it easy to find conversations that include one dialog node but not another.
The techniques in this article are purposely generic, even if somewhat inefficient. If your virtual assistant provider doesn’t include analytic capabilities, you should build the analyses described in this section.

FICTITIOUS INC’s password reset conversion funnel metrics can be found in Table 3.

Table 3 Conversion funnel for FICTITIOUS INC's password reset dialog flow. This analysis shows a steep drop-off after asking for the user ID and the security question.

The conversion funnel tells FICTITIOUS INC that one-third of password reset flow conversations included the question “What is your user ID?” but doesn’t include the question “What is your date of birth?”. The “What is your user ID?” question has a thirty-three percent drop-off rate. It’s also the largest source of drop-offs, causing containment failure on sixty-five total conversations. The entire conversion funnel can be visualized as in Figure 4.

Figure 4 FICTITIOUS INC's password reset flow conversion funnel annotated with the number of conversations containing each step of the dialog. P01 and P05 cause most of the total drop-off.

The severe drop-offs between P01 and P03, as well as P05 and P08, are both detrimental to FICTITIOUS INC’s successful containment metric. The P05 to P08 drop-off is more severe from a relativistic perspective (38% vs 33%), but the P01 to P03 drop-off affects more conversations in total. FICTITIOUS INC should first focus on the P01 to P03 drop-off.

Analyzing the first source of drop-off in the first intent

The first detailed analysis for the P01 to P03 drop-off is to find out what users are saying to the assistant between P01 and P03. Depending on their virtual assistant platform, FICTITIOUS INC can query for:

What users say immediately after P01
What users say immediately before P03

This query tells FICTITIOUS INC what users are saying in response to the “What is your User ID?” question. FICTITIOUS INC can inspect a small sample of these responses, perhaps ten or twenty, in their initial investigation. The query results are shown in Table 4. All valid FICTITIOUS INC user IDs follow the same format: four to twelve alphabetic characters followed by one to three numeric characters. Any other user ID string is invalid. Before reading ahead, see if you can classify the response patterns.

afreed1
don't know, that’s why I called
ebrown5
fjones8
hgarcia3
I don't know it
I'm not sure
jdoe3
mhill14
nmiller
no idea
pjohnson4
pdavis18
tsmith
vwilliams4

The analysis of these responses is shown in Figure 5. The analysis surfaces several patterns in the response utterances. The expected response to P01 is a valid user ID consisting of four-to-twelve letters followed by one to three numbers, but this isn’t what users always provide!

Figure 5 Patterns in user utterances given in response to FICTITIOUS INC's question P01: "What is your User ID?"

FICTITIOUS INC can transform these patterns into actionable insights that improves successful containment.

Insight #1: Many users don’t know their user ID. FICTITIOUS INC could build an intent for #i_dont_know. When a user is asked for their ID and responds with #i_dont_know, the assistant could provide the user with instructions on how to find their user ID. Or the assistant could be programmed to validate the user another way.
Insight #2: Many users provide their user ID incorrectly. This may be because they don’t know their user ID, or they may have entered it incorrectly. These users could be given another chance to enter their ID or guidance on what a valid user ID looks like.

That’s all for this article. If you want to learn more about the book, check it out on Manning’s liveBook platform here.

The Command Interpreter Pattern

Andrew R. Freed — Fri, 14 May 2021 20:28:35 +0000

An Introduction to Virtual Assistants

(From Creating Virtual Assistants by Andrew Freed)

This article gives you an introduction to creating useful virtual assistants with the Command Interpreter pattern.

(Take 40% off Creating Virtual Assistants by entering fccfreed into the discount code box at checkout at manning.com.)

A common use of virtual assistant technology is in the Command Interpreter pattern. This is prevalent in "smart" devices: your phone, your TV remote, your appliances. The command interpreter deals with a limited vocabulary - enough to invoke the commands it supports - and the interaction is finished as soon as the information needed to execute the task is collected. A smart television remote is tuned to recognize words like “channel” and “volume” but won’t recognize a command like “Set air temperature to 76 degrees”. Table 1 shows example command interpreters and commands that in their vocabulary.

Table 1 Example command interpreters and supported commands

Command Interpreter Type	Example command
Smart phone	“Set an alarm for 9PM”
Smart phone	“Call Mom”
Smart phone	“Open Facebook”
Voice-powered television remote	“Increase volume”
Voice-powered television remote	“Turn to channel 3”
Smart home controller	“Turn on the light”
Smart home controller	“Set air temperature to 76 degrees”

My favorite command on my phone is the "set alarm" command which has two parameter slots: the alarm time and the alarm reason. On my phone the alarm time is a required parameter and the alarm reason is optional. In the first example below, I've mapped out the simplest "set alarm" invocation: "Set an alarm". The interpreter knows that setting an alarm without an alarm time is impossible and it prompts me for the alarm time.

The command interpreters you’re most familiar with require activation either via a physical button or a "wake word". Several wake words you may be familiar with include "Hey Siri", "Ok Google", and "Alexa". Devices that wake via a button don't start listening until you press that button. Devices which are always listening don't jump into action until you use a wake word.

The diagram below demonstrates how several command interpreters parse your input into an actionable command. Try experimenting with your devices to see how many different ways you can execute a command, either by changing the words ("create alarm" vs "set an alarm") or changing the order of the information given.

Figure 1 Exploring the parts of a command

Figure 2 Alternate terminology for the command interpreter pattern

The command interpreter’s job is to identify a command (ex: “set alarm”) and fill all of the parameter "slots" (ex: time of alarm) for that command. When the command interpreter is invoked it asks as many follow-up questions as needed to fill all of the required slots - no more questions and no less. This is similar to how virtual assistants work except that command interpreters only service a single request - the entire conversation is finished when the required parameters are gathered and the initial command is executed.

Do virtual assistants always have “commands”?
Terminology differs. In this article I refer to intents and entities; nearly every virtual assistant platform uses the “intent” terminology. Some platforms use “action” and “intent” interchangeably. In the case of command interpreters, commands are intents and parameters are entities.

Figure 3 A command can require follow-up questions from the command interpreter

Command interpreters can also be coded to receive multiple parameters at once as long as there’s a way to distinguish them. In the next example I provide all of the alarm parameters in a single statement: "Set an alarm for 9PM to write Chapter 1." The phone gladly obliges this request!

Figure 4 When all required parameters are passed to the command interpreter, it completes the command and ends the conversation.

When developing a command interpreter, you need to separate the parameters from the command. In the case of setting an alarm this isn’t too difficult: the command is something like "set an alarm", the alarm time is text that parses to a time, and any remaining text (minus filler words like "for", "to", "at", etc.) is the reason.

If you want to learn more, you can check out the book on Manning’s browser-based liveBook platform here.

Comparing Voice and Text Chat Experiences

Andrew R. Freed — Wed, 10 Mar 2021 14:41:59 +0000

Excerpted from Creating Virtual Assistants.

The choice of a channel has far-reaching considerations into how your virtual assistant works. Each channel has pros and cons and you will likely need to customize your virtual assistant to exploit the benefits of a channel while avoiding the pitfalls. The specific channel you already use—voice or web—can also influence what you can go after first. Additionally, some business processes may be friendlier to adapt to either voice or web. For example, you can give driving directions with a map in a web channel – how would you give directions over voice?

Today's consumers are used to getting the information they want, when they want it, in the mode that they most prefer, as shown in figure 1.

Figure 1 Consumers are used to picking the channel that works best for them

Which channel has the closest affinity to most of your users? If your users are on the phone all day then they will be more likely to prefer a voice channel. If your users primarily interact with you through your website, app, or email they will be more likely to prefer a web chat option. Phones are ubiquitous and nearly everyone knows how to make phone calls—whether or not they like using the phone is a different story! Similarly, instructing users to "go to our website" may be a welcome relief or a daunting challenge depending on your user base. Of course—there is nothing to stop you from supporting both channels—just time and money!

With all else being equal it is easier to start with web chat rather than voice. Voice introduces two additional AI service (speech to text and text to speech conversion) which requires additional development and training work—though training a speech engine is continuing to get easier.

Table 1 Comparison between web and voice channels

Web Benefits	Voice Benefits
Technical implementation has less moving parts	Almost everyone has a phone
Do not have to train speech recognition	Friendlier to non-tech-savvy users
Easily deployed on websites and mobile apps	Easy to transfer users in and out of virtual assistant using public telephony switching

How users receive information in voice and web

The first difference between voice and web is the way the user receives information from the solution. In a web solution, the user will have the full chat transcript on their screen—a complete record of everything they said and what you said. While many users don't like scrolling, they have the option to scroll back to view the entire conversation at any time and re-read any part they would like. They can print their screen or copy/paste parts or the entire transcript at any time. A lengthy response can be skimmed or read in full at the user's discretion. The user may multi-task while chatting with little impact on the broader solution. A web assistant can return rich responses including not just text but images, buttons, and more. Figure 2 shows some of the differences.

Figure 2 The web channel allows rich responses like images. Voice solutions should be prepared to repeat important information. Map photo by Waldemar Brandt on Unsplash.

Conversely, in a voice channel, the user's interaction is only what they hear. If the user misses a key piece of information, they do not have a way of getting it again, unless you code a "repeat" question or functionality into your assistant. Figure 2 shows how web and voice channels can best handle a question like “where’s the nearest store?”

Further, a long verbal readout can be very frustrating for a user: they may need a pencil and paper to take notes, they may have to wait a long time to get what they want, and they probably have to be very quiet for fear of confusing the speech engine (I practically hold my breath when talking with some automated voice systems). Also, directly sending rich media responses like images is impossible over voice though you may be able to leverage side channels like SMS or email to send information for later review.

Figure 3 Different channels have different user experiences

You must be aware of the cost to the user when you have a long message. As shown in Figure 3, a web user can skim long messages, but a voice user cannot. Beware the temptation to cram every last piece of information into a message, especially in voice. The average adult reading speed is around 200 words per minute and speaking speed is around 160 words per minute, though automated speech systems can be tuned to speak more quickly.

Consider a hypothetical greeting:

"Thank you for calling the So-and-So automated voice hotline. We appreciate your call and look forward to serving you. This call may be recorded for quality assurance and training purposes. If you know the extension of the party you are calling you can dial it at any time. Please listen carefully as our menu options have recently changed. For appointments press 1."

I timed myself reading this 62-word message. It takes 20 seconds of audio to get to the first useful piece of information! (Hopefully you wanted appointments!) Perhaps the lawyers insisted—but look at how much "junk" is in that message from the user's point of view. Figure 4 breaks down the greeting.

Figure 4 User's thought progression through a long greeting

Contrast that with the following greeting:

"Thanks for calling So-and-So. Calls are recorded. How can I help you?"

This new message has 4 seconds to value while still covering the basics of greeting, notification, and intent gathering. You only get one chance to make a great first impression—don't waste your users' time on your greeting!

Take to heart the following quote:

"I have only made this letter longer because I have not had the time to make it shorter."
Blaise Pascal, The Provincial Letters (Letter 16, 1657)

It takes work to be concise, but your users will appreciate you for it!

How the assistant receives information in Voice and Web

Another key difference between voice and web is how you receive input from the user. In a web channel, you can be sure of receiving exactly what was on the user's screen. You may provide a pop-up form to collect one or more pieces of information at once (first and last name, full address). The user may have clicked a button and you will know exactly what they clicked. The user may have misspelled one or more words but virtual assistants are increasingly resilient to misspellings and typos as demonstrated in Figure 5.

Figure 5 Most major virtual assistant platforms are resilient against misspellings

In a voice channel you will receive a textual transcription of what the speech engine interpreted. Anyone who has used voice dictation has seen words get missed. The assistant can be adaptive to some mis-transcriptions (just like it can be adaptive to misspellings in chat) when the words are not contextually important. Figure 6 shows a voice assistant adapting to a pair of mis-transcriptions: “wear” for “where” and “a” for “the”.

Figure 6 Voice assistants can adapt to mis-transcriptions in general utterances as long as the key contextual phrases are preserved, like "nearest store"

Aside from simple mis-transcriptions, another class of inputs gives speech engines trouble—any input that is hard for humans will be hard for speech engines as well.

Proper names and addresses are both notoriously difficult for speech engines in both recognition (speech to text) and synthesis (text to speech). When I'm talking to a person on the phone and they ask for my last name I say "Freed. F-R-E-E-D" since many people hear "Free" or try to use the old German spelling "Fried". Common names are not that common—you should be easily able to rattle off a couple of "uncommon" names within your personal network rather quickly. Speech engines work best with a constrained vocabulary, even if that vocabulary is "the English language", and most names are considered out-of-vocabulary.

Sidebar: What’s a vocabulary?
Speech to text providers refer to a “vocabulary” – this is simply a list of words. A speech model is trained to recognize a set of words. A generalized English model may have a dictionary that includes all of the most common words in the English language.

Your virtual assistant will probably need to deal with uncommon words and jargon. The name of your company, or the products your company offers, may not be included in that vocabulary of words. If so, you will need to train a speech model to recognize them.

Addresses are harder than names. I found a random street name on a map "Westmoreland Drive." If you heard that would you transcribe "Westmoreland Drive" or "W. Moreland Drive" or "West Moorland Drive"? Figure 7 shows a challenge in mapping similar phonetics to words that sound similar.

Figure 7 Transcription challenges on unusual terms

Sidebar: On spelling out words and the difficulty of names and addresses
Spelling out a difficult name can sometimes be helpful for humans, but it does not help machines much. Letters are easily confused with each other: B/C/D/E/G/P/T all sound similar without context. Humans may require several repetitions to correctly interpret a proper name, even spelled out. There is rich literature on the difficulty of names and addresses. One such article is "The Difficulties with Names: Overcoming Barriers to Personal Voice Services" by Dr. Murray Spiegel (2003) http://web.media.mit.edu/~geek/TheDifficultiesWithNames.htm

The difficulty in receiving certain inputs from users affects the way you build a dialog structure, perhaps most significantly in authenticating users. In a web chat you can collect any information you need verbatim from a user. You may in fact authenticate in your regular website and pass an authenticated session to the web chat. In a voice channel, you need to be more restrictive in what you receive. During authentication, a single transcription error in the utterance will fail validation, just like if the user mistypes their password, as shown in Figure 8.

Figure 8 For most data inputs, a single transcription mistake prevents the conversation from proceeding. Voice systems need to take this into account by using re-prompts or alternate ways to receive difficult inputs.

In the best cases, speech technology has a 5% error rate and in challenging cases like names and addresses the error rate can be much, much higher (Some voice projects report a 60-70% error rate on addresses). With alphanumeric identifiers, the entire sequence needs to be transcribed correctly, as shown in Figure 8. A 5% error rate may apply to each character, so the error rate for the entire sequence will be higher. For this reason, a six-digit ID is much more likely to transcribe accurately than a twelve-digit ID.

For accurate transcriptions, constrained inputs like numeric sequences and dates work best. If you encounter a transcription error, you can always prompt the user to provide the information again. Keep in mind that you will want to limit the number of re-prompts. You may implement a “three strikes rule” – if three consecutive transcriptions fail then you direct the user to alternate forms of help that will serve them better.

Table 2 Summary of data types by how well speech engines can transcribe them

Data types that transcribe well	Data types that do not transcribe well
Numeric identifiers (ex: Social Security Number)	Proper names
Dates	Addresses
Short alphanumeric sequences (i.e. “ABC123”)	Long alphanumeric sequences (i.e. "ABCDEFGHI123456")

Voice authentication can make use of an alternate channel such as SMS. You can send a text message to a number on file for a user with a one-time code and use that in authentication, instead of collecting information over voice. If you absolutely must authenticate via an option that is difficult for speech, over the speech channel, be prepared to work hard in both speech training and orchestration layer post-processing logic. You will need a good call hand-off strategy in this scenario.

That’s all for this article. If you want to learn more about the book, check it out on Manning’s browser-based liveBook reader here.

Take 35% off Creating Virtual Assistants by entering devtofreed into the discount code box at checkout at manning.com.