DEV Community

Discussion on: Why you have not added voice functionalities to your app?

Collapse
 
ottomatias profile image
Ottomatias Peura

Great comment, thanks Joel!

To recap your comment, I'd say that you gave four different reasons:
1) privacy issues on end-users side
2) implementation costs
3) speech recognition accuracy
4) the benefits of a voice user interface are not big enough

So starting from privacy. It's clearly an issue and something that the app developers need to think about. And when it comes to machine learning, it's not a trivial issue, as developing the models require at least some kind of a way for a human to a) validate the results and b) correct them if needed, so that the model can be improved.

Our approach on privacy is based on being as open as possible on how the data is used and who can access it and support also "private mode" where voice data is never heard by anyone. But like said, it is a valid question and something that app developers should really think about. Not only with voice, of course.

Then about implementation costs. If we skip the problems of ASR accuracy and whatnot, implementing voice user interfaces is actually pretty simple with modern tools (such as Speechly ;) ). We have good client for React, for example and the extra work is pretty much
a) configuring the model by providing annotated sample utterances such as

*turn_off Turn off the [radio](device)
*set_brightness Set brightness of [living room](room) lights to [1..100](brightness)

b) streaming audio from the application to our servers to receive the actual transcript, but also the intent and entities. So if the user said something like "Set brightness in living room to 25", the API would return the intent set_brightness' and entitiesliving roomand the value of25`. By providing a few more examples, our system should be able to generalize these to also support other similar ways of expressing the same things – so for example if the user would say "I want the kitchen brightness to be 56", it would still work. So implementation costs do not need to be that huge!

When it comes to speech recognition accuracy, it's a hard problem. There is a range of accents and different voices and just like us humans, sometimes the system doesn't hear it just right. We are solving this issue at Speechly (and similar ways are used by others, too) a bit like us humans, do, too: if you know the context and hear at least some of the words right, you can probably guess the rest.

This requires some natural language understanding on top of the actual speech recognition part. Let's say the baseline ASR (automatic speech recognition) would hear something like "Turd ofter flights", but you've provided "Turn off the lights" as one of the example utterances, it's pretty easy to guess what was really said.

Of course, not all the results are correct, but I'd say that with most current speech recognition systems you can achieve a level of confidence that makes building real-life applications feasible.

And then the benefits: I don't think voice will ever be a replacement for touch and vision. Speech is seldom the fastest means of transmitting information, because us people are just not very good at expressing complex ideas in short sentences. If you want to create a spreadsheet, for example, a keyboard and mouse is probably the best UI for most tasks.

But let's say you want to send the ready spreadsheet to your boss after it's done. That's pretty easy to express that intent by using voice. With a keyboard and mouse, on the other hand you'll need to switch between apps, copypaste links and what not. That's a task that's a lot easier to do with voice.

The same goes with almost every application. Every application has subtasks and use cases where using voice would make a lot of sense, but you should not replace the current UI with a voice UI but rather add voice functionalities to improve the current UI and make use of voice whenever it's best suited.

There are also applications that really benefit from a voice UI. For example we built a grocery shopping application with a voice UI. It's a lot faster to say something like "2 liters of milk, one bag of crips, six-pack of Heineken and a loaf of bread" than to do 4 different searches and click ADD next to the correct products.

Sorry for a humongous answer!

Collapse
 
joelbonetr profile image
JoelBonetR 🥇 • Edited

Hahaha no problem, that's fine, i'll probably write another bible here.

I'm seeing it from my business point of view instead on a generic way for being more skeptical as a "client" could be, thinking on our customers.

The use case you provided such as send a spreadsheet to your boss could be perfectly automated with a "send to" button where you can just add an instant search linked to your contacts list for example (every automation could be perfectly achieved from multiple ways).

That's why I said that I can see that there's a specific market share for this features but it's not valid for every app. I mean, GBoard, Swiftkey and other smartphone keyboard APPs included this feature long long time ago and I never saw (heared) a single person using it to write whatsapp messages or emails.
That's - I think - due to concerns about other people knowing what you are doing.
I mean no one wants others to hear what you're saying to your family or friend or whatever, also you may not care about others knowing that you are currently buying something on internet but I think no one want others to hear all the entire list of what you are about to buy.

Of course we are not talking about replacing a way to interact with another, we are talking about combine both and that's great if it match your App and could benefit users. Adding it without a reason could add cognitive load to your potential customers and being counterproductive.

Now I'm talking as a customer/user:

I like the voice features of Google Assistant while driving (call someone, send [TEXT] to [CONTACT] using whatsapp, or search places on G maps), I like voice features to turn the lights on/off.

I don't want, need or gonna use voice control to buy something on Amazon. I could be aware that using voice control through Alexa could be fine, but I may want to read some reviews, search for prices and properties of the product and lately choose the delivery address and payment method visually.

As a developer again:
It's think it's not about adding this functionalities to your App, it must be something like adding this features for specific use cases.
I may want to add voice recognition for customers when asking some doubt through the contact chat and being answered using voice speech too in consequence for example, but not for adding products to cart as I want to push related products when user click on "add to cart button" and if I permit users to add products to cart using voice they don't need to pay attention to the screen so the marketing push will be useless for this customers.