DEV Community

Cover image for I love talking to you, webpage!
Dominika Zając (she/her) 🇵🇱
Dominika Zając (she/her) 🇵🇱

Posted on

I love talking to you, webpage!

A short story about why and how I added voice navigation to my side project in less than half an hour.

Siri, Google Assistant, Cortana - it’s really hard to find a person who never heard about those voice assistants. Talking to maps during driving, changing songs in speakers via voice command, turning off lights by speaking while laying in bed - all that activities are completely normal in 2021. But what if we can go a step further? Use voice navigation not only in specific apps but everywhere - surfing the web by our voice? Some time ago I discovered experimental technology called Web Speech API thanks to which it may be possible in the future. Is it ready for production purposes now? Unfortunately, no. But, do I believe it may be a game-changer for web development? Definitely! In this article, I will describe how - thanks to Web Speech API - I implemented simple voice navigation in my side project in less than half an hour. And why I keep fingers crossed for that technology. Sounds interesting? Keep reading!

Everything started in the kitchen…

I have to admit the whole story started in the kitchen. I was preparing dinner and between cutting carrots and frying meat, I realized that I was washing my hands once again just not to dirty the touch screen of the kitchen robot (where I check the next steps of the recipe). And how easier it would be if I would just say “next step” or “start mixing”. Maybe I would even start to like cooking then? Later on, my thoughts went into directions more connected with my professional life - how did speech recognitions tools work just now? I remembered from studies really interesting exercising with building some voice-based solutions but all of them were really simple or based on expensive databased - mostly trained only for small purposes. But it was some years ago - something had to change! Here my research (and timer) started. After some minutes with Google, I found a technology called Web Speech API and decided to use it in my side project.

So, what exactly is a Web Speech API?

Web Speech API is an experimental technology that moves responsibility for text-to-speech and speech recognition from web applications to the browser. Developers using that solution need only provide correct input and handle output properly to incorporate voice-based features into their products. Why it’s so awesome (at least for me)? You - as a developer - don’t have to collect and clean data, train your models or buy expensive databases. Also, the model is trained for a given user not only on your page but on all pages they visited, so it can learn faster and provide better accuracy. The API allows web pages to control activation, timing, and handle results and alternatives - so you still have quite good control over your solution. You can read more about Web Speech API on Draft Community Group Report or MDN Web Docs.

Screenshot of page with Web Speech API documentation

OK, we have another tool. But…

Why should I even care about speech recognition in the web?

Pink question mark on dark background
Photo by Emily Morter on Unsplash

We are used to navigating computers via mouse and keyboard. But let’s be honest - it’s not the most optimal way. To use them efficiently we need two hands and be close to the computer. Also, it’s a special skill - we had to learn to type on a keyboard and it still may be difficult for older people or kids. Not to mention people with motor disabilities or other limitations (and it’s not only about permanent disabilities and illnesses like Parkinson’s but also may impact you when you break your arm or just hold a baby in your hands). Next, small but still painful argument for me - have you ever have a problem with a discharged magic mouse? Yeah, I really hate it (who invented a mouse which cannot be used while charging?) All that limitations make me believe voice-based solutions may be a super interesting direction in web development in the future. What’s more, it should be also just trendy! According to research, 27% of the global online population is using voice search on mobile. And this number is still growing. In addition, personally, I can’t wait for presentations and speeches without hearing “next slide, please” over and over.

Unfortunately, there are some disadvantages too (yet?)

Even if I believe Web Speech API is a great solution there are many problems I have to mention here. First of all - security and privacy. How can we trust browsers are listening only when we want them to? And our voice is not overused by them? Is my voice recorded? Can malicious webpages steal my voice or trick me that recording is stopped when in reality is still listening? Should we pronounce our passwords loud? So many questions without answers… We have to be prepared for completely new challenges connected with security and hacker attacks. We have to remember it’s an experimental and new technology, so it probably takes some time until global standards and best practices will be developed. What’s more, global standards are needed not only for developing purposes but also for usability. We all know that spinner means loading and a button with a cross icon close the modal. We learned 3 parallel horizontal lines mean menu and click on the bell will show some notifications. But which word should we use to open the modal? “Show” / “Display” / “Open”?

In my native language, I can find much more than 3 proposals… How to handle internalization and gramma differences between languages? What about offline mode (currently, Chrome is using server-side recognition so a network connection is required)? How to guide users on which actions are possible via voice on the page? Show them some tutorials on the first visit? List of possible “next steps” while navigating on-page? Or maybe just documentation should be enough? Don’t forget about poor browser compatibility (currently only Chrome fully supports that API). The list of questions to ask is of course much much longer - we really need time, and defined global standards and best practices to address all of them.

Talk is cheap. Show me the code!

After all that introduction, time for the most interesting part - actual code and demo! As I’m using React in my side project I decided to use react-speech-recognition npm package - a great wrapper over Web Speech API providing easy-to-use hook and methods. It’s enough to call:
npm install --save react-speech-recognition

from your terminal to add the package to your project. Later on, you have to add the import:
import SpeechRecognition, { useSpeechRecognition } from 'react-speech-recognition'

and use hook in your code (an example taken from the package official documentation):

import React from 'react';
import SpeechRecognition, { useSpeechRecognition } from 'react-speech-recognition';
const Dictaphone = () => {
  const {
    transcript,
    listening,
    resetTranscript,
    browserSupportsSpeechRecognition
  } = useSpeechRecognition();
if (!browserSupportsSpeechRecognition) {
    return <span>Browser doesn't support speech recognition.</span>;
  }
return (
    <div>
      <p>Microphone: {listening ? 'on' : 'off'}</p>
      <button onClick={SpeechRecognition.startListening}>Start</button>
      <button onClick={SpeechRecognition.stopListening}>Stop</button>
      <button onClick={resetTranscript}>Reset</button>
      <p>{transcript}</p>
    </div>
  );
};
export default Dictaphone;
Enter fullscreen mode Exit fullscreen mode

How to support your custom actions? You just need to provide a list of commands and corresponding callback - like on the example below:

const commands = [
  {
    command: ['cancel', 'close'],
    callback: () => cancelModal();
  },
  {
    command: ['reload', 'refresh'],
    callback: () => reload();
  },
  {
    command: ['go to :city'],
    callback: (city) => setCity(city);
  },
  {
    command: 'clear',
    callback: ({ resetTranscript }) => resetTranscript()
  }
];
const { transcript, browserSupportsSpeechRecognition } = useSpeechRecognition({ commands });
Enter fullscreen mode Exit fullscreen mode

Take a look at the third command - I used a variable city there - detected word after go to phrase will be sent to the callback as a parameter, so I can use it in my function. Commands support also multi-word matches, optional words, and custom thresholds of similarity of speech to command needed to fire callback. More about options you can read in react-speech-recognition documentation.

Of course, in reality, it’s a little more complicated. You should always remember about checking if the user’s browser supports Web Speech API, providing a way to start and stop listening, handle network problems or lack of permissions, checking translations (if your app supports them), etc.
Let’s see the final effect of my project. It’s a super simple web app displaying the current air quality index in the given city with the message is it good, unhealthy, or dangerous. I love this example as it’s super easy to implement (one fetch to WAQI API https://waqi.info/en/ and some simple components) but still very useful - especially in the winter when the city where I live is struggling a lot with smog problems. It’s also a good base for my private R&D - more complicated than typical examples in articles/tutorials but still simple enough to extend code easily. So, how voice navigation works in my project? Take a look at the video below:

Quite nice, don’t you think? And everything was done in less in half an hour (including research). Maybe it’s not (yet!) ready for production purposes but really nice to play in your free time! I can strongly recommend that!

Summary

In my personal opinion, voice-based solutions will play a very important role in the future of the web. And the Web Speech API can have a huge impact on their success. Even if the technology is not production-ready yet, it’s still an interesting area to research, play with, and test. Who knows? Maybe the future is closer than we think?

Useful links:

Top comments (0)