loading...
Cover image for Speech-to-text and Text-to-speech with Android

Speech-to-text and Text-to-speech with Android

rtficial profile image Rishabh Tatiraju ・4 min read

Have you ever wondered how does Google's speech search work, or ever thought of building an ebook narration app? At the first glance it might seem some complex piece of technology. While it is complicated to implement it on your own, thankfully Android (via Google Services) has built in speech-to-text and text-to-speech APIs which make it extremely easy to setup these features.

See it in action

How does this work?

For Speech-to-text, Android provides an Intent based API which launches Google's Speech Recognition service and returns back the text result to you. There is a catch though - the device will require Google Search app for the service to work.

The Text-to-speech API, unlike Speech Recognition, is available without Google Services, and can be found in android.speech.tts package.

Source code

You can find the source of this tutorial on GitHub.

Let's develop!

Fire up Android Studio and create a project with a Blank Activity.

User interface

The user interface is going to be simple - a LinearLayout as the root view group, inside wich there will be a Button which launches the Speech Recognition API, an EditText that shows the Speech Recognition output as well as serves as input to Text-to-speech functionality, and another Button to trigger Text-to-speech output.

The resultant XML file is as follows:

<?xml version="1.0" encoding="utf-8"?>
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
    xmlns:tools="http://schemas.android.com/tools"
    android:layout_width="match_parent"
    android:layout_height="match_parent"
    android:gravity="center"
    android:orientation="vertical"
    android:padding="24dp"
    tools:context=".MainActivity">

    <Button
        android:id="@+id/btn_stt"
        android:layout_width="wrap_content"
        android:layout_height="wrap_content"
        android:text="Speak" />

    <EditText
        android:id="@+id/et_text_input"
        android:layout_width="match_parent"
        android:layout_height="0dp"
        android:layout_marginTop="24dp"
        android:layout_marginBottom="24dp"
        android:layout_weight="1"
        android:gravity="center"
        android:hint="Text from STT or for TTS goes here." />

    <Button
        android:id="@+id/btn_tts"
        android:layout_width="wrap_content"
        android:layout_height="wrap_content"
        android:text="Listen" />

</LinearLayout>

ui screenshot

Setting up speech recognition

The Speech Recognition API comes bundled with the Google Search app, and can be launched using an Intent. The result of this Intent holds the recognized text, which can be extracted from the result intent in onActivityResult.

All the code beyond here is in Kotlin.

Firstly, let's define our request code constant.

    companion object {
        private const val REQUEST_CODE_STT = 1
    }

Then, we'll attach an onClickListener to our button, in which we will construct and launch the Speech Recognition Intent.

    btn_stt.setOnClickListener {
        // Get the Intent action
        val sttIntent = Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH)
        // Language model defines the purpose, there are special models for other use cases, like search.
        sttIntent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM)
        // Adding an extra language, you can use any language from the Locale class.
        sttIntent.putExtra(RecognizerIntent.EXTRA_LANGUAGE, Locale.getDefault())
        // Text that shows up on the Speech input prompt.
        sttIntent.putExtra(RecognizerIntent.EXTRA_PROMPT, "Speak now!")
        try {
            // Start the intent for a result, and pass in our request code.
            startActivityForResult(sttIntent, REQUEST_CODE_STT)
        } catch (e: ActivityNotFoundException) {
            // Handling error when the service is not available.
            e.printStackTrace()
            Toast.makeText(this, "Your device does not support STT.", Toast.LENGTH_LONG).show()
        }
    }

The above code will launch the Speech Recognition API. But how do we get the result? We'll override the activity's onActivityResult and get the recognized text.

    override fun onActivityResult(requestCode: Int, resultCode: Int, data: Intent?) {
        super.onActivityResult(requestCode, resultCode, data)
        when (requestCode) {
            // Handle the result for our request code.
            REQUEST_CODE_STT -> {
                // Safety checks to ensure data is available.
                if (resultCode == Activity.RESULT_OK && data != null) {
                    // Retrieve the result array.
                    val result = data.getStringArrayListExtra(RecognizerIntent.EXTRA_RESULTS)
                    // Ensure result array is not null or empty to avoid errors.
                    if (!result.isNullOrEmpty()) {
                        // Recognized text is in the first position.
                        val recognizedText = result[0]
                        // Do what you want with the recognized text.
                        et_text_input.setText(recognizedText)
                    }
                }
            }
        }
    }

At this point, if your run your code, you will be able to use the Speech Recognition.

Setting up Text-to-speech

Unlike Speech Recognition API, Text-to-speech has it own class and doesn't run on Intents. We'll start off by creating a TextToSpeech object. The TextToSpeech class constructor expects a Context and an OnInitListener.

    private val textToSpeechEngine: TextToSpeech by lazy {
        // Pass in context and the listener.
        TextToSpeech(this,
            TextToSpeech.OnInitListener { status ->
                // set our locale only if init was success.
                if (status == TextToSpeech.SUCCESS) {
                    textToSpeechEngine.language = Locale.UK
                }
            })
    }

Then, we'll set an OnClickListener to our TTS button and call the text-to-speech API on our input text.

btn_tts.setOnClickListener {
    // Get the text to be converted to speech from our EditText.
    val text = et_text_input.text.toString().trim()
    // Check if user hasn't input any text.
    if (text.isNotEmpty()) {
        // Lollipop and above requires an additional ID to be passed.
        if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.LOLLIPOP) {
            // Call Lollipop+ function
            textToSpeechEngine.speak(text, TextToSpeech.QUEUE_FLUSH, null, "tts1")
        } else {
            // Call Legacy function
            textToSpeechEngine.speak(text, TextToSpeech.QUEUE_FLUSH, null)
        }
    } else {
        Toast.makeText(this, "Text cannot be empty", Toast.LENGTH_LONG).show()
    }
}

As a safety measure and to prevent memory leaks, we must override onPause and onDestroy methods and appropriately stop or shutdown the TextToSpeech object.

override fun onPause() {
    textToSpeechEngine.stop()
    super.onPause()
}

override fun onDestroy() {
    textToSpeechEngine.shutdown()
    super.onDestroy()
}

And that's it. Give it a try!

Closing Thoughts

With the standard APIs, Speech Recognition (or Speech-to-text) and Text-to-speech in Android is extremely easy to implement. While this might suffice most use cases, some advanced use cases would require more sophisticated third-party APIs or a custom implementation in your backend. We'll probably cover that sometime later.

Until then, keep coding, and as always do let me know if you have any questions in the comments section!

Posted on May 5 by:

rtficial profile

Rishabh Tatiraju

@rtficial

Native Android and Python Developer, Tech Writer and an ML and CV enthusiast.

Discussion

markdown guide