DEV Community

loading...

How a Programmer Used 300 Lines of Code to Help His Grandma Shop Online with Voice Input

Vivi-clevercoder
・9 min read

"John, why the writing pad is missing again?"

John, programmer at Huawei, has a grandma who loves novelty, and lately she's been obsessed with online shopping. Familiarizing herself with major shopping apps and their functions proved to be a piece of cake, and she had thought that her online shopping experience would be effortless — unfortunately, however, she was hindered by product searching.

John's grandma tended to use handwriting input. When using it, she would often make mistakes, like switching to another input method she found unfamiliar, or tapping on undesired characters or signs.
Except for shopping apps, most mobile apps feature interface designs that are oriented to younger users — it's no wonder that elderly users often struggle to figure out how to use them.
John patiently helped his grandma search for products with handwriting input several times. But then, he decided to use his skills as a veteran coder to give his grandma the best possible online shopping experience. More specifically, instead of helping her adjust to the available input method, he was determined to create an input method that would conform to her usage habits.
Since his grandma tended to err during manual input, John developed an input method that converts speech into text. Grandma was enthusiastic about the new method, because it is remarkably easy to use. All she has to do is to tap on the recording button and say the product's name. The input method then recognizes what she has said, and converts her speech into text.

Real-time speech recognition and speech to text are ideal for a broad range of apps, including:

  1. Game apps (online): Real-time speech recognition comes to users' aid when they team up with others. It frees up users' hands for controlling the action, sparing them from having to type to communicate with their partners. It can also free users from any potential embarrassment related to voice chatting during gaming.

  2. Work apps: Speech to text can play a vital role during long conferences, where typing to keep meeting minutes can be tedious and inefficient, with key details being missed. Using speech to text is much more efficient: during a conference, users can use this service to convert audio content into text; after the conference, they can simply retouch the text to make it more logical.

  3. Learning apps: Speech to text can offer users an enhanced learning experience. Without the service, users often have to pause audio materials to take notes, resulting in a fragmented learning process. With speech to text, users can concentrate on listening intently to the material while it is being played, and rely on the service to convert the audio content into text. They can then review the text after finishing the entire course, to ensure that they've mastered the content.

How to Implement

Two services in HUAWEI ML Kit: automatic speech recognition (ASR) and audio file transcription, make it easy to implement the above functions.

ASR can recognize speech of up to 60s, and convert the input speech into text in real time, with recognition accuracy of over 95%. It currently supports Mandarin Chinese (including Chinese-English bilingual speech), English, French, German, Spanish, Italian, and Arabic.
 Real-time result output
 Available options: with and without speech pickup UI
 Endpoint detection: Start and end points can be accurately located.
 Silence detection: No voice packet is sent for silent portions.
 Intelligent conversion to digital formats: For example, the year 2021 is recognized from voice input.
Audio file transcription can convert an audio file of up to five hours into text with punctuation, and automatically segment the text for greater clarity. In addition, this service can generate text with timestamps, facilitating further function development.

In this version, both Chinese and English are supported.

image

Development Procedures

  1. Preparations

(1) Configure the Huawei Maven repository address, and put the agconnect-services.json file under the app directory.
Open the build.gradle file in the root directory of your Android Studio project.
Add the AppGallery Connect plugin and the Maven repository.
 Go to allprojects > repositories and configure the Maven repository address for the HMS Core SDK.
 Go to buildscript > repositories and configure the Maven repository address for the HMS Core SDK.
 If the agconnect-services.json file has been added to the app, go to buildscript > dependencies and add the AppGallery Connect plugin configuration.

buildscript {
    repositories {
        google()
        jcenter()
        maven { url 'https://developer.huawei.com/repo/' }
    }
    dependencies {
        classpath 'com.android.tools.build:gradle:3.5.4'
        classpath 'com.huawei.agconnect:agcp:1.4.1.300'
        // NOTE: Do not place your app dependencies here; they belong
        // in the individual module build.gradle files.
    }
}

allprojects {
    repositories {
        google()
        jcenter()
        maven { url 'https://developer.huawei.com/repo/' }
    }
}
Set the app authentication information. For details, see Notes on Using Cloud Authentication Information.
(2) Add the build dependencies for the HMS Core SDK.
dependencies {
    //The audio file transcription SDK.
    implementation 'com.huawei.hms:ml-computer-voice-aft:2.2.0.300'
    // The ASR SDK.
    implementation 'com.huawei.hms:ml-computer-voice-asr:2.2.0.300'
    // Plugin of ASR.
    implementation 'com.huawei.hms:ml-computer-voice-asr-plugin:2.2.0.300'
    ...
}
apply plugin: 'com.huawei.agconnect'  // AppGallery Connect plugin.
Enter fullscreen mode Exit fullscreen mode

(3) Configure the signing certificate in the build.gradle file under the app directory.

signingConfigs {
    release {
        storeFile file("xxx.jks")
        keyAlias xxx
        keyPassword xxxxxx
        storePassword xxxxxx
        v1SigningEnabled true
        v2SigningEnabled true
    }

}

buildTypes {
    release {
        minifyEnabled false
        proguardFiles getDefaultProguardFile('proguard-android-optimize.txt'), 'proguard-rules.pro'
    }

    debug {
        signingConfig signingConfigs.release
        debuggable true
    }
}
Enter fullscreen mode Exit fullscreen mode

(4) Add permissions in the AndroidManifest.xml file.

<uses-permission android:name="android.permission.INTERNET" />
<uses-permission android:name="android.permission.READ_EXTERNAL_STORAGE" />
<uses-permission android:name="android.permission.WRITE_EXTERNAL_STORAGE" />
<uses-permission android:name="android.permission.ACCESS_NETWORK_STATE" />
<uses-permission android:name="android.permission.ACCESS_WIFI_STATE" />
<uses-permission android:name="android.permission.RECORD_AUDIO" />

<application
    android:requestLegacyExternalStorage="true"
  ...
</application>
Enter fullscreen mode Exit fullscreen mode
  1. Integrating the ASR Service

(1) Dynamically apply for the permissions.

if (ActivityCompat.checkSelfPermission(this, Manifest.permission.RECORD_AUDIO) != PackageManager.PERMISSION_GRANTED) {
    requestCameraPermission();
}

private void requestCameraPermission() {
    final String[] permissions = new String[]{Manifest.permission.RECORD_AUDIO};
    if (!ActivityCompat.shouldShowRequestPermissionRationale(this, Manifest.permission.RECORD_AUDIO)) {
        ActivityCompat.requestPermissions(this, permissions, Constants.AUDIO_PERMISSION_CODE);
        return;
    }
}
Enter fullscreen mode Exit fullscreen mode

(2) Create an Intent to set parameters.

// Set authentication information for your app.
MLApplication.getInstance().setApiKey(AGConnectServicesConfig.fromContext(this).getString("client/api_key"));
//// Use Intent for recognition parameter settings.
Intent intentPlugin = new Intent(this, MLAsrCaptureActivity.class)
        // Set the language that can be recognized to English. If this parameter is not set, English is recognized by default. Example: "zh-CN": Chinese; "en-US": English.
        .putExtra(MLAsrCaptureConstants.LANGUAGE, MLAsrConstants.LAN_EN_US)
        // Set whether to display the recognition result on the speech pickup UI.
        .putExtra(MLAsrCaptureConstants.FEATURE, MLAsrCaptureConstants.FEATURE_WORDFLUX);
startActivityForResult(intentPlugin, "1");
Enter fullscreen mode Exit fullscreen mode

(3) Override the onActivityResult method to process the result returned by ASR.

@Override
protected void onActivityResult(int requestCode, int resultCode, @Nullable Intent data) {
    super.onActivityResult(requestCode, resultCode, data);
    String text = "";
    if (null == data) {
        addTagItem("Intent data is null.", true);
    }
    if (requestCode == "1") {
        if (data == null) {
            return;
        }
        Bundle bundle = data.getExtras();
        if (bundle == null) {
            return;
        }
        switch (resultCode) {
            case MLAsrCaptureConstants.ASR_SUCCESS:
                // Obtain the text information recognized from speech.
                if (bundle.containsKey(MLAsrCaptureConstants.ASR_RESULT)) {
                    text = bundle.getString(MLAsrCaptureConstants.ASR_RESULT);
                }
                if (text == null || "".equals(text)) {
                    text = "Result is null.";
                    Log.e(TAG, text);
                } else {
                    // Display the recognition result in the search box.
                    searchEdit.setText(text);
                    goSearch(text, true);
                }
                break;
            // MLAsrCaptureConstants.ASR_FAILURE: Recognition fails.
            case MLAsrCaptureConstants.ASR_FAILURE:
                // Check whether an error code is contained.
                if (bundle.containsKey(MLAsrCaptureConstants.ASR_ERROR_CODE)) {
                    text = text + bundle.getInt(MLAsrCaptureConstants.ASR_ERROR_CODE);
                    // Troubleshoot based on the error code.
                }
                // Check whether error information is contained.
                if (bundle.containsKey(MLAsrCaptureConstants.ASR_ERROR_MESSAGE)) {
                    String errorMsg = bundle.getString(MLAsrCaptureConstants.ASR_ERROR_MESSAGE);
                    // Troubleshoot based on the error information.
                    if (errorMsg != null && !"".equals(errorMsg)) {
                        text = "[" + text + "]" + errorMsg;
                    }
                }
                // Check whether a sub-error code is contained.
                if (bundle.containsKey(MLAsrCaptureConstants.ASR_SUB_ERROR_CODE)) {
                    int subErrorCode = bundle.getInt(MLAsrCaptureConstants.ASR_SUB_ERROR_CODE);
                    // Troubleshoot based on the sub-error code.
                    text = "[" + text + "]" + subErrorCode;
                }
                Log.e(TAG, text);
                break;
            default:
                break;
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

3. Integrating the Audio File Transcription Service

(1) Dynamically apply for the permissions.

private static final int REQUEST_EXTERNAL_STORAGE = 1;
private static final String[] PERMISSIONS_STORAGE = {
        Manifest.permission.READ_EXTERNAL_STORAGE,
        Manifest.permission.WRITE_EXTERNAL_STORAGE };
public static void verifyStoragePermissions(Activity activity) {
    // Check if the write permission has been granted.
    int permission = ActivityCompat.checkSelfPermission(activity,
            Manifest.permission.WRITE_EXTERNAL_STORAGE);
    if (permission != PackageManager.PERMISSION_GRANTED) {
        // The permission has not been granted. Prompt the user to grant it.
        ActivityCompat.requestPermissions(activity, PERMISSIONS_STORAGE,
                REQUEST_EXTERNAL_STORAGE);
    }
}
Enter fullscreen mode Exit fullscreen mode

(2) Create and initialize an audio transcription engine, and create an audio file transcription configurator.

// Set the API key.
MLApplication.getInstance().setApiKey(AGConnectServicesConfig.fromContext(getApplication()).getString("client/api_key"));
MLRemoteAftSetting setting = new MLRemoteAftSetting.Factory()
        // Set the transcription language code, complying with the BCP 47 standard. Currently, Mandarin Chinese and English are supported.
        .setLanguageCode("zh")
        // Set whether to automatically add punctuations to the converted text. The default value is false.
        .enablePunctuation(true)
        // Set whether to generate the text transcription result of each audio segment and the corresponding audio time shift. The default value is false. (This parameter needs to be set only when the audio duration is less than 1 minute.)
        .enableWordTimeOffset(true)
        // Set whether to output the time shift of a sentence in the audio file. The default value is false.
        .enableSentenceTimeOffset(true)
        .create();

// Create an audio transcription engine.
MLRemoteAftEngine engine = MLRemoteAftEngine.getInstance();
engine.init(this);
// Pass the listener callback to the audio transcription engine created beforehand.
engine.setAftListener(aftListener);
Enter fullscreen mode Exit fullscreen mode

(3) Create a listener callback to process the audio file transcription result.

 Transcription of short audio files with a duration of 1 minute or shorter:

private MLRemoteAftListener aftListener = new MLRemoteAftListener() {
    public void onResult(String taskId, MLRemoteAftResult result, Object ext) {
        // Obtain the transcription result notification.
        if (result.isComplete()) {
            // Process the transcription result.
        }
    }
    @Override
    public void onError(String taskId, int errorCode, String message) {
        // Callback upon a transcription error.
    }
    @Override
    public void onInitComplete(String taskId, Object ext) {
        // Reserved.
    }
    @Override
    public void onUploadProgress(String taskId, double progress, Object ext) {
        // Reserved.
    }
    @Override
    public void onEvent(String taskId, int eventId, Object ext) {
        // Reserved.
    }
};
Enter fullscreen mode Exit fullscreen mode

 Transcription of audio files with a duration longer than 1 minute:

private MLRemoteAftListener asrListener = new MLRemoteAftListener() {
    @Override
    public void onInitComplete(String taskId, Object ext) {
        Log.e(TAG, "MLAsrCallBack onInitComplete");
        // The long audio file is initialized and the transcription starts.
        start(taskId);
    }
    @Override
    public void onUploadProgress(String taskId, double progress, Object ext) {
        Log.e(TAG, " MLAsrCallBack onUploadProgress");
    }
    @Override
    public void onEvent(String taskId, int eventId, Object ext) {
        // Used for the long audio file.
        Log.e(TAG, "MLAsrCallBack onEvent" + eventId);
        if (MLAftEvents.UPLOADED_EVENT == eventId) { // The file is uploaded successfully.
            // Obtain the transcription result.
            startQueryResult(taskId);
        }
    }
    @Override
    public void onResult(String taskId, MLRemoteAftResult result, Object ext) {
        Log.e(TAG, "MLAsrCallBack onResult taskId is :" + taskId + " ");
        if (result != null) {
            Log.e(TAG, "MLAsrCallBack onResult isComplete: " + result.isComplete());
            if (result.isComplete()) {
                TimerTask timerTask = timerTaskMap.get(taskId);
                if (null != timerTask) {
                    timerTask.cancel();
                    timerTaskMap.remove(taskId);
                }
                if (result.getText() != null) {
                    Log.e(TAG, taskId + " MLAsrCallBack onResult result is : " + result.getText());
                    tvText.setText(result.getText());
                }
                List<MLRemoteAftResult.Segment> words = result.getWords();
                if (words != null && words.size() != 0) {
                    for (MLRemoteAftResult.Segment word : words) {
                        Log.e(TAG, "MLAsrCallBack word  text is : " + word.getText() + ", startTime is : " + word.getStartTime() + ". endTime is : " + word.getEndTime());
                    }
                }
                List<MLRemoteAftResult.Segment> sentences = result.getSentences();
                if (sentences != null && sentences.size() != 0) {
                    for (MLRemoteAftResult.Segment sentence : sentences) {
                        Log.e(TAG, "MLAsrCallBack sentence  text is : " + sentence.getText() + ", startTime is : " + sentence.getStartTime() + ". endTime is : " + sentence.getEndTime());
                    }
                }
            }
        }
    }
    @Override
    public void onError(String taskId, int errorCode, String message) {
        Log.i(TAG, "MLAsrCallBack onError : " + message + "errorCode, " + errorCode);
        switch (errorCode) {
            case MLAftErrors.ERR_AUDIO_FILE_NOTSUPPORTED:
                break;
        }
    }
};
// Upload a transcription task.
private void start(String taskId) {
    Log.e(TAG, "start");
    engine.setAftListener(asrListener);
    engine.startTask(taskId);
}
// Obtain the transcription result.
private Map<String, TimerTask> timerTaskMap = new HashMap<>();
private void startQueryResult(final String taskId) {
    Timer mTimer = new Timer();
    TimerTask mTimerTask = new TimerTask() {
        @Override
        public void run() {
            getResult(taskId);
        }
    };
    // Periodically obtain the long audio file transcription result every 10s.
    mTimer.schedule(mTimerTask, 5000, 10000);
    // Clear timerTaskMap before destroying the UI.
    timerTaskMap.put(taskId, mTimerTask);
}
Enter fullscreen mode Exit fullscreen mode

(4) Obtain an audio file and upload it to the audio transcription engine.

// Obtain the URI of an audio file.
Uri uri = getFileUri();
// Obtain the audio duration.
Long audioTime = getAudioFileTimeFromUri(uri);
// Check whether the duration is longer than 60s.
if (audioTime < 60000) {
    // uri indicates audio resources read from the local storage or recorder. Only local audio files with a duration not longer than 1 minute are supported.
    this.taskId = this.engine.shortRecognize(uri, this.setting);
    Log.i(TAG, "Short audio transcription.");
} else {
    // longRecognize is an API used to convert audio files with a duration from 1 minute to 5 hours.
    this.taskId = this.engine.longRecognize(uri, this.setting);
    Log.i(TAG, "Long audio transcription.");
}

private Long getAudioFileTimeFromUri(Uri uri) {
    Long time = null;
    Cursor cursor = this.getContentResolver()
            .query(uri, null, null, null, null);
    if (cursor != null) {

        cursor.moveToFirst();
        time = cursor.getLong(cursor.getColumnIndexOrThrow(MediaStore.Video.Media.DURATION));
    } else {
        MediaPlayer mediaPlayer = new MediaPlayer();
        try {
            mediaPlayer.setDataSource(String.valueOf(uri));
            mediaPlayer.prepare();
        } catch (IOException e) {
            Log.e(TAG, "Failed to read the file time.");
        }
        time = Long.valueOf(mediaPlayer.getDuration());
    }
    return time;
}
Enter fullscreen mode Exit fullscreen mode

To learn more, visit the following links:
Documentation on the HUAWEI Developers website
https://developer.huawei.com/consumer/en/hms/huawei-MapKit

HUAWEI Developers official website

Development Guide

Redditto join developer discussions

GitHub or Gitee to download the demo and sample code

Stack Overflow to solve integration problems

Discussion (0)