DEV Community

Cover image for HarmonyOS Native Intelligence: Speech Recognition Practice
kouwei qing
kouwei qing

Posted on

HarmonyOS Native Intelligence: Speech Recognition Practice

HarmonyOS Native Intelligence: Speech Recognition Practice

Background

Many business scenarios in our company utilize speech recognition. At the time, our speech team developed a self-research speech recognition model with a solution involving cloud-based models interacting with edge-side SDKs. The edge side handles speech collection, VAD (Voice Activity Detection), Opus encoding, and real-time transmission to the cloud, which then returns recognition results. During HarmonyOS adaptation, we discovered that HarmonyOS Native Intelligence provides a local speech recognition SDK, prompting us to encapsulate its capabilities.

Scenario Introduction

Native speech recognition supports two modes:

  • Short speech mode (≤60s)
  • Long speech mode (≤8h)

API Interface Introduction

1. Engine Initialization

speechRecognizer.createEngine

let asrEngine: speechRecognizer.SpeechRecognitionEngine;
// Create the engine and return via callback
// Set engine creation parameters
let extraParam: Record<string, Object> = {"locate": "CN", "recognizerMode": "short"};
let initParamsInfo: speechRecognizer.CreateEngineParams = {
  language: 'zh-CN',
  online: 1,
  extraParams: extraParam
};
// Invoke createEngine method
speechRecognizer.createEngine(initParamsInfo, (err: BusinessError, speechRecognitionEngine: speechRecognizer.SpeechRecognitionEngine) => {
  if (!err) {
    console.info('Succeeded in creating engine.');
    // Receive the created engine instance
    asrEngine = speechRecognitionEngine;
  } else {
    // Error code 1002200008 when unable to create engine: Engine is being destroyed
    console.error(`Failed to create engine. Code: ${err.code}, message: ${err.message}.`);
  }
});
Enter fullscreen mode Exit fullscreen mode

Mainly need to construct speechRecognizer.CreateEngineParams:

  • language: Language
  • online: Mode (1 for offline; currently only offline engine is supported)
  • extraParams: Regional information, etc.
    • locate: Regional info (optional, defaults to "CN"; currently only "CN" is supported)
    • recognizerMode: Recognition mode ("short" for short speech, "long" for long speech)

Error information in callbacks:

  1. Error code 1002200001: Engine creation failed due to unsupported language, mode, initialization timeout, or missing resources.
  2. Error code 1002200006: Engine is busy (typically triggered when multiple apps call the speech recognition engine simultaneously).
  3. Error code 1002200008: Engine is being destroyed.

2. Set RecognitionListener Callback

The callback handles events during recognition. The most important is onResult for processing recognized content, with different sessions corresponding to unique sessionIds:

// Create callback object
let setListener: speechRecognizer.RecognitionListener = {
  // Callback when recognition starts successfully
  onStart(sessionId: string, eventMessage: string) {

  },
  // Event callback
  onEvent(sessionId: string, eventCode: number, eventMessage: string) {

  },
  // Recognition result callback (includes intermediate and final results)
  onResult(sessionId: string, result: speechRecognizer.SpeechRecognitionResult) {

  },
  // Recognition completion callback
  onComplete(sessionId: string, eventMessage: string) {

  },
  // Error callback (error codes returned here, e.g., 1002200006: Engine is busy)
  onError(sessionId: string, errorCode: number, errorMessage: string) {

  }
}
// Set callback
asrEngine.setListener(setListener);
Enter fullscreen mode Exit fullscreen mode

3. Start Recognition

let audioParam: speechRecognizer.AudioInfo = {audioType: 'pcm', sampleRate: 16000, soundChannel: 1, sampleBit: 16};
let extraParam: Record<string, Object> = {"vadBegin": 2000, "vadEnd": 3000, "maxAudioDuration": 40000};
let recognizerParams: speechRecognizer.StartParams = {
  sessionId: sessionId,
  audioInfo: audioParam,
  extraParams: extraParam
};
// Invoke start recognition method
asrEngine.startListening(recognizerParams);
Enter fullscreen mode Exit fullscreen mode

Main parameters for starting recognition:

  • sessionId: Session ID (must correspond to the sessionId in onResult callbacks)
  • audioInfo: Audio configuration (optional)
    • audioType: Currently only supports PCM (decode MP3 files before passing to the engine)
    • sampleRate: Audio sampling rate (currently only 16000 is supported)
    • sampleBit: Sampling bit depth (currently only 16-bit is supported)
    • soundChannel: Audio channels (currently only mono/1 channel is supported)
    • extraParams: Audio compression rate (defaults to 0 for PCM)
  • extraParams: Additional configuration
    • recognitionMode: Real-time speech recognition mode (defaults to 1 if unspecified)
    • 0: Real-time recording recognition (requires ohos.permission.MICROPHONE; call finish to stop)
    • 1: Real-time audio-to-text (call writeAudio to pass audio stream)
    • vadBegin: VAD (Voice Activity Detection) front-end point (range: [500,10000]ms; default 10000ms)
    • vadEnd: VAD back-end point (range: [500,10000]ms; default 800ms)
    • maxAudioDuration: Maximum supported audio duration
    • Short speech mode: [20000-60000]ms (default 20000ms)
    • Long speech mode: [20000 - 8*60*60*1000]ms

VAD primarily detects speech activity and skips silent segments.

4. Pass Audio Stream

asrEngine.writeAudio(sessionId, uint8Array);
Enter fullscreen mode Exit fullscreen mode

Write audio data to the engine (can be from a microphone or audio file).

Note: Audio stream length must be 640 or 1280 bytes.

5. Other Interfaces

  1. listLanguages: Query supported languages
  2. finish: End recognition
  3. cancel: Cancel recognition
  4. shutdown: Release engine resources

Best Practices

For real-time recognition, read audio from the microphone and pass it to asrEngine, then handle results in the onResult callback.

Configure audio capture parameters and create an AudioCapturer instance:

import { audio } from '@kit.AudioKit';

let audioStreamInfo: audio.AudioStreamInfo = {
  samplingRate: audio.AudioSamplingRate.SAMPLE_RATE_16000, // Sampling rate
  channels: audio.AudioChannel.CHANNEL_1, // Channels
  sampleFormat: audio.AudioSampleFormat.SAMPLE_FORMAT_S16LE, // Sample format
  encodingType: audio.AudioEncodingType.ENCODING_TYPE_RAW // Encoding type
};

let audioCapturerInfo: audio.AudioCapturerInfo = {
  source: audio.SourceType.SOURCE_TYPE_MIC,
  capturerFlags: 0
};

let audioCapturerOptions: audio.AudioCapturerOptions = {
  streamInfo: audioStreamInfo,
  capturerInfo: audioCapturerInfo
};

audio.createAudioCapturer(audioCapturerOptions, (err, data) => {
  if (err) {
    console.error(`Invoke createAudioCapturer failed, code is ${err.code}, message is ${err.message}`);
  } else {
    console.info('Invoke createAudioCapturer succeeded.');
    let audioCapturer = data;
  }
});
Enter fullscreen mode Exit fullscreen mode

Note: Sampling rate, channels, and bit depth must match ASR engine requirements (16k, mono, 16-bit).

Next, subscribe to audio data read events:

import { BusinessError } from '@kit.BasicServicesKit';
import { fileIo } from '@kit.CoreFileKit';

let bufferSize: number = 0;
class Options {
  offset?: number;
  length?: number;
}

let readDataCallback = (buffer: ArrayBuffer) => {
  // Write buffer to ASR engine
  asrEngine.writeAudio(sessionId, new Uint8Array(buffer));
}
audioCapturer.on('readData', readDataCallback);
Enter fullscreen mode Exit fullscreen mode

Note: Buffer size must be 640 or 1280 bytes (ASR engine restriction).

Summary

This article introduces HarmonyOS' official speech recognition capabilities, details ASR engine interfaces, and demonstrates real-time microphone speech recognition by capturing audio data and processing results.

Top comments (0)