kouwei qing

Posted on Jun 30

HarmonyOS Native Intelligence: Speech Recognition Practice

#harmonyosnext

HarmonyOS Native Intelligence: Speech Recognition Practice

Background

Many business scenarios in our company utilize speech recognition. At the time, our speech team developed a self-research speech recognition model with a solution involving cloud-based models interacting with edge-side SDKs. The edge side handles speech collection, VAD (Voice Activity Detection), Opus encoding, and real-time transmission to the cloud, which then returns recognition results. During HarmonyOS adaptation, we discovered that HarmonyOS Native Intelligence provides a local speech recognition SDK, prompting us to encapsulate its capabilities.

Scenario Introduction

Native speech recognition supports two modes:

Short speech mode (≤60s)
Long speech mode (≤8h)

API Interface Introduction

1. Engine Initialization

speechRecognizer.createEngine

let asrEngine: speechRecognizer.SpeechRecognitionEngine;
// Create the engine and return via callback
// Set engine creation parameters
let extraParam: Record<string, Object> = {"locate": "CN", "recognizerMode": "short"};
let initParamsInfo: speechRecognizer.CreateEngineParams = {
  language: 'zh-CN',
  online: 1,
  extraParams: extraParam
};
// Invoke createEngine method
speechRecognizer.createEngine(initParamsInfo, (err: BusinessError, speechRecognitionEngine: speechRecognizer.SpeechRecognitionEngine) => {
  if (!err) {
    console.info('Succeeded in creating engine.');
    // Receive the created engine instance
    asrEngine = speechRecognitionEngine;
  } else {
    // Error code 1002200008 when unable to create engine: Engine is being destroyed
    console.error(`Failed to create engine. Code: ${err.code}, message: ${err.message}.`);
  }
});

Mainly need to construct speechRecognizer.CreateEngineParams:

language: Language
online: Mode (1 for offline; currently only offline engine is supported)
extraParams: Regional information, etc.
- locate: Regional info (optional, defaults to "CN"; currently only "CN" is supported)
- recognizerMode: Recognition mode ("short" for short speech, "long" for long speech)

Error information in callbacks:

Error code 1002200001: Engine creation failed due to unsupported language, mode, initialization timeout, or missing resources.
Error code 1002200006: Engine is busy (typically triggered when multiple apps call the speech recognition engine simultaneously).
Error code 1002200008: Engine is being destroyed.

2. Set RecognitionListener Callback

The callback handles events during recognition. The most important is onResult for processing recognized content, with different sessions corresponding to unique sessionIds:

// Create callback object
let setListener: speechRecognizer.RecognitionListener = {
  // Callback when recognition starts successfully
  onStart(sessionId: string, eventMessage: string) {

  },
  // Event callback
  onEvent(sessionId: string, eventCode: number, eventMessage: string) {

  },
  // Recognition result callback (includes intermediate and final results)
  onResult(sessionId: string, result: speechRecognizer.SpeechRecognitionResult) {

  },
  // Recognition completion callback
  onComplete(sessionId: string, eventMessage: string) {

  },
  // Error callback (error codes returned here, e.g., 1002200006: Engine is busy)
  onError(sessionId: string, errorCode: number, errorMessage: string) {

  }
}
// Set callback
asrEngine.setListener(setListener);

3. Start Recognition

let audioParam: speechRecognizer.AudioInfo = {audioType: 'pcm', sampleRate: 16000, soundChannel: 1, sampleBit: 16};
let extraParam: Record<string, Object> = {"vadBegin": 2000, "vadEnd": 3000, "maxAudioDuration": 40000};
let recognizerParams: speechRecognizer.StartParams = {
  sessionId: sessionId,
  audioInfo: audioParam,
  extraParams: extraParam
};
// Invoke start recognition method
asrEngine.startListening(recognizerParams);

Main parameters for starting recognition:

sessionId: Session ID (must correspond to the sessionId in onResult callbacks)
audioInfo: Audio configuration (optional)
- audioType: Currently only supports PCM (decode MP3 files before passing to the engine)
- sampleRate: Audio sampling rate (currently only 16000 is supported)
- sampleBit: Sampling bit depth (currently only 16-bit is supported)
- soundChannel: Audio channels (currently only mono/1 channel is supported)
- extraParams: Audio compression rate (defaults to 0 for PCM)
extraParams: Additional configuration
- recognitionMode: Real-time speech recognition mode (defaults to 1 if unspecified)
- 0: Real-time recording recognition (requires ohos.permission.MICROPHONE; call finish to stop)
- 1: Real-time audio-to-text (call writeAudio to pass audio stream)
- vadBegin: VAD (Voice Activity Detection) front-end point (range: [500,10000]ms; default 10000ms)
- vadEnd: VAD back-end point (range: [500,10000]ms; default 800ms)
- maxAudioDuration: Maximum supported audio duration
- Short speech mode: [20000-60000]ms (default 20000ms)
- Long speech mode: [20000 - 8*60*60*1000]ms

VAD primarily detects speech activity and skips silent segments.

4. Pass Audio Stream

asrEngine.writeAudio(sessionId, uint8Array);

Write audio data to the engine (can be from a microphone or audio file).

Note: Audio stream length must be 640 or 1280 bytes.

5. Other Interfaces

listLanguages: Query supported languages
finish: End recognition
cancel: Cancel recognition
shutdown: Release engine resources

Best Practices

For real-time recognition, read audio from the microphone and pass it to asrEngine, then handle results in the onResult callback.

Configure audio capture parameters and create an AudioCapturer instance:

import { audio } from '@kit.AudioKit';

let audioStreamInfo: audio.AudioStreamInfo = {
  samplingRate: audio.AudioSamplingRate.SAMPLE_RATE_16000, // Sampling rate
  channels: audio.AudioChannel.CHANNEL_1, // Channels
  sampleFormat: audio.AudioSampleFormat.SAMPLE_FORMAT_S16LE, // Sample format
  encodingType: audio.AudioEncodingType.ENCODING_TYPE_RAW // Encoding type
};

let audioCapturerInfo: audio.AudioCapturerInfo = {
  source: audio.SourceType.SOURCE_TYPE_MIC,
  capturerFlags: 0
};

let audioCapturerOptions: audio.AudioCapturerOptions = {
  streamInfo: audioStreamInfo,
  capturerInfo: audioCapturerInfo
};

audio.createAudioCapturer(audioCapturerOptions, (err, data) => {
  if (err) {
    console.error(`Invoke createAudioCapturer failed, code is ${err.code}, message is ${err.message}`);
  } else {
    console.info('Invoke createAudioCapturer succeeded.');
    let audioCapturer = data;
  }
});

Note: Sampling rate, channels, and bit depth must match ASR engine requirements (16k, mono, 16-bit).

Next, subscribe to audio data read events:

import { BusinessError } from '@kit.BasicServicesKit';
import { fileIo } from '@kit.CoreFileKit';

let bufferSize: number = 0;
class Options {
  offset?: number;
  length?: number;
}

let readDataCallback = (buffer: ArrayBuffer) => {
  // Write buffer to ASR engine
  asrEngine.writeAudio(sessionId, new Uint8Array(buffer));
}
audioCapturer.on('readData', readDataCallback);

Note: Buffer size must be 640 or 1280 bytes (ASR engine restriction).

Summary

This article introduces HarmonyOS' official speech recognition capabilities, details ASR engine interfaces, and demonstrates real-time microphone speech recognition by capturing audio data and processing results.

DEV Community

HarmonyOS Native Intelligence: Speech Recognition Practice

HarmonyOS Native Intelligence: Speech Recognition Practice

Background

Scenario Introduction

API Interface Introduction

1. Engine Initialization

2. Set RecognitionListener Callback

3. Start Recognition

4. Pass Audio Stream

5. Other Interfaces

Best Practices

Summary

Top comments (0)