HarmonyOS Native Intelligence: Speech Recognition Practice
Background
Many business scenarios in our company utilize speech recognition. At the time, our speech team developed a self-research speech recognition model with a solution involving cloud-based models interacting with edge-side SDKs. The edge side handles speech collection, VAD (Voice Activity Detection), Opus encoding, and real-time transmission to the cloud, which then returns recognition results. During HarmonyOS adaptation, we discovered that HarmonyOS Native Intelligence provides a local speech recognition SDK, prompting us to encapsulate its capabilities.
Scenario Introduction
Native speech recognition supports two modes:
- Short speech mode (≤60s)
- Long speech mode (≤8h)
API Interface Introduction
1. Engine Initialization
speechRecognizer.createEngine
let asrEngine: speechRecognizer.SpeechRecognitionEngine;
// Create the engine and return via callback
// Set engine creation parameters
let extraParam: Record<string, Object> = {"locate": "CN", "recognizerMode": "short"};
let initParamsInfo: speechRecognizer.CreateEngineParams = {
  language: 'zh-CN',
  online: 1,
  extraParams: extraParam
};
// Invoke createEngine method
speechRecognizer.createEngine(initParamsInfo, (err: BusinessError, speechRecognitionEngine: speechRecognizer.SpeechRecognitionEngine) => {
  if (!err) {
    console.info('Succeeded in creating engine.');
    // Receive the created engine instance
    asrEngine = speechRecognitionEngine;
  } else {
    // Error code 1002200008 when unable to create engine: Engine is being destroyed
    console.error(`Failed to create engine. Code: ${err.code}, message: ${err.message}.`);
  }
});
Mainly need to construct speechRecognizer.CreateEngineParams:  
- 
language: Language
- 
online: Mode (1 for offline; currently only offline engine is supported)
- 
extraParams: Regional information, etc.- 
locate: Regional info (optional, defaults to "CN"; currently only "CN" is supported)
- 
recognizerMode: Recognition mode ("short" for short speech, "long" for long speech)
 
- 
Error information in callbacks:
- Error code 1002200001: Engine creation failed due to unsupported language, mode, initialization timeout, or missing resources.
- Error code 1002200006: Engine is busy (typically triggered when multiple apps call the speech recognition engine simultaneously).
- Error code 1002200008: Engine is being destroyed.
2. Set RecognitionListener Callback
The callback handles events during recognition. The most important is onResult for processing recognized content, with different sessions corresponding to unique sessionIds:
// Create callback object
let setListener: speechRecognizer.RecognitionListener = {
  // Callback when recognition starts successfully
  onStart(sessionId: string, eventMessage: string) {
  },
  // Event callback
  onEvent(sessionId: string, eventCode: number, eventMessage: string) {
  },
  // Recognition result callback (includes intermediate and final results)
  onResult(sessionId: string, result: speechRecognizer.SpeechRecognitionResult) {
  },
  // Recognition completion callback
  onComplete(sessionId: string, eventMessage: string) {
  },
  // Error callback (error codes returned here, e.g., 1002200006: Engine is busy)
  onError(sessionId: string, errorCode: number, errorMessage: string) {
  }
}
// Set callback
asrEngine.setListener(setListener);
3. Start Recognition
let audioParam: speechRecognizer.AudioInfo = {audioType: 'pcm', sampleRate: 16000, soundChannel: 1, sampleBit: 16};
let extraParam: Record<string, Object> = {"vadBegin": 2000, "vadEnd": 3000, "maxAudioDuration": 40000};
let recognizerParams: speechRecognizer.StartParams = {
  sessionId: sessionId,
  audioInfo: audioParam,
  extraParams: extraParam
};
// Invoke start recognition method
asrEngine.startListening(recognizerParams);
Main parameters for starting recognition:
- 
sessionId: Session ID (must correspond to the sessionId inonResultcallbacks)
- 
audioInfo: Audio configuration (optional)- 
audioType: Currently only supports PCM (decode MP3 files before passing to the engine)
- 
sampleRate: Audio sampling rate (currently only 16000 is supported)
- 
sampleBit: Sampling bit depth (currently only 16-bit is supported)
- 
soundChannel: Audio channels (currently only mono/1 channel is supported)
- 
extraParams: Audio compression rate (defaults to 0 for PCM)
 
- 
- 
extraParams: Additional configuration- 
recognitionMode: Real-time speech recognition mode (defaults to 1 if unspecified)
- 0: Real-time recording recognition (requires ohos.permission.MICROPHONE; callfinishto stop)
- 1: Real-time audio-to-text (call writeAudioto pass audio stream)
- 
vadBegin: VAD (Voice Activity Detection) front-end point (range:[500,10000]ms; default 10000ms)
- 
vadEnd: VAD back-end point (range:[500,10000]ms; default 800ms)
- 
maxAudioDuration: Maximum supported audio duration
- Short speech mode: [20000-60000]ms (default 20000ms)
- Long speech mode: [20000 - 8*60*60*1000]ms
 
- 
VAD primarily detects speech activity and skips silent segments.
4. Pass Audio Stream
asrEngine.writeAudio(sessionId, uint8Array);
Write audio data to the engine (can be from a microphone or audio file).
Note: Audio stream length must be 640 or 1280 bytes.  
5. Other Interfaces
- 
listLanguages: Query supported languages
- 
finish: End recognition
- 
cancel: Cancel recognition
- 
shutdown: Release engine resources
Best Practices
For real-time recognition, read audio from the microphone and pass it to asrEngine, then handle results in the onResult callback.  
Configure audio capture parameters and create an AudioCapturer instance:
import { audio } from '@kit.AudioKit';
let audioStreamInfo: audio.AudioStreamInfo = {
  samplingRate: audio.AudioSamplingRate.SAMPLE_RATE_16000, // Sampling rate
  channels: audio.AudioChannel.CHANNEL_1, // Channels
  sampleFormat: audio.AudioSampleFormat.SAMPLE_FORMAT_S16LE, // Sample format
  encodingType: audio.AudioEncodingType.ENCODING_TYPE_RAW // Encoding type
};
let audioCapturerInfo: audio.AudioCapturerInfo = {
  source: audio.SourceType.SOURCE_TYPE_MIC,
  capturerFlags: 0
};
let audioCapturerOptions: audio.AudioCapturerOptions = {
  streamInfo: audioStreamInfo,
  capturerInfo: audioCapturerInfo
};
audio.createAudioCapturer(audioCapturerOptions, (err, data) => {
  if (err) {
    console.error(`Invoke createAudioCapturer failed, code is ${err.code}, message is ${err.message}`);
  } else {
    console.info('Invoke createAudioCapturer succeeded.');
    let audioCapturer = data;
  }
});
Note: Sampling rate, channels, and bit depth must match ASR engine requirements (16k, mono, 16-bit).
Next, subscribe to audio data read events:
import { BusinessError } from '@kit.BasicServicesKit';
import { fileIo } from '@kit.CoreFileKit';
let bufferSize: number = 0;
class Options {
  offset?: number;
  length?: number;
}
let readDataCallback = (buffer: ArrayBuffer) => {
  // Write buffer to ASR engine
  asrEngine.writeAudio(sessionId, new Uint8Array(buffer));
}
audioCapturer.on('readData', readDataCallback);
Note: Buffer size must be 640 or 1280 bytes (ASR engine restriction).
Summary
This article introduces HarmonyOS' official speech recognition capabilities, details ASR engine interfaces, and demonstrates real-time microphone speech recognition by capturing audio data and processing results.
 
 
              
 
    
Top comments (0)