HarmonyOS Native Intelligence: Speech Recognition Practice
Background
Many business scenarios in our company utilize speech recognition. At the time, our speech team developed a self-research speech recognition model with a solution involving cloud-based models interacting with edge-side SDKs. The edge side handles speech collection, VAD (Voice Activity Detection), Opus encoding, and real-time transmission to the cloud, which then returns recognition results. During HarmonyOS adaptation, we discovered that HarmonyOS Native Intelligence provides a local speech recognition SDK, prompting us to encapsulate its capabilities.
Scenario Introduction
Native speech recognition supports two modes:
- Short speech mode (≤60s)
- Long speech mode (≤8h)
API Interface Introduction
1. Engine Initialization
speechRecognizer.createEngine
let asrEngine: speechRecognizer.SpeechRecognitionEngine;
// Create the engine and return via callback
// Set engine creation parameters
let extraParam: Record<string, Object> = {"locate": "CN", "recognizerMode": "short"};
let initParamsInfo: speechRecognizer.CreateEngineParams = {
language: 'zh-CN',
online: 1,
extraParams: extraParam
};
// Invoke createEngine method
speechRecognizer.createEngine(initParamsInfo, (err: BusinessError, speechRecognitionEngine: speechRecognizer.SpeechRecognitionEngine) => {
if (!err) {
console.info('Succeeded in creating engine.');
// Receive the created engine instance
asrEngine = speechRecognitionEngine;
} else {
// Error code 1002200008 when unable to create engine: Engine is being destroyed
console.error(`Failed to create engine. Code: ${err.code}, message: ${err.message}.`);
}
});
Mainly need to construct speechRecognizer.CreateEngineParams
:
-
language
: Language -
online
: Mode (1 for offline; currently only offline engine is supported) -
extraParams
: Regional information, etc.-
locate
: Regional info (optional, defaults to "CN"; currently only "CN" is supported) -
recognizerMode
: Recognition mode ("short" for short speech, "long" for long speech)
-
Error information in callbacks:
- Error code 1002200001: Engine creation failed due to unsupported language, mode, initialization timeout, or missing resources.
- Error code 1002200006: Engine is busy (typically triggered when multiple apps call the speech recognition engine simultaneously).
- Error code 1002200008: Engine is being destroyed.
2. Set RecognitionListener Callback
The callback handles events during recognition. The most important is onResult for processing recognized content, with different sessions corresponding to unique sessionIds:
// Create callback object
let setListener: speechRecognizer.RecognitionListener = {
// Callback when recognition starts successfully
onStart(sessionId: string, eventMessage: string) {
},
// Event callback
onEvent(sessionId: string, eventCode: number, eventMessage: string) {
},
// Recognition result callback (includes intermediate and final results)
onResult(sessionId: string, result: speechRecognizer.SpeechRecognitionResult) {
},
// Recognition completion callback
onComplete(sessionId: string, eventMessage: string) {
},
// Error callback (error codes returned here, e.g., 1002200006: Engine is busy)
onError(sessionId: string, errorCode: number, errorMessage: string) {
}
}
// Set callback
asrEngine.setListener(setListener);
3. Start Recognition
let audioParam: speechRecognizer.AudioInfo = {audioType: 'pcm', sampleRate: 16000, soundChannel: 1, sampleBit: 16};
let extraParam: Record<string, Object> = {"vadBegin": 2000, "vadEnd": 3000, "maxAudioDuration": 40000};
let recognizerParams: speechRecognizer.StartParams = {
sessionId: sessionId,
audioInfo: audioParam,
extraParams: extraParam
};
// Invoke start recognition method
asrEngine.startListening(recognizerParams);
Main parameters for starting recognition:
-
sessionId
: Session ID (must correspond to the sessionId inonResult
callbacks) -
audioInfo
: Audio configuration (optional)-
audioType
: Currently only supports PCM (decode MP3 files before passing to the engine) -
sampleRate
: Audio sampling rate (currently only 16000 is supported) -
sampleBit
: Sampling bit depth (currently only 16-bit is supported) -
soundChannel
: Audio channels (currently only mono/1 channel is supported) -
extraParams
: Audio compression rate (defaults to 0 for PCM)
-
-
extraParams
: Additional configuration-
recognitionMode
: Real-time speech recognition mode (defaults to 1 if unspecified) - 0: Real-time recording recognition (requires
ohos.permission.MICROPHONE
; callfinish
to stop) - 1: Real-time audio-to-text (call
writeAudio
to pass audio stream) -
vadBegin
: VAD (Voice Activity Detection) front-end point (range:[500,10000]
ms; default 10000ms) -
vadEnd
: VAD back-end point (range:[500,10000]
ms; default 800ms) -
maxAudioDuration
: Maximum supported audio duration - Short speech mode:
[20000-60000]
ms (default 20000ms) - Long speech mode:
[20000 - 8*60*60*1000]
ms
-
VAD primarily detects speech activity and skips silent segments.
4. Pass Audio Stream
asrEngine.writeAudio(sessionId, uint8Array);
Write audio data to the engine (can be from a microphone or audio file).
Note: Audio stream length must be 640 or 1280 bytes.
5. Other Interfaces
-
listLanguages
: Query supported languages -
finish
: End recognition -
cancel
: Cancel recognition -
shutdown
: Release engine resources
Best Practices
For real-time recognition, read audio from the microphone and pass it to asrEngine
, then handle results in the onResult
callback.
Configure audio capture parameters and create an AudioCapturer
instance:
import { audio } from '@kit.AudioKit';
let audioStreamInfo: audio.AudioStreamInfo = {
samplingRate: audio.AudioSamplingRate.SAMPLE_RATE_16000, // Sampling rate
channels: audio.AudioChannel.CHANNEL_1, // Channels
sampleFormat: audio.AudioSampleFormat.SAMPLE_FORMAT_S16LE, // Sample format
encodingType: audio.AudioEncodingType.ENCODING_TYPE_RAW // Encoding type
};
let audioCapturerInfo: audio.AudioCapturerInfo = {
source: audio.SourceType.SOURCE_TYPE_MIC,
capturerFlags: 0
};
let audioCapturerOptions: audio.AudioCapturerOptions = {
streamInfo: audioStreamInfo,
capturerInfo: audioCapturerInfo
};
audio.createAudioCapturer(audioCapturerOptions, (err, data) => {
if (err) {
console.error(`Invoke createAudioCapturer failed, code is ${err.code}, message is ${err.message}`);
} else {
console.info('Invoke createAudioCapturer succeeded.');
let audioCapturer = data;
}
});
Note: Sampling rate, channels, and bit depth must match ASR engine requirements (16k, mono, 16-bit).
Next, subscribe to audio data read events:
import { BusinessError } from '@kit.BasicServicesKit';
import { fileIo } from '@kit.CoreFileKit';
let bufferSize: number = 0;
class Options {
offset?: number;
length?: number;
}
let readDataCallback = (buffer: ArrayBuffer) => {
// Write buffer to ASR engine
asrEngine.writeAudio(sessionId, new Uint8Array(buffer));
}
audioCapturer.on('readData', readDataCallback);
Note: Buffer size must be 640 or 1280 bytes (ASR engine restriction).
Summary
This article introduces HarmonyOS' official speech recognition capabilities, details ASR engine interfaces, and demonstrates real-time microphone speech recognition by capturing audio data and processing results.
Top comments (0)