DEV Community

lyc233333
lyc233333

Posted on

Hongmeng Next voice synthesis technology: intelligent conversion from text to sound

Hongmeng Next's speech synthesis technology realizes natural voice output through a lightweight architecture.This article analyzes the core capabilities of Core Speech Kit, combines practical cases to display optimization strategies, and helps developers create an immersive voice interactive experience~

1. Technical principles and core capabilities

(I) Synthesis process disassembly

  1. Text preprocessing: word participle → part-of-speech annotation → rhythm analysis (such as identifying the stress of "the weather is so good today" in "true")
  2. Acoustic Model: Generate Mel Spectrum based on Tacotron2 architecture
  3. Vocoder Synthesis: WaveRNN converts spectrum into speech waveforms

(II) Hongmeng’s characteristic abilities

Functional Modules Technical Highlights Application Scenarios
Multilingual support One-click switching for 10+ languages, including Chinese/English/Japanese Globalized Intelligent Assistant
Emotional Voice Support 6 emotional modes such as happiness/sadness/seriousness Emotional reading of audiobooks
Lightweight model The end-side model is only 4.8MB, supporting devices below 1GB Smart watch/smart home equipment

2. Core Speech Kit actual combat

(I) Core interface call

import { TextToSpeechEngine } from '@ohos.speech.core';  

async function ttsDemo() {  
// 1. Create a lightweight engine (automatically select the device adaptation model)
  const engine = await TextToSpeechEngine.create({  
modelType: 'LIGHT_WEIGHT', // Lightweight mode
language: 'zh-CN' // Mandarin Chinese
  });  

// 2. Set voice parameters
  engine.setParameter({  
pitch: 1.2, // Increase the tone by 20%
speed: 0.9, // 10% reduction in speech speed
volume: 0.8 // Volume 80%
  });  

// 3. Synthetic voice (supports SSML tagging)
const ssmlText = '<prosody rate="slow">Welcome to experience Hongmeng pronunciation synthesis technology</prosody>';
  engine.speak(ssmlText);  

// 4. Streaming synthesis (suitable for long text)
  const stream = engine.createStream();  
stream.write('first paragraph text');
setTimeout(() => stream.write('second paragraph text'), 1000);
}  
Enter fullscreen mode Exit fullscreen mode

(II) Lightweight optimization

  1. Model compression: Reduce the Tacotron2 parameter volume by 60% through knowledge distillation
  2. Dynamic reasoning: Automatic switching accuracy based on device memory (FP16 is used for mobile phones, INT8 is used for IoT devices)
  3. Caching Policy: Repeat text to read the audio cache directly to reduce duplicate composition

3. Scenario optimization and future trends

(I) Typical scenario optimization

Precaution points for smart car scenes: In-car noise interference leads to unclear voice
Solution:

  1. Environmental noise detection → Dynamic adjustment of synthetic volume
// Automatically increase the volume when the noise is ≥60
if (noiseLevel > 60) {  
engine.setVolume(1.2); // Increase the volume by 20%
}  
Enter fullscreen mode Exit fullscreen mode
  1. Multi-microphone array noise reduction + voice synthesis linkage

(II) Technology evolution direction

  1. ** cloud collaboration**: The local model handles daily conversations, and the cloud model generates complex emotional voice
  2. Personalized tone: Generate exclusive tone models through 30-second voice samples
  3. lip-synchronization: Combined with AR Engine to realize real-time synchronization of virtual assistant mouth shape and voice

Summary: Three Principles of Phonetic Synthesis

  1. Lightweight priority: Dynamic adaptation of model volume and equipment performance
  2. Naturality is the core: The accuracy of rhythm analysis determines the upper limit of user experience
  3. Scene customization: Car/home and other scenarios need to be optimized in targeted parameters

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.