DEV Community

cz
cz

Posted on

GLM-TTS Complete Guide 2025: Revolutionary Zero-Shot Voice Cloning with Reinforcement Learning

🎯 Core Highlights (TL;DR)

  • Open-Source Excellence: GLM-TTS achieves the lowest Character Error Rate (0.89) among open-source TTS models while maintaining high speaker similarity
  • Zero-Shot Capability: Clone any voice with just 3-10 seconds of audio prompt without fine-tuning
  • RL-Enhanced Emotions: Multi-reward reinforcement learning framework delivers more natural and expressive speech compared to traditional TTS systems
  • Production-Ready: Supports streaming inference, bilingual processing (Chinese/English), and phoneme-level pronunciation control
  • Active Development: Released December 11, 2025, with ongoing updates including 2D Vocos vocoder and RL-optimized weights

Table of Contents

  1. What is GLM-TTS?
  2. Key Features and Capabilities
  3. System Architecture Explained
  4. How Does Reinforcement Learning Improve TTS?
  5. Performance Benchmarks
  6. Installation and Quick Start
  7. Use Cases and Applications
  8. Comparison with Other TTS Models
  9. Common Issues and Solutions
  10. FAQ

What is GLM-TTS?

GLM-TTS (Generative Language Model - Text-to-Speech) is a cutting-edge, open-source text-to-speech synthesis system developed by Zhipu AI's CogAudio Group. Released in December 2025, it represents a significant advancement in voice cloning technology by combining large language models with reinforcement learning optimization.

Core Innovation

Unlike traditional TTS systems that struggle with emotional expressiveness, GLM-TTS introduces a Multi-Reward Reinforcement Learning framework that evaluates generated speech across multiple dimensions:

  • Sound quality and naturalness
  • Speaker similarity
  • Emotional expression
  • Pronunciation accuracy (Character Error Rate)
  • Prosody and rhythm

πŸ’‘ Key Advantage
GLM-TTS achieves a Character Error Rate of 0.89 with RL optimization - the best among open-source models and competitive with commercial systems like MiniMax (0.83 CER).

Key Features and Capabilities

1. Zero-Shot Voice Cloning

What it means: Clone any speaker's voice without training or fine-tuning

Requirements:

  • 3-10 seconds of prompt audio
  • No speaker-specific model training needed
  • Works with any voice sample

Technical approach:

  • Extracts speaker embeddings using CamPlus ONNX model
  • Conditions the generation process on these embeddings
  • Maintains voice characteristics across different text inputs

2. RL-Enhanced Emotion Control

The system uses GRPO (Group Relative Policy Optimization) algorithm with multiple reward functions:

Reward Type Purpose Impact
Similarity Match speaker characteristics High speaker fidelity
CER (Character Error Rate) Pronunciation accuracy Reduced from 1.03 to 0.89
Emotion Natural emotional expression More expressive speech
Laughter Appropriate laugh insertion Enhanced naturalness

3. Phoneme-Level Control (Phoneme-in)

Problem solved: Automatic pronunciation ambiguity in polyphones and rare characters

Example: The Chinese character "葌" can be pronounced as xíng or hÑng depending on context

Solution: Hybrid Phoneme + Text input mechanism

Workflow:
1. Global G2P (Grapheme-to-Phoneme) conversion
2. Dynamic dictionary lookup for polyphones
3. Targeted phoneme replacement
4. Hybrid input generation
Enter fullscreen mode Exit fullscreen mode

⚠️ Use Case Specificity
Phoneme-level control is particularly valuable for:

  • Educational content and assessments
  • Audiobook production
  • Language learning applications
  • Technical documentation with specialized terminology

4. Streaming Inference Support

  • Real-time audio generation
  • Suitable for interactive applications
  • Low-latency processing
  • Ideal for conversational AI and virtual assistants

5. Bilingual Support

  • Primary: Chinese language
  • Secondary: English
  • Mixed text processing capability
  • Text normalization for both languages

System Architecture Explained

GLM-TTS employs a sophisticated two-stage architecture:

Stage 1: LLM-Based Token Generation

Model: Llama-based architecture
Input: Text (with optional phoneme annotations)
Output: Speech token sequences
Modes supported:

  • Pretrained (PRETRAIN)
  • Fine-tuning (SFT)
  • LoRA (Low-Rank Adaptation)

Stage 2: Flow Matching for Waveform Synthesis

Components:

  1. DiT (Diffusion Transformer): Converts tokens to mel-spectrograms
  2. Vocoder: Generates final audio waveforms
    • Vocos vocoder (current)
    • 2D Vocos vocoder (coming soon)
    • Hift vocoder (alternative)

Architecture Visualization

Text Input β†’ Frontend Processing β†’ LLM (Token Generation) 
    ↓
Speech Tokens β†’ Flow Matching Model β†’ Mel-Spectrogram
    ↓
Vocoder β†’ Audio Waveform Output

[Parallel Path]
Prompt Audio β†’ Speaker Embedding Extraction β†’ Conditioning Signal
Enter fullscreen mode Exit fullscreen mode

πŸ“Š Technical Specifications

  • VRAM requirement: ~8GB for inference
  • Supported Python versions: 3.10 - 3.12
  • Model size: Multiple components totaling several GB
  • Inference speed: Supports real-time streaming

How Does Reinforcement Learning Improve TTS?

Traditional TTS systems often produce flat, emotionless speech. GLM-TTS addresses this through a multi-reward RL framework:

The GRPO Training Process

  1. Generation Phase

    • Model generates multiple speech candidates for the same text
    • Each candidate is synthesized through the full pipeline
  2. Reward Computation

    • Distributed reward server evaluates each candidate
    • Multiple reward functions run in parallel
    • Token-level rewards provide fine-grained feedback
  3. Policy Optimization

    • GRPO algorithm compares candidates within each group
    • Updates LLM policy to favor higher-reward generations
    • Balances multiple objectives simultaneously

Measurable Improvements

Metric Base Model RL-Optimized Improvement
CER 1.03 0.89 13.6% reduction
Similarity 76.1 76.4 0.3% increase
Expressiveness Baseline Enhanced Qualitative

βœ… Best Practice
The RL-optimized model (GLM-TTS_RL) is recommended for production use when emotional expressiveness is critical, while the base model may be sufficient for straightforward narration tasks.

Performance Benchmarks

Seed-TTS-Eval Chinese Test Set Results

Evaluated without phoneme flag to maintain consistency with original benchmarks:

Model CER ↓ SIM ↑ Open Source Notes
GLM-TTS_RL 0.89 76.4 βœ… Yes Best open-source CER
VoxCPM 0.93 77.2 βœ… Yes Strong similarity
GLM-TTS Base 1.03 76.1 βœ… Yes Pre-RL baseline
IndexTTS2 1.03 76.5 βœ… Yes Comparable CER
DiTAR 1.02 75.3 ❌ No Closed source
CosyVoice3 1.12 78.1 ❌ No Higher similarity
Seed-TTS 1.12 79.6 ❌ No Best similarity
MiniMax 0.83 78.3 ❌ No Best overall CER
F5-TTS 1.53 76.0 βœ… Yes Open alternative
CosyVoice2 1.38 75.7 βœ… Yes Open alternative

Key Findings

  • GLM-TTS_RL leads all open-source models in pronunciation accuracy (CER)
  • Only 0.06 points behind the best commercial model (MiniMax)
  • Maintains competitive speaker similarity scores
  • Significantly outperforms other open-source alternatives

Installation and Quick Start

Prerequisites

  • Python 3.10, 3.11, or 3.12
  • ~8GB VRAM for inference
  • Git and pip installed
  • CUDA-compatible GPU recommended (CPU inference possible but slower)

Step 1: Clone Repository

git clone https://github.com/zai-org/GLM-TTS.git
cd GLM-TTS
Enter fullscreen mode Exit fullscreen mode

Step 2: Install Dependencies

pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

⚠️ Common Installation Issue
Users on Linux may encounter problems with WeTextProcessing/cython/pynini.

Solution:

# Comment out WeTextProcessing in requirements.txt, then:
pip install -r requirements.txt
pip install WeTextProcessing
pip install soxr

Step 3: Download Pre-trained Models

Option A: HuggingFace

mkdir -p ckpt
pip install -U huggingface_hub
huggingface-cli download zai-org/GLM-TTS --local-dir ckpt
Enter fullscreen mode Exit fullscreen mode

Option B: ModelScope (China)

mkdir -p ckpt
pip install -U modelscope
modelscope download --model ZhipuAI/GLM-TTS --local_dir ckpt
Enter fullscreen mode Exit fullscreen mode

Step 4: Run Inference

Command Line:

python glmtts_inference.py \
    --data=example_zh \
    --exp_name=_test \
    --use_cache
    # Add --phoneme flag for phoneme-level control
Enter fullscreen mode Exit fullscreen mode

Interactive Web Interface:

python tools/gradio_app.py
Enter fullscreen mode Exit fullscreen mode

Step 5 (Optional): Install RL Components

For training or advanced features:

cd grpo/modules
git clone https://github.com/s3prl/s3prl
git clone https://github.com/omine-me/LaughterSegmentation
# Download wavlm_large_finetune.pth to grpo/ckpt/
Enter fullscreen mode Exit fullscreen mode

Use Cases and Applications

1. Content Creation

  • Audiobook production: Phoneme control for accurate pronunciation
  • Podcast generation: Natural, expressive narration
  • Video voiceovers: Quick voice cloning for character consistency

2. Educational Technology

  • Language learning: Accurate pronunciation modeling
  • E-learning platforms: Engaging, emotional narration
  • Assessment tools: Pronunciation evaluation reference

3. Accessibility

  • Screen readers: More natural voice output
  • Assistive communication: Personalized voice synthesis
  • Text-to-speech for visually impaired users

4. Entertainment

  • Game character voices: Zero-shot voice cloning for NPCs
  • Virtual influencers: Consistent voice identity
  • Interactive storytelling: Emotional voice adaptation

5. Enterprise Applications

  • Customer service bots: Natural conversation flow
  • IVR systems: Professional voice synthesis
  • Internal training materials: Consistent narration

Comparison with Other TTS Models

GLM-TTS vs. CosyVoice2

Aspect GLM-TTS CosyVoice2
CER 0.89 (RL) / 1.03 (base) 1.38
Architecture LLM + Flow Different approach
RL Optimization βœ… Yes ❌ No
Open Source βœ… Full βœ… Full
Phoneme Control βœ… Hybrid input Limited

GLM-TTS vs. F5-TTS

Aspect GLM-TTS F5-TTS
CER 0.89 1.53
Memory Usage ~8GB VRAM Lower (competitor advantage)
Emotional Expression RL-enhanced Standard
Streaming βœ… Yes βœ… Yes
Language Support CN/EN Varies

GLM-TTS vs. Commercial Models (Seed-TTS, MiniMax)

Advantages of GLM-TTS:

  • βœ… Fully open-source
  • βœ… Self-hostable
  • βœ… No API costs
  • βœ… Privacy control
  • βœ… Customizable

Advantages of Commercial Models:

  • Slightly better CER (MiniMax: 0.83 vs GLM-TTS: 0.89)
  • Higher similarity scores (Seed-TTS: 79.6 vs GLM-TTS: 76.4)
  • Managed infrastructure
  • No local hardware requirements

πŸ’‘ Decision Framework
Choose GLM-TTS if you need:

  • Full control over the model
  • Privacy for sensitive content
  • Cost savings at scale
  • Customization capabilities

Choose commercial models if you need:

  • Absolute best quality
  • Zero infrastructure management
  • Immediate deployment

Common Issues and Solutions

Issue 1: Installation Failures on Linux

Symptom: Errors with WeTextProcessing, cython, or pynini during pip install -r requirements.txt

Solution:

# Edit requirements.txt to comment out WeTextProcessing
pip install -r requirements.txt
pip install WeTextProcessing
pip install soxr
Enter fullscreen mode Exit fullscreen mode

Confirmed working on: Linux/WSL with conda Python 3.12

Issue 2: Online Demo Returns 404

Symptom: The link to audio.z.ai demo is not accessible

Status: Demo infrastructure not yet deployed (as of December 11, 2025)

Workaround: Use local Gradio interface:

python tools/gradio_app.py
Enter fullscreen mode Exit fullscreen mode

Issue 3: Contractions Expanded in Output

Symptom: "I'm" becomes "I am", "don't" becomes "do not" in generated audio

Cause: Model trained to expand contractions for clarity

Workaround:

  • Pre-process text to expand contractions manually
  • Or accept this behavior as designed (similar to Star Trek's Data character)

Issue 4: Chinese Accent in English Output

Symptom: English speech has noticeable Chinese accent

Cause: Model primarily trained on Chinese data with English as secondary language

Expected behavior: Similar to native Chinese speakers who lived in English-speaking countries for a few years

Mitigation:

  • Use English-native prompt audio
  • Consider fine-tuning on English-heavy datasets
  • Or use specialized English TTS models for accent-critical applications

Issue 5: Special Characters Cause Output Issues

Symptom: A single underscore _ or other special characters make the rest of output go haywire

Cause: Frontend text processing limitations

Solution:

  • Pre-process text to remove or replace special characters
  • Use text normalization utilities in cosyvoice/cli/frontend.py
  • Report specific cases to the GitHub repository

Issue 6: High VRAM Usage

Symptom: ~8GB VRAM required, limiting accessibility

Context: This is expected for the full model pipeline

Alternatives for lower VRAM:

  • Use quantized models (when available)
  • Consider lighter alternatives like Kokoro or F5-TTS
  • Use CPU inference (slower but possible)

Issue 7: Suspicious File Warning on HuggingFace

Symptom: "This model has 1 file scanned as suspicious" - pickle imports detected on generator_jit.ckpt

Explanation: PyTorch pickle files can contain arbitrary code

Status: Team needs to convert pickles to safetensors format

Risk mitigation:

  • Download from official sources only
  • Review code before running
  • Use in isolated environments
  • Wait for safetensors conversion

Troubleshooting: No Streaming Code in Repository

Question from community: "It says it can be used for realtime streaming. I don't see code for that in the repo. Anyone know how to do that?"

Current status:

  • Streaming capability mentioned in documentation
  • Implementation details in flow/flow.py (Streaming Flow model)
  • Specific streaming inference examples not yet provided

Recommendation:

  • Check flow/flow.py for streaming implementation
  • Monitor GitHub issues for community solutions
  • Consider contributing streaming examples to the project

πŸ€” Frequently Asked Questions (FAQ)

Q: What languages does GLM-TTS support?

A: GLM-TTS primarily supports Chinese with secondary support for English. It can handle mixed Chinese-English text. For other languages, the model does not have native support, though some users have experimented with phoneme input using espeak-ng to output IPA (International Phonetic Alphabet). However, the tokenizer is optimized for Pinyin (Chinese phonemes), so results for other languages may be unpredictable.

Q: How much VRAM do I need to run GLM-TTS?

A: Approximately 8GB VRAM is required for inference with the full model pipeline. This includes:

  • LLM for token generation
  • Flow model for mel-spectrogram conversion
  • Vocoder for waveform synthesis

For lower VRAM systems, consider using CPU inference (slower) or waiting for quantized model releases.

Q: Can I fine-tune GLM-TTS for a specific voice or language?

A: Yes, the model supports multiple training modes:

  • LoRA (Low-Rank Adaptation): Efficient fine-tuning for specific voices
  • SFT (Supervised Fine-Tuning): Full model fine-tuning
  • Pretrained mode: Use as-is without fine-tuning

Configuration files are provided in the configs/ directory. However, detailed fine-tuning tutorials are not yet available in the documentation.

Q: How does GLM-TTS compare to Elevenlabs?

A: Quality: Elevenlabs still leads in overall naturalness and emotional range, but GLM-TTS is competitive, especially with RL optimization.

Language support: Elevenlabs supports 29+ languages, while GLM-TTS focuses on Chinese and English.

Cost: GLM-TTS is free and open-source; Elevenlabs is a paid service.

Privacy: GLM-TTS can be self-hosted for complete data control.

Customization: GLM-TTS offers full model access for customization.

Q: What's the difference between GLM-TTS and GLM-TTS_RL?

A:

  • GLM-TTS (Base): The pre-trained model without reinforcement learning optimization

    • CER: 1.03
    • Similarity: 76.1
    • Standard emotional expressiveness
  • GLM-TTS_RL: The same model after multi-reward RL optimization

    • CER: 0.89 (13.6% improvement)
    • Similarity: 76.4
    • Enhanced emotional expressiveness and prosody

Recommendation: Use GLM-TTS_RL for production applications where quality is critical.

Q: Is GLM-TTS suitable for real-time applications?

A: Yes, GLM-TTS supports streaming inference, making it suitable for:

  • Interactive voice assistants
  • Real-time conversation systems
  • Live narration applications

However, actual latency depends on hardware capabilities. With adequate GPU resources, real-time performance is achievable.

Q: How do I control pronunciation of specific words?

A: Use the Phoneme-in mechanism:

  1. Enable phoneme mode: --phoneme flag
  2. Use hybrid input format: mix text with phoneme annotations
  3. Configure custom pronunciation in configs/custom_replace.jsonl
  4. The system will use your specified phonemes for marked words while processing the rest normally

This is particularly useful for:

  • Polyphones (words with multiple pronunciations)
  • Rare characters
  • Technical terminology
  • Proper nouns

Q: Can I use GLM-TTS commercially?

A: The model is open-source and released on GitHub and HuggingFace. Check the repository's LICENSE file for specific terms. Generally, open-source models allow commercial use, but:

  • Verify the license terms
  • Note that prompt audio examples in the repository are marked "for research use only"
  • Ensure your use case complies with any restrictions

Q: What's coming next for GLM-TTS?

A: According to the project roadmap:

  • 2D Vocos vocoder update (in progress)
  • RL-optimized model weights (coming soon)
  • Potential for additional language support
  • Community contributions for streaming examples
  • Improved documentation and tutorials

Q: How can I contribute to the project?

A: The project welcomes contributions:

  • Report issues on GitHub
  • Submit pull requests for bug fixes or features
  • Share your use cases and results
  • Contribute to documentation
  • Help with language support expansion

Repository: https://github.com/zai-org/GLM-TTS

Community Reception and Feedback

Positive Reactions

From the Reddit discussion on r/LocalLLaMA:

"How many models are you guys gonna release! This is insane in a good way!" - Community excitement about ZAI's rapid release pace

"Kudos GLM team, keep it up guys." - Appreciation for open-source contributions

Concerns and Requests

  • Language support: Multiple users requested support beyond Chinese and English
  • Installation complexity: Several users spent hours troubleshooting dependencies
  • Documentation gaps: Lack of clear examples and demos initially
  • Model abandonment fears: Community hopes the project remains actively maintained, citing other abandoned TTS projects

Comparison with Other Models

Community members actively discussed GLM-TTS in context of:

  • Qwen-2.5-Omni: Another multimodal model with TTS capabilities
  • Chatterbox: Praised for multilingual support
  • VoxCPM: Noted for LoRA fine-tuning capabilities
  • Kokoro and F5-TTS: Compared for memory efficiency

Best Practices for Using GLM-TTS

1. Prompt Audio Selection

βœ… Do:

  • Use clean, high-quality audio (16kHz or higher)
  • Choose 3-10 seconds of clear speech
  • Select audio with consistent volume
  • Prefer single-speaker recordings

❌ Don't:

  • Use audio with background noise
  • Use multi-speaker recordings
  • Use music or non-speech audio
  • Use heavily compressed audio

2. Text Preparation

βœ… Do:

  • Normalize text (remove special characters)
  • Use proper punctuation for prosody
  • Expand abbreviations
  • Use phoneme annotations for ambiguous words

❌ Don't:

  • Include markdown or HTML formatting
  • Use excessive special characters
  • Rely on contractions if you need them preserved
  • Mix too many languages in one sentence

3. Performance Optimization

  • Use caching: Enable --use_cache flag to avoid reprocessing
  • Batch processing: Process multiple texts together when possible
  • GPU selection: Use CUDA-compatible GPU for best performance
  • Model selection: Use base model for simple narration, RL model for expressive content

4. Quality Assurance

  • Listen to outputs: Always review generated audio
  • Test edge cases: Verify pronunciation of numbers, dates, abbreviations
  • Compare speakers: Test with different prompt audio to find best match
  • Iterate on text: Adjust punctuation and phrasing for better prosody

Technical Deep Dive: Project Structure

Understanding the codebase organization:

GLM-TTS/
β”œβ”€β”€ glmtts_inference.py          # Main entry point
β”œβ”€β”€ configs/                     # Configuration files
β”‚   β”œβ”€β”€ spk_prompt_dict.yaml     # Speaker prompts
β”‚   β”œβ”€β”€ G2P_*.json               # Phoneme conversion
β”‚   └── custom_replace.jsonl     # Custom rules
β”œβ”€β”€ llm/
β”‚   └── glmtts.py                # LLM implementation
β”œβ”€β”€ flow/
β”‚   β”œβ”€β”€ dit.py                   # Diffusion Transformer
β”‚   β”œβ”€β”€ flow.py                  # Streaming Flow model
β”‚   └── modules.py               # Flow components
β”œβ”€β”€ grpo/                        # Reinforcement Learning
β”‚   β”œβ”€β”€ grpo_utils.py            # GRPO algorithm
β”‚   β”œβ”€β”€ reward_func.py           # Reward functions
β”‚   β”œβ”€β”€ reward_server.py         # Distributed rewards
β”‚   └── train_ds_grpo.py         # Training script
β”œβ”€β”€ cosyvoice/
β”‚   └── cli/frontend.py          # Text/audio preprocessing
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ campplus.onnx            # Speaker embedding
β”‚   └── cosyvoice_frontend.yaml  # Frontend config
└── tools/
    β”œβ”€β”€ gradio_app.py            # Web interface
    └── ffmpeg_speech_control.py # Audio processing
Enter fullscreen mode Exit fullscreen mode

Key Components to Explore

  1. For inference customization: glmtts_inference.py
  2. For phoneme control: utils/glm_g2p.py and configs/G2P_*.json
  3. For RL training: grpo/train_ds_grpo.py
  4. For frontend modifications: cosyvoice/cli/frontend.py
  5. For streaming: flow/flow.py

Conclusion and Recommendations

Key Takeaways

  1. GLM-TTS sets a new standard for open-source TTS with its 0.89 CER, outperforming all other open-source alternatives
  2. Reinforcement learning makes a measurable difference in both quality metrics and emotional expressiveness
  3. Zero-shot voice cloning works effectively with just 3-10 seconds of prompt audio
  4. The project is actively developed with a clear roadmap and responsive community

Who Should Use GLM-TTS?

Ideal for:

  • Developers building voice applications in Chinese or English
  • Content creators needing high-quality voice synthesis
  • Researchers exploring TTS and RL techniques
  • Organizations requiring self-hosted, privacy-preserving TTS
  • Projects where pronunciation accuracy is critical

Consider alternatives if:

  • You need support for languages beyond Chinese/English
  • You have very limited VRAM (<8GB)
  • You need the absolute highest quality (consider commercial options)
  • You want a more mature, extensively documented solution

Next Steps

  1. Try the demo: Install locally and test with your use case
  2. Join the community: Follow the GitHub repository for updates
  3. Experiment with RL model: Compare base vs. RL-optimized versions
  4. Explore phoneme control: Test pronunciation accuracy for your domain
  5. Contribute back: Share your findings, report issues, or submit improvements

Resources

Citation

If you use GLM-TTS in your research or projects, please cite:

@misc{glmtts2025,
  title={GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning},
  author={CogAudio Group Members},
  year={2025},
  publisher={Zhipu AI Inc}
}
Enter fullscreen mode Exit fullscreen mode

Last Updated: December 11, 2025

Model Version: GLM-TTS v1.0 (Base and RL-optimized)

Status: Active development with upcoming 2D Vocos vocoder update

πŸ’‘ Stay Updated
Star the GitHub repository to receive notifications about new releases, including the upcoming RL-optimized weights and 2D Vocos vocoder improvements.

GLM-TTS Complete Guide

Top comments (0)