OpenAI’s Whisper is well known for its robust multilingual transcription and English-targeted translation. But what if we want to directly translate Japanese speech into Chinese? In this project, I adapted Whisper’s tiny and base models to perform Japanese-to-Chinese speech translation — a task Whisper doesn’t support out of the box.
🎯 Motivation
Japanese media like anime, drama, and films are hugely popular among Chinese-speaking audiences. However, most existing translation pipelines either route through English or require large GPU resources.
I wanted to explore a low-resource solution that:
- Translates directly from Japanese to Chinese
- Can run on CPU-only or edge devices
📁 Dataset: ScreenTalk-JA2ZH
To fine-tune Whisper, I created a domain-specific dataset of Japanese audiovisual content with aligned Chinese subtitles.
- 🎬 Domains: Japanese films, TV dramas, anime
- ⏱️ Size: 582h train / 73h val / 73h test
- 📎 Format: 16kHz mono WAV + Simplified Chinese subtitles
- ✅ Sentence-level alignment, cleaned and manually verified
🔗 A smaller version is publicly available:
👉 ScreenTalk-JA2ZH-XS on Hugging Face
🛠️ Fine-Tuning Setup
Hyperparameter | Value |
---|---|
Epochs | 20 |
Learning rate | 3e-4 |
Precision | fp16 |
Batch size (tiny/base) | 96 / 64 |
Eval strategy | Step-based |
Early stopping | Patience = 5 |
We fine-tuned both Whisper tiny and Whisper base using the same training pipeline.
📈 Results
🔸 Whisper Tiny
- ✅ Lightweight, fast
- ❌ BLEU ≈ 0.60
- ❌ Prone to overfitting and semantic drift in long/complex speech
🔹 Whisper Base
- ✅ BLEU = 0.7179
- ✅ Stronger generalization and fluency
- ✅ Suitable for CPU deployment (edge ready)
👉 BLEU scores steadily improved even when token-level loss increased — highlighting that loss is not always a good proxy for translation quality.
🤔 Key Takeaways
- Whisper can be adapted for non-English language pairs
- Domain-specific data (like anime or TV) greatly improves model performance
- Model capacity matters: Tiny is efficient but not enough for expressive, noisy domains
- BLEU is limited — future work should include COMET, chrF, or human evals
🔮 What’s Next?
- Fine-tune larger Whisper models (medium, large)
- Try LoRA or other parameter-efficient tuning techniques
- Expand dataset to cover conversational, technical, and news speech
🚀 Try It Out
🧠 Models available on Hugging Face:
Thanks to the open-source Whisper community and everyone working to break language barriers with AI.
👉 Follow me for more multilingual AI experiments!
Top comments (0)