Fine-Tuning Whisper for Japanese-to-Chinese Speech Translation — A Lightweight Approach

OpenAI’s Whisper is well known for its robust multilingual transcription and English-targeted translation. But what if we want to directly translate Japanese speech into Chinese? In this project, I adapted Whisper’s tiny and base models to perform Japanese-to-Chinese speech translation — a task Whisper doesn’t support out of the box.

🎯 Motivation

Japanese media like anime, drama, and films are hugely popular among Chinese-speaking audiences. However, most existing translation pipelines either route through English or require large GPU resources.

I wanted to explore a low-resource solution that:

Translates directly from Japanese to Chinese
Can run on CPU-only or edge devices

📁 Dataset: ScreenTalk-JA2ZH

To fine-tune Whisper, I created a domain-specific dataset of Japanese audiovisual content with aligned Chinese subtitles.

🎬 Domains: Japanese films, TV dramas, anime
⏱️ Size: 582h train / 73h val / 73h test
📎 Format: 16kHz mono WAV + Simplified Chinese subtitles
✅ Sentence-level alignment, cleaned and manually verified

🔗 A smaller version is publicly available:

👉 ScreenTalk-JA2ZH-XS on Hugging Face

🛠️ Fine-Tuning Setup

Hyperparameter	Value
Epochs	20
Learning rate	3e-4
Precision	fp16
Batch size (tiny/base)	96 / 64
Eval strategy	Step-based
Early stopping	Patience = 5

We fine-tuned both Whisper tiny and Whisper base using the same training pipeline.

📈 Results

🔸 Whisper Tiny

✅ Lightweight, fast
❌ BLEU ≈ 0.60
❌ Prone to overfitting and semantic drift in long/complex speech

🔹 Whisper Base

✅ BLEU = 0.7179
✅ Stronger generalization and fluency
✅ Suitable for CPU deployment (edge ready)

👉 BLEU scores steadily improved even when token-level loss increased — highlighting that loss is not always a good proxy for translation quality.

🤔 Key Takeaways

Whisper can be adapted for non-English language pairs
Domain-specific data (like anime or TV) greatly improves model performance
Model capacity matters: Tiny is efficient but not enough for expressive, noisy domains
BLEU is limited — future work should include COMET, chrF, or human evals

🔮 What’s Next?

Fine-tune larger Whisper models (medium, large)
Try LoRA or other parameter-efficient tuning techniques
Expand dataset to cover conversational, technical, and news speech

🚀 Try It Out

🧠 Models available on Hugging Face:

Thanks to the open-source Whisper community and everyone working to break language barriers with AI.

👉 Follow me for more multilingual AI experiments!

DEV Community