LiveTR: A Real-Time English-to-Japanese Audio Translation App for Videos

#windows

What is LiveTR?

It is a Windows application that recognizes English audio from videos playing on your PC in real-time and translates it into Japanese. The translation results are displayed as an overlay on the screen as subtitles, and the app can also read the translations aloud in Japanese.

Whether it is YouTube, Twitch, or a local video file, the source does not matter as long as there is English audio playing.

Key Features

Real-time Speech Recognition — Transcribes English audio on the fly using faster-whisper
Japanese Translation — Supports online translation services (Google Cloud, DeepL, Azure, Amazon)
Subtitle Overlay — Displays translation results in a transparent window. Position and size are adjustable, and the window allows click-through
Japanese Text-to-Speech — Reads translation results aloud using AivisSpeech. Automatically reflects the voice quality of the speaker
Automatic Ducking — Automatically lowers the video volume during playback to make the speech easier to hear
Process-based Audio Capture — Captures audio only from the specified application. Prevents loops by avoiding the re-capture of the synthesized speech

How to Use

Launch the application
Select the process you want to capture audio from
Click "Start" to begin speech recognition, translation, subtitle display, and text-to-speech

System Requirements

Windows 10 / 11 (64bit)
NVIDIA GPU (compatible with CUDA 12.x)
16GB RAM or more recommended

A GPU is required. Since the speech recognition model runs in real-time, decent hardware specifications are necessary.

Development Story

I built this entirely using Claude Code. The development period was about 4 days.

Since my previous project, OLTranslator, was an application that translates text on the screen, I started this project thinking, "Why not do the same for audio?" OLTranslator took me two weeks using Copilot, so this was significantly faster by comparison. Of course, I have become more accustomed to AI coding, but a major factor was being able to maintain project guidelines through CLAUDE.md, and the fact that the design-instruction-review cycle flows naturally with Claude Code.

Points I Focused On

I paid close attention to how to cut the sentences picked up by speech recognition before sending them for translation. Deciding where to segment a sentence and how to join parts if they are cut mid-stream directly impacts translation accuracy.

I also focused on speaker gender identification. I wanted the text-to-speech voice to match the speaker, so I built the logic with Claude Code by referring to academic papers and patents. Since AivisSpeech has multiple speaker models, it reads with a male-like voice for men and a female-like voice for women.

However, this gender identification was quite tricky. If judged solely by pitch, a man might be identified as a woman during exciting moments—like F1 broadcasting—when the pitch rises. I wrote about how I solved this issue in detail in "Logic Impossible on My Own: Building with Claude and Research Papers".

Download

LiveTR — Real-time Audio Translation App (BOOTH)

OLTranslator — A translation app for on-screen text. OLTranslator handles "text," while LiveTR handles "audio." Although they both perform translation, the approaches are completely different.
Logic Impossible on My Own: Building with Claude and Research Papers — A technical deep-dive into the gender identification logic.