DEV Community

J3ffJessie
J3ffJessie

Posted on

Torc Bot 3: Torclation Services

Why build it?

Animated Gathering of people of a community

So within the Torc community, we have a global user base with a very large contingent in LATAM. For a lot of our events there is a specific Espanol version of the stream and there are other times that there are specific events that are only for Spanish speaking members. For those of us in the community that aren't fortunate enough to speak spanish this means we miss out on those events. So about a month ago, during one of our dual events it was brought up that users would love to participate in the Spanish speaking event if they could get translation of what was being said so they could participate in the chat. I took this as priority immediately because I want the community to have access to everything possible.

Where to start

Developer looking lost with both hands up sort of asking what now

I honestly had no idea where to even start with this feature. I went to the Discord documentation to check through what exactly was available and did some searching online to familiarize myself with how Discord audio works. Apparently Discord sends audio in Opus codec format. So I started looking at options to utilize Opus codec and what I could do with it. Now I am no audiophile so when I started reading about audio formats and encoding and decoding and all that I admittedly was in over my head.

A few searches later for different existing discord bots that mess with audio I found documentation that mentioned Opusscript decoder to be able to decode the audio file to a PCM format so it could be used for my purpose of sending off to AI model for translation. I started by adding in the simple commands to start/stop the translation functionality and verified that everything fired properly without creating new issues.

The Flow

For everything to work properly, when the user runs the command /translate start within a voice channel, the bot joins the audio call to start listening. The flow follows the following pattern

Discord voice channel
        │
        ▼
  voiceService.js        — Opus audio capture + per-frame decoding
        │
        ▼
transcriptionService.js  — PCM → WAV → Whisper (Groq API)
        │
        ▼
 translationService.js   — Translated English text (Groq API / LLaMA)
        │
        ▼
  streamingService.js    — WebSocket broadcast
        │
        ▼
   captions.html         — Live captions displayed in browser
Enter fullscreen mode Exit fullscreen mode

The details

So I shared on X, this was the most interesting feature that I have added so far to the bot, what I really meant was that this was the most difficult and draining feature to add as well as interesting. So, lets get into the details of it and explain what is happening behind the scenes of the operation.


sessionService

When the user runs the /translate start command, the bot checks that the user is in a voice channel first... and if they aren't it informs the user to join a voice channel. This error handling was added to avoid accidental triggers in non voice channels when attempting to use other commands or just accidental usage. A session is created via the sessionService which generates a unique random token that is tied to the discord community server (guild ID) adn stores a set of connected WebSocket clients.

Once this is done, it calls to voiceService to have the bot join the voice channel as a listener and calls to the captureAudio function that starts capturing the spoken audio.

async start(guild, channel, guildId) {
    const connection = joinVoiceChannel({
      channelId: channel.id,
      guildId: guild.id,
      adapterCreator: guild.voiceAdapterCreator,
      selfDeaf: false,
      selfMute: true,
    });

    this.connections.set(guildId, connection);

    const receiver = connection.receiver;

    receiver.speaking.on("start", (userId) => {
      // Brief delay avoids the corrupted first Opus frame Discord sends
      // when a user's encoder initializes.
      setTimeout(() => {
        if (!this.connections.has(guildId)) return;
        this.captureAudio(receiver, userId, guildId).catch(() => {});
      }, 100);
    });
  }
Enter fullscreen mode Exit fullscreen mode

One key in this function is the setTimeout to delay capture of Audio. One of the most infuriating things I kept running into was that there were corrupt Opus frames. Originally I was using the Opus dependency and due to being on a newer version of Node and also newer version of discordjs package, their were issues where when a corrupt frame was found it would stop the audio capture completely which meant users saw nothing in the captions page. Obviously this is a waste of time because it needs to handle corrupted frames without ending the service.

So I did some searching and AI assistance to find out that opusscript works best with Node 20 and did the switch around and still had corrupt frame issues but only right at the beginnning of the audio capture. So I wrote up a timeout with a delay and no longer received a corrupt frame immediately on audio capture. Using OpusScript I am able to chunk the audio together in small batches and maintain consistent capturing while sending the audio off to be translated without losing much of the audio being captured at all.

  async captureAudio(receiver, userId, guildId) {
    if (this.activeCaptures.get(userId)) return;
    this.activeCaptures.set(userId, true);

    const opusStream = receiver.subscribe(userId, {
      end: {
        behavior: EndBehaviorType.AfterSilence,
        duration: 300,
      },
    });

    let audioBytes = 0;
    let hitMaxDuration = false;

    const decoder = new OpusScript(48000, 2, OpusScript.Application.AUDIO);
    const pcmChunks = [];

    try {
      await new Promise((resolve, reject) => {
        let settled = false;

        const settle = (err) => {
          if (settled) return;
          settled = true;
          if (err) reject(err);
          else resolve();
        };

        const maxTimer = setTimeout(() => {
          hitMaxDuration = true;
          opusStream.destroy();
          settle();
        }, MAX_CAPTURE_MS);

        opusStream.on("data", (packet) => {
          try {
            const pcm = decoder.decode(packet);
            pcmChunks.push(pcm);
            audioBytes += pcm.length;
          } catch {
            // Corrupted Opus frame — skip this frame, keep going
          }
        });

        opusStream.on("end",   () => { clearTimeout(maxTimer); settle(); });
        opusStream.on("close", () => { clearTimeout(maxTimer); settle(); });
        opusStream.on("error", (err) => { clearTimeout(maxTimer); settle(err); });
      });
    } catch (err) {
      console.error(`Capture error for user ${userId}:`, err?.message);
    } finally {
      decoder.delete();

      // Release the lock immediately — the next utterance can start capturing
      // without waiting for the Whisper + translation API calls to finish.
      this.activeCaptures.delete(userId);

      // User is still speaking — re-subscribe right away before yielding to
      // the event loop so we don't miss audio between chunks.
      if (hitMaxDuration && this.connections.has(guildId)) {
        this.captureAudio(receiver, userId, guildId).catch(() => {});
      }

      try { opusStream.destroy(); } catch {}
    }

    // Hand off to background processing — does not block the next capture.
    const minBytes = 48000 * 2 * 2 * 0.3;
    if (audioBytes >= minBytes && pcmChunks.length > 0) {
      this.processAudio(pcmChunks, userId, guildId).catch((err) => {
        console.error(`Process error for user ${userId}:`, err?.message);
      });
    }
  }
Enter fullscreen mode Exit fullscreen mode

We have audio, now what?

Person listening to audio on headphones while sitting down

OpusScript takes the audio from Discord and decodes it into a temporary PCM file which is then converted into a WAV file that becomes the transcript to be used for translation by Groq llama-3.1b model. In easy speak, we are taking speech, converting it to text, then translating the text and passing it to the UI for consumption. The intricate part of it all was figuring out the chunks so that when we capture audio and begin the process of sending that opus coded file for translation, we immediately catch the next spoken word so that we don't miss out on any of the spoken words. It isn't perfect by any means, but overall with the limitations of server response and timing it does as well as I believe it can be done at the moment without hosting the translation locally to cut out some of the latency.

 async convertPcmToWav(pcmFile) {
    const wavFile = pcmFile.replace('.pcm', '.wav');

    return new Promise((resolve, reject) => {
      const reader = fs.createReadStream(pcmFile);
      const writer = new wav.FileWriter(wavFile, {
        channels: 2,
        sampleRate: 48000,
        bitDepth: 16,
      });

      reader.pipe(writer);
      writer.on('finish', () => resolve(wavFile));
      writer.on('error', reject);
    });
  }

  async transcribe(filePath) {
    if (!filePath) throw new Error('Invalid file path');
    return await this.groq.audio.transcriptions.create({
      file: fs.createReadStream(filePath),
      model: 'whisper-large-v3-turbo',
    });
  }
Enter fullscreen mode Exit fullscreen mode
  async translate(text) {
    if (!text || !text.trim()) return '';

    const response = await this.groq.chat.completions.create({
      model: 'llama-3.1-8b-instant',
      temperature: 0, // 🔥 important for consistency
      messages: [
        {
          role: 'system',
          content:
            'You are a translation engine. Translate ALL input text to English. Return ONLY the translated text. Do not explain. Do not add commentary.',
        },
        {
          role: 'user',
          content: text,
        },
      ],
    });

    return response.choices[0].message.content.trim();
  }
Enter fullscreen mode Exit fullscreen mode

Translation meets the UI

Early in the process, when the user summons the bot to do translation there is a captions url that is generated. This URL points the user to the browser to a simple HTML page that gets the translated text passed in to display to the user.

Screen capture of the captions page where translations are displayed to the user

This allows the user to have both the Discord stream pulled up and also the translation in the browser so they can see both at the same time and respond in the chat and participate. The initial design of the caption page was as simple as possible and I plan to throw a fresh paint of coat on that in the near future to make it more visually appealing, but for the initial release my major focus was on performance and getting things to work and work decently enough to be useful for users. I believe that I am as close to live translation as I can get without hosting the models locally and cutting out server calls.

Iteration....? Future Additions

Future adjustments will be made as feedback comes in and more uses are done live to see more and more results. This will result in prompt changes, timing changes to see if adjusting the amount of audio captured before sending for translation will help or if smaller chunks are better than pushing for longer periods of audio to try and eliminate splitting sentences and things with translations.

Overall, the learning on this was great and the end result although not fully noticed yet is that I have created something for the community at large that can be utilized by other discord servers if they choose and offer this ability within their servers for their users. I am thoroughly enjoying doing things that help the community overall and make the Torc discord an inclusive place where we support people from all areas and try to offer the ability for everyone to join and participate as much as possible.

Plans are in process of adding multi-server support so that the bot can run independently in mutlitple servers with config handled by server admins on what features do what. This is the next big push for the bot as I want to have it available for other communities that may have a need for translation of their events and what not.

Disclaimer: Images people or persons are generated using ChatGPT while snippets of code and webpages are not. Writing has been done with the assistance of AI to attempt at maintaining readability of technical descriptions.

Top comments (0)