DEV Community

Alair Joao Tavares
Alair Joao Tavares

Posted on • Originally published at alair.com.br

Building a Scalable Multilingual Audio Generation Pipeline with Gemini TTS and Asynchronous Processing

In today's global digital landscape, content accessibility is paramount. One of the most engaging ways to enhance written articles is by providing audio versions, catering to users who prefer listening over reading or those who are multitasking. But what happens when you want to offer this experience in multiple languages for long-form content? A simple, synchronous API call to a Text-to-Speech (TTS) service will quickly lead to request timeouts, a frustrated user, and a brittle system.

This article dives deep into the architecture and implementation of a robust, scalable pipeline for generating multilingual audio tracks for your articles. We'll tackle the problem of long-running tasks head-on by leveraging asynchronous job processing. We'll integrate Google's powerful Gemini TTS for high-quality voice generation, manage our media files effectively within a Dockerized environment, and design a rich frontend experience that elegantly presents dual-language audio options to the reader. Let's build a system that's not only powerful but also resilient and user-friendly.

1. Architectural Blueprint: Decoupling for Scalability

The key to handling long-running processes like audio generation is to decouple the initial request from the actual work. A synchronous approach, where the user's browser waits for the entire audio file to be generated and saved, is a recipe for disaster. Our architecture will instead embrace asynchronicity.

Here’s a high-level overview of the components:

  1. Frontend (React/TypeScript): The user interface where an author or administrator clicks a button to "Generate Audio" for an article in a specific language.
  2. Backend API (Node.js/Express): A lightweight web server that exposes endpoints to initiate and check the status of jobs. It does not perform the TTS generation itself.
  3. Task Queue (BullMQ/Redis): A message broker that acts as a middleman. The API server adds a "job" to the queue, which is a message describing the work to be done (e.g., articleId: 123, language: 'es').
  4. Asynchronous Worker: A separate background process that is constantly listening to the task queue. When a new job appears, the worker picks it up and performs the heavy lifting: calling the Gemini TTS API, processing the audio, and saving the file.
  5. TTS Service (Gemini TTS): The external API we call to convert text into lifelike speech.
  6. Shared Storage (Docker Volume): A persistent storage location, accessible by both the worker (to write files) and the backend API (to serve them), ensuring data is not lost when containers restart.

This decoupled design provides several advantages:

  • Responsiveness: The user gets an immediate response from the API, confirming the job has been scheduled.
  • Reliability: If the worker fails mid-task, the job can be automatically retried by the queue without the user needing to resubmit.
  • Scalability: We can scale the number of worker processes independently of the API server to handle higher loads of audio generation tasks.

2. The Asynchronous Core: Offloading with a Task Queue

The heart of our system is the task queue. We'll use BullMQ, a popular and robust queueing system for Node.js built on top of Redis. The flow begins when the backend API receives a request to generate audio.

API Endpoint: Enqueuing the Job

Instead of blocking, the API endpoint's only job is to validate the request and add a new task to the queue. It then immediately returns a 202 Accepted status, along with a jobId that the frontend can use to poll for updates.

Here's what that looks like in a TypeScript/Express application:

// file: src/routes/audio.ts
import { Router } from 'express';
import { Queue } from 'bullmq';

// Connect to the same Redis instance as the worker
const audioGenerationQueue = new Queue('audio-generation', {
  connection: {
    host: 'redis',
    port: 6379,
  },
});

const router = Router();

interface GenerateAudioBody {
  articleId: string;
  language: 'en-US' | 'es-ES';
  voice: string;
}

router.post('/generate-audio', async (req, res) => {
  try {
    const { articleId, language, voice } = req.body as GenerateAudioBody;

    // Basic validation
    if (!articleId || !language) {
      return res.status(400).json({ error: 'articleId and language are required' });
    }

    // Add the job to the queue
    const job = await audioGenerationQueue.add('generate-article-audio', {
      articleId,
      language,
      voice: voice || 'default-voice-for-language',
    });

    // Respond immediately with the job ID
    return res.status(202).json({ 
      message: 'Audio generation has been queued.',
      jobId: job.id 
    });

  } catch (error) {
    console.error('Failed to queue audio generation job:', error);
    return res.status(500).json({ error: 'Internal server error' });
  }
});

// Endpoint to check job status
router.get('/job-status/:jobId', async (req, res) => {
  const { jobId } = req.params;
  const job = await audioGenerationQueue.getJob(jobId);

  if (!job) {
    return res.status(404).json({ error: 'Job not found' });
  }

  const state = await job.getState();
  const progress = job.progress;
  const returnValue = job.returnvalue;

  res.json({ 
    jobId,
    state, // e.g., 'waiting', 'active', 'completed', 'failed'
    progress,
    result: returnValue // Contains the file URL on completion
  });
});

export default router;
Enter fullscreen mode Exit fullscreen mode

The Worker: Doing the Heavy Lifting

The worker is a separate Node.js process. It connects to the same Redis instance and listens for jobs on the audio-generation queue.

// file: src/worker.ts
import { Worker } from 'bullmq';
import { generateAudioFromArticle } from './services/tts-service'; // We'll define this next

console.log('Audio generation worker started...');

const worker = new Worker('audio-generation', async (job) => {
  const { articleId, language, voice } = job.data;
  console.log(`Processing job ${job.id} for article ${articleId} in ${language}`);

  try {
    // This is where the long-running task happens
    // It fetches article text, calls Gemini, saves the file, updates the DB
    const result = await generateAudioFromArticle(articleId, language, voice, (progress) => {
        // Update job progress (0-100)
        job.updateProgress(progress);
    });

    console.log(`Job ${job.id} completed successfully.`);
    // The return value is stored and can be retrieved via the status endpoint
    return result;
  } catch (error) {
    console.error(`Job ${job.id} failed:`, error);
    // Throwing an error will cause the job to be marked as 'failed'
    throw error;
  }
}, {
  connection: {
    host: 'redis',
    port: 6379,
  },
  // Allow more concurrent jobs if your resources permit
  concurrency: 5,
});

worker.on('failed', (job, err) => {
  console.error(`Job ${job.id} failed with error: ${err.message}`);
});
Enter fullscreen mode Exit fullscreen mode

3. Integrating Gemini for High-Quality TTS

With our asynchronous foundation in place, we can now focus on the core logic: converting text to speech. Google's Gemini models offer exceptionally high-quality, natural-sounding voices.

To handle long articles, we can't send the entire text in one API call. Most TTS services have character limits. The solution is to split the article text into smaller, manageable chunks (e.g., by paragraphs or sentences) and generate audio for each chunk. We then concatenate these audio chunks into a single file.

Here’s a simplified service function that demonstrates this process:

// file: src/services/tts-service.ts
import { GoogleAuth } from 'google-auth-library';
import fetch from 'node-fetch';
import * as fs from 'fs/promises';
import { ArticleRepository } from '../repositories/article-repository';

const TTS_API_ENDPOINT = 'https://texttospeech.googleapis.com/v1/text:synthesize';

async function callGeminiTTS(text: string, languageCode: string, voiceName: string): Promise<Buffer> {
    const auth = new GoogleAuth({
        scopes: ['https://www.googleapis.com/auth/cloud-platform'],
    });
    const client = await auth.getClient();
    const accessToken = (await client.getAccessToken()).token;

    const response = await fetch(TTS_API_ENDPOINT, {
        method: 'POST',
        headers: {
            'Authorization': `Bearer ${accessToken}`,
            'Content-Type': 'application/json; charset=utf-8',
        },
        body: JSON.stringify({
            input: { text },
            voice: { languageCode, name: voiceName },
            audioConfig: { audioEncoding: 'MP3' },
        }),
    });

    if (!response.ok) {
        throw new Error(`TTS API failed with status: ${response.statusText}`);
    }

    const data = await response.json() as { audioContent: string };
    return Buffer.from(data.audioContent, 'base64');
}

// Main function called by the worker
export async function generateAudioFromArticle(
    articleId: string,
    language: string, 
    voice: string,
    onProgress: (progress: number) => void
) {
    const article = await ArticleRepository.findById(articleId);
    if (!article) throw new Error('Article not found');

    const textChunks = splitTextIntoChunks(article.content, 2500); // Split text into ~2500 char chunks
    const audioBuffers: Buffer[] = [];

    for (let i = 0; i < textChunks.length; i++) {
        const chunk = textChunks[i];
        const audioChunk = await callGeminiTTS(chunk, language, voice);
        audioBuffers.push(audioChunk);

        const progress = Math.round(((i + 1) / textChunks.length) * 100);
        onProgress(progress);
    }

    const finalAudio = Buffer.concat(audioBuffers);
    const filePath = `/var/media/audio/${articleId}-${language}.mp3`;
    await fs.writeFile(filePath, finalAudio);

    // Update the article in the database with the new audio file path
    await ArticleRepository.update(articleId, { 
        [`audioUrl_${language}`]: `/media/audio/${articleId}-${language}.mp3` 
    });

    return { audioUrl: `/media/audio/${articleId}-${language}.mp3` };
}

function splitTextIntoChunks(text: string, maxLength: number): string[] {
    // Implement logic to split text by paragraphs or sentences without exceeding maxLength
    // This is a simplified placeholder
    const chunks: string[] = [];
    for (let i = 0; i < text.length; i += maxLength) {
        chunks.push(text.substring(i, i + maxLength));
    }
    return chunks;
}
Enter fullscreen mode Exit fullscreen mode

4. Managing Media Files with Docker

Our system has multiple services (API, Worker) running in separate Docker containers. How do they share the generated audio files? The answer is Docker volumes.

A volume is a persistent storage mechanism that exists outside the container's lifecycle. We can create a volume and mount it into both our API container and our Worker container.

  • The Worker writes the generated MP3 file to the volume (e.g., at /var/media/audio/).
  • The API Server (using a static file server middleware like express.static) reads from the same volume to serve the file to the user.

Here’s a simplified docker-compose.yml demonstrating the setup:

version: '3.8'

services:
  api-server:
    build: ./api-server
    ports:
      - '4000:4000'
    volumes:
      # Mount the shared volume for read access to serve files
      - media_files:/var/media
    depends_on:
      - redis

  worker:
    build: ./worker
    volumes:
      # Mount the shared volume for write access to save generated audio
      - media_files:/var/media
    depends_on:
      - redis

  redis:
    image: 'redis:alpine'

# Define the shared volume
volumes:
  media_files:
    driver: local
Enter fullscreen mode Exit fullscreen mode

Pro Tip: When dealing with volumes, always be mindful of file permissions. Your worker process might run as a non-root user. You may need to add a command to your Dockerfile or an entrypoint script to ensure the correct user owns the mounted directory (e.g., chown -R appuser:appgroup /var/media).

5. Designing a Rich Frontend Experience

With a robust backend in place, we can now create a seamless user experience on the frontend.

Kicking Off and Polling

When the user clicks "Generate Audio", we make a POST request to our /generate-audio endpoint. Upon receiving the 202 Accepted response with the jobId, we start polling the /job-status/:jobId endpoint every few seconds.

Here's a conceptual React hook to manage this state:

// file: src/hooks/useAudioGeneration.ts
import { useState, useEffect } from 'react';

interface JobStatus {
  jobId: string;
  state: 'waiting' | 'active' | 'completed' | 'failed';
  progress: number;
  result?: { audioUrl: string };
}

export function useAudioGeneration(jobId: string | null) {
  const [status, setStatus] = useState<JobStatus | null>(null);
  const [error, setError] = useState<string | null>(null);

  useEffect(() => {
    if (!jobId) return;

    const intervalId = setInterval(async () => {
      try {
        const response = await fetch(`/api/job-status/${jobId}`);
        if (!response.ok) throw new Error('Failed to fetch job status');

        const data: JobStatus = await response.json();
        setStatus(data);

        if (data.state === 'completed' || data.state === 'failed') {
          clearInterval(intervalId);
        }
      } catch (e) {
        setError('Could not get job status.');
        clearInterval(intervalId);
      }
    }, 3000); // Poll every 3 seconds

    return () => clearInterval(intervalId);
  }, [jobId]);

  return { status, error };
}
Enter fullscreen mode Exit fullscreen mode

The UI Component

In our React component, we can use this hook to display the current state. We'd show a progress bar while the job is active, a success message and the audio player when completed, and an error message if it failed.

For a multilingual setup, we'd manage two sets of state, one for each language. The final UI would feature two distinct audio players, allowing the user to seamlessly switch between, for example, the English and Spanish versions of the article.

Small touches make a big difference. Consider adding:

  • Custom Audio Players: Style the HTML <audio> element to match your site's theme.
  • Progress Indicators: Use the progress value from the polling response to show a realistic progress bar during generation.
  • Themed Scrollbars: If your article content is in a scrollable container alongside the player, a custom, themed scrollbar adds a final layer of professional polish.

Conclusion

We've successfully designed and outlined a complete, scalable system for adding multilingual audio to any content platform. By embracing asynchronous processing with a task queue, we've eliminated the risk of gateway timeouts and created a resilient, responsive experience for our users. Integrating a cutting-edge service like Gemini TTS ensures the final product is of the highest quality, and thoughtful Docker volume management keeps our media organized and accessible.

This architectural pattern—API, task queue, and worker—is a powerful tool in any developer's arsenal. It's the ideal solution for any long-running task, from video transcoding and report generation to complex data analysis, ensuring your applications remain fast, scalable, and reliable.

Top comments (0)