Connie Leung for Google Developer Experts

Posted on May 2 • Originally published at blueskyconnie.com

Building Dynamic Audio with Emotion & Pace: Gemini 3.1 Flash TTS, Angular & Firebase Cloud Functions [GDE]

#firebase #angular #gemini #ai

Google released the Gemini 3.1 Flash TTS Preview model for AI audio generation in the Gemini API, Gemini in Vertex AI, and Gemini AI Studio. This model introduces a new Audio tags feature to exhibit expressive human emotion, pace, and style.

This application explores Firebase AI Logic to analyze an uploaded image to generate recommendations, description, alternative tags, and an obscure fact. The obscure fact is sent to a Firebase Cloud Function to generate an audio using a Gemini TTS model. The Cloud Function returns the stream to an Angular application that converts it to a Blob URL object. An audio player sets the URL to the source that users can click the Play button to play the stream.

In this blog post, I migrate my application to use the Gemini 3.1 Flash TTS Preview model and create a signal form in Angular to input a scene, emotion, and pace. Then, the Angular application provides the form values and the obscure fact to the Firebase Cloud Function to generate an expressive voice using the GenAI TypeScript SDK.

Prerequisites

The technical stack of the project:

Angular 21: The latest version as of May 2026.
Node.js LTS: The LTS version as of May 2026.
Firebase Remote Config: To manage dynamic parameters.
Firebase Cloud Functions: To generate an expressive human voice when called by the frontend.
Firebase Local Emulator Suite: To test the functions locally at http://localhost:5001.
Gemini in Vertex AI: To generate videos and store them in Firebase Cloud Storage.

The public Google AI Studio API is restricted in my region (Hong Kong). However, Vertex AI (Google Cloud) offers enterprise access that works reliably here, so I chose Vertex AI for this demo.

npm i -g firebase-tools

Install firebase-tools globally using npm.

firebase logout

firebase login

Log out of Firebase and log in again to perform proper Firebase authentication.

firebase init

Execute firebase init and follow the prompts to set up Firebase Cloud Functions, the Firebase Local Emulator Suite, Firebase Cloud Storage, and Firebase Remote Config.

If you have an existing project or multiple projects, you can specify the project ID on the command line.

firebase init --project <PROJECT_ID>

In both cases, the Firebase CLI automatically installs the firebase-admin and firebase-functions dependencies.

After completing the setup steps, the Firebase tools generate the functions emulator, functions, a storage rules file, remote config templates, and configuration files such as .firebaserc and firebase.json.

Angular dependency

npm i firebase

The Angular application requires the firebase dependency to initialize a Firebase app, load remote config, and invoke the Firebase Cloud Functions to generate videos.

Firebase dependencies

npm i @cfworker/json-schema @google/genai @modelcontextprotocol/sdk

Install the above dependencies to access Gemini in Vertex AI. @google/genai depends on @cfworker/json-schema and @modelcontextprotocol/sdk. Without these, the Cloud Functions cannot start.

With our project configured, let's look at how the frontend and backend communicate.

Architecture

A user uploads an image in an Angular application and prompts the Gemini 3.1 Flash Lite Preview model to generate a few recommendations for improving the image, a description, and alternative tags. The user also uses the same model and the Google Search tool to find an obscure fact related to the image.

A user inputs a scene, an emotion, and a pace in an experimental signal form. When a user clicks the generate audio button, the Angular application sends the form values and the obscure fact to the Firebase Cloud Function to generate an expressive voice using the GenAI TypeScript SDK and Gemini 3.1 Flash TTS Preview model.

Limitations of Gemini 3.1 Flash TTS Preview Model

The model can only accept text inputs and generate audio outputs.
The context window is 32K tokens
TTS does not support streaming.
The supported languages can be found in https://ai.google.dev/gemini-api/docs/speech-generation#languages. My mother tongue, Cantonese, is currently unsupported.

Firebase Integration

1. Configure Environment Variables

Defining the environment variables in the Firebase project ensures the functions know the region of the Google Cloud project, the Firebase Cloud Function location, and the required TTS model.

.env.example

GOOGLE_CLOUD_LOCATION="global"
GOOGLE_FUNCTION_LOCATION="asia-east2"
GEMINI_TTS_MODEL_NAME="gemini-3.1-flash-tts-preview"
WHITELIST="http://localhost:4200"
REFERER="http://localhost:4200/"

Variable	Description
GOOGLE_CLOUD_LOCATION	The region of the Google Cloud project. I chose `global` so that the Firebase project has access to the newest Gemini 3.1 Flash TTS preview model.
GOOGLE_FUNCTION_LOCATION	The region of the Firebase Cloud Functions. I chose `asia-east2` because this is the region where I live.
WHITELIST	Requests must come from http://localhost:4200.
REFERER	Requests originate from http://localhost:4200/.

http://localhost:4200 is the host and port number of my local Angular application.

2. Validating Environment Variables

Before the Cloud Function proceeds with any AI calls, it is critical to ensure that all necessary environment variables are present. I implemented an AUDIO_CONFIG IIFE (Immediately Invoked Function Expression) to validate environment variables like the TTS model name, Google Cloud Project ID, and location.

import logger from "firebase-functions/logger";

export function validate(value: string | undefined, fieldName: string, missingKeys: string[]) {
    const err = `${fieldName} is missing.`;
    if (!value) {
        logger.error(err);
        missingKeys.push(fieldName);
        return "";
    }

    return value;
}

export const AUDIO_CONFIG = (() => {
    logger.info("AUDIO_CONFIG initialization: Loading environment variables and validating configuration...");

    const env = process.env;

    const missingKeys: string[] = [];
    const location = validate(env.GOOGLE_CLOUD_LOCATION, "Vertex Location", missingKeys);
    const model = validate(env.GEMINI_TTS_MODEL_NAME, "Gemini TTS Model Name", missingKeys);
    const project = validate(env.GCLOUD_PROJECT, "Google Cloud Project", missingKeys);

    if (missingKeys.length > 0) {
        throw new HttpsError("failed-precondition", `Missing environment variables: ${missingKeys.join(", ")}`);
    }

    return {
        genAIOptions: {
            project,
            location,
            vertexai: true,
        },
        model,
    };
})();

I am using Node 24 as of May 2026. Since Node 20, we can use the built-in process.loadEnvFile function that loads environment variables from the .env file.

In env.ts, the try-catch block attempts to load the environment variables from the .env file.

try {
    process.loadEnvFile();
} catch {
    // Ignore error if .env file is not found (e.g., in production where env vars are set by the platform)
}

In src/index.ts, the first statement imports the env.ts before importing other files and libraries.

import "./env";

... other import statements ...

If you are using a Node version that does not support process.loadEnvfile, the alternative is to install dotenv to load the environment variables.

npm i dotenv

import dotenv from "dotenv";

dotenv.config();

Firebase provides the GCLOUD_PROJECT variable, so it is not defined in the .env file.

When the missingKeys array is not empty, AUDIO_CONFIG throws an error that lists all the missing variable names. If the validation is successful, the genAIOptions and model are returned. The genAIOptions is used to initialize the GoogleGenAI and model is the selected TTS model name.

3. Sanitize the Prompt Inputs

The Cloud Function sanitizes the scene and transcript before composing the audio prompt.

The sanitizeScene function accepts the scene by escaping the newline character ('\n') with the '\\n'. The newline character creates a blank line and often signals the end of a block. The sanitization effectively flattens the scene into one continuous line of data and the LLM's Markdown parser recognizes it as a single, safe paragraph. The sanitization also removes all Markdown headers that are injected into the scene.

function sanitizeScene(text: string): string {
    return (text || "").trim().replace(/\r?\n/g, "\\n").replace(/^[#\s]+/gm, "");
}

The sanitizeTranscript function accepts the transcript by removing all Markdown headers and triple quotes that are injected into it.

function sanitizeTranscript(text: string): string {
    return (text || "").trim().replace(/^#+/gm, "").replace(/"""/g, '"');
}

4. Build an Audio Prompt

The AudioPrompt interface encapsulates the scene, emotion, pace, transcript, and voice option to set the location, audio tags, text, and persona of the audio.

export type AudioPrompt = {
  scene: string;
  emotion: string;
  pace: string;
  transcript: string;
  voiceOption: string;
}

The SCENE_DICTIONARY is an array of scenes. When the user does not provide a scene, a scene is randomly selected from the array.

export const SCENE_DICTIONARY = [
    "A dimly lit, dusty library filled with ancient leather-bound books.\n" +
        "The air is thick with history. A scholarly archivist is leaning closely into a warm, vintage ribbon microphone.\n" +
        "They speak with an infectious, hushed intensity, eager to share a forgotten secret they just uncovered in a decaying manuscript.",

    "It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline, but inside, it is blindingly bright.\n" +
        "The red 'ON AIR' tally light is blazing. The speaker is standing up, bouncing on the balls of their heels to the rhythm of a thumping backing track.\n" +
        "It is a chaotic, caffeine-fueled cockpit designed to wake up an entire nation.",

    "A meticulously sound-treated bedroom in a suburban home.\n" +
        "The space is deadened by plush velvet curtains and a heavy rug, creating an intimate, close-up acoustic environment.\n" +
        "The speaker delivers the information like a trusted friend sharing an inside joke.",

    "A high-tech, minimalist laboratory humming with servers.\n" +
        "Crisp, clean acoustics reflect off glass and steel.\n" +
        "A brilliant but eccentric scientist is pacing back and forth, speaking rapidly and enthusiastically into a headset microphone, excited to explain a complex phenomenon.",
];

I define a buildAudioPrompt function to construct the advanced audio prompt.
When an emotion is defined, the tag is [<emotion>]. When a pace is defined, the tag is [<pace>]. The combined audio tag is [<emotion>] [<pace>]<a space> to create a proper token boundary.

The insertAudioTagsToTranscript uses a regular expression to split the transcript into lines, inserts the combined audio tag before each line, and then joins them with an empty string.

The buildAudioPrompt concatenates the scene and the expressive transcript into a string before returning it.

import { SCENE_DICTIONARY } from './constants/scenes.const';
import { AudioPrompt } from './types/audio-prompt.type';

function makeTag(value: string) {
    const trimmedValue = value.trim();
    return trimmedValue ? `[${trimmedValue}] ` : "";
}

function insertAudioTagsToTranscript({ transcript, pace, emotion }: AudioPrompt): string {
    const audioTags = `${makeTag(emotion)}${makeTag(pace)}`;
    const cleanedTranscript = sanitizeTranscript(transcript);

    const parts = cleanedTranscript.split(/(?<!\b(?:Mr|Mrs|Ms|Dr|St|i\.e|e\.g))([.!?\n\r]+[”"’']*\s*)/);
    return parts
        .map((text, i, arr) => {
            if (i % 2 !== 0) {
                return ""; // Skip delimiters, they are appended to the text blocks
            }
            const delimiter = arr[i + 1] || "";
            return text.trim() ? `${audioTags}${text.trim()}${delimiter}` : delimiter;
        })
        .join("");
}

export function buildAudioPrompt(data: AudioPrompt): string {
    const randomIndex = Math.floor(Math.random() * SCENE_DICTIONARY.length);
    const selectedScene = SCENE_DICTIONARY[randomIndex];

    const trimmedScene = (data.scene || "").trim() || selectedScene;
    const escapedScene = sanitizeScene(trimmedScene);
    const transcript = insertAudioTagsToTranscript(data);

    return `## Scene:
${escapedScene}

## Transcript:
"""
${transcript}
"""
`;
}

The output of the prompt looks like:

## Scene:
<scene>

## Transcript:
[<emotion>] [<pace>] <sentence 1>[<emotion>] [<pace>] <sentence 2>...[<emotion>] [<pace>] <sentence N>

5. Generating an Expression Human Audio in a Firebase Cloud Function

The createVoiceConfig function constructs an instance of GenerateContentConfig that outputs a speech narrated by the given voice name.

import { GenerateContentConfig } from "@google/genai";

export function createVoiceConfig(voiceName = "Kore"): GenerateContentConfig {
    return {
        responseModalities: ["audio"],
        speechConfig: {
            voiceConfig: {
                prebuiltVoiceConfig: {
                    voiceName,
                },
            },
        },
    };
}

const splitList = (whitelist?: string) => (whitelist || "").split(",").map((origin) => origin.trim());

export const whitelist = splitList(process.env.WHITELIST);
export const cors = whitelist.length > 0 ? whitelist : true;
export const refererList = splitList(process.env.REFERER);

All Cloud Functions enforce App Check, CORS, and a timeout period of 600 seconds. If WHITELIST is unspecified, CORS defaults to true. While acceptable in a demo environment, configure CORS to a specific domain or false in production to prevent unauthorized access.

The readFact cloud function delegates to readFactStreamFunction when isStreaming is true. Otherwise, it is delegated to readFactFunction.

The readFactFunction function returns a Promise<string> that is the encoded base64 string.

The readFactStreamFunction functions returns a Promise<number[] | undefined> that represents a buffer of WAV header bytes.

import { onCall } from "firebase-functions/v2/https";
import { cors } from "../auth";
import { buildAudioPrompt } from './audio-prompt';
import { readFactFunction, readFactFunctionStream } from "./read-fact";
import { createVoiceConfig } from './voice-config';

const options = {
    cors,
    enforceAppCheck: true,
    timeoutSeconds: 600,
};

export const readFact = onCall(options, (request, response) => {
    const { data, acceptsStreaming } = request;
    const isStreaming = acceptsStreaming && !!response;
    const prompt = buildAudioPrompt(data);
    const voiceOption = createVoiceConfig(data.voiceOption);

    return isStreaming
        ? readFactStreamFunction(prompt, voiceOption, response)
        : readFactFunction(prompt, voiceOption);
});

The withAIAudio function is a high-order function that calls the callback to generate an audio stream.

async function withAIAudio(callback: (ai: GoogleGenAI, model: string) => Promise<string | number[] | undefined>) {
    try {
        const variables = AUDIO_CONFIG;
        if (!variables) {
            return "";
        }

        const { genAIOptions, model } = variables;
        const ai = new GoogleGenAI(genAIOptions);
        return await callback(ai, model);
    } catch (e) {
        if (e instanceof HttpsError) {
            throw e;
        }
        throw new HttpsError("internal", "An internal error occurred while setting up the AI client.", {
            originalError: (e as Error).message,
        });
    }
}

generateAudio is a callback function that uses the Gemini 3.1 Flash TTS Preview model to generate a response. getBase64DataUrl invokes extractInlineAudioData to extract the raw data and the mime type from the response. The encodeBase64String function first converts the raw data to WAV format, then encodes it to base64 format, and finally returns the base64 string.

The createAudioParams function constructs a parameter with the Gemini TTS model, the audio prompt, and the speech configuration.

async function generateAudio(aiTTS: AIAudio, prompt: string, voiceOption: GenerateContentConfig) {
    try {
        const { ai, model } = aiTTS;
        const response = await ai.models.generateContent(createAudioParams(model, prompt, voiceOption));
        return getBase64DataUrl(response);
    } catch (error) {
        console.error(error);
        throw error;
    }
}

function createAudioParams(model: string, prompt: string, config?: GenerateContentConfig) {
    return {
        model,
        contents: [
            {
                role: "user",
                parts: [
                    {
                        text: prompt,
                    },
                ],
            },
        ],
        config,
    };
}

function extractInlineAudioData(response: GenerateContentResponse): {
    rawData: string | undefined;
    mimeType: string | undefined;
} {
    const { data: rawData, mimeType } = response.candidates?.[0]?.content?.parts?.[0]?.inlineData ?? {};

    return { rawData, mimeType };
}

function getBase64DataUrl(response: GenerateContentResponse) {
    const { rawData, mimeType } = extractInlineAudioData(response);

    if (!rawData || !mimeType) {
        throw new Error("Audio generation failed: No audio data received.");
    }

    return encodeBase64String({ rawData, mimeType });
}

export function encodeBase64String({ rawData, mimeType }: RawAudioData) {
    const wavBuffer = convertToWav(rawData, mimeType);
    const base64Data = wavBuffer.toString("base64");
    return `data:audio/wav;base64,${base64Data}`;
}

generateAudioStream is a callback function that uses the Gemini 3.1 Flash TTS Preview model to stream a list of audio chunks. The chunks are iterated so that each chunk is passed to the extractInlineAudioData function to extract the raw data and the mime type. The function converts the chunk's raw data into a buffer and sends it to the client; the byte length accumulates to determine the total size of all chunks.

After all the chunks are sent to the client, the createWavHeader function uses the total byte length and the audio options to construct a WAV header and returns it.

async function generateAudioStream(
    aiTTS: AIAudio,
    prompt: string,
    voiceOption: GenerateContentConfig,
    response: CallableResponse<unknown>,
): Promise<number[] | undefined> {
    try {
        const { ai, model } = aiTTS;
        const chunks = await ai.models.generateContentStream(createAudioParams(model, prompt, voiceOption));
        let byteLength = 0;
        let options: WavConversionOptions | undefined = undefined;
        for await (const chunk of chunks) {
            const { rawData, mimeType } = extractInlineAudioData(chunk);
            if (!options && mimeType) {
                options = parseMimeType(mimeType);
                response.sendChunk({
                    type: "metadata",
                    payload: {
                        sampleRate: options.sampleRate,
                    },
                });
            }

            if (rawData && mimeType) {
                const buffer = Buffer.from(rawData, "base64");
                byteLength = byteLength + buffer.length;
                response.sendChunk({
                    type: "data",
                    payload: {
                        buffer,
                    },
                });
            }
        }

        if (options && byteLength > 0) {
            const header = createWavHeader(byteLength, options);
            return [...header];
        }

        return undefined;
    } catch (error) {
        console.error(error);
        throw error;
    }
}

The readFactFunction invokes the withAIAudio high-order function to generate a base64-encoded string.

The readFactStreamFunction function calls the withAIAudio high-order function to write chunks to the response body and send them to the client. Then, the generateAudioStream function returns the bytes of the WAV header.

export async function readFactFunction(prompt: string, voiceOption: GenerateContentConfig) {
    return withAIAudio((ai, model) => generateAudio({ ai, model }, prompt, voiceOption));
}

export async function readFactStreamFunction(prompt: string, voiceOption: GenerateContentConfig, response: CallableResponse<unknown>) {
    return withAIAudio((ai, model) => generateAudioStream({ ai, model }, prompt, voiceOption, response));
}

6. Firebase App Configuration and reCAPTCHA Site Key

I implemented a FIREBASE_APP_CONFIG IIFE (Immediately Invoked Function Expression) to run once to validate the environment variables of the Firebase app.

export const FIREBASE_APP_CONFIG = (() => {
    const env = process.env;
    const missingKeys: string[] = [];
    const apiKey = validate(env.APP_API_KEY, "API Key", missingKeys);
    const appId = validate(env.APP_ID, "App Id", missingKeys);
    const messagingSenderId = validate(env.APP_MESSAGING_SENDER_ID, "Messaging Sender ID", missingKeys);
    const recaptchaSiteKey = validate(env.RECAPTCHA_ENTERPRISE_SITE_KEY, "Recaptcha site key", missingKeys);
    const projectId = validate(env.GCLOUD_PROJECT, "Project ID", missingKeys);

    if (missingKeys.length > 0) {
        throw new Error(`Missing environment variables: ${missingKeys.join(", ")}`);
    }

    return {
        app: {
            apiKey,
            appId,
            projectId,
            messagingSenderId,
            authDomain: `${projectId}.firebaseapp.com`,
            storageBucket: `${projectId}.firebasestorage.app`,
        },
        recaptchaSiteKey,
    };
})();

The getFirebaseConfig function caches the FIREBASE_APP_CONFIG for an hour before returning it to the Angular application.

The Angular application receives the Firebase app configuration and reCAPTCHA site key from the Cloud Function to initialize Firebase AI Logic and protect resources from unauthorized access and abuse.

export const getFirebaseConfig = onRequest({ cors }, (request, response) => {
    if (!validateRequest(request, response)) {
        return;
    }

    try {
        response.set("Cache-Control", "public, max-age=3600, s-maxage=3600");
        response.json(FIREBASE_APP_CONFIG);
    } catch (err) {
        console.error(err);
        response.status(500).send("Internal Server Error");
    }
});

7. Local Development with Emulators

For local development, I used the Firebase Local Emulator Suite to save cost and time. In the bootstrapFirebase process, the application calls connectFunctionsEmulator to link to the Cloud Functions running at http://localhost:5001.

The port number defaulted to 5001 when firebase init was executed.

function connectEmulators(functions: Functions, remoteConfig: RemoteConfig) {
  if (location.hostname === 'localhost') {
    const host = getValue(remoteConfig, 'functionEmulatorHost').asString();
    const port = getValue(remoteConfig, 'functionEmulatorPort').asNumber();
    connectFunctionsEmulator(functions, host, port);
  }
}

loadFirebaseConfig is a helper function that makes request to the Cloud function to obtain the Firebase App configuration and the reCAPTCHA site key.

{
  "getFirebaseConfigUrl": "http://127.0.0.1:5001/vertexai-firebase-6a64f/us-central1/getFirebaseConfig"
}

export type FirebaseConfigResponse = {
  app: FirebaseOptions;
  recaptchaSiteKey: string
}

import { HttpClient } from '@angular/common/http';
import { inject } from '@angular/core';
import { catchError, lastValueFrom, throwError } from 'rxjs';
import config from '../../public/config.json';
import { FirebaseConfigResponse } from './ai/types/firebase-config.type';

async function loadFirebaseConfig() {
  const httpService = inject(HttpClient);
  const firebaseConfig$ =
    httpService.get<FirebaseConfigResponse>(config.getFirebaseConfigUrl)
      .pipe(catchError((e) => throwError(() => e)));
  return lastValueFrom(firebaseConfig$);
}

The bootstrapFirebase function initializes the FirebaseApp and App Check, loads the Firebase remote configuration and cloud functions, and stores them in the config service for later use.

export async function bootstrapFirebase() {
    try {
      const configService = inject(ConfigService);
      const firebaseConfig = await loadFirebaseConfig();
      const { app, recaptchaSiteKey } = firebaseConfig;
      const firebaseApp = initializeApp(app);
      const remoteConfig = await fetchRemoteConfig(firebaseApp);

      initializeAppCheck(firebaseApp, {
        provider: new ReCaptchaEnterpriseProvider(recaptchaSiteKey),
        isTokenAutoRefreshEnabled: true,
      });

      const functionRegion = getValue(remoteConfig, 'functionRegion').asString();
      const functions = getFunctions(firebaseApp, functionRegion);
      connectEmulators(functions, remoteConfig);

      configService.loadConfig(firebaseApp, remoteConfig, functions);
    } catch (err) {
      console.error(err);
    }
}

The AppConfig remains unchanged.

import { ApplicationConfig, provideAppInitializer } from '@angular/core';
import { bootstrapFirebase } from './app.bootstrap';

export const appConfig: ApplicationConfig = {
  providers: [
    provideAppInitializer(async () => bootstrapFirebase()),
  ]
};

8. Angular Implementation

8.1 Audio Tags Component

I create an AudioTagsComponent and a new signal form to input the scene, emotion, pace, and voice name in the Angular frontend.

<div>
  <h3>
    <span class="text-xl">🎙️</span> Customize Audio Generation
  </h3>

  <div class="grid grid-cols-1 md:grid-cols-2 gap-4">
    <!-- Scene -->
    <div class="flex flex-col gap-1.5 md:col-span-2">
      <label for="scene">Scene Description</label>
      <textarea id="scene" [formField]="audioPromptForm.scene"
      ></textarea>
    </div>

    <!-- Emotion -->
    <div class="flex flex-col gap-1.5">
      <label for="emotion">Vocal Emotion</label>
      <input type="text" id="emotion" [formField]="audioPromptForm.emotion"
        placeholder="e.g., panicked, whispers"
      />
    </div>

    <!-- Pace -->
    <div class="flex flex-col gap-1.5">
      <label for="pace">Speaking Pace</label>
      <input type="text" id="pace" [formField]="audioPromptForm.pace"
        placeholder="e.g., very slow, rapid"
      />
    </div>

    <!-- Voice Option -->
    <div class="flex flex-col gap-1.5 md:col-span-2">
      <label for="voiceOption">AI Voice Model</label>
      <select id="voiceOption" [formField]="audioPromptForm.voiceOption"
      >
        <option value="" disabled selected>Select a voice...</option>
        @for (option of sortedVoiceOptions(); track option.name) {
          <option [value]="option.name" class="bg-slate-800">{{ option.label }}</option>
        }
      </select>
    </div>
  </div>
</div>

import { ChangeDetectionStrategy, Component, computed, signal } from '@angular/core';
import { form, FormField } from '@angular/forms/signals';
import { VOICE_OPTIONS } from './constants/voice-options.const';
import { AudioPromptData } from './types/audio-prompt-data.type';

@Component({
  selector: 'app-audio-tags',
  imports: [FormField],
  templateUrl: './audio-tags.component.html',
  changeDetection: ChangeDetectionStrategy.OnPush,
})
export class AudioTagsComponent {
    #audioPromptModel = signal<AudioPromptData>({
      scene: 'A news anchor reading the news in a busy newsroom',
      emotion: 'professional, slightly serious',
      pace: 'moderate, clear enunciation',
      voiceOption: 'Kore'
    });
    audioPromptForm = form(this.#audioPromptModel);

    sortedVoiceOptions = computed(() => {
      const sortedList = VOICE_OPTIONS.sort((a, b) => a.name.localeCompare(b.name));

      return sortedList.map(option => ({
        name: option.name,
        label: `${option.name} - ${option.description}`
      }));
    });

    audioPromptModel = this.#audioPromptModel.asReadonly();
}

The AudioTagsComponent is imported into ObscureFactComponent such that users can input values into the experimental signal form.

In the HTML template of ObscureFactComponent, the <app-audio-tags> has a template variable audioTags, and audioTags.audioPromptModel() resolves to an instance of AudioPromptData. The data is assigned to the audioTags property of the generateSpeech method.

<div class="w-full mt-6">
    <app-audio-tags #audioTags />

    <h3>A surprising or obscure fact about the tags</h3>
    @if (interestingFact()) {
      <p>{{ interestingFact() }}</p>

      <app-error-display [error]="ttsError()" />

      <app-text-to-speech
        [isLoadingSync]="isLoadingSync()"
        [isLoadingStream]="isLoadingStream()"
        [isLoadingWebAudio]="isLoadingWebAudio()"
        [audioUrl]="audioUrl()"
        (generateSpeech)="generateSpeech({ mode: $event, audioTags: audioTags.audioPromptModel() })"
        [playbackRate]="playbackRate()"
      />
    } @else {
      <p>The tag(s) does not have any interesting or obscure fact.</p>
    }
</div>

import { AudioPromptData } from './audio-prompt-data.type';
import { GenerateSpeechMode } from '../../generate-audio.util';

export type ModeWithAudioTags = {
  mode: GenerateSpeechMode;
  audioTags: AudioPromptData;
};

export type AudioPrompt = {
  scene: string;
  emotion: string;
  pace: string;
  transcript: string;
  voiceOption: string;
};

The generateSpeech method uses the fact and audioTags to contruct an instance of AudioPrompt. When the mode is stream, the SpeechService calls generateAudioBlobURL to use the audioPrompt to construct a blob URL. When the mode is sync, the SpeechService calls generateAudio to use the audioPrompt to generate an encoded base64 string. When the mode is web_audio_api, the AudioPlayerService calls playStream to stream the audio.

import { SpeechService } from '@/ai/services/speech.service';
import { AudioPrompt } from '@/ai/types/audio-prompt.type';
import { ChangeDetectionStrategy, Component, inject, input, OnDestroy, signal } from '@angular/core';
import { revokeBlobURL } from '../blob.util';
import { AudioTagsComponent } from './audio-tags/audio-tags.component';
import { ModeWithAudioTags } from './audio-tags/types/mode-audio-tags.type';
import { generateSpeechHelper, streamSpeechWithWebAudio, ttsError } from './generate-audio.util';
import { AudioPlayerService } from './services/audio-player.service';

@Component({
  selector: 'app-obscure-fact',
  templateUrl: './obscure-fact.component.html',
  imports: [
    TextToSpeechComponent,
  ],
  changeDetection: ChangeDetectionStrategy.OnPush,
})
export class ObscureFactComponent implements OnDestroy {
  interestingFact = input<string | undefined>(undefined);

  speechService = inject(SpeechService);
  audioPlayerService = inject(AudioPlayerService);

  isLoadingSync = signal(false);
  isLoadingStream = signal(false);
  isLoadingWebAudio = signal(false);

  audioUrl = signal<string | undefined>(undefined);

  ttsError = ttsError;

  async generateSpeech({ mode, audioTags }: ModeWithAudioTags) {
    const fact = this.interestingFact();

    if (fact) {
      revokeBlobURL(this.audioUrl);
      this.audioUrl.set(undefined);

      const audioPrompt = {
          ...audioTags,
          transcript: fact,
      };
      if (mode === 'sync' || mode === 'stream') {
        const loadingSignal = mode === 'stream' ? this.isLoadingStream : this.isLoadingSync;
        const speechFn = (audioPrompt: AudioPrompt) => mode === 'stream' ?
            this.speechService.generateAudioBlobURL(audioPrompt) :
            this.speechService.generateAudio(audioPrompt);

        await generateSpeechHelper(audioPrompt, loadingSignal, this.audioUrl, speechFn);
      } else if (mode === 'web_audio_api') {
        await streamSpeechWithWebAudio(
          audioPrompt,
          this.isLoadingWebAudio,
          (audioPrompt: AudioPrompt) => this.audioPlayerService.playStream(audioPrompt));
      }
    }
  }

  ngOnDestroy(): void {
    revokeBlobURL(this.audioUrl);
  }
}

8.2 Call Firebase Cloud Functions directly

The SpeechService has a generateAudio method that calls the readFact cloud function to obtain the encoded base64 string.

Similarly, the service has a generateAudioBlobURL method that streams the chunks to create a buffer and prepend it with the WAV header. The constructBlobURL creates a blob URL from the Blob Part array.

export function constructBlobURL(parts: BlobPart[]) {
  return URL.createObjectURL(new Blob(parts, { type: 'audio/wav' }));
}

import { AudioPrompt } from '@/ai/types/audio-prompt.type';
import { constructBlobURL } from '@/photo-panel/blob.util';
import { inject, Injectable } from '@angular/core';
import { Functions, httpsCallable } from 'firebase/functions';
import { StreamMessage } from '../types/stream-message.type';
import { ConfigService } from './config.service';

@Injectable({
  providedIn: 'root'
})
export class SpeechService  {
    private configService = inject(ConfigService);

    private get functions(): Functions {
      if (!this.configService.functions) {
        throw new Error('Firebase Functions has not been initialized.');
      }
      return this.configService.functions;
    }

    async generateAudio(audioPrompt: AudioPrompt) {
      const readFactFunction = httpsCallable<AudioPrompt, string>(
        this.functions, 'textToAudio-readFact'
      );

      const { data: audioUri } = await readFactFunction(audioPrompt);
      return audioUri;
    }

    async generateAudioStream(audioPrompt: AudioPrompt) {
      const readFactStreamFunction = httpsCallable<AudioPrompt, number[] | undefined, StreamMessage>(
        this.functions, 'textToAudio-readFact'
      );

      return readFactStreamFunction.stream(audioPrompt);
    }

    async generateAudioBlobURL(audioPrompt: AudioPrompt) {
      const { stream, data } = await this.generateAudioStream(audioPrompt);

      const audioParts: BlobPart[] = [];
      for await (const audioChunk of stream) {
        if (audioChunk && audioChunk.type === 'data') {
          audioParts.push(new Uint8Array(audioChunk.payload.buffer.data));
        }
      }

      const wavHeader = await data;
      if (wavHeader && wavHeader.length) {
        audioParts.unshift(new Uint8Array(wavHeader));
      }

      return constructBlobURL(audioParts);
    }
}

Similar to SpeechService.generateAudioBlobURL, the playStream method of AudioPlayerService also calls generateAudioStream to get a stream of chunks and play each of them immediately.

import { SpeechService } from '@/ai/services/speech.service';
import { AudioPrompt } from '@/ai/types/audio-prompt.type';
import { inject, Injectable, OnDestroy, signal } from '@angular/core';

@Injectable({
  providedIn: 'root'
})
export class AudioPlayerService implements OnDestroy  {

  async playStream(audioPrompt: AudioPrompt) {
      const { stream } = await this.speechService.generateAudioStream(audioPrompt);
      for await (const audioChunk of stream) {
        ... process each chunk ...
      }
  }


  ngOnDestroy(): void {
    ... release resources to prevent memory leak ...
  }
}

This is the end of the walkthrough for the demo. You should now be able to input different combinations of scene, emotion, and pace to create a unique personality to say the given text in an audio clip.

9. Caveats and Lessons Learned: Avoiding the Dynamic Prompt Trap

The examples in Gemini AI Studio and Vertex AI Studio use static audio tags and transcripts and they work correctly for me.

When I applied dynamic audio tags and transcripts in the demo, the Gemini 3.1 TTS Flash Preview model ignored the audio tags. The issue was resolved after debugging in Gemini CLI for hours.

Here are the Caveats and Lessons Learned:

The Token Boundary Trap. The code originally concatenated tags and transcript without a space (for example, "[giggle][slow]Before"). The LLM tokenizer failed to recognize the instruction to change the behavior and pace of the audio. My fix was to insert a space between the tags and the transcript, which was "[giggle] [slow] Before".
Sanitize inputs before injecting into the prompt template. The sanitize functions remove Markdown headers (#) and triple quotes from the scene and transcript. The cleansed scene and transcript are injected into the prompt template to construct the final audio prompt.
LLM does not understand idiom. I typed "at a snail's pace" in the signal form and inserted "[at a snail's pace]" before the line. However, the model vocalized the tag literally, and no pace change occurred.
"Repetitive Weighting" is a Real Strategy. If standard tags like [slow] and [fast] are not dramatic enough, prepend the pace with "very" to increase the dramatic effect of the pace. It was evident when [very, very, very slow] generated a longer audio than [slow].
Replace newline character (\n) with \\n. to flatten the lines into a single paragraph. When the scene and transcript are cleansed and escaped, they are injected into the prompt template while the structure is preserved for the LLM parser.

Conclusion

The integration of text-to-speech with Firebase's serverless scalability empowers Angular applications for real-time audio generation.

First, the Angular application neither requires the genai dependency nor stores the Vertex AI environment variables in a .env file. The client application calls the Cloud Functions to perform the text to speech tasks to generate an audio stream.

The Cloud Functions receive arguments from the client, and execute a TTS operation to either return the entire audio as an encoded base64 string or stream the audio bytes in chunks. During local development, the Firebase Emulator calls the functions at http://localhost:5001 instead of the ones deployed on the Cloud Run platform to save cost.

Try cloning the GitHub repository, uploading an image to generate an obscure fact, and using the Gemini 3.1 Flash TTS preview model to speak it with the specified scene, emotion, and pace.

Resources

Top comments (3)

PEACEBINFLOW • May 3

The token boundary trap—needing that space between [giggle] and Before for the tokenizer to recognize the tag—is the kind of bug that teaches you something about how these models actually read. It's not parsing the way a markup parser would. It's tokenizing the string as continuous text, and [giggle]Before is just a different token sequence than [giggle] Before. Same characters, completely different meaning to the model.

It reminds me of the early days of prompt engineering when people were discovering that whitespace and punctuation could dramatically shift outputs, and it felt like superstition until you understood the BPE tokenizer underneath. We're now at the same phase with audio tags, except the failure mode isn't a bad text response—it's a voice that just reads your stage directions out loud like they're part of the script.

The idiom thing ("at a snail's pace" being read literally) makes me wonder whether these audio tags are fundamentally a markup layer bolted onto a model that wasn't trained to distinguish markup from content in a robust way. Like, it works when the pattern matches training data, but there's no real parsing happening. Have you found any patterns for which emotions or pace descriptors reliably work versus which ones the model just recites? Curious if there's a vocabulary of tags that's actually tested, or if everyone is just guessing.

Harjot Singh • Jun 1

the new Audio tags feature sounds like a game changer for making TTS more expressive. building with Firebase and Angular is a solid choice for dynamic applications. if you're ever looking for a fast way to deploy your own app, Moonshift can get you a full next.js + postgres + auth setup in about 7 minutes. happy to offer you a free run if you're interested.

Felix Hui • May 30

thx