Connie Leung

Posted on Dec 31, 2025

Streaming AI Speech with Gemini 2.5 Flash TTS, Angular, and Firebase

#angular #firebase #gemini #webdev

In this technical deep-dive, we explore how to integrate the latest Google Gemini models into a modern web stack. We use Gemini 3 Flash Preview to generate insightful and obscure facts and Gemini 2.5 Flash to provide high-quality Text-to-Speech (TTS) capabilities. We will walk through the implementation of synchronous audio delivery, real-time streaming with dynamic WAV header construction, and professional-grade browser playback using Angular.

1. Prerequisites

The technical stack includes:

Angular v21: Build the frontend user interfaces.
Node.js LTS: For implementing the backend logic in the Firebase Cloud Functions.
Firebase Cloud Functions: To be called by the frontend to either generate the audio synchronously or stream the chunks to the client.
Firebase Cloud Functions Emulator: To test the functions locally at http://localhost:5001.
Gemini 3 Flash Preview: Used for generating the text content (obscure facts) from hashtags.
Gemini 2.5 Flash: Used for the Text-to-Speech (TTS) operation.
Vertex AI: Use Gemini in Vertex AI to generate text and speech.

2. Use cases of text-to-speech

The cloud function generates the audio, converts the raw data from L16 mime type to WAV mime type before returning it to the Angular application to play the speech.

The Angular application wants to stream the audio, so the cloud function obtains the stream from the Gemini TTS model. The function converts the raw data to a base64 buffer before sending it to the client. After all chunks have been processed, the function calculates the WAV header and returns it. The client collects all the chunks and the WAV header before creating a Blob URL.

The Angular application assigns the Blob URL to an HTML5 <audio> element to play the speech.

This is an advanced, low-latency approach to streaming speech in real time.
Similarly, the Angular application asks to stream the audio and the cloud function transfers the chunks to it.

The client uses the Web Audio API (AudioContext) to schedule and play the chunks as they arrive, rather than waiting to construct a valid Blob URL at the end.

3. Voice Tones

Before diving into the functions, I defined different tones and voice profiles for the text-to-speech operations. Gemini TTS allows me to use prebuilt voices to narrate the text.

Moreover, I added the tone in the prompt to ensure the voice sounds like Darth Vader or an energetic individual.

// text-to-audio/constants/tone.const.ts 

export const DARTH_VADER_TONE = `You are a voice-over specialist for an advanced Text-to-Speech engine.
Your goal is to generate text and formatting that mimics the voice of Darth Vader (James Earl Jones).
**Vocal Characteristics:**
1.  **Pitch:** Extremely low, resonant, and bass-heavy.
2.  **Cadence:** Slow, deliberate, and rhythmic. Never rush. Each word carries the weight of authority.
3.  **Timbre:** Authoritative, menacing, and slightly gravelly, but with perfect clarity.
4.  **Breathing:** Every 2-3 short sentences (or one long sentence), you must insert
a mechanical respirator sound marker: [Mechanical Breath: Inhale/Exhale]
Read the text below EXACTLY once:`;

export const LIGHT_TONE = `Speak with a high-pitched, infectious energy.
End every sentence with a rising, joyful intonation.
Sound incredibly eager to please, as if you’ve just won a prize.
Read the text below EXACTLY once:`;

The DARTH_VADER_TONE ensures that voice talks like Darth Vader is a deep and heavy breathing voice. On the other hand, the LIGHT TONE allows the voice to speak in an energetic and joyful mood.

4. Voice Configuration

import { GenerateContentConfig } from '@google/genai';

export function createVoiceConfig(voiceName = "Kore"): GenerateContentConfig {
    return {
      responseModalities: ["audio"],
      speechConfig: {
          voiceConfig: {
              prebuiltVoiceConfig: {
                  voiceName,
              },
          },
      },
    };
}

I defined a createVoiceConfig function to help create a voice configuration. voiceName is a prebuilt voice name such as Kore and Puck. The default voice is Kore when no argument is provided.

const KORE_VOICE_CONFIG = createVoiceConfig();

const PUCK_VOICE_CONFIG = createVoiceConfig("Puck");

5. Cloud Function to Generate Speech from Text

The backend entry point utilizes an onCall trigger. We use a ternary expression to determine whether to provide a streaming response (acceptsStreaming is true) or a synchronous response (acceptsStreaming is false) based on the client's capabilities.

import { onCall } from "firebase-functions/v2/https";
import { readFactFunction, readFactFunctionStream } from "./read-fact";

const cors = process.env.WHITELIST ? process.env.WHITELIST.split(",") : true;
const options = {
    cors,
    enforceAppCheck: true,
    timeoutSeconds: 180,
};

export const readFact = onCall(options, ({ data, acceptsStreaming }, response) =>
    acceptsStreaming && response ? readFactFunctionStream(data, response) : readFactFunction(data),
);

6. Validate Environment Variables

import { HttpsError } from "firebase-functions/v2/https";

export function validate(value: string | undefined, fieldName: string, missingKeys: string[]) {
    const err = `${fieldName} is missing.`;
    if (!value) {
        logger.error(err);
        missingKeys.push(fieldName);
        return "";
    }

    return value;
}

export function validateAudioConfigFields() {
    const env = process.env;
    const vertexai = (env.GOOGLE_GENAI_USE_VERTEXAI || "false") === "true";

    const missingKeys: string[] = [];
    const location = validate(env.GOOGLE_CLOUD_LOCATION, "Vertex Location", missingKeys);
    const model = validate(env.GEMINI_TTS_MODEL_NAME, "Gemini TTS Model Name", missingKeys);
    const project = validate(env.GOOGLE_CLOUD_QUOTA_PROJECT, "Google Cloud Project", missingKeys);

    if (missingKeys.length > 0) {
        throw new HttpsError("failed-precondition", `Missing environment variables: ${missingKeys.join(", ")}`);
    }

    return {
        genAIOptions: {
            project,
            location,
            vertexai,
        },
        model,
    };
}

The validateAudioConfigFields function validates the Gemini TTS model name, Google Project ID, and Google Cloud location are defined. When the missingKeys list is non-empty, an error is thrown and TTS operation is halted.

7 L16 to Wav Conversion Utility

export type WavConversionOptions = {
    numChannels: number;
    sampleRate: number;
    bitsPerSample: number;
};

import { HttpsError } from "firebase-functions/v2/https";

export function encodeBase64String(rawData: string, mimeType: string): string {
  const wavBuffer = convertToWav(rawData, mimeType);
  const base64Data = Buffer.from(wavBuffer).toString("base64");
  return `data:audio/wav;base64,${base64Data}`;
}

export function convertToWav(rawData: string, mimeType: string): Buffer<ArrayBuffer> {
    const options = parseMimeType(mimeType);
    const wavHeader = createWavHeader(rawData.length, options);
    const buffer = Buffer.from(rawData, "base64");

    return Buffer.concat([wavHeader, buffer]);
}

export function parseMimeType(mimeType: string) {
    const [fileType, ...params] = mimeType.split(";").map((s) => s.trim());
    const format = fileType.split("/")[1];

    const options: Partial<WavConversionOptions> = {
        numChannels: 1,
    };

    if (format && format.startsWith("L")) {
        const bits = parseInt(format.slice(1), 10);
        if (!isNaN(bits)) {
            options.bitsPerSample = bits;
        }
    }

    for (const param of params) {
        const [key, value] = param.split("=").map((s) => s.trim());
        if (key === "rate") {
            options.sampleRate = parseInt(value, 10);
        }
    }

    if (!isWavConversionOptions(options)) {
        throw new HttpsError(
            "invalid-argument",
            `Invalid or incomplete mimeType: "${mimeType}". ` +
                "Could not determine all required WAV options (sampleRate, bitsPerSample).",
        );
    }

    return options;
}

function isWavConversionOptions(options: Partial<WavConversionOptions>): options is WavConversionOptions {
    // A valid WavConversionOptions object must have all properties as valid numbers.
    return (
        typeof options.numChannels === "number" &&
        !isNaN(options.numChannels) &&
        typeof options.sampleRate === "number" &&
        !isNaN(options.sampleRate) &&
        typeof options.bitsPerSample === "number" &&
        !isNaN(options.bitsPerSample)
    );
}

export function createWavHeader(dataLength: number, options: WavConversionOptions) {
    const { numChannels, sampleRate, bitsPerSample } = options;

    const byteRate = (sampleRate * numChannels * bitsPerSample) / 8;
    const blockAlign = (numChannels * bitsPerSample) / 8;
    const buffer = Buffer.alloc(44);

    buffer.write("RIFF", 0); // ChunkID
    buffer.writeUInt32LE(36 + dataLength, 4); // ChunkSize
    buffer.write("WAVE", 8); // Format
    buffer.write("fmt ", 12); // Subchunk1ID
    buffer.writeUInt32LE(16, 16); // Subchunk1Size (PCM)
    buffer.writeUInt16LE(1, 20); // AudioFormat (1 = PCM)
    buffer.writeUInt16LE(numChannels, 22); // NumChannels
    buffer.writeUInt32LE(sampleRate, 24); // SampleRate
    buffer.writeUInt32LE(byteRate, 28); // ByteRate
    buffer.writeUInt16LE(blockAlign, 32); // BlockAlign
    buffer.writeUInt16LE(bitsPerSample, 34); // BitsPerSample
    buffer.write("data", 36); // Subchunk2ID
    buffer.writeUInt32LE(dataLength, 40); // Subchunk2Size

    return buffer;
}

The Gemini TTS model generates audio with mime type audio/L16. The HTML audio player element does not understand this mime type. Therefore, conversion from L16 to WAV type is needed

encodeBase64String uses the raw data and the mime type to create a WAV buffer with a valid WAV header. It prepends data:audio/wav;base64, to the buffer to form the encoded base64 data URL.

createWavHeader creates a 44-byte WAV header.

convertToWav calls createWavHeader to create a WAV header and concatenates it with a base64 buffer to create a complete WAV buffer.

8. Text to Speech Examples

async function withAIAudio(callback: (ai: GoogleGenAI, model: string) => Promise<string | number[] | undefined>) {
    try {
        const variables = validateAudioConfigFields();
        if (!variables) {
            return "";
        }

        const { genAIOptions, model } = variables;
        const ai = new GoogleGenAI(genAIOptions);
        return await callback(ai, model);
    } catch (e) {
        console.error(e);
        if (e instanceof HttpsError) {
            throw e;
        }
        throw new HttpsError("internal", "An internal error occurred while setting up the AI client.", { originalError: (e as Error).message });
    }
}

withAIAudio is a high-order function that executes a callback function to return a Promise<string | number[] | undefined>. The string is either the encoded base64 data URL, a WAV header or undefined.

8.1. Serverless: Generating Inline Data at Once (Sync)

In the synchronous workflow, the readFactFunction function generates the entire audio part using the KORE_VOICE_CONFIG and DARTH_VADER_TONE. The result is converted to a WAV buffer and returned as a Base64 Data URL.

export async function readFactFunction(text: string) {
    return withAIAudio((ai, model) => generateAudio({ ai, model }, text));
}

text is the obscure fact that the Angular application passes to the cloud function to transform into speech.

function createAudioParams(model: string, contents: string, config?: GenerateContentConfig) {
    return {
        model,
        contents: {
            role: "user",
            parts: [
                {
                    text: contents,
                },
            ],
        },
        config,
    };
}

createAudioParams creates the content configuration with the model name, the text, and the voice configuration.

function extractInlineAudioData(response: GenerateContentResponse): {
    rawData: string | undefined;
    mimeType: string | undefined;
} {
    const { data: rawData, mimeType } = response.candidates?.[0]?.content?.parts?.[0]?.inlineData ?? {};

    return { rawData, mimeType };
}

function getBase64DataUrl(response: GenerateContentResponse): string {
  const { rawData, mimeType } = extractInlineAudioData(response);

  if (rawData && mimeType) {
    return encodeBase64String(rawData, mimeType);
  }
  throw new Error("Audio generation failed: No audio data received.");
}

getBase64DataUrl extracts the raw data and the mime type from the inline data. It provides both the raw data and the mime type to the encodeBase64String function to return the encoded base64 data URL.

async function generateAudio(aiTTS: AIAudio, text: string): Promise<string> {
  try {
    const { ai, model } = aiTTS;
    const contents = `${DARTH_VADER_TONE.trim()} ${text.trim()}`;

    const response = await ai.models.generateContent(
      createAudioParams(model, contents, KORE_VOICE_CONFIG)
    );

    return getBase64DataUrl(response);
  } catch (error) {
    console.error("Error generating audio:", error);
    throw error;
  }
}

The generateAudio function appends the text to the Darth Vader tone to form the prompt. Then it uses the ai instance to generate the audio response. Finally, it passes the response to getBase64DataUrl to return the encoded base64 data URL.

8.2. Streaming Inline Data with Dynamic WAV Header

For a better user experience, I sent the audio chunks to the client as they are generated. I accumulated the total byte length while iterating the stream before constructing the WAV header using the length and the WAV options at the very end.

export async function readFactFunctionStream(text: string, response: CallableResponse<unknown>) {
    return withAIAudio((ai, model) => generateAudioStream({ ai, model }, text, response));
}

async function generateAudioStream(
    aiTTS: AIAudio,
    text: string,
    response: CallableResponse<unknown>,
): Promise<number[] | undefined> {
    try {
        const { ai, model } = aiTTS;
        const contents = `${LIGHT_TONE.trim()} ${text.trim()}`;

        const chunks = await ai.models.generateContentStream(createAudioParams(model, contents, PUCK_VOICE_CONFIG));

        let byteLength = 0;
        let options: WavConversionOptions | undefined = undefined;
        for await (const chunk of chunks) {
            const { rawData, mimeType } = extractInlineAudioData(chunk);
            if (!options && mimeType) {
                options = parseMimeType(mimeType);
                response.sendChunk({
                    type: "metadata",
                    payload: {
                        sampleRate: options.sampleRate,
                    },
                });
            }

            if (rawData && mimeType) {
                const buffer = Buffer.from(rawData, "base64");
                byteLength = byteLength + buffer.length;
                response.sendChunk({
                    type: "data",
                    payload: {
                        buffer,
                    },
                });
            }
        }

        if (options && byteLength > 0) {
            const header = createWavHeader(byteLength, options);
            return [...header];
        }

        return undefined;
    } catch (error) {
        console.error(error);
        throw error;
    }
}

The generateAudioStream gets a stream of chunks from ai.models.generateContentStream. Each chunk is converted into a buffer before sending to the client side. Before exiting from the function, it constructs a WAV header with the byte length and the WAV options.

9. Angular Implementation

9.1 Config Service

import { Injectable } from '@angular/core';
import { FirebaseApp } from 'firebase/app';
import { Functions } from 'firebase/functions';
import { RemoteConfig } from 'firebase/remote-config';

@Injectable({
  providedIn: 'root'
})
export class ConfigService  {

    remoteConfig: RemoteConfig | undefined = undefined;
    firebaseApp: FirebaseApp | undefined = undefined;
    functions: Functions | undefined = undefined;

    loadConfig(firebaseApp: FirebaseApp, remoteConfig: RemoteConfig, functions: Functions) {
      this.firebaseApp = firebaseApp;
      this.remoteConfig = remoteConfig;
      this.functions = functions;
    }
}

The ConfigService keeps a reference to the Cloud Functions to be used in the SpeechService.

loadConfig is called to perform the initialization in provideAppInitializer.

9.2 Speech Service

The SpeechService handles the network call and re-assembles the downloaded chunks into a single Blob.

export function constructBlobURL(parts: BlobPart[]) {
  return URL.createObjectURL(new Blob(parts, { type: 'audio/wav' }));
}

export type SerializedBuffer = {
  type: 'Buffer';
  data: number[];
}

export type StreamMessage =
    | {
        type: "metadata";
        payload: {
          sampleRate: number;
        }
      }
    | {
        type: "data";
        payload: {
          buffer: SerializedBuffer,
        }
      };

import { constructBlobURL } from '@/photo-panel/blob.util';
import { inject, Injectable } from '@angular/core';
import { Functions, httpsCallable } from 'firebase/functions';
import { StreamMessage } from '../types/stream-message.type';
import { ConfigService } from './config.service';

@Injectable({
  providedIn: 'root'
})
export class SpeechService  {
    private configService = inject(ConfigService);

    private get functions(): Functions {
      if (!this.configService.functions) {
        throw new Error('Firebase Functions has not been initialized.');
      }
      return this.configService.functions;
    }

    async generateAudio(text: string) {
      const readFactFunction = httpsCallable<string, string>(
        this.functions, 'textToAudio-readFact'
      );

      const { data: audioUri } = await readFactFunction(text);
      return audioUri;
    }

    async generateAudioStream(text: string) {
      const readFactStreamFunction = httpsCallable<string, number[] | undefined, StreamMessage>(
        this.functions, 'textToAudio-readFact'
      );

      return readFactStreamFunction.stream(text);
    }

    async generateAudioBlobURL(text: string) {
      const { stream, data } = await this.generateAudioStream(text);

      const audioParts = [];
      for await (const audioChunk of stream) {
        if (audioChunk && audioChunk.data) {
          if (audioChunk && audioChunk.type === 'data') {
            audioParts.push(new Uint8Array(audioChunk.payload.buffer.data));
          }
        }
      }

      const wavHeader = await data;
      if (wavHeader && wavHeader.length) {
        audioParts.unshift(new Uint8Array(wavHeader));
      }

      return constructBlobURL(audioParts);
    }
}

generateAudio calls the Cloud Function textToAudio-readFact directly and receives the encoded base64 data URL.

generateAudioStream calls the same function but it wants to receive the stream results from the server.

generateAudioBlobURL downloads all the chunks into an audioParts array in the for await loop. The await data waits for the WAV header and inserts to the beginning of the audioParts array. The createTtsURL helper function then creates a Blob URL from it.

9.3. Audio Player Service

import { SpeechService } from '@/ai/services/speech.service';
import { inject, Injectable, OnDestroy, signal } from '@angular/core';

const INT16_MAX_VALUE = 32768;

@Injectable({
  providedIn: 'root'
})
export class AudioPlayerService implements OnDestroy  {
  private audioCtx: AudioContext | undefined = undefined;
  private speechService = inject(SpeechService);

  private nextStartTime = 0;
  private activeSources: AudioBufferSourceNode[] = [];

  playbackRate = signal(1);

  async playStream(text: string) {
      this.stopAll();

      this.playbackRate.set(this.setRandomPlaybackRate());

      try {
        const { stream } = await this.speechService.generateAudioStream(text);
        for await (const audioChunk of stream) {
          if (audioChunk.type === 'metadata') {
            this.initializeAudioContext(audioChunk.payload.sampleRate);
          } else if (audioChunk.type === 'data') {
            if (!this.audioCtx) {
              console.warn("Audio data received before metadata. Skipping chunk.");
              continue;
            }
            this.processChunk(audioChunk.payload.buffer.data);

          }
        }
      } catch (error) {
        console.error('Failed to play audio stream:', error);
        // Optionally, reset state or notify the user
        this.stopAll();
      }
  }

  private stopAll() {
    this.activeSources.forEach(sourceNode => {
      try {
        sourceNode.stop();
        sourceNode.disconnect();
      } catch (e) {
        // It's common for stop() to be called on a node that has already finished.
        // We can safely ignore these "InvalidStateError" exceptions.
      }
    });

    this.activeSources = [];
    this.nextStartTime = 0;
    if (this.audioCtx) {
      this.audioCtx?.close();
      this.audioCtx = undefined;
    }
  }

  private processChunk(rawData: number[]) {
    const float32Data = this.normalizeSoundSamples(rawData);
    if (float32Data.length === 0 || !this.audioCtx) {
      return;
    }

    const buffer = this.audioCtx.createBuffer(1, float32Data.length, this.audioCtx.sampleRate);
    buffer.copyToChannel(float32Data, 0);

    const sourceNode = this.connectSource(buffer);
    sourceNode.playbackRate.value = this.playbackRate();

    const playTime = Math.max(this.nextStartTime, this.audioCtx.currentTime);
    sourceNode.start(playTime);

    const actualDuration = buffer.duration / this.playbackRate();
    this.nextStartTime = playTime + actualDuration;
  }

  private normalizeSoundSamples(rawData: number[]) {
    const rawBuffer = new Uint8Array(rawData).buffer;

    const byteLength = rawBuffer.byteLength % 2 === 0 ? rawBuffer.byteLength : rawBuffer.byteLength - 1;
    if (byteLength === 0) {
      return new Float32Array(0);
    }

    const int16Data = new Int16Array(rawBuffer, 0, byteLength / 2);
    const float32Data = new Float32Array(int16Data.length);
    for (let i = 0; i < int16Data.length; i++) {
      float32Data[i] = (int16Data[i] * 1.0) / INT16_MAX_VALUE;
    }
    return float32Data;
  }

  private setRandomPlaybackRate(min = 0.85, max = 1.3) {
    const rawRate = Math.random() * (max - min) + min;
    return Math.round(rawRate * 100) / 100;
  }

  private connectSource(buffer: AudioBuffer) {
    if (!this.audioCtx) {
      throw new Error("Audio context is not initialized.");
    }

    const sourceNode = this.audioCtx.createBufferSource();
    sourceNode.buffer = buffer;
    sourceNode.connect(this.audioCtx.destination);

    this.activeSources.push(sourceNode);

    // Cleanup: Remove from array when this specific chunk finishes playing
    sourceNode.onended = () => {
      const index = this.activeSources.indexOf(sourceNode);
      if (index >= 0) {
        this.activeSources.splice(index, 1);
      }
    };

    return sourceNode;
  }

  private initializeAudioContext(sampleRate: number) {
    // Ensure any old context is closed before creating a new one.
    if (this.audioCtx) {
      this.audioCtx.close();
    }

    this.audioCtx = new AudioContext({ sampleRate });
    if (this.audioCtx.state === 'suspended') {
      this.audioCtx.resume();
    }

    this.nextStartTime = this.audioCtx.currentTime;
  }

  ngOnDestroy(): void {
    this.stopAll();
  }
}

Browser often blocks AudioContext from starting until a user interaction occurs.
if (this.audioCtx.state === 'suspended') { this.audioCtx.resume(); } resumes the AudioContext if it is suspended.

playStream reuses this.speechService.generateAudioStream(text) to obtain a stream of chunks. processChunk converts the raw data to a float32 array, creates an AudioBuffer, connects the audio buffer to a node, and adds the node to the AudioContext. The node is scheduled to play at nextStartTime to avoid overlap. playStream intentionally ignores the WAV header to reduce latency.

In the ngOnDestroy lifecycle hook, all source nodes cease to play and are disconnected from the AudioContext. This is done to avoid memory leak when users click the Web Audio API button to generate a new buffer before all the nodes are played and removed.

9.4 Component Interaction: `ObscureFactComponent` and `TextToSpeechComponent`

The parent ObscureFactComponent orchestrates the generated text and passes the resulting URI to the native audio element.

import { Signal } from '@angular/core';

function isValidBlobUrl(url: string) {
  try {
    const parsed = new URL(url);
    return parsed.protocol === 'blob:';
  } catch (e) {
    console.error(e);
    return false;
  }
}

export function revokeBlobURL(dataUrl: Signal<string | undefined>) {
  const blobUrl = dataUrl();
  if (blobUrl && isValidBlobUrl(blobUrl)) {
    URL.revokeObjectURL(blobUrl);
  }
}

// generate-audio.util.ts

import { signal, WritableSignal } from '@angular/core';

export const ttsError = signal('');

export type GenerateSpeechMode = 'sync' | 'stream' | 'web_audio_api';

export async function generateSpeechHelper(
  text: string,
  loadingSignal: WritableSignal<boolean>,
  urlSignal: WritableSignal<string | undefined>,
  speechFn: (fact: string) => Promise<string>
) {

  try {
    ttsError.set('');
    loadingSignal.set(true);
    const uri = await speechFn(text);
    urlSignal.set(uri);
  } catch (e) {
    console.error(e);
    ttsError.set('Error generating speech from text.');
  } finally {
    loadingSignal.set(false);
  }
}

export async function streamSpeechWithWebAudio(
    text: string,
    loadingSignal: WritableSignal<boolean>,
    webAudioApiFn: (text: string) => Promise<void>
) {
    try {
      loadingSignal.set(true);
      await webAudioApiFn(text);
    } catch (e) {
      console.error(e);
      ttsError.set('Error streaming speech using the Web Audio API.');
    } finally {
      loadingSignal.set(false);
    }
}

<div>
    <h3>A surprising or obscure fact about the tags</h3>
    @if (interestingFact()) {
      <p>{{ interestingFact() }}</p>

      <app-error-display [error]="ttsError()" />

      <app-text-to-speech
        [isLoadingSync]="isLoadingSync()"
        [isLoadingStream]="isLoadingStream()"
        [isLoadingWebAudio]="isLoadingWebAudio()"
        [audioUrl]="audioUrl()"
        (generateSpeech)="generateSpeech($event)"
        [playbackRate]="playbackRate()"
      />
    } @else {
      <p>The tag(s) does not have any interesting or obscure fact.</p>
    }
</div>

When users click any button in the TextToSpeechComponent, the generateSpeech custom event emits a GenerateSpeechMode to the ObscureFactComponent. The ObscureFactComponent executes the generateSpeech method, examines the value of mode, and generates the speech accordingly.

import { SpeechService } from '@/ai/services/speech.service';
import { ErrorDisplayComponent } from '@/error-display/error-display.component';
import { ChangeDetectionStrategy, Component, inject, input, OnDestroy, signal } from '@angular/core';
import { revokeBlobURL } from '../blob.util';
import { generateSpeechHelper, GenerateSpeechMode, streamSpeechWithWebAudio, ttsError } from './generate-audio.util';
import { AudioPlayerService } from './services/audio-player.service';
import { TextToSpeechComponent } from './text-to-speech/text-to-speech.component';

@Component({
  selector: 'app-obscure-fact',
  templateUrl: './obscure-fact.component.html',
  imports: [
    TextToSpeechComponent,
    ErrorDisplayComponent
  ],
  changeDetection: ChangeDetectionStrategy.OnPush,
})
export class ObscureFactComponent implements OnDestroy {
  interestingFact = input<string | undefined>(undefined);

  speechService = inject(SpeechService);
  audioPlayerService = inject(AudioPlayerService);

  audioUrl = signal<string | undefined>(undefined);
  playbackRate = this.audioPlayerService.playbackRate;

  ttsError = ttsError;

  async generateSpeech({ mode }: { mode: GenerateSpeechMode }) {
    const fact = this.interestingFact();
    if (fact) {
      revokeBlobURL(this.audioUrl);
      this.audioUrl.set(undefined);

      if (mode === 'sync' || mode === 'stream') {
        const loadingSignal = mode === 'stream' ? this.isLoadingStream : this.isLoadingSync;
        const speechFn = (text: string) => mode === 'stream' ?
            this.speechService.generateAudioBlobURL(text) :
            this.speechService.generateAudio(text);
        await generateSpeechHelper(fact, loadingSignal, this.audioUrl, speechFn);
      } else if (mode === 'web_audio_api') {
        await streamSpeechWithWebAudio(
          fact,
          this.isLoadingWebAudio,
          (text: string) => this.audioPlayerService.playStream(text));
      }
    }
  }

  ngOnDestroy(): void {
    revokeBlobURL(this.audioUrl);
  }
}

Conclusion

By leveraging Angular and Firebase's Gen 2 streaming functions, we can create seamless AI-driven voice experiences. Using Gemini 3 Flash Preview for text generation and Gemini 2.5 Flash for character-driven TTS (like our Darth Vader example) allows for truly immersive web applications.

DEV Community

Streaming AI Speech with Gemini 2.5 Flash TTS, Angular, and Firebase

1. Prerequisites

2. Use cases of text-to-speech

3. Voice Tones

4. Voice Configuration

5. Cloud Function to Generate Speech from Text

6. Validate Environment Variables

7 L16 to Wav Conversion Utility

8. Text to Speech Examples

8.1. Serverless: Generating Inline Data at Once (Sync)

8.2. Streaming Inline Data with Dynamic WAV Header

9. Angular Implementation

9.1 Config Service

9.2 Speech Service

9.3. Audio Player Service

9.4 Component Interaction: `ObscureFactComponent` and `TextToSpeechComponent`

Conclusion

References

Top comments (0)

1. Prerequisites

2. Use cases of text-to-speech

3. Voice Tones

4. Voice Configuration

5. Cloud Function to Generate Speech from Text

6. Validate Environment Variables

7 L16 to Wav Conversion Utility

8. Text to Speech Examples

8.1. Serverless: Generating Inline Data at Once (Sync)

8.2. Streaming Inline Data with Dynamic WAV Header

9. Angular Implementation

9.1 Config Service

9.2 Speech Service

9.3. Audio Player Service

9.4 Component Interaction: ObscureFactComponent and TextToSpeechComponent

Conclusion

References

9.4 Component Interaction: `ObscureFactComponent` and `TextToSpeechComponent`