DEV Community

Cover image for Building a Real-Time Conversational AI Agent with LiveKit, Gemini & Express
Daniel Odii
Daniel Odii

Posted on

Building a Real-Time Conversational AI Agent with LiveKit, Gemini & Express

Please before you start reading,

  1. This tutorial will guide you on how to setup the core system for the conversational ai agent, you can further improve it on your own since the main work would be done here (both ui and API wise)

  2. This is my first online article so please forgive my writing style.

While I’ve been working with ai for a while now, but mostly for text and image based input/output, I always felt building something with voice input would be a much complex task but fortunately with livekit it has never been more easier to build, and No, this is not the tradition type of voice bots where user voice audio input is received and processed in different chained steps, this is an actual realtime, near human conversational voice agent, think Gemini voice conversation mode, ChatGPT voice conversation mode.

Traditional voice bots suffer from high latency because they daisy-chain three separate models (STT → LLM → TTS). In this guide, we’ll build a Native Multimodal agent. It hears and speaks audio directly using Gemini 2.5 ("gemini-2.5-flash-native-audio-preview" particularly), resulting in a conversational experience that feels incredibly human.



Without wasting much of our time let’s dive right into business, we would be setting up a very simple express-typescript project just for this tutorial but you can still adapt the style and structure if you have an already existing express-ts project.

For this tutorial, these are the essential prerequisites:

  1. Working Gemini API Key (Make sure to enable the Speech-to-Text and Text-to-Speech API in Google Cloud)
  2. A Service account JSON key file (Download this from your Google cloud project too)
  3. Livekit env credentials (you will need a livekit account for this, signup here)

Lets Initialize the project

  1. First, we set up a modern TypeScript environment using ES Modules.

On your terminal:
mkdir voice-ai-agent && cd voice-ai-agent
npm init -y

  1. Update your package.json to include "type": "module" and other necessary scripts commands:
{
  "name": "voice-ai-agent",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1",
    "build": "tsc",
    "start": "node dist/index.js",
    "dev": "tsx watch src/index.ts"
  },
  "keywords": [],
  "author": "",
  "license": "ISC",
  "type": "module"
}

Enter fullscreen mode Exit fullscreen mode

Create a "tsconfig.json" to configure TypeScript
Since we are using modern Node.js features, our configuration needs to reflect that:

{
    "compilerOptions": {
        "target": "ES2022",
        "module": "NodeNext",
        "moduleResolution": "NodeNext",
        "outDir": "./dist",
        "rootDir": "./src",
        "strict": true,
        "esModuleInterop": true,
        "skipLibCheck": true
    },
    "include": ["src/**/*"]
}
Enter fullscreen mode Exit fullscreen mode






Installing Dependencies.

Run this in you project root:

a. Development dependencies:

npm install -D typescript tsx @types/node @types/express @types/cors
Enter fullscreen mode Exit fullscreen mode

typescript: To compile our TS code.
tsx: This is a lifesaver. It allows you to run 

.ts files directly (e.g., tsx watch src/index.ts) without manually compiling to JS every time you make a change.
@types/...: Providing TypeScript definitions for our libraries.

b. Core dependencies

npm install @livekit/agents @livekit/agents-plugin-google @livekit/rtc-node livekit-server-sdk express cors dotenv
Enter fullscreen mode Exit fullscreen mode
  • @livekit/agents
  • Core framework for building LiveKit agents. It handles agent lifecycle, job execution, room connections, and orchestration logic for real-time conversational agents.
  • @livekit/agents-plugin-google
  • LiveKit Google plugin that integrates Gemini RealtimeModel (beta) for speech-to-text, language understanding, and text-to-speech (TTS) in conversational agents.
  • @livekit/rtc-node
  • Node.js bindings for LiveKit’s real-time communication (RTC) stack. Required for connecting agents to LiveKit rooms and handling audio streams in a Node environment.
  • livekit-server-sdk
  • Server-side SDK used to create rooms, generate access tokens, manage participants, and interact with the LiveKit server securely.
  • express
  • Lightweight Node.js web framework used to expose HTTP endpoints (e.g., token generation, health checks, or agent triggers).
  • cors
  • Express middleware that enables Cross-Origin Resource Sharing (CORS), allowing your API to be safely accessed from browsers or frontend apps hosted on different domains.
  • dotenv
  • Loads environment variables from a .env file into process.env, making it easy to manage secrets like LiveKit API keys and Google credentials without hardcoding them.

Setting-up environment Variables

Create a .env file in your project root and include these (we will use dotenv to load them in Express):

LIVEKIT_URL=wss://your-livekit-server.livekit.cloud # Or your self-hosted URL
LIVEKIT_API_KEY=your_livekit_api_key # From your livekit account
LIVEKIT_API_SECRET=your_livekit_api_secret # From your livekit account
GOOGLE_API_KEY=your_gemini_api_key # From Google AI Studio
GOOGLE_APPLICATION_CREDENTIALS=/path/to/your-service-account.json # For Google Cloud STT

The Logic Engine (Personality prompt for the agent)

I know we want to start writing code immediately but before that, we need to define "who" our AI agent is, this basically describes what our Agent does, and how it would interact with users.

Key design principles for voice prompts:

  1. Brevity: Limit responses to <45 words. Long AI monologues kill the flow.
  2. Turn-Taking: Always end with a question or an invitation for user's to speak.
  3. Make it feel human: avoid using buzz words but rather incuding popular slangs or phrases that would enable users vibe with it

For this tutorial we will call our agent "Leo" (a general purpose agent), so go ahead and create a new file "src/prompt.ts"

export const prompt = `
You are Leo, a friendly, general-purpose AI voice assistant.
Keep responses under 45 words.
Sound natural, relaxed, and human — like a smart friend, not a robot.
Avoid buzzwords. Use light, casual language when appropriate.
Always end with a question or invite the user to speak.
Keep the conversation flowing.
If it fits, use casual phrases like “gotcha,” “makes sense,” or “no worries.”
`;
Enter fullscreen mode Exit fullscreen mode

The moment we've all been waiting for, the AI Worker

The Agent is a worker process that connects to a livekit room as a participant, so go ahead and create a new file "src/agent.ts"

a. Defining the Agent
We use "defineAgent" to set up the entry point for when a user joins a room.

import { voice, defineAgent, type JobContext } from '@livekit/agents';
import * as google from '@livekit/agents-plugin-google';
import { prompt } from './prompt.js';


export default defineAgent({
    entry: async (ctx: JobContext) => {
        await ctx.connect(); // Connect to the LiveKit Room

        // 1. Initialize the Gemini Realtime Model
        const session = new voice.AgentSession({
            llm: new google.beta.realtime.RealtimeModel({
                model: "gemini-2.5-flash-native-audio-preview", // The multimodal brain
                voice: "Puck",
                inputAudioTranscription: {}, // Allows UI captions
                outputAudioTranscription: {},
            }),
        });

        // 2. Initialize the Agent
        const agent = new voice.Agent({
            instructions: prompt
        });

        // 3. Start the Session
        await session.start({ agent, room: ctx.room });

        // 3. Kick off the conversation with a greeting
        await session.generateReply();
    }
});
Enter fullscreen mode Exit fullscreen mode

b. Handling state & cleanup
To make the agent production read, enable easy debug, we need to add event logs and an idle timeput to save on API costs.

// Inside the entry function...
(session as any).on('agent_state_changed', (ev: any) => {
    console.log(`Agent is now: ${ev.newState}`); // 'speaking', 'thinking', or 'listening'
});

// Idle timeout: Stop the worker if no one speaks for 10 minutes
let lastActivity = Date.now();
(session as any).on('user_input_transcribed', () => lastActivity = Date.now());

setInterval(() => {
    if (Date.now() - lastActivity > 10 * 60 * 1000) {
        ctx.shutdown(); // Gracefully exit
    }
}, 30000);
Enter fullscreen mode Exit fullscreen mode

Final "agent.ts" code should now look like this:

import { voice, defineAgent, type JobContext } from '@livekit/agents';
import * as google from '@livekit/agents-plugin-google';
import { prompt } from './prompt.js';


export default defineAgent({
    entry: async (ctx: JobContext) => {
        await ctx.connect(); // Connect to the LiveKit Room

        // 1. Initialize the Gemini Realtime Model
        const session = new voice.AgentSession({
            llm: new google.beta.realtime.RealtimeModel({
                model: "gemini-2.5-flash-native-audio-preview", // The multimodal brain
                voice: "Puck",
                inputAudioTranscription: {}, // Allows UI captions
                outputAudioTranscription: {},
            }),
        });

        // 2. Initialize the Agent
        const agent = new voice.Agent({
            instructions: prompt
        });

        // 3. Start the Session
        await session.start({ agent, room: ctx.room });

        // 3. Kick off the conversation with a greeting
        await session.generateReply();


        // Inside the entry function...
        (session as any).on('agent_state_changed', (ev: any) => {
            console.log(`Agent is now: ${ev.newState}`); // 'speaking', 'thinking', or 'listening'
        });

        // Idle timeout: Stop the worker if no one speaks for 10 minutes
        let lastActivity = Date.now();
        (session as any).on('user_input_transcribed', () => lastActivity = Date.now());

        setInterval(() => {
            if (Date.now() - lastActivity > 10 * 60 * 1000) {
                ctx.shutdown(); // Gracefully exit
            }
        }, 30000);
    }
});

Enter fullscreen mode Exit fullscreen mode

The Express Server
Our express server handles two things:
Issuing tokens to users and launching the agent worker.
Now go ahead and create a new file: "src/index.ts"

import 'dotenv/config';
import express from 'express';
import { initializeLogger, Worker, WorkerOptions } from '@livekit/agents';
import { AccessToken } from 'livekit-server-sdk';
import path from 'path';
import cors from 'cors';
import fs from 'fs';
import os from 'os';

const app = express();
const port = process.env.PORT || 3000;

app.use(cors());

initializeLogger({ level: 'debug', pretty: true });

// Robustly handle Google Application Credentials
if (process.env.GOOGLE_APPLICATION_CREDENTIALS) {
    const creds = process.env.GOOGLE_APPLICATION_CREDENTIALS;
    if (creds.startsWith('{')) {
        // It's a JSON string, write it to a temp file
        const tempPath = path.join(os.tmpdir(), `google-creds-${Date.now()}.json`);
        fs.writeFileSync(tempPath, creds);
        process.env.GOOGLE_APPLICATION_CREDENTIALS = tempPath;
        console.log(`Using credentials from JSON string, saved to ${tempPath}`);
    } else {
        // It's a path, ensure it's absolute
        process.env.GOOGLE_APPLICATION_CREDENTIALS = path.resolve(creds);
        console.log(`Using credentials from file: ${process.env.GOOGLE_APPLICATION_CREDENTIALS}`);
    }
} else if (fs.existsSync(path.resolve('./service-account.json'))) {
    // Default to local service-account.json if it exists and env is not set
    process.env.GOOGLE_APPLICATION_CREDENTIALS = path.resolve('./service-account.json');
    console.log(`Using local service-account.json: ${process.env.GOOGLE_APPLICATION_CREDENTIALS}`);
}

// 1. Token Endpoint for the Frontend
app.get('/get-token', async (req, res) => {
    const roomName = (req.query.room as string) || 'voice-chat';
    const participantName = (req.query.user as string) || 'user';

    const token = new AccessToken(
        process.env.LIVEKIT_API_KEY!,
        process.env.LIVEKIT_API_SECRET!,
        { identity: participantName }
    );

    token.addGrant({ roomJoin: true, room: roomName, canPublish: true, canSubscribe: true });
    res.json({ token: await token.toJwt() });
});

// 2. Start the Agent Worker
const worker = new Worker(
    new WorkerOptions({
        agent: path.resolve('./src/agent.ts'),
        wsURL: process.env.LIVEKIT_URL!,
        apiKey: process.env.LIVEKIT_API_KEY!,
        apiSecret: process.env.LIVEKIT_API_SECRET!,
    })
);

worker.run();

app.listen(port, () => console.log(`Server live at http://localhost:${port}`));
Enter fullscreen mode Exit fullscreen mode

If you've gotten to this stage Congratulations, you've successfully completed the first and important stage for this tutorial, now the next thing is; cooking up a very simple bare metal frontend to help us test out the agent.
For this tutorial we will be using the good o'l HTML to avoid further complexities.

NB: You can absolutely decide to use your favorite frontend framework as it pleases you whether React, Vue, Angular.

Go Ahead and create a new "client/agent.html".

We will also be requiring some essentials here:

  1. Livekit client module, we will be using the CDN. <script src="https://cdn.jsdelivr.net/npm/livekit-client@2.0.4/dist/livekit-client.umd.min.js"></script>

If youre using react or other frontend frameworks you would have to use the Livekit client SDK

npm install livekit-client
Enter fullscreen mode Exit fullscreen mode

Inside your agent.html

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <title>Real-Time Voice Chat with Leo</title>
    <script src="https://cdn.jsdelivr.net/npm/livekit-client@2.0.4/dist/livekit-client.umd.min.js"></script>
    <style>
        body {
            font-family: sans-serif;
            padding: 20px;
        }

        #transcript {
            margin-top: 20px;
            max-height: 400px;
            overflow-y: auto;
            border: 1px solid #ccc;
            padding: 10px;
            background: #f9f9f9;
        }

        .message {
            margin: 8px 0;
        }

        .user {
            color: blue;
        }

        .agent {
            color: green;
        }

        .interim {
            opacity: 0.6;
            font-style: italic;
        }

        #agentStatus {
            font-weight: bold;
            margin-top: 10px;
        }
    </style>
</head>

<body>
    <h1>Voice Chat with Leo</h1>
    <button id="joinBtn">Join Chat</button>
    <button id="leaveBtn" disabled>Leave Chat</button>
    <audio id="audioOutput" autoplay controls></audio>

    <div id="agentStatus">Agent status: not connected</div>
    <div id="transcript"></div>

    <script>
        const { Room, RoomEvent, createLocalTracks, DataPacketKind } = LivekitClient;
        let room;
        const joinBtn = document.getElementById('joinBtn');
        const leaveBtn = document.getElementById('leaveBtn');
        const audioOutput = document.getElementById('audioOutput');
        const transcriptDiv = document.getElementById('transcript');
        const agentStatus = document.getElementById('agentStatus');
        let lastMessageDiv = null;

        // Helper to add messages to transcript
        function addMessage(role, text, isInterim = false) {
            if (isInterim && lastMessageDiv && lastMessageDiv.dataset.role === role && lastMessageDiv.dataset.isInterim === 'true') {
                lastMessageDiv.textContent = `${role === 'user' ? 'You' : 'Leo'}: ${text}`;
                return;
            }

            const div = document.createElement('div');
            div.className = `message ${role} ${isInterim ? 'interim' : ''}`;
            div.textContent = `${role === 'user' ? 'You' : 'Leo'}: ${text}`;
            div.dataset.role = role;
            div.dataset.isInterim = isInterim ? 'true' : 'false';

            // If we were showing an interim message and now have a final one, or a new speaker, remove the 'interim' styling
            if (!isInterim && lastMessageDiv && lastMessageDiv.dataset.role === role && lastMessageDiv.dataset.isInterim === 'true') {
                lastMessageDiv.textContent = div.textContent;
                lastMessageDiv.classList.remove('interim');
                lastMessageDiv.dataset.isInterim = 'false';
                return;
            }

            transcriptDiv.appendChild(div);
            transcriptDiv.scrollTop = transcriptDiv.scrollHeight;
            lastMessageDiv = div;
        }

        joinBtn.addEventListener('click', async () => {
            try {
                const response = await fetch('http://localhost:3000/get-token?room=voice-chat-room&user=user');
                const { token } = await response.json();

                room = new Room();

                // Listen for transcription text events (modern approach)
                room.on(RoomEvent.TranscriptionReceived, (transcriptions, participant, publication) => {
                    transcriptions.forEach((transcription) => {
                        const isAgent = participant?.identity.includes('agent') || participant?.metadata?.includes('agent');
                        const role = isAgent ? 'agent' : 'user';

                        // The segment.text contains the transcribed text
                        addMessage(role, transcription.text, !transcription.isFinal);
                    });
                });

                // Agent state via track metadata
                room.on(RoomEvent.TrackSubscribed, (track, publication, participant) => {
                    if (track.kind === 'audio' && participant.identity.includes('agent')) {
                        track.attach(audioOutput);
                        audioOutput.play().catch(console.warn);

                        const updateStatus = () => {
                            const meta = publication.metadata || '';
                            if (meta.includes('speaking')) agentStatus.textContent = 'Agent status: speaking';
                            else if (meta.includes('thinking')) agentStatus.textContent = 'Agent status: thinking';
                            else if (meta.includes('listening')) agentStatus.textContent = 'Agent status: listening';
                        };
                        updateStatus();
                        publication.on('metadataChanged', updateStatus);
                    }
                });

                await room.connect('wss://your-livekit-server.livekit.cloud', token);
                await room.startAudio();

                const tracks = await createLocalTracks({ audio: true });
                for (const track of tracks) {
                    await room.localParticipant.publishTrack(track);
                }

                console.log('Joined room successfully');
                joinBtn.disabled = true;
                leaveBtn.disabled = false;
                agentStatus.textContent = 'Agent status: joined';
            } catch (error) {
                console.error('Error joining chat:', error);
                agentStatus.textContent = 'Error connecting';
            }
        });

        leaveBtn.addEventListener('click', () => {
            if (room) {
                room.disconnect();
                transcriptDiv.innerHTML = '';
                agentStatus.textContent = 'Agent status: not connected';
                console.log('Left room');
            }
            joinBtn.disabled = false;
            leaveBtn.disabled = true;
        });
    </script>
</body>

</html>
Enter fullscreen mode Exit fullscreen mode

NB: Remember to properly set your correct livekit cloud secure websocket url, replace "wss://your-app.livekit.cloud" with it.

If you have now gotten to this stage, youre ready to run and test your ai agent, you can confirm youre on the right track if your codebase looks like this:

Project Structure

.
├─ client
│   └─ agent.html
├─ src
│   ├─ agent.ts
│   ├─ index.ts
│   └─ leoPrompt.ts
├─ .env
├─ package.json
└─ tsconfig.json
Enter fullscreen mode Exit fullscreen mode

Now, You’ve just built a modern, low-latency Voice AI system. By using LiveKit's Agent Framework and Gemini 2.5, you've avoided the complexity of building your own WebRTC signaling and the lag of traditional STT/TTS pipelines, you can deploy to your favorite cloud platform and connect the api to your app.

To test this we can now run:
npm run dev

and open on a browser; the "agent.html" file inside our "client" dir.
Warning: please ensure to properly configure your .gitignore file to avoid exposing sensitive secrets when pushing to github.

# Dependencies
node_modules/

# Build output
dist/

# Environment variables
.env
.env.local
.env.*.local

# Logs
logs/
*.log
npm-debug.log*

# TypeScript cache
*.tsbuildinfo

# Optional npm cache
.npm/

# Credentials / Secrets
*.json
!package.json
!package-lock.json
!tsconfig.json

Enter fullscreen mode Exit fullscreen mode

Want to quickly get started with this project?
I've made a sample repository with the starter code here

Enjoy!

This tutorial was not sponsored by livekit

Top comments (0)