I Lonare

Posted on Nov 29, 2024

How to Create a Conversational AI Voice Agent with OpenAI Realtime API A Step-by-Step Guide for Next JS 15

#ai #openai #nextjs #react

Building a conversational AI voice agent has become incredibly accessible thanks to OpenAI’s real-time APIs. In this article, we’ll create a fully functional conversational AI voice agent using Next.js 15. By the end, you’ll have a basic voice-enabled AI agent that listens to users, generates responses in real-time, and speaks back to them.

Let’s dive in step by step.

Prerequisites

Basic Knowledge of JavaScript/React: You should be comfortable with basic coding concepts.
Node.js Installed: Ensure you have Node.js v16 or higher installed.
OpenAI API Key: Create an account and obtain an API key from OpenAI.
Microphone and Speaker: Required for testing voice input and output.

Step 1: Setting Up a New Next.js 15 Project

Start by creating a new Next.js project.

npx create-next-app@latest conversational-ai-agent
cd conversational-ai-agent

Install necessary dependencies:

npm install openai react-speech-recognition react-speech-kit

openai: For integrating OpenAI APIs.
react-speech-recognition: For handling voice input.
react-speech-kit: For text-to-speech functionality.

Step 2: Configure the OpenAI API in Next.js

Create a file called .env.local in the root directory and add your OpenAI API key:

OPENAI_API_KEY=your-openai-api-key

Now, create a utility function for interacting with OpenAI’s API.

`utils/openai.js`

import { Configuration, OpenAIApi } from "openai";

const configuration = new Configuration({
  apiKey: process.env.OPENAI_API_KEY,
});

const openai = new OpenAIApi(configuration);

export const getChatResponse = async (prompt) => {
  const response = await openai.createChatCompletion({
    model: "gpt-4",
    messages: [{ role: "user", content: prompt }],
  });
  return response.data.choices[0].message.content;
};

This function sends a user’s query to OpenAI and retrieves the AI’s response.

Step 3: Add Speech Recognition and Text-to-Speech

We’ll now set up the microphone to capture voice input and a text-to-speech system to read AI responses aloud.

`pages/index.js`

import { useState } from "react";
import SpeechRecognition, { useSpeechRecognition } from "react-speech-recognition";
import { useSpeechSynthesis } from "react-speech-kit";
import { getChatResponse } from "../utils/openai";

export default function Home() {
  const [conversation, setConversation] = useState([]);
  const [isProcessing, setIsProcessing] = useState(false);
  const { speak } = useSpeechSynthesis();
  const { transcript, resetTranscript } = useSpeechRecognition();

  if (!SpeechRecognition.browserSupportsSpeechRecognition()) {
    return <p>Your browser does not support Speech Recognition.</p>;
  }

  const handleStart = () => {
    resetTranscript();
    SpeechRecognition.startListening({ continuous: true });
  };

  const handleStop = async () => {
    SpeechRecognition.stopListening();
    setIsProcessing(true);
    const userMessage = transcript;
    const updatedConversation = [...conversation, { role: "user", content: userMessage }];
    setConversation(updatedConversation);

    // Get AI response
    const aiResponse = await getChatResponse(userMessage);
    setConversation([...updatedConversation, { role: "assistant", content: aiResponse }]);

    // Speak AI response
    speak({ text: aiResponse });
    setIsProcessing(false);
  };

  return (
    <div style={{ padding: "2rem", fontFamily: "Arial, sans-serif" }}>
      <h1>Conversational AI Voice Agent</h1>
      <div>
        <p><strong>AI:</strong> {conversation.map((msg, idx) => (
          <span key={idx}>
            <em>{msg.role === "assistant" ? msg.content : ""}</em><br />
          </span>
        ))}</p>
        <p><strong>You:</strong> {transcript}</p>
      </div>
      <button onClick={handleStart} disabled={isProcessing}>
        Start Listening
      </button>
      <button onClick={handleStop} disabled={isProcessing || !transcript}>
        Stop and Process
      </button>
    </div>
  );
}

Key Features:

SpeechRecognition: Captures the user’s voice and continuously listens.
SpeechSynthesis: Converts AI text responses into speech.
Conversation State: Maintains a history of messages between the user and AI.

Step 4: Add CSS for Better UX

Create a styles/global.css file and add the following:

body {
  margin: 0;
  padding: 0;
  font-family: Arial, sans-serif;
  background-color: #f4f4f9;
  color: #333;
}

h1 {
  text-align: center;
  color: #4a90e2;
}

button {
  padding: 10px 20px;
  margin: 5px;
  background-color: #4a90e2;
  color: white;
  border: none;
  border-radius: 5px;
  cursor: pointer;
}

button:disabled {
  background-color: #ccc;
}

div {
  max-width: 600px;
  margin: 0 auto;
}

Step 5: Run Your Application

Start your development server:

npm run dev

Open your browser and navigate to http://localhost:3000.

Click Start Listening to begin capturing your voice.
Speak a question or command.
Click Stop and Process to send your input to OpenAI and hear the AI’s response.

Step 6: Deploy the App (Optional)

Deploy your app to a platform like Vercel for wider accessibility:

npx vercel

Follow the prompts to deploy your app and share the generated URL with others.

Final Thoughts

Congratulations! 🎉 You’ve successfully created a conversational AI voice agent using Next.js 15 and OpenAI’s API. This simple implementation can be expanded with features like custom commands, improved UI, and multi-language support. The possibilities are endless!

Top comments (1)

Fluents • Oct 13

Nice walkthrough. It is great to see a clean, end-to-end path from mic capture to TTS in Next.js, especially for folks just dipping into voice UX.

A couple practical tweaks that have helped me: keep the OpenAI call on the server so you do not leak the API key. In Next.js, exposing utils/openai.js to the client can get bundled, so I usually add an /api/chat route (or a server action) and call that from the page. Also, the OpenAI Node SDK moved away from Configuration/OpenAIApi - the current pattern is new OpenAI() and client.chat.completions.create or responses.create. While you are at it, consider streaming tokens and speaking as they arrive to cut perceived latency.

On the voice side, Web Speech API works but is uneven across browsers. You can improve capture with media constraints like echoCancellation, noiseSuppression, and autoGainControl. For UX, queue or cancel any in-progress utterance before speaking the next response to prevent overlap, and consider basic VAD or a short silence timeout when stopping continuous listening. If you want interruption mid-speech, implement barge-in by calling speechSynthesis.cancel() when new audio or user speech is detected.

At Fluents we build production voice agents for phone and web, and the biggest wins have come from reducing turn latency and handling edge cases like partial transcripts, barge-in, and reconnects. Are you planning to try the OpenAI Realtime API with WebRTC next, or keep iterating on the REST flow? I would be curious what latency you are seeing locally vs Vercel once you add streaming.