Conversational AI is rapidly reshaping mobile interaction by enabling natural voice-based communication between users and intelligent systems. In this tutorial, we will explore how to build a conversational AI for iOS using ZEGOCLOUD’s Conversational AI SDK. By the end, you will have a working iOS application that supports voice input, intelligent responses through an LLM, and synthesized speech playback, along with live subtitle display.
Download Complete Source Code
Full working examples are available on GitHub:
| Component | Repository |
|---|---|
| Backend Server + Web Client | https://github.com/ZEGOCLOUD/blog-aiagent-server-and-web |
| iOS Client | https://github.com/ZEGOCLOUD/blog-aiagent-ios |
Prerequisites
Before building the iOS client, make sure you have the following:
- Xcode 15 or later
- A ZEGOCLOUD Console account
- Basic familiarity with Swift and iOS development
- A backend service available (this tutorial uses a Next.js server deployed on Vercel)
Architecture Overview
Before moving to implementation, it’s helpful to understand how the conversational workflow operates and why both a server and a client are required.
Why Both Server and Client?
ZEGOCLOUD’s Conversational AI runs speech recognition, LLM response generation, and text-to-speech within the cloud. Separating server and client responsibilities improves security and simplifies the integration process.
| Component | Responsibilities |
|---|---|
| Backend Server | Stores required credentials, generates authentication tokens, and manages the AI agent lifecycle through server APIs |
| iOS Client | Captures user voice, streams audio, plays generated speech, and renders subtitles |
| ZEGO Cloud | Performs ASR, LLM reasoning, TTS conversion, and manages the real time communication path |
System Architecture
This model ensures that sensitive keys remain on the server, while real time communication and AI processing are handled through ZEGOCLOUD’s cloud services.
Step 1: Set Up Your Backend Server
The backend service issues authentication tokens and communicates with ZEGOCLOUD’s AI Agent APIs. In this tutorial, a Next.js server deployed on Vercel will be used.
1.1 Environment Variables
Create a .env.local file containing the following configuration values:
# ZEGO Configuration (from ZEGOCLOUD Console: https://console.zegocloud.com/)
NEXT_PUBLIC_ZEGO_APP_ID=your_app_id
ZEGO_SERVER_SECRET=your_server_secret_32_chars
# AI Agent Configuration
ZEGO_AGENT_ID=aiAgent1
ZEGO_AGENT_NAME=AI Assistant
# System Prompt - Define your AI's personality
SYSTEM_PROMPT="You are my best friend whom I can talk to about anything."
# LLM Configuration (Large Language Model)
LLM_URL=https://your-llm-provider.com/api/chat/completions
LLM_API_KEY=your_llm_api_key
LLM_MODEL=your_model_name
# TTS Configuration (Text-to-Speech)
TTS_VENDOR=ByteDance
TTS_APP_ID=zego_test
TTS_TOKEN=zego_test
TTS_CLUSTER=volcano_tts
TTS_VOICE_TYPE=zh_female_wanwanxiaohe_moon_bigtts
1.2 Token Generation API
// app/api/zego/token/route.ts
... code unchanged ...
1.3 Deploy to Vercel
Once configured, deploy the Next.js server to Vercel:
- Push the project to GitHub
- Import the repository into Vercel
- Add the environment variables under project settings
- Deploy the project
The backend service will be accessible at your Vercel project URL.
Step 2: Build the iOS Client
2.1 Create an Xcode Project
- Interface: SwiftUI
- Language: Swift
- Minimum Deployment: iOS 14.0
2.2 Integrate ZEGOCLOUD SDK
Important: Use the AI Agent optimized SDK version v3.22.0.46173.
// SDK download and integration steps (unchanged)
2.3 Info.plist Configuration
<key>NSMicrophoneUsageDescription</key>
<string>This app needs access to your microphone for voice conversations with AI.</string>
<key>UIBackgroundModes</key>
<array>
<string>audio</string>
</array>
2.4 App Config File
// AppConfig.swift
struct AppConfig {
// ZEGO App ID - Must match your backend configuration
static let appID: UInt32 = 1234567890
// Backend server URL (your Vercel deployment)
static let serverURL = "https://your-project.vercel.app"
static func generateUserId() -> String {
return "user\(Int(Date().timeIntervalSince1970) % 100000)"
}
static func generateRoomId() -> String {
return "room\(Int(Date().timeIntervalSince1970) % 100000)"
}
}
2.5 API Service
// AppConfig.swift
struct AppConfig {
// ZEGO App ID - Must match your backend configuration
static let appID: UInt32 = 1234567890
// Backend server URL (your Vercel deployment)
static let serverURL = "https://your-project.vercel.app"
static func generateUserId() -> String {
return "user\(Int(Date().timeIntervalSince1970) % 100000)"
}
static func generateRoomId() -> String {
return "room\(Int(Date().timeIntervalSince1970) % 100000)"
}
}
2.6 ZEGO Express Manager
The core manager handles RTC operations and subtitle parsing:
// ZegoExpressManager.swift
import ZegoExpressEngine
struct SubtitleMessage {
let cmd: Int // 3=ASR(user), 4=LLM(AI)
let text: String
let messageId: String
let endFlag: Bool
var isUserMessage: Bool { cmd == 3 }
var isAgentMessage: Bool { cmd == 4 }
}
class ZegoExpressManager: NSObject {
static let shared = ZegoExpressManager()
var onSubtitleReceived: ((SubtitleMessage) -> Void)?
func initEngine() {
// Configure engine for AI conversation
let engineConfig = ZegoEngineConfig()
engineConfig.advancedConfig = [
"set_audio_volume_ducking_mode": "1",
"enable_rnd_volume_adaptive": "true"
]
ZegoExpressEngine.setEngineConfig(engineConfig)
// Create engine
let profile = ZegoEngineProfile()
profile.appID = AppConfig.appID
profile.scenario = .highQualityChatroom
ZegoExpressEngine.createEngine(with: profile, eventHandler: self)
// Enable 3A audio processing
let engine = ZegoExpressEngine.shared()
engine.enableAGC(true)
engine.enableAEC(true)
engine.setAECMode(.aiBalanced)
engine.enableANS(true)
engine.setANSMode(.medium)
}
func loginRoom(roomId: String, userId: String, token: String,
callback: @escaping (Int32) -> Void) {
let user = ZegoUser(userID: userId)
let config = ZegoRoomConfig()
config.token = token
ZegoExpressEngine.shared().loginRoom(roomId, user: user, config: config) { errorCode, _ in
callback(errorCode)
}
}
func startPublishing(streamId: String) {
ZegoExpressEngine.shared().startPublishingStream(streamId)
}
func startPlaying(streamId: String) {
ZegoExpressEngine.shared().startPlayingStream(streamId)
}
}
extension ZegoExpressManager: ZegoEventHandler {
func onRecvExperimentalAPI(_ content: String) {
// Parse subtitle message from experimental API
guard let data = content.data(using: .utf8),
let json = try? JSONSerialization.jsonObject(with: data) as? [String: Any],
let method = json["method"] as? String,
method == "liveroom.room.on_recive_room_channel_message",
let params = json["params"] as? [String: Any],
let msgContent = params["msg_content"] as? String,
let msgData = msgContent.data(using: .utf8),
let msgJson = try? JSONSerialization.jsonObject(with: msgData) as? [String: Any],
let cmd = msgJson["Cmd"] as? Int,
let dataDict = msgJson["Data"] as? [String: Any] else {
return
}
let message = SubtitleMessage(
cmd: cmd,
text: dataDict["Text"] as? String ?? "",
messageId: dataDict["MessageId"] as? String ?? "",
endFlag: dataDict["EndFlag"] as? Bool ?? false
)
onSubtitleReceived?(message)
}
}
2.7 Chat ViewModel
The ViewModel manages the conversation state and handles subtitle display:
// ChatViewModel.swift
@MainActor
class ChatViewModel: ObservableObject {
@Published var messages: [ChatMessage] = []
@Published var isConnected = false
@Published var statusText = "Disconnected"
private var llmMessageCache: [String: String] = [:]
init() {
ZegoExpressManager.shared.initEngine()
ZegoExpressManager.shared.onSubtitleReceived = { [weak self] message in
Task { @MainActor in
self?.handleSubtitleMessage(message)
}
}
}
private func handleSubtitleMessage(_ message: SubtitleMessage) {
if message.isUserMessage {
// ASR: Full text replacement each time
handleAsrMessage(message)
} else if message.isAgentMessage {
// LLM: Incremental text accumulation
handleLlmMessage(message)
}
}
private func handleAsrMessage(_ message: SubtitleMessage) {
if let index = messages.firstIndex(where: { $0.id == message.messageId }) {
messages[index].text = message.text
} else {
messages.append(ChatMessage(id: message.messageId, text: message.text, isUser: true))
}
}
private func handleLlmMessage(_ message: SubtitleMessage) {
let cachedText = llmMessageCache[message.messageId] ?? ""
let newText = cachedText + message.text
llmMessageCache[message.messageId] = newText
if let index = messages.firstIndex(where: { $0.id == message.messageId }) {
messages[index].text = newText
} else {
messages.append(ChatMessage(id: message.messageId, text: newText, isUser: false))
}
if message.endFlag {
llmMessageCache.removeValue(forKey: message.messageId)
}
}
func startCall() async {
// 1. Get token
let tokenResponse = try await ApiService.shared.getToken(userId: userId)
// 2. Login room
ZegoExpressManager.shared.loginRoom(roomId: roomId, userId: userId, token: token)
// 3. Start publishing audio
ZegoExpressManager.shared.startPublishing(streamId: userStreamId)
// 4. Start AI agent
let agentResponse = try await ApiService.shared.startAgent(...)
// 5. Play agent's audio stream
ZegoExpressManager.shared.startPlaying(streamId: agentStreamId)
isConnected = true
}
}
2.8 SwiftUI Views
Create the user interface:
// ContentView.swift
struct ContentView: View {
@StateObject private var viewModel = ChatViewModel()
var body: some View {
HStack(spacing: 0) {
// Left: Control Panel
VStack(spacing: 24) {
Text("ZEGO AI Agent")
.font(.title)
HStack(spacing: 8) {
Circle()
.fill(viewModel.isConnected ? Color.green : Color.gray)
.frame(width: 12, height: 12)
Text(viewModel.statusText)
}
Button(action: {
Task {
if viewModel.isConnected {
await viewModel.endCall()
} else {
await viewModel.startCall()
}
}
}) {
Text(viewModel.isConnected ? "End Call" : "Start AI Call")
.foregroundColor(.white)
.frame(width: 180, height: 50)
.background(viewModel.isConnected ? Color.red : Color.blue)
.cornerRadius(25)
}
}
.frame(maxWidth: .infinity)
// Right: Chat Messages
ScrollView {
LazyVStack(alignment: .leading, spacing: 12) {
ForEach(viewModel.messages) { message in
MessageBubble(message: message)
}
}
.padding()
}
.frame(maxWidth: .infinity)
}
}
}
struct MessageBubble: View {
let message: ChatMessage
var body: some View {
VStack(alignment: message.isUser ? .trailing : .leading) {
Text(message.isUser ? "You" : "AI Agent")
.font(.caption)
Text(message.text)
.padding(12)
.background(message.isUser ? Color.blue.opacity(0.2) : Color.green.opacity(0.2))
.cornerRadius(12)
}
.frame(maxWidth: .infinity, alignment: message.isUser ? .trailing : .leading)
}
}
Conclusion
You now have a functioning iOS application that supports real time conversational AI using ZEGOCLOUD’s SDK. The app can capture voice input, generate LLM responses, output synthesized speech, and display subtitles during the interaction.

Top comments (0)