DEV Community

Cover image for How to Build a Conversational AI for iOS
Stephen568hub
Stephen568hub

Posted on

How to Build a Conversational AI for iOS

Conversational AI is rapidly reshaping mobile interaction by enabling natural voice-based communication between users and intelligent systems. In this tutorial, we will explore how to build a conversational AI for iOS using ZEGOCLOUD’s Conversational AI SDK. By the end, you will have a working iOS application that supports voice input, intelligent responses through an LLM, and synthesized speech playback, along with live subtitle display.

Download Complete Source Code

Full working examples are available on GitHub:

Prerequisites

Before building the iOS client, make sure you have the following:

  • Xcode 15 or later
  • A ZEGOCLOUD Console account
  • Basic familiarity with Swift and iOS development
  • A backend service available (this tutorial uses a Next.js server deployed on Vercel)

Architecture Overview

Before moving to implementation, it’s helpful to understand how the conversational workflow operates and why both a server and a client are required.

Why Both Server and Client?

ZEGOCLOUD’s Conversational AI runs speech recognition, LLM response generation, and text-to-speech within the cloud. Separating server and client responsibilities improves security and simplifies the integration process.

Component Responsibilities
Backend Server Stores required credentials, generates authentication tokens, and manages the AI agent lifecycle through server APIs
iOS Client Captures user voice, streams audio, plays generated speech, and renders subtitles
ZEGO Cloud Performs ASR, LLM reasoning, TTS conversion, and manages the real time communication path

System Architecture

This model ensures that sensitive keys remain on the server, while real time communication and AI processing are handled through ZEGOCLOUD’s cloud services.

Step 1: Set Up Your Backend Server

The backend service issues authentication tokens and communicates with ZEGOCLOUD’s AI Agent APIs. In this tutorial, a Next.js server deployed on Vercel will be used.

1.1 Environment Variables

Create a .env.local file containing the following configuration values:

# ZEGO Configuration (from ZEGOCLOUD Console: https://console.zegocloud.com/)
NEXT_PUBLIC_ZEGO_APP_ID=your_app_id
ZEGO_SERVER_SECRET=your_server_secret_32_chars

# AI Agent Configuration
ZEGO_AGENT_ID=aiAgent1
ZEGO_AGENT_NAME=AI Assistant

# System Prompt - Define your AI's personality
SYSTEM_PROMPT="You are my best friend whom I can talk to about anything."

# LLM Configuration (Large Language Model)
LLM_URL=https://your-llm-provider.com/api/chat/completions
LLM_API_KEY=your_llm_api_key
LLM_MODEL=your_model_name

# TTS Configuration (Text-to-Speech)
TTS_VENDOR=ByteDance
TTS_APP_ID=zego_test
TTS_TOKEN=zego_test
TTS_CLUSTER=volcano_tts
TTS_VOICE_TYPE=zh_female_wanwanxiaohe_moon_bigtts
Enter fullscreen mode Exit fullscreen mode

1.2 Token Generation API

// app/api/zego/token/route.ts
... code unchanged ...
Enter fullscreen mode Exit fullscreen mode

1.3 Deploy to Vercel

Once configured, deploy the Next.js server to Vercel:

  1. Push the project to GitHub
  2. Import the repository into Vercel
  3. Add the environment variables under project settings
  4. Deploy the project

The backend service will be accessible at your Vercel project URL.

Step 2: Build the iOS Client

2.1 Create an Xcode Project

  • Interface: SwiftUI
  • Language: Swift
  • Minimum Deployment: iOS 14.0

2.2 Integrate ZEGOCLOUD SDK

Important: Use the AI Agent optimized SDK version v3.22.0.46173.

// SDK download and integration steps (unchanged)
Enter fullscreen mode Exit fullscreen mode

2.3 Info.plist Configuration

<key>NSMicrophoneUsageDescription</key>
<string>This app needs access to your microphone for voice conversations with AI.</string>
<key>UIBackgroundModes</key>
<array>
    <string>audio</string>
</array>
Enter fullscreen mode Exit fullscreen mode

2.4 App Config File

// AppConfig.swift
struct AppConfig {
    // ZEGO App ID - Must match your backend configuration
    static let appID: UInt32 = 1234567890

    // Backend server URL (your Vercel deployment)
    static let serverURL = "https://your-project.vercel.app"

    static func generateUserId() -> String {
        return "user\(Int(Date().timeIntervalSince1970) % 100000)"
    }

    static func generateRoomId() -> String {
        return "room\(Int(Date().timeIntervalSince1970) % 100000)"
    }
}
Enter fullscreen mode Exit fullscreen mode

2.5 API Service

// AppConfig.swift
struct AppConfig {
    // ZEGO App ID - Must match your backend configuration
    static let appID: UInt32 = 1234567890

    // Backend server URL (your Vercel deployment)
    static let serverURL = "https://your-project.vercel.app"

    static func generateUserId() -> String {
        return "user\(Int(Date().timeIntervalSince1970) % 100000)"
    }

    static func generateRoomId() -> String {
        return "room\(Int(Date().timeIntervalSince1970) % 100000)"
    }
}
Enter fullscreen mode Exit fullscreen mode

2.6 ZEGO Express Manager

The core manager handles RTC operations and subtitle parsing:

// ZegoExpressManager.swift
import ZegoExpressEngine

struct SubtitleMessage {
    let cmd: Int           // 3=ASR(user), 4=LLM(AI)
    let text: String
    let messageId: String
    let endFlag: Bool

    var isUserMessage: Bool { cmd == 3 }
    var isAgentMessage: Bool { cmd == 4 }
}

class ZegoExpressManager: NSObject {
    static let shared = ZegoExpressManager()

    var onSubtitleReceived: ((SubtitleMessage) -> Void)?

    func initEngine() {
        // Configure engine for AI conversation
        let engineConfig = ZegoEngineConfig()
        engineConfig.advancedConfig = [
            "set_audio_volume_ducking_mode": "1",
            "enable_rnd_volume_adaptive": "true"
        ]
        ZegoExpressEngine.setEngineConfig(engineConfig)

        // Create engine
        let profile = ZegoEngineProfile()
        profile.appID = AppConfig.appID
        profile.scenario = .highQualityChatroom
        ZegoExpressEngine.createEngine(with: profile, eventHandler: self)

        // Enable 3A audio processing
        let engine = ZegoExpressEngine.shared()
        engine.enableAGC(true)
        engine.enableAEC(true)
        engine.setAECMode(.aiBalanced)
        engine.enableANS(true)
        engine.setANSMode(.medium)
    }

    func loginRoom(roomId: String, userId: String, token: String,
                   callback: @escaping (Int32) -> Void) {
        let user = ZegoUser(userID: userId)
        let config = ZegoRoomConfig()
        config.token = token
        ZegoExpressEngine.shared().loginRoom(roomId, user: user, config: config) { errorCode, _ in
            callback(errorCode)
        }
    }

    func startPublishing(streamId: String) {
        ZegoExpressEngine.shared().startPublishingStream(streamId)
    }

    func startPlaying(streamId: String) {
        ZegoExpressEngine.shared().startPlayingStream(streamId)
    }
}

extension ZegoExpressManager: ZegoEventHandler {
    func onRecvExperimentalAPI(_ content: String) {
        // Parse subtitle message from experimental API
        guard let data = content.data(using: .utf8),
              let json = try? JSONSerialization.jsonObject(with: data) as? [String: Any],
              let method = json["method"] as? String,
              method == "liveroom.room.on_recive_room_channel_message",
              let params = json["params"] as? [String: Any],
              let msgContent = params["msg_content"] as? String,
              let msgData = msgContent.data(using: .utf8),
              let msgJson = try? JSONSerialization.jsonObject(with: msgData) as? [String: Any],
              let cmd = msgJson["Cmd"] as? Int,
              let dataDict = msgJson["Data"] as? [String: Any] else {
            return
        }

        let message = SubtitleMessage(
            cmd: cmd,
            text: dataDict["Text"] as? String ?? "",
            messageId: dataDict["MessageId"] as? String ?? "",
            endFlag: dataDict["EndFlag"] as? Bool ?? false
        )
        onSubtitleReceived?(message)
    }
}
Enter fullscreen mode Exit fullscreen mode

2.7 Chat ViewModel

The ViewModel manages the conversation state and handles subtitle display:

// ChatViewModel.swift
@MainActor
class ChatViewModel: ObservableObject {
    @Published var messages: [ChatMessage] = []
    @Published var isConnected = false
    @Published var statusText = "Disconnected"

    private var llmMessageCache: [String: String] = [:]

    init() {
        ZegoExpressManager.shared.initEngine()

        ZegoExpressManager.shared.onSubtitleReceived = { [weak self] message in
            Task { @MainActor in
                self?.handleSubtitleMessage(message)
            }
        }
    }

    private func handleSubtitleMessage(_ message: SubtitleMessage) {
        if message.isUserMessage {
            // ASR: Full text replacement each time
            handleAsrMessage(message)
        } else if message.isAgentMessage {
            // LLM: Incremental text accumulation
            handleLlmMessage(message)
        }
    }

    private func handleAsrMessage(_ message: SubtitleMessage) {
        if let index = messages.firstIndex(where: { $0.id == message.messageId }) {
            messages[index].text = message.text
        } else {
            messages.append(ChatMessage(id: message.messageId, text: message.text, isUser: true))
        }
    }

    private func handleLlmMessage(_ message: SubtitleMessage) {
        let cachedText = llmMessageCache[message.messageId] ?? ""
        let newText = cachedText + message.text
        llmMessageCache[message.messageId] = newText

        if let index = messages.firstIndex(where: { $0.id == message.messageId }) {
            messages[index].text = newText
        } else {
            messages.append(ChatMessage(id: message.messageId, text: newText, isUser: false))
        }

        if message.endFlag {
            llmMessageCache.removeValue(forKey: message.messageId)
        }
    }

    func startCall() async {
        // 1. Get token
        let tokenResponse = try await ApiService.shared.getToken(userId: userId)

        // 2. Login room
        ZegoExpressManager.shared.loginRoom(roomId: roomId, userId: userId, token: token)

        // 3. Start publishing audio
        ZegoExpressManager.shared.startPublishing(streamId: userStreamId)

        // 4. Start AI agent
        let agentResponse = try await ApiService.shared.startAgent(...)

        // 5. Play agent's audio stream
        ZegoExpressManager.shared.startPlaying(streamId: agentStreamId)

        isConnected = true
    }
}
Enter fullscreen mode Exit fullscreen mode

2.8 SwiftUI Views

Create the user interface:

// ContentView.swift
struct ContentView: View {
    @StateObject private var viewModel = ChatViewModel()

    var body: some View {
        HStack(spacing: 0) {
            // Left: Control Panel
            VStack(spacing: 24) {
                Text("ZEGO AI Agent")
                    .font(.title)

                HStack(spacing: 8) {
                    Circle()
                        .fill(viewModel.isConnected ? Color.green : Color.gray)
                        .frame(width: 12, height: 12)
                    Text(viewModel.statusText)
                }

                Button(action: {
                    Task {
                        if viewModel.isConnected {
                            await viewModel.endCall()
                        } else {
                            await viewModel.startCall()
                        }
                    }
                }) {
                    Text(viewModel.isConnected ? "End Call" : "Start AI Call")
                        .foregroundColor(.white)
                        .frame(width: 180, height: 50)
                        .background(viewModel.isConnected ? Color.red : Color.blue)
                        .cornerRadius(25)
                }
            }
            .frame(maxWidth: .infinity)

            // Right: Chat Messages
            ScrollView {
                LazyVStack(alignment: .leading, spacing: 12) {
                    ForEach(viewModel.messages) { message in
                        MessageBubble(message: message)
                    }
                }
                .padding()
            }
            .frame(maxWidth: .infinity)
        }
    }
}

struct MessageBubble: View {
    let message: ChatMessage

    var body: some View {
        VStack(alignment: message.isUser ? .trailing : .leading) {
            Text(message.isUser ? "You" : "AI Agent")
                .font(.caption)
            Text(message.text)
                .padding(12)
                .background(message.isUser ? Color.blue.opacity(0.2) : Color.green.opacity(0.2))
                .cornerRadius(12)
        }
        .frame(maxWidth: .infinity, alignment: message.isUser ? .trailing : .leading)
    }
}
Enter fullscreen mode Exit fullscreen mode

Conclusion

You now have a functioning iOS application that supports real time conversational AI using ZEGOCLOUD’s SDK. The app can capture voice input, generate LLM responses, output synthesized speech, and display subtitles during the interaction.

Top comments (0)