DEV Community

Cover image for How to Build an AI Voice Assistant on Android
Stephen568hub
Stephen568hub

Posted on

How to Build an AI Voice Assistant on Android

Voice interfaces represent the next evolution in mobile human-computer interaction. Users increasingly expect natural, hands-free communication with their applications rather than traditional touch input. For developers seeking to build an AI voice assistant for Android, ZEGOCLOUD(https://www.zegocloud.com/) provides a comprehensive solution.

This technical guide presents a systematic approach to integrating voice AI capabilities into Android applications. Through ZEGOCLOUD's Conversational AI platform(https://www.zegocloud.com/solutions/conversational-ai), developers can implement automatic speech recognition, large language model processing, and neural text-to-speech within a unified SDK architecture.

ZEGOCLOUD Platform Overview

ZEGOCLOUD delivers real-time communication services through cloud-based infrastructure. The Conversational AI product consolidates three distinct AI services:

Service Technology Output
ASR DeepSpeech-based recognition Transcribed text
LLM Large language model inference Contextual responses
TTS Neural voice synthesis Natural audio output

The platform abstracts complexity associated with managing separate ASR, LLM, and TTS providers while maintaining sub-second response latency suitable for conversational applications.

System Architecture

Component Interaction

sequenceDiagram
    participant C as Android Client
    participant B as Backend Server
    participant Z as ZEGO Cloud
    participant A as AI Services

    C->>B: Request auth token
    B-->>C: Return token

    C->>Z: Join RTC room + publish stream
    C->>B: Create AI agent request
    B->>Z: Register agent with LLM/TTS
    B->>Z: Instantiate agent
    Z-->>B: Agent stream identifier
    B-->>C: Agent stream ID

    C->>Z: Subscribe to agent audio

    Note over C,Z: Conversation loop active
    C->>Z: User audio frames
    Z->>A: ASR transcription
    A->>Z: Text payload
    Z->>A: LLM inference request
    A->>Z: Response text
    Z->>A: TTS synthesis request
    A->>Z: Audio payload
    Z-->>C: AI voice + subtitle data

    C->>B: Terminate agent
    B->>Z: Delete agent instance
Enter fullscreen mode Exit fullscreen mode

Data Flow Summary

Phase Action Endpoint
Authentication Token request /api/zego/token
Session Init Room login + publish ZEGO RTC
Agent Creation AI agent registration /api/zego/start
Media Playback Stream subscription ZEGO RTC
Conversation Bidirectional audio ZEGO Cloud
Termination Agent cleanup /api/zego/stop

Understanding the Server-Client Separation

Building an android ai voice assistant requires a two-tier architecture rather than direct client-to-service communication. This design decision stems from fundamental security and operational requirements.

Architectural Roles:

Layer Primary Function Credentials Stored
Backend Server Authentication, agent lifecycle, API signing Full access (server secret, LLM keys)
Android Client Audio I/O, stream management, UI None (receives short-lived tokens only)
ZEGO Cloud ASR, LLM, TTS processing N/A (external service)

Rationale for Separation:

  1. Credential Security — Server secrets (ZEGO_SERVER_SECRET, LLM API keys) remain in your controlled infrastructure. Mobile applications can be decompiled, making them unsuitable for permanent credential storage.

  2. Token Lifecycle Management — Clients authenticate using time-limited tokens (default 1 hour). This approach provides revocable access without exposing long-term credentials.

  3. Dynamic AI Configuration — System prompts, model selections, and voice profiles are managed server-side. Changes deploy instantly without requiring client app updates through app stores.

  4. Operational Control — Centralized logging, rate limiting, usage analytics, and abuse detection operate at the server layer, providing a single control point for production monitoring.

When you build voice ai assistant for android applications, this architecture ensures security, maintainability, and operational visibility in production environments.

Implementation Guide

Phase 1: Backend Infrastructure

Deploy a Next.js server to handle authentication and agent lifecycle management.

Environment Setup:

# .env.local configuration
NEXT_PUBLIC_ZEGO_APP_ID=<your_app_id>
ZEGO_SERVER_SECRET=<32_character_secret>

# AI Agent parameters
ZEGO_AGENT_ID=voice_assistant_01
SYSTEM_PROMPT="You are a helpful voice assistant. Provide concise, spoken-friendly responses."

# External AI services
LLM_URL=https://api.provider.com/v1/chat/completions
LLM_API_KEY=<your_api_key>
LLM_MODEL=gpt-4

# TTS configuration
TTS_VENDOR=ByteDance
TTS_VOICE_TYPE=zh_female_wanwanxiaohe_moon_bigtts
Enter fullscreen mode Exit fullscreen mode

Token Generation Handler:

// app/api/zego/token/route.ts
import { NextRequest, NextResponse } from 'next/server';
import crypto from 'crypto';

const generateToken = (appId: number, userId: string, secret: string, ttl: number): string => {
  const payload = {
    app_id: appId,
    user_id: userId,
    nonce: Math.floor(Math.random() * 2147483647),
    ctime: Math.floor(Date.now() / 1000),
    expire: Math.floor(Date.now() / 1000) + ttl,
    payload: ''
  };

  const encrypted = crypto
    .createCipheriv('aes-256-gcm', secret, crypto.randomBytes(12))
    .update(JSON.stringify(payload), 'utf8');

  return '04' + Buffer.concat([
    Buffer.alloc(8).writeBigInt64BE(BigInt(payload.expire)),
    Buffer.from([0, 12]),
    encrypted.final(),
    Buffer.from([1])
  ]).toString('base64');
};

export async function POST(req: NextRequest) {
  const { userId } = await req.json();
  const token = generateToken(
    parseInt(process.env.NEXT_PUBLIC_ZEGO_APP_ID!),
    userId,
    process.env.ZEGO_SERVER_SECRET!,
    3600
  );
  return NextResponse.json({ token });
}
Enter fullscreen mode Exit fullscreen mode

Agent Management API:

// app/api/zego/start/route.ts
import { sendZegoRequest } from '@/lib/zego-utils';

export async function POST(req: Request) {
  const { roomId, userId, userStreamId } = await req.json();

  const agentData = await sendZegoRequest('CreateAIAgent', {
    agentId: process.env.ZEGO_AGENT_ID,
    roomId,
    userId,
    userStreamId,
    llmConfig: {
      model: process.env.LLM_MODEL,
      systemPrompt: process.env.SYSTEM_PROMPT
    },
    ttsConfig: {
      vendor: process.env.TTS_VENDOR,
      voiceType: process.env.TTS_VOICE_TYPE
    }
  });

  return Response.json({
    agentInstanceId: agentData.instanceId,
    agentStreamId: agentData.streamId
  });
}
Enter fullscreen mode Exit fullscreen mode

Phase 2: Android Client Implementation

With the backend infrastructure in place, we now turn to the Android client. This phase involves configuring your Android project, setting up network communication, initializing the ZEGOCLOUD Express Engine, and integrating the UI components.

Step 2.1: Project Configuration

The ZEGOCLOUD Express SDK is distributed as a JAR file with native .so libraries. You need to manually integrate these files rather than using a Maven dependency.

Setup instructions:

  1. Download the AI Agent-optimized SDK from ZEGO Download Page
  2. Extract and copy files to your project:
    • ZegoExpressEngine.jarapp/libs/
    • libZegoExpressEngine.so (arm64) → app/libs/arm64-v8a/
    • libZegoExpressEngine.so (v7a) → app/libs/armeabi-v7a/

Then configure your build script:

// app/build.gradle.kts
android {
    defaultConfig {
        minSdk = 24
        ndk {
            abiFilters += listOf("armeabi-v7a", "arm64-v8a")
        }
    }
    sourceSets {
        getByName("main") {
            jniLibs.srcDirs("libs")
        }
    }
}

dependencies {
    implementation(files("libs/ZegoExpressEngine.jar"))
    implementation("com.squareup.okhttp3:okhttp:4.12.0")
    implementation("com.google.code.gson:gson:2.10.1")
    implementation("org.jetbrains.kotlinx:kotlinx-coroutines-android:1.7.3")
}
Enter fullscreen mode Exit fullscreen mode

Note: The AI Agent-optimized SDK is required for receiving subtitle messages via onRecvExperimentalAPI. The standard Maven version does not support this feature.

Step 2.2: Application Configuration

Create a centralized configuration object to store app-wide constants and utility functions:

// config/AppConfig.kt
object AppConfig {
    const val APP_ID: Long = 1234567890L
    const val SERVER_URL = "https://your-deployment.vercel.app"

    fun generateSessionId(): String =
        "s${System.currentTimeMillis() % 1000000}"
}
Enter fullscreen mode Exit fullscreen mode

Important: For Android emulator testing, use http://10.0.2.2:3000 instead of localhost to access your local development server.

Step 2.3: Network Service Layer

The API service handles all HTTP communication with your backend server. It manages three core operations: authentication, agent creation, and agent cleanup.

// api/ApiService.kt
object ApiService {
    private val httpClient = OkHttpClient()
    private val jsonSerializer = Gson()

    suspend fun authenticate(userId: String): String? {
        val request = Request.Builder()
            .url("${AppConfig.SERVER_URL}/api/zego/token")
            .post(
                jsonSerializer.toJson(mapOf("userId" to userId))
                    .toRequestBody("application/json".toMediaType())
            )
            .build()

        return try {
            val response = httpClient.newCall(request).execute()
            val json = JSONObject(response.body?.string() ?: "")
            json.optJSONObject("data")?.optString("token")
        } catch (e: Exception) { null }
    }

    suspend fun createAgent(roomId: String, userId: String, streamId: String): String? {
        val request = Request.Builder()
            .url("${AppConfig.SERVER_URL}/api/zego/start")
            .post(
                jsonSerializer.toJson(mapOf(
                    "roomId" to roomId,
                    "userId" to userId,
                    "userStreamId" to streamId
                )).toRequestBody("application/json".toMediaType())
            )
            .build()

        return try {
            val response = httpClient.newCall(request).execute()
            val json = JSONObject(response.body?.string() ?: "")
            json.optJSONObject("data")?.optString("agentStreamId")
        } catch (e: Exception) { null }
    }

    suspend fun destroyAgent(agentInstanceId: String): Boolean {
        val request = Request.Builder()
            .url("${AppConfig.SERVER_URL}/api/zego/stop")
            .post(
                jsonSerializer.toJson(mapOf("agentInstanceId" to agentInstanceId))
                    .toRequestBody("application/json".toMediaType())
            )
            .build()

        return try {
            httpClient.newCall(request).execute().isSuccessful
        } catch (e: Exception) { false }
    }
}
Enter fullscreen mode Exit fullscreen mode

Implementation notes:

  • All functions are suspend functions, designed to be called from Kotlin coroutines
  • Error handling returns null or false on failure — the calling code should handle these cases
  • The OkHttpClient is configured with default timeouts (10s connect, 10s read, 10s write)

Step 2.4: RTC Engine Initialization

The ZegoEngineManager class encapsulates all ZEGOCLOUD Express SDK operations, providing a clean interface for the rest of your application.

// rtc/ZegoEngineManager.kt
class ZegoEngineManager(private val context: Context) {
    private var engine: ZegoExpressEngine? = null

    fun initialize() {
        val profile = ZegoEngineProfile().apply {
            appID = AppConfig.APP_ID
            scenario = ZegoScenario.HIGH_QUALITY_CHATROOM
            application = context.applicationContext
        }
        engine = ZegoExpressEngine.createEngine(profile, null)

        // Audio optimization for voice interaction
        engine?.apply {
            enableAGC(true)      // Gain control
            enableANS(true)      // Noise suppression
            enableAEC(true)      // Echo cancellation
            setANSMode(ZegoANSMode.MEDIUM)
        }
    }

    fun joinRoom(roomId: String, userId: String, token: String): Int {
        val user = ZegoUser(userId)
        val config = ZegoRoomConfig().apply { this.token = token }
        var resultCode = -1

        CountDownLatch(1).apply {
            engine?.loginRoom(roomId, user, config) { code, _ ->
                resultCode = code
                countDown()
            }
            await(5000, TimeUnit.MILLISECONDS)
        }
        return resultCode
    }

    fun publishAudio(streamId: String) {
        engine?.muteMicrophone(false)
        engine?.startPublishingStream(streamId)
    }

    fun playAudio(streamId: String) {
        engine?.startPlayingStream(streamId)
    }

    fun leaveRoom(roomId: String) {
        engine?.stopPublishingStream()
        engine?.logoutRoom(roomId)
    }

    fun destroy() {
        ZegoExpressEngine.destroyEngine()
    }
}
Enter fullscreen mode Exit fullscreen mode

Audio optimization explanation:

Setting Purpose Recommended Value
AGC Automatic Gain Control — normalizes volume levels Enabled
ANS Automatic Noise Suppression — reduces background noise Medium mode
AEC Acoustic Echo Cancellation — prevents echo feedback AI Balanced mode

Step 2.5: Subtitle Integration

ZEGOCLOUD provides official subtitle components that handle ASR and LLM message parsing. These components are implemented in Java and need to be integrated into your Kotlin codebase.

Download required files:

  • AudioChatMessageParser.java
  • AIChatListView.java

Available from ZEGO Subtitle Guide.

// subtitle/SubtitleHandler.kt
class SubtitleHandler(private val view: AIChatListView) {
    private val messageParser = AudioChatMessageParser()

    init {
        messageParser.setAudioChatMessageListListener(
            object : AudioChatMessageParser.AudioChatMessageListListener {
                override fun onMessageListUpdated(messages: MutableList<AudioChatMessage>) {
                    view.post { view.onMessageListUpdated(messages) }
                }

                override fun onAudioChatStateUpdate(status: AudioChatAgentStatusMessage) {
                    // Handle agent state changes if needed
                }
            }
        )
    }

    fun onExperimentalAPI(content: String) {
        try {
            val json = JSONObject(content)
            if (json.getString("method") == "liveroom.room.on_recive_room_channel_message") {
                val msgContent = json.getJSONObject("params").getString("msg_content")
                messageParser.parseAudioChatMessage(msgContent)
            }
        } catch (e: Exception) { e.printStackTrace() }
    }
}
Enter fullscreen mode Exit fullscreen mode

How subtitles work:

  1. The onRecvExperimentalAPI callback receives raw room channel messages
  2. Messages are parsed by AudioChatMessageParser to extract ASR (user speech) and LLM (AI response) text
  3. Parsed messages are displayed in the AIChatListView component
  4. The listener ensures UI updates run on the main thread via view.post()

Step 2.6: Activity Integration

Finally, tie all components together in your main Activity:

// ui/VoiceAssistantActivity.kt
class VoiceAssistantActivity : AppCompatActivity() {
    private lateinit var engineManager: ZegoEngineManager
    private lateinit var subtitleHandler: SubtitleHandler

    private var sessionId: String? = null
    private var userId: String? = null
    private var agentInstanceId: String? = null

    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)
        setContentView(ActivityVoiceAssistantBinding.inflate(layoutInflater))

        engineManager = ZegoEngineManager(application)
        engineManager.initialize()

        subtitleHandler = SubtitleHandler(binding.subtitleView)
        setupExperimentalAPIListener()

        binding.actionButton.setOnClickListener {
            if (sessionId == null) startSession() else endSession()
        }
    }

    private fun setupExperimentalAPIListener() {
        ZegoExpressEngine.onRecvExperimentalAPI = { content ->
            subtitleHandler.onExperimentalAPI(content)
        }
    }

    private fun startSession() {
        lifecycleScope.launch {
            sessionId = AppConfig.generateSessionId()
            userId = "u${sessionId}"
            val userStreamId = "${sessionId}_user"

            val token = ApiService.authenticate(userId!!)
            if (token == null) { showError("Authentication failed"); return@launch }

            val loginCode = engineManager.joinRoom(sessionId!!, userId!!, token)
            if (loginCode != 0) { showError("Room join failed"); return@launch }

            engineManager.publishAudio(userStreamId)

            val agentStreamId = ApiService.createAgent(sessionId!!, userId!!, userStreamId)
            if (agentStreamId == null) { showError("Agent creation failed"); return@launch }

            agentInstanceId = agentStreamId
            engineManager.playAudio(agentStreamId)

            updateUI(true)
        }
    }

    private fun endSession() {
        lifecycleScope.launch {
            agentInstanceId?.let { ApiService.destroyAgent(it) }
            sessionId?.let { engineManager.leaveRoom(it) }

            sessionId = null
            userId = null
            agentInstanceId = null
            updateUI(false)
        }
    }

    override fun onDestroy() {
        super.onDestroy()
        engineManager.destroy()
    }
}
Enter fullscreen mode Exit fullscreen mode

Session flow summary:

Step Action Component
1 Generate session ID AppConfig
2 Request auth token ApiService.authenticate()
3 Join RTC room ZegoEngineManager.joinRoom()
4 Publish user audio ZegoEngineManager.publishAudio()
5 Create AI agent ApiService.createAgent()
6 Play AI audio ZegoEngineManager.playAudio()
7 End session destroyAgent() + leaveRoom()

Conclusion

This guide demonstrated how to build an AI voice assistant for Android using ZEGOCLOUD's integrated platform. The architecture separates concerns between backend authentication, agent orchestration, and client-side media handling. Key implementation points include proper token management, RTC room coordination, and subtitle message parsing.

Developers can extend this foundation with custom wake words, multi-language support, or domain-specific conversation flows. The same architectural pattern applies across customer service, education, accessibility, and IoT control scenarios requiring natural voice interaction.

For production deployment, consider implementing connection retry logic, offline fallbacks, and comprehensive error handling based on your specific use case requirements.

Top comments (0)