DEV Community

Alexander Phaiboon
Alexander Phaiboon

Posted on

Building Real-Time Voice Chat in Flutter: A WebSocket Streaming Architecture

Most Flutter chat UIs handle text perfectly but fall apart when you add voice. The moment you need real-time audio streaming, you're dealing with WebSockets, buffering, and state management nightmares that make your smooth chat UI stutter and crash.

I spent three weeks building voice chat for our Flutter app and discovered that the standard chat UI patterns don't work for streaming audio. Here's the architecture that actually works, complete with code examples you can implement today.

The Problem with Standard Chat UI Patterns

Traditional Flutter chat widgets assume messages arrive complete. You build a ListView, add a new ChatMessage widget, and call setState(). Perfect for text, terrible for voice.

Voice streaming is different:

  • Audio chunks arrive every 100ms
  • You need to show real-time transcription updates
  • Visual feedback must update without rebuilding the entire chat
  • WebSocket connections can drop and reconnect
  • Buffer management becomes critical for smooth playback

The flutter_gen_ai_chat_ui package (864 weekly downloads) handles text streaming well but lacks voice-specific patterns. Most developers end up creating custom solutions that don't scale.

Solution: Streaming-First Architecture

The key insight is separating message state from streaming state. Instead of updating existing messages, we create a dedicated streaming layer that handles real-time updates.

Here's the core architecture:

// Core streaming interfaces
abstract class VoiceProvider {
  Stream<VoiceChunk> streamVoiceInput();
  StreamSink<VoiceChunk> get voiceOutput;
  Stream<TranscriptionUpdate> get transcriptionStream;
}

class VoiceChunk {
  final Uint8List audioData;
  final Duration timestamp;
  final String sessionId;

  const VoiceChunk({
    required this.audioData,
    required this.timestamp,
    required this.sessionId,
  });
}

class TranscriptionUpdate {
  final String text;
  final bool isComplete;
  final String sessionId;
  final Duration timestamp;

  const TranscriptionUpdate({
    required this.text,
    required this.isComplete,
    required this.sessionId,
    required this.timestamp,
  });
}
Enter fullscreen mode Exit fullscreen mode

This separation lets us handle voice streaming without disrupting the main chat UI. The VoiceProvider manages WebSocket connections and audio processing while the chat UI focuses on displaying results.

Step 1: Building the Voice Streaming Controller

The streaming controller coordinates between WebSocket connections and UI updates. It handles connection failures, buffering, and state management:

class VoiceStreamController extends ChangeNotifier {
  late WebSocketChannel _channel;
  final List<VoiceChunk> _audioBuffer = [];
  StreamSubscription? _transcriptionSubscription;

  bool _isRecording = false;
  bool _isConnected = false;
  String _currentTranscription = '';

  bool get isRecording => _isRecording;
  bool get isConnected => _isConnected;
  String get currentTranscription => _currentTranscription;

  Future<void> connect() async {
    try {
      _channel = WebSocketChannel.connect(
        Uri.parse('wss://api.flai.dev/voice-stream'),
      );

      _isConnected = true;
      notifyListeners();

      // Listen for transcription updates
      _transcriptionSubscription = _channel.stream
          .map((data) => TranscriptionUpdate.fromJson(data))
          .listen(_handleTranscriptionUpdate);

    } catch (e) {
      _isConnected = false;
      notifyListeners();
      rethrow;
    }
  }

  void _handleTranscriptionUpdate(TranscriptionUpdate update) {
    _currentTranscription = update.text;
    notifyListeners();

    // If transcription is complete, add to chat history
    if (update.isComplete) {
      _addCompletedMessage(update.text);
      _currentTranscription = '';
    }
  }

  Future<void> startRecording() async {
    if (!_isConnected) await connect();

    _isRecording = true;
    notifyListeners();

    // Start audio capture and streaming
    await _startAudioCapture();
  }

  Future<void> stopRecording() async {
    _isRecording = false;
    notifyListeners();

    await _stopAudioCapture();

    // Send completion signal
    _channel.sink.add(jsonEncode({'type': 'recording_complete'}));
  }

  void _addCompletedMessage(String text) {
    // Add to your chat message list here
    // This integrates with your existing chat UI
  }
}
Enter fullscreen mode Exit fullscreen mode

The controller maintains clear separation between streaming state (isRecording, currentTranscription) and final state (completed messages). This prevents UI flickering and makes the interface predictable.

Step 2: Creating the Voice Chat UI

The UI layer combines your existing chat interface with streaming overlays. The key is using StreamBuilder for real-time updates while keeping the main chat stable:

class VoiceChatWidget extends StatefulWidget {
  @override
  _VoiceChatWidgetState createState() => _VoiceChatWidgetState();
}

class _VoiceChatWidgetState extends State<VoiceChatWidget> {
  late VoiceStreamController _voiceController;
  final List<ChatMessage> _messages = [];

  @override
  void initState() {
    super.initState();
    _voiceController = VoiceStreamController();
  }

  @override
  Widget build(BuildContext context) {
    return Scaffold(
      body: Column(
        children: [
          // Main chat area - stable, doesn't rebuild during streaming
          Expanded(
            child: ListView.builder(
              itemCount: _messages.length,
              itemBuilder: (context, index) {
                return ChatMessageWidget(message: _messages[index]);
              },
            ),
          ),

          // Streaming overlay - only rebuilds during voice activity
          Consumer<VoiceStreamController>(
            builder: (context, controller, child) {
              if (!controller.isRecording && controller.currentTranscription.isEmpty) {
                return SizedBox.shrink();
              }

              return StreamingMessageOverlay(
                transcription: controller.currentTranscription,
                isRecording: controller.isRecording,
              );
            },
          ),

          // Voice controls
          VoiceControlsWidget(controller: _voiceController),
        ],
      ),
    );
  }
}

class StreamingMessageOverlay extends StatelessWidget {
  final String transcription;
  final bool isRecording;

  const StreamingMessageOverlay({
    required this.transcription,
    required this.isRecording,
  });

  @override
  Widget build(BuildContext context) {
    return Container(
      margin: EdgeInsets.all(16),
      padding: EdgeInsets.all(12),
      decoration: BoxDecoration(
        color: Colors.blue.withOpacity(0.1),
        borderRadius: BorderRadius.circular(8),
        border: Border.all(
          color: isRecording ? Colors.red : Colors.blue,
          width: 2,
        ),
      ),
      child: Row(
        children: [
          if (isRecording)
            Icon(Icons.mic, color: Colors.red, size: 20),
          if (!isRecording && transcription.isNotEmpty)
            Icon(Icons.edit, color: Colors.blue, size: 20),
          SizedBox(width: 8),
          Expanded(
            child: Text(
              transcription.isEmpty 
                ? (isRecording ? 'Listening...' : 'Processing...')
                : transcription,
              style: TextStyle(
                fontStyle: transcription.isEmpty ? FontStyle.italic : null,
              ),
            ),
          ),
        ],
      ),
    );
  }
}
Enter fullscreen mode Exit fullscreen mode

This approach keeps your existing chat UI completely stable. The ListView never rebuilds during streaming, preventing scroll jumps and maintaining smooth performance. Only the small streaming overlay updates in real-time.

Step 3: WebSocket Connection Management

Voice streaming requires robust connection handling. WebSockets drop frequently on mobile networks, and you need seamless reconnection without losing audio data:

class WebSocketManager {
  WebSocketChannel? _channel;
  Timer? _heartbeatTimer;
  final List<Map<String, dynamic>> _messageQueue = [];

  Stream<dynamic> connect(String url) async* {
    while (true) {
      try {
        _channel = WebSocketChannel.connect(Uri.parse(url));
        _startHeartbeat();

        // Send queued messages
        for (final message in _messageQueue) {
          _channel!.sink.add(jsonEncode(message));
        }
        _messageQueue.clear();

        await for (final data in _channel!.stream) {
          yield data;
        }
      } catch (e) {
        print('WebSocket error: $e');
        _stopHeartbeat();

        // Wait before reconnecting
        await Future.delayed(Duration(seconds: 2));
      }
    }
  }

  void send(Map<String, dynamic> message) {
    if (_channel != null) {
      try {
        _channel!.sink.add(jsonEncode(message));
      } catch (e) {
        // Connection lost, queue message
        _messageQueue.add(message);
      }
    } else {
      _messageQueue.add(message);
    }
  }

  void _startHeartbeat() {
    _heartbeatTimer = Timer.periodic(Duration(seconds: 30), (timer) {
      send({'type': 'ping'});
    });
  }

  void _stopHeartbeat() {
    _heartbeatTimer?.cancel();
    _heartbeatTimer = null;
  }

  void dispose() {
    _stopHeartbeat();
    _channel?.sink.close();
  }
}
Enter fullscreen mode Exit fullscreen mode

The message queue ensures no audio data is lost during reconnections. The heartbeat keeps connections alive on mobile networks that aggressively close idle sockets.

Integration with FlAI

FlAI provides pre-built components that handle these patterns for you. The voice_conversation_flow Mason brick includes the streaming controller, WebSocket management, and UI components as a complete package:

# Install FlAI CLI
dart pub global activate flai_cli

# Generate voice chat component
flai create voice_conversation_flow my_voice_chat

# Add to your app
cd my_voice_chat
flutter pub get
Enter fullscreen mode Exit fullscreen mode

The generated component includes production-ready error handling, audio buffering, and integration hooks for your existing chat UI. You get the architecture I've shown here without building it from scratch.

Performance Considerations

Voice streaming puts different demands on your app than text chat:

Memory Management: Audio buffers grow quickly. Implement circular buffering and clear completed chunks:

class AudioBufferManager {
  static const int maxBufferSize = 50; // Keep last 50 chunks
  final Queue<VoiceChunk> _buffer = Queue();

  void addChunk(VoiceChunk chunk) {
    _buffer.add(chunk);

    if (_buffer.length > maxBufferSize) {
      _buffer.removeFirst();
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

UI Updates: Limit transcription updates to prevent excessive rebuilds:

Timer? _updateTimer;

void _handleTranscriptionUpdate(String text) {
  _updateTimer?.cancel();
  _updateTimer = Timer(Duration(milliseconds: 100), () {
    _currentTranscription = text;
    notifyListeners();
  });
}
Enter fullscreen mode Exit fullscreen mode

Network Efficiency: Batch small audio chunks to reduce WebSocket overhead:

final List<VoiceChunk> _chunkBatch = [];

void _batchAndSend(VoiceChunk chunk) {
  _chunkBatch.add(chunk);

  if (_chunkBatch.length >= 5 || chunk.timestamp.inMilliseconds % 500 == 0) {
    _sendBatch();
    _chunkBatch.clear();
  }
}
Enter fullscreen mode Exit fullscreen mode

Next Steps

This architecture scales to complex voice AI scenarios. You can extend it with:

  • Multi-participant voice chat using session IDs
  • Voice activity detection to optimize bandwidth
  • Audio quality adaptation based on connection speed
  • Integration with speech recognition APIs beyond transcription

The streaming-first approach works because it separates concerns cleanly. Your chat UI stays simple and fast, while the streaming layer handles all the complexity of real-time audio.

Try implementing the basic controller first, then add the UI overlay. Once you have those working, the WebSocket management becomes straightforward to integrate.

The key insight is that voice chat isn't just text chat with audio added—it's a different interaction pattern that needs different architecture. Build for streaming from the start, and your users will get the smooth, responsive voice experience they expect.

Top comments (0)