This is a submission for the AssemblyAI Voice Agents Challenge
What I Built
EmpathyAI is a real-time voice-powered mental health support application that provides compassionate AI-driven conversations for individuals experiencing emotional distress. The system processes spoken input through advanced speech recognition, analyzes emotional content using AI, and responds with empathetic voice-based support.
Demo
GitHub Repository
React frontend app
https://github.com/vpjigin/EmpathyAIReact.git
Spring-boot backend
https://github.com/vpjigin/EmpathyAISpringBoot.git
AssemblyAI Universal-Streaming Technology
This application demonstrates advanced real-time audio processing powered by AssemblyAI’s Universal-Streaming API. The system enables low-latency, turn-based, and secure transcription, enabling emotionally intelligent AI conversations.
Core Architecture
The architecture follows a multi-layered streaming pipeline:
Client Audio → WebSocket Handler → AssemblyAI Streaming → AI Processing → Response
AssemblyAI Streaming Implementation
- Real-time WebSocket Connection The backend creates a persistent WebSocket connection to AssemblyAI’s streaming endpoint:
private static final String ASSEMBLYAI_STREAMING_URL = "wss://streaming.assemblyai.com/v3/ws";
public CompletableFuture<StreamingSession> createStreamingSession(String sessionId, TranscriptCallback callback) {
String connectionUrl = ASSEMBLYAI_STREAMING_URL + "?sample_rate=16000&format_turns=true";
Map<String, String> headers = new HashMap<>();
headers.put("Authorization", apiKey);
WebSocketClient client = new WebSocketClient(serverUri, headers) {
@Override
public void onMessage(String message) {
JsonNode jsonMessage = objectMapper.readTree(message);
if ("Turn".equals(messageType)) {
String transcript = jsonMessage.get("transcript").asText();
boolean isFormatted = jsonMessage.get("turn_is_formatted").asBoolean();
if (isFormatted) {
callback.onTranscript(transcript, true);
}
}
}
};
}
- Audio Streaming Handler The AudioStreamingWebSocketHandler component bridges client-side audio to the AssemblyAI session:
@Component
public class AudioStreamingWebSocketHandler implements WebSocketHandler {
@Autowired
private AssemblyAIStreamingServiceV2 assemblyAIStreamingService;
private void handleBinaryMessage(WebSocketSession session, BinaryMessage message) {
StreamingSessionV2 assemblySession = assemblyAISessions.get(session.getId());
if (assemblySession != null) {
ByteBuffer audioData = message.getPayload();
byte[] audioBytes = new byte[audioData.remaining()];
audioData.get(audioBytes);
assemblySession.sendAudioData(audioBytes);
}
}
private void startStreaming(WebSocketSession session, String conversationUuid) {
assemblyAIStreamingService.createStreamingSession(session.getId(), new TranscriptCallback() {
@Override
public void onTranscript(String text, boolean isFinal) {
if (isFinal) {
handleFinalTranscript(session, conversation, text);
}
}
});
}
}
- Advanced Features Utilized
- Turn-based Transcription: format_turns=true for human-like flow
- 16kHz Audio: sample_rate=16000 ensures clarity
- TLS/SSL Security: Secured with valid certs
- Concurrent Streaming: Multiple session support
Message Type Handling: Supports "Begin", "Turn", and "Termination" types
Dual Implementation Strategy
I implemented two parallel streaming strategies:
AssemblyAIStreamingService: Uses Java-WebSocket for low-level WebSocket handling
AssemblyAIStreamingServiceV2: Uses Spring’s StandardWebSocketClient for seamless Spring Boot integration
// Spring-based implementation
public CompletableFuture<StreamingSessionV2> createStreamingSession(String sessionId, TranscriptCallback callback) {
StandardWebSocketClient client = new StandardWebSocketClient();
WebSocketHttpHeaders headers = new WebSocketHttpHeaders();
headers.add("Authorization", apiKey);
WebSocketHandler handler = new WebSocketHandler() {
@Override
public void handleMessage(WebSocketSession session, WebSocketMessage<?> message) {
// Handle messages using Spring WebSocket framework
}
};
client.doHandshake(handler, headers, serverUri).get();
}
Technical Capabilities Leveraged
1.Real-time Binary Audio Streaming
2.Low-latency (<1s) Transcription
3.Turn-based Conversation Context
4.Error Recovery & Retry Mechanism
5.Scalable Concurrent Sessions
Project Structure (Brief)
├── controller/
├── service/
├── websocket/
├── model/
├── config/
Top comments (0)