How to realize Real-Time Speech with Dify API

#dify #ai #llm #javascript

Dify is an open-sourced SaaS platform for building LLM workflows online. I'm using the API to create conversational AI experience on my app. I was struggling with getting TTS streams as the API response and play it. Here I demonstrates how to process the audio streams and play it correctly in real-time. In short, please check my code.

I'm using the API endpoint https://api.dify.ai/v1/chat-messages for text chat. It returns audio data in the same stream as the text response if we enabled Text to Speech feature in our Dify apps.

Press ADD FEATURE button and add Text to Speech feature.

You can check the response from API with the following curl command.



curl -X POST 'https://api.dify.ai/v1/chat-messages' \
--header 'Authorization: Bearer YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data-raw '{
    "inputs": {},
    "query": "What are the specs of the iPhone 13 Pro Max?",
    "response_mode": "streaming",
    "conversation_id": "",
    "user": "abc-123",
    "files": []
}'

I demonstrate in TypeScript / JavaScript but you can apply the same logic to your programming language.

Anatomy of streamed data

First, let's understand what kind of data Dify is using for the streams.

Streamed data format

Dify is using the following text data format. It is like JSON lines but it is not the same exactly.



data: {"event": "workflow_started", "conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "message_id": "3f0fe3cf-5aa1-4f7c-8abe-2505bf07ae8f", "created_at": 1724478014, "task_id": "dacb2d5c-a6f5-44b5-b5a6-de000f24aeba", "workflow_run_id": "50100b30-e458-4632-ad7d-8dd383823376", "data": {"id": "50100b30-e458-4632-ad7d-8dd383823376", "workflow_id": "debdb4fa-dcab-4233-9413-fd6d17b9e36a", "sequence_number": 334, "inputs": {"sys.query": "What are the specs of the iPhone 13 Pro Max?", "sys.files": [], "sys.conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "sys.user_id": "abc-123"}, "created_at": 1724478014}}

data: {"event": "node_started", "conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "message_id": "3f0fe3cf-5aa1-4f7c-8abe-2505bf07ae8f", "created_at": 1724478014, "task_id": "dacb2d5c-a6f5-44b5-b5a6-de000f24aeba", "workflow_run_id": "50100b30-e458-4632-ad7d-8dd383823376", "data": {"id": "bf912f43-29dd-4ee2-aefa-0fabdf379257", "node_id": "1721365917005", "node_type": "start", "title": "\u958b\u59cb", "index": 1, "predecessor_node_id": null, "inputs": null, "created_at": 1724478013, "extras": {}}}

data: {"event": "node_finished", "conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "message_id": "3f0fe3cf-5aa1-4f7c-8abe-2505bf07ae8f", "created_at": 1724478014, "task_id": "dacb2d5c-a6f5-44b5-b5a6-de000f24aeba", "workflow_run_id": "50100b30-e458-4632-ad7d-8dd383823376", "data": {"id": "bf912f43-29dd-4ee2-aefa-0fabdf379257", "node_id": "1721365917005", "node_type": "start", "title": "\u958b\u59cb", "index": 1, "predecessor_node_id": null, "inputs": {"sys.query": "What are the specs of the iPhone 13 Pro Max?", "sys.files": [], "sys.conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "sys.user_id": "abc-123", "sys.dialogue_count": 1}, "process_data": null, "outputs": {"sys.query": "What are the specs of the iPhone 13 Pro Max?", "sys.files": [], "sys.conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "sys.user_id": "abc-123", "sys.dialogue_count": 1}, "status": "succeeded", "error": null, "elapsed_time": 0.001423838548362255, "execution_metadata": null, "created_at": 1724478013, "finished_at": 1724478013, "files": []}}

data: {"event": "node_started", "conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "message_id": "3f0fe3cf-5aa1-4f7c-8abe-2505bf07ae8f", "created_at": 1724478014, "task_id": "dacb2d5c-a6f5-44b5-b5a6-de000f24aeba", "workflow_run_id": "50100b30-e458-4632-ad7d-8dd383823376", "data": {"id": "89ed58ab-6157-499b-81b2-92b1336969a5", "node_id": "llm", "node_type": "llm", "title": "LLM", "index": 2, "predecessor_node_id": "1721365917005", "inputs": null, "created_at": 1724478013, "extras": {}}}

...

In the response, Dify pushes text answer and audio data.

Example line of text answer



data: {"event": "message", "conversation_id": "aa13eb24-e90a-4c5d-a36b-756f0e3be8f8", "message_id": "5be739a9-09ba-4444-9905-a2f37f8c7a21", "created_at": 1724301648, "task_id": "0643f770-e9d3-408f-b771-bb2e9430b4f9", "id": "5be739a9-09ba-4444-9905-a2f37f8c7a21", "answer": "MP"}

Example line of audio data



data: {"event": "tts_message", "conversation_id": "aa13eb24-e90a-4c5d-a36b-756f0e3be8f8", "message_id": "5be739a9-09ba-4444-9905-a2f37f8c7a21", "created_at": 1724301648, "task_id": "0643f770-e9d3-408f-b771-bb2e9430b4f9", "audio": "//PkxABhvDm0DVp4ACUUfvWc1CFlh0tR9Oh7LxzHRsGBuGx155x3JqTJiwKKZf8wIcxpMzJU0h4zhgyQwwwIsgWQMAALQMkanBTjfCPgZwFsDOGGIYJoJoJoJoPQPQLYEgAOwM4SMXMW8TcNWGrEPEME0HoIQTg0DQNA0C5k7IOLeJuDnDVi5nWyJwgghAagQwTQQgJAGrDVibiFhqw1YR8HOEjBUA5AcgagQwTQTQQgJAAtgLYKsQ8hZc0PV7OrE4SgQgFIAsAQAwA6H0Uv4t4m4m49Yt4uYOQHIBkAyAqAkAuB0Mm6UeKxDGRrIODkByBqBNBCA1ARwHIEgBVg5wkY41W2GgdEVDFBNe+HicQw0ydk7HrHrIWXM62d48ePNfCkNATcTcNWGrCRhqxDxcwMYBwBkByCGC4EILgoJTQUDeW8W8TcTchZ1qBWIYchOBbBCA1AhgSMJGGrFzLmh6fL+LeBkAyAZAcgSAXAhB0Kxnj4YDkJwXA6FAzwj8IIJoJoPQXA6EPOcg4R8FOBnCRljRAwlwoh4EUwLhFTCVA+MR0R8wyxOhgAwwDgJjBUABMM0hMxBgnTPtMrMBEEcwJQCzIXIdMZMG821DmjDKHJAwLDKHRMQsJkwbwVRoFs//PkxEx5dDnwAZ7wANHgEUFJHGCUCQp3LWCQQYGAATI5QzwHBJF4UFktpfATT2l0goAGNADLOU64HAMCQCK50szABAIkDS2/j8gl6l6Di7QgBEiAfMEADBnyZBgeAWCMK4xvBbhoRZj1M+ktsNMTrMNcHEwHQEzAjAHMGQAQwRQZTBHALMGMDkzhh2jGhLtMgsMMwfhOzCnGLMMcKgwOw8pqHMoGtvdDzos0AIAiXIsBAmGsRFtYcBABmB0AUYjQfhhDAfjoCrETAGArMOAJ4iAAMCMFkwXwh5fffuhpYMhyP2bl3MVAJQrSYQDsna7G2+fx/GvyAwUQbTAdAFCAHVKyIAduTXHZZXDjNS57/VeVJ5+JBJ+0kATkCSells8/NBt/2/5Dj1s+chDBYSINutNS9FQwDwBWHjgASKRgAAJOyYC4Ao0CMNAKBgB6KK1hYBkAAHROM9mLsknb8avTcB0MerV6jl7llE70egOerRh9WcP/FoHqtVsO/In2f+G2tsdnH+L/KSSvBQB4OATam27Yi4jiBgBFOpq15bTQU6k1G4LoWo1mMAwDQwlBEzEnKsMkA7c5JYuTOzK2MvAbEysSPTM+dOOn1XEzGgIzXzmPODVvs1cyNTJxQ9MsAWwy//PkxDlz7DIMAd7gAek5EwnjcjX9QVN1N0czFyijQKOmMi4IYw8RvzFvCHMHYBQwdQlTRxVNvm8ycGjLYlMTAQ=="}

We can distinguish JSON lines of audio data by checking the event property. Audio JSON has tts_message as the value. The audio mp3 binary is stored in the audio property of the JSONs in base64 format.

Problems in handling data

The first problem that we have when we play TTS audio real-time is the JSON lines are split into packets and each packet is not valid JSON data as it is.

Example packet which is cut in the middle



gMkhx2XCjT6Y0rKnDuvOnora378v6wGEMscxTGVK4ZLfbI+7cFjtUZxDCk3joo9En2RVbx1oIiz1VZYxKB2wq4pmSLWo55pbOoqtN0G2aY/LsNwomtvPH4M2zxBRpLsxKBJTIV6xF7IPaFQuq3CcZ/lDUQafC3mgavJHUWs7L+O8zuxIoahyH40TEFNRTMuMTAwqgGNTDg1JPDM5yHt0ZFFRiVTGYHgakOZhxJkgZMggAwCIxUTGFwQZQRRhIemGCABSONDpTQgEAIFxj8UmDhOYQAIMAgYaSQKmQwcXeBAYAAXEAKR8MIEABGIEBwyuFzQiVNXqcycmDT86Pug89ZUjiFYO6Oc2+BWXmEAqaDCRgUCGGA2Y7CgAEZMMgg1GDACCDwq3O9NNq+JiIOOBciCJyXYkWGCQjCmSOmVSFU2KGxxgYbMYBoacYBcpK+OM/OuxIngNUGJTg02CgJGVCxyfPr6FZIJGmmkBwQwxIxgQzgILC2X//PkxONtxDoABOafcMeL9NfW0rYzVsTJRAHVPD6hrLVnqxDJ4zpZFsVCg0ywkiWoUs6MADVREAIAki0xhwxeJYYrCpuLXb1ayPaFT4FeqU0lzVHUJZxJyqDqVo3kLOh0sE6Jc4oTjbk/LGfxuk7MpgOBmYISXTKcbDkVrMV5zohMIalUZJYoCkJrZVLSH1CPjrcz7OhCyxF9W2RKJKIT1A=="}

data: {"event": "tts_message", "conversation_id": "9ed2e63a-8527-41ff-851f-bf449e7f1096", "message_id": "706bf92a-eca4-4ec8-a04e-a54af25c8cca", "created_at": 1724491999, "task_id": "5f3ca6e2-b8bc-4cb7-946b-b5e0c1a85e99", "audio": "CWNnU8iypDSsX0myFoS4rzmeqmdtaHk4PJWJpIPUalRYjLJCh6iSBcnNXlOcJxsxdkPY4CoVTnHVq7TqEpqqMOhMQU1FMy4xMDCqqhKNPkjR+Ex2kM2MTDCcwcfAmod6hmLu5lhwZkkGBKphA0cQ3GAKxrVmaEhmrIhmaGZ6aDSUaYKBw2ZkImIggABC6xFVmcFBiwCKp4jBGBiIFGEwWYJMhl0qGlgAYLO5oiAmxcuaGnByxCGBfAaWdh06dmmWMZtVJnIfGDQYYMDZEIDDxhMAmAymZBkPhgOEQAAwsCggMNGM0QCjGoYMaisxohTIwnMDlYzGNhqKFj5FGLwGDDg4aY00bEcYMkY0AiiJHQxsb42YMmJGg5qVjwq+BVUKhDklRVOZRoc6EckebYSZGuaYCaUwAiaAIaHyKWp9PU7/8+TE9HH8OfgE3zUMkw9t3VLwKPo7oWpJKoegyTC0JiIoyZFUQQKL9GCIkpQKBxIxXGQQqFUAdpXyQ6QTdXrtv9bf5jbqpBO8uXbZV0vs/eEtbEqOGnZXDNrxVC5al8hhlIVmsWnnKaA9jIJM+MMxWA5Q8DswxSxunbC6sD0sCuA5Uom12td2qo+61VVYHYc4qche5qDOmhKtjzjPMqVOV0YZnGlTVuKqflkYak8F5/YLmGjMpvyN23tgW08zQoQ8yporCXVgmClh3UeyB387NsRcV2JEorHm5UagMxPQC8FMQU1FMy4xMDBVVVVVVVVVVVVVVVVVVVVVCVZncYm5NgeRQJt1kGKVgbuHRpBsmQhYYOBRh4jjQuMUBI0KOTC5EMII8xGXTEggBAXMOgYDIMOF7LTGZEL0DgCMKhwwCTjBxHMKAoxANggMF8AABgMEjFpIMil0z8ujsrQM+JM3N1z0ynM+gkx80jmakA2PM4hQkVRo8BlRCOozaMxKYnsVQafpEMLvmTEGWADaIWeB2JDw1LAtChEZoAEGjMjDjLzGCQ7CXjGlhiAZAkNSeIhQFFAIMXmMwk0CwgVFZ90kS6zLgIGVETefOwEzSBxK4lgWcECaCIvzEFXqZPGzJnT/8+TE5m5cOfQM5rNIgRVXhpQx6MVpN0dplrRXEQFOCAjUxcGtOqIgUmXKC4DrLpRxlkaUMcp1mAwMXhUqIglO3LWO1lMZyJp/2XwLStuwhTymDcGuwoSAk7XWcq3JuyBcsjhSlMPMiY22zOGlt2a09y6ELmUMqhp0l2KSZCzlrTGlN3CZa11uDDVi0zoslVgcZzXKeB+3rgXcNNduQbEIizl624P/R2k+3Bi89YkuXWBuPA0P1nISOcBOaTNcWAdZbDG1VmyqvnX3nJhaUMqrsRgZVKXM5bktR5tQHFYHYg=="}

The packet is starting from the middle of a JSON line. We have to combine multiple packets to get valid JSONs lines.

The second problem is the audio data chunk in a JSON is not a valid audio data. The data is cut in the middle of mp3 frames.

Implementation

To handle the split data of JSON and mp3, we have to do some smart way. The flow of the process is following:

First, we have to get valid JSON data and split into JSONs while receiving packets. When we got a packet with \n at the end, we can say the concatenation of the packets received so far is not cut in the middle. The pseudo code is like this.



let packets = []
stream.on('data', (bytes) => {
   const text = bytes.toString()
   packets.push(text)
   if (text.endsWith('\n')) {
      // Extract audio data from the packets.
      const audioChunks = extractAudioChunks(packets.join(''))
      // Clear the packet array
      packets = []
   }
})

Second, we have to split the audio chunks into mp3 frames. We concat the audio chunks into a binary and find each mp3 frames in it.



const mp3Frames = []
const binaryToProcess = Buffer.concat([...audioChunks])
let frameStartIndex = 0
for (let i = 0; i < binaryToProcess.length - 1; i += 1) {
  const currentByte = binaryToProcess[i]
  const nextByte = binaryToProcess[i + 1]
  // MP3 frame header always starts with eleven 1 bits. Checking 2 bytes.
  // It is a beginning of mp3 frame if current byte is 0xff and the beginning of the next byte is 111.
  // MP3 Spacification
  // http://www.mp3-tech.org/programmer/frame_header.html
  if (currentByte === 0xff && (nextByte & 0b11100000) === 0b11100000) {
    mp3Frames.push(binaryToProcess.subarray(frameStartIndex, i))
    frameStartIndex = i
  }
}

This is not the full implementation of splitting into mp3 frames. In the actual process, we have to consider cases that we have remainder bytes when we extracted mp3 frames from the audio binary and use the remainder as the beginning of the audio bytes in the next iteration.

Play the frames

I used fluent-ffmpeg for decoding and speaker for playing the decoded PCM audios. To play the TTS audio immediately after it received, I used stream for creating the decoding-playing pipeline.



class Mp3FrameReadable extends Readable {
    _read(size: number) {}
}

const mp3FrameStream = new Mp3FrameReadable()
const speaker = new Speaker()
ffmpeg(mp3FrameStream)
    .audioFrequency(44100)
    .audioChannels(2)
    .format('s16le')
    .pipe(speaker)

// Push a mp3 frame immediately after it is extracted from packets.
mp3FrameStream.push(frame)