DEV Community

Ryo Kuroyanagi
Ryo Kuroyanagi

Posted on

11

How to realize Real-Time Speech with Dify API

Dify is an open-sourced SaaS platform for building LLM workflows online. I'm using the API to create conversational AI experience on my app. I was struggling with getting TTS streams as the API response and play it. Here I demonstrates how to process the audio streams and play it correctly in real-time. In short, please check my code.

I'm using the API endpoint https://api.dify.ai/v1/chat-messages for text chat. It returns audio data in the same stream as the text response if we enabled Text to Speech feature in our Dify apps.

Press ADD FEATURE button and add Text to Speech feature.
Image description

You can check the response from API with the following curl command.



curl -X POST 'https://api.dify.ai/v1/chat-messages' \
--header 'Authorization: Bearer YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data-raw '{
    "inputs": {},
    "query": "What are the specs of the iPhone 13 Pro Max?",
    "response_mode": "streaming",
    "conversation_id": "",
    "user": "abc-123",
    "files": []
}'


Enter fullscreen mode Exit fullscreen mode

I demonstrate in TypeScript / JavaScript but you can apply the same logic to your programming language.

Anatomy of streamed data

First, let's understand what kind of data Dify is using for the streams.

Streamed data format

Dify is using the following text data format. It is like JSON lines but it is not the same exactly.



data: {"event": "workflow_started", "conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "message_id": "3f0fe3cf-5aa1-4f7c-8abe-2505bf07ae8f", "created_at": 1724478014, "task_id": "dacb2d5c-a6f5-44b5-b5a6-de000f24aeba", "workflow_run_id": "50100b30-e458-4632-ad7d-8dd383823376", "data": {"id": "50100b30-e458-4632-ad7d-8dd383823376", "workflow_id": "debdb4fa-dcab-4233-9413-fd6d17b9e36a", "sequence_number": 334, "inputs": {"sys.query": "What are the specs of the iPhone 13 Pro Max?", "sys.files": [], "sys.conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "sys.user_id": "abc-123"}, "created_at": 1724478014}}

data: {"event": "node_started", "conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "message_id": "3f0fe3cf-5aa1-4f7c-8abe-2505bf07ae8f", "created_at": 1724478014, "task_id": "dacb2d5c-a6f5-44b5-b5a6-de000f24aeba", "workflow_run_id": "50100b30-e458-4632-ad7d-8dd383823376", "data": {"id": "bf912f43-29dd-4ee2-aefa-0fabdf379257", "node_id": "1721365917005", "node_type": "start", "title": "\u958b\u59cb", "index": 1, "predecessor_node_id": null, "inputs": null, "created_at": 1724478013, "extras": {}}}

data: {"event": "node_finished", "conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "message_id": "3f0fe3cf-5aa1-4f7c-8abe-2505bf07ae8f", "created_at": 1724478014, "task_id": "dacb2d5c-a6f5-44b5-b5a6-de000f24aeba", "workflow_run_id": "50100b30-e458-4632-ad7d-8dd383823376", "data": {"id": "bf912f43-29dd-4ee2-aefa-0fabdf379257", "node_id": "1721365917005", "node_type": "start", "title": "\u958b\u59cb", "index": 1, "predecessor_node_id": null, "inputs": {"sys.query": "What are the specs of the iPhone 13 Pro Max?", "sys.files": [], "sys.conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "sys.user_id": "abc-123", "sys.dialogue_count": 1}, "process_data": null, "outputs": {"sys.query": "What are the specs of the iPhone 13 Pro Max?", "sys.files": [], "sys.conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "sys.user_id": "abc-123", "sys.dialogue_count": 1}, "status": "succeeded", "error": null, "elapsed_time": 0.001423838548362255, "execution_metadata": null, "created_at": 1724478013, "finished_at": 1724478013, "files": []}}

data: {"event": "node_started", "conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "message_id": "3f0fe3cf-5aa1-4f7c-8abe-2505bf07ae8f", "created_at": 1724478014, "task_id": "dacb2d5c-a6f5-44b5-b5a6-de000f24aeba", "workflow_run_id": "50100b30-e458-4632-ad7d-8dd383823376", "data": {"id": "89ed58ab-6157-499b-81b2-92b1336969a5", "node_id": "llm", "node_type": "llm", "title": "LLM", "index": 2, "predecessor_node_id": "1721365917005", "inputs": null, "created_at": 1724478013, "extras": {}}}

...


Enter fullscreen mode Exit fullscreen mode

In the response, Dify pushes text answer and audio data.

Example line of text answer



data: {"event": "message", "conversation_id": "aa13eb24-e90a-4c5d-a36b-756f0e3be8f8", "message_id": "5be739a9-09ba-4444-9905-a2f37f8c7a21", "created_at": 1724301648, "task_id": "0643f770-e9d3-408f-b771-bb2e9430b4f9", "id": "5be739a9-09ba-4444-9905-a2f37f8c7a21", "answer": "MP"}


Enter fullscreen mode Exit fullscreen mode

Example line of audio data



data: {"event": "tts_message", "conversation_id": "aa13eb24-e90a-4c5d-a36b-756f0e3be8f8", "message_id": "5be739a9-09ba-4444-9905-a2f37f8c7a21", "created_at": 1724301648, "task_id": "0643f770-e9d3-408f-b771-bb2e9430b4f9", "audio": "//PkxABhvDm0DVp4ACUUfvWc1CFlh0tR9Oh7LxzHRsGBuGx155x3JqTJiwKKZf8wIcxpMzJU0h4zhgyQwwwIsgWQMAALQMkanBTjfCPgZwFsDOGGIYJoJoJoJoPQPQLYEgAOwM4SMXMW8TcNWGrEPEME0HoIQTg0DQNA0C5k7IOLeJuDnDVi5nWyJwgghAagQwTQQgJAGrDVibiFhqw1YR8HOEjBUA5AcgagQwTQTQQgJAAtgLYKsQ8hZc0PV7OrE4SgQgFIAsAQAwA6H0Uv4t4m4m49Yt4uYOQHIBkAyAqAkAuB0Mm6UeKxDGRrIODkByBqBNBCA1ARwHIEgBVg5wkY41W2GgdEVDFBNe+HicQw0ydk7HrHrIWXM62d48ePNfCkNATcTcNWGrCRhqxDxcwMYBwBkByCGC4EILgoJTQUDeW8W8TcTchZ1qBWIYchOBbBCA1AhgSMJGGrFzLmh6fL+LeBkAyAZAcgSAXAhB0Kxnj4YDkJwXA6FAzwj8IIJoJoPQXA6EPOcg4R8FOBnCRljRAwlwoh4EUwLhFTCVA+MR0R8wyxOhgAwwDgJjBUABMM0hMxBgnTPtMrMBEEcwJQCzIXIdMZMG821DmjDKHJAwLDKHRMQsJkwbwVRoFs//PkxEx5dDnwAZ7wANHgEUFJHGCUCQp3LWCQQYGAATI5QzwHBJF4UFktpfATT2l0goAGNADLOU64HAMCQCK50szABAIkDS2/j8gl6l6Di7QgBEiAfMEADBnyZBgeAWCMK4xvBbhoRZj1M+ktsNMTrMNcHEwHQEzAjAHMGQAQwRQZTBHALMGMDkzhh2jGhLtMgsMMwfhOzCnGLMMcKgwOw8pqHMoGtvdDzos0AIAiXIsBAmGsRFtYcBABmB0AUYjQfhhDAfjoCrETAGArMOAJ4iAAMCMFkwXwh5fffuhpYMhyP2bl3MVAJQrSYQDsna7G2+fx/GvyAwUQbTAdAFCAHVKyIAduTXHZZXDjNS57/VeVJ5+JBJ+0kATkCSells8/NBt/2/5Dj1s+chDBYSINutNS9FQwDwBWHjgASKRgAAJOyYC4Ao0CMNAKBgB6KK1hYBkAAHROM9mLsknb8avTcB0MerV6jl7llE70egOerRh9WcP/FoHqtVsO/In2f+G2tsdnH+L/KSSvBQB4OATam27Yi4jiBgBFOpq15bTQU6k1G4LoWo1mMAwDQwlBEzEnKsMkA7c5JYuTOzK2MvAbEysSPTM+dOOn1XEzGgIzXzmPODVvs1cyNTJxQ9MsAWwy//PkxDlz7DIMAd7gAek5EwnjcjX9QVN1N0czFyijQKOmMi4IYw8RvzFvCHMHYBQwdQlTRxVNvm8ycGjLYlMTAQ=="}


Enter fullscreen mode Exit fullscreen mode

We can distinguish JSON lines of audio data by checking the event property. Audio JSON has tts_message as the value. The audio mp3 binary is stored in the audio property of the JSONs in base64 format.

Problems in handling data

The first problem that we have when we play TTS audio real-time is the JSON lines are split into packets and each packet is not valid JSON data as it is.

Example packet which is cut in the middle



gMkhx2XCjT6Y0rKnDuvOnora378v6wGEMscxTGVK4ZLfbI+7cFjtUZxDCk3joo9En2RVbx1oIiz1VZYxKB2wq4pmSLWo55pbOoqtN0G2aY/LsNwomtvPH4M2zxBRpLsxKBJTIV6xF7IPaFQuq3CcZ/lDUQafC3mgavJHUWs7L+O8zuxIoahyH40TEFNRTMuMTAwqgGNTDg1JPDM5yHt0ZFFRiVTGYHgakOZhxJkgZMggAwCIxUTGFwQZQRRhIemGCABSONDpTQgEAIFxj8UmDhOYQAIMAgYaSQKmQwcXeBAYAAXEAKR8MIEABGIEBwyuFzQiVNXqcycmDT86Pug89ZUjiFYO6Oc2+BWXmEAqaDCRgUCGGA2Y7CgAEZMMgg1GDACCDwq3O9NNq+JiIOOBciCJyXYkWGCQjCmSOmVSFU2KGxxgYbMYBoacYBcpK+OM/OuxIngNUGJTg02CgJGVCxyfPr6FZIJGmmkBwQwxIxgQzgILC2X//PkxONtxDoABOafcMeL9NfW0rYzVsTJRAHVPD6hrLVnqxDJ4zpZFsVCg0ywkiWoUs6MADVREAIAki0xhwxeJYYrCpuLXb1ayPaFT4FeqU0lzVHUJZxJyqDqVo3kLOh0sE6Jc4oTjbk/LGfxuk7MpgOBmYISXTKcbDkVrMV5zohMIalUZJYoCkJrZVLSH1CPjrcz7OhCyxF9W2RKJKIT1A=="}

data: {"event": "tts_message", "conversation_id": "9ed2e63a-8527-41ff-851f-bf449e7f1096", "message_id": "706bf92a-eca4-4ec8-a04e-a54af25c8cca", "created_at": 1724491999, "task_id": "5f3ca6e2-b8bc-4cb7-946b-b5e0c1a85e99", "audio": "CWNnU8iypDSsX0myFoS4rzmeqmdtaHk4PJWJpIPUalRYjLJCh6iSBcnNXlOcJxsxdkPY4CoVTnHVq7TqEpqqMOhMQU1FMy4xMDCqqhKNPkjR+Ex2kM2MTDCcwcfAmod6hmLu5lhwZkkGBKphA0cQ3GAKxrVmaEhmrIhmaGZ6aDSUaYKBw2ZkImIggABC6xFVmcFBiwCKp4jBGBiIFGEwWYJMhl0qGlgAYLO5oiAmxcuaGnByxCGBfAaWdh06dmmWMZtVJnIfGDQYYMDZEIDDxhMAmAymZBkPhgOEQAAwsCggMNGM0QCjGoYMaisxohTIwnMDlYzGNhqKFj5FGLwGDDg4aY00bEcYMkY0AiiJHQxsb42YMmJGg5qVjwq+BVUKhDklRVOZRoc6EckebYSZGuaYCaUwAiaAIaHyKWp9PU7/8+TE9HH8OfgE3zUMkw9t3VLwKPo7oWpJKoegyTC0JiIoyZFUQQKL9GCIkpQKBxIxXGQQqFUAdpXyQ6QTdXrtv9bf5jbqpBO8uXbZV0vs/eEtbEqOGnZXDNrxVC5al8hhlIVmsWnnKaA9jIJM+MMxWA5Q8DswxSxunbC6sD0sCuA5Uom12td2qo+61VVYHYc4qche5qDOmhKtjzjPMqVOV0YZnGlTVuKqflkYak8F5/YLmGjMpvyN23tgW08zQoQ8yporCXVgmClh3UeyB387NsRcV2JEorHm5UagMxPQC8FMQU1FMy4xMDBVVVVVVVVVVVVVVVVVVVVVCVZncYm5NgeRQJt1kGKVgbuHRpBsmQhYYOBRh4jjQuMUBI0KOTC5EMII8xGXTEggBAXMOgYDIMOF7LTGZEL0DgCMKhwwCTjBxHMKAoxANggMF8AABgMEjFpIMil0z8ujsrQM+JM3N1z0ynM+gkx80jmakA2PM4hQkVRo8BlRCOozaMxKYnsVQafpEMLvmTEGWADaIWeB2JDw1LAtChEZoAEGjMjDjLzGCQ7CXjGlhiAZAkNSeIhQFFAIMXmMwk0CwgVFZ90kS6zLgIGVETefOwEzSBxK4lgWcECaCIvzEFXqZPGzJnT/8+TE5m5cOfQM5rNIgRVXhpQx6MVpN0dplrRXEQFOCAjUxcGtOqIgUmXKC4DrLpRxlkaUMcp1mAwMXhUqIglO3LWO1lMZyJp/2XwLStuwhTymDcGuwoSAk7XWcq3JuyBcsjhSlMPMiY22zOGlt2a09y6ELmUMqhp0l2KSZCzlrTGlN3CZa11uDDVi0zoslVgcZzXKeB+3rgXcNNduQbEIizl624P/R2k+3Bi89YkuXWBuPA0P1nISOcBOaTNcWAdZbDG1VmyqvnX3nJhaUMqrsRgZVKXM5bktR5tQHFYHYg=="}


Enter fullscreen mode Exit fullscreen mode

The packet is starting from the middle of a JSON line. We have to combine multiple packets to get valid JSONs lines.

The second problem is the audio data chunk in a JSON is not a valid audio data. The data is cut in the middle of mp3 frames.

Implementation

To handle the split data of JSON and mp3, we have to do some smart way. The flow of the process is following:

Image description

First, we have to get valid JSON data and split into JSONs while receiving packets. When we got a packet with \n at the end, we can say the concatenation of the packets received so far is not cut in the middle. The pseudo code is like this.



let packets = []
stream.on('data', (bytes) => {
   const text = bytes.toString()
   packets.push(text)
   if (text.endsWith('\n')) {
      // Extract audio data from the packets.
      const audioChunks = extractAudioChunks(packets.join(''))
      // Clear the packet array
      packets = []
   }
})


Enter fullscreen mode Exit fullscreen mode

Second, we have to split the audio chunks into mp3 frames. We concat the audio chunks into a binary and find each mp3 frames in it.



const mp3Frames = []
const binaryToProcess = Buffer.concat([...audioChunks])
let frameStartIndex = 0
for (let i = 0; i < binaryToProcess.length - 1; i += 1) {
  const currentByte = binaryToProcess[i]
  const nextByte = binaryToProcess[i + 1]
  // MP3 frame header always starts with eleven 1 bits. Checking 2 bytes.
  // It is a beginning of mp3 frame if current byte is 0xff and the beginning of the next byte is 111.
  // MP3 Spacification
  // http://www.mp3-tech.org/programmer/frame_header.html
  if (currentByte === 0xff && (nextByte & 0b11100000) === 0b11100000) {
    mp3Frames.push(binaryToProcess.subarray(frameStartIndex, i))
    frameStartIndex = i
  }
}


Enter fullscreen mode Exit fullscreen mode

This is not the full implementation of splitting into mp3 frames. In the actual process, we have to consider cases that we have remainder bytes when we extracted mp3 frames from the audio binary and use the remainder as the beginning of the audio bytes in the next iteration.

Play the frames

I used fluent-ffmpeg for decoding and speaker for playing the decoded PCM audios. To play the TTS audio immediately after it received, I used stream for creating the decoding-playing pipeline.



class Mp3FrameReadable extends Readable {
    _read(size: number) {}
}

const mp3FrameStream = new Mp3FrameReadable()
const speaker = new Speaker()
ffmpeg(mp3FrameStream)
    .audioFrequency(44100)
    .audioChannels(2)
    .format('s16le')
    .pipe(speaker)

// Push a mp3 frame immediately after it is extracted from packets.
mp3FrameStream.push(frame)


Enter fullscreen mode Exit fullscreen mode

Please check my GitHub repo for the full implementation. Hope this helps.

Sentry blog image

How I fixed 20 seconds of lag for every user in just 20 minutes.

Our AI agent was running 10-20 seconds slower than it should, impacting both our own developers and our early adopters. See how I used Sentry Profiling to fix it in record time.

Read more

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more