Works AI_Makoto

Posted on Nov 24, 2024

AI Generation Workflow with Azure Speech Service and Dify: Full Automation of Podcast and Image Creation

#dify #llm #azure #tts

Introduction

Creative activities leveraging AI technology are becoming increasingly accessible. This time, we have created a workflow using Dify to automatically generate audio podcasts and images for children. This workflow generates stories based on user input and outputs them as audio files and images.

In particular, this workflow was built with reference to a post by Dify.AI Japan on X. Here is the post:
https://x.com/DifyJapan/status/1859577258442363124

In this article, we will cover the following topics:

An explanation of the workflow built with Dify
Prompt design tailored for creating children's content

Overview of the Workflow

The workflow we created is structured as shown in the image below.
The ChatFlow created in this project is saved in the following repository:
https://github.com/aimakotoworks/TextToSpeech_Chatflow

Key Components:

LLM (Large Language Model)
Based on the input text, the LLM generates a dialogue-style story. For example, if you provide the request, "A story about a boy and Santa Claus," it will create a narrative about a boy and Santa Claus.
HTTP Request (Azure Speech Services)
The generated story is converted into audio using Azure Speech Services. This service transforms input text into high-quality audio data. In particular, Azure Portal allows for easy creation of resources, as well as retrieval of endpoints and key information for seamless usage.
https://learn.microsoft.com/en-us/azure/ai-services/Speech-Service/regions
Setting Up Azure TTS
Creating a Resource
Log in to Azure Portal, select "AI + Machine Learning" and then "Speech" to create a resource. During creation, you can select the region and pricing tier.

Retrieving Endpoints and Keys
After creating the resource, navigate to the "Keys and Endpoints" tab to obtain the necessary connection information. This information is required for communicating with Azure Speech Services via HTTP requests.

Choosing a Voice Model
Azure TTS offers multiple voice models, such as "en-US-JennyNeural" and "en-US-AriaNeural". These models provide natural pronunciation and a wide range of expressive capabilities, allowing you to choose the most suitable voice for your use case.
DALL-E 3
Related images are generated based on the story. For instance, it can visualize scenes such as "Santa Claus and a boy."

Prompt Design Highlights

For this project, we utilized OpenAI's Playground Prompt Generator to set up detailed guidelines.
https://platform.openai.com/docs/guides/prompt-generation
Based on these instructions, we generated a dialogue-style podcast script tailored for children.

Overview of the Prompt

Objective:

To create a podcast for children using Azure's Text-To-Speech (TTS) service. The script is designed to be fun and approachable, ensuring children can learn while enjoying engaging, conversational content.

Writing Style Features:

Playful and warm tone: Modeled as a friendly dialogue between characters.
Balance of fun and learning: Topics are chosen to spark children's interest (e.g., science, nature, math, daily habits).
Simple language: Content is adapted to suit preschoolers and early elementary school children.
Key Points in Prompt Design

XML Structure
Used SSML (Speech Synthesis Markup Language) to ensure smooth integration with Azure TTS.
The tag serves as the root element, while tags distinguish dialogue for each character.
Character Setup
Jenny (Narrator): The primary storyteller.
Aria (Listener): Reacts with questions and expressions of excitement, adding rhythm to the conversation.
Azure TTS voices "en-US-JennyNeural" and "en-US-AriaNeural" are used to create a clear distinction between characters' voices.
Topic Progression
Introduction: Clearly present a new topic to pique children's interest.
Dialogue Format: Jenny explains the topic, while Aria reacts with curiosity and enthusiasm to maintain a smooth conversational flow.
Conclusion: Summarize the lesson with a positive message to encourage children.
Error Prevention
Ensure SSML XML structure is correctly formatted to avoid issues such as missing or invalid tags.
Use syntax that is compatible with Azure TTS to avoid read errors.

Prompt Example
Below is a sample prompt actually used in this workflow:

Create a script for a podcast targeted at young children using Azure's Text-To-Speech (TTS) service. The writing style should be playful, warm, and engaging, similar to a dialogue between friendly characters. Use the specified XML format so that different "voices" can be assigned to the two characters in the podcast.

# Instructions

- Maintain a cheerful, conversational tone, avoiding complex words. Assume young children as your target audience.
- The script should use an alternating voice format between at least two characters. Be sure to guide children into learning while maintaining a playful and curious attitude.
- Each part of the dialogue must be enclosed in `<voice></voice>` tags, clearly identifying the speaker.
- Consider using two 'voices' provided by Azure TTS, such as "en-US-JennyNeural" and "en-US-AriaNeural."
  - Jenny is the primary narrator, while Aria asks questions and expresses excitement.
- Use the `<speak>` XML tag as the root element and ensure it’s well-formatted for TTS use.
- Explore fun topics like science, nature, mathematics, or daily habits in a sequence that makes learning fun.

# Steps

1. **Topic Introduction**: Introduce the new topic in a friendly and engaging way.
2. **Dialogue Sequence**: Alternate between characters:
    - The main character (Jenny) introduces information.
    - The secondary character (Aria) reacts, asks simple questions, or expresses curiosity.
3. **Wrap-Up & Encouragement**: End with a positive summary or encouraging note for children.

# Output Format

- Use SSML format with the `<speak>` and `<voice>` tags.
- Include alternate dialogue sections with at least two characters.
- Write complete and well-formed XML.

## Example

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">

<voice name="en-US-JennyNeural">
    Hello, my little friends! Today, we are going on an adventure to learn all about one of the most magical creatures of the sea: the dolphin!
</voice>

<voice name="en-US-AriaNeural">
    Oooh, dolphins! I always wanted to learn about dolphins! Do they really love to jump out of the water?
</voice>

<voice name="en-US-JennyNeural">
    That's right, Aria! Dolphins love to leap out of the water. It helps them move faster and also looks like they’re having so much fun! They’re super strong swimmers—just like you when you try your best in the pool!
</voice>

<voice name="en-US-AriaNeural">
    Wow, that's so cool! How do dolphins talk to each other?
</voice>

<voice name="en-US-JennyNeural">
    Great question, Aria! Dolphins talk to each other using sounds like clicks, whistles, and even special calls. It's almost like they have their own secret language!
</voice>

<voice name="en-US-AriaNeural">
    Dolphins are so amazing! I want to learn their secret language too!
</voice>

<voice name="en-US-JennyNeural">
    Maybe one day! For now, it's just fun to know that every dolphin family has its very own way of communicating, just like we do with our families. Isn’t that amazing?
</voice>

<voice name="en-US-AriaNeural">
    Yes, it is! I love learning new things with you, Jenny.
</voice>

<voice name="en-US-JennyNeural">
    And I love sharing it with you, Aria! And remember, friends, keep being curious and asking questions. There's so much out there to discover!
</voice>

</speak>

# Notes

- Be sure to alternate voices for an engaging back-and-forth dialogue.
- Maintain an overall length for each script that would produce a 2-3 minute long audio.
- You may add other relevant voices if needed to enhance the storytelling, but always ensure they are in service to the topic’s narrative.
- The SSML structure should be correctly formed, as invalid XML will lead to errors during the TTS generation.

The prompts used for DALLE 3 were crafted based on the following detailed guidelines. These guidelines are specifically designed to generate cute and approachable illustrations tailored for children.

Key aspects of these guidelines include:

Overall Tone:

Creating a soft and gentle atmosphere that provides warmth and reassurance to children.

Specific Visual Elements:

Including detailed instructions about the characters, scenes, and colors to be depicted.

Simplicity and Friendliness:

Avoiding complexity or anything frightening, while emphasizing cuteness and a heartwarming impression.
Additionally, the prompts were constructed following these steps:

Summary of Key Elements:
Concisely outline the main subject, characters, settings (location and background), and actions of the illustration.
Selection of Friendly Language:
Use clear and specific descriptions of charming elements such as "fluffy animals," "smiling children," or "soft pastel colors."
Scene Setting:
Suggest imaginative settings that excite children, like a "storybook village" or a "magical cloud world."
Character Descriptions:
Provide detailed descriptions of the characters' expressions and features to enhance their approachability.
Adjusting Sensory Details:
Use expressions like "fluffy," "cozy," or "playful environment" to evoke a sense of softness and warmth.

Create a prompt for DALLE 3 to generate an image based on a text description, emphasizing that the illustration should be cute, whimsical, and child-friendly.

Keep the following in mind:

- Imagery should be soft, approachable, and appropriate for children.
- Include specific visual elements, colors, characters, and settings that evoke warmth and friendliness.
- Refrain from anything that could appear frightening or overly complex.

# Steps 

1. Summarize the main components of the image to include key subjects, actions, and themes.
2. Use gentle, friendly language that specifies 'cute' elements, such as "cuddly animals," "smiling children," or "soft pastel colors." 
3. Add a setting that feels wholesome or magical, such as a "grassy meadow," "storybook village," or "fanciful cloud world."
4. Include friendly character descriptions, specifying expressions and playful features.
5. Adjust sensory details to evoke softness or gentleness (e.g., "fluffy," "cozy," "playful environment").

# Output Format

"Create an illustration of [main subject doing specific action] in a [setting description]. The illustration should be cute and child-friendly, with [light colors or specific tones]. The characters should be [appearance descriptors], and the scene should evoke feelings of [joy, warmth, safety, etc.]."

# Examples

**User Input**: "A dragon in the forest"

**DALLE Prompt**: "Create an illustration of a small, friendly dragon taking a nap in a sunlit forest clearing. The scene should be cute and child-friendly, with soft green colors, dappled sunlight, and playful animals like squirrels and birds nearby. The dragon should look cuddly, with round features and a gentle expression, evoking a sense of warmth and calm."

**User Input**: "A princess with a cat"

**DALLE Prompt**: "Create an illustration of a young princess sitting in a lovely garden, petting a smiling cat. The scene should be whimsical and child-friendly, with bright, cheerful colors like pinks, yellows, and greens. The princess should have a friendly expression, while the cat should look fluffy and comfortable, creating a sense of happiness and playfulness." 

**User Input**: "Space adventure"

**DALLE Prompt**: "Create an illustration of a young astronaut floating in space next to a friendly alien. They should both be smiling and waving towards each other. The scene should be cute and child-friendly, with bright stars and gentle pastel colors, and the spaceship in the background should be playful, resembling something from a child's dream."

# Notes

- The goal of the prompt is to ensure the AI generates images that are suitable for young children, avoiding sharp details, negative emotions, or anything that might be scary or unsettling.
- Focus on positive, nurturing interactions between characters.
- Colors should generally be bright, pastel, or soft, reinforcing a child-friendly tone.

How to Use

Simply input a question like the following, and the AI will generate a story for you:

Example Input:

Generated Output:

Audio:
A touching story about a boy meeting Santa Claus and growing through their adventures together.
Image:
A scene of Santa and the boy flying through the night sky.
This process can be effectively utilized for creating children's podcasts or storybooks.

Conclusion

This workflow allows anyone to easily create stories and content for children. While we utilized Azure's TTS (Text-To-Speech) service for its high-quality audio generation, newer audio services like ElevenLabs can be integrated for further improvements.
https://elevenlabs.io/
This flexibility enables users to select the best audio service based on their specific needs.

Although this workflow was used to create content for children, Dify also supports file uploads. This feature can be leveraged to convert training materials or documents into podcast-style audio, making it an effective tool for learning during commutes or study sessions. Additionally, it can be used for promotional purposes, such as embedding corporate advertisements on websites.

By harnessing AI technology, new creative possibilities emerge. We hope this workflow serves as inspiration for your creative projects, daily life, or even business applications. Give it a try and unlock its potential!

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

DEV Community