DEV Community: Hatem Elseidy

Part 5 - Video Compiler

Hatem Elseidy — Sun, 08 Oct 2023 15:41:17 +0000

Introduction

In the previous parts we have generated and processed all the story content that's text, images and audio. In this section, we will use royalty-free background music from Pixabay and a python library called moviepy to compile the final video.

High Level Overview

What we have now is a list of images files each represents a page, a list of audio files, each corresponds to a page. So, we want to go from that to a compiled video. To do so, we have to picture the video in our head first then figure out how to implement it using moviepy.

As you may already know, the video is a bunch of frames stick together in a sequence, plus audio. So, what we need to do is set each page image as current frame of the video, and the audio part of that page as the audio part of the video. Once the audio part is done (the narrator completes reading the sentence), we want to move to the next frame and the next audio. And so on.

After we finish with all frames and their audio, we blend in the background music with the full audio.

Image Clips

moviepy offers a simple interface called ImageClip. Given an image, we can create a video clip that runs this image X number of seconds. That number of seconds we know we want to be equal to the audio length of that page. We can also add an AUDIO_GAP of 0.5 seconds to make sure that there's a tiny little breath between pages (you can play with that, to make the gap longer of shorter).

The below code shows an example of creating an image clip called page_clip and adding it to a list of page_clips.

page_clip = mpy.ImageClip(page.page_filepath).set_duration(
    page.audio.length_in_seconds + self._AUDIO_GAP
)
page_clips.append(page_clip)

We can then concatenate these image clips (page_clips) into a single clip concatenate_videoclips and method='compose'.

clip = mpy.concatenate_videoclips(page_clips, method="compose")

Audio Clips

Similarly, we need to create a list of audio clips from our audio files. But this time, we don't need to set the duration but we want to set when should that audio clip start. So, we can start with a current_start pointer at 0.0 and move it as we go.

current_start = 0.0

# for each page
audio_clip = mpy.AudioFileClip(story.pages[i].audio.mp3_file).set_start(
    current_start
)
audio_clips.append(audio_clip)

# keep track of the current length
current_start += page.audio.length_in_seconds + self._AUDIO_GAP

We can then combine them together using CompositeAudioClip and then add that the audio part of the combined ImageClip from previous section.

full_audio_clip = mpy.CompositeAudioClip(audio_clips)

# clip = mpy.concatenate_videoclips(page_clips, method="compose")
clip.audio = full_audio_clip

Background Music

I opted in for a randomized approach that picks up a random background music file from a predefined folder. The below method lists all music files under BACKGROUND_MUSIC_PATH and picks a random background music file.

def _get_background_music_filename(self) -> str:
    """Returns: A path to a mp3 file to be used as a background music."""
    music_files: List[str] = [
        join(self._BACKGROUND_MUSIC_PATH, filename)
        for filename in listdir(self._BACKGROUND_MUSIC_PATH)
        if isfile(join(self._BACKGROUND_MUSIC_PATH, filename))
        and filename.endswith(self._BACKGROUND_MUSIC_EXT)
    ]
    return random.choice(music_files)

To add that background music, we first need to read the music file as AudioFileClip. Then, we reduce it's volume and combine it with the existing audio.

# read background music file
background_music = mpy.AudioFileClip(background_music_filepath).fx(
    volumex, self._BACKGROUND_MUSIC_VOLUME_FACTOR
)

# set volume
background_music = afx.audio_loop(
    background_music, duration=video_clip.duration
)

# combine it with existing audio
video_clip.audio = mpy.CompositeAudioClip([video_clip.audio, background_music])

Finally, we save the video, by calling clip.write_videofile.

clip_filepath = os.path.join(workdir, self._FILENAME)
clip.write_videofile(clip_filepath, fps=self._FPS)

Bing it all together

That's the cherry on the cake. Given a Story object, this method will return back a filepath that represents the compiled video.

def generate_video(self, workdir: str, story: Story) -> str:
    """Create a video for the given story

    Args:
        workdir: The root workdir for the story to generate video for
        story: The story object that contains all details about the story

    Returns: A local filepath for where the created video is stored
    """
    page_clips = []
    audio_clips = []
    current_start = 0.0

    # Build image and audio clips
    for i in range(len(story.pages)):
        page = story.pages[i]
        page_clip = mpy.ImageClip(page.page_filepath).set_duration(
            page.audio.length_in_seconds + self._AUDIO_GAP
        )
        page_clips.append(page_clip)
        audio_clip = mpy.AudioFileClip(story.pages[i].audio.mp3_file).set_start(
            current_start
        )
        audio_clips.append(audio_clip)
        # keep track of the current length
        current_start += page.audio.length_in_seconds + self._AUDIO_GAP

    # Combine image and audio clips
    clip = mpy.concatenate_videoclips(page_clips, method="compose")
    full_audio_clip = mpy.CompositeAudioClip(audio_clips)
    full_audio_clip.write_audiofile(
        os.path.join(workdir, f"final_audio.wav"), fps=44100
    )

    clip.audio = full_audio_clip

    # Add background music and save
    self._add_background_music(clip)
    clip_filepath = os.path.join(workdir, self._FILENAME)
    clip.write_videofile(clip_filepath, fps=self._FPS)
    return clip_filepath

Conclusion

In this 5 part series, we have gone a step by step in building a fully fledged automated AI visual story generator. We have done text and image generation, text processing, image processing, text to speech and video processing.

Thanks for reading :)

Part 4 - Page Processor

Hatem Elseidy — Sun, 08 Oct 2023 15:00:51 +0000

Introduction

In this part of the series, we will discuss the Page Processor (link). Below are 2 examples from a story called "The Dark Forest", that raw input of the page processor on the left and the expected output on the right. And as you can see, one time we put the text right and another time we put it on the left. This is a simple alternating logic that we will discuss here as well.

Text Image

Let's start with writing text to an image. Here's a simple method that will draw white text on black background image with the default font.

def text_image():
    # input parameters
    text_message = "This is the fantastic story line"
    width, height = 256, 64    # image size
    bg_color = (0, 0, 0)    # black color
    font_color = (256, 256, 256)    # white color  

    # create an empty image with width, height and background color
    image = Image.new('RGB', (width, height), bg_color)

    # draw text on image
    draw = ImageDraw.Draw(image)
    draw.text((10, 30), text_message, fill=font_color)

    return image

This would result in the following

Now let's improve it by using a custom font. I used a font called Playfulist and you can find it here. We need to define a font size, use to define a font object and pass the font object to draw.text.

...
font_size = 12
font = ImageFont.truetype("Playfulist.ttf", font_size)
...
draw.text(
    (10, 30),
    text_message,
    font=font,
    fill=font_color,
)

Now, instead of hard-coding (10, 30) as the text location, which is the top left corner of the text, let's put it in the center of the image. To do so, we need first to calculate the size of the text in pixels. Hence, we will use ImageDraw.textbbox which will return a bounding box (in pixels) of given text relative to given anchor when rendered in font with provided direction, features, and language (reference).

We can call draw.textbbox then draw.rectangle. to visualize it.

x, y, w, h = draw.textbbox((0, 0), text_message, font=font)
draw.rectangle((x, y, w, h), outline=(0, 255, 0), width=1)

To shift that box to the center of the image we need a little bit of geometry. We know the center point of the image lies at x = width/2 and y = height/2. Let's highlight it with blue color.

Now, we want the green box to be centered around the blue dot exactly. But when we render text, we just need the top left corner. So, given our knowledge of the overall textbox size, we need to calculate the top left corner in respect to the center point. How do we do that? We start at the center point and we move left half of the width of the textbox and we move up half of the height of the text box.

top_left_x = center_x - (bbox_width / 2)
top_left_y = center_y - (bbox_height / 2)

# center point of the image
center_x = width / 2
center_y = height / 2

# from the bounding box results from 'draw.textbbox'
bbox_width = w - x
bbox_height = h - y

"""
But we also know that we started from 0, 0 when
calculating the bounding box. We can just ignore x, y.
Although, when we look at the actual values in the above example, we got x = 0 and y = 3. That's probably
because PIL adds some margin for specific fonts.
But we can ignore the tiny little margins to simplify
the calculation.
"""

# assuming x = 0 and y = 0
bbox_width = w
bbox_height = h

# Hence
top_left_x = (width / 2) - (w / 2)
top_left_y = (height / 2) - (h / 2)

# And we can simplify it more to 
top_left_x = (width - w) / 2
top_left_y = (height - h) / 2

Let's draw a rectangle with these calculations first by using:

draw.rectangle(((width - w) / 2, (height - h) / 2, (width - w) / 2 + w, (height - h) / 2 + h), outline=(255, 0, 0), width=1)

Then, let's add back the text. And as you can see below, it fits the expected rectangle exactly.

draw.text(
    ((width - w) / 2, (height - h) / 2),
    text_message,
    font=font,
    fill=font_color,
)

Removing the guide lines, we get the following nicely centered text.

Justify Text

To break the story sentences into multiple lines to look better on the page, I chose a library called JustifyText. All we need to do is give it the original sentence and the character width, and it will break it down into smaller sentences and return them as an list of strings.

text_message = "Once upon a time, there was a small village nestled deep in the heart of the Dark Forest."
results = justify(text_message, 20)
for r in results:
    print(r)

Running this will result the following same length strings.

Once  upon  a  time,
there  was  a  small
village nestled deep
in the heart of  the
Dark Forest.

And with exact same image calculations we did before, we get this.

Background Color

We are almost done with the text part of the image. As a last step what if we can set the background color of the text part to be somehow relevant to the image part of the page. To do this, we can calculate the dominant color of the image, lighten it to always keep a light background (that's a personal preference) and set the background color of the text part to that calculated color.

We can find a dominant color algorithm on the internet and try to implement it, but like lots of other things in this world, there's already a library that can do this for us. It's called fast-colorthief. We will just need to convert the image to numpy array before calling get_dominant_color.

from fast_colorthief import get_dominant_color

rgba_image = story_page_content.image.convert("RGBA")
ndarray = numpy.array(rgba_image).astype(numpy.uint8)
dominant_color = get_dominant_color(ndarray, quality=1)

And to lighten the color we can use the following method. The idea behind this method is that we want to move the color to the white side of the color spectrum with a factor called _BACKGROUND_TINT_FACTOR, on the 3 RGB values. For example, given a single value between 0 and 255 that's 150, to make it need to move it towards 255. Hence, we need to add another value to it. How much do we add, in the below method, I calculated the difference between 255 and the current value e.g. 255 - 150 that's 105, then multiply that by the tint factor e.g. 0.7. That gives us 73, which we then add to 150 to get 223. So, we moved it towards 255 by 70% factor. In order for this to make more sense, think about the extreme cases, a tint factor of 1.0 would give us 255 for any value, that's white, 100% lightened. A factor of 0.0 would give us the original value, which means it's lightened by 0%.

def _lighten_color(self, color: Tuple[int, int, int]) -> Tuple[int, int, int]:
    return (
        int(color[0] + (255 - color[0]) * self._BACKGROUND_TINT_FACTOR),
        int(color[1] + (255 - color[1]) * self._BACKGROUND_TINT_FACTOR),
        int(color[2] + (255 - color[2]) * self._BACKGROUND_TINT_FACTOR),
    )

The below image shows, the generated image on the left, the calculated dominant color in the middle and the lighted version of the right. Now, we can make that the background color for the text part of the image to make it better blend with the generated image in the same page.

Going back to text image section, you can easily see we can just change the input parameters for background color and font color to lightened version of dominant color and black (or any other color) respectively. Here's an example:

Note: to make the background color of the text image even cooler, I added a gradient effect that you can find here. I'll skip it in this tutorial but you can take a look if you are interested.

Put them together

Now that we have the generated image and a nice looking text part, let's put them together. One on the left side and one on the right side. To do this the 2 images must be of the same height. We can create an empty image pf the expected size and paste each part on the correct location.

result = Image.new(
    "RGB", (image_left.width + image_right.width, image_left.height)
)
result.paste(image_left, (0, 0))
result.paste(image_right, (image_left.width, 0))

And to make sure that it looks and feels like a book, we can alternate between text on left and text on right every other page.

if int(story_page_content.page_number) % 2 == 0:
    page_image: Image.Image = self._concat_horizontally(
        story_page_content.image, text_img
    )
else:
    page_image = self._concat_horizontally(text_img, story_page_content.image)

Paper Wrinkling Effect

That's actually a nice simple technique that you can use to add any other effect to the whole page. It blends an existing page image (here) with the full page with a factor. Using Image.blend from PIL.

def _add_paper_effect(self, page_image: Image) -> Image:
    paper: Image = Image.open(self._PAPER_IMAGE_PATH).convert(page_image.mode)
    paper = paper.resize(page_image.size)
    return Image.blend(page_image, paper, self._PAPER_BLEND_FACTOR)

Bring it all together

To bring it all together we need to do the following steps:

Calculate dominant/lightened background color.
Create text image
Concatenate image and text parts
Add paper wrinkling effect
Save and return

def create_page(
    self,
    workdir: str,
    story_page_content: StoryPageContent,
    audio: AudioInfo,
    story_size: StorySize,
) -> StoryPage:

    # calculate dominant/lightened background color
    background_color = self._calculate_background_color(story_page_content)

    # create text part
    text_img: Image.Image = self._create_text_image(
        size=(story_size.text_part_width, story_size.text_part_height),
        bg_color=background_color,
        message=story_page_content.sentence,
        font=ImageFont.truetype(self._FONT, story_size.font_size),
        font_color=self._BLACK_COLOR,
    )

    # alternate between text on left and text on right
    if int(story_page_content.page_number) % 2 == 0:
        page_image: Image.Image = self._concat_horizontally(
            story_page_content.image, text_img
        )
    else:
        page_image = self._concat_horizontally(text_img, story_page_content.image)

    # add paper effect
    page_image = self._add_paper_effect(page_image)
    page_filepath = os.path.join(
        workdir, f"page_{story_page_content.page_number}.png"
    )

    # save and return
    page_image.save(page_filepath)
    return StoryPage(
        page_content=story_page_content,
        page_image=page_image,
        page_filepath=page_filepath,
        audio=audio,
    )

In this section, we did lots of image processing, we played with colors and fonts, we did a little bit of geometry to correctly place the text in the center. We can also keep adding endless effects to how the page look and feel. And there are also lots of parameters we can play with. This includes, font size and color, tinting factors for background color and paper effect, etc.

Now that we have the nice looking pages built in this section and the our AI narrator ready from previous section. Let's compile the final video in the next part.

Part 3 - Text to Speech

Hatem Elseidy — Sun, 01 Oct 2023 17:09:41 +0000

Introduction

In previous parts of this series, we discussed the high level overview of our fully automated story generator. We also discussed text/images generation using OpenAI and how we represented the whole problem in simple data structures.

In this part, we will discuss text to speech (Audio generation or voice over).

Text to Speech

Looking around for a python library that performs text to speech, I found one called gTTS. But I noticed the below disclaimer on the github page.

This project is not affiliated with Google or Google Cloud. Breaking upstream changes can occur without notice. This project is leveraging the undocumented Google Translate speech functionality and is different from Google Cloud Text-to-Speech.

I also didn't really like the resulted voice as it seemed too robotic (check this example). But I thought it would be nice to integrate text to speech, make the story generator work end to end, then iterate on separate parts.

Audio Generation Abstract Class

To enable easy experimentation with different text to speech technologies and support multiple different audio generators. I decided to create an abstract class that represents that abstract audio generation functionality, use that everywhere we generate audio and inject different implementations as needed.

For example, StoryManager which is a high level manager that combines several other components, takes AbstractAudioGenerator as an input and doesn't care how it's implemented.

def __init__(
    self,
    audio_generator: AbstractAudioGenerator,
    keywords_generator: KeywordsGenerator,
    page_processor: PageProcessor,
    pdf_processor: PdfProcessor,
    video_processor: VideoProcessor,
):

You can find the AsbtractAudioGenerator here. As you can see below, it's super simple. It has a main public method called generate_audio and a couple of helper methods. We will see later on, how these are used.

from abc import ABC, abstractmethod
from mutagen.mp3 import MP3
from data_models import AudioInfo, StoryPageContent


class AbstractAudioGenerator(ABC):
    @abstractmethod
    def generate_audio(
        self, workdir: str, story_page_content: StoryPageContent
    ) -> AudioInfo:
        pass

    @staticmethod
    def _get_length_in_seconds(mp3_filepath: str) -> float:
        return MP3(mp3_filepath).info.length

    @staticmethod
    def _get_mp3_filepath(workdir: str, story_page_content: StoryPageContent) -> str:
        return os.path.join(workdir, f"audio_{story_page_content.page_number}.mp3")

gTTS Implementation

gTTS is very simple, all we need to do is call gTTS and then call a save method to save the results to an mp3 file.

audio = gTTS(text="a sentence from the story", lang="en", slow=True)
audio.save("/my/fantastic/story/audio/page.mp3")

To work with our properly designed framework, we will need to implement AbstractAudioGenerator and return AudioInfo object. And this where we need _get_mp3_filepath and _get_length_in_seconds. The 2 helper methods we implemented in the abstract class.

class AudioGeneratorGtts(AbstractAudioGenerator):
    def generate_audio(
        self, workdir: str, story_page_content: StoryPageContent
    ) -> AudioInfo:
        print(f"Generating audio for: {story_page_content.sentence}")
        audio = gTTS(text=story_page_content.sentence, lang="en", slow=True)
        mp3_filepath = self._get_mp3_filepath(workdir, story_page_content)
        audio.save(mp3_filepath)
        length_in_seconds = self._get_length_in_seconds(mp3_filepath)
        return AudioInfo(mp3_file=mp3_filepath, length_in_seconds=length_in_seconds)

AWS Polly Implementation

Amazon Polly uses deep learning technologies to synthesize natural-sounding human speech, so you can convert articles to speech. With dozens of lifelike voices across a broad set of languages, use Amazon Polly to build speech-activated applications.

You can find the full implementation here. And here are few things to note.

First things first, you need an AWS account and you need AWS credentials with access to Polly. An AWS user with access key and secret key is one way to achieve this, but you can also use IAM roles.

Using boto3 we can create an AWS Polly client with pre-configured credentials like this:

from boto3 import Session

def __init__(self, aws_polly_credentials_provider: AwsPollyCredentialsProvider):
    self.session = Session(
        aws_access_key_id=aws_polly_credentials_provider.access_key,
        aws_secret_access_key=aws_polly_credentials_provider.secret_key,
    )
    self.polly = self.session.client("polly")

Now we can call synthesize_speech from that polly client.

return self.polly.synthesize_speech(
    Engine=self._ENGINE_NEUTRAL,
    TextType=self._TEXT_TYPE_SSML,
    Text=self._construct_ssml(text=text),
    OutputFormat=self._OUTPUT_FORMAT_MP3,
    VoiceId=self._VOICE_ID,
    LanguageCode=self._LANGUAGE_CODE_EN_US,
)

Then the rest of this class is logic to process input and output from/to polly. You can look at other examples online, or AWS documentation.

Other Implementations

You can expand this to other implementations if you have other preferred technologies.

Conclusion

That's it for speech to text. We built a simple extendible system that takes as input StoryPageContent and returns AudioInfo. And we saw 2 examples of different implementations.

Next, we will look at the most fun, artistic and complex component of this project. That's the page processor. This is where we learn about some basic image processing/manipulation using PIL.

Part 2 - Problem Representation

Hatem Elseidy — Sun, 01 Oct 2023 17:02:43 +0000

Introduction

In part 1 of this series, we discussed the high level overview of our fully automated story generator. We also discussed the Content Generation problem out of total 5 problems. In content generation, we generated story text, processed it and generated images.

In this part, we will discuss the problem representation and the data structures used all over the code.

Reminder

The code can be found in this github repo.
https://github.com/hatemfaheem/ai-story-generator

And in this youtube channel, you can find lots of examples auto generated from this codebase (with ZERO video/audio editing involved). Going back in this channel you can see earlier versions that had less compelling results.

Example old version: https://www.youtube.com/watch?v=rM7l-B0wsx4
Example newer version: https://www.youtube.com/watch?v=xURG3wQ0Jtg

This could show how we can improve by doing multiple iterations. Another good place to learn about the power of iterations is to look at the commit history of this project.

https://github.com/hatemfaheem/ai-story-generator/commits/main

Why discussing this?

This is actually the most important section. It explains how we can represent a complex problem, like generating video from text, in simple to understand data structures. If we get this right, then every other component's job is to produce one of these data structures and consume one or more other data structures.

All data structures used around the code can be found in [data_models.py](https://github.com/hatemfaheem/ai-story-generator/blob/main/data_models.py). As you can see it's mostly native python except for Image from PIL.

from PIL.Image import Image

Let's just go from top to bottom.

A. Content Generation Data Structures

Back to the high level architecture in part 1, there are 2 main parts. Let's start by the content generation side and next section move to content processing data structures.

Story Size

Let's take the below example. There are 5 main dimensions:

[Red] Page Width (The width of the whole page).
[Green] Page Height (The height of the whole page).
[Purple] Text Part Width (The width of the text part of the page).
[Blue] Image Part Width (The width of the image part of the page).
The font size. Playing around with numbers, it's better if we adapt the font size based on the image size.

Now, the biggest restriction we have is image generation. If you take a look at OpenAI image generation docs, you'll see that we can generate only square images of 3 different sizes. That's 256x256, 512x512 and 1024x1024. So, to avoid errors with image generation, let's design our stories around only these 3 sizes.

As I was designing this for Youtube videos, I learned that the best aspect ratio for Youtube videos is 16:9. Hence, we have to extend the square images with a text part that has a width of (page_width - image_width) where page_width / image_width = 1.777. Which is the target aspect ration. With simple math, you get the following numbers for the page dimensions (height, width).

SIZE_256 = (256, 455)
SIZE_512 = (512, 910)
SIZE_1024 = (1024, 1820)

Now, we can create an enum to calculate the missing values from these 2 numbers:

class StorySize(Enum):
    """The sizing configuration of the story."""

    SIZE_256 = (256, 455)
    SIZE_512 = (512, 910)
    SIZE_1024 = (1024, 1820)

    def __init__(self, image_part_size: int, page_width: int):
        self.page_width: int = page_width
        self.page_height: int = image_part_size
        self.image_part_size: str = f"{image_part_size}x{image_part_size}"
        self.text_part_width: int = page_width - image_part_size
        self.text_part_height: int = image_part_size
        self.font_size: int = self._get_font_size(image_part_size)

Font size, is just trail and error. For each input size, I hard coded the following numbers:

def _get_font_size(size: int) -> int:
    return {256: 16, 512: 38, 1024: 58}[size]

Finally, we want to make the command line interface simple (that's the main interface for now). Hence, I implemented a method that maps the 3 main numbers to the enum. So, when we specify input size, we just specify 256, 512 or 1024 as input.

def get_size_from_str(size: str):
    return {
        "256": StorySize.SIZE_256,
        "512": StorySize.SIZE_512,
        "1024": StorySize.SIZE_1024,
    }[size]

So, by running this method, you get an object with all the different dimensions for the story. All inclusive.

Story Content

To represent the contents of the story, we created 3 data structures StoryText, StoryPageContent and StoryContent.

StoryText is simple, it contains raw text from OpenAI and tokenized sentences as discussed in part 1.

class StoryText:
    raw_text: str
    processed_sentences: List[str]

StoryPageContent represents the contents of a single page. The text of this page (sentence), the actual image of the page, the path of the image on local disk, and finally the page number.

class StoryPageContent:
    sentence: str
    image: Image
    image_path: str
    page_number: str

StoryContent represents the contents of the story as a whole. story_seed is the input sentence (title) of the story, raw_text again the full raw text of the story, page_contents is a list of StoryPageContent and story_size is the object that contains all dimensions info discussed above.

class StoryContent:
    story_seed: str
    raw_text: str
    page_contents: List[StoryPageContent]
    story_size: StorySize

Now, you can see that given this StoryContent object you know pretty much everything about the generated story include it's seed title, text, images and size. Remember that generated images had specific size that's why we couple the size with the content.

You could explore generating the images once and resizing them based on the input size to save OpenAI calls. That way we can decouple the size from contents.

Note: If you look at story_utils.py, you will see that we are saving and loading the StoryContent object. This allows us save the contents after relatively expensive OpenAPI calls and avoid regenerating the contents in case of error in further steps.

B. Content Processing Data Structures

Once we have the contents of the story, images and text. We want to process it into a compelling nice video. That includes, background music, voice over, etc.

AudioInfo

As simple as shown below, a string mp3_file that points to an mp3 file on local disk and the length of this audio as length_in_seconds.

class AudioInfo:
    mp3_file: str
    length_in_seconds: float

StoryPage

StoryPage builds on top of StoryPageContent. It adds the final image of the full page (text + generated image), and that it contains audio information for voice over of that page.

class StoryPage:
    page_content: StoryPageContent
    page_image: Image
    page_filepath: str
    audio: AudioInfo

Story

That's the final all inclusive story. The main reason for this class is to get everything in 1 place. You may disagree with this approach, but I find it simpler in prototyping and fast iteration. Nothing really new here, by the names you can guess what each field represents.

class Story:
    story_seed: str
    story_raw_text: str
    pages: List[StoryPage]
    start_page_filepath: str
    end_page_filepath: str
    keywords: List[str]

How does this help writing the code?

Let's see a 2 examples.

1. Audio Generation

When we think about voice over or more specifically text speech, we know that a human being would look at the page and read out loud what is on that page. And this is how we exactly designed it here. As you can see, it takes as input StoryPageContent and return AudioInfo. When implementing this method, you don't really need to think about what's happening in other areas like image generation or page processing. Simple, isn't it?

@abstractmethod
def generate_audio(
    self, workdir: str, story_page_content: StoryPageContent
) -> AudioInfo:
    """Preform text to speech and generate an audio file for the given story page

    Args:
        workdir: The workdir where to save the audio files
        story_page_content: The content of a single page from the story

    Returns: AudioInfo object with filepath and length.
    """
    pass

2. Page Processing

As we will see in later parts, page processing is the process of generating an image from the story content. It combines everything from the contents into a nice looking page. As you can see below, it consumes StoryPageContent, AudioInfo, StorySize and produces StoryPage.

def create_page(
    self,
    workdir: str,
    story_page_content: StoryPageContent,
    audio: AudioInfo,
    story_size: StorySize,
) -> StoryPage:

Everything is a function of these data structures.

In next part, we will look into text to speech when we will touch a little bit on polymorphism.

Part 1 - Content Generation

Hatem Elseidy — Sun, 01 Oct 2023 17:01:00 +0000

Introduction

In this series of posts I'll walk you through building a fully fledged, fully automated, visual story generator including text, images, audio, video and background music.

Your code will go from an input sentence as short as "The Whistling Scarecrow" to the below video in ~1 min.

This is my favourite ever side project. I built it early this year when OpenAI APIs came out. The work will be fully in Python and the full code is published open source in this repo.

https://github.com/hatemfaheem/ai-story-generator

High Level Overview

The below diagram shows a high level overview about what we will be building. As you can see on the right side, we produce lots of raw and processed results, but most importantly a video (like the on shown in intro).

There are 5 main subproblems with different levels of complexity and different tools. If we can solve these problems independently, we can just pipe them together to create our beautiful vide.

Content Generation & NLP: Our story needs text and images which is the core content of the story. We will be using OpenAI for this. We will also need to process the text (Natural Language Processing) of the story for a couple of reasons (a) to breakdown into sentences/pages and (b) to produce keywords for SEO (we will not really dive deep into SEO but I'll show you how to produce keywords to be used in things like hashtags).
Text to Speech (Audio): A nice video story is not perfect without a narrator. And guess what, we will also generate this.
Image Processing: Once we have images and text, we will need to combine these into nice looking visual pages. This is a super cool subsystem written using Pillow (a popular image processing python library). And yes, this will include the page wrinkling effect.
Video Processing: Once we have nice looking pages and narrator audio, we compile a full video.
PDF Processing: Similar to video but compile a pdf this time. Like a printable version of the story.

The same diagram above can also be viewed as a pipeline (data flow diagram).

Content Generation & NLP

Story Text

Let's jump in straight away. Given a simple sentence i.e. story title we want to generate a story. In this case, we just need the story text. Thanks to OpenAI APIs, we can use text-davinci-003 model to obtain this with a few lines of code.

# prompt: str = "The Whistling Scarecrow"
story_content = openai.Completion.create(
    model="text-davinci-003",
    prompt="Give me a story about " + prompt,
    max_tokens=self._MAX_TOKENS,
    temperature=0,
)
story_raw_text = story_content["choices"][0]["text"]

As you can see I had to prepend "Give me a story about " to the title prompt, to instruct OpenAI to give me a story about The Whistling Scarecrow. And non-surprisingly it's very good at generating such stories (try it out on ChatGPT if you have access). You may be used to this level of AI now, but the quality of the stories was super impressive when I was writing this code in December 2022.

Next, we need to process this text into sentences i.e. story pages. You can think something as simple as this:

story_raw_text.split(".")

This will work for lots of stories, but it's not reliable. Consider the following story.

In the small village of Elmridge, people told tales of the Whistling Scarecrow. Not as a mere bedtime story, but as a local legend that had seen generations.

Splitting on '.' will produce the following list.

[
  "In the small village of Elmridge, people told tales of the Whistling Scarecrow",
  "Not as a mere bedtime story, but as a local legend that had seen generations"
]

Which is actually correct, but here are the next few sentences in the same story:

The scarecrow stood in the middle of Mr. Whitaker's cornfield, lanky and faded from years under the sun and rain. Its clothes were tattered, its straw body peeking out from holes and tears, yet it stood proud, guarding the field as though it were its own.

And as you can see, this solution will break at Mr., it will separate "Mr." and "Whitaker's" into 2 different sentences, although it shouldn't. How can we fix this? We use a smarter sentence tokenizer. Thanks to NLTK we can do this in 1 line:

import nltk

story_sentences = nltk.sent_tokenize(story_raw_text)

Keywords

Now that we have the story text and as we are talking about text processing. Let's also generate a bunch of keywords that are representative of the story content.

Why do we need keywords?

This could be used as hashtags if you're publishing this story to social media.

How do we automatically generate high quality keywords?

The answer is KeyBERT. KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

from keybert import KeyBERT

keybert_model = KeyBERT()
keybert_model.extract_keywords(story_raw_text)

For the Whistling Scarecrow story in this video, that generated the following set of keywords. Which is similar to human tagging abilities if you think about it.

scarecrow, farmer, whistling, whistle, crops

Story Images

Now that we have generated and processed the story text, let's jump into generating images for each sentence. For this we will use DALL.E 2 from Open AI. It may not be the best image generation model in the market, but it has an API that allows us to automate this process.

# prompt -> story sentence
def generate_image(prompt: str) -> str:
    response = openai.Image.create(
        prompt=prompt, n=1, size="1024x1024"
    )
    return response["data"][0]["url"]

Given the image url, we can download the actual image by doing something like this:

def download_image(
    workdir: str, url: str, image_number: str
) -> Tuple[Image.Image, str]:
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))
    filepath = os.path.join(workdir, f"image_{image_number}.png")
    img.save(filepath)
    return img, filepath

We first make a GET request to the get the URL content, then we open the image using PIL library (we will talk a lot about this in the image processing part of this series). We then save the image to local dir.

Story Content Generator

Now let's bring it all together. The story content generation algorithm is simple:

Generate and process story text for the given prompt/title.
Generate and download images for each sentence in the story.
Construct StoryContent object that contains all story content/details.

def generate_new_story(
    self, workdir_images: str, story_seed_prompt: str, story_size: StorySize
) -> StoryContent:
    """Generate a new story for the given prompt

    Args:
        workdir_images: The workdir where images should be stored
        story_seed_prompt: The title/seed of the story
        story_size: Story size configuration

    Returns: The contents of the newly generated story
    """
    story_text = self.text_generator.generate_story_text(story_seed_prompt)
    raw_text = story_text.raw_text
    processed_sentences = story_text.processed_sentences
    page_contents = []

    for i in range(len(processed_sentences)):
        image_prompt = (
            f"A painting for '{processed_sentences[i]}'. "
            f"{story_seed_prompt}."
        )
        url = self.image_generator.generate_image(
            prompt=image_prompt, story_size=story_size
        )
        image_number: str = str(i).zfill(3)
        image, image_path = self.image_generator.download_image(
            workdir=workdir_images,
            url=url,
            image_number=image_number,
        )
        story_page_content = StoryPageContent(
            sentence=processed_sentences[i],
            image=image,
            image_path=image_path,
            page_number=image_number,
        )
        page_contents.append(story_page_content)

    return StoryContent(
        story_seed=story_seed_prompt,
        raw_text=raw_text,
        page_contents=page_contents,
        story_size=story_size,
    )

In the next part of the series we will talk about how to represent the story generation problem as a set of data structures, including StoryContent and StoryPageContent shown in the previous section.