Khoa Nguyen

Posted on May 21 • Edited on May 23

Side Quest #3: maybe the real Deepfakes were the friends we made along the way

Follow Along

Maybe if you're really liking what I'm doing here, give me a star!

BasicallyOk / deepfake-lip-sync

Deepfake model for syncing a person's lip to a given piece of audio

deepfake-lip-sync

A Generative Adversarial Network that deepfakes a person's lip to a given audio source

Environment Variable

Easiest method is to place a .env file in project root. Follow this format:

BATCH_DATA_PATH=./batch_data/batches.json
GENERATED_IMAGES_PATH=./generated_images/
SAVED_MODELS_PATH=./saved_models/
DATASET_PATH=./dataset/train/
TEST_DATASET_PATH=./dataset/test/
MODEL_NAME="contrastive_loss_noisy20000"

Credits

Some of the code were written in a different repository that has now been nuked. I owe a lot of the data processing pipeline to @HueyAmiNhvtim and @Blinco0.

View on GitHub

skip the ramblings

What even is Side Quest?

Programming does not really have to be complicated, but where's the fun in that. Personally, I've seen enough guides on the Internet on "How to make ChatGPT in 3 Simple Steps!" with this thumbnail.

Naturally, I'll avoid that style of tutorials. Youtube tutorials are absolute beginner traps, learn how to read documentation instead (can't change my mind on this one). I mean, I'm not trying to make a guide, but I would be pretty proud of myself if you are reading this to learn from how I do things (or maybe how not to do things). If that is in fact the case for you, I thank you, I'm very flattered. Anyways, the whole point of the series is a diary. I touched on this a little bit in part 2 if you haven't read that.

Side Quest Devblog #2: Virtual Insanity

Khoa Nguyen ・ May 1

#machinelearning #computervision #datascience

So, in a sense, it is a gateway into the pink yarnball of neurons some would call by brain. Maybe, it's a disruption of the status quo: the world needs more relatable software developers that shows you how complicated our jobs actually are (maybe I'm just straight stupid, but hey, I'm learning). Who knows, maybe one day I'll make someone realize that the job is tough, and they're not the only one.

If this is the Side Quest, what even is the Main Quest?

To be honest, it's just a name. But if you really think about it, it's just life itself. Right now, I'm just taking a few steps back, get a few Side Quests done to gain some xp and, most importantly, my ego meter. I am very egotistical, but that stuff has been running low lately. Those who know me know that I run on this stuff like fuel.
The Main Quest has been pretty shitty lately. I got paid my last paycheck as a Teaching Assistant the other day, so I decided I'd pull the trigger on a pair of Everlast Elite 2.0 Boxing Shoes. I've been boxing on a pair of worn down Ultraboost. Although I've gotten really used to the fact that they don't stick to the floor at all nowadays because I absolutely annihilated the outsole at the foot pivot point, maybe an upgrade to make the whole experience nicer would be awesome. And guess what, Everlast ran out of stock. I do not know when they'll come back. If you ask me, this is a sign from above. So maybe I'll just leave it for later and git gud without it first.

Back to the good stuff

Since I imagine a common criticism of my blog is that I stray too much in the middle of a tech talk, I will enclose it in these quotation blocks so you know what to avoid. Of course, I will still keep straying from the topic because gosh dang it this is my blog and mine alone. Enough ramblings from me, I'll let the project do the talking from now on.

A big part of this episode will be me revamping the input pipeline. I thought I was going to be working on the model, but plans changed. Like I've said maybe in Side Quest #1, the whole legacy code is held together by hopes and dreams. Naturally, I've grown out of that kind of mentality; now, I want to see something clean. So let's take a look at what's next:

Port the original batcher into a batched Tensorflow Dataset. This is mostly so it works better with the tensorflow model I already made and give the whole project some cohesion.
Remake the training loop. As I have mentioned in Side Quest #1, the original loop does not look good and uses a lot of tensorflow's legacy stuff, so I'll move on from that and start clean.
Redesign the model. Again, was not a big fan of the original model, it is due for a change.

Lots of work to be done, so sit back, grab some coffee and a popcorn, it's going to be a long day.

By the way, since my posts also double as game recommendations, Alan Wake 2 was one of the best game I've played last year, only beat out by Disco Elysium and Elden Ring

The Bad Batch

Batching is a pretty important thing to do when you have a lot of data. Originally, you just send in a piece of data at a time, and then have the neural network return an output via giving the solution to an equation. After people figured out a system of equation exists, and, eventually, linear algebra, everybody figured out that you can send all the data in at once to do inferrence (batched inferrence). Same goes with gradient descent. It is simply a no-brainer to not do that as it is vastly more paralellizable than just normal gradient descent. Then again, it is usually not a great idea in practice for 2 main reasons:

You need all the data in memory at once, and RAM is expensive (unless you downloaded some for free, then power to you).
Being parallelizable is great, but there are only so many parallel processes that can be run at a time by a computer.

So, as a compromise, we just use mini-batch, where we do gradient descent on subsets of the data, instead of the full data itself. To ~~convince you that I am pretty smart~~ show how mini-batch gradient descent work graphically, this is a pretty cool visual that sums the whole thing up

The picture's credits to Sweta (I don't think SGD should be compared against mini-batch as they do not describe the same thing; in fact, modern optimizers are mini-batch SGD). As you can see, there are less steps, and the steps are a lot cleaner than if you just use one datapoint at a time (again, the article refers to this as SGD, but single datapoint isn't the point of SGD).

Very sorry if I was rambling too much about that article, I just need to make sure nobody is confused (but you might be more confused now, oh well)

Each of our image is about 8.5 kB, and we have about 210,000 images (and that's without mentioning the audio clips). Needless to say, we're not doing full batch training, but split it up into about a few hundreds per mini-batch.

Pre-processing Pipeline still need work

Hi, this is Khoa a few days into the future. I just finished Guardians of the Galaxy game and thought it was one of the most well-written game I've played in a while. Gameplay was definitely weak, but I got really invested in the story and the guardians themselves. However, I am not a big fan of the ending, specifically how a character seemingly does not get affected at all after some pretty rough trauma.

Anywho, with the new idea for the batcher in mind, instead of working on the batcher, I went back and clean up the preprocessing pipeline (because I cannot help but be a perfectionist). The original had too many "roundabouts" in its design; so naturally, I got pretty frustrated just thinking about it. Although it still does roughly the same thing as before (maybe in a cleaner manner), I made a decently big change:

The audio will no longer be segmented beforehands.

This may not make a difference in the training itself, but I think moving the segmentation to the batcher instead allows for a cleaner input pipeline. With the new system, the input pipeline no longer require you to pre-segment the audio clips before feeding it to the model, which was a massive pain point when trying to using the model in the wild. It is a classic case of 2 birds with one stone.

The batcher-inator

Theory is definitely very fun, but implementation is a wholly different story. Before, I had a function that generates a new batch of data every time it is called. This is fine, but it does not play well with Tensorflow, and especially Keras. Thus, I will be using tensorflow.data.Dataset.

I used this once in my image compression project, and I can't really go back. It makes the piece of code responsible for training the model super compact as I will show you in the next episode when I work on the models themselves. It really tickles my brain when the code is clean. You will see much more of this obsessive-compulsive side of me when I eventually do a Side Quest for bitburner.

The Dataset API is pretty awesome and allow me to write a batch generator that I can just throw straight into the model's input. The entire pipeline can be sum up with a very clean set of code as seen here:

def batcher_from_directory(batch_size:int, dataset_path:str, shuffle=False,seed=None) -> tf.data.Dataset:
    """
    Return a tensorflow Dataset object that returns images and spectrograms as required.
    Partly inspired by https://github.com/keras-team/keras/blob/v3.3.3/keras/src/utils/image_dataset_utils.py

    Args:
        batch_size: The batch size.
        dataset_path: The path to the dataset folder which must contain the image folder and audio folder.
        shuffle: Whether to shuffle the dataset. Default to False.
        seed: The seed for the shuffle. Default to None.
    """
    image_dataset_path = os.path.join(dataset_path, "image")
    # create the foundation datasets
    og_dataset = tf.data.Dataset.from_generator(lambda: original_image_path_gen(image_dataset_path), 
                                                output_signature=tf.TensorSpec(shape=(), dtype=tf.string))
    og_dataset = og_dataset.repeat(None) # repeat indefinitely
    ref_dataset = tf.data.Dataset.from_generator(lambda: ref_image_path_gen(image_dataset_path),
                                                 output_signature=(tf.TensorSpec(shape=(), dtype=tf.string), 
                                                                   tf.TensorSpec(shape=(), dtype=tf.bool)))
    ref_dataset = ref_dataset.repeat(None) # repeat indefinitely

    # create the input datasets
    og_image_dataset = og_dataset.map(lambda x: tf.py_function(load_image, [x, tf.convert_to_tensor(False, dtype=tf.bool)], tf.float32), 
                                        num_parallel_calls=tf.data.AUTOTUNE)
    masked_image_dataset = og_image_dataset.map(lambda x: tf.py_function(load_masked_image, [x], tf.float32),
                                        num_parallel_calls=tf.data.AUTOTUNE)
    ref_image_dataset = ref_dataset.map(lambda x, y: tf.py_function(load_image, [x, y], tf.float32), 
                                        num_parallel_calls=tf.data.AUTOTUNE)
    audio_spec_dataset = og_dataset.map(lambda x: tf.py_function(load_audio_data, [x, dataset_path], tf.float64),
                                        num_parallel_calls=tf.data.AUTOTUNE)
    unsync_spec_dataset = ref_dataset.map(lambda x, _: tf.py_function(load_audio_data, [x, dataset_path], tf.float64),
                                        num_parallel_calls=tf.data.AUTOTUNE)

    # ensure shape as tensorflow does not accept unknown shapes
    og_image_dataset = og_image_dataset.map(lambda x: tf.ensure_shape(x, IMAGE_SHAPE))
    masked_image_dataset = masked_image_dataset.map(lambda x: tf.ensure_shape(x, MASKED_IMAGE_SHAPE))
    ref_image_dataset = ref_image_dataset.map(lambda x: tf.ensure_shape(x, IMAGE_SHAPE))
    audio_spec_dataset = audio_spec_dataset.map(lambda x: tf.ensure_shape(x, AUDIO_SPECTROGRAM_SHAPE))
    unsync_spec_dataset = unsync_spec_dataset.map(lambda x: tf.ensure_shape(x, AUDIO_SPECTROGRAM_SHAPE))

    # multi input using https://discuss.tensorflow.org/t/train-a-model-on-multiple-input-dataset/17829/4
    full_dataset = tf.data.Dataset.zip((masked_image_dataset, ref_image_dataset, audio_spec_dataset, unsync_spec_dataset), og_image_dataset)
    # if shuffle:
    #     full_dataset = full_dataset.shuffle(buffer_size=batch_size * 8, seed=seed) # not sure why buffer size is such

    # batch
    full_dataset = full_dataset.batch(batch_size=batch_size)

    return full_dataset

I think we can all appreciate clean code, and this is definitely one of them. It is fairly efficient, because this code is compiled beforehands and generates the batches on the fly, reducing the load on my poor RTX 3070 Ti with a measly 8GB of VRAM (and my 32GB of DDR4 memory, which used to be 16GB just a few months ago). If anything, there seems to be a big bottleneck with the audio spectrogram generation process. Before this, audio_spec_dataset was the only spectrogram Dataset object that was needed, so when a new epoch comes, only 1 spectrogram is generated from the wav file. However, I recently updated the discriminator to also punish deepfakes that have bad mouth synchronization with the audio. This is done by taking a random piece of audio at a different spot in the video and give that to the discriminator along with the true image (which would not be synced to the audio, because it is not the same spot). For simplification, the unsynced audio corresponds to the reference face, not the masked face to generate onto. Before, each epoch takes around 80 seconds to complete; after the update, it now takes 150 seconds. Since the model was not changed (and even then it would not take this much time), I suspect that the spectrogram generation process is taking a long time. Since we're dealing with thousands of epoch, 150 seconds per epoch is pretty outrageous. This will hopefully be handled by the time Side Quest #4 (a full guide on how to make a deepfake model) drops.

Visions from the Future

So, it has been quite a while since the last episode, almost 3 weeks to be exact. I had to move and deal with all of the problems stemming from that (history really does have a way of repeating itself).
I actually started writing this blog right before the last episode dropped; however, I simply never got to the stage where I felt it was good enough to publish. In any case, I have completed the model and have trained a few prototype models.

These looked surprisingly nice, but the original model I made for CUCAI was kind of a low bar if you remember. Judging from those test generations, it is quite clear it has a favorite. One person represented in the picture definitely appears a lot more than others in the training data. They blend in with the masked image surprisingly well and have spot on skin color and orientation.

On the other hand, they are still quite blurry and seemed to have minimal improvements past epoch 700 with batch size of 100. I suspect this is because the model is not big enough to "accommodate" the complexity of the data. The model is currently sitting at around 1 million parameters, which you can consider shockingly small compared to new LLMs and such. I believe the Wav2Lip model was about 30 million parameters. However, with great parameter count comes great data size and training time, and with the currently problem with the batching pipeline, that may be time-consuming at the moment. I mean, just deciding when to train the model gets me thinking days ahead to plan out the use of my computer. It is already taking just under 40 hours to train 1000 epochs. So I will need to figure out how to improve that soon. Another problem I have run into is the Discriminator's performance in the test set. It is extremely good at sniffing out deepfakes there. As you can see in the picture above, it is also returning a value larger than 1, which should be impossible as the training clearly specify 0 as real and 1 as fake. I need to look into whether it does the same for the train set, as that could pose a problem. Moreover, there is a strange "line" that appears at the bottom of the image. I hypothesize that this is a product of the fact that shoulders can be seen in a lot of the images. The model would theoretically have no information on the color of their clothes, causing the line to have a darker color because that is the average of that section. Maybe a fix for this issue is to provide such information with a few rows of pixel at the bottom. This would force me to redesign the model architecture, so this is preferably something that is done later.

Anyways, I might go on another hiatus, as I want Side Quest #4 to be the conclusion to the project. I figure it should be in the style of the traditional guide (contrary to what I said earlier) that is based on what I've learned in the course of the series. So look forward to that! If you read all the way until here: thank you, I really appreciate that view count ticking up on my dashboard, it gives me a lot of motivation to complete this massive undertaking.

DEV Community