Khoa Nguyen

Posted on May 1

Side Quest Devblog #2: Virtual Insanity

#machinelearning #computervision #datascience

Follow Along

BasicallyOk / deepfake-lip-sync

A Generative Adversarial Network that deepfakes a person's lip to a given audio source. The version shown at CUCAI 2023 is saved in the legacy branch.

deepfake-lip-sync

A Generative Adversarial Network that deepfakes a person's lip to a given audio source

Current Progress

File structure

Saved models

Contact the team for the trained models. I am too broke for LFS.

View on GitHub

The trademarked ramblings of a mad man

So, where am I? Mentally, not great; I mean, I cannot imagine looking at the loading screen when you're trying to process 200GB worth of data to be fun for anybody. Going back to this project really hits home how inexperienced I was just 1 year ago. The data preprocessing pipeline is literally held together by hopes and dreams. In fact, the pipeline failed overnight and I have next to no clue whether it has processed everything. I suppose this is my fault, as when I led the team, I did not really have a solid direction to start with; hell, I did not even know what a good pipeline even looks like. I think the real challenge for this one is the sheer size of the data, as with most Computer Vision projects.

Nonetheless, I have loads of data, ready to start training.

Current Progress

The starter stuff is pretty much done. I wrote brand new code for training the model, based on this little project I did for the Neural Networks class:

BasicallyOk / Face-Compressor

Autoencoder for compressing and decompressing faces. Paper can be found at the link.

CISC452-Project

Investigating deep learning methods for compressing and decompressing faces

Results

Refer to the paper for more details.

Reconstructed Images by the Deep Convolutional Generative Adversarial Network

Reconstructed Images by the Convolutinal Variatinoal Autoencoder

Instructions

Environment Setup

Skip 2 and 3 if you're lazy/a maniac.

Go to project root
Set up environment by running python -m venv .venv or python3 -m venv .venv
Run source .venv/bin/activate on Linux and MacOS, ./venv/Script/activate on Windows.
Run pip install -r requirements.txt

Dataset

Download the celebA dataset
Extract dataset to project root (optional, but will make your life easier)
Rename the dataset folder to dataset (also optional, but the skeleton code assumes it's called dataset)
Ensure there are no subfolders. The dataset folder should only contain images of faces.

View on GitHub

Now only 3 more things has to be done before I get started on something new:

Data Preprocessing: You have to train on something. This blog will essentially be this step in its entirety. As I am writing this little piece of text, my computer is pumping out face frames and the corresponding wav file from 200 GBs worth of data from the internet. I'll talk about this in a bit.
Design the Model itself: Designing the GAN itself. The easiest part of the whole thing, in my opinion. I already have a vague idea of what the full thing is going to look like. I might go into this a little later.
Training, Testing and Iterating: Self-explanatory. It is the single most important step of the whole project. It definitely will take the longest amount of time (because with all the experience I may claim to have, I am still a moron and have a lot to learn).

I call this the holy trinity of AI development. In other words, I've gotten about 10% of the way and is only getting started.

The Data

Here's the dataset I will be using

I find it extremely ironic that I am using data for a deepfake detection challenge as training data for a deepfake model. Nonetheless, like most AI trainers, and AIs themselves, I am very data hungry, and I will take what I can get.
So here's the big question: "Why this dataset specifically?". Well, I'm glad you asked, because the answer is not as simple as you would think. See, this was not my first choice. This dataset isn't necessarily structured the way I would have liked it to. Moreover, it is not the dataset used by the Wav2Lip and LipGAN papers. These are not necessarily bad things; however, being able to use LSR2 would have been such a godsend:

It is used by both Wav2Lip and LipGAN, which allow us to establish a baseline with regards to performance.
Preprocessing code exists in the vastness of the space known as Open Source (and Stack Overflow) as this is the single most well-known dataset for this type of task. But here's the thing, the BBC would not just give it access to anybody. Here's a little snippet from their agreement:

What can I use it for? Provided you keep to these rules, the BBC gives its limited permission for the named researcher  (researcher) to use the BBC’s content free of charge for ‘Academic Research Purposes’ for this Project for 12 months starting on the date both you and the BBC sign this “Permission to Use for Researchers” form. You are responsible for your own costs in taking part in the Project. You must delete all BBC content when this permission ends – please see section 6 for more information.
‘Academic Research Purposes’ means solely these activities for this Project:
• using our content for individual research and private study;

• using a small number of extracts from our content for academic demonstrations  or academic conferences to invited, non-paying guests  to demonstrate your research for the Project in a technical context.

But only if the extracts do not include images of children, other people’s logos or trade marks, sports or music. If the extracts are of recognisable living adult individuals (either standalone or through a combination of images, text, analyses/summaries/interpretations), you must keep to a minimum your use of extracts that identify individuals and you must not use them in any way that might adversely affect those individuals. Ask the BBC Contact (the BBC person who signed this form) if you are not sure if you can use an image or extract;

• using a small number of still images (jpegs) or transcriptions of extracts from our content in academic papers, but you MUST ask the BBC Contact for permission first. You must tell the BBC Contact which images and extracts you would like to use. If the BBC gives its permission in writing, you can then include the permitted images and extracts in your academic paper and publish it in print or online as a pdf file  for the research community to illustrate the results of your research into the Project (but the images and extracts must never be used as part of an online demonstration).

Academic papers should be made available to the research community for individual research and private study purposes only, and must contain this accreditation wording: “For non-commercial individual research and private study use only. BBC content included courtesy of the BBC”. Creative Commons or similar licence terms must not be attached to the BBC content in academic papers;

• using our content to build a research prototype to compare with your organisation’s own existing technology. This work must be done separately from your organisation’s commercial product development. The BBC’s content must not be used for training any existing or new technology, algorithms or models that will be sold commercially. And it must not be used to train technology to improve or enhance your operational systems. It can only be used for comparative or benchmarking purposes.

Needless to say personal project does not count for any of these. We asked last year and they said no, even though we said it was research for an organization funded by the university. Oh well, I suppose this data is pretty dangerous in the wrong hands. At least until I came across the little Kaggle competition whose dataset I will be using. So all is good, it just takes a little extra... holy shit it's over 470 GBs. Let's take a look at my hard drive:

Guess doing the whole thing will simply not be possible. Fortunately for me they allow partitions, so I just downloaded about half of the full dataset. At least there's a silver lining: I upgraded to a new 1TB SSD recently as you see up there. See, when I did this last year, I had about 100 GB to spare on my PC; in my infinite wisdom, I decided to train the model off of an external SSD. Why is this bad? Let's break down how the training process look like in terms of memory:

The model needs to pull data from storage and load it into memory. This is unavoidable overhead and runs every training loop if it's batched.
The model loads the model onto VRAM.
Train

In step 1 there is a transfer happen every single training loop because I am batching the data. To put this into perspective, there are 100 image per batch, each of size 256x256, and a total of 20000 training iterations, with 1 batch per iteration. So let's just say each image is $2^8 * 2^8 * 3 = 2^{16} * 3$ bytes, we essentially get $100 * 2^{16} * 3$ bytes worth of transfer from storage to memory per iteration. Now, this would not be horrible through PCIE or even Sata, because of the high transfer speed; but, alas, I used an external hard drive connected via USB 3.0. You can imagine how slow the whole process was. It took a total of 30 hours to finish training the model.

Of course, this is a pretty oversimplified view of the whole thing, but it does give you a sense of the scale of data we're working with. The new M.2 will definitely help in that regard. Plus, I am now moving to 64x64 images as discussed in the previous post. This way, I can increase the batch size and potentially improve training speed by just a little bit by removing the overhead per iteration.

The Dreaded Preprocessing

Kaggle is a godsend, and the Kaggle deepfake challenge is also a godsend for this little project. Nonetheless, things cannot be too perfect.

The Kaggle competition's data is in the form of videos. Well that's all great, but our model cannot exactly interpret videos as is. Therefore, we need to extract the data we want from it. To see exactly what we want, let's do a recap of what the models will look like:

The generator will take in: a reference masked image, a pose prior, and an audio spectrogram that corresponds to the masked reference.
The discriminator will take in: a image and its corresponding audio spectrogram.

Now that we know what we want, we can split the video and audio into pairs of images and their corresponding spectrograms. Luckily, my team already wrote me code to do this exact thing, so it is just matter of refreshing my memory and wait a small, insignificant 10 hours. Honestly I half expected this because outside of resizing to 64x64, it is using OpenCV to detect where the faces are and cut it out of the frame, both which are probably more intensive then you'd think. If you or your child want to go into Computer Vision, ask yourself: "Am I/my child patient?". If not, this is not for you. To be very fair, we are processing 200GBs worth of data. I have not done this in any other project, but I will have to leave my computer running for the rest of the night. But I won't leave you waiting in suspense, here's an example picture of what we're working with (sorry it's a little small, it is 64x64 after all):

And a random, unrelated spectrogram from one of the audio files:

As I was working on this, it started to hit me that a potential reason for the slow training process could be that I was doing the .wav to spectrogram translation as part of the training loop, as can be seen in the get_batch() function, which is run at the start of every training iteration. Here's a snapshot of how get_batch() processes audio as it is written in the legacy train_GAN.ipynb file:

def get_batch(epoch):
    """
    Proper get_batch function.
    Requires a batches.json file to get here.
    The indexing of the output numpy arrays corresponds to one another
    Returns:
        - image_samples: the real images of shape (256, 256, 3)
        - gen_input_samples: pose prior concatenated with the masked real image of shape (256, 256, 6)
        - audio_samples: the audio snippet that corresponds to the real image of shape (6, 513, 1)
        - unsynced_audio_samples: the audio snippet that corresponds to the pose prior of shape (6, 513, 1). Meant to be used as the unsynced audio snippet.
    """

    image_samples = []
    gen_input_samples = [] # Specifically to work with Huy's data, should change
    audio_samples = []
    unsynced_audio_samples = []

    """
    Unrelated
    """

    for names in batch:
        """
        Unrelated
        """

        audio_arr = stft(os.path.join(DATASET_PATH, audio_name), 2 ** 10) # Generate the spectrogram out of audio segment.
        audio_samples.append(audio_arr)
        task.update()

        unsynced_audio_arr = stft(os.path.join(DATASET_PATH, unsynced_audio_name), 2 ** 10) # Generate the spectrogram out of audio segment.
        unsynced_audio_samples.append(unsynced_audio_arr)
        task.update()

    """
    Unrelated
    """
    audio_samples = np.asarray(audio_samples)
    unsynced_audio_samples = np.asarray(unsynced_audio_samples)
    print("Batch converted to numpy arrays. Returning")

    # It should go without saying that the order needs to match
    return image_samples, gen_input_samples, audio_samples, unsynced_audio_samples

When I get to the training step I will do some testing and see if it was in fact the bottleneck. For now, I will not undo my computer's work all night just to get me all the data.

In conclusion, after a whole night's worth of preprocessing, here's what we end up with:

Enough data to make Mark Zuckerberg cry (this is ironic because Meta partly contributed to the dataset). We'll make changes to this once we're done. ~~Now, I will discuss the model making process very briefly, and save the rest for part 3~~ I accidentally published this early so oops.

An Unexpected End to an Unexpected Journey

As it turns out, the article was published against my will. Not sure what happened, but oh well. I had some plans for what I was going to say after this. Then again, I'm not sure who's really taking the time to hear a 21 year old dude in the basement trying to make Donald Trump sing Baka Mitai. Well, now that I've said it, maybe I'd be pretty interested if it wasn't me. At any rate though, it is my article, and I like to speak my mind.

Before I head out to start writing part 3 and dissecting my godforsaken legacy code, I had some thoughts on why am I even writing an article. I guess this is an anchor to help me keep focus. It has done an amazing job so far; in fact, it's a prime example for why rubber ducking works.

Maybe it is just me trying to find meaning in everything I do. There is a very big chance that I am just going insane and this is an outlet. This whole thing did start to feel like Charlie Day in that episode of "It's Always Sunny in Philadelphia" where he was being gaslighted into thinking he was a genius. I am definitely getting hit by imposter syndrome big time, but sometimes I see value in gaslighting yourself into thinking you're smart. As much as your weird classmates might make you doubt, I believe that you can, in fact, manifest stuff into happening if you really believe in it and act on it. So, maybe if I keep my chin up and my ego high, maybe in a week y'all see Donald Trump singing Baka Mitai, or see someone say something absolutely horrendous (or you'll just see Jin Sakai speaking Japanese fluently). Adios.

DEV Community

Side Quest Devblog #2: Virtual Insanity

Follow Along

BasicallyOk / deepfake-lip-sync

A Generative Adversarial Network that deepfakes a person's lip to a given audio source. The version shown at CUCAI 2023 is saved in the legacy branch.

deepfake-lip-sync

Current Progress

File structure

Saved models

The trademarked ramblings of a mad man

Current Progress

BasicallyOk / Face-Compressor

Autoencoder for compressing and decompressing faces. Paper can be found at the link.

CISC452-Project

Results

Instructions

Environment Setup

Dataset

The Data

The Dreaded Preprocessing

An Unexpected End to an Unexpected Journey

Top comments (0)

Read next

From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

Meet Auty, a bot designed to support and guide autistic individuals - on Coze

Characterization of Large Language Model Development in the Datacenter

ShapeFusion: A 3D diffusion model for localized shape editing