DEV Community: Khoa Nguyen

Side Quest #3: maybe the real Deepfakes were the friends we made along the way

Khoa Nguyen — Tue, 21 May 2024 00:42:53 +0000

Follow Along

Maybe if you're really liking what I'm doing here, give me a star!

BasicallyOk / deepfake-lip-sync

Deepfake model for syncing a person's lip to a given piece of audio

deepfake-lip-sync

A Generative Adversarial Network that deepfakes a person's lip to a given audio source

Environment Variable

Easiest method is to place a .env file in project root. Follow this format:

BATCH_DATA_PATH=./batch_data/batches.json
GENERATED_IMAGES_PATH=./generated_images/
SAVED_MODELS_PATH=./saved_models/
DATASET_PATH=./dataset/train/
TEST_DATASET_PATH=./dataset/test/
MODEL_NAME="contrastive_loss_noisy20000"

Credits

Some of the code were written in a different repository that has now been nuked. I owe a lot of the data processing pipeline to @HueyAmiNhvtim and @Blinco0.

View on GitHub

skip the ramblings

What even is Side Quest?

Programming does not really have to be complicated, but where's the fun in that. Personally, I've seen enough guides on the Internet on "How to make ChatGPT in 3 Simple Steps!" with this thumbnail.

Naturally, I'll avoid that style of tutorials. Youtube tutorials are absolute beginner traps, learn how to read documentation instead (can't change my mind on this one). I mean, I'm not trying to make a guide, but I would be pretty proud of myself if you are reading this to learn from how I do things (or maybe how not to do things). If that is in fact the case for you, I thank you, I'm very flattered. Anyways, the whole point of the series is a diary. I touched on this a little bit in part 2 if you haven't read that.

Side Quest Devblog #2: Virtual Insanity

Khoa Nguyen ・ May 1 '24

#machinelearning #computervision #datascience

So, in a sense, it is a gateway into the pink yarnball of neurons some would call by brain. Maybe, it's a disruption of the status quo: the world needs more relatable software developers that shows you how complicated our jobs actually are (maybe I'm just straight stupid, but hey, I'm learning). Who knows, maybe one day I'll make someone realize that the job is tough, and they're not the only one.

If this is the Side Quest, what even is the Main Quest?

To be honest, it's just a name. But if you really think about it, it's just life itself. Right now, I'm just taking a few steps back, get a few Side Quests done to gain some xp and, most importantly, my ego meter. I am very egotistical, but that stuff has been running low lately. Those who know me know that I run on this stuff like fuel.
The Main Quest has been pretty shitty lately. I got paid my last paycheck as a Teaching Assistant the other day, so I decided I'd pull the trigger on a pair of Everlast Elite 2.0 Boxing Shoes. I've been boxing on a pair of worn down Ultraboost. Although I've gotten really used to the fact that they don't stick to the floor at all nowadays because I absolutely annihilated the outsole at the foot pivot point, maybe an upgrade to make the whole experience nicer would be awesome. And guess what, Everlast ran out of stock. I do not know when they'll come back. If you ask me, this is a sign from above. So maybe I'll just leave it for later and git gud without it first.

Back to the good stuff

Since I imagine a common criticism of my blog is that I stray too much in the middle of a tech talk, I will enclose it in these quotation blocks so you know what to avoid. Of course, I will still keep straying from the topic because gosh dang it this is my blog and mine alone. Enough ramblings from me, I'll let the project do the talking from now on.

A big part of this episode will be me revamping the input pipeline. I thought I was going to be working on the model, but plans changed. Like I've said maybe in Side Quest #1, the whole legacy code is held together by hopes and dreams. Naturally, I've grown out of that kind of mentality; now, I want to see something clean. So let's take a look at what's next:

Port the original batcher into a batched Tensorflow Dataset. This is mostly so it works better with the tensorflow model I already made and give the whole project some cohesion.
Remake the training loop. As I have mentioned in Side Quest #1, the original loop does not look good and uses a lot of tensorflow's legacy stuff, so I'll move on from that and start clean.
Redesign the model. Again, was not a big fan of the original model, it is due for a change.

Lots of work to be done, so sit back, grab some coffee and a popcorn, it's going to be a long day.

By the way, since my posts also double as game recommendations, Alan Wake 2 was one of the best game I've played last year, only beat out by Disco Elysium and Elden Ring

The Bad Batch

Batching is a pretty important thing to do when you have a lot of data. Originally, you just send in a piece of data at a time, and then have the neural network return an output via giving the solution to an equation. After people figured out a system of equation exists, and, eventually, linear algebra, everybody figured out that you can send all the data in at once to do inferrence (batched inferrence). Same goes with gradient descent. It is simply a no-brainer to not do that as it is vastly more paralellizable than just normal gradient descent. Then again, it is usually not a great idea in practice for 2 main reasons:

You need all the data in memory at once, and RAM is expensive (unless you downloaded some for free, then power to you).
Being parallelizable is great, but there are only so many parallel processes that can be run at a time by a computer.

So, as a compromise, we just use mini-batch, where we do gradient descent on subsets of the data, instead of the full data itself. To ~~convince you that I am pretty smart~~ show how mini-batch gradient descent work graphically, this is a pretty cool visual that sums the whole thing up

The picture's credits to Sweta (I don't think SGD should be compared against mini-batch as they do not describe the same thing; in fact, modern optimizers are mini-batch SGD). As you can see, there are less steps, and the steps are a lot cleaner than if you just use one datapoint at a time (again, the article refers to this as SGD, but single datapoint isn't the point of SGD).

Very sorry if I was rambling too much about that article, I just need to make sure nobody is confused (but you might be more confused now, oh well)

Each of our image is about 8.5 kB, and we have about 210,000 images (and that's without mentioning the audio clips). Needless to say, we're not doing full batch training, but split it up into about a few hundreds per mini-batch.

Pre-processing Pipeline still need work

Hi, this is Khoa a few days into the future. I just finished Guardians of the Galaxy game and thought it was one of the most well-written game I've played in a while. Gameplay was definitely weak, but I got really invested in the story and the guardians themselves. However, I am not a big fan of the ending, specifically how a character seemingly does not get affected at all after some pretty rough trauma.

Anywho, with the new idea for the batcher in mind, instead of working on the batcher, I went back and clean up the preprocessing pipeline (because I cannot help but be a perfectionist). The original had too many "roundabouts" in its design; so naturally, I got pretty frustrated just thinking about it. Although it still does roughly the same thing as before (maybe in a cleaner manner), I made a decently big change:

The audio will no longer be segmented beforehands.

This may not make a difference in the training itself, but I think moving the segmentation to the batcher instead allows for a cleaner input pipeline. With the new system, the input pipeline no longer require you to pre-segment the audio clips before feeding it to the model, which was a massive pain point when trying to using the model in the wild. It is a classic case of 2 birds with one stone.

The batcher-inator

Theory is definitely very fun, but implementation is a wholly different story. Before, I had a function that generates a new batch of data every time it is called. This is fine, but it does not play well with Tensorflow, and especially Keras. Thus, I will be using tensorflow.data.Dataset.

I used this once in my image compression project, and I can't really go back. It makes the piece of code responsible for training the model super compact as I will show you in the next episode when I work on the models themselves. It really tickles my brain when the code is clean. You will see much more of this obsessive-compulsive side of me when I eventually do a Side Quest for bitburner.

The Dataset API is pretty awesome and allow me to write a batch generator that I can just throw straight into the model's input. The entire pipeline can be sum up with a very clean set of code as seen here:

def batcher_from_directory(batch_size:int, dataset_path:str, shuffle=False,seed=None) -> tf.data.Dataset:
    """
    Return a tensorflow Dataset object that returns images and spectrograms as required.
    Partly inspired by https://github.com/keras-team/keras/blob/v3.3.3/keras/src/utils/image_dataset_utils.py

    Args:
        batch_size: The batch size.
        dataset_path: The path to the dataset folder which must contain the image folder and audio folder.
        shuffle: Whether to shuffle the dataset. Default to False.
        seed: The seed for the shuffle. Default to None.
    """
    image_dataset_path = os.path.join(dataset_path, "image")
    # create the foundation datasets
    og_dataset = tf.data.Dataset.from_generator(lambda: original_image_path_gen(image_dataset_path), 
                                                output_signature=tf.TensorSpec(shape=(), dtype=tf.string))
    og_dataset = og_dataset.repeat(None) # repeat indefinitely
    ref_dataset = tf.data.Dataset.from_generator(lambda: ref_image_path_gen(image_dataset_path),
                                                 output_signature=(tf.TensorSpec(shape=(), dtype=tf.string), 
                                                                   tf.TensorSpec(shape=(), dtype=tf.bool)))
    ref_dataset = ref_dataset.repeat(None) # repeat indefinitely

    # create the input datasets
    og_image_dataset = og_dataset.map(lambda x: tf.py_function(load_image, [x, tf.convert_to_tensor(False, dtype=tf.bool)], tf.float32), 
                                        num_parallel_calls=tf.data.AUTOTUNE)
    masked_image_dataset = og_image_dataset.map(lambda x: tf.py_function(load_masked_image, [x], tf.float32),
                                        num_parallel_calls=tf.data.AUTOTUNE)
    ref_image_dataset = ref_dataset.map(lambda x, y: tf.py_function(load_image, [x, y], tf.float32), 
                                        num_parallel_calls=tf.data.AUTOTUNE)
    audio_spec_dataset = og_dataset.map(lambda x: tf.py_function(load_audio_data, [x, dataset_path], tf.float64),
                                        num_parallel_calls=tf.data.AUTOTUNE)
    unsync_spec_dataset = ref_dataset.map(lambda x, _: tf.py_function(load_audio_data, [x, dataset_path], tf.float64),
                                        num_parallel_calls=tf.data.AUTOTUNE)

    # ensure shape as tensorflow does not accept unknown shapes
    og_image_dataset = og_image_dataset.map(lambda x: tf.ensure_shape(x, IMAGE_SHAPE))
    masked_image_dataset = masked_image_dataset.map(lambda x: tf.ensure_shape(x, MASKED_IMAGE_SHAPE))
    ref_image_dataset = ref_image_dataset.map(lambda x: tf.ensure_shape(x, IMAGE_SHAPE))
    audio_spec_dataset = audio_spec_dataset.map(lambda x: tf.ensure_shape(x, AUDIO_SPECTROGRAM_SHAPE))
    unsync_spec_dataset = unsync_spec_dataset.map(lambda x: tf.ensure_shape(x, AUDIO_SPECTROGRAM_SHAPE))

    # multi input using https://discuss.tensorflow.org/t/train-a-model-on-multiple-input-dataset/17829/4
    full_dataset = tf.data.Dataset.zip((masked_image_dataset, ref_image_dataset, audio_spec_dataset, unsync_spec_dataset), og_image_dataset)
    # if shuffle:
    #     full_dataset = full_dataset.shuffle(buffer_size=batch_size * 8, seed=seed) # not sure why buffer size is such

    # batch
    full_dataset = full_dataset.batch(batch_size=batch_size)

    return full_dataset

I think we can all appreciate clean code, and this is definitely one of them. It is fairly efficient, because this code is compiled beforehands and generates the batches on the fly, reducing the load on my poor RTX 3070 Ti with a measly 8GB of VRAM (and my 32GB of DDR4 memory, which used to be 16GB just a few months ago). If anything, there seems to be a big bottleneck with the audio spectrogram generation process. Before this, audio_spec_dataset was the only spectrogram Dataset object that was needed, so when a new epoch comes, only 1 spectrogram is generated from the wav file. However, I recently updated the discriminator to also punish deepfakes that have bad mouth synchronization with the audio. This is done by taking a random piece of audio at a different spot in the video and give that to the discriminator along with the true image (which would not be synced to the audio, because it is not the same spot). For simplification, the unsynced audio corresponds to the reference face, not the masked face to generate onto. Before, each epoch takes around 80 seconds to complete; after the update, it now takes 150 seconds. Since the model was not changed (and even then it would not take this much time), I suspect that the spectrogram generation process is taking a long time. Since we're dealing with thousands of epoch, 150 seconds per epoch is pretty outrageous. This will hopefully be handled by the time Side Quest #4 (a full guide on how to make a deepfake model) drops.

Visions from the Future

So, it has been quite a while since the last episode, almost 3 weeks to be exact. I had to move and deal with all of the problems stemming from that (history really does have a way of repeating itself).
I actually started writing this blog right before the last episode dropped; however, I simply never got to the stage where I felt it was good enough to publish. In any case, I have completed the model and have trained a few prototype models.

These looked surprisingly nice, but the original model I made for CUCAI was kind of a low bar if you remember. Judging from those test generations, it is quite clear it has a favorite. One person represented in the picture definitely appears a lot more than others in the training data. They blend in with the masked image surprisingly well and have spot on skin color and orientation.

On the other hand, they are still quite blurry and seemed to have minimal improvements past epoch 700 with batch size of 100. I suspect this is because the model is not big enough to "accommodate" the complexity of the data. The model is currently sitting at around 1 million parameters, which you can consider shockingly small compared to new LLMs and such. I believe the Wav2Lip model was about 30 million parameters. However, with great parameter count comes great data size and training time, and with the currently problem with the batching pipeline, that may be time-consuming at the moment. I mean, just deciding when to train the model gets me thinking days ahead to plan out the use of my computer. It is already taking just under 40 hours to train 1000 epochs. So I will need to figure out how to improve that soon. Another problem I have run into is the Discriminator's performance in the test set. It is extremely good at sniffing out deepfakes there. As you can see in the picture above, it is also returning a value larger than 1, which should be impossible as the training clearly specify 0 as real and 1 as fake. I need to look into whether it does the same for the train set, as that could pose a problem. Moreover, there is a strange "line" that appears at the bottom of the image. I hypothesize that this is a product of the fact that shoulders can be seen in a lot of the images. The model would theoretically have no information on the color of their clothes, causing the line to have a darker color because that is the average of that section. Maybe a fix for this issue is to provide such information with a few rows of pixel at the bottom. This would force me to redesign the model architecture, so this is preferably something that is done later.

Anyways, I might go on another hiatus, as I want Side Quest #4 to be the conclusion to the project. I figure it should be in the style of the traditional guide (contrary to what I said earlier) that is based on what I've learned in the course of the series. So look forward to that! If you read all the way until here: thank you, I really appreciate that view count ticking up on my dashboard, it gives me a lot of motivation to complete this massive undertaking.

Side Quest Devblog #2: Virtual Insanity

Khoa Nguyen — Wed, 01 May 2024 00:05:52 +0000

Follow Along

BasicallyOk / deepfake-lip-sync

Deepfake model for syncing a person's lip to a given piece of audio

deepfake-lip-sync

A Generative Adversarial Network that deepfakes a person's lip to a given audio source

Environment Variable

Easiest method is to place a .env file in project root. Follow this format:

BATCH_DATA_PATH=./batch_data/batches.json
GENERATED_IMAGES_PATH=./generated_images/
SAVED_MODELS_PATH=./saved_models/
DATASET_PATH=./dataset/train/
TEST_DATASET_PATH=./dataset/test/
MODEL_NAME="contrastive_loss_noisy20000"

Credits

Some of the code were written in a different repository that has now been nuked. I owe a lot of the data processing pipeline to @HueyAmiNhvtim and @Blinco0.

View on GitHub

The trademarked ramblings of a mad man

So, where am I? Mentally, not great; I mean, I cannot imagine looking at the loading screen when you're trying to process 200GB worth of data to be fun for anybody. Going back to this project really hits home how inexperienced I was just 1 year ago. The data preprocessing pipeline is literally held together by hopes and dreams. In fact, the pipeline failed overnight and I have next to no clue whether it has processed everything. I suppose this is my fault, as when I led the team, I did not really have a solid direction to start with; hell, I did not even know what a good pipeline even looks like. I think the real challenge for this one is the sheer size of the data, as with most Computer Vision projects.

Nonetheless, I have loads of data, ready to start training.

Current Progress

The starter stuff is pretty much done. I wrote brand new code for training the model, based on this little project I did for the Neural Networks class:

BasicallyOk / Face-Compressor

Autoencoder for compressing and decompressing faces. Paper can be found at the link.

CISC452-Project

Investigating deep learning methods for compressing and decompressing faces

Results

Refer to the paper for more details.

Reconstructed Images by the Deep Convolutional Generative Adversarial Network

Reconstructed Images by the Convolutinal Variatinoal Autoencoder

Instructions

Environment Setup

Skip 2 and 3 if you're lazy/a maniac.

Go to project root
Set up environment by running python -m venv .venv or python3 -m venv .venv
Run source .venv/bin/activate on Linux and MacOS, ./venv/Script/activate on Windows.
Run pip install -r requirements.txt

Dataset

Download the celebA dataset
Extract dataset to project root (optional, but will make your life easier)
Rename the dataset folder to dataset (also optional, but the skeleton code assumes it's called dataset)
Ensure there are no subfolders. The dataset folder should only contain images of faces.

View on GitHub

Now only 3 more things has to be done before I get started on something new:

Data Preprocessing: You have to train on something. This blog will essentially be this step in its entirety. As I am writing this little piece of text, my computer is pumping out face frames and the corresponding wav file from 200 GBs worth of data from the internet. I'll talk about this in a bit.
Design the Model itself: Designing the GAN itself. The easiest part of the whole thing, in my opinion. I already have a vague idea of what the full thing is going to look like. I might go into this a little later.
Training, Testing and Iterating: Self-explanatory. It is the single most important step of the whole project. It definitely will take the longest amount of time (because with all the experience I may claim to have, I am still a moron and have a lot to learn).

I call this the holy trinity of AI development. In other words, I've gotten about 10% of the way and is only getting started.

The Data

Here's the dataset I will be using

I find it extremely ironic that I am using data for a deepfake detection challenge as training data for a deepfake model. Nonetheless, like most AI trainers, and AIs themselves, I am very data hungry, and I will take what I can get.
So here's the big question: "Why this dataset specifically?". Well, I'm glad you asked, because the answer is not as simple as you would think. See, this was not my first choice. This dataset isn't necessarily structured the way I would have liked it to. Moreover, it is not the dataset used by the Wav2Lip and LipGAN papers. These are not necessarily bad things; however, being able to use LSR2 would have been such a godsend:

It is used by both Wav2Lip and LipGAN, which allow us to establish a baseline with regards to performance.
Preprocessing code exists in the vastness of the space known as Open Source (and Stack Overflow) as this is the single most well-known dataset for this type of task. But here's the thing, the BBC would not just give it access to anybody. Here's a little snippet from their agreement:

What can I use it for? Provided you keep to these rules, the BBC gives its limited permission for the named researcher  (researcher) to use the BBC’s content free of charge for ‘Academic Research Purposes’ for this Project for 12 months starting on the date both you and the BBC sign this “Permission to Use for Researchers” form. You are responsible for your own costs in taking part in the Project. You must delete all BBC content when this permission ends – please see section 6 for more information.
‘Academic Research Purposes’ means solely these activities for this Project:
• using our content for individual research and private study;

• using a small number of extracts from our content for academic demonstrations  or academic conferences to invited, non-paying guests  to demonstrate your research for the Project in a technical context.

But only if the extracts do not include images of children, other people’s logos or trade marks, sports or music. If the extracts are of recognisable living adult individuals (either standalone or through a combination of images, text, analyses/summaries/interpretations), you must keep to a minimum your use of extracts that identify individuals and you must not use them in any way that might adversely affect those individuals. Ask the BBC Contact (the BBC person who signed this form) if you are not sure if you can use an image or extract;

• using a small number of still images (jpegs) or transcriptions of extracts from our content in academic papers, but you MUST ask the BBC Contact for permission first. You must tell the BBC Contact which images and extracts you would like to use. If the BBC gives its permission in writing, you can then include the permitted images and extracts in your academic paper and publish it in print or online as a pdf file  for the research community to illustrate the results of your research into the Project (but the images and extracts must never be used as part of an online demonstration).

Academic papers should be made available to the research community for individual research and private study purposes only, and must contain this accreditation wording: “For non-commercial individual research and private study use only. BBC content included courtesy of the BBC”. Creative Commons or similar licence terms must not be attached to the BBC content in academic papers;

• using our content to build a research prototype to compare with your organisation’s own existing technology. This work must be done separately from your organisation’s commercial product development. The BBC’s content must not be used for training any existing or new technology, algorithms or models that will be sold commercially. And it must not be used to train technology to improve or enhance your operational systems. It can only be used for comparative or benchmarking purposes.

Needless to say personal project does not count for any of these. We asked last year and they said no, even though we said it was research for an organization funded by the university. Oh well, I suppose this data is pretty dangerous in the wrong hands. At least until I came across the little Kaggle competition whose dataset I will be using. So all is good, it just takes a little extra... holy shit it's over 470 GBs. Let's take a look at my hard drive:

Guess doing the whole thing will simply not be possible. Fortunately for me they allow partitions, so I just downloaded about half of the full dataset. At least there's a silver lining: I upgraded to a new 1TB SSD recently as you see up there. See, when I did this last year, I had about 100 GB to spare on my PC; in my infinite wisdom, I decided to train the model off of an external SSD. Why is this bad? Let's break down how the training process look like in terms of memory:

The model needs to pull data from storage and load it into memory. This is unavoidable overhead and runs every training loop if it's batched.
The model loads the model onto VRAM.
Train

In step 1 there is a transfer happen every single training loop because I am batching the data. To put this into perspective, there are 100 image per batch, each of size 256x256, and a total of 20000 training iterations, with 1 batch per iteration. So let's just say each image is $2^{8} * 2^{8} * 3 = 2^{16} * 3$ bytes, we essentially get $100 * 2^{16} * 3$ bytes worth of transfer from storage to memory per iteration. Now, this would not be horrible through PCIE or even Sata, because of the high transfer speed; but, alas, I used an external hard drive connected via USB 3.0. You can imagine how slow the whole process was. It took a total of 30 hours to finish training the model.

Of course, this is a pretty oversimplified view of the whole thing, but it does give you a sense of the scale of data we're working with. The new M.2 will definitely help in that regard. Plus, I am now moving to 64x64 images as discussed in the previous post. This way, I can increase the batch size and potentially improve training speed by just a little bit by removing the overhead per iteration.

The Dreaded Preprocessing

Kaggle is a godsend, and the Kaggle deepfake challenge is also a godsend for this little project. Nonetheless, things cannot be too perfect.

The Kaggle competition's data is in the form of videos. Well that's all great, but our model cannot exactly interpret videos as is. Therefore, we need to extract the data we want from it. To see exactly what we want, let's do a recap of what the models will look like:

The generator will take in: a reference masked image, a pose prior, and an audio spectrogram that corresponds to the masked reference.
The discriminator will take in: a image and its corresponding audio spectrogram.

Now that we know what we want, we can split the video and audio into pairs of images and their corresponding spectrograms. Luckily, my team already wrote me code to do this exact thing, so it is just matter of refreshing my memory and wait a small, insignificant 10 hours. Honestly I half expected this because outside of resizing to 64x64, it is using OpenCV to detect where the faces are and cut it out of the frame, both which are probably more intensive then you'd think. If you or your child want to go into Computer Vision, ask yourself: "Am I/my child patient?". If not, this is not for you. To be very fair, we are processing 200GBs worth of data. I have not done this in any other project, but I will have to leave my computer running for the rest of the night. But I won't leave you waiting in suspense, here's an example picture of what we're working with (sorry it's a little small, it is 64x64 after all):

And a random, unrelated spectrogram from one of the audio files:

As I was working on this, it started to hit me that a potential reason for the slow training process could be that I was doing the .wav to spectrogram translation as part of the training loop, as can be seen in the get_batch() function, which is run at the start of every training iteration. Here's a snapshot of how get_batch() processes audio as it is written in the legacy train_GAN.ipynb file:

def get_batch(epoch):
    """
    Proper get_batch function.
    Requires a batches.json file to get here.
    The indexing of the output numpy arrays corresponds to one another
    Returns:
        - image_samples: the real images of shape (256, 256, 3)
        - gen_input_samples: pose prior concatenated with the masked real image of shape (256, 256, 6)
        - audio_samples: the audio snippet that corresponds to the real image of shape (6, 513, 1)
        - unsynced_audio_samples: the audio snippet that corresponds to the pose prior of shape (6, 513, 1). Meant to be used as the unsynced audio snippet.
    """

    image_samples = []
    gen_input_samples = [] # Specifically to work with Huy's data, should change
    audio_samples = []
    unsynced_audio_samples = []

    """
    Unrelated
    """

    for names in batch:
        """
        Unrelated
        """

        audio_arr = stft(os.path.join(DATASET_PATH, audio_name), 2 ** 10) # Generate the spectrogram out of audio segment.
        audio_samples.append(audio_arr)
        task.update()

        unsynced_audio_arr = stft(os.path.join(DATASET_PATH, unsynced_audio_name), 2 ** 10) # Generate the spectrogram out of audio segment.
        unsynced_audio_samples.append(unsynced_audio_arr)
        task.update()

    """
    Unrelated
    """
    audio_samples = np.asarray(audio_samples)
    unsynced_audio_samples = np.asarray(unsynced_audio_samples)
    print("Batch converted to numpy arrays. Returning")

    # It should go without saying that the order needs to match
    return image_samples, gen_input_samples, audio_samples, unsynced_audio_samples

When I get to the training step I will do some testing and see if it was in fact the bottleneck. For now, I will not undo my computer's work all night just to get me all the data.

In conclusion, after a whole night's worth of preprocessing, here's what we end up with:

Enough data to make Mark Zuckerberg cry (this is ironic because Meta partly contributed to the dataset). We'll make changes to this once we're done. ~~Now, I will discuss the model making process very briefly, and save the rest for part 3~~ I accidentally published this early so oops.

An Unexpected End to an Unexpected Journey

As it turns out, the article was published against my will. Not sure what happened, but oh well. I had some plans for what I was going to say after this. Then again, I'm not sure who's really taking the time to hear a 21 year old dude in the basement trying to make Donald Trump sing Baka Mitai. Well, now that I've said it, maybe I'd be pretty interested if it wasn't me. At any rate though, it is my article, and I like to speak my mind.

Before I head out to start writing part 3 and dissecting my godforsaken legacy code, I had some thoughts on why am I even writing an article. I guess this is an anchor to help me keep focus. It has done an amazing job so far; in fact, it's a prime example for why rubber ducking works.

Maybe it is just me trying to find meaning in everything I do. There is a very big chance that I am just going insane and this is an outlet. This whole thing did start to feel like Charlie Day in that episode of "It's Always Sunny in Philadelphia" where he was being gaslighted into thinking he was a genius. I am definitely getting hit by imposter syndrome big time, but sometimes I see value in gaslighting yourself into thinking you're smart. As much as your weird classmates might make you doubt, I believe that you can, in fact, manifest stuff into happening if you really believe in it and act on it. So, maybe if I keep my chin up and my ego high, maybe in a week y'all see Donald Trump singing Baka Mitai, or see someone say something absolutely horrendous (or you'll just see Jin Sakai speaking Japanese fluently). Adios.

Side Quest Devblog #1: These Fakes are getting Deep

Khoa Nguyen — Tue, 30 Apr 2024 00:28:55 +0000

BasicallyOk / deepfake-lip-sync

Deepfake model for syncing a person's lip to a given piece of audio

deepfake-lip-sync

A Generative Adversarial Network that deepfakes a person's lip to a given audio source

Environment Variable

Easiest method is to place a .env file in project root. Follow this format:

BATCH_DATA_PATH=./batch_data/batches.json
GENERATED_IMAGES_PATH=./generated_images/
SAVED_MODELS_PATH=./saved_models/
DATASET_PATH=./dataset/train/
TEST_DATASET_PATH=./dataset/test/
MODEL_NAME="contrastive_loss_noisy20000"

Credits

Some of the code were written in a different repository that has now been nuked. I owe a lot of the data processing pipeline to @HueyAmiNhvtim and @Blinco0.

View on GitHub

Skip the ramblings

Return of the King

So hi, I'm back. It has been a while, and I suppose a lot of things have happened. I actually wrote a draft for Lifelike Devblog #3 but never really got down to finishing it. I guess this is a pretty massive case of writer's block. But the most important thing is: I'm back.

What has happened in the meantime? I basically just finished my final undergraduate exam ever. Last summer when I was making Lifelike, it felt like any other summers: opportunities, sunlight, beaches, vacations, dreading about the future, and for some of us, maybe some side hustles. Now, the whole thing has a sense of finality to it. For the first time in my life, I don't have a plan for September. I don't have a job offer lined up because the market is in the gutters. Maybe I should give another shot at academia, but it can be pretty unfair to international students.

The Project

Either way, there is a silver lining to all this: time to dig into my project backlogs. Also, I feel it is worth mentioning that this is not a 1 devblog long project. This one simply looks at what I've done so far and there will be a part 2. First on the menu is deepfakes. Ah yes, how immoral of me. However, before you judge, it is not meant to be good, I am simply challenging myself and get something fun done in the process. Still, to justify myself, I will introduce you to a great game that I really enjoy: Ghost of Tsushima (if you love this, you will probably also love the new Shogun show).

As much as I love this game, especially the Japanese dub, the game was clearly made to be played in English, as the lip movement syncs to the English voices, not English. This absolutely took me out of the experience (and why I think that anime is better subbed). Spoiler alert for the ending in the video below.

Naturally, I decided to think about what could be done about it. So in the summer of 2022, I decided that the team I was going to get for this little club at college called QMIND (Queen's AI club essentially) were going to be working on a deepfake lip syncing AI. What I turned in for the biggest undergraduate conference in AI (CUCAI 2023) was less than desireable. But it worked, and attracted a lot of attention (I have Mustafa's mutilated deepfaked persona to thank for that). Don't get me wrong, the team was the best I could have asked for, but not much can really overcome my ambitions for the project. But here's what we're working with:

I regret writing this at 3 in the morning. Needless to say, I think Mustafa looks better than this even on his worst day, so I'm making it my mission to touch him up a little. Before I address what went wrong (which is a lot), I'll do a walkthrough of the current state:

Current State

First thing first, we're working with a neural network, specifically of the Generative Adversarial variant. Think of it as a rivalry between 2 different models, the Generator and the Discriminator. The Generator does exactly as advertise, while the Discriminator try to figure out if the Generator's output is real or fake (or simply achieve some kind of objective). I like to think of the Discriminator as a trainable objective (or loss) function.

The following piece of code were based very roughly on the paper "A Lip Sync Expert is all you need for Speech to Lip Generation in the Wild" (the paper behind Wav2Lip, a very popular deepfake model) with a lot of elements taken from the authors' previous paper "Towards Automatic Face-to-Face Translation"

Within the code, the Generator is essentially an autoencoder stack that takes in 2 input: the spectrogram of the audio (sounds crazy, but yes, you can use computer vision techniques to process audio) as well as a stacked image (I'll explain it when I explain why I won't do it again soon enough) of the image we are trying to lip-sync, as well as a random reference image; it then spits out the lip-synced image frame via the Convolutional Transpose layers. The model I ran with based on this architecture:

The Discriminator is pretty similar, takes in 2 images: an image and the spectrogram, which we essentially train to recognize how well the image fits into the piece of audio. To do this I simply calculate the L2-distance between the embeddings of the 2 images. I ran with this model:

For training, we simply let the Generator generate something (it will start out as random noise). We then compute 2 different loss values: Mean Absolute Error (against the image it was supposed to generate) as well as the Discriminator output and perform gradient descent on both the Generator using the combined loss and the Discriminator based on how much it got wrong about the synchronicity. I recorded the generation of the same 2 faces on their respective audio piece and put it into a gif (each frame is every 1000 epochs I believe). I am fairly certain the way I record losses is faulty, so I intend to do it again properly this time.

The Problem

If you are an expert in this field, you may already see a problem or two with my implementation. If so, I would appreciate a comment. If not, here goes:

The Generator's Weird Inputs

The one part about the Wav2Lip paper that I don't really like is the fact that they are stacking the reference top half of the face "on top" of the pose prior (how I describe the image in a different pose). The main issue with it is that it takes up unnecessary space, and thus artificially inflates the model with useless parameters. This is because if you mask the bottom half of a face of size 256x256, you get an image of size 256x128, which you then half to stack on top of another 256x256 image, to result in an array of size 256x256x6, a fourth of which will be 0 no matter what. There is a good chance that gradient descent will not catch this and reduce the parameters corresponding to that area of the input to go to 0. This can explain why the model tend to plateau before it even generates a realistic human picture.

The Generator's Loss Calculations

The generator currently only has 1 loss functions:

Mean Absolute Error compared to the image it was supposed to generate This can be seen in the code for train step:

def train_step(images, gen_imgs, audio_samples, unsynced_audio_samples):
    #with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
    real = np.zeros((len(images), 1))
    fake = np.ones((len(images), 1))

    generated_images, gd_loss = gen([gen_imgs, audio_samples])

    # Very hacky, without this, the discriminator cannot be trained
    qual_disc.trainable = True
    # train discriminator
    if np.random.choice([True, False]): # Make sure the discriminator will not be too strong
        disc_loss = qual_disc.train_on_batch([generated_images, audio_samples], fake)
        unsync_loss = qual_disc.test_on_batch([images, np.roll(audio_samples, 10, axis=0)], fake)
    else:
        disc_loss = qual_disc.test_on_batch([generated_images, audio_samples], fake)
        unsync_loss = qual_disc.train_on_batch([images, np.roll(audio_samples, 10, axis=0)], fake)

    real_loss = qual_disc.train_on_batch([images, audio_samples], real)

    qual_disc.trainable = False # Do not allow the discriminator to be trained
    # train generator
    total, mae, adv= gen.train_on_batch([gen_imgs, audio_samples], [images, real]) 
    print("Disc Loss: {} Real Loss: {} Unsync Loss: {} Total Gen Loss: {} MAE: {} Adv Loss: {}".format(disc_loss, real_loss, unsync_loss, total, mae, adv))

    return (disc_loss+real_loss+unsync_loss)/3, total

You might already see the problem: the generator is not trying to fool the discriminator, at least I don't think it is. The loss graph disputes this a little bit and the discriminator's output is used in the generator's code but since I was a novice at the time, a lot of things can go wrong. So to potentially fix this, I will completely rewrite a lot of the code with Tensorflow's functional API, which allowed me to do this exact thing without problems in my Facial Image Compression project.

L2 distance (L2-norm, i.e. Euclidean distance) between the generated image's embeddings and the audio piece's embeddings, both of which are trained by the Discriminator

So, mathematically (using Latex so I seem smart):

l 2 = i = 1 \sum N (a_{i} - b_{i})^{2}

This is translated into code as seen here:

# L2-normalize the encoding tensors
image_encoding = tf.math.l2_normalize(image_encoding, axis=1)
audio_encoding = tf.math.l2_normalize(audio_encoding, axis=1)

# Find euclidean distance between image_encoding and audio_encoding
# Essentially trying to detect if the face is saying the audio
# Will return nan without the 1e-12 offset due to https://github.com/tensorflow/tensorflow/issues/12071
d = tf.norm((image_encoding - audio_encoding) + 1e-12, ord='euclidean', axis=1, keepdims=True)

discriminator = keras.Model(inputs=[image_input, audio_input], outputs=[d], name="discriminator")

Long Train Time

Unfortunately I do not want to run the whole thing for 30 hours again to record the time it takes to run 1 full epoch. However, I did this a year ago, and wrote a little comment in the code that says the main bottleneck is in my batch generation function. I know I said that 2023 me was stupid, but in hindsight, 2023 me had a job the previous summer and I don't. Therefore, I am inclined to trust him with this.

Disgusting Code

Self-explanatory

Does not Synchronize well as a video

Remember when I said this is a rough adaptation of the Wav2Lip model. It's rough rough 100%. The novel feature that Wav2Lip introduces is the second Discriminator: the Lip-Sync Expert, which is a pre-trained model (as oppose to traditional discriminators that are trained alongside the generator). Needless to say I did not do that. For this, I think I will change the batch generation code to load the frame directly preceding it in the timeline as the pose-prior and hopefully it will train itself to generate a frame that would work as a frame that is directly after it.

The Plan

Based on all of the above, I think I know where to start. To address the long train time, I will decrease the image size to 64x64. When I built the facial compression GAN a few months ago, I realized that the smallest size for an image containing a face with a mouth that still looks distinguishable is around 54x54. Nonetheless, my lucky number is 2, and 2 to the power of 6 is 64, so 64x64 it is. I will take all the help I can get with this project, so Lady Luck will be a guest at my house for this week. Another thing I will do is to revamp the model definition themselves with the Tensorflow functional API, which I will discuss further in part 2. This allows my custom training loop to run on Keras' graph mode and achieve the best performance possible. Plus it looks clean as hell. I will also change the inputs of the applicable models to take in the reference image and the pose prior separately, instead of stack to avoid useless parameters as mentioned above.

If anything, I am mostly relying on the fact that I have had more experience working on this kind of projects now, and can probably do a lot better if I start from scratch. If it works, well hey, that's a free ego boost because I'll know that I have improved a lot in a year.

Thank you for Reading

I don't exactly know how many people read my blogs, because to be completely honest it's mostly me rambling about some frankenstein of a creation I have made. For those who are returning, thank you, I don't write these with an audience in mind, but I do hope I either gave you some extra insight or just simply inspired you to sit down and make something awesome. I apologize for the mess that was the Lifelike Devblog release schedule though; both me and Mustafa simply did not feel like the idea was baked in the oven for long enough. Lifelike was cool, and I think I will come back to that after I am done with this little project of mine. Honestly, I was really burned out by the end of Nights&Weekends; while I think of myself as a half decent programmer, I am definitely not a good salesman and it showed there. Plus, with the power of the almighty hindsight, I think me in 2023 was pretty shit. I learned a lot since then and realized a lot of mistakes I made at the time. At the end of the day, maybe that was the real win.

Visions of the Future

Nonetheless, I have major ambitions for the interrogation game and I would definitely like it to be made. Artificial Intelligence is absolutely awesome, but it gets a lot of flak from the art community (and for good reason). With the interrogation demo I want to make a point that it should be used as a tool for game devs, not a replacement. I mean, that was the whole idea behind Lifelike. But, before big ambitions come into play, it's the little devblogs, the little projects and the little things we do along the way that really make it all worth it. Take your eyes off of the future and just take on your little side quests, see where it goes.

Lifelike Toolkit Devblog #2: Mining for Diamonds

Khoa Nguyen — Fri, 12 May 2023 00:48:43 +0000

Brand New Discord Channel! Join to keep track of our progress
Lifelike Github Repo
Cool Overprotective Dad AI Visual Novel Demo
In Progress AI Interrogation Demo - The Dinner Party Case
Mustafa on Twitter

// Detect dark theme var iframe = document.getElementById('tweet-1652896917506981888-549'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1652896917506981888&theme=dark" }

Hey everyone, it's been a while. Super sorry for dropping this 2 weeks late, life kind of got in the way. After around 3-4 times of changing how long it has been since the supposed drop date of the blog, I'm really hoping this is the one. As an international student who just finished his junior year in university, I live a "nomadic" lifestyle. I move ever so often, which gets in the way of my work. As of writing this lengthy introduction, I do not have access to the Internet, which, interestingly, does provide quite a perspective on how reliant I have gotten to it. Unfortunately, me and my housemates did not plan this out very well so it will be like this for another few days. I suppose it does give me extra motivation to focus all my attention on Lifelike (but I'm not 100% sure how coding without documentation and StackOverflow is going to look like). Nonetheless, it is not just me that ceased to function properly without access to the Internet. It was quite a rude awakening this morning when I tried to tune my guitar with a phone app that I have used for the past 5 years, only to find out that the app requires the Internet. For the moment though, I am going through Breath of the Wild again, and the game still feels as good as it did years ago (but some of the magic is gone since I know pretty much everything already).

I don't suppose a Breath of the Wild review is what you are here for, so let's look at what we have done so far. First and foremost, we've finally packaged the toolkit! You can now use the toolkit by doing pip install git+https://github.com/lifelike-toolkit/lifelike.git. Mustafa did a great job explaining how to use the toolkit in the repository itself. Meanwhile, there will be a guide/documentation on the Sequence Manager will come out soon. Currently, with Mustafa's Brain component, you can set up a chatbot, or have multiple NPCs talk and interact with each other. For both use cases, one can set up background contexts for the characters, as well as the main context of their conversation. On the other hand, my Sequence Tree is more or less done, but I'm still doing some testing and game development work myself so that I can streamline the Sequence Manager. In my opinion, me actually trying to make a game out of it helps me find a good usage flow for the toolkit. Of course, thanks to the Human-Computer Interface course I took, I know that I am not the user, but it's pretty difficult to get people to get into this to be completely honest. For now, this is a great option.

// Detect dark theme var iframe = document.getElementById('tweet-1653949111517806592-74'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1653949111517806592&theme=dark" }

As of right now though, the Sequence Manager controls an abstracted representation of a Sequence Tree. In this tree, each node is a Sequence Event, which stores the corresponding game state or any additional information while every edge is a Path Embedding, which stores the embedding interpreted using user input. In Cool Overprotective Dad, the Embedding represents sentiment embedding given by a sentiment analysis model using user input. With this design, there are quite a lot of things you can use it for.

If you're new here (even if you're a frequent reader, in which case, I really really appreciate that), Lifelike is designed to work like an improv drama session. Unlike most AI games that have come out so far, Lifelike allows for the human touch when it comes to story. To explain this in terms of the improv drama metaphor, you can imagine the Brain as the actors, while the Sequence Manager is the Director who is telling the actors where they should push the story. Since I love talking about Skyrim here, I will also explain it in terms of my original "Get into Windhelm" scenario that I used in the first blog. The developer here can define the number of options to get into Windhelm using the Sequence Manager: making the guard like you or explain to him that you support their cause. Here, when the player approach the guard and make the guard like you, the Sequence Manager will tell the Brain that the "Guard opens gate to Windhelm" event has been reached via the "Guard likes Player" path. Here the Sequence Manager will give the Brain the "Guard will now open the gates to Windhelm" context, causing the Brain to behave as such.

It does not end there, I recently finished the code for a AI-driven Visual Novel (although there's no Visual right now) named Cool Overprotective Dad. This is a demo that I made to demonstrate the use of the Sequence Manager, as the game was made exclusively using this tool. Before I start peeling off the layers of buzzwords, I want to give a special shout-out to my roommate and a great friend, Connor Killingbeck for writing a banger of a story specifically for the game. In the game, the story progresses based on how you talk to characters. Specifically, the game's narative rely on the emotion you convey, which leads to a Disco Elysium-esque story (I will not claim it will be as good as Disco Elysium though). For example, in the encounter with the Bouncer in Chapter 1, he will asks the player how they are feeling that night. Here, if the player responds with excitement, the bouncer will immediately let you through. Conversely, if the player appears negative, the game ends as the player cannot enter the club. However, an interesting option we included is the Neutral path (no emotion conveyed by the player), where the bouncer will asks the player for ID before they are let into the club. To make this work, each event is something that happens in the game that is the result of player's input, while each path is a sentiment embedding. Every time the player speaks, the game will process the sentiment in the input, get the embedding, and determine which event is the most likely to happen based on the pre-determined events. This is all done using the Sequence Manager. If you guys want to try the game out (only Chapter 1 is out at the moment, Connor is finishing the rest), it will take some work to set up (follow the guide on the repository) but trust me, it's totally worth it. If you don't have that kind of time though, here's a video of me playing through Chapter 1 (I did not get to all the endings in this video, maybe you can):

Now that's really awesome, but how does it perform? Surprisingly well, although it struggles if player's input rely on some pre-established context. This can be attributed to the sentiment analysis model as it is only meant for sentences that can be taken out of context. It's a very quick one that I can find on the Internet. I also make some weird design choices in this project as I was unsure of how Lifelike was going to look like with Mustafa's work. Originally, the whole thing was supposed to be a browser game only, running on ONNX, but we quickly found out that it was implausible, unless I want to translate core functionalities to Javascript. So, for now, the inference is done client-side, but the game itself is stored server-side in a vectordb. Outside of that, I also included an embedding tuner, that allows the developer to input a list of prompts and get the average embedding back. This helped greatly as I was able to program events a lot more easily. The whole process took about a few hours. In the future, I want to go back to this project, write a whole guide on how I managed to do it for a Lifelike tutorial as well as clean up the game + make a simple terminal game in Python.

If you are interested in how the game looked like a few weeks ago, below is a video of an extremely early beta for a showcase.

So, what are we working on right now? As of writing this little blog, we are working on a brand new demo that showcases the entire toolkit in action (Cool Overprotective Dad only has an AI driven story, the characters do not really respond to contextual clues). This demo (tentatively called Interrogation for now) involves you, a detective, trying to get a suspect to admit guilt. It's pretty much inspired by Ace Attorney and L.A. Noire (as I have talked about in the previous blog). Recently, I have started to also think about the Brooklyn Nine-Nine episode "The Box" where Jake had to interrogate a very intelligent murderer and try to get a confession. I really want to create a game that can show that type of emotional manipulation that Jake used to get an admission of guilt at the end of the episode. In this game, the player is given a pdf case file (very similar to how Keep Talking and Nobody Explodes works) that details crime scene evidence, witnesses' testimony, CCTV descriptions and other clues. Using these information, the player has to ask the suspect questions, and tell them things that will push them to either intentionally or accidentally admit to guilt.

But what's different from Cool Overprotective Dad? This game is not a graphic novel that is written by a human. Instead, it's 50-50 this time around. The story beats will still need to be written by a human, so we don't lose that human touch and artistry. However, the dialogues themselves are completely written by AI, allowing dynamic dialogue that actually responds to what the player has said. The game will be built using both the Sequence Manager and the Brain of the Lifelike toolkit to make both of these true. The way I want to build it is to first build a Sequence Tree with the story that I want, determining the personality, emotional response, as well as game events that correspond to what the players say. The Brain, on the other hand, will handle how the suspect will respond verbally given the current story event. Essentially , think of an improv film director who's telling the actors (the Brain) how to push the story with a script (the Sequence Manager). Unlikes Cool Overprotective Dad though,the Sequence Manager will not be run on sentiment analysis here, but rather context-embedding, possibly using LLAMA. The Sequence Manager will also be changed slightly to work more as a knowledge encyclopedia for the game. For example, an event is a topic, which will include information about the topic itself, as well as how the suspect should respond to a question/statement about it. This is a super simplified view of the actual code, but I will make sure there is a guide to recreate the game, so some dev out there (you? maybe) can make their own scenarios.

If that is too tough for you to visualize, here's a little play-by-play of an example case in the game. Imagine the suspect is being interrogated for stealing their neighbour's Amazon packages. The case file provided will include a description of a person wearing a Maple Leafs hoodie stealing the package, eyewitness testimony of the package being stolen at 8 am on the previous Thursday as well as an eyewitness seeing the suspect jogging in the neighbourhood at 9 am, also on the previous Thursday. The Sequence Manager will essentially stores these topics as events, and provide the corresponding path embeddings (provided by LLAMA) required to reach it. In the Maple Leafs topic, the suspect will be instructed to be very excited to talk about the hockey team and how much of a fan they are. However, when asked about 8am on the previous Thursday, they will lie and say they were not in the neighbourhood until 10 am on that day. However, if the player read the case file, they will realize that an eyewitness saw the suspect at 9 am, so the player can call them out on it. The suspect should then feel intimidated into admission if the player successfully invoke such topic (even passive aggresively). Each of these "context clues" will be stored in the Sequence Managers. As the game proceeds, a new component, named the Game Director, will take the user's input, embed the context and see which topic the player is trying to invoke. The topic's context clues is then passed to the Brain via the Game Director, generating the suspect's response.

Right now, I am still in the process of experimenting with what the model we chose is capable of with the toolkit. It's pretty incredible how far AI has gone to be able to generate these sorts of responses. There are some issues sticking out though. The responses take quite a bit to generate, even on my beefy desktop. It took even longer on Mustafa's laptop. Hopefully, when the game actually comes out, we'll have found a way to handle this little issue. Furthermore, every once in a while, the agent gets a little insane.

For context, the above snippet is taken when I was experimenting with Lifelike's Brain tool. In this scenario, the smiley face icon is a detective who's investigating the murder of a women; while the robot icon, the suspect and the victim's brother, is trying to hide the fact that he's guilty. The Detective is trying to get a confession out of the Suspect, which leads to this hilarious conversation. If you want to try to do the same thing, here's the code snippet that I used (courtesy of Mustafa):

"""This module demonstrated how to make a terminal chatbot with multiple characters."""

import streamlit as st
import streamlit_chat
from langchain.llms import LlamaCpp
from lifelike import brain

# Initialize LLM
llm = LlamaCpp(model_path='ggml-model-q4_0.bin')

# Initialize Characters
characters = brain.Characters('characters.json')

name1 = input ("Enter name of character 1: ")
name2 = input ("Enter name of character 2: ")
background1 = input (f"Enter background of {name1}: ")
background2 = input (f"Enter background of {name2}: ")
CONTEXT = input("Enter context: ")
first_speaker = input("Enter first speaker: ")
first_utterance = input("Enter first utterance: ")

characters.add(name1, background1)
characters.add(name2, background2)

# Initialize Conversations
conversations = brain.Conversations('conversations.json', characters, llm)

conversations.new(CONTEXT, {name1, name2})
conversations.append(CONTEXT, first_speaker, first_utterance)

st.set_page_config(
    page_title="Lifelike brain demo",
    page_icon="🧠",
)

st.header("Lifelike brain demo")


# Start chatbot
counter = 0
while True:
    if counter % 2 == 0:
        streamlit_chat.message(conversations.get(CONTEXT)['log'][-1][1])
    else:
        streamlit_chat.message(conversations.get(CONTEXT)['log'][-1][1], is_user=True)
    counter += 1
    last_speaker = conversations.get(CONTEXT)['log'][-1][0]
    out = conversations.generate(CONTEXT, {last_speaker})

To get it to work, you need to pip install git+https://github.com/lifelike-toolkit/lifelike.git streamlit streamlit_chat for lifelike and streamlit, as well as the llama-cpp model itself, which might be difficult to get. However, if you have the technical know-how, you should be able to replace it with the llm of your choice without much issues. We'll make sure to streamline this process down the line.

As for the development of Cool Overprotective Dad, I have passed it off to a web developer I know who is infinitely more skilled and knowledgeable at designing a frontend than I do. Connor is also on vacation, writing the rest of the game in the meantime, so hopefully you can expect the full game within the next few weeks.

This blog has taken quite a bit of time to write, and I suspect it might be to read as well. So thank you so much for sticking with me and Mustafa as we work on some of the most ambitious project for a bunch of undergrads. I really hope that the interrogation demo and Cool Overprotective Dad can be released as soon as possible. There are some difficulty with Cool Overprotective Dad's release, so for now, if you want to play it, use the guide in its repository. Otherwise, if you want to follow our development, follow Mustafa on Twitter and join our Discord channel! Again, thank you so much for reading, and I will (hopefully), see you next week!

Lifelike Toolkit Devblog #1: Countdown to Ignition

Khoa Nguyen — Thu, 20 Apr 2023 23:14:37 +0000

Project Repository
Quick updates from Mustafa
Current Demo (unfortunately it crashes if there's enough traffic)

Hey everyone, it's me again. Hope everyone's had a great week since the last Devblog. Feels like an eternity ago. We've just finished the first week of nights&weekends and dropped our first demo! Unfortunately, since I could not get AWS free tier cloud hosting working, we'll have to live with an extremely sketchy Streamlit-hosted web app that crashes if more than 5 people access it at a time. I cannot say working with AWS was a particularly fun experience, especially after I spent a few hours learning it, and finding out at the finish line that the free tier cannot handle Pytorch (which is rather necessary for the ML demo). We truly live in the twilight of cloud computing, where AWS, one of the biggest service out there, simply misses the mark when it comes to ease of use and engineering for error (shoutout to the Human-Computer Interface course I took this year for allowing me to give a scientific breakdown of why AWS is horrible for early deployment). But it's not just AWS, Heroku really rubbed salt into the wound by removing the free tier. Although I grew to love Streamlit as a great demo'ing framework, the fact that there is no paid tier while the free tier supports something like 5 users at a time simply breaks my heart. Just when you thought they've done it all, the Simpsons really came through with another eerily accurate vision of the future.

Going forward, I have some plans for using ONNX and make the demo client-side, so we can put it on cloud for a lot cheaper.

Before we start discussing the demo itself, I think I need to first make clear that we are not intending on making a chatbot, or a chatbot NPC. Rather, we are making an open-model toolkit that allows any game developers to supply their own language model and have it work in the game. However, we were not able to do that within a week's time while also studying for exams. In the end, we settled on the v1 demo to be more of a demonstration for us than for users. To users, it is basically a chatbot that speaks like Nick Wilde from Zootopia, who has an extreme obsession with rabbits (to be fair, half of his dialogues in the movie are to Judy); to us, however, we wanted it to be a test bench where we gauge our progress, fuck around and see what sticks.

Mustafa had the most important job going into this, working on figuring out how we can make the magic happen. He had previous experience working with LLMs, which is an area I am still reading up on. Due to the time constraints, he chose to focus on fine tuning GPT2, instead of open-model, with Zootopia's script. Why Zootopia though? It is the first movie that caught his eyes on The Internet Movie Script Database. To be honest, I could not have made a better choice myself.

Fine tuning the language model, as with most AI-related work, has a fair amount of prerequisite work. As part of the toolkit, we wanted users to be able to create characters based on their dialogues. For this, Mustafa created a script to go on imsdb and scrape the movie requested by the user. The data then has to be "cleaned" and processed for use. This proved fairly difficult, as are there are multiple formatting inconsistencies that made it difficult to create a generalized script to extract what we needed. It also did not help that we had an exam on Tuesday. In the end, we chose to hard code a few things, and maybe move away from film scripts. At the moment, we are looking into a paper that was released about 2 weeks ago called "Generative Agents". I will go into a little more detail later when discussing future plans.

Anyways, this is probably what everyone came here to see: How did it do? To be frank, not great. Nick has an extreme obsession with rabbits, frequently refer to Judy as Judy Wilde and does not seem exactly "smart". I suppose it is to be expected, considering Nick's dialogues in the movie are mostly directed at Judy. Although a lot of the response does sound like something Nick would say, it is certainly not good enough to fool anyone. Another thing I tried to get working is contextual memory, but Nick just seems to be fairly confused when asked about something I have told him before. So for the sake of stability, I opted to not include it in the demo. Of course, it is very easy to blame GPT-2 here, and we will, but we will also acknowledge that maybe the data we gave it could be better. Maybe we need to include the question and its response, not just the response itself, maybe we need to include a little more information. We honestly have not tested it out enough yet, but it is certainly underway now that I've finished working with everything. One thing I do want to try is GPT-3, since that model is a lot more powerful than GPT-3, as well as another language model that's easy to find out there. I also don't want to focus the model on a single character this time around, but rather have all characters start from a blank slate, but when we speak to them, they are supplied with the necessary context as well as personality traits. This will highly depend on how good the model is, but it's definitely worth a shot.

But let's not take too much away from Nick here, as I've shown, Nick does have certain contexts. He knows his name and the fact that he's a fox. That's definitely something cool.

So what are we going to do in the future? First thing to do would be for me to make sure we can get the demo out there. One thing that I am currently looking into is ONNX, which I have heard about so often during daily standups back at my first internship. With it, we can essentially take the Pytorch model, convert it into the ONNX format, which we can run inference with on the browser, reducing the cloud-hosting leviathan overhead that is Pytorch. It's quite insane to think that, as new as AI is, we've already figured out how to run it on the browser! It's exciting to put a name that sounds that cool on my resume.

Second of all, I will be figuring out how to store the contextual memories for characters, as well as the characters themselves. Apparently ChromaDB is great for this, being able to run locally too, so I am looking into that.

Outside of all that, I am reading up on LLMs, Transformers, while also looking into research on simulating human behaviour and invoking a "response" using speech. Luckily for me, "Generative Agents: Interactive Simulacra of Human Behavior" came out fairly recently, and contains a lot of inspiration for us. The architecture of the agent discussed in the paper has a memory stream, which allows an agent to plan, reflect on itself, as well as act on certain perceptions. For example, when the discussed agent is asked about a certain topic, it can retrieve relevant memories to "condition its response to the situation". This certainly aligns with our goals, as we want a player to be able to "convince" an NPC to do something for them. ChromaDB would also be a good fit here, as we can certainly encode certain memories to contextual embeddings. To be honest, I feel like a kid at a candy store just thinking about the potential.

Both me and Mustafa really want to have our work speaks for us, so for our next demo, we will be attempting to make a simple game using it. After all, we are not making a chatbot, but rather a game dev toolkit. A great friend of mine suggested an "interrogation game", where you need to figure out if a suspect is lying, find cracks, and use that to force them to tell you the truth. Essentially, it's L.A. Noire and Phoenix Wright, but without the textual prompts. On a quick sidenote here, if you have not played the Phoenix Wright trilogy, make sure you do. This is quite a difficult task to accomplish in a week, but this will definitely get the words out there.

Seriously don't want to overstay my welcome, so I'll close it here very briefly. Thank you all for taking the time and read what I have to say about this project. Even though it's been part time due to school, it's been a blast to see our idea gaining form like that. I really hope I was able to convey that sense of awe and wonder in this honestly really long devblog. If you guys did enjoy my rambling though, you can expect DevBlog #2 next week as well. I really took my time on this and dropped it later than I promised due to exams. Really sorry about that, so again, if you are not a big fan of me, you can definitely checkout Mustafa's twitter, where he posts updates as soon as they roll out. If you know anyone who might be interested in this project, share this to them. We would appreciate any feedback and help we can get from the community. After all, our Human-Interaction class did teach us that "We are not the user".

With all of the shortcomings that our demo has shown, while also disappointed, both of us felt a lot more at ease. As bad as seeing your brain child falling short is, we managed to find a lot of problems with it, as well as potential fixes. To be honest, that insight is a lot more valuable than anything. So, in my first attempt at ending a devblog with a banger quote from yours truly: It is a lot more assuring to see your stuff fail than succeed.

n&w update 8: and we're live! access a little demo of the toolkit here: https://t.co/aULlq5F5L2

they say if you're not embarassed by the first version of your product then you're too late. does being too embarassed mean we're super early? let's see haha
— Mustafa Tariq (@mustafa_tariqk) April 17, 2023

Lifelike Toolkit Devblog #0: Getting Started

Khoa Nguyen — Wed, 12 Apr 2023 03:39:23 +0000

Project Repository
Quick updates from Mustafa

Hello everyone! Since this is going to be the first of a long string of devblogs, I will introduce myself. Me and my teammate Mustafa Tariq are working on a toolkit that allows game developers to easily intergrate artificial intelligence into their game, allowing players to essentially say whatever they want, and impact their game based on what they say. We even plan to open-source it once we're done with most of what we want to do. If anyone is interested, we're building this as part of buildspace's nights&weekends s3. It will last 6 weeks, but we're more than willing to pursue this project even past that deadline!

The idea is simple: imagine yourself playing Skyrim, but whenever you are talking to an NPC, you don't pick from a list of options, but rather type whatever you need into a dialogue box, and have that NPC react accordingly. What we also want to create are gameplay situations where you're required to find out how NPCs react to certain things you say. For example, you need to enter Skyrim's City of Windhelm, but the guards would not let you pass. Now, if you paid attention to the story, you would know that Windhelm is home to the Stormcloak Rebellion; therefore, you would tell the guard (not through an option, you'll have to type it out) that you support the rebellion and is there to enlist. The guard would then sympathizes with your cause and let you pass. You can also figure out how to make the guards like you based on their personalities and maybe they'll let you, a friend, pass!

So, you may ask, how will we do it? We have 1 immediate milestone that we want to set:
April 16th: MVP out. This is our hard deadline for nights&weekend; however, we are in the middle of an exam season, so we'll have to scale down the MVP by quite a bit. The MVP will allow a user to load up a NLP model of their choice, create a character and allow for conversations, the contexts of which will be saved (like memory) and loaded back up whenever another conversation is triggered. Devblog #1, which will go into technical details, will also be released then.
Beyond: User feedback and iteration. We will have a much clearer picture of the full timeline once we're done the MVP and get some feedback on it. So stick along to see what cool stuff we'll do

The project came by as we watched a youtuber play a very old gem called Façade.

Then, we thought with all the new technologies surrounding NLP, we can probably remake the game and expand on some of its ideas as a summer project (we're both still in university). But we realized we can turn this into something bigger by creating a toolkit that allows just about anybody to remake Façade in a short amount of time. Just imagine the amount of Skyrim mods that this will make possible! What we definitely want to see at the end is Telltale's The Walking Dead (or any Telltale games for that matter) fully playable by player speech with no text options.

We are totally aware of similar projects out there. However, our goals differ from them in that we are not trying to make a game per se, we're trying to make it easy for developers to make the games that we want to make! We're not exactly doing this because we're businessmen, we truly just want to work on something we really really care about and maybe it will turn into our careers after graduation. Because of this, our goal is to make this the best thing that we have ever made, and make sure that everyone who uses it feels the same.

So, that's it for today! Stay tuned (maybe even subscribe) on April 16th to see what we've cooked up. From Devblog #1 onwards we will set up a feedback form if you've tried it out, maybe even a Discord channel if we garner extra interests. Also, if you want to follow the project more closely and see updates as we make them, follow Mustafa's Twitter account where he posts all of them. Over and out.