DEV Community

Cover image for No It Wasn't A Waste Entirely
Debajyati Dey
Debajyati Dey

Posted on

No It Wasn't A Waste Entirely

So, hello everyone! It is a follow-up article to my last article about mirroring two datasets to Kaggle.


In that article I expressed my frustrations, ambitions and how I failed even after trying many ways particularly for the Wild Deepfake Dataset.

In this article I am going to present you how I managed to successfully upload the 24GB polyglotfake multimodal deepfake dataset on kaggle for accessibility enhancement and easy non-interactive experiments for everyone.

The original GitHub repo of the PolyGlotFake Deepfake Dataset is at -

PolyGlotFake Dataset

Overview

PolyGlotFake is a multilingual and multimodal deepfake dataset meticulously designed to address the challenges and demands of deepfake detection technologies. It consists of videos with manipulated audio and visual components across seven languages, employing advanced Text-to-Speech, voice cloning, and lip-sync technologies.

Download DataSet

Please download from this link: https://drive.google.com/file/d/1aBWLii-TbrpKNLSTwpmjqu98eKovWLxF/view?usp=drive_link

Quantitative Comparison

DataSet Release Data Manipulated Modality Multilingual Real video Fake video Total video Manipulation Methods Techniques Labeling Attribute Labeling
UADFV 2018 V No 49 49 98 1 No No
TIMI 2018 V No 320 640 960 2 No No
FF++ 2019 V No 1,000 4,000 5,000 4 No No
DFD 2019 V No 360 3,068 3,431 5 No No
DFDC 2020 A/V No 23,654 104,500 128,154 8 No No
DeeperForensics 2020 V No 50,000 10,000 60,000 1 No No
Celeb-DF 2020 V No 590 5,639 6,229 1 No No
FFIW 2020 V No 10,000 10,000 20,000 1

And the README contains the drive link to download the dataset.

Based on my experience with uploading the wild deepfake dataset on Kaggle, my key mistake was extracting the tar files into nested folders of images and then uploading them after to drive ⇒ GCloud Storage bucket ⇒ Kaggle (failed).

Although it wasn't entirely a mistake considering I didn't have much storage space in my local machine to archive the 4 tar files (train real, train fake, test real, test fake) in the wild deepfake dataset.

So I would just fail anyway.

Clicking on the polyglotfake download drive link you can see that the dataset is presented as a single RAR archive file. This format is highly beneficial for uploading as dataset in Kaggle. Because if I can create a public link of a no matter how large file (single file, not a folder) in GCS bucket, it will be very quickly transferred.

How I Managed to Be Successful

So I began by copying the RAR file through the drive link into my Google Drive.

Then opened a colab notebook. Began writing code -

# authenticating google cloud
from google.colab import auth
auth.authenticate_user()
project_id = 'polyglotfake'
!gcloud config set project {project_id}
!gsutil ls
Enter fullscreen mode Exit fullscreen mode

Next step is downloading the 24GB dataset from drive to colab disk storage using gdown -

!gdown --id 1cUlwVi8Wu6MmDu8Mh2lXTIPJFz63KOtd
Enter fullscreen mode Exit fullscreen mode

I intended to directly copy the gdown downloaded RAR file to the GCS bucket. So I need gcsfuse for that -

echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
apt -qq update
apt -qq install gcsfuse
Enter fullscreen mode Exit fullscreen mode
mkdir my_gcs_mount
Enter fullscreen mode Exit fullscreen mode

now next step forward is to copy the file from drive to GCS bucket -

%cp /content/goblin/PolyGlotFake.rar /content/my_gcs_mount/polyglotfake/
Enter fullscreen mode Exit fullscreen mode

In my case, this operation took more than 3 hours.

But now is the time for most crucial operation. That is making the storage bucket public.

I did it using the command line because the web interface was somewhat confusing.

gcloud storage buckets add-iam-policy-binding gs://pgfake --member=allUsers --role=roles/storage.objectViewer
Enter fullscreen mode Exit fullscreen mode

So that's it! Now I have a public link of the GCS bucket! I copied that from the web interface of Google cloud console.

Finally the only thing left to do is uploading to kaggle. So, navigate to - https://www.kaggle.com/datasets/?new=true and click on the link tab -

Creating a new dataset on kaggle

It will then look like this -

Creating a dataset from link - selected the option  raw `Import Google Cloud Storage` endraw

I entered the path to the file in the bucket.

And it took nearly 2 hours to upload the whole dataset.

Tears of joy.

I was finally able to mirror a large dataset of the two I intended to.

I made it public so that anyone in need can access it, without going through the hassle I had.

You can find it on https://www.kaggle.com/datasets/debajyatidey/polyglotfake.

A Glimpse of The Data as CSV

Gist of Real Videos Info as a CSV -

Hehe, fake videos' info CSV file is too large, so Can't render it here.

Visualizing The Data Distributions

Below there are some charts to visualize how the Real Videos (not deepfake) is distributed and organized -

Distribution of Age By Language The subject speaks in :-
Distribution of Age By Language The subject speaks in

Distribution of Age of Subject By Sex (Gender):-
Distribution of Age of Subject By Sex (Gender)

Sex Ratio in All Real Videos -
Sex Ratio

Now there are some charts to visualize how Deepfake Videos are organized and distributed -

Raw Language, Target Language and Text-To-Speech categories distribution in All Fake Videos

📝 All these visualizations were created using Google Looker Studio.

Conclusion

So yeah… this time, it worked.

After all the failed attempts, broken pipelines, storage limitations, and that slow realization that "simple" problems are rarely simple — this one finally went through. Not because I discovered some magical trick, but because I stopped fighting the system and started working with it.

The difference?
I didn’t try to be clever. I tried to be practical.

Instead of exploding datasets into thousands of files and then dragging them through layers of infrastructure (and pain), I kept it as a single archive and let the systems that are actually designed for large transfers do their job. Turns out, boring solutions scale better than “smart” ones.

Looking back, this whole journey reinforced something I underestimated earlier:
data engineering is not a side quest in ML — it is the game.

Datasets like PolyGlotFake aren’t just large for the sake of it. They’re complex, multilingual, multimodal — and intentionally difficult to work with because they reflect real-world deepfake challenges. Making them accessible is not just convenience — it directly impacts how fast someone can experiment, iterate, and actually do research.

And that’s really the point.

If one person can now spin up a Kaggle notebook, plug in the dataset, and start experimenting in minutes instead of wasting days setting things up — then this entire ordeal was worth it.

Would I do it again?
Maybe, if it is indeed feasible.

But at least now I know this much —
sometimes the problem isn’t that something is impossible.

It’s just that you were doing it the hard way.

So, yes, ... that's a wrap!

Feel free to connect with me. :)

Thanks for reading! 🙏🏻
Written with 💚 by Debajyati Dey
My GitHub My LinkedIn My Daily.dev My Peerlist My Twitter

Follow me on Dev...

Happy coding 🧑🏽‍💻👩🏽‍💻! Have a nice day ahead! 🚀

Top comments (0)