Debajyati Dey

Posted on Apr 3

No It Wasn't A Waste Entirely

#discuss #cloud #datascience #kaggle

So, hello everyone! It is a follow-up article to my last article about mirroring two datasets to Kaggle.

Debajyati Dey

Mar 15

So, you know what? I just wasted 3 months of my life

#discuss #kaggle #googlecloud #deeplearning

8 min read

In that article I expressed my frustrations, ambitions and how I failed even after trying many ways particularly for the Wild Deepfake Dataset.

In this article I am going to present you how I managed to successfully upload the 24GB polyglotfake multimodal deepfake dataset on kaggle for accessibility enhancement and easy non-interactive experiments for everyone.

The original GitHub repo of the PolyGlotFake Deepfake Dataset is at -

tobuta / PolyGlotFake

PolyGlotFake Dataset

Overview

PolyGlotFake is a multilingual and multimodal deepfake dataset meticulously designed to address the challenges and demands of deepfake detection technologies. It consists of videos with manipulated audio and visual components across seven languages, employing advanced Text-to-Speech, voice cloning, and lip-sync technologies.

Download DataSet

Please download from this link: https://drive.google.com/file/d/1aBWLii-TbrpKNLSTwpmjqu98eKovWLxF/view?usp=drive_link

Quantitative Comparison

DataSet	Release Data	Manipulated Modality	Multilingual	Real video	Fake video	Total video	Manipulation Methods	Techniques Labeling	Attribute Labeling
UADFV	2018	V	No	49	49	98	1	No	No
TIMI	2018	V	No	320	640	960	2	No	No
FF++	2019	V	No	1,000	4,000	5,000	4	No	No
DFD	2019	V	No	360	3,068	3,431	5	No	No
DFDC	2020	A/V	No	23,654	104,500	128,154	8	No	No
DeeperForensics	2020	V	No	50,000	10,000	60,000	1	No	No
Celeb-DF	2020	V	No	590	5,639	6,229	1	No	No
FFIW	2020	V	No	10,000	10,000	20,000	1

…

View on GitHub

And the README contains the drive link to download the dataset.

PolyGlotFake.rar - Google Drive

drive.google.com

Based on my experience with uploading the wild deepfake dataset on Kaggle, my key mistake was extracting the tar files into nested folders of images and then uploading them after to drive ⇒ GCloud Storage bucket ⇒ Kaggle (failed).

Although it wasn't entirely a mistake considering I didn't have much storage space in my local machine to archive the 4 tar files (train real, train fake, test real, test fake) in the wild deepfake dataset.

So I would just fail anyway.

Clicking on the polyglotfake download drive link you can see that the dataset is presented as a single RAR archive file. This format is highly beneficial for uploading as dataset in Kaggle. Because if I can create a public link of a no matter how large file (single file, not a folder) in GCS bucket, it will be very quickly transferred.

How I Managed to Be Successful

So I began by copying the RAR file through the drive link into my Google Drive.

Then opened a colab notebook. Began writing code -

# authenticating google cloud
from google.colab import auth
auth.authenticate_user()
project_id = 'polyglotfake'
!gcloud config set project {project_id}
!gsutil ls

Next step is downloading the 24GB dataset from drive to colab disk storage using gdown -

!gdown --id 1cUlwVi8Wu6MmDu8Mh2lXTIPJFz63KOtd

I intended to directly copy the gdown downloaded RAR file to the GCS bucket. So I need gcsfuse for that -

echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
apt -qq update
apt -qq install gcsfuse

mkdir my_gcs_mount

now next step forward is to copy the file from drive to GCS bucket -

%cp /content/goblin/PolyGlotFake.rar /content/my_gcs_mount/polyglotfake/

In my case, this operation took more than 3 hours.

But now is the time for most crucial operation. That is making the storage bucket public.

I did it using the command line because the web interface was somewhat confusing.

gcloud storage buckets add-iam-policy-binding gs://pgfake --member=allUsers --role=roles/storage.objectViewer

So that's it! Now I have a public link of the GCS bucket! I copied that from the web interface of Google cloud console.

Finally the only thing left to do is uploading to kaggle. So, navigate to - https://www.kaggle.com/datasets/?new=true and click on the link tab -

It will then look like this -

I entered the path to the file in the bucket.

And it took nearly 2 hours to upload the whole dataset.

Tears of joy.

I was finally able to mirror a large dataset of the two I intended to.

I made it public so that anyone in need can access it, without going through the hassle I had.

You can find it on https://www.kaggle.com/datasets/debajyatidey/polyglotfake.

A Glimpse of The Data as CSV

Gist of Real Videos Info as a CSV -

Hehe, fake videos' info CSV file is too large, so Can't render it here.

Visualizing The Data Distributions

Below there are some charts to visualize how the Real Videos (not deepfake) is distributed and organized -

Distribution of Age By Language The subject speaks in :-

Distribution of Age of Subject By Sex (Gender):-

Sex Ratio in All Real Videos -

Now there are some charts to visualize how Deepfake Videos are organized and distributed -

📝 All these visualizations were created using Google Looker Studio.

Conclusion

So yeah… this time, it worked.

After all the failed attempts, broken pipelines, storage limitations, and that slow realization that "simple" problems are rarely simple — this one finally went through. Not because I discovered some magical trick, but because I stopped fighting the system and started working with it.

The difference?
I didn’t try to be clever. I tried to be practical.

Instead of exploding datasets into thousands of files and then dragging them through layers of infrastructure (and pain), I kept it as a single archive and let the systems that are actually designed for large transfers do their job. Turns out, boring solutions scale better than “smart” ones.

Looking back, this whole journey reinforced something I underestimated earlier:
data engineering is not a side quest in ML — it is the game.

Datasets like PolyGlotFake aren’t just large for the sake of it. They’re complex, multilingual, multimodal — and intentionally difficult to work with because they reflect real-world deepfake challenges. Making them accessible is not just convenience — it directly impacts how fast someone can experiment, iterate, and actually do research.

And that’s really the point.

If one person can now spin up a Kaggle notebook, plug in the dataset, and start experimenting in minutes instead of wasting days setting things up — then this entire ordeal was worth it.

Would I do it again?
Maybe, if it is indeed feasible.

But at least now I know this much —
sometimes the problem isn’t that something is impossible.

It’s just that you were doing it the hard way.

So, yes, ... that's a wrap!

Feel free to connect with me. :)