So, hello everyone! It is a follow-up article to my last article about mirroring two datasets to Kaggle.
In that article I expressed my frustrations, ambitions and how I failed even after trying many ways particularly for the Wild Deepfake Dataset.
In this article I am going to present you how I managed to successfully upload the 24GB polyglotfake multimodal deepfake dataset on kaggle for accessibility enhancement and easy non-interactive experiments for everyone.
The original GitHub repo of the PolyGlotFake Deepfake Dataset is at -
PolyGlotFake Dataset
Overview
PolyGlotFake is a multilingual and multimodal deepfake dataset meticulously designed to address the challenges and demands of deepfake detection technologies. It consists of videos with manipulated audio and visual components across seven languages, employing advanced Text-to-Speech, voice cloning, and lip-sync technologies.
Download DataSet
Please download from this link: https://drive.google.com/file/d/1aBWLii-TbrpKNLSTwpmjqu98eKovWLxF/view?usp=drive_link
Quantitative Comparison
| DataSet | Release Data | Manipulated Modality | Multilingual | Real video | Fake video | Total video | Manipulation Methods | Techniques Labeling | Attribute Labeling |
|---|---|---|---|---|---|---|---|---|---|
| UADFV | 2018 | V | No | 49 | 49 | 98 | 1 | No | No |
| TIMI | 2018 | V | No | 320 | 640 | 960 | 2 | No | No |
| FF++ | 2019 | V | No | 1,000 | 4,000 | 5,000 | 4 | No | No |
| DFD | 2019 | V | No | 360 | 3,068 | 3,431 | 5 | No | No |
| DFDC | 2020 | A/V | No | 23,654 | 104,500 | 128,154 | 8 | No | No |
| DeeperForensics | 2020 | V | No | 50,000 | 10,000 | 60,000 | 1 | No | No |
| Celeb-DF | 2020 | V | No | 590 | 5,639 | 6,229 | 1 | No | No |
| FFIW | 2020 | V | No | 10,000 | 10,000 | 20,000 | 1 |
And the README contains the drive link to download the dataset.
Based on my experience with uploading the wild deepfake dataset on Kaggle, my key mistake was extracting the tar files into nested folders of images and then uploading them after to drive ⇒ GCloud Storage bucket ⇒ Kaggle (failed).
Although it wasn't entirely a mistake considering I didn't have much storage space in my local machine to archive the 4 tar files (train real, train fake, test real, test fake) in the wild deepfake dataset.
So I would just fail anyway.
Clicking on the polyglotfake download drive link you can see that the dataset is presented as a single RAR archive file. This format is highly beneficial for uploading as dataset in Kaggle. Because if I can create a public link of a no matter how large file (single file, not a folder) in GCS bucket, it will be very quickly transferred.
How I Managed to Be Successful
So I began by copying the RAR file through the drive link into my Google Drive.
Then opened a colab notebook. Began writing code -
# authenticating google cloud
from google.colab import auth
auth.authenticate_user()
project_id = 'polyglotfake'
!gcloud config set project {project_id}
!gsutil ls
Next step is downloading the 24GB dataset from drive to colab disk storage using gdown -
!gdown --id 1cUlwVi8Wu6MmDu8Mh2lXTIPJFz63KOtd
I intended to directly copy the gdown downloaded RAR file to the GCS bucket. So I need gcsfuse for that -
echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
apt -qq update
apt -qq install gcsfuse
mkdir my_gcs_mount
now next step forward is to copy the file from drive to GCS bucket -
%cp /content/goblin/PolyGlotFake.rar /content/my_gcs_mount/polyglotfake/
In my case, this operation took more than 3 hours.
But now is the time for most crucial operation. That is making the storage bucket public.
I did it using the command line because the web interface was somewhat confusing.
gcloud storage buckets add-iam-policy-binding gs://pgfake --member=allUsers --role=roles/storage.objectViewer
So that's it! Now I have a public link of the GCS bucket! I copied that from the web interface of Google cloud console.
Finally the only thing left to do is uploading to kaggle. So, navigate to - https://www.kaggle.com/datasets/?new=true and click on the link tab -
It will then look like this -
I entered the path to the file in the bucket.
And it took nearly 2 hours to upload the whole dataset.
Tears of joy.
I was finally able to mirror a large dataset of the two I intended to.
I made it public so that anyone in need can access it, without going through the hassle I had.
You can find it on https://www.kaggle.com/datasets/debajyatidey/polyglotfake.
A Glimpse of The Data as CSV
Gist of Real Videos Info as a CSV -
Hehe, fake videos' info CSV file is too large, so Can't render it here.
Visualizing The Data Distributions
Below there are some charts to visualize how the Real Videos (not deepfake) is distributed and organized -
Distribution of Age By Language The subject speaks in :-

Distribution of Age of Subject By Sex (Gender):-

Sex Ratio in All Real Videos -

Now there are some charts to visualize how Deepfake Videos are organized and distributed -
📝 All these visualizations were created using Google Looker Studio.
Conclusion
So yeah… this time, it worked.
After all the failed attempts, broken pipelines, storage limitations, and that slow realization that "simple" problems are rarely simple — this one finally went through. Not because I discovered some magical trick, but because I stopped fighting the system and started working with it.
The difference?
I didn’t try to be clever. I tried to be practical.
Instead of exploding datasets into thousands of files and then dragging them through layers of infrastructure (and pain), I kept it as a single archive and let the systems that are actually designed for large transfers do their job. Turns out, boring solutions scale better than “smart” ones.
Looking back, this whole journey reinforced something I underestimated earlier:
data engineering is not a side quest in ML — it is the game.
Datasets like PolyGlotFake aren’t just large for the sake of it. They’re complex, multilingual, multimodal — and intentionally difficult to work with because they reflect real-world deepfake challenges. Making them accessible is not just convenience — it directly impacts how fast someone can experiment, iterate, and actually do research.
And that’s really the point.
If one person can now spin up a Kaggle notebook, plug in the dataset, and start experimenting in minutes instead of wasting days setting things up — then this entire ordeal was worth it.
Would I do it again?
Maybe, if it is indeed feasible.
But at least now I know this much —
sometimes the problem isn’t that something is impossible.
It’s just that you were doing it the hard way.
So, yes, ... that's a wrap!
Feel free to connect with me. :)
| Thanks for reading! 🙏🏻 Written with 💚 by Debajyati Dey |
|---|
Follow me on Dev...
Happy coding 🧑🏽💻👩🏽💻! Have a nice day ahead! 🚀



Top comments (0)