Hey bros and sisters and other fellow developers. I am a human man. And I am sharing my personal experience and frustrations regarding uploading datasets.
I am dumb and may don't know a lot of stuff and that is why I just happen to try a lot of things, differently maybe and fail severly. Haha, what a tragedy...
So, it happens as I needed to my tune some CNNs or ViTs or both which are pretrained on 'imagenet' or 'LVD-1689M', so that I can use them for feature extraction in my preferred dataset domain (detection of deepfakes - It is my final year project that I am working on).
The dataset I preferred for fine-tuning the set of deep learning models is Wild Deepfake -
Because it is a diverse and real world deepfake dataset that has data sourced from in the wild deepfakes available on the internet. Fine tunign on it will be extremely beneficial because it will help the models generalize better across real world modern deepfakes and other available datasets. It is proven and tested before by other researchers and they found the fact to be true.
Now you might ask me and yes it is a valid question to ask that is if the dataset is available in huggingface then what I am trying to achieve by uploading? You mean like what am I uploading even?
It is just there in HF. What am I upto? Well, that is a sensible concern.
Look, the thing is - I use kaggle notebooks for all my test, research and prototyping stuff for developing and ideating different possible architectures.
Why because kaggle not only provides weekly 30 hours of GPUs and 23 hours of TPUs for FREE but also, but it lets you run notebooks non-interactively. Non-interactive means you can code and save your work by clicking on the 'save and run all (commit)' button which saves your work in the cloud by runnign the whole notebook or script file for once in the cloud without you needing to keep the browser tab open. You can run save and forget the world and everything for good, shut down your PC, and take a deep sleep, have dreams and whatever. Meanwhile, your notebook will keep running and if it faces any error and stops or executes every cell sucessfully, saves output and ends execution - it will email you with link to the failed or successful notebook.
I find this to be a pleasant feature. I can spend whole day and night experimenting, thinking, architecting and developing the system and just put it to execution before I go to bed. It can take just 1 hour or even 3-5-9-12 hours to execute the notebook. And when I will wake up I will brush my teeth, have a nice breakfast and would just check my email for results.
Now to answer your question - "Why I would not just load from huggingface?". The answer is - speed. Based on personal experience, and whatever I heard from other devs, when downloading a dataset hosted on kaggle for the first time, it is hell slow. 2nd time, 3rd time, also slow. Next time it will be cached. And so, subsequent downloads will be faster, like a lot faster. Faster than faster. And I run notebooks non interactively most times. In non-interactive mode, loading a dataset is even faster like immediate compared to interactive modes. But that is not the case for hf datasets.
Interactive and Non-interactive modes - in both cases loading dataset from hf is found to take the same amount of time. At least I found it like that.
Also, I didn't find that after downloadin a dataset from hf in kaggle notebooks couple of times, it gets any faster. It doesn't. The speed doesn't improve at all. It remains the same.
The dataset is huge - in parquet conversion - it is total 9gb. And in raw PNG image nested folders - it is 67 gigabytes. Huge...
But being in raw image format helps me loading some individual random or selective files which won't be very easy honestly, with the parquet version.
Still loading it a couple of times will make it faster and enough for me to run in non-interactive commit mode of notebooks (if the data is loaded from kaggle datasets). So, yeah! I thought it might be worth the shot!
Meanwhile for just continuing my work as it already going on -
I was/(am still) using this subset of original wild deepfake dataset - made available in kaggle by a fellow veteran developer (I am grateful to his kindness in making his upload publicly available) -
Without this I wouldn't have been able to use wilddeepfake for training and testing, you know.
But still, there is no guarantee that it is the actual wild deepfake dataset. It might be a subset of a whole different dataset and he just named it WildDeepfake. How would I know? There is no way. I can either blindly accept and use it or, perhaps, lose it?
My Next Decision
So, my dumb brain just thought, heyyyy, why not just download the data from huggingface and upload it on kaggle? Prawnblem solved!
And my dumb brain just followed the decision with utmost heart and soul.
- I cloned the wilddeepfake repo from huggingface.
- enabled git lfs. and downloaded the gzip compressed tar archives.
- extracted them all.
- took 1.5 months to upload them all in my Google Drive folder by folder. (yes I have 2TB Google Drive storage because I have 1 year google AI pro subscription for students, so I can easily upload large datasets in my drive, but that doesn't solve any problem).
- Started the ($300) google cloud free trial and created a Google cloud storage bucket.
- Opened a colab notebook. Wrote code to authenticate Google Drive and google cloud.
- Used gsutil to directory copy files from drive directory to cloud storage bucket directory.
- It took me 30 days to complete the copying process.
So yeah now I just can't upload the dataset from my gcs bucket to kaggle. I tried multiple times. I tried multiple times. I put the path to directory containing the dataset in the bucket to kaggle dataset upload from link option which accepts gcs bucket.
It takes forever to get done. No, not even forever. It goes 12 hours long while showing "bundling" & then it vanishes uploading. No traces that some process was occuring.
Man I feel like I must give up. It won't work maybe. I was fool to try.
That's not just it.
I have found a 24GB multimodal deepfake dataset named - "Polyglotfake". I saved it in my Google Drive. Now I can't upload it in kaggle. Based on my experience, whenever I tried uploading a single 24 GB huge rar file via kaggle cli or via web using the downloaded rar in my local filesystem - both are slow as hell and takes forever.
Next thing I tried was finding gdown as a useful tool to download large files "shared with access to anyone in the link" from Google Drive.
The most pathetic thing about gdown is that it temporarily blocks you after you tried to download a file (especially, a large file). Then within 24 hours, not only you but everyone in the planet will not be able to download it.
Another thing, I saw that when you use gdown on kaggle notebook or your local filesystem, it works really damn fast and actually downloads the whole file without disruption. But with Google cloud shell, it just accidentally stops with whatever error and or sometimes even no error, when it has just barely downloaded 67 megabytes of data.
One thing I am sure that I will try to upload the polyglotfake dataset again after the next 24 hours are over. I will mount my seperate gcs bucket for polyglotfake in my colab notebook, authenticate my Google Drive account; and either try to copy from mounted google drive to mounted cloud storage bucket. Or, run gdown again on the mounted storage bucket directory.
My goggle cloud free trial ends on the last week of this march. Before that I have to everything I can, and if I can't do anything, or I fail, I will push towards deleting the buckets and finally, giving up.
So, yaa, that's a wrap.
Concluding
TL;DR: I tried to move a dataset. The internet said "no". I'm still trying.
If you've been here — slow downloads, vanished uploads, gdown bans — you know the pain. If you haven't… welcome, it gets worse (just kidding… mostly).
Looking back, maybe I overcomplicated it. Maybe I should've just used the subset, or stuck with HuggingFace loading times, or accepted that some friction is part of the process.
But here's the thing: research isn't just about models and metrics. It's also about workflow, accessibility, and removing tiny barriers that add up. If I can make WildDeepfake or Polyglotfake one click easier to use for someone else down the line — even if it takes me a few more failed uploads to get there — I think that's a win.
So yeah, that's where I'm at.
What I learned: Sometimes the simplest idea — "just upload the dataset" — can turn into a 2-month odyssey through Google Drive, Cloud Storage, Kaggle limits, and existential doubt. 😅
What I still need: If you've successfully mirrored a large HuggingFace dataset to Kaggle, or if you know a reliable way to move 67GB of images without losing your mind — please drop a comment or DM. I'm all ears.
Why I'm still trying: Because better deepfake detection matters. And if making datasets more accessible in the places where researchers actually work (like Kaggle) helps even one person prototype faster, test smarter, or sleep while their notebook runs — then maybe this whole mess was worth it. 🙃
I'll update this post if I ever get WildDeepfake or Polyglotfake fully mirrored on Kaggle. Until then: keep experimenting, keep sharing.
Feel free to connect with me. :)
| Thanks for reading! 🙏🏻 Written with 💚 by Debajyati Dey |
|---|
Follow me on Dev...
Happy coding 🧑🏽💻👩🏽💻! Have a nice day ahead! 🚀

Top comments (0)