Sometimes you need to run complex queries against the list of files storage in your Google Cloud Storage. In my case, I wanted find out which files were no longer referenced by my database and clean them up.
Here is how I did it:
Exporting the File List to a CSV file
This simple 1-line script will allow you to the entire file list from Google Cloud Storage to a simple CSV format that can be then loaded into Google BigQuery:
gsutil ls -l gs://bucket-name/** | head -n-1 | awk 'BEGIN{print "Size,Modified,Path"}{print $1","$2",\""$3"\""}' > filelist.csv
Make sure to replace bucket-name
with your actual GCS bucket name.
Loading the CSV file to BigQuery
Before you can load the CSV File to Google's BigQuery, you first need to upload it to Google Cloud storage. You can do it by running:
gsutil cp filelist.csv gs://some-bucket/filelist.csv
Then, you can load it right into bigquery:
bq load --autodetect --source_format=CSV mydataset.mytable gs://some-bucket/filelist.csv
Make sure you change some-bucket
to your actual bucket name, and mydataset
and mytable
to the target BigQuery dataset and table, respectively.
I tested this process for a bucket containing about 350,000 files and it worked flawlessly.
Top comments (0)