What if someone ask you a basic question - You have a table item_store having stores and there corresponding item details and table item_store_old that is having legacy data for all the item's and stores, now I need to find out if item_store_old is having all the stores which are present in item_store ?
The first thought is ahh that's pretty simple I will write a query something like this
Select * from item_store store
left join
item_store_old old_store
on store.store_id = old_store.store_id
where old_store.store_id is null
In case the above query return's 0 row's I can confirm that item_store_old is having all the store_id's which are present in item_store.
When I ran the above query on BQ it kept on running for 3 hrs, and I have to kill the job.
Reason - item_store_old was having 400 million records
and item_store was having 15 million records and I was trying to join these tables which had no partition.
This is the problem statement which I faced recently at my day to day work, interested to know what all I tried before I got a working solution. I will come up with another post, let me know in the comments what all you would have tried.
Note the data was in the form of parquet files and I loaded those in BigQuery tables in order to query that data.
Top comments (0)