Hey all,
My current stack is data coming as an eventing system to Pub/Sub (GCP). This contains streaming event data at scale of thousands every 5 minutes.
I also have an intelligence table in BigQuery which has all metadata information for those events. The pubsub events and BQ table share a common join key. However the metadata table is about 3 billion rows.I have an application that queries realtime metrics based on joined data (in clickhouse) returned via an API. We need the metadata joined in because it is the group by key for on-demand aggregations
My current setup is pubsubs to GCS to Clikchouse and then BQ to GCS to clickhouse and in clickhouse the incoming events get enriched by materialised view. However due to the size of metadata table each materialised view query is taking super long and costing a lot as well.
Are there any other tools/solutions I can use for this use case (if its something in GCP - amazing)
Any mistakes i am making because clickhouse join is not performant here (tried using dict as well) But there are too many keys to join (most of them wasteful)
Should I use another data model for pub/sub event i.e push all metadata there itself? But that leaves an issue of backfilling every time we add new metadata columns (which is frequent)
Any help would be appreciated
Top comments (0)