Towards cloud native dev...

#python #gcp #airflow #beginners

I am a Python developer with 15+ years of experience. I was working as automation QA engineer (created custom testing framework, modifying it and doing functionality testing) for 4+ years and as you would have guessed it, working on something for 4 years surely does take out charm / challenge out of it. So i was looking for new things and at the start of 2022 there was email about Python developer role in some other team. I jumped onto it and got it just based on recommendation.

So, this new role was with Quantitative research team, just to give my background, I am not good at math, specially math which only has letters in their equation :). One thing i was sure about my skills and i knew that i would learn anything for the task i am working on and produce result and improve them as i get more experience / knowledge.

I was happy that i will work with Scientific side of Python libraries (numpy, scipy etc.) and then with GCP (Google Cloud Platform). First task i started with was creating Pipeline to load data into bigquery for our ML Pipeline. I faced some design decision issues just because i was not following cloud native approach.

Get requests to data server timing out after first 1000 requests.

I created fully functional local Airflow pipeline but it started timing out on GCP. Why? cause i was opening new socket for every request to save host! This would have worked fine only any VM or server but not on cloud. Though, i think i should have used Session from the start. Learned why to use sessions

Re-running airflow job

I must confess that choosing BigQuery (BQ) was not my decision. As BQ has no concept of duplicate / unique row concept, we needed to make sure that if we re-run job for some day it needs to first clean the data from BQ. So I was forced to think in a different way of how to structure the data, partition on column and then in case of re-run just drop the partition. But how to decide which partition to drop? all these decisions impacted my conventional developer thinking and helped me to think like Cloud-Native developer.

Do the backfill with starting date going back to 6 years back.

This i must confess was the hardest part of the whole pipeline. How would you one-off back-fill for past 6 years and then do it periodically if some data is corrupt for specific IDs.

When i started with this development, I was very sure that this whole thing can be completed in 1 month but in reality, I had to learn about Airflow, Bigquery, GCP limitations and access rights! In the end, i spent close to 2.5 months to get this pipeline running in production. In the end, i was very happy that my first pipeline is working fine for 3+ months without having any issues.

DEV Community

Towards cloud native dev...

Top comments (0)

Read next

Test Python Code Like a Pro with Poetry, Tox, Nox and CI/CD

My 2024 Year in Review

Code Your Diagrams: Automate Architecture with Python's Diagrams Library

Understanding Kubernetes: part 54 – StatefulSet