<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tanmay</title>
    <description>The latest articles on DEV Community by Tanmay (@tanmay_bhurkunde).</description>
    <link>https://dev.to/tanmay_bhurkunde</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1444448%2F9e66b6cc-033e-4a00-8030-57f5a48d1c47.png</url>
      <title>DEV Community: Tanmay</title>
      <link>https://dev.to/tanmay_bhurkunde</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tanmay_bhurkunde"/>
    <language>en</language>
    <item>
      <title>Building My First End-to-End ETL Pipeline with Airflow, BigQuery, and Docker</title>
      <dc:creator>Tanmay</dc:creator>
      <pubDate>Sat, 13 Jun 2026 17:17:16 +0000</pubDate>
      <link>https://dev.to/tanmay_bhurkunde/building-my-first-end-to-end-etl-pipeline-with-airflow-bigquery-and-docker-2c1m</link>
      <guid>https://dev.to/tanmay_bhurkunde/building-my-first-end-to-end-etl-pipeline-with-airflow-bigquery-and-docker-2c1m</guid>
      <description>&lt;p&gt;Recently, I completed my first full Data Engineering project: building an end-to-end ETL pipeline using real-world Australian weather data spanning 10 years.&lt;/p&gt;

&lt;p&gt;The dataset contained over 145,000 rows, and the goal of the project was to understand how modern data systems ingest, process, validate, and orchestrate data workflows.&lt;/p&gt;

&lt;p&gt;Rather than focusing only on completing the project quickly, I wanted to understand the engineering decisions happening at each stage of the pipeline.&lt;/p&gt;

&lt;p&gt;Project Overview&lt;/p&gt;

&lt;p&gt;The pipeline was divided into four major stages:&lt;/p&gt;

&lt;p&gt;Extract&lt;br&gt;
Transform&lt;br&gt;
Load&lt;br&gt;
Orchestration&lt;/p&gt;

&lt;p&gt;The project processes weather data from raw CSV format and prepares it for downstream analytics inside Google BigQuery.&lt;/p&gt;

&lt;p&gt;Extract Phase&lt;/p&gt;

&lt;p&gt;The extraction layer focused on:&lt;/p&gt;

&lt;p&gt;reading raw CSV files,&lt;br&gt;
validating ingestion,&lt;br&gt;
handling inconsistent records,&lt;br&gt;
and detecting missing values early in the pipeline.&lt;/p&gt;

&lt;p&gt;This stage helped me understand why ingestion reliability is important in real-world data workflows.&lt;/p&gt;

&lt;p&gt;Transform Phase&lt;/p&gt;

&lt;p&gt;The transformation stage introduced much more engineering complexity than I initially expected.&lt;/p&gt;

&lt;p&gt;I worked on:&lt;/p&gt;

&lt;p&gt;handling null values,&lt;br&gt;
converting inconsistent data types,&lt;br&gt;
restructuring records,&lt;br&gt;
and performing feature engineering.&lt;/p&gt;

&lt;p&gt;Some engineered features included:&lt;/p&gt;

&lt;p&gt;temp_range&lt;br&gt;
is_hot_day&lt;br&gt;
season classification&lt;/p&gt;

&lt;p&gt;The transformed dataset was then converted from CSV to Parquet format.&lt;/p&gt;

&lt;p&gt;Result:&lt;br&gt;
13.44 MB → 2.35 MB&lt;br&gt;
(82.5% storage reduction)&lt;/p&gt;

&lt;p&gt;This phase made me appreciate how important schema consistency and data quality are in ETL systems.&lt;/p&gt;

&lt;p&gt;Load Phase&lt;/p&gt;

&lt;p&gt;After transformation, the processed data was loaded into Google BigQuery.&lt;/p&gt;

&lt;p&gt;I also implemented:&lt;/p&gt;

&lt;p&gt;row-count validation,&lt;br&gt;
null-value checks,&lt;br&gt;
and integrity verification after loading.&lt;/p&gt;

&lt;p&gt;This stage introduced me to the importance of downstream reliability and validation in Data Engineering systems.&lt;/p&gt;

&lt;p&gt;Orchestration with Apache Airflow&lt;/p&gt;

&lt;p&gt;The entire workflow was orchestrated using Apache Airflow running inside Docker containers.&lt;/p&gt;

&lt;p&gt;The DAG included:&lt;/p&gt;

&lt;p&gt;scheduled execution,&lt;br&gt;
retry logic,&lt;br&gt;
logging,&lt;br&gt;
and task dependency management.&lt;/p&gt;

&lt;p&gt;This was one of the most interesting parts of the project because it made the pipeline feel much closer to a production-style workflow.&lt;/p&gt;

&lt;p&gt;Project Statistics&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiv6rdeg997uoocf94855.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiv6rdeg997uoocf94855.png" alt=" " width="799" height="268"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;✅ 145,460 rows processed&lt;br&gt;
✅ 343,248 missing values handled&lt;br&gt;
✅ 0 missing values after transformation&lt;br&gt;
✅ All Airflow tasks completed successfully&lt;/p&gt;

&lt;p&gt;Tech Stack&lt;br&gt;
Python&lt;br&gt;
Pandas&lt;br&gt;
PyArrow&lt;br&gt;
Google BigQuery&lt;br&gt;
Apache Airflow&lt;br&gt;
Docker&lt;br&gt;
GitHub Codespaces&lt;br&gt;
Key Learnings&lt;/p&gt;

&lt;p&gt;This project taught me that Data Engineering is not just about moving data from one system to another.&lt;/p&gt;

&lt;p&gt;It also involves:&lt;/p&gt;

&lt;p&gt;reliability,&lt;br&gt;
validation,&lt;br&gt;
orchestration,&lt;br&gt;
scalability,&lt;br&gt;
and ensuring downstream systems can trust the data they receive.&lt;/p&gt;

&lt;p&gt;To document the learning journey more deeply, I published the project across multiple platforms — each covering a different perspective of the ETL pipeline:&lt;/p&gt;

&lt;p&gt;Hashnode — Technical deep dive into the ETL architecture, orchestration flow, and system design decisions : &lt;a href="https://hashnode.com/edit/cmqciwsp900000bjib95wd9tp" rel="noopener noreferrer"&gt;&lt;strong&gt;HashNode&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔹 Medium — Reflections on approaching Data Engineering projects through smaller engineering exercises and incremental learning: &lt;a href="https://medium.com/@tanmaybhurkhunde2018/building-my-first-etl-pipeline-changed-how-i-think-about-data-engineering-5e7e45dcb975" rel="noopener noreferrer"&gt;&lt;strong&gt;Medium&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Building the project end-to-end gave me a much deeper understanding of how ETL workflows evolve in real-world systems.&lt;/p&gt;

&lt;p&gt;GitHub Repository : &lt;a href="https://github.com/tanmaybhurkunde/ETL-Pipeline" rel="noopener noreferrer"&gt;&lt;strong&gt;ETL Pipeline&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>googlecloud</category>
      <category>etl</category>
      <category>sql</category>
    </item>
    <item>
      <title>Running Apache Airflow + Docker for Free Using GitHub Codespaces</title>
      <dc:creator>Tanmay</dc:creator>
      <pubDate>Mon, 08 Jun 2026 06:03:39 +0000</pubDate>
      <link>https://dev.to/tanmay_bhurkunde/running-apache-airflow-docker-for-free-using-github-codespaces-4905</link>
      <guid>https://dev.to/tanmay_bhurkunde/running-apache-airflow-docker-for-free-using-github-codespaces-4905</guid>
      <description>&lt;p&gt;While building my ETL pipeline project, I ran into a common beginner problem:&lt;/p&gt;

&lt;p&gt;Running Apache Airflow locally on Windows with Docker was painful.&lt;/p&gt;

&lt;p&gt;Problems included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Low disk space&lt;/li&gt;
&lt;li&gt;Docker setup issues&lt;/li&gt;
&lt;li&gt;Linux compatibility problems&lt;/li&gt;
&lt;li&gt;Environment debugging overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With only ~17GB free on my laptop, running multiple Airflow containers locally became difficult.&lt;/p&gt;

&lt;p&gt;So I moved the entire setup to GitHub Codespaces.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Codespaces Provided
&lt;/h2&gt;

&lt;p&gt;Out of the box:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ubuntu Linux environment&lt;/li&gt;
&lt;li&gt;Docker pre-installed&lt;/li&gt;
&lt;li&gt;VS Code in browser&lt;/li&gt;
&lt;li&gt;Auto-cloned GitHub repo&lt;/li&gt;
&lt;li&gt;Port forwarding for Airflow UI&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Workflow
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open Airflow UI&lt;/li&gt;
&lt;li&gt;Trigger ETL DAG&lt;/li&gt;
&lt;li&gt;Verify successful execution ✔️&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Airflow was running in ~90 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Important Security Lesson
&lt;/h2&gt;

&lt;p&gt;I accidentally committed my GCP service account key once.&lt;/p&gt;

&lt;p&gt;GitHub Secret Scanning blocked the push automatically.&lt;/p&gt;

&lt;p&gt;Immediately added:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;*.json
.env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Never commit cloud credentials.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Setup Helps Beginners
&lt;/h2&gt;

&lt;p&gt;Codespaces removes a huge amount of local environment friction and lets you focus more on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Airflow orchestration&lt;/li&gt;
&lt;li&gt;ETL pipelines&lt;/li&gt;
&lt;li&gt;Docker workflows&lt;/li&gt;
&lt;li&gt;Cloud integrations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you'd like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Beginner-friendly walkthrough → check Medium&lt;/li&gt;
&lt;li&gt;Engineering-focused breakdown → check Hashnode&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Medium: &lt;a href="https://medium.com/p/8832297e4d85/" rel="noopener noreferrer"&gt;Link&lt;/a&gt;&lt;br&gt;
Hashnode: &lt;a href="https://engineeringfriction.hashnode.dev/how-github-codespaces-helped-me-run-airflow-docker-free?utm_source=hashnode&amp;amp;utm_medium=feed" rel="noopener noreferrer"&gt;Link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Project Repo:&lt;a href="https://github.com/tanmaybhurkunde/ETL-Pipeline" rel="noopener noreferrer"&gt;ETL Pipeline&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  DataEngineering #Docker #ApacheAirflow #GitHubCodespaces
&lt;/h1&gt;

</description>
      <category>dataengineering</category>
      <category>docker</category>
      <category>python</category>
      <category>apacheairflow</category>
    </item>
    <item>
      <title>How I Broke Down My ETL Pipeline Project Into Smaller Engineering Exercises</title>
      <dc:creator>Tanmay</dc:creator>
      <pubDate>Sat, 06 Jun 2026 08:45:06 +0000</pubDate>
      <link>https://dev.to/tanmay_bhurkunde/how-i-broke-down-my-etl-pipeline-project-into-smaller-engineering-exercises-2c0a</link>
      <guid>https://dev.to/tanmay_bhurkunde/how-i-broke-down-my-etl-pipeline-project-into-smaller-engineering-exercises-2c0a</guid>
      <description>&lt;p&gt;Recently, I started building an ETL pipeline project to better understand how modern data systems process and prepare data.&lt;/p&gt;

&lt;p&gt;Initially, I approached the project as one large system, but I quickly realized that trying to implement everything at once made it difficult to focus on the engineering concepts behind each stage.&lt;/p&gt;

&lt;p&gt;To make learning more manageable, I broke the project into smaller exercises.&lt;/p&gt;

&lt;p&gt;So far, I've completed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extract&lt;/li&gt;
&lt;li&gt;Transform&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and each stage taught me something different about Data Engineering systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Exercise 1 — Extract Phase
&lt;/h2&gt;

&lt;p&gt;The first goal was simple:&lt;br&gt;
collect raw data and prepare it for processing.&lt;/p&gt;

&lt;p&gt;While implementing this stage, I focused on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reading datasets,&lt;/li&gt;
&lt;li&gt;understanding source formats,&lt;/li&gt;
&lt;li&gt;organizing raw input,&lt;/li&gt;
&lt;li&gt;and creating a clean ingestion flow.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This phase helped me understand that ingestion is more than just "reading data."&lt;/p&gt;

&lt;p&gt;Even before transformation begins, the system needs to think about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;consistency,&lt;/li&gt;
&lt;li&gt;structure,&lt;/li&gt;
&lt;li&gt;and reliability of incoming records.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Exercise 2 — Transform Phase
&lt;/h2&gt;

&lt;p&gt;The transformation stage turned out to be the most interesting part of the project.&lt;/p&gt;

&lt;p&gt;I worked on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cleaning inconsistent records,&lt;/li&gt;
&lt;li&gt;handling null or missing values,&lt;/li&gt;
&lt;li&gt;restructuring datasets,&lt;/li&gt;
&lt;li&gt;standardizing fields,&lt;/li&gt;
&lt;li&gt;and preparing the data for downstream usage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This stage made me realize how important data quality is.&lt;/p&gt;

&lt;p&gt;A poorly designed transformation layer can create downstream problems for analytics, reporting, or other services consuming the data.&lt;/p&gt;

&lt;p&gt;It also introduced me to concepts around:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;schema design,&lt;/li&gt;
&lt;li&gt;processing logic,&lt;/li&gt;
&lt;li&gt;and data normalization.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;One thing that stood out to me was that ETL pipelines are not only about moving data from one place to another.&lt;/p&gt;

&lt;p&gt;They're also about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ensuring trust in the data,&lt;/li&gt;
&lt;li&gt;preparing systems for scalability,&lt;/li&gt;
&lt;li&gt;and building reliable processing workflows.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The next stage of the project will focus on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;loading transformed data into the target system,&lt;/li&gt;
&lt;li&gt;pipeline orchestration,&lt;/li&gt;
&lt;li&gt;and exploring scalability improvements.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Building this project incrementally has helped me understand Data Engineering concepts much more clearly than trying to study them only theoretically.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>database</category>
      <category>etl</category>
      <category>sql</category>
    </item>
    <item>
      <title>Why I Stopped Treating Job Applications as My Only Career Strategy</title>
      <dc:creator>Tanmay</dc:creator>
      <pubDate>Sat, 30 May 2026 19:49:42 +0000</pubDate>
      <link>https://dev.to/tanmay_bhurkunde/why-i-stopped-treating-job-applications-as-my-only-career-strategy-1l4c</link>
      <guid>https://dev.to/tanmay_bhurkunde/why-i-stopped-treating-job-applications-as-my-only-career-strategy-1l4c</guid>
      <description>&lt;p&gt;Like many engineers, I started my job search with a simple idea:&lt;/p&gt;

&lt;p&gt;Apply to enough roles and eventually something will work out.&lt;/p&gt;

&lt;p&gt;The reality was more complicated.&lt;/p&gt;

&lt;p&gt;Some positions were already filled.&lt;br&gt;
Some never responded.&lt;br&gt;
Some required significantly more experience.&lt;br&gt;
Some disappeared before interviews even started.&lt;/p&gt;

&lt;p&gt;After a while, I realized something important:&lt;/p&gt;

&lt;p&gt;Applications are necessary, but they are not the only mechanism for creating opportunities.&lt;/p&gt;

&lt;p&gt;A Simple Probability Problem&lt;/p&gt;

&lt;p&gt;Imagine sending 100 applications.&lt;/p&gt;

&lt;p&gt;If the response rate is 2%, the expected number of responses is:&lt;/p&gt;

&lt;p&gt;100 × 0.02 = 2&lt;/p&gt;

&lt;p&gt;Now imagine spending part of that effort on:&lt;/p&gt;

&lt;p&gt;Building projects&lt;br&gt;
Writing technical articles&lt;br&gt;
Creating a portfolio&lt;br&gt;
Participating in engineering discussions&lt;/p&gt;

&lt;p&gt;None of these guarantee opportunities.&lt;/p&gt;

&lt;p&gt;But they increase the number of ways someone can discover your work.&lt;/p&gt;

&lt;p&gt;What I Decided to Build&lt;/p&gt;

&lt;p&gt;Instead of focusing exclusively on applications, I started working on:&lt;/p&gt;

&lt;p&gt;Payment Gateway Design&lt;/p&gt;

&lt;p&gt;Understanding transactions, idempotency, retries, and failure handling.&lt;/p&gt;

&lt;p&gt;Schema Design Portfolio&lt;/p&gt;

&lt;p&gt;Documenting database designs and architectural decisions.&lt;/p&gt;

&lt;p&gt;Data Engineering Journey&lt;/p&gt;

&lt;p&gt;Exploring Kafka, Spark, Airflow, and distributed systems.&lt;/p&gt;

&lt;p&gt;Technical Writing&lt;/p&gt;

&lt;p&gt;Sharing lessons learned while studying and building.&lt;/p&gt;

&lt;p&gt;The Hard Part Nobody Talks About&lt;/p&gt;

&lt;p&gt;The internet often makes personal branding sound easy.&lt;/p&gt;

&lt;p&gt;Reality looks more like this:&lt;/p&gt;

&lt;p&gt;Writing articles nobody reads.&lt;br&gt;
Publishing posts that get little engagement.&lt;br&gt;
Maintaining projects after the excitement wears off.&lt;br&gt;
Spending months before seeing meaningful results.&lt;/p&gt;

&lt;p&gt;There is no shortcut.&lt;/p&gt;

&lt;p&gt;The value comes from consistency.&lt;/p&gt;

&lt;p&gt;Final Thought&lt;/p&gt;

&lt;p&gt;I'm not abandoning job applications.&lt;/p&gt;

&lt;p&gt;I'm simply trying to build assets that continue working even when I'm not actively applying.&lt;/p&gt;

&lt;p&gt;Applications create opportunities one submission at a time.&lt;/p&gt;

&lt;p&gt;Projects and writing create opportunities that can compound over time.&lt;/p&gt;

&lt;p&gt;I'm curious how other engineers balance these two approaches.&lt;/p&gt;

</description>
      <category>softwareengineering</category>
      <category>learninginpublic</category>
      <category>careerdevelopment</category>
      <category>dataengineering</category>
    </item>
  </channel>
</rss>
