DEV Community: Tanmay

Reverse-Engineering LinkedIn's Email Templates to Automate an Airtable CRM

Tanmay — Thu, 23 Jul 2026 05:24:48 +0000

Originally posted on Hashnode — full version and any follow-up deep-dives live here.

LinkedIn has no API for "someone accepted my connection request" or "a connection changed jobs." What it does have is email notifications. This post covers what I learned building a Python + IMAP + Airtable pipeline that parses those emails directly, including the undocumented headers and template variants that made it actually reliable.

The wrong starting assumption

I assumed one email template per event type. Wrong — LinkedIn has at least three templates just for "connection accepted":

Single-connection, HTML-based
Batch/digest, plain-text, for multiple acceptances at once
A "network digest" email that isn't about connections at all — it bundles job-change alerts, post reactions, and search-appearance notices into one message

None of this is documented. Found entirely by pulling raw MIME source (Show original in Gmail).

The header that made classification reliable
X-LinkedIn-Class: INVITE-ACCEPT
X-LinkedIn-Template: email_accept_invite_digest_02

X-LinkedIn-Class is invisible in the rendered email but present in every message — a clean, stable signal for event type. X-LinkedIn-Template identifies the exact variant, which is what let me route between the single and batch connection-accept parsers:

python
template = msg.get("X-LinkedIn-Template", "")

if "digest" in template.lower():
parsed_list = parse_connection_accept_digest(text_body)
else:
parsed = parse_connection_accept_email(html_body)

Check headers before you write a single regex against visible content — vendors often leave more structure in there than the UI exposes.

Parsing plain text over HTML (mostly)

For the digest emails, plain text turned out to be far more parseable than the deeply-nested HTML tables — LinkedIn's plain-text part cleanly separates notifications with blank lines, and profile slugs are recoverable straight from the raw URLs sitting after each block's CTA text:

python
def extract_profile_slug_from_compose_url(compose_url):
match = re.search(r"messaging/compose/([^/]+)/", compose_url)
return match.group(1) if match else None

But the single-connection template flipped this — its plain-text part didn't include a profile link at all, only the HTML had a data-test-id marker precise enough to extract it. Lesson: check both MIME parts, use whichever has what you need — don't assume one is always better.

Classify, flag, never guess

Digest emails bundle unrelated notification types in one message. My rule for anything that didn't match a known pattern: log it with a Needs Review status and the raw block text, never silently drop it or force-fit it into the wrong bucket.

python
return {"type": "needs_review", "raw_text": block[:1000]}

Templates drift. Build the "I don't recognize this" path on day one, not as an afterthought.

Performance: two-stage IMAP fetch

First version fetched full raw email bodies for every linkedin.com sender just to check one header — slow on any inbox with unrelated LinkedIn mail. Fix: check headers first, fetch bodies only for matches.

python
status, msg_data = conn.fetch(msg_id, "(BODY.PEEK[HEADER])")

...check X-LinkedIn-Class here...

only THEN, for matches:

status, msg_data = conn.fetch(msg_id, "(BODY.PEEK[])")

BODY.PEEK instead of RFC822 matters for a second reason: RFC822 silently marks a message as read as a side effect of fetching it — a side effect you don't want happening during a "just checking" step.

Read-state as a designed decision, not an accident

Once fetches don't auto-mark-read, you decide explicitly:

Processed successfully → mark read
Failed to parse → leave unread (cheap "check this manually" signal, and it resurfaces next run)

Get this backwards and either your failures vanish silently, or every run re-scans your whole inbox history, getting slower as it grows.

"Already exists" isn't boolean

When a job-change alert fires for someone already in the table, is that a duplicate or new info to apply? Ended up treating it as new info: if a headline changed, the row gets updated rather than staying stale.

Takeaways
Read raw MIME source before writing any parser — the rendered view hides the useful signals
Assume more format variants exist than you've seen; build the unknown-input path first
Fetch cheap (headers) before expensive (bodies)
Read/unread state and dedup logic are part of your data model, not plumbing
Flag, don't guess — a confident wrong parse is worse than an honest "needs review"

Happy to go deeper on any piece of this in the comments — schema design, the Airtable API side, or the full parser code.

I Thought My Virtual Environment Was Broken... It Wasn't.

Tanmay — Wed, 22 Jul 2026 17:44:31 +0000

As an aspiring Data Engineer, I spend a lot of time setting up local environments before actually writing pipelines.

Recently I hit a bug that looked simple but consumed far more time than it should have.

Every package installed successfully.

Every import failed.

Sound familiar?

After deleting and recreating my virtual environment several times, I discovered the real issue:

VS Code was running one Python interpreter while pip was installing packages into another.

Here's the checklist I follow now

✅ Check where python

✅ Check where pip

✅ Confirm the selected interpreter in VS Code

✅ Use

python -m pip install

instead of

pip install

✅ Verify package location using

python -m pip show
Why I'm sharing this

One thing I love about the DEV community is how often people document the "small" bugs.

These posts end up helping thousands of developers because they're based on real debugging sessions instead of ideal tutorials.

If you're just starting with Python, backend development, or Data Engineering, hopefully this saves you from deleting your virtual environment five times like I did.

What was the smallest configuration mistake that wasted the most time for you?

Building My First End-to-End ETL Pipeline with Airflow, BigQuery, and Docker

Tanmay — Sat, 13 Jun 2026 17:17:16 +0000

Recently, I completed my first full Data Engineering project: building an end-to-end ETL pipeline using real-world Australian weather data spanning 10 years.

The dataset contained over 145,000 rows, and the goal of the project was to understand how modern data systems ingest, process, validate, and orchestrate data workflows.

Rather than focusing only on completing the project quickly, I wanted to understand the engineering decisions happening at each stage of the pipeline.

Project Overview

The pipeline was divided into four major stages:

Extract
Transform
Load
Orchestration

The project processes weather data from raw CSV format and prepares it for downstream analytics inside Google BigQuery.

Extract Phase

The extraction layer focused on:

reading raw CSV files,
validating ingestion,
handling inconsistent records,
and detecting missing values early in the pipeline.

This stage helped me understand why ingestion reliability is important in real-world data workflows.

Transform Phase

The transformation stage introduced much more engineering complexity than I initially expected.

I worked on:

handling null values,
converting inconsistent data types,
restructuring records,
and performing feature engineering.

Some engineered features included:

temp_range
is_hot_day
season classification

The transformed dataset was then converted from CSV to Parquet format.

Result:
13.44 MB → 2.35 MB
(82.5% storage reduction)

This phase made me appreciate how important schema consistency and data quality are in ETL systems.

Load Phase

After transformation, the processed data was loaded into Google BigQuery.

I also implemented:

row-count validation,
null-value checks,
and integrity verification after loading.

This stage introduced me to the importance of downstream reliability and validation in Data Engineering systems.

Orchestration with Apache Airflow

The entire workflow was orchestrated using Apache Airflow running inside Docker containers.

The DAG included:

scheduled execution,
retry logic,
logging,
and task dependency management.

This was one of the most interesting parts of the project because it made the pipeline feel much closer to a production-style workflow.

Project Statistics

✅ 145,460 rows processed
✅ 343,248 missing values handled
✅ 0 missing values after transformation
✅ All Airflow tasks completed successfully

Tech Stack
Python
Pandas
PyArrow
Google BigQuery
Apache Airflow
Docker
GitHub Codespaces
Key Learnings

This project taught me that Data Engineering is not just about moving data from one system to another.

It also involves:

reliability,
validation,
orchestration,
scalability,
and ensuring downstream systems can trust the data they receive.

To document the learning journey more deeply, I published the project across multiple platforms — each covering a different perspective of the ETL pipeline:

Hashnode — Technical deep dive into the ETL architecture, orchestration flow, and system design decisions : HashNode

🔹 Medium — Reflections on approaching Data Engineering projects through smaller engineering exercises and incremental learning: Medium

Building the project end-to-end gave me a much deeper understanding of how ETL workflows evolve in real-world systems.

GitHub Repository : ETL Pipeline

Running Apache Airflow + Docker for Free Using GitHub Codespaces

Tanmay — Mon, 08 Jun 2026 06:03:39 +0000

While building my ETL pipeline project, I ran into a common beginner problem:

Running Apache Airflow locally on Windows with Docker was painful.

Problems included:

Low disk space
Docker setup issues
Linux compatibility problems
Environment debugging overhead

With only ~17GB free on my laptop, running multiple Airflow containers locally became difficult.

So I moved the entire setup to GitHub Codespaces.

What Codespaces Provided

Out of the box:

Ubuntu Linux environment
Docker pre-installed
VS Code in browser
Auto-cloned GitHub repo
Port forwarding for Airflow UI

Workflow

docker-compose up

Then:

Open Airflow UI
Trigger ETL DAG
Verify successful execution ✔️

Airflow was running in ~90 seconds.

Important Security Lesson

I accidentally committed my GCP service account key once.

GitHub Secret Scanning blocked the push automatically.

Immediately added:

*.json
.env

Never commit cloud credentials.

Why This Setup Helps Beginners

Codespaces removes a huge amount of local environment friction and lets you focus more on:

Airflow orchestration
ETL pipelines
Docker workflows
Cloud integrations

If you'd like:

Beginner-friendly walkthrough → check Medium
Engineering-focused breakdown → check Hashnode

Medium: Link
Hashnode: Link

Project Repo:ETL Pipeline

DataEngineering #Docker #ApacheAirflow #GitHubCodespaces

How I Broke Down My ETL Pipeline Project Into Smaller Engineering Exercises

Tanmay — Sat, 06 Jun 2026 08:45:06 +0000

Recently, I started building an ETL pipeline project to better understand how modern data systems process and prepare data.

Initially, I approached the project as one large system, but I quickly realized that trying to implement everything at once made it difficult to focus on the engineering concepts behind each stage.

To make learning more manageable, I broke the project into smaller exercises.

So far, I've completed:

Extract
Transform

and each stage taught me something different about Data Engineering systems.

Exercise 1 — Extract Phase

The first goal was simple:
collect raw data and prepare it for processing.

While implementing this stage, I focused on:

reading datasets,
understanding source formats,
organizing raw input,
and creating a clean ingestion flow.

This phase helped me understand that ingestion is more than just "reading data."

Even before transformation begins, the system needs to think about:

consistency,
structure,
and reliability of incoming records.

Exercise 2 — Transform Phase

The transformation stage turned out to be the most interesting part of the project.

I worked on:

cleaning inconsistent records,
handling null or missing values,
restructuring datasets,
standardizing fields,
and preparing the data for downstream usage.

This stage made me realize how important data quality is.

A poorly designed transformation layer can create downstream problems for analytics, reporting, or other services consuming the data.

It also introduced me to concepts around:

schema design,
processing logic,
and data normalization.

Key Takeaways

One thing that stood out to me was that ETL pipelines are not only about moving data from one place to another.

They're also about:

ensuring trust in the data,
preparing systems for scalability,
and building reliable processing workflows.

What's Next

The next stage of the project will focus on:

loading transformed data into the target system,
pipeline orchestration,
and exploring scalability improvements.

Building this project incrementally has helped me understand Data Engineering concepts much more clearly than trying to study them only theoretically.

Why I Stopped Treating Job Applications as My Only Career Strategy

Tanmay — Sat, 30 May 2026 19:49:42 +0000

Like many engineers, I started my job search with a simple idea:

Apply to enough roles and eventually something will work out.

The reality was more complicated.

Some positions were already filled.
Some never responded.
Some required significantly more experience.
Some disappeared before interviews even started.

After a while, I realized something important:

Applications are necessary, but they are not the only mechanism for creating opportunities.

A Simple Probability Problem

Imagine sending 100 applications.

If the response rate is 2%, the expected number of responses is:

100 × 0.02 = 2

Now imagine spending part of that effort on:

Building projects
Writing technical articles
Creating a portfolio
Participating in engineering discussions

None of these guarantee opportunities.

But they increase the number of ways someone can discover your work.

What I Decided to Build

Instead of focusing exclusively on applications, I started working on:

Payment Gateway Design

Understanding transactions, idempotency, retries, and failure handling.

Schema Design Portfolio

Documenting database designs and architectural decisions.

Data Engineering Journey

Exploring Kafka, Spark, Airflow, and distributed systems.

Technical Writing

Sharing lessons learned while studying and building.

The Hard Part Nobody Talks About

The internet often makes personal branding sound easy.

Reality looks more like this:

Writing articles nobody reads.
Publishing posts that get little engagement.
Maintaining projects after the excitement wears off.
Spending months before seeing meaningful results.

There is no shortcut.

The value comes from consistency.

Final Thought

I'm not abandoning job applications.

I'm simply trying to build assets that continue working even when I'm not actively applying.

Applications create opportunities one submission at a time.

Projects and writing create opportunities that can compound over time.

I'm curious how other engineers balance these two approaches.