data engineering vs data science is the most-searched career comparison in the modern data job market — and the question almost always means a third one too: where does the data analyst sit, and why does the same job title pay a different number depending on which of those three teams owns it. The confusion is structural. Hiring managers still recycle a job description written in 2014 ("data scientist who can also build pipelines and dashboards"), the org chart has since split that unicorn into three distinct ladders, and a new middle role — the analytics engineer — has emerged at the seam.
This guide is the cheat sheet you wished existed the first time a recruiter asked "are you a data engineer or a data scientist?" and you realised the honest answer was "depends what the team needs me to build this quarter." It walks through why the three roles keep blurring, the three-role Venn that splits unique territory from shared overlap, the canonical stack matrix across languages, warehouses, orchestration, modelling, BI, and streaming, the 2026 salary bands across the US, EU, and India for every seniority level, and the two main career rails (analytics rail and science rail) with explicit switch costs at every junction. Each section pairs a teaching block with a decision-shaped worked answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the SQL practice library → for the analyst and DE shared language, rehearse on Python practice problems → for the DS and DE shared muscle, and stack the platform skills with ETL practice problems →.
On this page
- Why the three roles keep getting confused
- The three-role Venn — shared tools and unique territory
- Stack comparison — tools, languages, infrastructure
- Salary bands 2026 — US, EU, India
- Career path flow — entry to senior and role switching
- Cheat sheet — role boundary recipes
- Frequently asked questions
- Practice on PipeCode
1. Why the three roles keep getting confused
The 2014 unicorn job ad still echoes — and the modern org chart has quietly split that one role into three ladders with separate SLAs
The one-sentence invariant: a "data scientist" in 2014 meant whoever could ship a SQL query, train a model, build a pipeline, and present to the board — and the modern org chart has split that bundle into a data analyst (questions and dashboards), a data engineer (pipelines and warehouses), and a data scientist (models and experiments). Once you internalise that "the title splits by ownership boundary, not by hype," every recruiter conversation gets easier.
The three places the confusion bites.
- Job descriptions. A 2014 "data scientist" JD listed Hadoop, Spark, scikit-learn, Tableau, SQL, and presentation skills in a single role. The 2026 version of that JD belongs to three different people on three different teams. Hiring managers who haven't refreshed their JD library still post the unicorn ad.
- Self-identification. A new grad reads "data scientist earns $X" on Glassdoor, copies the title into LinkedIn, and discovers the role they actually do (build dashboards, write SQL, present to PM) maps to "data analyst" — which the same Glassdoor entry shows earning 35% less.
- Performance reviews. A data engineer who ships a great model and a data scientist who ships a great pipeline both get praised, then quietly down-levelled at the next calibration because each one's "real" role expected something else. The boundary matters at performance time even when it didn't matter at hire time.
Three-valued ownership in one sentence.
Every data team has three SLAs — the pipeline SLA (data lands on time, correct, idempotent), the dashboard SLA (the metric matches the source of truth and refreshes by 8 AM), and the model SLA (predictions are calibrated, the feature drift is monitored, the A/B test is stat-sig). The data engineer owns the first, the data analyst owns the second, and the data scientist owns the third. Roles that "share" an SLA in your org chart are roles you have not finished defining yet.
What hiring managers actually listen for.
- Do you describe your last project by the artifact you owned (pipeline / dashboard / model) — senior signal.
- Do you reach for an analytics engineer label when your work is dbt + warehouse modelling? — required answer in 2026.
- Do you mention on-call rotations when discussing DE work? — senior signal, since DE carries the pipeline pager.
- Do you recognise that MLOps sits at the DE ∩ DS overlap rather than inside pure DS? — required answer.
The 2026 reality.
- Analytics engineer is now the fastest-growing title between DA and DE — owns dbt models and the warehouse contract, but does not own ingestion or orchestration.
- MLOps engineer and ML engineer are the increasingly distinct DE/DS overlap roles, with MLOps owning serving infra and ML engineers owning the train-to-deploy lifecycle.
- Data analyst is splitting into product analyst, growth analyst, and BI developer at scale; smaller orgs still hire generalist DAs.
- Data scientist is splitting into applied scientist (modelling for product), research scientist (publishing and frontier work), and decision scientist (experimentation and causal inference).
Worked example — three role descriptions from the same week
Detailed explanation. New grads often confuse the three roles because the JDs sound similar at the top: "use data to drive business decisions." The discriminator is the artifact each role ships. Pull three real JDs from the same company posted in the same week and read them by artifact — the role boundaries appear immediately.
Question. Given three abbreviated JDs from the same fintech, classify each as DE, DS, or DA based on the artifact each role ships and the SLA each one owns.
Input.
| JD # | Top-of-page hook | Day-to-day artifact | SLA owned |
|---|---|---|---|
| A | "Use data to drive product decisions" | dashboards + KPI definitions | "8 AM refresh, metric matches finance" |
| B | "Use data to drive product decisions" | Airflow DAGs + Snowflake schemas | "data lands by 6 AM, idempotent reruns" |
| C | "Use data to drive product decisions" | A/B tests + churn model | "stat-sig at p<0.05, model AUC > 0.78" |
Code (pseudo-decision tree).
def classify_role(jd: dict) -> str:
artifact = jd["day_to_day_artifact"].lower()
sla = jd["sla_owned"].lower()
if "dag" in artifact or "schema" in artifact or "idempotent" in sla:
return "Data Engineer"
if "model" in artifact or "a/b" in artifact or "auc" in sla:
return "Data Scientist"
if "dashboard" in artifact or "kpi" in artifact or "metric matches" in sla:
return "Data Analyst"
return "Unknown — ask the hiring manager"
jds = [
{"id": "A", "day_to_day_artifact": "dashboards + KPI definitions",
"sla_owned": "8 AM refresh, metric matches finance"},
{"id": "B", "day_to_day_artifact": "Airflow DAGs + Snowflake schemas",
"sla_owned": "data lands by 6 AM, idempotent reruns"},
{"id": "C", "day_to_day_artifact": "A/B tests + churn model",
"sla_owned": "stat-sig at p<0.05, model AUC > 0.78"},
]
for jd in jds:
print(jd["id"], "->", classify_role(jd))
Step-by-step explanation.
- The top-of-page hook is identical in all three JDs — "use data to drive product decisions" — because that's the company's mission, not the role. Ignore it.
- JD A ships dashboards. Its SLA is the 8 AM refresh and metric matching finance. The owned artifact is the dashboard; the role is Data Analyst.
- JD B ships Airflow DAGs and Snowflake schemas. Its SLA is idempotent reruns and the 6 AM landing time. The owned artifact is the pipeline; the role is Data Engineer.
- JD C ships A/B tests and a churn model. Its SLA is statistical significance and model AUC. The owned artifact is the model; the role is Data Scientist.
- If the JD's day-to-day artifact list mixes pipelines and models, that is a 2014 unicorn JD — push back during the recruiter call and ask which artifact actually has on-call.
Output.
| JD # | Inferred role |
|---|---|
| A | Data Analyst |
| B | Data Engineer |
| C | Data Scientist |
Rule of thumb. Read every JD by the artifact it lists in the day-to-day section. Ignore the top-of-page mission statement; ignore the "skills" list (every JD lists SQL and Python because every JD copies the previous one). The role boundary lives in the bullet list of weekly deliverables.
Worked example — your first job title rarely matches what you do
Detailed explanation. Surveys repeatedly show that 30-40% of "data scientists" spend most of their time on data prep, cleaning, and pipeline work — i.e. data engineering. New grads pick the title with the highest median salary on Glassdoor, accept the role, then discover the actual work is one rung down the seniority ladder. Knowing the time-allocation curve before you accept the offer prevents a year of frustration.
Question. Given a self-reported weekly time log from a "data scientist," classify the actual role using a simple time-bucket rule, and recommend the next career move.
Input.
| Activity | Hours / week |
|---|---|
| Writing SQL for ad-hoc requests | 12 |
| Building dashboards in Looker | 6 |
| Maintaining ETL pipelines in Airflow | 14 |
| Training / tuning ML models | 4 |
| Stakeholder presentations | 4 |
| Total | 40 |
Code.
weekly_hours = {
"sql_adhoc": 12,
"dashboards": 6,
"etl_pipelines": 14,
"ml_models": 4,
"stakeholder": 4,
}
bucket_to_role = {
"sql_adhoc": "DA",
"dashboards": "DA",
"etl_pipelines": "DE",
"ml_models": "DS",
"stakeholder": "DA",
}
role_hours = {"DA": 0, "DE": 0, "DS": 0}
for bucket, hours in weekly_hours.items():
role_hours[bucket_to_role[bucket]] += hours
dominant = max(role_hours, key=role_hours.get)
print("Actual role by time allocation:", dominant, role_hours)
Step-by-step explanation.
- Sum hours per role bucket: DA gets
12 + 6 + 4 = 22, DE gets14, DS gets4. Total 40. - The dominant bucket is DA at 22 hours — even though the LinkedIn title says "data scientist."
- The 4 hours of model work is a side-project at this org. To become a DS by time allocation, the person needs to negotiate a different team or push back on the DA workload.
- The 14 hours on Airflow signals strong DE intuition; the analytics-engineer path is the lowest-friction next step (DA → analytics engineer is a 6-12 month skill build).
- The "stakeholder" hours overlap with DA but are a transferable senior signal regardless of role.
Output.
| Role | Hours / week | % of week |
|---|---|---|
| DA | 22 | 55% |
| DE | 14 | 35% |
| DS | 4 | 10% |
Rule of thumb. Once a quarter, log your actual hours by role bucket. If your dominant bucket diverges from your job title by more than 30%, raise it in your 1:1 — either renegotiate the work, change teams, or rewrite your resume to reflect the work you actually do.
Worked example — the PipeCode framing: pick by ownership boundary, not by hype curve
Detailed explanation. New grads pick roles by the Gartner Hype Cycle — "AI is hot, so I'll be a data scientist." Senior practitioners pick roles by ownership boundary — "I want to own the artifact that I find most satisfying to ship." The first framing makes you a generic candidate; the second makes you a senior IC.
Question. Given three honest preferences from a candidate, recommend the best-fit starting role and the canonical first-year promotion path.
Input — candidate preferences.
| Question | Answer |
|---|---|
| What do you enjoy debugging? | "a slow query plan" |
| What do you find satisfying to ship? | "a reliable pipeline" |
| What do you avoid doing? | "presenting to executives" |
Code.
prefs = {
"enjoys_debugging": "slow query plan",
"satisfying_ship": "reliable pipeline",
"avoids": "presenting to executives",
}
if "query plan" in prefs["enjoys_debugging"] and "pipeline" in prefs["satisfying_ship"]:
role = "Data Engineer"
rail = "DA -> analytics engineer -> DE"
next_skill = "Airflow + dbt + warehouse modelling"
elif "model" in prefs["satisfying_ship"] or "experiment" in prefs["enjoys_debugging"]:
role = "Data Scientist"
rail = "Stats grad -> DS -> ML engineer"
next_skill = "scikit-learn + experimentation framework"
else:
role = "Data Analyst"
rail = "BI analyst -> DA -> analytics engineer"
next_skill = "Looker / Tableau + KPI ownership"
print(role, "via", rail, "| build:", next_skill)
Step-by-step explanation.
- The candidate enjoys debugging query plans — a strong DE signal (DEs spend a lot of time on
EXPLAIN ANALYZE). - The candidate finds reliable pipelines satisfying — DE artifact, confirmed.
- The candidate avoids executive presentations — rules out DA as the dominant role.
- The recommended role is Data Engineer, with the analytics-rail entry point. Even though the candidate "could" enter directly as a junior DE, the analytics rail (BI analyst → DA → analytics engineer → DE) has lower interview friction in 2026 because it builds the warehouse intuition first.
- The next skill to build is the analytics-engineer triad: Airflow + dbt + warehouse modelling. Six months of focused practice and the candidate is interview-ready.
Output.
| Inferred role | Suggested rail | Next skill block |
|---|---|---|
| Data Engineer | DA → analytics engineer → DE | Airflow + dbt + warehouse modelling |
Rule of thumb. Pick your role by what you find satisfying to ship and what you find satisfying to debug. The compensation curves are close enough across roles that picking by money costs you in retention and growth; picking by ownership boundary pays compounding career interest.
Decision question on role identification
A staff DE often opens an interview with: "Walk me through a recent project from your last role. Tell me what you shipped, what SLA you owned, who reviewed your code, and who picked up the pager when it broke." It surfaces in one minute whether the candidate is a generic "data person" or whether they understand their role's ownership boundary.
Decision Using the artifact-SLA-pager triad
def role_audit(project: dict) -> dict:
"""
Audit a candidate's last project by three lenses.
Returns the inferred role boundary and the seniority signal.
"""
artifact = project["artifact"] # what you shipped
sla = project["sla"] # what guarantee you owned
pager = project["who_paged"] # who got woken at 3 AM
role_by_artifact = {
"pipeline": "DE", "dashboard": "DA", "model": "DS",
"dbt_model": "Analytics Engineer", "deployment": "ML Engineer",
}.get(artifact, "Unknown")
seniority_signal = "Senior" if pager == "me" else "Mid / Junior"
return {
"inferred_role": role_by_artifact,
"owned_sla": sla,
"seniority_signal": seniority_signal,
}
project = {
"artifact": "pipeline",
"sla": "data lands by 6 AM with idempotent reruns",
"who_paged": "me",
}
print(role_audit(project))
Step-by-step trace.
| Lens | Candidate answer | Inference |
|---|---|---|
| Artifact | "pipeline" | Data Engineer |
| SLA | "data lands by 6 AM with idempotent reruns" | DE pipeline SLA, well-defined |
| Pager | "me" | Senior IC signal — owns the on-call |
After the audit, the interviewer knows: the candidate is a DE who owns a measurable SLA and carries the pager. That triad alone qualifies the candidate for a mid-to-senior DE conversation. The follow-up probes (incident review, schema evolution, cost) test depth — but the role boundary is settled.
Output:
| Field | Value |
|---|---|
| inferred_role | DE |
| owned_sla | data lands by 6 AM with idempotent reruns |
| seniority_signal | Senior |
Why this works — concept by concept:
- Artifact lens — the artifact you shipped is the most honest role signal because it survived the org chart and code review. "Worked on the data platform" is mush; "shipped the orders-table pipeline" is data.
- SLA lens — owning an SLA means you were on the hook when it broke. SLAs map cleanly to roles: pipeline SLA → DE, dashboard SLA → DA, model SLA → DS.
- Pager lens — the pager separates ICs who contributed to a system from ICs who owned it. Senior roles always carry the pager for their artifact; juniors usually don't.
- Triad beats title — the three-lens audit beats the candidate's LinkedIn title every time. A "Senior Data Scientist" who ships dashboards and never carries a pager is a mid DA in disguise.
- Cost — one 5-minute conversation. The triad is cheap to apply, hard to fake, and surfaces the candidate's real ownership in the time it takes to make a coffee.
SQL
Topic — SQL fundamentals
SQL practice library (all roles)
2. The three-role Venn — shared tools and unique territory
Three circles, seven territories — the outer crescents are role-only, the pairwise overlaps are where teams ship together, and the centre is universal
The mental model in one line: the three roles share SQL, Git, and stakeholder communication at the centre; they share dbt and warehouse modelling in the DE ∩ DA overlap (analytics engineer), Python notebooks and exploratory analysis in the DA ∩ DS overlap, and MLOps / feature stores / vector DBs in the DE ∩ DS overlap (ML engineer). Once you can draw the Venn from memory and label every region, the entire "role comparison" interview surface collapses to a placement exercise.
The seven Venn territories in one table.
| Region | What lives here | Who owns it |
|---|---|---|
| DE-only outer crescent | ingestion, batch + streaming pipelines, warehouse storage, schema design, orchestration, observability | Data Engineer |
| DS-only outer crescent | hypothesis design, modelling, experimentation, statistical inference, model accuracy SLA | Data Scientist |
| DA-only outer crescent | business questions, dashboards, ad-hoc SQL, KPI definitions, stakeholder presentations | Data Analyst |
| DE ∩ DA overlap | dbt models, warehouse modelling, SQL transforms | Analytics Engineer (the named role) |
| DA ∩ DS overlap | Python notebooks, exploratory analysis, A/B tests | the strongest DA or weakest DS |
| DE ∩ DS overlap | MLOps, feature stores, vector DBs, model serving | ML Engineer / MLOps Engineer |
| Triple centre | SQL, Git, documentation, stakeholder communication | every senior data IC |
DE-only territory in detail.
- Ingestion. Batch (Fivetran, Airbyte, custom Python) and streaming (Kafka, Kinesis, Pub/Sub). DE owns the ingestion SLA — data must arrive, must arrive once, and must arrive with the contract the downstream team expects.
- Warehouse storage. Snowflake / BigQuery / Redshift / Databricks — DE owns the cost, the partitioning, the clustering, the retention policy, and the IAM. DA and DS read the warehouse; DE administers it.
- Orchestration. Airflow / Dagster / Prefect — DE owns the DAGs, the retry policy, the alerting hookup, and the on-call rotation when a pipeline fails.
- Observability. Data quality tests, schema drift detection, freshness monitors, lineage. DE owns the dashboards that watch the data, not the dashboards that show the data to humans.
DS-only territory in detail.
- Hypothesis design. Framing a business question as a testable hypothesis with a clear outcome metric — "does the new ranking model lift CTR by 2%, holding revenue per session flat?"
- Modelling. scikit-learn, PyTorch, TensorFlow, XGBoost, LightGBM. DS owns the model lifecycle: feature selection, hyperparameter tuning, evaluation, calibration.
- Experimentation. A/B tests, multi-armed bandits, switchback experiments, causal inference (instrumental variables, regression discontinuity). DS owns the analysis; DE owns the event capture.
- Statistical inference. Bootstrapping, confidence intervals, power analysis, sample-size planning. DS is the one teammate who can defend "p < 0.05" in a stakeholder review.
DA-only territory in detail.
- Business questions. "Why did NPS drop in the EU last week?" The DA translates a vague stakeholder ask into a precise SQL query with a clear answer.
- Dashboards. Looker / Tableau / Power BI / Mode / Metabase. DA owns the dashboard SLA: the metric must match the source of truth and refresh by 8 AM.
- Ad-hoc SQL. The "five minutes before the board meeting" turnaround. DA has the warehouse schema memorised and can hand-write the query the executive will ask.
- KPI definitions. "What does MAU mean at our company?" DA owns the definition document and defends it in metric review.
- Stakeholder presentations. DA presents to PMs, executives, and finance more than the other two roles combined.
The two pairwise overlaps that have their own titles.
- Analytics Engineer = DE ∩ DA in titlecase. Owns dbt models, the warehouse transformation layer, and the metric definitions in code. Does not own ingestion (that's DE) or dashboards (that's DA), but does own the "single source of truth" tables that both depend on.
- ML Engineer = DE ∩ DS in titlecase. Owns model serving infrastructure, feature stores, vector databases, model registry, and the train-to-deploy pipeline. Does not own the model science (that's DS) or the ingestion (that's DE), but does own the lifecycle that connects them.
The triple centre — what every senior data IC has.
- SQL. Every role writes SQL; the variance is volume (DA most, DS least) and complexity (DE writes the gnarliest joins; DA writes the most elegant aggregates).
- Git. Every role commits code, opens PRs, reviews PRs, and tags releases.
- Documentation. README files, runbooks, schema docs, KPI definition docs. Every role contributes; senior ICs in every role own the docs they ship.
- Stakeholder communication. Even DEs present to PMs at sprint review. Senior ICs across all three roles can explain a technical decision to a non-technical audience.
Worked example — labelling a tool to its Venn region
Detailed explanation. A new manager joining a data org wants to map every tool the team uses to its Venn region — DE-only, DS-only, DA-only, or one of the overlaps — so they can write each role's job description with confidence. The exercise also surfaces gaps (e.g. nobody owns the feature store) and overlaps (e.g. two people own dbt).
Question. Given the team's tool inventory, classify each tool to its Venn region and identify the named-overlap role (analytics engineer or ML engineer) when applicable.
Input.
| Tool | Used by |
|---|---|
| Airflow | DE only |
| Looker | DA only |
| scikit-learn | DS only |
| dbt | DE + DA |
| MLflow | DE + DS |
| Python notebooks | DA + DS |
| SQL | everyone |
Code (mapping logic).
TOOL_VENN = {
"Airflow": {"region": "DE-only", "owner": "Data Engineer"},
"Looker": {"region": "DA-only", "owner": "Data Analyst"},
"scikit-learn": {"region": "DS-only", "owner": "Data Scientist"},
"dbt": {"region": "DE ∩ DA", "owner": "Analytics Engineer"},
"MLflow": {"region": "DE ∩ DS", "owner": "ML Engineer"},
"Python notebooks": {"region": "DA ∩ DS", "owner": "strongest DA or weakest DS"},
"SQL": {"region": "Triple centre", "owner": "every senior IC"},
}
for tool, info in TOOL_VENN.items():
print(f"{tool:20s} -> {info['region']:18s} -> owned by {info['owner']}")
Step-by-step explanation.
- Airflow lives in the DE-only crescent — orchestration is core DE. Owner is the DE.
- Looker lives in the DA-only crescent — dashboards are core DA. Owner is the DA.
- scikit-learn lives in the DS-only crescent — modelling is core DS. Owner is the DS.
- dbt lives in the DE ∩ DA overlap. If the team has more than two dbt models, hire an analytics engineer to own them. Otherwise the strongest DA or weakest DE doubles up.
- MLflow lives in the DE ∩ DS overlap. If the team trains and serves more than one model, hire an ML engineer. Otherwise the senior DS owns it.
- Python notebooks live in the DA ∩ DS overlap — the place where a DA does exploratory work that could be a DS workflow, and where a DS does light reporting that could be a DA workflow. No named role; assign by who has the time.
- SQL is the triple centre — everyone writes it. The variance is who writes the most and who writes the gnarliest.
Output.
| Tool | Region | Recommended owner |
|---|---|---|
| Airflow | DE-only | Data Engineer |
| Looker | DA-only | Data Analyst |
| scikit-learn | DS-only | Data Scientist |
| dbt | DE ∩ DA | Analytics Engineer |
| MLflow | DE ∩ DS | ML Engineer |
| Python notebooks | DA ∩ DS | DA or DS |
| SQL | Triple centre | every senior IC |
Rule of thumb. Once a quarter, list every tool the team uses and map it to a Venn region. Tools without a clear owner are bugs in your org chart. Tools with two owners are budget waste or the next named role about to be hired.
Worked example — the analytics engineer overlap, in code
Detailed explanation. Analytics engineer is the most underestimated role of 2026. The work is dbt + warehouse modelling — owning the "single source of truth" transformation layer that both the DE pipeline and the DA dashboard depend on. The role was invented by Fishtown Analytics (now dbt Labs) in 2018 and went mainstream by 2022.
Question. Show a tiny dbt model that lives in the DE ∩ DA overlap — it depends on a raw ingestion table (DE owns) and feeds a downstream KPI dashboard (DA owns). Explain why the analytics engineer is the natural owner.
Input — raw table (DE owns).
| event_id | user_id | event_type | event_ts |
|---|---|---|---|
| 1 | 100 | login | 2026-06-01 09:01 |
| 2 | 100 | purchase | 2026-06-01 09:14 |
| 3 | 200 | login | 2026-06-02 10:02 |
Code — dbt model models/marts/daily_active_users.sql.
{{ config(materialized='table') }}
-- Analytics-engineer-owned transformation: the single source of truth
-- for the DAU metric, used by every dashboard, every notebook, every
-- model that wants "active users."
SELECT
DATE(event_ts) AS event_date,
COUNT(DISTINCT user_id) AS daily_active_users
FROM {{ ref('stg_events') }}
WHERE event_type = 'login'
GROUP BY DATE(event_ts)
Step-by-step explanation.
- The upstream
stg_eventsmodel wraps the raw event ingestion that the DE owns — schema enforcement, type casting, column renaming. The DE controls the ingestion SLA. - The
daily_active_usersmart is owned by the analytics engineer. The transformation is small but the definition is load-bearing: changing the WHERE clause or the COUNT semantics changes every dashboard, every notebook, and every model that reads the table. - The downstream Looker dashboard, owned by the DA, simply selects from
mart.daily_active_usersand visualises the metric. The DA doesn't write the SQL — they consume the canonical mart. - If a DS wants to use DAU as a feature for a churn model, they read from the same mart. One definition, three downstream consumers, one owner — the analytics engineer.
- The role exists because without it, the DA writes the metric SQL in Looker and the DS writes it again in a notebook and they drift apart. The analytics engineer centralises the definition.
Output (one row of the mart).
| event_date | daily_active_users |
|---|---|
| 2026-06-01 | 1 |
| 2026-06-02 | 1 |
Rule of thumb. The moment two downstream consumers (a dashboard and a notebook) compute the same metric independently, hire an analytics engineer. The cost of two slightly-different DAU definitions is a quarter of leadership confusion; the cost of one analytics engineer is one headcount line.
Worked example — the ML engineer overlap, in code
Detailed explanation. ML engineer is the DE ∩ DS overlap titled. The role owns the lifecycle that connects a DS-trained model to a production system — feature store reads, model registry, serving endpoint, drift monitoring, and rollback. The DS owns the model science; the DE owns the ingestion; the ML engineer owns the bridge.
Question. Show a tiny serving snippet that lives in the DE ∩ DS overlap — it reads features from a feature store (DE-shaped), invokes a model from a registry (DS-shaped), and writes predictions to a downstream table (DE-shaped). Explain why the ML engineer is the natural owner.
Input — feature store row.
| user_id | recency_days | frequency_30d | monetary_90d |
|---|---|---|---|
| 100 | 1 | 14 | 320.50 |
Code — serving snippet (Python).
import mlflow
import pandas as pd
from feature_store import get_features
from warehouse import write_predictions
# 1) Load features from the DE-owned feature store.
features = get_features(user_id=100) # returns a pandas Series
# 2) Load the DS-trained model from the model registry.
model = mlflow.sklearn.load_model("models:/churn_v3/Production")
# 3) Predict — owned by the ML engineer's serving layer.
proba = model.predict_proba(pd.DataFrame([features]))[0][1]
# 4) Write predictions back to the warehouse so the DA dashboards
# and the marketing campaign tool can both read them.
write_predictions(
user_id=100,
churn_proba=float(proba),
model_version="v3",
served_at="2026-06-06T00:00:00Z",
)
Step-by-step explanation.
-
get_featuresreads from the DE-owned feature store — the DE controls the freshness, the schema, and the partition strategy. The ML engineer consumes; the DE produces. -
mlflow.sklearn.load_modelreads from the DS-owned model registry. The DS controls which version is inProduction, the training data lineage, and the offline evaluation. The ML engineer consumes; the DS produces. - The
predict_probacall is the ML engineer's production concern: latency, batching, fallback when the model fails to load, A/B routing between v3 and v4. -
write_predictionswrites back to the warehouse so downstream consumers (DA dashboards, marketing tools, the next-day batch model) can read the predictions. The DE owns the destination table; the ML engineer owns the write. - The role exists because without it, the DS pickles a model and hands it to a DE who writes the serving code and neither person owns the end-to-end lifecycle. The ML engineer makes the loop accountable.
Output (one row of the predictions table).
| user_id | churn_proba | model_version | served_at |
|---|---|---|---|
| 100 | 0.42 | v3 | 2026-06-06T00:00:00Z |
Rule of thumb. Hire an ML engineer the moment your team has a model in production and you've had a rollback or drift incident. Until then, a senior DS or senior DE can wear the hat. After that incident, the cost of not having a named owner exceeds the cost of a headcount.
Decision question on the Venn overlap
A staff hiring manager often asks: "We just hired our third data scientist and our second data engineer. We have dbt models nobody owns and a churn model nobody owns. What roles do we need next?" The interviewer is testing whether the candidate can read the Venn and recommend the named overlap role.
Decision Using the Venn-region headcount audit
def venn_audit(headcount: dict, ownership_gaps: list[str]) -> list[str]:
"""
Given current headcount per role and the list of un-owned artifacts,
recommend the next hire to close the Venn gap.
"""
hires: list[str] = []
if "dbt_models" in ownership_gaps and headcount.get("AnalyticsEngineer", 0) == 0:
hires.append("Analytics Engineer (DE ∩ DA overlap)")
if "churn_model_serving" in ownership_gaps and headcount.get("MLEngineer", 0) == 0:
hires.append("ML Engineer (DE ∩ DS overlap)")
if "kpi_definitions" in ownership_gaps and headcount.get("DA", 0) < 2:
hires.append("Senior Data Analyst (DA-only)")
return hires
current_headcount = {"DE": 2, "DS": 3, "DA": 1,
"AnalyticsEngineer": 0, "MLEngineer": 0}
gaps = ["dbt_models", "churn_model_serving"]
print(venn_audit(current_headcount, gaps))
Step-by-step trace.
| Gap | Current headcount | Recommended hire |
|---|---|---|
| dbt_models | AnalyticsEngineer = 0 | Analytics Engineer |
| churn_model_serving | MLEngineer = 0 | ML Engineer |
| (kpi_definitions not in gaps) | DA = 1 | no action |
After the audit, the manager knows the next two hires are not "another DE" or "another DS" — they are the named overlap roles the Venn has produced. The org has organic demand for the analytics engineer and the ML engineer because the un-owned artifacts live in the overlaps, not the crescents.
Output:
| Hire # | Role | Region |
|---|---|---|
| 1 | Analytics Engineer | DE ∩ DA |
| 2 | ML Engineer | DE ∩ DS |
Why this works — concept by concept:
- Venn-region thinking — naming the seven regions of the Venn lets the org chart map onto the work. Crescents map to canonical roles; overlaps map to named-overlap roles; the centre is universal.
- Un-owned artifact is a hire signal — every artifact must have exactly one owner. Un-owned dbt models eventually drift; un-owned production models eventually break.
- Named overlap > generic overlap — "the strongest DA or weakest DE" works for one quarter, but the named role (analytics engineer, ML engineer) is what scales the team past five people.
- Headcount audit before hire pitch — pitching "hire an analytics engineer" without the gap audit makes the hire look like buzzword-chasing. The audit makes the hire look like a closure of a real risk.
- Cost — one afternoon to do the audit; saves quarters of misalignment and drift. The audit is the cheapest senior-EM tool that exists.
SQL
Topic — joins
JOIN problems (SQL — shared across all three roles)
3. Stack comparison — tools, languages, infrastructure
Same warehouse, different jobs — the stack matrix tells you which role wrote the query the moment you read the imports
The mental model in one line: the three roles touch the same warehouse layer, but they own different tool categories on top of it — DE owns the orchestration layer, DS owns the modelling layer, DA owns the BI layer, and they share SQL on the warehouse layer. Once you can quote the canonical stack per role, every "what do you use day-to-day?" interview question becomes a one-paragraph answer.
The stack matrix in one table.
| Category | Data Engineer | Data Scientist | Data Analyst |
|---|---|---|---|
| Languages | Python · SQL · Scala / Java · Bash | Python · R · SQL | SQL · Excel · light Python |
| Warehouses | Snowflake · BigQuery · Redshift · Databricks (owns config) | same — read-only mostly | same — BI views mostly |
| Orchestration | Airflow · Dagster · Prefect (owns DAGs) | Airflow for training jobs | rarely |
| Modelling | basic feature pipelines | scikit-learn · PyTorch · TensorFlow · XGBoost | regressions in notebooks |
| BI / Dashboards | infra for BI | reads dashboards | Looker · Tableau · Power BI · Mode |
| Streaming | Kafka · Flink · Spark Streaming | online inference only | none |
| MLOps | Vertex AI · SageMaker · MLflow · Kubeflow (shared) | same (shared) | none |
| Vector DB | pgvector · Pinecone · Weaviate (shared) | same (shared) | none |
| Version control | Git + dbt + Terraform | Git + notebooks | Git (light) + Looker LookML |
Languages by role in detail.
- DE. Python is the everyday glue language for Airflow DAGs, custom operators, and data quality tests. SQL is for the transformation layer and ad-hoc audits. Scala or Java surfaces in Spark and Flink code paths. Bash for CI / CD scripts.
- DS. Python is the modelling lingua franca — pandas, NumPy, scikit-learn, PyTorch. R survives in pharma, academia, and economics-adjacent teams. SQL is required because every feature pipeline starts in the warehouse.
- DA. SQL is the primary tool — most DAs write more SQL than the DE or DS combined. Excel is still load-bearing for finance and ops handoffs. Light Python (pandas) appears for one-off exploratory work.
Warehouses — same building, different rooms.
- Snowflake / BigQuery / Redshift / Databricks. All three roles touch the warehouse, but DE owns the config: warehouse sizing, partitioning, clustering, retention, IAM, cost monitoring.
- DE writes DDL. Tables, views, materialised views, schemas. DA and DS read from the tables DE creates.
- DS reads with sampling. A DS rarely reads a 10B-row table directly; they sample, materialise a subset for training, and iterate locally.
- DA reads through dashboards. A DA's warehouse interaction is usually via Looker / Tableau views, with occasional ad-hoc raw SQL for stakeholder asks.
Orchestration — DE-owned, DS-shared.
- Airflow. The 800-pound gorilla. DE owns the Airflow deployment, the DAG repo, the operator library, the alerting hookup. DS uses Airflow for nightly training jobs but doesn't operate the platform.
- Dagster / Prefect. Newer alternatives that are gaining adoption — DE owns these too. Dagster's asset-based model is increasingly the choice for data-mesh-style orgs.
- DA almost never touches orchestration. Dashboard refreshes are scheduled in the BI tool, not in the orchestrator.
Modelling — DS-owned, occasional DE feature work.
- scikit-learn. The bread-and-butter for tabular models. Every DS knows it; every DE has seen it.
- PyTorch / TensorFlow. Deep learning frameworks. DS owns the model code; DE / ML engineer owns the training-job infrastructure (GPU scheduling, checkpoint storage).
- XGBoost / LightGBM. Gradient boosting libraries. Still the winner for most tabular classification problems in 2026.
- DA does light modelling in notebooks. Simple regressions to answer a stakeholder question, not production models.
BI / Dashboards — DA-owned, the other two read.
- Looker. Owned by DA; the LookML semantic layer often co-owned with the analytics engineer.
- Tableau / Power BI. Owned by DA. Larger orgs have BI developers as a separate sub-role.
- Mode / Metabase / Hex. Notebook-style BI tools — used by DA and DS in roughly equal measure.
Streaming — DE-owned, DS for online inference only.
- Kafka. The dominant streaming substrate. DE owns the topics, partitions, and consumer groups.
- Flink. Stateful stream processing. DE owns the jobs; DS reads from the output sinks if a feature needs to be near-real-time.
- Spark Streaming. Micro-batch streaming. DE owns the jobs; the operational model is closer to batch DAGs than to event-driven Kafka.
MLOps — DE ∩ DS shared territory.
- MLflow. Model registry and experiment tracking. Shared ownership; the ML engineer (if hired) consolidates.
- Vertex AI / SageMaker. Cloud-managed MLOps platforms. The infra side is DE-owned; the model side is DS-owned.
- Kubeflow. Kubernetes-native ML platform. DE-heavy; survives in larger orgs with a dedicated platform team.
Worked example — labelling a query log to the role that wrote it
Detailed explanation. A new data lead inherits a warehouse query log and wants to attribute each query to the role that wrote it. The patterns are recognisable: DE queries DDL, DS queries deep aggregates with sampling, DA queries production-table aggregates with filtering. A simple regex over the SQL body classifies most queries correctly.
Question. Given three sample queries from the warehouse log, classify each by the role that most likely wrote it.
Input — query #1.
CREATE TABLE IF NOT EXISTS warehouse.fct_orders (
order_id BIGINT PRIMARY KEY,
user_id BIGINT,
amount DECIMAL(12, 2),
placed_at TIMESTAMP
)
PARTITION BY DATE(placed_at)
CLUSTER BY user_id;
Input — query #2.
SELECT
DATE(placed_at) AS d,
SUM(amount) AS revenue
FROM warehouse.fct_orders
WHERE placed_at >= DATE_TRUNC('month', CURRENT_DATE)
GROUP BY DATE(placed_at)
ORDER BY d;
Input — query #3.
SELECT user_id,
AVG(amount) AS avg_basket,
COUNT(*) AS order_count,
MAX(placed_at) AS last_seen
FROM warehouse.fct_orders TABLESAMPLE BERNOULLI (10)
WHERE placed_at >= CURRENT_DATE - INTERVAL '180 days'
GROUP BY user_id;
Code — heuristic classifier.
def classify_sql(sql: str) -> str:
s = sql.upper()
if "CREATE TABLE" in s or "PARTITION BY" in s or "CLUSTER BY" in s:
return "Data Engineer"
if "TABLESAMPLE" in s or "AVG(" in s and "GROUP BY USER_ID" in s:
return "Data Scientist (feature pipeline)"
if "DATE_TRUNC" in s and "SUM(" in s and "GROUP BY DATE" in s:
return "Data Analyst (KPI dashboard)"
return "Unknown"
queries = [open("q1.sql").read(), open("q2.sql").read(), open("q3.sql").read()]
for i, q in enumerate(queries, start=1):
print(f"Query #{i}: {classify_sql(q)}")
Step-by-step explanation.
- Query #1 is a
CREATE TABLEwith partitioning and clustering — DE territory. DE owns the DDL. - Query #2 is a daily revenue rollup with a month-to-date filter — classic DA dashboard query. The shape "DATE_TRUNC + SUM + GROUP BY DATE + ORDER BY DATE" is recognisable from any production Looker LookML.
- Query #3 samples 10% of the table and computes user-level features (avg_basket, order_count, last_seen) — classic DS feature pipeline shape. The
TABLESAMPLEis the giveaway; production analytics queries rarely sample. - The classifier is a heuristic, not a proof. Edge cases exist (a DA could write a sampled query for an executive ask; a DE could write a daily revenue query as part of a freshness check). But the heuristic is right on the median case.
Output.
| Query | Inferred role |
|---|---|
| #1 | Data Engineer |
| #2 | Data Analyst |
| #3 | Data Scientist |
Rule of thumb. When a query log surfaces an unrecognisable author, look at the shape: DDL → DE; daily rollup → DA; sampled feature aggregate → DS. The shape carries more signal than the username, especially in orgs where multiple roles share warehouse logins.
Worked example — when DE and DS share MLOps
Detailed explanation. The DE ∩ DS overlap on MLOps is the most common stack-comparison friction point in 2026. Both roles want to own the model registry, the feature store, and the serving infra. The clean answer is: DE owns the platform, DS owns the content, and the ML engineer owns the lifecycle.
Question. Given a churn-model deployment workflow, label each step with the role that owns it.
Input — workflow steps.
| Step | Description |
|---|---|
| 1 | Ingest raw event data into Snowflake |
| 2 | Build features.user_churn_v2 table |
| 3 | Train the churn model with XGBoost |
| 4 | Register the model in MLflow with version v2
|
| 5 | Deploy v2 to the prediction service |
| 6 | Monitor drift on features.user_churn_v2
|
Code — ownership table.
WORKFLOW = [
("Ingest raw events into Snowflake", "DE"),
("Build features.user_churn_v2 table", "DE (+ Analytics Engineer review)"),
("Train churn model with XGBoost", "DS"),
("Register the model in MLflow as v2", "DS (writes) + DE (registry infra)"),
("Deploy v2 to the prediction service", "ML Engineer (or DE in small orgs)"),
("Monitor drift on the feature table", "DE (data drift) + DS (model drift)"),
]
for step, owner in WORKFLOW:
print(f"{step:55s} -> {owner}")
Step-by-step explanation.
- Step 1 (ingest) is pure DE — ingestion is the canonical pipeline SLA.
- Step 2 (feature table) is DE-built but reviewed by the analytics engineer for naming and contract — features need a stable schema downstream.
- Step 3 (train) is pure DS — modelling is the canonical model SLA.
- Step 4 (register) is a shared step: DS writes the registry entry; DE owns the registry infrastructure.
- Step 5 (deploy) is the ML engineer's job in mature orgs. In smaller orgs the DE wears the hat.
- Step 6 (monitor) splits cleanly: DE monitors the data drift (schema, freshness, NULL spike), DS monitors the model drift (prediction distribution, AUC decay).
- The workflow is NOT "DS does steps 3-6, DE does steps 1-2." It is a relay race with named handoffs. Senior teams document the handoff in a runbook.
Output (ownership ladder).
| Step | Owner |
|---|---|
| Ingest | DE |
| Build features | DE + Analytics Engineer |
| Train | DS |
| Register | DS + DE |
| Deploy | ML Engineer |
| Monitor | DE + DS |
Rule of thumb. Whenever you draw a workflow diagram across DE / DS / ML engineer, label each box with the owner and the handoff contract. "DE hands a feature table to DS with this schema" is a contract; "DE and DS collaborate on features" is mush.
Worked example — the analyst's stack is narrower but deeper in SQL
Detailed explanation. A common myth is that "DA is a junior version of DS or DE." The truth is that DA owns the stakeholder communication and the dashboard SLA, and writes the most production SQL of any role. The stack is narrower (no Airflow, no PyTorch, no Kafka) but the SQL depth is deeper — a senior DA writes joins, windows, and CTEs that most DEs would struggle with on the whiteboard.
Question. Given a senior-DA SQL query, identify the techniques that make it senior-shaped.
Input.
| user_id | event_type | event_ts |
|---|---|---|
| 100 | view | 2026-06-01 09:00 |
| 100 | add_to_cart | 2026-06-01 09:05 |
| 100 | purchase | 2026-06-01 09:14 |
| 200 | view | 2026-06-01 10:00 |
| 200 | add_to_cart | 2026-06-01 10:08 |
Code — senior-DA conversion-funnel query.
WITH events AS (
SELECT user_id,
event_type,
event_ts,
LEAD(event_type) OVER (PARTITION BY user_id ORDER BY event_ts) AS next_event
FROM warehouse.fct_events
WHERE event_ts >= DATE_TRUNC('month', CURRENT_DATE)
),
funnel AS (
SELECT
SUM(CASE WHEN event_type = 'view' THEN 1 ELSE 0 END) AS views,
SUM(CASE WHEN event_type = 'add_to_cart' THEN 1 ELSE 0 END) AS carts,
SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) AS purchases
FROM events
)
SELECT
views,
carts,
purchases,
ROUND(100.0 * carts / NULLIF(views, 0), 2) AS view_to_cart_pct,
ROUND(100.0 * purchases / NULLIF(carts, 0), 2) AS cart_to_purchase_pct
FROM funnel;
Step-by-step explanation.
- The CTE
eventsuses a window function (LEAD) to look at the next event per user — a senior-DA technique that surfaces session structure without expensive self-joins. - The CTE
funneluses conditional aggregation (SUM(CASE WHEN ...)) to count each event type in a single scan. Junior DAs write three separate queries; senior DAs write one. - The final SELECT computes funnel-conversion percentages with
NULLIF(denominator, 0)to avoid division-by-zero — a habit picked up by senior practitioners in every role. - The query is monthly (DATE_TRUNC) so it slots into a Looker dashboard that refreshes daily. The DA owns the dashboard SLA; this query is the SLA's transformation core.
- No Airflow, no Python, no PyTorch — just deep SQL. This is the senior DA's superpower.
Output (one month).
| views | carts | purchases | view_to_cart_pct | cart_to_purchase_pct |
|---|---|---|---|---|
| 240 | 60 | 18 | 25.00 | 30.00 |
Rule of thumb. Don't conflate stack narrowness with role juniority. A senior DA writes deeper SQL than the median DE, and the dashboard SLA is as load-bearing as the pipeline SLA. The narrower stack is by design — the work is deeper in fewer tools.
Decision question on stack selection
A staff DE often opens with: "You're starting a new data team from scratch with three hires: one DE, one DS, one DA. Pick the canonical stack — one tool per category — and justify each choice." The interviewer is testing whether the candidate can reason about stack ownership without rolling out an over-engineered platform.
Decision Using the one-tool-per-category starter stack
def starter_stack() -> dict:
"""
Three hires, one stack — pick the one canonical tool per category
that the smallest data team can adopt without dedicated platform staff.
"""
return {
"warehouse": {"tool": "BigQuery", "owner": "DE", "why": "serverless, pay-per-query, no infra"},
"orchestration": {"tool": "Airflow", "owner": "DE", "why": "industry default, deep operator library"},
"transform": {"tool": "dbt-core", "owner": "DE + DA", "why": "version-controlled SQL, free tier"},
"modelling": {"tool": "scikit-learn", "owner": "DS", "why": "tabular default, no GPU needed"},
"bi": {"tool": "Looker Studio", "owner": "DA", "why": "free, fast, integrates with BigQuery"},
"language": {"tool": "Python + SQL", "owner": "all", "why": "the only two languages every role uses"},
"version": {"tool": "Git + GitHub", "owner": "all", "why": "non-negotiable"},
}
for category, choice in starter_stack().items():
print(f"{category:13s} -> {choice['tool']:20s} ({choice['owner']:8s}) — {choice['why']}")
Step-by-step trace.
| Category | Tool | Owner | Why |
|---|---|---|---|
| Warehouse | BigQuery | DE | serverless, no infra |
| Orchestration | Airflow | DE | industry default |
| Transform | dbt-core | DE + DA | version-controlled SQL |
| Modelling | scikit-learn | DS | tabular default |
| BI | Looker Studio | DA | free, fast |
| Language | Python + SQL | all | universal |
| Version | Git | all | non-negotiable |
After the audit, the team has a 7-tool stack that three hires can run without a dedicated platform engineer. Streaming, vector DBs, and full MLOps are deferred — they earn their place when the data product needs them, not before.
Output:
| Stack rank | Choice |
|---|---|
| 1 (lowest infra cost) | BigQuery |
| 2 (lowest team friction) | Airflow + dbt-core |
| 3 (lowest modelling cost) | scikit-learn |
| 4 (lowest BI cost) | Looker Studio |
Why this works — concept by concept:
- One tool per category — picking exactly one tool per category prevents the "we have three orchestrators" problem that plagues teams that grew organically.
- Serverless warehouse — BigQuery (or Snowflake) lets a team of three skip the warehouse-ops burden that would otherwise consume the DE's whole quarter.
- dbt-core over dbt-cloud — free, version-controlled, owned by the team. dbt-cloud is the right answer at 10+ data hires, not at 3.
- scikit-learn over deep learning — until you have proof a deep model beats XGBoost on your data, default to tabular. The simpler stack ships faster.
- Looker Studio over Tableau / Looker — free, BigQuery-native, "good enough" for the first year. Tableau and Looker earn their place at scale, not at start.
- Cost — a three-person stack with this kit is roughly $50-$500/month of cloud spend in the first quarter. Compare to $50K+/year of Looker / Snowflake licences that some teams sign before they have a dashboard worth shipping.
Python
Topic — Python
Python practice library (DE + DS shared)
4. Salary bands 2026 — US, EU, India
Three roles, three regions, three seniority bands — the salary curve is closer than the title hype suggests, and the spread between roles narrows in remote-first markets
The mental model in one line: DE pays 5-15% more than DS at senior level in 2026 due to infra scarcity, DA bands are compressed but promotion velocity into analytics engineer is fastest, and the equity-to-base mix flips depending on whether you're at a FAANG (DS-heavy equity) or a startup (DE-heavy base). Once you can quote the bands cold, every offer negotiation becomes a calibrated conversation instead of an anchor war.
The bands in one table — US total compensation (USD).
| Role | Junior | Mid | Senior |
|---|---|---|---|
| Data Engineer | $110k | $165k | $230k |
| Data Scientist | $115k | $170k | $220k |
| Data Analyst | $75k | $110k | $150k |
EU bands (EUR total comp — London / Berlin / Amsterdam benchmarks).
| Role | Junior | Mid | Senior |
|---|---|---|---|
| Data Engineer | €55k | €80k | €115k |
| Data Scientist | €58k | €82k | €110k |
| Data Analyst | €40k | €58k | €80k |
India bands (INR LPA — Bangalore / Hyderabad / Pune benchmarks).
| Role | Junior | Mid | Senior |
|---|---|---|---|
| Data Engineer | 8 LPA | 22 LPA | 48 LPA |
| Data Scientist | 9 LPA | 22 LPA | 45 LPA |
| Data Analyst | 5 LPA | 14 LPA | 28 LPA |
Why DE pays 5-15% over DS at senior level in 2026.
- Infrastructure scarcity. Senior DEs who can run Kafka, Flink, Snowflake at scale are rarer in the labour market than senior DSs who can train a model — the supply curve is steeper.
- On-call premium. DEs carry the pager for the data platform. The market prices the pager into base salary; DSs without on-call don't get the premium.
- Cost-of-error premium. A broken pipeline costs the business immediately (today's dashboards are stale, today's revenue forecast is wrong); a broken model costs the business over weeks. Markets compensate the role that owns the immediate-cost SLA.
- Cloud-bill responsibility. Senior DEs often own the data platform cloud bill (six-to-eight figures at scale). The market prices the financial responsibility.
Why DA bands are compressed but promotion velocity is fastest.
- Smaller variance in DA seniority. The gap between a junior DA and a senior DA is smaller than the gap between a junior DE and a senior DE (the stack is narrower). Bands compress.
- Promotion velocity. Strong DAs become analytics engineers in 18-24 months (versus 3-4 years for DA → senior DA). The salary catches up at the role switch, not in the title bump.
- Adjacent roles drag the band up. Product analyst, growth analyst, and decision scientist roles often pay closer to DS bands. A DA who specialises can move into those bands within 2-3 years.
Equity vs base mix — FAANG vs startup.
- FAANG DS — base / bonus / RSU often splits roughly 60% / 10% / 30%. Senior DSs at FAANG can hit $400k+ total comp on the equity tailwind.
- FAANG DE — base / bonus / RSU usually splits 65% / 10% / 25%. Slightly higher base, slightly lower equity. Total comp is comparable to DS at the same level.
- Series A-B startup DE — base / equity often splits 90% / 10%. Higher cash, lower equity multiplier. The DE is critical infrastructure; the equity tail is volatile.
- Series A-B startup DS — base / equity often splits 85% / 15%. Slightly lower cash, higher equity if the company hits.
- Remote-first market. Roles pay closer to the company's headquarter band, but the spread between roles narrows because the remote market compresses geographic premiums.
Caveat — always cite the year and the source.
- Bands move 5-15% year over year. A 2024 band is already stale in 2026.
- The canonical sources are levels.fyi (FAANG / public companies, US-centric), Glassdoor (broad market, self-reported), Robert Half Tech Salary Guide (recruiter data, regional), and Pave (compensation benchmarking, mid-stage startup-heavy).
- For India, AmbitionBox and Glassdoor India are the closest analogues; for EU, Honeypot and StackOverflow Developer Survey add signal.
- The honest framing for any blog or salary conversation: "as of Q1 2026, levels.fyi shows senior DE at $X median; your offer should be within ±10% of that, adjusted for company stage and location."
Worked example — calibrating a US senior DE offer against the band
Detailed explanation. A senior DE receives an offer letter and wants to know whether the numbers are competitive. The right answer is to decompose the offer into the comparison units the market uses — base, bonus, and equity vesting — and benchmark each separately against the senior-DE band.
Question. Given an offer of $190k base + 15% bonus + $400k RSU vesting over 4 years, calibrate the year-1 total comp against the 2026 senior-DE US band.
Input.
| Component | Amount | Notes |
|---|---|---|
| Base | $190,000 | annual |
| Bonus | 15% of base | annual target |
| RSU | $400,000 | 4-year vesting, 25% each year |
Code.
def year1_total_comp(base: float, bonus_pct: float, rsu_grant: float,
vest_years: int = 4) -> dict:
bonus = base * bonus_pct
rsu_year1 = rsu_grant / vest_years
total = base + bonus + rsu_year1
return {
"base": base,
"bonus": bonus,
"rsu_year1": rsu_year1,
"year1_total_comp": total,
}
SENIOR_DE_US = {"junior": 110_000, "mid": 165_000, "senior": 230_000}
offer = year1_total_comp(base=190_000, bonus_pct=0.15, rsu_grant=400_000)
print(offer)
print("Band reference (senior DE US 2026):", SENIOR_DE_US["senior"])
delta_pct = (offer["year1_total_comp"] - SENIOR_DE_US["senior"]) / SENIOR_DE_US["senior"] * 100
print(f"Delta vs band: {delta_pct:+.1f}%")
Step-by-step explanation.
- Base = $190k.
- Bonus = 15% × $190k = $28.5k.
- RSU year-1 vest = $400k / 4 = $100k.
- Year-1 total comp = $190k + $28.5k + $100k = $318.5k.
- Senior-DE US 2026 band = $230k median total comp.
- Delta vs band = ($318.5k − $230k) / $230k = +38.5%.
- Interpretation: the offer is well above the median senior-DE band. The candidate should either accept or use the gap to negotiate a small base increase (the year-1 number includes a chunky RSU vest that won't repeat at the same level after year 1 if the grant isn't refreshed).
Output.
| Component | Amount |
|---|---|
| Base | $190,000 |
| Bonus | $28,500 |
| RSU year-1 | $100,000 |
| Year-1 total comp | $318,500 |
| Delta vs senior-DE median | +38.5% |
Rule of thumb. When the year-1 total comp leans on the RSU front-load, ask the recruiter what the "year-4 steady-state" comp looks like after the grant fully vests. The honest comparison is steady-state base + bonus + annual refresh, not the year-1 spike.
Worked example — comparing a Bangalore senior DS offer against the band
Detailed explanation. Indian compensation data is reported as LPA (lakhs per annum, where 1 LPA = ₹100,000 per year). The senior DS band in Bangalore in 2026 is roughly 45 LPA, with strong variance by company tier. A Tier-1 (FAANG / product unicorn) offer should sit at or above the band; a Tier-2 (mid-stage product company) offer should sit near the band; a Tier-3 (service company) offer typically lands below.
Question. Given an offer of 38 LPA base + 8 LPA bonus + 12 LPA ESOP, classify by tier and decide whether to accept or counter.
Input.
| Component | Amount (LPA) |
|---|---|
| Base | 38 |
| Bonus | 8 |
| ESOP | 12 |
Code.
def classify_india_offer(base_lpa: float, bonus_lpa: float, esop_lpa: float,
band_median_lpa: float) -> dict:
total = base_lpa + bonus_lpa + esop_lpa
delta = (total - band_median_lpa) / band_median_lpa * 100
tier = (
"Tier-1 (FAANG / unicorn)" if total >= band_median_lpa * 1.15 else
"Tier-2 (product mid-stage)" if total >= band_median_lpa * 0.95 else
"Tier-3 (service / consulting)"
)
return {
"total_lpa": total,
"band_median": band_median_lpa,
"delta_pct": delta,
"tier_inference": tier,
}
offer = classify_india_offer(base_lpa=38, bonus_lpa=8, esop_lpa=12,
band_median_lpa=45)
print(offer)
Step-by-step explanation.
- Total comp = 38 + 8 + 12 = 58 LPA.
- Senior-DS Bangalore 2026 band median = 45 LPA.
- Delta = (58 − 45) / 45 = +28.9% above band.
- Tier inference: the offer is at +28.9%, comfortably in Tier-1 territory.
- Interpretation: the offer is competitive for a senior DS in Bangalore. The ESOP component (12 LPA) carries the usual startup risk — if the company is pre-IPO, treat the ESOP as a lottery ticket rather than guaranteed comp.
- Counter-offer recommendation: if the candidate has competing offers, ask for a 5-10 LPA base bump (turning 38 into 43-48 base). The ESOP component is less negotiable.
Output.
| Component | Amount |
|---|---|
| Base | 38 LPA |
| Bonus | 8 LPA |
| ESOP | 12 LPA |
| Total | 58 LPA |
| Delta vs band | +28.9% |
| Tier inference | Tier-1 |
Rule of thumb. For India offers, always separate the cash (base + bonus) from the equity (ESOP) when comparing to the band. ESOP value is highly variable; cash is the floor. The honest comparison is base + bonus vs band, with ESOP as an optional upside.
Worked example — the role-switch arbitrage from DA to analytics engineer
Detailed explanation. A mid DA in 2026 earns roughly $110k US / €58k EU / 14 LPA. The same person, retitled as an analytics engineer after a 6-12 month dbt + warehouse-modelling skill build, can earn 25-40% more without changing the actual work much. This is the cheapest career-level arbitrage in data in 2026.
Question. Given a mid-DA earning $110k who completes the analytics-engineer skill build in 9 months, project the after-switch comp.
Input.
| Phase | Base | Notes |
|---|---|---|
| Now (mid DA) | $110k | 3 years DA experience |
| 9 months later (mid analytics engineer) | ? | dbt + warehouse modelling proficiency |
Code.
def projected_switch(base_now: float, role_premium_pct: float,
cola_pct: float = 0.03) -> dict:
"""
Projected total comp after switching from DA -> analytics engineer.
role_premium_pct: market premium for analytics engineer over DA, typically 0.25-0.40
cola_pct: cost-of-living / annual adjustment baseline, default 3%
"""
base_after_cola = base_now * (1 + cola_pct)
base_after_switch = base_after_cola * (1 + role_premium_pct)
return {
"base_now": base_now,
"base_after_cola": base_after_cola,
"base_after_switch": base_after_switch,
"uplift_pct": (base_after_switch / base_now - 1) * 100,
}
print(projected_switch(base_now=110_000, role_premium_pct=0.30))
Step-by-step explanation.
- Base now = $110k. Applying a 3% annual COLA gives $113.3k as the "do-nothing" baseline.
- Applying the 30% analytics-engineer premium on top gives $113.3k × 1.30 = $147.3k.
- Uplift vs starting base = ($147.3k − $110k) / $110k = +33.9%.
- The arbitrage works because the analytics-engineer role is in higher demand than the mid-DA role and the candidate retains all the DA's stakeholder-comms strength, which most analytics-engineer JDs explicitly value.
- The 9-month skill build typically costs 100-200 evening hours and an open-source dbt portfolio. The ROI is one career level inside one year.
Output.
| Phase | Base |
|---|---|
| Mid DA today | $110,000 |
| After 3% COLA | $113,300 |
| After role switch | $147,290 |
| Uplift | +33.9% |
Rule of thumb. Every DA at the 3-year mark should ask "is the next career step a senior DA at +15% or an analytics engineer at +30%?" The latter is usually the right move in 2026 — the bands favour the role switch, and the skill build is achievable in a single year.
Decision question on band-aware negotiation
A staff hiring manager often asks: "You're benchmarking three offers — one from a US tech, one from an EU scale-up, one from an Indian unicorn — for a senior DS. How do you compare them apples-to-apples and which one would you pick?" The interviewer is testing whether the candidate can normalise across regions without falling into anchor traps.
Decision Using a purchasing-power-adjusted comp comparison
def normalize_comp(comp_local: float, currency: str, location: str) -> dict:
"""
Normalise across regions using a rough PPP (purchasing power parity) factor
relative to the US dollar. PPP factors as of Q1 2026.
"""
PPP = {"USD": 1.00, "EUR": 1.10, "INR_LPA": 0.36} # 1 LPA ≈ $0.36 in PPP-USD
factor = PPP[currency]
ppp_usd = comp_local * factor if currency != "INR_LPA" else comp_local * 100_000 * factor / 1_000 # LPA to USD
return {
"comp_local": comp_local,
"currency": currency,
"location": location,
"ppp_adjusted_usd": ppp_usd,
}
offers = [
normalize_comp(220_000, "USD", "San Francisco"),
normalize_comp(110_000, "EUR", "London"),
normalize_comp(50, "INR_LPA", "Bangalore"),
]
for o in offers:
print(o)
Step-by-step trace.
| Offer | Local amount | PPP factor | PPP-USD |
|---|---|---|---|
| US senior DS | $220k | 1.00 | $220k |
| EU senior DS | €110k | 1.10 | $121k |
| India senior DS | 50 LPA | 0.36 | $180k (PPP-adjusted) |
After normalisation, the US offer leads on nominal USD, but the India offer (50 LPA) is close on PPP basis (~$180k purchasing-power-equivalent in Bangalore). The EU offer trails on PPP because EU senior bands are structurally lower than US.
Output:
| Region | Local comp | PPP-adjusted USD | Rank |
|---|---|---|---|
| US | $220k | $220k | 1 |
| India | 50 LPA | $180k | 2 |
| EU | €110k | $121k | 3 |
Why this works — concept by concept:
- PPP-adjusted comparison — comparing $220k SF vs €110k London vs 50 LPA Bangalore in nominal terms is misleading; PPP normalises for the cost of living and lets the candidate compare lifestyle, not just paycheque.
- Currency conversion is not enough — a naive USD conversion would put London at $130k and Bangalore at $60k, masking the fact that 50 LPA in Bangalore buys a lifestyle closer to $180k in San Francisco.
- Always cite the year — PPP factors and salary bands shift quarter to quarter. Anchor every conversation to the year and the data source.
- Equity is region-dependent — US tech equity tends to vest in liquid public stock; EU scale-up equity is often phantom or illiquid; India unicorn ESOP carries the highest variance. Build that into the offer comparison.
- Cost-of-living delta — even after PPP, the lived experience of $220k in SF (high rent, expensive everything) and 50 LPA in Bangalore (lower rent, lower everything) diverges. PPP is a first approximation, not a final answer.
- Cost — one spreadsheet, 30 minutes. The cost of not doing the PPP-adjusted comparison is accepting a nominally-higher offer that is actually lower lifestyle.
SQL
Topic — aggregation
Aggregation problems (SQL — DA/DS/DE shared)
5. Career path flow — entry to senior and role switching
Two rails, three junctions, one career lattice — pick a starting role, plan the next two moves, treat the third as optional
The mental model in one line: the analytics rail (BI analyst → DA → analytics engineer → DE → staff DE / data platform) and the science rail (stats grad → DS → senior DS → ML engineer → staff / applied scientist) are the two canonical career paths, with switch points at the analytics-engineer and MLOps junctions and a manager-vs-IC fork at ~5-7 years. Once you can draw the lattice from memory, every "where should I go next?" career conversation becomes a one-arrow decision.
The analytics rail.
- BI Analyst (0-2 years). Entry point. Owns SQL queries and dashboards in Looker / Tableau. Reports to a DA team lead. Salary band: roughly US $65-80k, EU €35-45k, India 4-6 LPA.
- Data Analyst (2-4 years). Owns KPI definitions and stakeholder asks. Writes deeper SQL with windows and CTEs. Presents to PMs. Salary band: US $90-130k, EU €50-65k, India 10-18 LPA.
- Analytics Engineer (3-6 years). Owns dbt models and the warehouse transformation layer. Reads less ad-hoc, writes more reusable code. Salary band: US $130-180k, EU €65-95k, India 18-32 LPA.
- Data Engineer (5-9 years). Owns pipelines and orchestration. Carries on-call. Salary band: US $150-220k, EU €70-110k, India 22-45 LPA.
- Staff DE / Data Platform (8+ years). Owns platform-level decisions across teams. Mentors. Salary band: US $230-350k, EU €110-160k, India 48-90 LPA.
The science rail.
- Stats / ML Grad (0-2 years). Entry point. Notebooks, statistical inference, light modelling. Often a 2-year MS / PhD pipeline. Salary band: US $80-115k, EU €40-58k, India 8-12 LPA.
- Data Scientist (2-4 years). Owns experiments and small models. Writes feature pipelines. Presents to PMs. Salary band: US $120-170k, EU €58-82k, India 14-25 LPA.
- Senior Data Scientist (4-7 years). Owns model accuracy SLA. Reviews other DSs' work. Salary band: US $170-230k, EU €85-115k, India 28-48 LPA.
- ML Engineer (5-8 years). Owns model serving and MLOps. Salary band: US $180-260k, EU €90-130k, India 32-60 LPA.
- Staff / Applied Scientist (8+ years). Owns research direction. Salary band: US $250-400k, EU €120-180k, India 50-100 LPA.
The three switch junctions (cross-rail arrows).
- Analytics Engineer ↔ Data Scientist. Switch cost: 6-12 months. The skill bridge is statistical inference and modelling (the analytics engineer already has the warehouse intuition the DS needs). Cross-over is common at the 3-5 year mark.
- Data Scientist ↔ Data Engineer. Switch cost: 12-18 months. The skill bridge is Airflow, warehouse internals, and on-call discipline. Cross-over is less common but high-value for DSs who want to own end-to-end systems.
- Data Engineer ↔ ML Engineer. Switch cost: 6-12 months. The skill bridge is MLOps, model serving, and feature stores. Cross-over is common for DEs who want to move closer to product.
Manager track vs IC track diverges around senior (5-7 years).
- IC track. Senior IC → Staff → Principal. Owns technical decisions, mentors juniors, writes design docs.
- Manager track. Senior IC → Team Lead → Manager → Director. Owns headcount, performance reviews, cross-team coordination.
- Both tracks exist on both rails. A staff DE and an engineering manager DE both exist; a staff DS and a manager DS both exist. The fork is about preference, not capability.
- Switching tracks is expensive. A manager who returns to IC usually drops a level for 6-12 months while rebuilding technical depth. An IC who becomes a manager usually loses technical sharpness within a year.
The T-shaped career mindset.
- Deep in one role, conversant in the other two. Most senior hires expect this. A senior DE who can read a notebook and review a dashboard is more valuable than a senior DE who can only read DAG code.
- The "T" is the depth. The horizontal bar of the T is breadth across the other roles. The vertical line is depth in your role.
- The breadth bar earns you the staff title. Staff DEs run cross-team initiatives that touch DA and DS work; staff DSs review ML engineering decisions; senior DAs translate between exec and engineering.
- The depth line earns you the principal title. Principals own the hardest technical decisions in their lane and are the last word in code review.
Worked example — drawing the lattice for a specific candidate
Detailed explanation. A junior with three years as a BI analyst wants to plan the next 5-year arc. The right framing is: draw the lattice, mark current position, write the next two destinations, and identify the switch costs. Beyond two destinations, optionality wins — the world changes too fast to plan five years out.
Question. Given a 3-year BI analyst's preferences (enjoys SQL, dislikes presenting, curious about pipelines), recommend the next two career destinations and the skill build for each.
Input.
| Now | Preferences | Time horizon |
|---|---|---|
| BI Analyst, 3 yrs | enjoys SQL, dislikes presenting, curious about pipelines | 5 years |
Code.
def plan_two_moves(current_role: str, preferences: dict) -> list[dict]:
plan = []
# Move 1: lowest-friction next step
if current_role == "BI Analyst" and preferences.get("dislikes_presenting"):
plan.append({
"destination": "Analytics Engineer",
"skill_build": ["dbt", "warehouse modelling", "Git workflows"],
"horizon_years": 1.5,
"comp_uplift_pct": 30,
})
# Move 2: branching destination based on curiosity
if preferences.get("curious_about_pipelines"):
plan.append({
"destination": "Data Engineer",
"skill_build": ["Airflow", "Snowflake internals", "on-call discipline"],
"horizon_years": 3,
"comp_uplift_pct": 50,
})
else:
plan.append({
"destination": "Senior DA / Product Analyst",
"skill_build": ["KPI ownership", "stakeholder strategy"],
"horizon_years": 3,
"comp_uplift_pct": 25,
})
return plan
prefs = {"dislikes_presenting": True, "curious_about_pipelines": True}
for step in plan_two_moves("BI Analyst", prefs):
print(step)
Step-by-step explanation.
- Current position: BI Analyst, 3 years. The lowest-friction next step is Analytics Engineer because it leverages the existing SQL skill and lets the candidate skip the "deeper DA stakeholder work" they don't enjoy.
- Skill build for move 1: dbt (the analytics-engineer signature tool), warehouse modelling (Kimball / data vault), Git workflows (so the team can review the dbt PRs).
- Horizon for move 1: 1.5 years. The skill build is 6-9 months; the title change at the current company or a new role takes another 3-6 months.
- Comp uplift for move 1: roughly +30% from BI Analyst to Analytics Engineer.
- Move 2: Data Engineer, based on the curiosity signal. The bridge from Analytics Engineer to DE is Airflow, warehouse internals, and on-call discipline.
- Horizon for move 2: 3 years from now (so 1.5 years as analytics engineer, then the switch).
- Comp uplift for move 2: roughly +50% from BI Analyst to DE.
Output.
| Move | Destination | Skill build | Horizon | Comp uplift |
|---|---|---|---|---|
| 1 | Analytics Engineer | dbt, modelling, Git | 1.5 yrs | +30% |
| 2 | Data Engineer | Airflow, warehouse, on-call | 3 yrs | +50% |
Rule of thumb. Plan exactly two career moves. The first is the lowest-friction next step that fits your preferences; the second is the destination the first step enables. Don't plan three; the world changes, the second move usually surfaces a third option you couldn't have predicted.
Worked example — the DA → DE switch via dbt + Airflow
Detailed explanation. The most common cross-role switch in 2026 is DA → DE, with the analytics-engineer step as the natural intermediate. The skill bridge is dbt + Airflow + warehouse modelling. The switch typically pays +40-60% over 18-24 months.
Question. Given a 4-year DA at $115k who completes the dbt + Airflow build over 18 months, project the path to mid-DE at the 24-month mark.
Input.
| Phase | Title | Base |
|---|---|---|
| Month 0 | Senior DA | $115k |
| Month 12 | Analytics Engineer | $? |
| Month 24 | Mid DE | $? |
Code.
def da_to_de_path(starting_base: float) -> list[dict]:
return [
{"month": 0, "title": "Senior DA", "base": starting_base},
{"month": 12, "title": "Analytics Engineer", "base": starting_base * 1.30},
{"month": 24, "title": "Mid DE", "base": starting_base * 1.30 * 1.20},
]
for stage in da_to_de_path(115_000):
print(f"Month {stage['month']:2d}: {stage['title']:20s} -> ${stage['base']:,.0f}")
Step-by-step explanation.
- Month 0: Senior DA at $115k.
- Month 12: Switch to Analytics Engineer with a +30% bump → $149.5k.
- Month 24: Switch to Mid DE with another +20% bump → $179.4k.
- Total uplift over 24 months: ($179.4k − $115k) / $115k = +56%.
- The bumps assume the candidate (a) builds the skill, (b) switches companies (in-place promotions usually cap at +10-15%), and (c) negotiates each offer.
- Alternative path: stay as Senior DA for 24 months and hit +15% via COLA + promotion → $132.3k. The role-switch path nets +$47k/year more.
Output.
| Month | Title | Base |
|---|---|---|
| 0 | Senior DA | $115,000 |
| 12 | Analytics Engineer | $149,500 |
| 24 | Mid DE | $179,400 |
Rule of thumb. When you can articulate the skill bridge and have a portfolio that proves it, switching companies for the role change usually pays 2-3× the comp of an in-place promotion. The trade-off is the social cost of building a new network; the comp math is unambiguous.
Worked example — DS to DE via MLOps
Detailed explanation. The DS → DE switch is less common than DA → DE but high value for DSs who want to own the end-to-end system. The skill bridge is MLOps, model serving, and infrastructure. The switch typically lands at "ML engineer" rather than "pure DE" — the DS's modelling background is too valuable to throw away.
Question. Given a 5-year DS at $190k who wants to move closer to infra, recommend the destination and the skill build.
Input.
| Now | Preference signal |
|---|---|
| Senior DS, 5 yrs, $190k | "I want to own the model in production, not hand it off." |
Code.
def ds_to_eng_path(current_role: str, current_base: float, signal: str) -> dict:
if "production" in signal.lower():
return {
"destination": "ML Engineer",
"skill_build": ["model serving (FastAPI / TorchServe)",
"feature stores (Feast)",
"model registry (MLflow)",
"Kubernetes basics"],
"horizon_months": 12,
"expected_base": current_base * 1.20,
"rationale": "ML engineer keeps the modelling depth + adds the lifecycle ownership",
}
if "infra" in signal.lower():
return {
"destination": "Data Engineer",
"skill_build": ["Airflow", "Snowflake internals", "on-call"],
"horizon_months": 18,
"expected_base": current_base * 1.15,
"rationale": "DE adds the pipeline ownership but discards some modelling depth",
}
return {"destination": "stay DS, deepen modelling"}
print(ds_to_eng_path("Senior DS", 190_000,
"I want to own the model in production, not hand it off."))
Step-by-step explanation.
- Signal "own the model in production" maps to ML engineer, not pure DE. ML engineer preserves the modelling investment.
- Skill build: model serving (FastAPI for sync, TorchServe / Triton for deep models), feature stores (Feast), model registry (MLflow), Kubernetes basics.
- Horizon: 12 months. The DS already has 60% of the skill — they just need the production tooling.
- Expected base after switch: $190k × 1.20 = $228k.
- Rationale: the ML engineer role recognises that lifecycle ownership of a model is its own discipline; the DS who wants to own production gets paid the premium for adding the engineering.
- If the signal had been "I want to own pipelines" instead of "models in production," the answer would have been DE — and the DS would discard a chunk of their modelling premium.
Output.
| Field | Value |
|---|---|
| Destination | ML Engineer |
| Skill build | serving, feature stores, registry, K8s |
| Horizon | 12 months |
| Expected base | $228k |
Rule of thumb. When a DS wants to "move closer to engineering," route them through ML engineer first, not pure DE. The ML engineer role lets them preserve the modelling depth they already paid for; pure DE discards it.
Worked example — the manager-vs-IC fork at the staff level
Detailed explanation. Around 5-7 years on either rail, the IC and the manager track diverge. The same person can choose to keep coding (staff IC) or to manage people (engineering manager). The fork is reversible but expensive — switching back loses ~12 months of momentum.
Question. Given a senior DE at 6 years contemplating the fork, list the day-to-day differences and the seniority comp comparison.
Input.
| Track | Day-to-day | Skills required |
|---|---|---|
| Staff DE (IC) | design docs, deep code review, cross-team mentorship | senior technical depth + breadth |
| Engineering Manager (DE) | 1:1s, performance reviews, hiring, cross-team coordination | people leadership + technical breadth |
Code.
FORK = {
"staff_de_ic": {
"day_to_day": ["design docs", "deep code review", "cross-team mentorship",
"incident postmortems", "platform vision"],
"comp_us": "230-350k",
"comp_eu": "115-160k",
"comp_india": "48-90 LPA",
"reversibility": "easy — coding skill stays sharp",
},
"em_de": {
"day_to_day": ["1:1s", "performance reviews", "hiring loops",
"roadmap planning", "cross-team coordination"],
"comp_us": "220-340k",
"comp_eu": "110-155k",
"comp_india": "45-85 LPA",
"reversibility": "expensive — technical sharpness decays",
},
}
for track, info in FORK.items():
print(track.upper())
for k, v in info.items():
print(f" {k}: {v}")
Step-by-step explanation.
- Staff DE (IC) spends most of the week on design, code review, and mentorship. Compensation is roughly equal to engineering manager at the same level.
- Engineering manager (DE) spends most of the week on people leadership: 1:1s, performance reviews, hiring loops. Direct coding time drops to <10% of the week.
- Compensation is essentially identical between staff IC and equivalent EM at most modern companies. The choice is preference, not money.
- Reversibility: staff IC → EM is straightforward (the technical depth is fresh). EM → staff IC is harder; technical sharpness decays inside 12 months and the re-onboarding costs the candidate a level.
- Both tracks lead to director / principal at the next step. They eventually fork further, but at 5-7 years the choice is about how you want to spend your week.
Output.
| Field | Staff DE (IC) | Engineering Manager |
|---|---|---|
| Day-to-day | design, review, mentorship | 1:1s, hiring, planning |
| Comp US | $230-350k | $220-340k |
| Reversibility | easy | expensive |
Rule of thumb. Pick the track based on how you want to spend your week, not on the comp curve. The comp is essentially equal at the staff / EM level on either rail. The day-to-day satisfaction is the variable that compounds over years.
Decision question on the next career move
A staff DE often closes with: "If you had 18 months to make the highest-leverage career move starting from your current role, what would it be and why?" The interviewer is testing whether the candidate has a self-aware career model or is reacting to job-board signal.
Decision Using the rail + junction + switch-cost framing
def next_move(current_role: str, current_years: int, signal: str) -> dict:
"""
Recommend the next career move using the rail + junction framing.
"""
if current_role == "Data Analyst" and current_years >= 2:
return {
"rail": "analytics",
"next_role": "Analytics Engineer",
"junction": "DA -> Analytics Engineer",
"switch_cost_months": 9,
"skill_build": ["dbt", "warehouse modelling", "Git"],
"comp_uplift_pct": 30,
}
if current_role == "Data Scientist" and "production" in signal:
return {
"rail": "science",
"next_role": "ML Engineer",
"junction": "DS -> ML Engineer",
"switch_cost_months": 12,
"skill_build": ["serving", "feature store", "registry"],
"comp_uplift_pct": 20,
}
if current_role == "Data Engineer" and current_years >= 5:
return {
"rail": "analytics",
"next_role": "Staff DE" if "code" in signal else "Engineering Manager",
"junction": "Senior DE -> Staff vs EM fork",
"switch_cost_months": 18,
"skill_build": ["design docs", "mentorship"] if "code" in signal
else ["1:1 framework", "hiring loops"],
"comp_uplift_pct": 35,
}
return {"recommendation": "deepen current role for one more year"}
print(next_move("Data Analyst", current_years=3, signal="curious about pipelines"))
Step-by-step trace.
| Candidate state | Inferred rail | Next role | Switch cost | Comp uplift |
|---|---|---|---|---|
| Senior DA, 3 yrs | analytics | Analytics Engineer | 9 months | +30% |
| Senior DS, 5 yrs | science | ML Engineer | 12 months | +20% |
| Senior DE, 6 yrs, prefers code | analytics | Staff DE (IC) | 18 months | +35% |
After the framing, the candidate has a one-line answer to "what's next?" — the rail, the next role, the switch cost, and the comp expectation. Recruiters who hear this framing recognise senior career intuition; juniors who can articulate it stand out.
Output:
| Field | Value |
|---|---|
| Rail | analytics |
| Next role | Analytics Engineer |
| Junction | DA → Analytics Engineer |
| Switch cost (months) | 9 |
| Skill build | dbt, modelling, Git |
| Comp uplift | +30% |
Why this works — concept by concept:
- Rail framing — picking the analytics rail or the science rail orients every other decision. Within a rail, the next step is usually clear; across rails, the switch cost is the variable.
- Junction is the named edge — every cross-rail switch has a named junction (DA → analytics engineer, DS → ML engineer, DE → ML engineer). Knowing the junction by name lets you plan the skill build.
- Switch cost in months — quantifying the cost (3-18 months) anchors the decision in time, not in vibes. A 9-month switch is achievable; an 18-month switch is a strategic bet.
- Skill build as a checklist — three to four named tools / behaviours that the next role demands. The list lets the candidate plan deliberate practice.
- Comp uplift as the tiebreaker — when two next steps look equally good on day-to-day fit, the comp uplift breaks the tie. Don't let it lead; let it close.
- Cost — one afternoon to draw your own lattice and write the framing. The compounding return is one to two career levels over three years.
SQL
Topic — window functions
Window function problems (SQL — senior signal)
Cheat sheet — role boundary recipes
- Owns the pipeline SLA. Data Engineer. Data lands on time, idempotent reruns, schema enforced, on-call rotation.
- Owns the dashboard SLA. Data Analyst. Metric matches source of truth, refresh by 8 AM, stakeholder-ready.
- Owns the model accuracy SLA. Data Scientist. Predictions calibrated, drift monitored, A/B test stat-sig.
- Owns the warehouse transformation layer. Analytics Engineer (DE ∩ DA). dbt models, KPI definitions in code, single source of truth.
- Owns the model lifecycle. ML Engineer (DE ∩ DS). Serving infra, feature store, registry, drift detection, rollback.
- Writes the most SQL by hand. Data Analyst. Volume leader; senior DA writes deeper SQL than most DEs.
- Writes the gnarliest SQL. Data Engineer. Complex joins, recursive CTEs, transformation logic against TB-scale tables.
- Writes the most Python. Data Scientist (modelling code) and Data Engineer (Airflow operators and data quality tests).
- Carries the on-call pager. Data Engineer primarily; ML engineer for model serving; nobody else on a healthy team.
- Presents to executives. Data Analyst primarily; senior DS for experiment readouts; senior DE for platform reviews.
- Ships to production. Data Engineer (pipelines), ML Engineer (models), Analytics Engineer (warehouse contracts). DA ships dashboards which are also "production" but managed in BI tools.
- Highest median salary at senior level. Data Engineer (5-15% above Data Scientist at senior level in 2026), driven by infra scarcity and on-call premium.
- Fastest promotion velocity. Data Analyst → Analytics Engineer (18-24 months for a +30% comp jump).
- Best entry for unsure new grads. Data Analyst — narrow stack, fast feedback loop, optionality to either rail.
- Best entry for stats / ML grads. Data Scientist — uses the degree directly.
- Best entry for SWE switchers. Data Engineer — reuses the software-engineering muscle.
- Switch-cost ranking (cheapest first). DA → Analytics Engineer (6-9 months) < DS → ML Engineer (12 months) < DE → DS (12-18 months) < DS → DE (12-18 months).
- The T-shaped rule. Deep in one role, conversant in the other two. Senior hires expect this; juniors should build the breadth deliberately.
Frequently asked questions
What's the main difference between a data engineer and a data scientist?
The main difference is the artifact each role ships and the SLA each role owns. A data engineer ships pipelines, schemas, and orchestration code — and owns the pipeline SLA (data lands on time, idempotent reruns, on-call rotation when the pipeline fails). A data scientist ships models, experiments, and statistical analyses — and owns the model SLA (predictions calibrated, A/B test stat-sig, model AUC above target). Both write SQL and Python; both touch the warehouse. The differentiator is what they're responsible for when the system breaks at 3 AM. In 2026 the data engineer typically earns 5-15% more at senior level due to infra scarcity and on-call premium.
Does a data analyst need to know Python?
Light Python is increasingly expected of a senior data analyst — for one-off exploratory work in pandas, for automating a recurring report, or for prototyping a metric before handing it off to the analytics engineer. The DA's primary tool is still SQL (volume leader of any role) and the BI tool (Looker, Tableau, Power BI, Mode). Production Python — Airflow operators, ML modelling, custom packages — is not a DA skill. New DAs in 2026 should learn enough Python to read pandas and write a 50-line analysis script; deep Python expertise is a signal you're moving toward analytics engineering or data science, not staying in pure DA.
Which role earns the most in 2026?
At senior level in the US, the rough ranking is Data Engineer > Data Scientist > Data Analyst, with DE earning roughly $230k median total comp, DS roughly $220k, and DA roughly $150k. At staff level the DE premium widens slightly because data platform leads command a scarcity premium. The picture flips at FAANG, where senior DS at companies like Meta or Google often hit $300-400k total comp on the equity tailwind — comparable to staff DE elsewhere. In the EU and India the rank order holds but the gap between DE and DS narrows. The honest framing: pick the role by what you find satisfying to ship; the comp curves are close enough that role-fit is the better optimisation than maxing the median.
Can I switch from data analyst to data engineer?
Yes — DA → DE is the most common cross-role switch in 2026, typically routed through analytics engineering as the intermediate step. The skill bridge is dbt (warehouse transformation layer), Airflow (orchestration), and warehouse modelling (Kimball / data vault). A focused 9-12 month build, combined with an open-source dbt portfolio, makes the switch achievable. The comp uplift is typically +30% at the analytics-engineer step and another +20% at the DE step — roughly +56% total over 24 months. The DA's stakeholder-communication skill is preserved and is unusually valuable at the DE rank; senior DEs who can talk to PMs are the staff-track candidates.
What is an analytics engineer?
An analytics engineer owns the warehouse transformation layer — dbt models, KPI definitions in code, and the "single source of truth" tables that both dashboards and notebooks depend on. The role sits in the DE ∩ DA overlap of the three-role Venn: it shares the DE's discipline around version control, code review, and testing, and shares the DA's stakeholder-comms muscle and metric-definition responsibility. The role was named by dbt Labs (formerly Fishtown Analytics) in 2018 and went mainstream by 2022; in 2026 it's the fastest-growing data-role title in mid-stage and scale-up companies. The salary band sits roughly between DA and DE — typically $130-180k senior in the US, with strong demand in remote-first markets.
Do data scientists still build pipelines?
Some still do, especially at smaller companies or on teams without a dedicated DE. The 2014 unicorn DS role explicitly required pipeline work; in 2026 the canonical DS role hands pipeline work to the DE or analytics engineer. The exception is feature pipelines — the DS often owns the upstream feature engineering for their model, even if the DE owns the raw ingestion. Strong DSs in 2026 are conversant in Airflow and dbt (they can read a DAG and a dbt model) but don't operate the platforms — that's the DE's job. The trend is unmistakable: the more mature the data org, the cleaner the split, and the less pipeline work the DS owns.
Practice on PipeCode
- Drill the SQL practice library → for the language every role uses daily.
- Rehearse on Python practice problems → for the DE + DS shared muscle.
- Stack the aggregation library → for the DA's bread-and-butter metric work.
- Sharpen window function drills → for the senior-DA signal that crosses into analytics-engineer territory.
- Stack the joins practice library → for the universal warehouse skill.
- Layer the ETL practice problems → for the canonical DE workflow.
- Add the streaming practice library → for the DE specialisation that commands the highest salary band.
- For the broader surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Sharpen the dialect axis with the SQL for data engineering interviews course →.
- Build the Python depth with the Python for data engineering interviews course →.
- For long-form schema craft, work through data modelling for DE interviews →.
- For the platform side, study the Apache Spark internals course →.
Pipecode.ai is Leetcode for Data Engineering — every role-boundary recipe above ships with hands-on practice rooms where you write the SQL the DA owns, the Python the DS owns, and the pipelines the DE owns against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you can move from "I know the Venn" to "I can ship the artifact" without leaving the browser.





Top comments (0)