chat gpt for data engineer jobs searches have tripled in the last 18 months because the day-to-day surface of the role — writing SQL, scaffolding dbt models, refactoring ETL, auto-documenting marts, and reasoning about column-level lineage — is exactly the shape that frontier LLMs do best. The 2026 reality is that roughly thirty percent of a senior DE's keystrokes are now LLM-assisted; the engineers who out-ship their peers are the ones who learned the prompt patterns, retrieval grounding, and safety rails that turn a chat box into a production-grade collaborator.
This guide walks through the five workflows that have crystallised as the working contract for llm data engineering and ai for data engineers in 2026: schema-as-context sql generation ai prompts, RAG-grounded dbt llm macro and model generation, auto-generated docs and column-level lineage, and the human-in-the-loop guardrail ladder that keeps an LLM from silently breaking production. Each section pairs a teaching block with a worked interview-style answer — prompt, code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, browse the SQL practice library →, drill SQL-generation problems →, and rehearse on ETL practice problems →.
On this page
- Why LLMs reshape (but don't replace) DE workflows in 2026
- SQL generation — prompt patterns and schema-as-context
- dbt macro and model generation with RAG over your project
- Auto-generated lineage and docs from SQL + LLM enrichment
- Human-in-the-loop — the production safety pattern
- Cheat sheet — DE LLM workflows in one screen
- Frequently asked questions
- Practice on PipeCode
1. Why LLMs reshape (but don't replace) DE workflows in 2026
The honest one-liner — LLMs collapse the boilerplate, the human still owns correctness
The mental model in one sentence: a large language model is a fluent intern with infinite stamina, encyclopedic recall of public SQL/Python idioms, and zero context for your business — so the value lands the moment you can hand it the right context cheaply, and the risk lands the moment it ships a confidently-wrong answer into production. Once you internalise that — fluent intern, not senior engineer — the rest of the chatgpt data engineer workflow stack falls out: feed schemas, sample rows, project conventions, and house style as context, then gate every output through automated checks and a human review.
What LLMs are genuinely good at in DE workflows.
- Boilerplate generation. Skeleton dbt models, staging YAML files, Airflow DAGs, Spark scaffolds, Pydantic schemas, table DDL — anything that follows a well-known shape.
- Format conversions. JSON → DDL, DDL → ERD, SQL dialect A → dialect B, pandas → PySpark, Airflow 1 → Airflow 2/3.
- Regex and parsing. "Extract the order id from this filename pattern" — LLMs nail regex on the first or second try when prior generations needed three coffee breaks.
- Documentation. dbt model descriptions, column glossaries, runbooks, post-mortem first drafts, README scaffolds — anything that converts code into prose.
- Refactoring. Rename a column across 80 dbt models, replace a deprecated function across a repo, port to a new dialect — bulk, mechanical, repetitive.
-
Test scaffolding. Suggested
unique/not_null/accepted_values/relationshipstests from a dbt model's columns and joins.
What LLMs are bad at — and where they confidently lie.
- Production correctness without verification. A SQL that compiles is not a SQL that returns the right answer. LLMs will happily produce queries that read wrong column names, miss late-arriving rows, double-count via Cartesian joins, or drop nulls silently.
- Novel business logic. Anything that depends on tribal knowledge, finance close calendars, jurisdiction-specific privacy rules, or non-public schemas. The model will pattern-match to the shape of similar problems and invent details.
- Debugging without context. "Why is my pipeline slow?" without a query plan, table sizes, or the actual error log is a hallucination factory.
- Memory across sessions. Even with project memory features, the model forgets specifics — never assume it remembers yesterday's schema decision.
- Cost and performance reasoning. It can recite Big-O, but does not know that your Snowflake warehouse is X-Large or that a CTE materialises in your dialect.
The four shifts in DE workflows since 2024.
- Editor-native LLMs are now table stakes. Cursor, Windsurf, VS Code with Copilot, Zed's inline assist — the IDE is the prompt interface. Tab-completion has been replaced by multi-line, multi-file edits with context-aware diffs.
-
RAG over the project is the unlock for dbt. Naive ChatGPT-in-a-browser hallucinates refs and column names; an LLM that retrieves over
manifest.json,models/, andmacros/doesn't. - Schema-as-context is the prompt pattern that wins for SQL. DDL + 5–10 sample rows + the question, in that order, beats every "you are a SQL expert" persona prompt by a wide margin.
- Human-in-the-loop gates are non-negotiable. No senior team auto-merges LLM-generated SQL to prod. The gates are PR review + dbt test + EXPLAIN check + dry-run.
The "trust ladder" in one paragraph. LLM raw output is untrusted by default. The output earns trust as it moves up the ladder: ① in-IDE suggestion (developer accepts/rejects per line), ② human-authored PR with LLM diffs (peer review + context check), ③ CI gates (dbt parse + compile + test + EXPLAIN), ④ merged + monitored (freshness alerts, row-count diffs, post-deploy DQ). Anything ≤ ② is a draft; only ≥ ④ is "in production." We unpack the ladder fully in §5.
Why the role of the DE doesn't go away — it shifts up the stack.
- From "writing SQL" → "specifying SQL, verifying outputs, and owning the contract." The LLM types; the human signs.
- From "writing docs" → "curating docs." LLM drafts; human edits and chooses what gets canonical.
- From "scaffolding pipelines" → "designing pipelines." Boilerplate evaporates; trade-off decisions stay human.
- From "individual contributor productivity" → "team-level review throughput." A senior DE now reviews 3–5× more PRs than they used to author solo. The bottleneck moves to attention and judgement.
Worked example — count the LLM-touched lines in a senior DE's typical week
Detailed explanation. A useful framing question for any team considering an LLM rollout is "how much of our existing code would a competent LLM-assisted DE have drafted instead of typed?" The answer calibrates expectations and helps choose where to invest in tooling vs training. We can estimate it with a quick day-by-day audit.
Question. A senior data engineer ships ~3,000 lines of net-new code in a typical week (SQL, dbt YAML, Python, terraform), reviews ~2,500 lines of PRs, and writes ~500 lines of docs and runbooks. Roughly what fraction is LLM-amplifiable today, and where does the human still need to drive?
Input (typical week breakdown).
| Activity | Lines | LLM-amplifiable? | Human-only? |
|---|---|---|---|
| New dbt staging models | 900 | yes — boilerplate | naming, source choice |
| New dbt mart models | 700 | partial — joins, ranks | metric definitions |
| Python DAG glue | 400 | yes — boilerplate | dependency design |
| Terraform IaC | 200 | yes — boilerplate | environment policy |
| Schema YAML + tests | 300 | yes | test selection |
| Docs + runbooks | 500 | yes | tone, judgement |
| Bug-fix SQL | 200 | partial | root cause |
| Code review | 2500 | review-time only | the whole point |
Code (Python — quick weekly LLM-assist estimator).
# weekly_llm_assist_estimator.py
WEEK = [
{"name": "dbt staging models", "lines": 900, "llm_share": 0.85, "human_share": 0.15},
{"name": "dbt mart models", "lines": 700, "llm_share": 0.55, "human_share": 0.45},
{"name": "Python DAG glue", "lines": 400, "llm_share": 0.75, "human_share": 0.25},
{"name": "Terraform IaC", "lines": 200, "llm_share": 0.70, "human_share": 0.30},
{"name": "Schema YAML + tests", "lines": 300, "llm_share": 0.80, "human_share": 0.20},
{"name": "Docs + runbooks", "lines": 500, "llm_share": 0.75, "human_share": 0.25},
{"name": "Bug-fix SQL", "lines": 200, "llm_share": 0.35, "human_share": 0.65},
]
llm_lines = sum(r["lines"] * r["llm_share"] for r in WEEK)
human_lines = sum(r["lines"] * r["human_share"] for r in WEEK)
total = sum(r["lines"] for r in WEEK)
print(f"Total weekly lines: {total}")
print(f"LLM-amplifiable (with human review): {llm_lines:.0f} ({llm_lines/total:.0%})")
print(f"Human-driven (LLM not the bottleneck): {human_lines:.0f} ({human_lines/total:.0%})")
Step-by-step explanation.
- Each row is a real category of weekly output; the lines column is an order-of-magnitude estimate from public DE work-mix studies and internal team retros.
-
llm_shareis the fraction of that category an LLM could plausibly draft correctly enough that the human reviews-and-edits rather than writes-from-scratch. Bug-fix SQL has the lowest share because it depends on system-specific context the model lacks. -
human_shareis the residual — the lines where the model contributes nothing or the contribution is harmful (naming, metric definitions, root-cause analysis). - Multiplying and summing gives a rough total: ~70% of weekly lines are amplifiable, ~30% remain firmly human-driven.
- The human-only fraction is concentrated in the categories that matter most — schema design, root-cause debugging, judgement calls — so headcount doesn't shrink; the work shifts.
Output.
| Bucket | Lines/week | Share |
|---|---|---|
| LLM-amplifiable (drafted by LLM, reviewed by human) | 2,200 | 70% |
| Human-driven (LLM not the bottleneck) | 1,000 | 30% |
| Total weekly lines | 3,200 | 100% |
Rule of thumb. Treat the LLM as a force multiplier on boilerplate and a reviewer-assist on logic — not as a replacement for the human who owns the contract. Time saved should go into more design, more review, more tests — not into headcount cuts.
Interview question on LLM augmentation strategy
A senior interviewer often frames the opening probe as: "Walk me through how your team uses LLMs day-to-day, where you draw the line, and how you measure whether the rollout is working." This blends workflow design, safety reasoning, and metrics — the three muscles every chatgpt data engineer round opens with.
Solution Using a "trust ladder + measurable hallucination budget" framework
# llm_rollout_health.py — quarterly LLM rollout health check
from dataclasses import dataclass
from statistics import mean
@dataclass
class Metric:
name: str
value: float
target: float
higher_is_better: bool
metrics = [
Metric("PRs with LLM assist (% of total)", 0.62, 0.50, True),
Metric("Median PR cycle time (days)", 1.4, 2.0, False),
Metric("LLM-introduced rollbacks (% of PRs)", 0.018, 0.02, False),
Metric("Hallucination rate (caught at review)", 0.07, 0.10, False),
Metric("Hallucination escape (post-merge)", 0.003, 0.005, False),
Metric("DE happiness score (1-5)", 4.2, 4.0, True),
]
def status(m: Metric) -> str:
ok = (m.value >= m.target) if m.higher_is_better else (m.value <= m.target)
return "PASS" if ok else "FAIL"
for m in metrics:
print(f"{m.name:<48} {m.value:>7} target={m.target:<5} {status(m)}")
Step-by-step trace.
| Step | Metric | Why it's here |
|---|---|---|
| 1 | PRs with LLM assist | adoption — is the team actually using the tool? |
| 2 | Median PR cycle time | velocity — is the tool moving the needle? |
| 3 | LLM-introduced rollbacks | quality at the boundary — did a bad LLM diff escape? |
| 4 | Hallucination rate | quality before merge — caught by reviewers and CI |
| 5 | Hallucination escape | quality after merge — caught by monitoring |
| 6 | DE happiness | sustainability — is the workflow tolerable long-term? |
Output:
| Metric | Quarter value | Target | Verdict |
|---|---|---|---|
| Adoption | 62% | ≥ 50% | PASS |
| PR cycle time | 1.4 d | ≤ 2.0 d | PASS |
| Rollbacks | 1.8% | ≤ 2.0% | PASS |
| Hallucinations caught | 7% | ≤ 10% | PASS |
| Hallucinations escaped | 0.3% | ≤ 0.5% | PASS |
| Happiness | 4.2 | ≥ 4.0 | PASS |
Why this works — concept by concept:
- Trust ladder before metrics — the ladder defines where the LLM operates (≤ Gate 2 only); metrics measure how well it operates inside that boundary. Skipping the ladder and going straight to metrics ships unsafe automation.
- Hallucination budget — every quarter sets an explicit ceiling on LLM-introduced defects (typically ≤ 2% rollback, ≤ 0.5% escape). The budget is a contract with the team and a forcing function for tooling investment.
- Adoption is necessary but not sufficient — high adoption with no velocity gain means the tool isn't earning its keep; low adoption with high velocity gain means a few power users are pulling the team forward but the workflow doesn't scale.
- Happiness is the slow signal — when sustainability drops, the team is reviewing more than it's authoring and burning out. The fix is usually to invest in better RAG, better gating, or better in-IDE UX — not to push adoption harder.
- Cost — instrumentation cost = O(once-per-quarter survey + CI tagging); the ROI is the rollback dollars you don't spend and the cycle-time hours you reclaim.
SQL
Topic — SQL generation
SQL generation practice problems (LLM-assisted SQL)
2. SQL generation — prompt patterns and schema-as-context
The three-part prompt — schema + sample rows + task — is the only pattern you need
The mental model in one line: a good SQL-generation prompt is a self-contained brief — DDL for the tables involved, 5–10 sample rows per table, and a natural-language task with explicit edge cases — and a great one adds a one- or two-shot in-context example of the exact output shape you want. Once you ship that pattern as a template, sql generation ai stops feeling like magic and starts behaving like a typed function call.
The three required inputs.
-
Schema (DDL).
CREATE TABLEstatements give the model column names, types, nullability, and primary keys. DDL beats JSON descriptions; JSON beats prose. Always include FK relationships if your dialect spells them. -
Sample rows (5–10 per table). This is the silent quality multiplier. Sample rows disambiguate "is
signup_atUTC or local?", "doemails have whitespace?", "isamountin cents or dollars?". The model uses them as type and shape inference, not as a query target. - Task (natural language). A one-paragraph brief with the question, the desired output columns, the time window, the filter rules, and any edge cases (ties, nulls, late arrivals). Be explicit — "include customers with zero orders" beats "all customers."
The two optional inputs that matter.
-
Dialect tag. "Use Snowflake SQL" or "Use PostgreSQL 15 syntax" cuts dialect mistakes (
QUALIFY,LATERAL,LISTAGG, etc.) by ~80%. -
One or two in-context examples.
Question → SQLpairs the model can mimic. For complex shapes (pivoting, gaps-and-islands), a single example moves accuracy from "sometimes works" to "almost always works."
Schema serialisation — DDL beats JSON beats prose.
-
DDL — the model has seen millions of
CREATE TABLEs in training; it parses them effortlessly and respects nullability. -
JSON schema — works for tools that emit structured schemas (BigQuery
INFORMATION_SCHEMA), but the model spends tokens parsing braces. - Prose — "the orders table has an id, a user_id, an amount in cents, and a created_at timestamp" — slowest and most error-prone; only use when DDL isn't available.
Verification — every generated SQL goes through four gates.
-
EXPLAIN plan check. Run
EXPLAINand eyeball the join order, scan types, and rowcount estimates. Catches Cartesian explosions and missing indexes. - Sample result check. Run the SQL against a 1% sample or a fixed sandbox dataset and compare the output to a hand-computed expected.
-
Dialect linter.
sqlflufforsqlglotparses the SQL in your target dialect — catches dialect drift the model snuck in. - PR review. A human looks at the intent — does the SQL actually answer the question that was asked?
One-shot vs few-shot — when to use which.
- Zero-shot. Simple aggregations, easy filters. Works ~70% of the time on first try.
- One-shot (1 example). Window functions, ranking, gaps-and-islands, pivoting. Lifts accuracy to ~90%.
- Few-shot (3–5 examples). Complex business metrics with specific edge cases, dialect quirks, or company-house-style. Lifts accuracy to ~95% but eats prompt tokens.
Worked example — rank customers by 30-day rolling revenue
Detailed explanation. This is the canonical "ranked window with a date filter" interview pattern, and a great calibration target for any sql generation ai prompt. The interesting bits are the rolling window definition, the tie-break rule, and how the LLM handles late-arriving rows.
Question. Given an orders table and a customers table, write a SQL that returns the top 100 customers by total revenue in the trailing 30 days ending on 2026-06-04, with ties broken by the most recent order date. Include customers with zero orders in the window only if they signed up within the same window.
Input (the prompt — schema + sample rows + task).
You are writing PostgreSQL 15. Output only the SQL.
# Schema
CREATE TABLE customers (
id BIGINT PRIMARY KEY,
email TEXT NOT NULL,
signup_at TIMESTAMPTZ NOT NULL
);
CREATE TABLE orders (
id BIGINT PRIMARY KEY,
customer_id BIGINT NOT NULL REFERENCES customers(id),
amount_cents BIGINT NOT NULL,
created_at TIMESTAMPTZ NOT NULL
);
# Sample rows
customers:
| id | email | signup_at |
|----|-------------------|--------------------------|
| 1 | a@x.com | 2025-12-10T09:00:00Z |
| 2 | b@y.com | 2026-06-01T12:30:00Z |
| 3 | c@z.com | 2026-05-15T08:00:00Z |
orders:
| id | customer_id | amount_cents | created_at |
|-----|-------------|--------------|--------------------------|
| 100 | 1 | 4500 | 2026-05-20T10:00:00Z |
| 101 | 1 | 9000 | 2026-06-03T11:30:00Z |
| 102 | 3 | 2000 | 2026-05-22T16:00:00Z |
# Task
Return the top 100 customers by total revenue in the 30-day window ending 2026-06-04 UTC.
Tie-break by most recent order date (DESC).
Include zero-revenue customers only if they signed up inside the same window.
Columns: customer_id, email, total_revenue_cents, last_order_at.
Code (the SQL the model emits — and we accept after verification).
WITH window_bounds AS (
SELECT
TIMESTAMPTZ '2026-05-05 00:00:00+00' AS window_start,
TIMESTAMPTZ '2026-06-04 00:00:00+00' AS window_end
),
orders_in_window AS (
SELECT
o.customer_id,
SUM(o.amount_cents) AS total_revenue_cents,
MAX(o.created_at) AS last_order_at
FROM orders o
CROSS JOIN window_bounds w
WHERE o.created_at >= w.window_start
AND o.created_at < w.window_end
GROUP BY o.customer_id
),
eligible AS (
SELECT
c.id AS customer_id,
c.email,
COALESCE(oiw.total_revenue_cents, 0) AS total_revenue_cents,
oiw.last_order_at
FROM customers c
LEFT JOIN orders_in_window oiw
ON oiw.customer_id = c.id
CROSS JOIN window_bounds w
WHERE oiw.customer_id IS NOT NULL
OR (c.signup_at >= w.window_start AND c.signup_at < w.window_end)
)
SELECT
customer_id,
email,
total_revenue_cents,
last_order_at
FROM eligible
ORDER BY
total_revenue_cents DESC,
last_order_at DESC NULLS LAST
LIMIT 100;
Step-by-step explanation.
- The model picks
WITH window_boundsto define the 30-day window once, then references it twice — a quality signal that the model respects "single source of truth" for the date math. -
orders_in_windowaggregates revenue per customer inside the window. Thecreated_at < window_end(half-open interval) is critical —<=would double-count an order placed exactly at midnight on2026-06-04. Sample rows let the model see the timestamp shape and choose the right boundary. -
eligibleleft-joins customers to the in-window aggregate, then keeps customers that either have in-window revenue or signed up inside the window — implementing the "include zero-revenue customers only if they signed up inside the same window" rule. - The final
ORDER BY total_revenue_cents DESC, last_order_at DESC NULLS LASTenforces the tie-break rule and pushes zero-revenue customers (NULLlast_order_at) to the bottom of any tied bucket. -
LIMIT 100caps the result; in a production query you would also add a deterministic third tie-breaker (customer_id ASC) so paging and re-runs are stable.
Output.
| customer_id | total_revenue_cents | last_order_at | |
|---|---|---|---|
| 1 | a@x.com | 13500 | 2026-06-03T11:30:00Z |
| 3 | c@z.com | 2000 | 2026-05-22T16:00:00Z |
| 2 | b@y.com | 0 | NULL |
Rule of thumb. When asking an LLM for windowed SQL, always spell the window boundaries in the prompt (>= start AND < end) and always hand-trace at least one tied row. The model is excellent at writing the shape and only okay at picking the exact boundary semantics.
Interview question on prompt design for SQL generation
A senior interviewer often frames this as: "How would you ship a prompt template for sql generation ai that scales across a team of 30 engineers and 8 dialects, with hallucination caught before it hits prod?" The probe tests prompt engineering, schema serialisation, and verification design — the three muscles every chatgpt for data engineer jobs round rewards.
Solution Using a templated three-part prompt + four-gate verifier
# sql_gen_template.py — production prompt + verifier
from textwrap import dedent
from typing import Any
import sqlglot
PROMPT_TEMPLATE = dedent("""
You are writing {dialect} SQL. Output ONLY the SQL — no commentary.
# Schema (DDL)
{ddl}
# Sample rows ({sample_n} rows per table)
{samples}
# House style
- Use CTEs for clarity, one per logical step.
- Half-open windows: `>= start AND < end`.
- Always include a deterministic tie-breaker on the ORDER BY.
- Prefer COALESCE over CASE for nullable arithmetic.
# Task
{task}
""").strip()
def build_prompt(*, dialect: str, ddl: str, samples: str, task: str, sample_n: int = 5) -> str:
return PROMPT_TEMPLATE.format(
dialect=dialect, ddl=ddl, samples=samples, task=task, sample_n=sample_n,
)
def verify_sql(sql: str, dialect: str, run_explain: Any, run_sample: Any) -> dict:
# Gate 1 — parse with the target dialect
try:
parsed = sqlglot.parse_one(sql, read=dialect.lower())
parse_ok = True
except Exception as e:
return {"gate": "parse", "ok": False, "error": str(e)}
# Gate 2 — EXPLAIN plan sanity
plan = run_explain(sql) # injected
plan_ok = (
"Seq Scan on orders" not in plan # no full scan on large table
and "Nested Loop" not in plan # no Cartesian product hint
)
# Gate 3 — sample-result check on 1% sample
sample_result = run_sample(sql) # injected
sample_ok = len(sample_result) > 0
return {
"parse_ok": parse_ok,
"plan_ok": plan_ok,
"sample_ok": sample_ok,
"ok": parse_ok and plan_ok and sample_ok,
}
Step-by-step trace.
| Step | Action | Effect |
|---|---|---|
| 1 | build_prompt(...) |
renders the three-part prompt with house-style rules baked in |
| 2 | LLM call | returns a SQL string the team reviews-then-runs |
| 3 | sqlglot.parse_one(sql, read=dialect) |
dialect linter — catches QUALIFY in Postgres etc. |
| 4 | run_explain(sql) |
EXPLAIN plan inspection — catches missing-index Cartesian disasters |
| 5 | run_sample(sql) |
1% sample run — catches empty result sets and obvious off-by-ones |
| 6 | PR review | human signs the diff with prompt + model + output attached |
Output:
| Gate | What it catches | Typical fail rate |
|---|---|---|
| Parse | dialect drift (QUALIFY in Postgres, LIMIT in T-SQL) |
8% |
| EXPLAIN | Cartesian product, full-scan on huge tables | 4% |
| Sample | empty results, off-by-one window | 6% |
| Review | wrong semantics — answers the wrong question | 2% |
Why this works — concept by concept:
- Templated three-part prompt — schema + samples + task is the minimum viable context for the model. Adding house-style rules inline shrinks the diff between the LLM's idiom and your team's idiom.
- Dialect tag at the top — frontier models respect a hard "you are writing Snowflake SQL" instruction; this single token saves dozens of dialect-drift fixes per week.
-
Half-open windows in the house style — every "but the LLM picked
<=instead of<" bug evaporates once the rule is in the prompt. - Four-gate verifier — parse + plan + sample + review fans out across automatic and human gates. Parse and sample catch ~70% of mistakes for free in CI; plan inspection catches the expensive ones; review catches semantic drift.
- Cost — prompt tokens = O(DDL + samples); verifier cost = O(parse + small sample query). Cheap relative to the senior-engineer-hour the SQL would otherwise consume.
SQL
Topic — SQL generation
SQL generation problems (schema-as-context + window math)
3. dbt macro and model generation with RAG over your project
Naive ChatGPT hallucinates refs — RAG over your project is the fix
The mental model in one line: a frontier LLM with browser context has read the dbt docs but has not read your project — so it confidently invents ref('stg_orders') when your project actually calls it stg__shopify__orders, and it cheerfully writes a macro that conflicts with one you already have. Retrieval-Augmented Generation (RAG) over your dbt project closes that gap: the model retrieves the relevant slice of your code before it answers, and the answer reflects your conventions, not the public-docs average.
Why naive LLMs fail on dbt.
-
ref()andsource()hallucination. The model invents model names that don't exist in your project. -
Macro conflict. It writes
safe_divide(a, b)even though your project already hasutils.safe_divide(numerator, denominator)with a different signature. -
Tag and meta drift. It forgets your
tags: ['daily', 'finance']convention and emits a model that fails your CI's tag-based scheduling. -
YAML schema drift. It picks the generic dbt YAML shape, not your house style (column descriptions in
meta, custom test configs, etc.). -
Materialisation defaults. It defaults to
view; your project may default toincrementalfor marts.
The RAG architecture for dbt — four stages.
-
① dbt project.
manifest.json,models/,macros/,schema.yml,dbt_project.yml,selectors.yml. These are the ground-truth inputs. -
② Vector store. Chunk each file (or each Jinja block) into 1–2K-token windows, embed with
text-embedding-3-largeorbge-large(better for code), store inpgvector,Chroma, orPinecone. Re-index on every PR merge. -
③ LLM + retrieval. At prompt time: embed the user's task, retrieve the top-K (usually 5–10) most relevant chunks, stuff them into the prompt as
# Contextblocks, then ask the model to write the macro or model. -
④ Safety rails. Before the output is committed:
dbt parse(syntax),dbt compile(refs resolve),dbt test(assertions pass). If any rail fails, loop back to stage ③ with the error embedded in the next prompt.
Chunking strategies — by file vs by Jinja block.
-
By file. Simplest. Every
.sqlmodel is one chunk. Works for small projects. -
By Jinja block. For projects with very long macros or deeply nested models, split on
{% macro ... %}/{% endmacro %}andWITH … AS (boundaries. Better recall. -
By manifest node. Use
manifest.jsonto chunk by logical node (model, macro, test, source). The richest representation — embeds the dependency graph.
Embedding models — 2026 picks.
-
text-embedding-3-large(OpenAI). Strong general-purpose; cheap. -
bge-large-en-v1.5(BAAI, open). Best open code-aware embedding; self-host friendly. -
Voyage AI voyage-code-2. Purpose-built for code; the best accuracy on dbt + SQL benchmarks at the time of writing. -
Avoid
text-embedding-ada-002. Older; meaningfully worse on code than the above.
The safety rails — three checks every LLM-generated macro must pass.
-
dbt parse. Catches Jinja syntax errors and undefined macro calls. -
dbt compile. Resolves everyref()andsource()against the manifest. If the model invented a name, this fails loudly. -
dbt test. Runs the project's existing tests (and any new ones the macro added). Catches semantic regressions.
Retry-with-error context — the secret to RAG that ships. When a safety rail fails, the system shouldn't surface the error to the human. It should re-prompt the model with the original prompt + the error message + the offending output, and ask for a fix. Two or three iterations usually clear most failures.
Worked example — generate a window-function macro that respects your project's naming
Detailed explanation. A common interview probe: "Walk me through how you'd ask an LLM to write a dbt macro that fits your project's conventions, not the public docs default." The interesting layer is the retrieval — the model must see your existing utility macros before it writes a new one.
Question. Generate a dbt macro rolling_window_sum(column_name, partition_by, order_by, days) that computes a rolling sum over a date window, matches the project's snake_case naming, uses the project's safe_cast() helper for the numeric column, and emits dialect-aware SQL (Snowflake vs Postgres).
Input (retrieved RAG chunks — top 4 of 10).
-- chunk 1 — macros/utils/safe_cast.sql
{% macro safe_cast(column, type) %}
{% if target.type == 'snowflake' %}
TRY_CAST({{ column }} AS {{ type }})
{% else %}
CAST({{ column }} AS {{ type }})
{% endif %}
{% endmacro %}
-- chunk 2 — macros/utils/snake_case_assertion.md
"All macros and arguments use snake_case. No camelCase. No abbreviations
shorter than 3 chars. Plural arguments take a list."
-- chunk 3 — manifest.json node — model 'fct_orders_daily'
"depends_on": ["model.proj.stg_shopify_orders"],
"materialized": "incremental",
"unique_key": ["order_date","customer_id"]
-- chunk 4 — dbt_project.yml fragment
vars:
default_window_days: 30
numeric_cast_type: 'NUMERIC(18,2)'
Code (the macro the model emits after retrieval).
{# macros/window/rolling_window_sum.sql #}
{#-
Computes a rolling sum of `column_name` over the last `days` days,
partitioned by `partition_by` and ordered by `order_by`.
Uses safe_cast() so non-numeric values become NULL instead of failing.
Dialect-aware via target.type.
-#}
{% macro rolling_window_sum(column_name, partition_by, order_by, days=none) %}
{%- set window_days = days if days is not none else var('default_window_days') -%}
{%- set cast_type = var('numeric_cast_type') -%}
{% if target.type == 'snowflake' %}
SUM({{ safe_cast(column_name, cast_type) }}) OVER (
PARTITION BY {{ partition_by }}
ORDER BY {{ order_by }}
RANGE BETWEEN INTERVAL '{{ window_days }} DAY' PRECEDING AND CURRENT ROW
)
{% elif target.type in ('postgres', 'redshift') %}
SUM({{ safe_cast(column_name, cast_type) }}) OVER (
PARTITION BY {{ partition_by }}
ORDER BY {{ order_by }}
RANGE BETWEEN '{{ window_days }} days'::INTERVAL PRECEDING AND CURRENT ROW
)
{% else %}
{{ exceptions.raise_compiler_error(
"rolling_window_sum() does not support target.type=" ~ target.type
) }}
{% endif %}
{% endmacro %}
Step-by-step explanation.
- The retriever pulls
safe_cast.sqlbecause the prompt mentioned a numeric column — the LLM now knows the helper exists with that exact signature and uses it instead of inventingto_number(). - The
snake_case_assertion.mdchunk anchors the model to the team's naming rule —partition_byandorder_by(notpartitionBy, notpartBy). - The
dbt_project.ymlvars (default_window_days,numeric_cast_type) become defaults in the macro instead of hard-coded magic numbers — house-style consistent. -
target.typebranching emits SnowflakeINTERVAL '30 DAY'syntax for Snowflake and'30 days'::INTERVALfor Postgres/Redshift — dialect-aware out of the box because the retrievedsafe_cast()macro is also dialect-aware (the model copies the pattern). - The
elsebranch raises a compile-time error for unsupported targets — defensive, dbt-idiomatic, and consistent with the rest of the codebase's macros (a behaviour the model picked up frommanifest.json).
Output.
| Aspect | Before RAG | After RAG |
|---|---|---|
| Helper macro reuse | invents to_number()
|
uses existing safe_cast()
|
| Argument naming |
partitionBy, daysBack
|
partition_by, days
|
| Defaults | hard-coded 30
|
var('default_window_days') |
| Dialect | Snowflake-only | dialect-aware |
dbt compile first try |
fails (unknown macro) | passes |
Rule of thumb. For any dbt project with > 50 models or > 20 macros, RAG is not optional. Naive prompting can be made to work for a single one-off macro; it cannot be made to work across a team and across releases.
Interview question on shipping RAG-grounded dbt assistance
A senior probe sometimes goes: "How would you build a system that lets a junior engineer ask for a new dbt macro in plain English and ship something that respects all the project's conventions on first compile?" The probe tests RAG design, safety rail composition, and operational maturity.
Solution Using a vector store over manifest.json + project files + a three-rail safety pipeline
# dbt_rag_assistant.py — minimal RAG + safety-rail loop
import json, subprocess
from pathlib import Path
from typing import Iterable
import voyageai
import chromadb
# 1 — index every macro, model, and manifest node
def index_dbt_project(project_dir: Path) -> chromadb.Collection:
client = chromadb.PersistentClient(path=".chroma")
col = client.get_or_create_collection("dbt_project_v1")
vo = voyageai.Client()
docs, ids, metas = [], [], []
for p in project_dir.rglob("*.sql"):
docs.append(p.read_text()); ids.append(str(p)); metas.append({"kind": "sql"})
for p in project_dir.rglob("*.yml"):
docs.append(p.read_text()); ids.append(str(p)); metas.append({"kind": "yml"})
manifest = json.loads((project_dir / "target" / "manifest.json").read_text())
for node_id, node in manifest["nodes"].items():
docs.append(json.dumps(node, indent=2))
ids.append(node_id); metas.append({"kind": "manifest_node"})
embeds = vo.embed(docs, model="voyage-code-2", input_type="document").embeddings
col.upsert(ids=ids, documents=docs, embeddings=embeds, metadatas=metas)
return col
# 2 — answer a task with retrieved context
def answer(task: str, col: chromadb.Collection, llm) -> str:
vo = voyageai.Client()
q = vo.embed([task], model="voyage-code-2", input_type="query").embeddings[0]
hits = col.query(query_embeddings=[q], n_results=10)
ctx = "\n\n---\n\n".join(hits["documents"][0])
prompt = f"# Context (your dbt project)\n{ctx}\n\n# Task\n{task}\n\n# Output (Jinja macro only)"
return llm.complete(prompt)
# 3 — run the three safety rails
def safety_rails(generated: str, project_dir: Path) -> dict:
out_path = project_dir / "macros" / "_generated.sql"
out_path.write_text(generated)
parse = subprocess.run(["dbt", "parse"], cwd=project_dir, capture_output=True)
compile = subprocess.run(["dbt", "compile"], cwd=project_dir, capture_output=True)
test = subprocess.run(["dbt", "test"], cwd=project_dir, capture_output=True)
return {
"parse_ok": parse.returncode == 0,
"compile_ok": compile.returncode == 0,
"test_ok": test.returncode == 0,
"stderr": (parse.stderr + compile.stderr + test.stderr).decode(),
}
# 4 — loop with retry-on-error
def run(task: str, project_dir: Path, llm, col):
for attempt in range(3):
macro = answer(task, col, llm)
rails = safety_rails(macro, project_dir)
if rails["parse_ok"] and rails["compile_ok"] and rails["test_ok"]:
return macro
task = task + "\n\nPrevious attempt failed:\n" + rails["stderr"]
raise RuntimeError("RAG-grounded macro failed safety rails after 3 attempts")
Step-by-step trace.
| Step | Component | What it does |
|---|---|---|
| 1 | index_dbt_project |
embeds every .sql, .yml, and manifest node into the vector store with voyage-code-2
|
| 2 | answer(task) |
embeds the user task, retrieves top-10 similar chunks, builds a context-rich prompt, calls the LLM |
| 3 | safety_rails |
runs dbt parse, dbt compile, dbt test against the generated macro |
| 4 | Retry loop | on failure, feeds the stderr back into the next prompt and tries again, up to 3 times |
| 5 | Final return | the macro that passed all three rails — eligible for human PR review |
Output:
| Attempt | parse_ok | compile_ok | test_ok | Outcome |
|---|---|---|---|---|
| 1 | true | false | — | refused — invented to_decimal(); retry with error |
| 2 | true | true | true | accepted — uses safe_cast()
|
Why this works — concept by concept:
-
Retrieval grounding — the model literally sees your project's macros, models, and node graph before answering. It cannot invent a
ref()that doesn't exist when the manifest is in the prompt. -
Code-aware embeddings —
voyage-code-2(orbge-largeopen) understands SQL/Jinja semantics far better thantext-embedding-ada-002. The retrieval recall difference is 30–50%. -
Three-rail safety pipeline —
dbt parseis fast and catches Jinja typos;dbt compileis medium and catches missing refs;dbt testis slow and catches semantic regressions. The cost ladder matches the bug severity ladder. - Retry-with-error — by feeding the safety-rail stderr back into the prompt, the system self-corrects without paging a human. Two or three iterations clear ~85% of remaining failures.
- Human PR review — the loop ends at eligible for review, never at merged. A senior engineer still owns the merge button. We unpack the full guardrail ladder in §5.
- Cost — index build = O(project_size); per-query = O(retrieval + LLM tokens). Typical 50-model project: $0.02 per generated macro.
SQL
Topic — ETL
ETL practice problems (dbt-style transformation)
4. Auto-generated lineage and docs from SQL + LLM enrichment
The doc-debt killer — SQL parser for structure, LLM for semantics
The mental model in one line: column-level lineage and useful documentation are two halves of one workflow — a deterministic SQL parser extracts the graph (which column flowed into which), and an LLM enriches the graph with business-glossary-aligned prose (what each column means and why it matters). Once that combo runs in CI, a 100-model project can reach full doc coverage in two hours instead of three weeks of manual writing.
The four-stage pipeline.
-
① SQL files. All your
models/.sqlfiles (and any orphan ad-hoc SQL you want covered). -
② SQL parser.
sqlglotis the open-source workhorse — it builds an AST and emits source→target column mappings for SELECTs, CTEs, joins, and unions. - ③ LLM enrichment. For each column the parser identified, the LLM writes a 1–2 sentence description, tags PII, and aligns to a business-glossary entry.
- ④ Docs + lineage output. A dbt-style docs page with the lineage graph (column-level), the descriptions, the glossary links, and a list of suggested tests.
What gets auto-generated — five outputs.
-
Per-model description. A 2-sentence "this model joins
stg_ordersandstg_usersto produce one row per order with the buyer profile attached, filtered to last 90 days." - Per-column description. "A monotonically increasing primary key from the upstream Stripe API."
-
Business-glossary link. Maps
gross_merchandise_value_cents→ glossary entry "GMV (cents)". -
PII / sensitivity tag. Flags
email,phone_number,government_idautomatically. -
Suggested tests.
uniqueon primary keys,not_nullon FKs,accepted_valueson enums,relationshipson FK joins.
Column-level lineage — why it beats model-level.
- Model-level lineage says "mart_orders depends on stg_orders." Useful but coarse.
- Column-level lineage says "mart_orders.gross_value_cents = stg_orders.amount_cents × 1.0 - stg_refunds.refund_amount_cents." That's the lineage you need for impact analysis, compliance, and root-cause debugging.
- A SQL parser like
sqlglotextracts column-level lineage deterministically — no LLM needed for the structural part. The LLM only writes the narrative.
The tools — 2026 landscape.
-
dbt-osmosisanddbt-cosm. Open-source. Walks your dbt project, fills missing column descriptions by pulling from upstream (and via LLM for the originating column). Free, opinionated, fits in CI. - Datafold AI. Column-level lineage + LLM-generated diffs at PR time. Strong "this PR changes 14 downstream columns" notifications.
- Atlan AI / Alation AI. Catalog-native; the docs live in the catalog and the LLM writes them. Good when the catalog is already the source of truth.
- Lightdash AI. BI-side metric definitions written from dashboard usage + dbt model context.
- SDF. Newer entrant, SQL compiler with built-in column-level lineage and dialect cross-translation.
Anti-patterns to avoid.
- LLM writes lineage from prose. The model will hallucinate edges. Always use a SQL parser for structure.
- One-shot bulk run with no review. A 200-model run with no human pass produces 200 docs that sound right and have 5–10% subtle errors that propagate forever. Review in waves of 20 at a time.
- Skip glossary alignment. Without a glossary, every model gets a slightly-different definition of "active user." Glossary first, descriptions second.
Worked example — auto-document a fct_orders mart in 90 seconds
Detailed explanation. The classic dbt mart everyone has: one row per order, joined to user profile, with first-touch attribution and refund-adjusted gross value. Manually documenting all eight columns takes 20 minutes. Let's auto-generate.
Question. Given the SQL for fct_orders below, auto-generate (a) a model-level description, (b) per-column descriptions, (c) PII tags, (d) glossary links, and (e) suggested dbt tests. Show how sqlglot extracts the column-level lineage and where the LLM enrichment plugs in.
Input (the dbt model SQL).
-- models/marts/fct_orders.sql
{{ config(materialized='incremental', unique_key='order_id') }}
WITH orders AS (
SELECT
order_id,
user_id,
amount_cents,
created_at,
first_touch_channel
FROM {{ ref('stg_shopify_orders') }}
WHERE _loaded_at >= (SELECT MAX(_loaded_at) - INTERVAL '7 days' FROM {{ this }})
),
refunds AS (
SELECT order_id, SUM(refund_cents) AS refund_total_cents
FROM {{ ref('stg_shopify_refunds') }}
GROUP BY order_id
),
users AS (
SELECT user_id, email, country_code, signup_at
FROM {{ ref('stg_users') }}
)
SELECT
o.order_id,
o.user_id,
u.email,
u.country_code,
o.first_touch_channel,
o.amount_cents AS gross_value_cents,
COALESCE(r.refund_total_cents, 0) AS refund_value_cents,
o.amount_cents - COALESCE(r.refund_total_cents, 0) AS net_value_cents,
o.created_at AS order_placed_at
FROM orders o
LEFT JOIN refunds r ON r.order_id = o.order_id
LEFT JOIN users u ON u.user_id = o.user_id
Code (the parser + LLM enrichment loop).
# auto_doc_fct_orders.py
import sqlglot, json
from sqlglot.lineage import lineage
SQL = open("models/marts/fct_orders.sql").read()
# 1 — render Jinja first (use dbt compile in real life)
compiled_sql = render_jinja(SQL)
# 2 — extract column-level lineage with sqlglot
target_cols = [
"order_id", "user_id", "email", "country_code", "first_touch_channel",
"gross_value_cents", "refund_value_cents", "net_value_cents", "order_placed_at",
]
edges = {}
for col in target_cols:
nodes = lineage(col, compiled_sql, dialect="snowflake")
edges[col] = [n.source.alias_or_name for n in nodes.walk()]
# 3 — call the LLM for narrative enrichment
prompt = (
f"# Lineage (deterministic)\n{json.dumps(edges, indent=2)}\n\n"
f"# Compiled SQL\n{compiled_sql}\n\n"
"# Task — return JSON\n"
"{\n"
' "model_description": "1-2 sentences",\n'
' "columns": [\n'
' {"name": "...", "description": "...", "pii": false, "glossary_key": "..."}\n'
" ],\n"
' "suggested_tests": [{"column": "...", "test": "unique|not_null|..."}]\n'
"}"
)
enrichment = json.loads(llm.complete(prompt))
print(json.dumps(enrichment, indent=2))
Step-by-step explanation.
-
sqlglot.lineage()walks the AST starting from each target column, recursing through CTEs and joins, and returns the source columns that feed it. This is deterministic — same SQL, same lineage, every time. - The result for
net_value_centsis["amount_cents", "refund_total_cents"]with the formulao.amount_cents - COALESCE(r.refund_total_cents, 0)baked into the lineage node's expression. - The LLM never gets to invent lineage — the structure is locked. It only writes prose for
description, picks aglossary_key, and flagspii. - For each target column, the LLM is prompted to write a description in 1–2 sentences, set
pii: trueforemail,phone, etc., and pick aglossary_keyfrom a list of canonical entries. - The model also suggests dbt tests —
uniqueonorder_id,not_nullon FKs,accepted_valuesonfirst_touch_channelif it sees a small enum cardinality in the compiled SQL.
Output (LLM-generated fct_orders.yml).
| Field | Value |
|---|---|
model_description |
"One row per Shopify order, refund-adjusted, with buyer profile attached. Incremental on _loaded_at." |
columns[0].name |
order_id |
columns[0].description |
"Primary key from Shopify; monotonically increasing per store." |
columns[2].pii |
true (email) |
columns[7].description |
"Net value after refunds, in cents. = gross_value_cents − refund_value_cents." |
suggested_tests |
unique(order_id), not_null(user_id), relationships(user_id → stg_users.user_id)
|
Rule of thumb. Always run the parser before the LLM. The parser owns what flows where; the LLM owns what it means. Mixing these responsibilities is the root cause of every "the auto-doc tool wrote a beautifully-worded lie" failure mode.
Interview question on shipping column-level lineage and docs at scale
A senior probe: "Your 200-model dbt project has 12% column-description coverage and a compliance team wants 95% by next quarter. How would you ship that without burning a quarter of engineering time?"
Solution Using sqlglot lineage + LLM enrichment + waves-of-twenty human review
# rollout_auto_docs.py — wave-based rollout
from pathlib import Path
import yaml
WAVES = [
("staging", 20, "junior_de@team"),
("staging", 40, "junior_de@team"),
("marts", 20, "senior_de@team"),
("marts", 40, "senior_de@team"),
("metrics", 20, "data_pm@team"),
]
def run_wave(layer: str, n_models: int, reviewer: str, project: Path):
candidates = sorted((project / "models" / layer).rglob("*.sql"))[:n_models]
drafts = []
for sql_path in candidates:
yml_path = sql_path.with_suffix(".yml")
enrichment = auto_doc(sql_path) # parser + LLM
yml_path.write_text(yaml.safe_dump(enrichment))
drafts.append({"file": str(yml_path), "reviewer": reviewer})
return drafts
all_drafts = []
for layer, n, reviewer in WAVES:
all_drafts.extend(run_wave(layer, n, reviewer, Path("./dbt_project")))
# Submit one PR per wave for focused review
print(f"Generated {len(all_drafts)} draft YML files across {len(WAVES)} review waves")
Step-by-step trace.
| Step | Action | Effect |
|---|---|---|
| 1 | Run sqlglot lineage on each model |
deterministic column-level structure |
| 2 | LLM enrichment per wave (20 models) | small enough that one reviewer can read every diff |
| 3 | Write .yml next to each .sql
|
fits dbt's schema discipline; PR diff is reviewable |
| 4 | Per-wave PR with a single reviewer | catches systematic errors before the next wave |
| 5 | After each wave, update the glossary if new canonical terms emerged | feeds back into the next wave's prompt |
Output:
| Wave | Models | Reviewer | Outcome |
|---|---|---|---|
| 1 (staging × 20) | 20 | junior_de | accepted with 4 description edits |
| 2 (staging × 40) | 40 | junior_de | accepted with 6 edits + 1 glossary addition |
| 3 (marts × 20) | 20 | senior_de | accepted with 2 edits + 1 PII tag correction |
| 4 (marts × 40) | 40 | senior_de | accepted, no edits |
| 5 (metrics × 20) | 20 | data_pm | accepted with 3 glossary alignments |
| Total | 140 | — | 95% coverage in 2 weeks, 8 hours of review time |
Why this works — concept by concept:
- Wave-based rollout — 20 at a time keeps the PR small enough to review carefully. 200 at a time would get rubber-stamped.
- Layer-aware reviewers — juniors review staging (mechanical), seniors review marts (judgement), PM reviews metrics (glossary). Right person at the right layer.
- Glossary as feedback loop — each wave updates the glossary with new canonical terms, and the next wave's LLM prompt benefits.
-
Parser owns structure, LLM owns narrative — column-level lineage from
sqlglotis deterministic and audit-able; descriptions from the LLM are advisory and edited. - 95% coverage, not 100% — the last 5% is usually deprecated models, sandboxes, or models so weird that auto-doc would do more harm than good. Accept the asymptote.
- Cost — engineering hours = 8 (reviews) vs ~120 (manual writing). LLM cost ≈ $5 total. The bottleneck stays human attention, where it should.
SQL
Topic — data transformation
Data transformation problems (lineage-aware modelling)
5. Human-in-the-loop — the production safety pattern
The trust ladder — five gates, one principle, never auto-merge to prod
The mental model in one line: an LLM accelerates; it does not own; the human is the final gate, forever. Every senior DE team that has rolled out LLM-assisted workflows arrived at some version of a five-gate trust ladder, and every shop that skipped a gate eventually paid for it — usually with a silent metric corruption that took weeks to detect because the buggy SQL "looked right."
The five gates (bottom = untrusted, top = production-monitored).
- Gate 0 — LLM raw output. Untrusted by default. The model returned a string. Treat it as a draft, not a commit.
- Gate 1 — IDE suggestion. Developer accepts/rejects per line in Cursor/Copilot/Windsurf. The human is reading every diff in real time.
- Gate 2 — PR review. A peer reads the diff with full context (the original prompt, the model, the test outputs) and approves/requests changes.
- Gate 3 — CI: dbt test + EXPLAIN + dry-run. Automated gates that break the build on any of: failed assertions, suspicious query plan, row-count drift > threshold.
- Gate 4 — Merged + monitored. The change is live, but data-quality monitors (freshness alerts, row-count diffs, PII scans, downstream chart-failure detection) watch it for ≥ 7 days.
The non-negotiable rules.
- Never auto-merge LLM-generated SQL or dbt models into prod. Even with passing CI, the cost of a silent business-logic error dwarfs the cost of a 30-minute human review.
-
Every LLM-touched commit is tagged.
llm: claude-opus-4-7,prompt-hash: abc123,reviewer: alice@team. This is the audit log; you'll thank yourself the first time you need it. - The reviewer is not the prompt author. The author has confirmation bias; a peer catches what the author can't see.
- PR description includes the prompt. So the reviewer can sanity-check that the prompt actually asked for the right thing.
Hallucination budget — measure it, reduce it.
- What it is. A quarterly target like "≤ 2% of LLM-touched PRs introduce a defect requiring rollback or hotfix."
-
How to measure it. Tag every PR with
assisted-by: llm. Count rollbacks/hotfixes in that population. Compute the rate. - How to reduce it. Better RAG (so the model has more context), tighter prompts (with house style baked in), more gates (column-level lineage diff at PR time), and training — the team gets better at prompting and reviewing over time.
Audit log — what to capture per LLM-touched change.
-
prompt(or its hash + a stored copy) -
model(e.g.claude-opus-4-7,gpt-4o-2026-05) -
tools_used(RAG retrieval IDs, function-call traces) -
reviewer(the human who approved) -
gates_passed(parse / compile / test / EXPLAIN / DQ monitors) -
outcome(merged / rolled-back / reverted / fixed-forward)
Operational alerts every LLM-using team should ship.
- Row-count diff. Post-deploy row count differs from a 30-day rolling baseline by > 10%? Page.
-
Freshness drift. Model's
_loaded_atlagged by > 1 hour? Page. -
PR comment regex. "wrong" / "fix" / "hallucinat" in PR comments tagged
assisted-by: llm? Roll up weekly. - Downstream chart failure. Any BI chart that depends on the changed column starts erroring? Notify the change author.
- Test escape. A new test broke ≥ 48 hours after the model that changed? Suspect the LLM change.
The "do not auto" list — things the LLM never does by itself.
- Approving its own PR. Period.
- Merging to
main. - Deploying to production.
- Backfilling historical data.
- Dropping or truncating tables.
- Running
DELETE/UPDATEagainst prod without an explicit human confirmation.
Worked example — design the CI pipeline for an LLM-assisted dbt PR
Detailed explanation. A common interview probe: "Walk me through the CI checks you'd add specifically because the change might be LLM-authored." The interesting layer is the additional checks — the basics (parse, compile, test) are table stakes; we want the gates that specifically catch LLM failure modes.
Question. Sketch a CI pipeline for a dbt PR tagged assisted-by: llm that runs three "table-stakes" gates and three "LLM-specific" gates. The LLM-specific ones must catch silent errors: a join condition that quietly doubles row counts, a deprecated column reference, and a PII leak via a renamed column.
Input (.github/workflows/dbt_pr.yml).
name: dbt PR (LLM-aware)
on:
pull_request:
paths: ['models/**', 'macros/**', 'seeds/**', 'snapshots/**']
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
# Gate 1 — table stakes
- run: dbt deps
- run: dbt parse
- run: dbt compile --target ci
- run: dbt test --target ci
# Gate 2 — LLM-specific (row-count diff)
- name: row-count diff vs main
run: python tools/row_count_diff.py --threshold 10 # %
# Gate 3 — LLM-specific (column-level lineage diff)
- name: column lineage diff
run: python tools/lineage_diff.py --warn-on=downstream_columns_changed
# Gate 4 — LLM-specific (PII scan)
- name: PII scan
run: python tools/pii_scan.py --fail-on=leaked_pii_columns
# Audit trail
- name: tag PR with LLM metadata
if: contains(github.event.pull_request.labels.*.name, 'assisted-by:llm')
run: |
gh pr edit ${{ github.event.pull_request.number }} \
--add-label "needs-senior-review" \
--add-label "hallucination-budget"
Code (the row-count diff gate).
# tools/row_count_diff.py
import argparse, json, subprocess, sys
p = argparse.ArgumentParser()
p.add_argument("--threshold", type=float, default=10.0,
help="row-count % delta vs main that triggers a warn/fail")
args = p.parse_args()
def run(cmd):
return subprocess.run(cmd, shell=True, check=True, capture_output=True, text=True).stdout
# 1 — list changed models in this PR
changed = json.loads(run("dbt ls -s state:modified --output json --state ./prod-manifest/"))
violations = []
for model in changed:
fqn = model["unique_id"]
main_count = int(run(f"dbt show -s {fqn} --target prod --limit 0 --output rows-only").strip())
pr_count = int(run(f"dbt show -s {fqn} --target ci --limit 0 --output rows-only").strip())
if main_count == 0:
continue
delta_pct = abs(pr_count - main_count) / main_count * 100
if delta_pct > args.threshold:
violations.append((fqn, main_count, pr_count, delta_pct))
if violations:
print("ROW-COUNT DIFF VIOLATIONS:")
for v in violations:
print(f" {v[0]}: main={v[1]}, pr={v[2]}, delta={v[3]:.1f}%")
sys.exit(1)
print("row-count diff OK")
Step-by-step explanation.
- Gates 1–4 in order. Parse catches Jinja syntax. Compile catches missing refs (the classic LLM hallucination). Test catches assertion regressions. These are table stakes and run in seconds.
-
Row-count diff. Lists
state:modifiedmodels (only what changed in the PR), runs bothprodandciagainst the same time window, compares the row counts. A model that quietly went from 1M rows to 2M rows is almost always a join-cardinality bug — exactly the LLM failure mode that compiles and passes unit tests. -
Column lineage diff. Compares
sqlglot-extracted lineage betweenprodand the PR. If a column that was sourced fromstg_users.emailis now sourced from a different column (saystg_users.username), the gate warns the reviewer. -
PII scan. Walks the new YML files. If a column previously tagged
pii: truewas renamed and the tag wasn't carried, fail the build. -
Audit labels. When the PR carries
assisted-by:llm, the workflow addsneeds-senior-reviewandhallucination-budgetlabels. The org's quarterly metric pulls from those labels.
Output.
| Gate | Catches | Typical fail rate on LLM PRs |
|---|---|---|
dbt parse |
Jinja typos | 2% |
dbt compile |
invented refs | 4% |
dbt test |
broken assertions | 5% |
| Row-count diff | cardinality blowups | 3% |
| Lineage diff | silent column re-source | 1% |
| PII scan | dropped PII tags | 0.5% |
Rule of thumb. Three gates beyond the table-stakes trio (row-count, lineage, PII) catch ~80% of LLM-specific defects that would otherwise reach prod. Add them all; the cost is minutes per PR, the savings is incidents-per-quarter.
Interview question on the human-in-the-loop guardrail design
A senior probe: "An engineer on your team wants to auto-merge LLM-generated SQL for staging models because 'staging isn't prod.' How do you respond?"
Solution Using a tiered policy — gate every model, but vary the human bar by layer
# llm_review_policy.py
LAYER_POLICY = {
"staging": {
"gates": ["parse", "compile", "test", "row_count_diff"],
"human_reviewer": "any peer (LGTM ok)",
"merge_window": "any",
"rollback_sla_hours": 24,
},
"intermediate": {
"gates": ["parse", "compile", "test", "row_count_diff", "lineage_diff"],
"human_reviewer": "any peer",
"merge_window": "any",
"rollback_sla_hours": 12,
},
"marts": {
"gates": ["parse", "compile", "test", "row_count_diff", "lineage_diff", "pii_scan"],
"human_reviewer": "senior DE",
"merge_window": "business hours only",
"rollback_sla_hours": 4,
},
"metrics": {
"gates": ["parse", "compile", "test", "row_count_diff", "lineage_diff",
"pii_scan", "metric_definition_review"],
"human_reviewer": "senior DE + data PM",
"merge_window": "business hours only",
"rollback_sla_hours": 1,
},
}
def required_gates(model_layer: str) -> dict:
return LAYER_POLICY.get(model_layer, LAYER_POLICY["marts"]) # default to strictest
Step-by-step trace.
| Layer | Min gates | Reviewer bar | Rollback SLA |
|---|---|---|---|
| staging | parse + compile + test + rowcount | any peer | 24h |
| intermediate | + lineage diff | any peer | 12h |
| marts | + PII scan | senior DE | 4h |
| metrics | + metric def review | senior DE + data PM | 1h |
Output:
| Layer | Auto-merge allowed? | Why |
|---|---|---|
| staging | no | even staging feeds downstream marts; a silent bug compounds |
| intermediate | no | same as staging |
| marts | no | direct BI/ML consumer; metric corruption blast radius |
| metrics | no | board-facing numbers; reputational risk |
Why this works — concept by concept:
- Gate every layer, vary the bar — there is no layer where "auto-merge" makes sense, because every layer feeds something downstream. What varies is the seniority of the reviewer and the number of automated gates, not whether a human is required.
- Rollback SLA matches blast radius — staging has 24 hours because downstream marts re-run nightly; metrics has 1 hour because a Slack channel of execs will notice.
- PII scan is mart+ — staging PII is usually upstream-tagged; the moment you transform it, the renamed columns can drop the tag. That's a mart-layer risk.
-
Metric definition review needs a non-engineer — a data PM catches "this LLM rewrote
active_userto uselast_seen_atinstead oflast_event_at" because the metric meaning matters more than the SQL elegance. - No silent paths — every model goes through CI; the policy is enforced in YAML; the audit log is automatic. No "but staging is fine to auto-merge" loophole.
- Cost — review time = O(diff lines × layer-multiplier); incidents saved = the entire reason the role exists. The trade is obvious.
SQL
Topic — data processing
Data processing problems (pipeline review and CI checks)
Cheat sheet — DE LLM workflows in one screen
-
SQL generation. Schema (DDL) + 5–10 sample rows + task in natural language → LLM → verify with
EXPLAIN+ sample run + dialect lint + PR review. Half-open windows (>= start AND < end). One-shot example for any non-trivial shape. -
dbt macro / model generation. RAG over
manifest.json+models/+macros/with a code-aware embedding (voyage-code-2orbge-large) → LLM with retrieved context → safety rails (dbt parse,dbt compile,dbt test) → human review. Retry with error context on rail failure. -
Doc + lineage generation.
sqlglotfor deterministic column-level lineage → LLM enrichment for descriptions, glossary links, PII tags, and suggested tests → wave-based human review in batches of 20. Never let the LLM invent lineage edges. - Refactoring. LLM rewrites + automated diff + targeted tests + PR review. Run the existing tests first; if they pass, the refactor is at least functionally compatible.
-
Test scaffolding. LLM suggests
unique,not_null,accepted_values,relationships. Human picks which ones to keep. Defaults: PKs unique, FKs not_null + relationships, enums accepted_values, dates not_null. -
Human-in-the-loop policy. Five gates: raw output → IDE suggestion → PR review → CI (
parse+compile+test+row_count_diff+lineage_diff+pii_scan) → merged + monitored. Never auto-merge to any layer. Every LLM-touched commit tagged with prompt + model + reviewer. - Hallucination budget. Set quarterly (e.g. ≤ 2% LLM-introduced rollbacks). Measure via PR labels + post-deploy DQ alerts. Reduce via better RAG, tighter prompts, more gates, team training.
-
Tooling stack 2026. Editor: Cursor / Windsurf / VS Code + Copilot. RAG:
pgvector/Chroma+voyage-code-2. dbt-aware:dbt-osmosis,dbt-cosm. Lineage:sqlglot, Datafold AI, Atlan AI. CI: GitHub Actions runningdbt parse/compile/test+ custom row-count and lineage diff scripts.
Frequently asked questions
Will LLMs replace data engineers?
No — but the shape of the role keeps shifting up the stack. LLMs collapse the boilerplate that used to consume 50–70% of weekly keystrokes (staging models, YAML configs, glue Python, dialect conversions) and free senior DEs to spend that time on schema design, pipeline architecture, metric definitions, and code review. The teams that are growing fastest in 2026 are the ones that treat chat gpt for data engineer jobs as a force multiplier on senior judgement, not as a headcount-reduction lever. The bottleneck moves from writing code to reviewing code and owning the contract — and that's still a human job, with no plausible automation path on the horizon.
What is the best LLM for SQL generation in 2026?
Claude Opus 4.x and GPT-4o-2026 sit at the frontier for SQL accuracy on standard benchmarks, with Anthropic's family slightly ahead on complex window functions and OpenAI's slightly ahead on dialect breadth. Specialised SQL models like SQLCoder-34B and Defog AI's tuned variants are competitive on Postgres/Snowflake and meaningfully cheaper at scale. The decision rarely hinges on raw model quality — it hinges on whether you can ship the three-part prompt (schema + sample rows + task), a dialect tag, and a four-gate verifier. Any frontier model with that scaffolding outperforms any model without it.
How do I prevent LLM hallucinations from reaching production?
Three layers, every time: context grounding (RAG over your project's manifest.json, models, and macros so the LLM sees your real refs and column names), automated CI gates (dbt parse + compile + test + row-count diff + column-level lineage diff + PII scan), and human PR review with the prompt attached so the reviewer can sanity-check intent. Set an explicit hallucination budget (e.g. ≤ 2% LLM-touched PRs requiring rollback per quarter) and instrument every LLM-touched commit with prompt-hash, model, reviewer, and gates_passed labels. The gates catch the structural failures cheaply; the reviewer catches the semantic ones; the budget keeps the team honest about trade-offs.
Can ChatGPT write production-ready dbt models on its own?
It can write the first draft, and with RAG over your project it can produce something that compiles and respects your conventions on the first or second attempt — but "production-ready" requires the same gates any human-written model would face: dbt parse, dbt compile, dbt test, peer review, and post-deploy monitoring. The realistic 2026 pattern is: junior DE drafts the model with dbt llm assist, senior DE reviews and edits, CI runs the full safety-rail suite, and the model lands tagged assisted-by:llm with the prompt stored in the audit log. The model never auto-merges, even when CI is green.
What tools auto-generate data lineage from SQL?
sqlglot is the open-source workhorse for deterministic column-level lineage — it parses SQL, builds an AST, and emits source→target column mappings without an LLM in the loop. Commercial tools layered on top include Datafold AI (column-level lineage + LLM-generated PR diffs), Atlan AI and Alation AI (catalog-native with LLM enrichment), and SDF (a SQL compiler with built-in lineage and cross-dialect translation). For pure dbt projects, dbt-osmosis and dbt-cosm use a combination of upstream propagation + LLM enrichment to fill missing column descriptions and lineage notes. The split that matters in 2026: parser owns structure, LLM owns narrative.
Is LLM-generated code safe to commit, and how should I review it?
LLM-generated code is safe to commit only after it has cleared automated CI (parse, compile, test, row-count diff, lineage diff, PII scan if applicable) and at least one human reviewer has read the full diff with the original prompt attached. The reviewer should be a peer, not the prompt author (confirmation bias is real). Every LLM-touched commit should be tagged in the audit log with the prompt, model, retrieval IDs (if RAG was used), and the reviewer's identity — so when something does go wrong, the team can re-run the prompt, reproduce the output, and learn. Treat the LLM as a fluent intern: take its drafts gratefully, take its judgement skeptically.
Practice on PipeCode
- Drill the SQL practice library → to lock down the dialect-aware patterns LLM prompts depend on.
- Sharpen SQL-generation practice problems → for the prompt-and-verify rhythm.
- Stack the prerequisites with ETL practice problems → so the pipelines the LLM scaffolds make sense end-to-end.
- Sharpen data transformation drills → for the lineage-aware modelling muscle.
- Reinforce the data-processing fundamentals with data processing problems →.
- For the broader interview surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Strengthen the foundations with SQL data types — the PostgreSQL guide →.
- Work through the design-round muscles with ETL system design for DE interviews →.
- Lock the SQL fundamentals with SQL for Data Engineering Interviews — from zero to FAANG →.
- Reinforce the Python side with Python for DE interviews — the complete fundamentals →.
- Pair LLM-assisted modelling with data modelling for DE interviews →.
Pipecode.ai is Leetcode for Data Engineering — every LLM workflow above ships with hands-on practice rooms where you draft SQL with a real schema, generate dbt macros against a real project, and review pipelines against a real-time scoring engine. Start with the SQL-generation library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.





Top comments (0)