DEV Community

Cover image for ChatGPT / LLM Workflows for Data Engineers: SQL Generation, dbt Macros & Lineage
Gowtham Potureddi
Gowtham Potureddi

Posted on

ChatGPT / LLM Workflows for Data Engineers: SQL Generation, dbt Macros & Lineage

chat gpt for data engineer jobs searches have tripled in the last 18 months because the day-to-day surface of the role — writing SQL, scaffolding dbt models, refactoring ETL, auto-documenting marts, and reasoning about column-level lineage — is exactly the shape that frontier LLMs do best. The 2026 reality is that roughly thirty percent of a senior DE's keystrokes are now LLM-assisted; the engineers who out-ship their peers are the ones who learned the prompt patterns, retrieval grounding, and safety rails that turn a chat box into a production-grade collaborator.

This guide walks through the five workflows that have crystallised as the working contract for llm data engineering and ai for data engineers in 2026: schema-as-context sql generation ai prompts, RAG-grounded dbt llm macro and model generation, auto-generated docs and column-level lineage, and the human-in-the-loop guardrail ladder that keeps an LLM from silently breaking production. Each section pairs a teaching block with a worked interview-style answer — prompt, code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

PipeCode blog header for ChatGPT / LLM workflows deep dive — bold white headline 'ChatGPT · LLM Workflows for DEs' with subtitle 'SQL Generation · dbt Macros · Lineage · Docs' and a stylised prompt → SQL → verify pipeline on a dark gradient with purple, green, orange, and blue accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, browse the SQL practice library →, drill SQL-generation problems →, and rehearse on ETL practice problems →.


On this page


1. Why LLMs reshape (but don't replace) DE workflows in 2026

The honest one-liner — LLMs collapse the boilerplate, the human still owns correctness

The mental model in one sentence: a large language model is a fluent intern with infinite stamina, encyclopedic recall of public SQL/Python idioms, and zero context for your business — so the value lands the moment you can hand it the right context cheaply, and the risk lands the moment it ships a confidently-wrong answer into production. Once you internalise that — fluent intern, not senior engineer — the rest of the chatgpt data engineer workflow stack falls out: feed schemas, sample rows, project conventions, and house style as context, then gate every output through automated checks and a human review.

What LLMs are genuinely good at in DE workflows.

  • Boilerplate generation. Skeleton dbt models, staging YAML files, Airflow DAGs, Spark scaffolds, Pydantic schemas, table DDL — anything that follows a well-known shape.
  • Format conversions. JSON → DDL, DDL → ERD, SQL dialect A → dialect B, pandas → PySpark, Airflow 1 → Airflow 2/3.
  • Regex and parsing. "Extract the order id from this filename pattern" — LLMs nail regex on the first or second try when prior generations needed three coffee breaks.
  • Documentation. dbt model descriptions, column glossaries, runbooks, post-mortem first drafts, README scaffolds — anything that converts code into prose.
  • Refactoring. Rename a column across 80 dbt models, replace a deprecated function across a repo, port to a new dialect — bulk, mechanical, repetitive.
  • Test scaffolding. Suggested unique / not_null / accepted_values / relationships tests from a dbt model's columns and joins.

What LLMs are bad at — and where they confidently lie.

  • Production correctness without verification. A SQL that compiles is not a SQL that returns the right answer. LLMs will happily produce queries that read wrong column names, miss late-arriving rows, double-count via Cartesian joins, or drop nulls silently.
  • Novel business logic. Anything that depends on tribal knowledge, finance close calendars, jurisdiction-specific privacy rules, or non-public schemas. The model will pattern-match to the shape of similar problems and invent details.
  • Debugging without context. "Why is my pipeline slow?" without a query plan, table sizes, or the actual error log is a hallucination factory.
  • Memory across sessions. Even with project memory features, the model forgets specifics — never assume it remembers yesterday's schema decision.
  • Cost and performance reasoning. It can recite Big-O, but does not know that your Snowflake warehouse is X-Large or that a CTE materialises in your dialect.

The four shifts in DE workflows since 2024.

  • Editor-native LLMs are now table stakes. Cursor, Windsurf, VS Code with Copilot, Zed's inline assist — the IDE is the prompt interface. Tab-completion has been replaced by multi-line, multi-file edits with context-aware diffs.
  • RAG over the project is the unlock for dbt. Naive ChatGPT-in-a-browser hallucinates refs and column names; an LLM that retrieves over manifest.json, models/, and macros/ doesn't.
  • Schema-as-context is the prompt pattern that wins for SQL. DDL + 5–10 sample rows + the question, in that order, beats every "you are a SQL expert" persona prompt by a wide margin.
  • Human-in-the-loop gates are non-negotiable. No senior team auto-merges LLM-generated SQL to prod. The gates are PR review + dbt test + EXPLAIN check + dry-run.

The "trust ladder" in one paragraph. LLM raw output is untrusted by default. The output earns trust as it moves up the ladder: ① in-IDE suggestion (developer accepts/rejects per line), ② human-authored PR with LLM diffs (peer review + context check), ③ CI gates (dbt parse + compile + test + EXPLAIN), ④ merged + monitored (freshness alerts, row-count diffs, post-deploy DQ). Anything ≤ ② is a draft; only ≥ ④ is "in production." We unpack the ladder fully in §5.

Why the role of the DE doesn't go away — it shifts up the stack.

  • From "writing SQL" → "specifying SQL, verifying outputs, and owning the contract." The LLM types; the human signs.
  • From "writing docs" → "curating docs." LLM drafts; human edits and chooses what gets canonical.
  • From "scaffolding pipelines" → "designing pipelines." Boilerplate evaporates; trade-off decisions stay human.
  • From "individual contributor productivity" → "team-level review throughput." A senior DE now reviews 3–5× more PRs than they used to author solo. The bottleneck moves to attention and judgement.

Worked example — count the LLM-touched lines in a senior DE's typical week

Detailed explanation. A useful framing question for any team considering an LLM rollout is "how much of our existing code would a competent LLM-assisted DE have drafted instead of typed?" The answer calibrates expectations and helps choose where to invest in tooling vs training. We can estimate it with a quick day-by-day audit.

Question. A senior data engineer ships ~3,000 lines of net-new code in a typical week (SQL, dbt YAML, Python, terraform), reviews ~2,500 lines of PRs, and writes ~500 lines of docs and runbooks. Roughly what fraction is LLM-amplifiable today, and where does the human still need to drive?

Input (typical week breakdown).

Activity Lines LLM-amplifiable? Human-only?
New dbt staging models 900 yes — boilerplate naming, source choice
New dbt mart models 700 partial — joins, ranks metric definitions
Python DAG glue 400 yes — boilerplate dependency design
Terraform IaC 200 yes — boilerplate environment policy
Schema YAML + tests 300 yes test selection
Docs + runbooks 500 yes tone, judgement
Bug-fix SQL 200 partial root cause
Code review 2500 review-time only the whole point

Code (Python — quick weekly LLM-assist estimator).

# weekly_llm_assist_estimator.py
WEEK = [
    {"name": "dbt staging models",  "lines": 900, "llm_share": 0.85, "human_share": 0.15},
    {"name": "dbt mart models",     "lines": 700, "llm_share": 0.55, "human_share": 0.45},
    {"name": "Python DAG glue",     "lines": 400, "llm_share": 0.75, "human_share": 0.25},
    {"name": "Terraform IaC",       "lines": 200, "llm_share": 0.70, "human_share": 0.30},
    {"name": "Schema YAML + tests", "lines": 300, "llm_share": 0.80, "human_share": 0.20},
    {"name": "Docs + runbooks",     "lines": 500, "llm_share": 0.75, "human_share": 0.25},
    {"name": "Bug-fix SQL",         "lines": 200, "llm_share": 0.35, "human_share": 0.65},
]

llm_lines = sum(r["lines"] * r["llm_share"] for r in WEEK)
human_lines = sum(r["lines"] * r["human_share"] for r in WEEK)
total = sum(r["lines"] for r in WEEK)

print(f"Total weekly lines: {total}")
print(f"LLM-amplifiable (with human review): {llm_lines:.0f}  ({llm_lines/total:.0%})")
print(f"Human-driven (LLM not the bottleneck): {human_lines:.0f}  ({human_lines/total:.0%})")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Each row is a real category of weekly output; the lines column is an order-of-magnitude estimate from public DE work-mix studies and internal team retros.
  2. llm_share is the fraction of that category an LLM could plausibly draft correctly enough that the human reviews-and-edits rather than writes-from-scratch. Bug-fix SQL has the lowest share because it depends on system-specific context the model lacks.
  3. human_share is the residual — the lines where the model contributes nothing or the contribution is harmful (naming, metric definitions, root-cause analysis).
  4. Multiplying and summing gives a rough total: ~70% of weekly lines are amplifiable, ~30% remain firmly human-driven.
  5. The human-only fraction is concentrated in the categories that matter most — schema design, root-cause debugging, judgement calls — so headcount doesn't shrink; the work shifts.

Output.

Bucket Lines/week Share
LLM-amplifiable (drafted by LLM, reviewed by human) 2,200 70%
Human-driven (LLM not the bottleneck) 1,000 30%
Total weekly lines 3,200 100%

Rule of thumb. Treat the LLM as a force multiplier on boilerplate and a reviewer-assist on logic — not as a replacement for the human who owns the contract. Time saved should go into more design, more review, more tests — not into headcount cuts.

Interview question on LLM augmentation strategy

A senior interviewer often frames the opening probe as: "Walk me through how your team uses LLMs day-to-day, where you draw the line, and how you measure whether the rollout is working." This blends workflow design, safety reasoning, and metrics — the three muscles every chatgpt data engineer round opens with.

Solution Using a "trust ladder + measurable hallucination budget" framework

# llm_rollout_health.py — quarterly LLM rollout health check
from dataclasses import dataclass
from statistics import mean

@dataclass
class Metric:
    name: str
    value: float
    target: float
    higher_is_better: bool

metrics = [
    Metric("PRs with LLM assist (% of total)",      0.62, 0.50, True),
    Metric("Median PR cycle time (days)",           1.4,  2.0,  False),
    Metric("LLM-introduced rollbacks (% of PRs)",   0.018, 0.02, False),
    Metric("Hallucination rate (caught at review)", 0.07, 0.10, False),
    Metric("Hallucination escape (post-merge)",     0.003, 0.005, False),
    Metric("DE happiness score (1-5)",              4.2, 4.0, True),
]

def status(m: Metric) -> str:
    ok = (m.value >= m.target) if m.higher_is_better else (m.value <= m.target)
    return "PASS" if ok else "FAIL"

for m in metrics:
    print(f"{m.name:<48} {m.value:>7}  target={m.target:<5}  {status(m)}")
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Metric Why it's here
1 PRs with LLM assist adoption — is the team actually using the tool?
2 Median PR cycle time velocity — is the tool moving the needle?
3 LLM-introduced rollbacks quality at the boundary — did a bad LLM diff escape?
4 Hallucination rate quality before merge — caught by reviewers and CI
5 Hallucination escape quality after merge — caught by monitoring
6 DE happiness sustainability — is the workflow tolerable long-term?

Output:

Metric Quarter value Target Verdict
Adoption 62% ≥ 50% PASS
PR cycle time 1.4 d ≤ 2.0 d PASS
Rollbacks 1.8% ≤ 2.0% PASS
Hallucinations caught 7% ≤ 10% PASS
Hallucinations escaped 0.3% ≤ 0.5% PASS
Happiness 4.2 ≥ 4.0 PASS

Why this works — concept by concept:

  • Trust ladder before metrics — the ladder defines where the LLM operates (≤ Gate 2 only); metrics measure how well it operates inside that boundary. Skipping the ladder and going straight to metrics ships unsafe automation.
  • Hallucination budget — every quarter sets an explicit ceiling on LLM-introduced defects (typically ≤ 2% rollback, ≤ 0.5% escape). The budget is a contract with the team and a forcing function for tooling investment.
  • Adoption is necessary but not sufficient — high adoption with no velocity gain means the tool isn't earning its keep; low adoption with high velocity gain means a few power users are pulling the team forward but the workflow doesn't scale.
  • Happiness is the slow signal — when sustainability drops, the team is reviewing more than it's authoring and burning out. The fix is usually to invest in better RAG, better gating, or better in-IDE UX — not to push adoption harder.
  • Cost — instrumentation cost = O(once-per-quarter survey + CI tagging); the ROI is the rollback dollars you don't spend and the cycle-time hours you reclaim.

SQL
Topic — SQL generation
SQL generation practice problems (LLM-assisted SQL)

Practice →


2. SQL generation — prompt patterns and schema-as-context

The three-part prompt — schema + sample rows + task — is the only pattern you need

The mental model in one line: a good SQL-generation prompt is a self-contained brief — DDL for the tables involved, 5–10 sample rows per table, and a natural-language task with explicit edge cases — and a great one adds a one- or two-shot in-context example of the exact output shape you want. Once you ship that pattern as a template, sql generation ai stops feeling like magic and starts behaving like a typed function call.

Visual diagram of the SQL generation prompt pattern — three stacked input panels (schema DDL, sample rows, task in natural language) feeding into an LLM that outputs SQL, then a verify panel running EXPLAIN + sample-result check + dialect lint; on a light PipeCode card.

The three required inputs.

  • Schema (DDL). CREATE TABLE statements give the model column names, types, nullability, and primary keys. DDL beats JSON descriptions; JSON beats prose. Always include FK relationships if your dialect spells them.
  • Sample rows (5–10 per table). This is the silent quality multiplier. Sample rows disambiguate "is signup_at UTC or local?", "do emails have whitespace?", "is amount in cents or dollars?". The model uses them as type and shape inference, not as a query target.
  • Task (natural language). A one-paragraph brief with the question, the desired output columns, the time window, the filter rules, and any edge cases (ties, nulls, late arrivals). Be explicit — "include customers with zero orders" beats "all customers."

The two optional inputs that matter.

  • Dialect tag. "Use Snowflake SQL" or "Use PostgreSQL 15 syntax" cuts dialect mistakes (QUALIFY, LATERAL, LISTAGG, etc.) by ~80%.
  • One or two in-context examples. Question → SQL pairs the model can mimic. For complex shapes (pivoting, gaps-and-islands), a single example moves accuracy from "sometimes works" to "almost always works."

Schema serialisation — DDL beats JSON beats prose.

  • DDL — the model has seen millions of CREATE TABLEs in training; it parses them effortlessly and respects nullability.
  • JSON schema — works for tools that emit structured schemas (BigQuery INFORMATION_SCHEMA), but the model spends tokens parsing braces.
  • Prose — "the orders table has an id, a user_id, an amount in cents, and a created_at timestamp" — slowest and most error-prone; only use when DDL isn't available.

Verification — every generated SQL goes through four gates.

  • EXPLAIN plan check. Run EXPLAIN and eyeball the join order, scan types, and rowcount estimates. Catches Cartesian explosions and missing indexes.
  • Sample result check. Run the SQL against a 1% sample or a fixed sandbox dataset and compare the output to a hand-computed expected.
  • Dialect linter. sqlfluff or sqlglot parses the SQL in your target dialect — catches dialect drift the model snuck in.
  • PR review. A human looks at the intent — does the SQL actually answer the question that was asked?

One-shot vs few-shot — when to use which.

  • Zero-shot. Simple aggregations, easy filters. Works ~70% of the time on first try.
  • One-shot (1 example). Window functions, ranking, gaps-and-islands, pivoting. Lifts accuracy to ~90%.
  • Few-shot (3–5 examples). Complex business metrics with specific edge cases, dialect quirks, or company-house-style. Lifts accuracy to ~95% but eats prompt tokens.

Worked example — rank customers by 30-day rolling revenue

Detailed explanation. This is the canonical "ranked window with a date filter" interview pattern, and a great calibration target for any sql generation ai prompt. The interesting bits are the rolling window definition, the tie-break rule, and how the LLM handles late-arriving rows.

Question. Given an orders table and a customers table, write a SQL that returns the top 100 customers by total revenue in the trailing 30 days ending on 2026-06-04, with ties broken by the most recent order date. Include customers with zero orders in the window only if they signed up within the same window.

Input (the prompt — schema + sample rows + task).

You are writing PostgreSQL 15. Output only the SQL.

# Schema
CREATE TABLE customers (
  id BIGINT PRIMARY KEY,
  email TEXT NOT NULL,
  signup_at TIMESTAMPTZ NOT NULL
);

CREATE TABLE orders (
  id BIGINT PRIMARY KEY,
  customer_id BIGINT NOT NULL REFERENCES customers(id),
  amount_cents BIGINT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL
);

# Sample rows
customers:
| id | email             | signup_at                |
|----|-------------------|--------------------------|
| 1  | a@x.com           | 2025-12-10T09:00:00Z     |
| 2  | b@y.com           | 2026-06-01T12:30:00Z     |
| 3  | c@z.com           | 2026-05-15T08:00:00Z     |

orders:
| id  | customer_id | amount_cents | created_at               |
|-----|-------------|--------------|--------------------------|
| 100 | 1           | 4500         | 2026-05-20T10:00:00Z     |
| 101 | 1           | 9000         | 2026-06-03T11:30:00Z     |
| 102 | 3           | 2000         | 2026-05-22T16:00:00Z     |

# Task
Return the top 100 customers by total revenue in the 30-day window ending 2026-06-04 UTC.
Tie-break by most recent order date (DESC).
Include zero-revenue customers only if they signed up inside the same window.
Columns: customer_id, email, total_revenue_cents, last_order_at.
Enter fullscreen mode Exit fullscreen mode

Code (the SQL the model emits — and we accept after verification).

WITH window_bounds AS (
  SELECT
    TIMESTAMPTZ '2026-05-05 00:00:00+00' AS window_start,
    TIMESTAMPTZ '2026-06-04 00:00:00+00' AS window_end
),
orders_in_window AS (
  SELECT
    o.customer_id,
    SUM(o.amount_cents) AS total_revenue_cents,
    MAX(o.created_at)   AS last_order_at
  FROM orders o
  CROSS JOIN window_bounds w
  WHERE o.created_at >= w.window_start
    AND o.created_at <  w.window_end
  GROUP BY o.customer_id
),
eligible AS (
  SELECT
    c.id AS customer_id,
    c.email,
    COALESCE(oiw.total_revenue_cents, 0) AS total_revenue_cents,
    oiw.last_order_at
  FROM customers c
  LEFT JOIN orders_in_window oiw
    ON oiw.customer_id = c.id
  CROSS JOIN window_bounds w
  WHERE oiw.customer_id IS NOT NULL
     OR (c.signup_at >= w.window_start AND c.signup_at < w.window_end)
)
SELECT
  customer_id,
  email,
  total_revenue_cents,
  last_order_at
FROM eligible
ORDER BY
  total_revenue_cents DESC,
  last_order_at      DESC NULLS LAST
LIMIT 100;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The model picks WITH window_bounds to define the 30-day window once, then references it twice — a quality signal that the model respects "single source of truth" for the date math.
  2. orders_in_window aggregates revenue per customer inside the window. The created_at < window_end (half-open interval) is critical — <= would double-count an order placed exactly at midnight on 2026-06-04. Sample rows let the model see the timestamp shape and choose the right boundary.
  3. eligible left-joins customers to the in-window aggregate, then keeps customers that either have in-window revenue or signed up inside the window — implementing the "include zero-revenue customers only if they signed up inside the same window" rule.
  4. The final ORDER BY total_revenue_cents DESC, last_order_at DESC NULLS LAST enforces the tie-break rule and pushes zero-revenue customers (NULL last_order_at) to the bottom of any tied bucket.
  5. LIMIT 100 caps the result; in a production query you would also add a deterministic third tie-breaker (customer_id ASC) so paging and re-runs are stable.

Output.

customer_id email total_revenue_cents last_order_at
1 a@x.com 13500 2026-06-03T11:30:00Z
3 c@z.com 2000 2026-05-22T16:00:00Z
2 b@y.com 0 NULL

Rule of thumb. When asking an LLM for windowed SQL, always spell the window boundaries in the prompt (>= start AND < end) and always hand-trace at least one tied row. The model is excellent at writing the shape and only okay at picking the exact boundary semantics.

Interview question on prompt design for SQL generation

A senior interviewer often frames this as: "How would you ship a prompt template for sql generation ai that scales across a team of 30 engineers and 8 dialects, with hallucination caught before it hits prod?" The probe tests prompt engineering, schema serialisation, and verification design — the three muscles every chatgpt for data engineer jobs round rewards.

Solution Using a templated three-part prompt + four-gate verifier

# sql_gen_template.py — production prompt + verifier
from textwrap import dedent
from typing import Any
import sqlglot

PROMPT_TEMPLATE = dedent("""
You are writing {dialect} SQL. Output ONLY the SQL — no commentary.

# Schema (DDL)
{ddl}

# Sample rows ({sample_n} rows per table)
{samples}

# House style
- Use CTEs for clarity, one per logical step.
- Half-open windows: `>= start AND < end`.
- Always include a deterministic tie-breaker on the ORDER BY.
- Prefer COALESCE over CASE for nullable arithmetic.

# Task
{task}
""").strip()

def build_prompt(*, dialect: str, ddl: str, samples: str, task: str, sample_n: int = 5) -> str:
    return PROMPT_TEMPLATE.format(
        dialect=dialect, ddl=ddl, samples=samples, task=task, sample_n=sample_n,
    )

def verify_sql(sql: str, dialect: str, run_explain: Any, run_sample: Any) -> dict:
    # Gate 1 — parse with the target dialect
    try:
        parsed = sqlglot.parse_one(sql, read=dialect.lower())
        parse_ok = True
    except Exception as e:
        return {"gate": "parse", "ok": False, "error": str(e)}

    # Gate 2 — EXPLAIN plan sanity
    plan = run_explain(sql)                       # injected
    plan_ok = (
        "Seq Scan on orders" not in plan          # no full scan on large table
        and "Nested Loop" not in plan             # no Cartesian product hint
    )

    # Gate 3 — sample-result check on 1% sample
    sample_result = run_sample(sql)               # injected
    sample_ok = len(sample_result) > 0

    return {
        "parse_ok":  parse_ok,
        "plan_ok":   plan_ok,
        "sample_ok": sample_ok,
        "ok":        parse_ok and plan_ok and sample_ok,
    }
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Action Effect
1 build_prompt(...) renders the three-part prompt with house-style rules baked in
2 LLM call returns a SQL string the team reviews-then-runs
3 sqlglot.parse_one(sql, read=dialect) dialect linter — catches QUALIFY in Postgres etc.
4 run_explain(sql) EXPLAIN plan inspection — catches missing-index Cartesian disasters
5 run_sample(sql) 1% sample run — catches empty result sets and obvious off-by-ones
6 PR review human signs the diff with prompt + model + output attached

Output:

Gate What it catches Typical fail rate
Parse dialect drift (QUALIFY in Postgres, LIMIT in T-SQL) 8%
EXPLAIN Cartesian product, full-scan on huge tables 4%
Sample empty results, off-by-one window 6%
Review wrong semantics — answers the wrong question 2%

Why this works — concept by concept:

  • Templated three-part prompt — schema + samples + task is the minimum viable context for the model. Adding house-style rules inline shrinks the diff between the LLM's idiom and your team's idiom.
  • Dialect tag at the top — frontier models respect a hard "you are writing Snowflake SQL" instruction; this single token saves dozens of dialect-drift fixes per week.
  • Half-open windows in the house style — every "but the LLM picked <= instead of <" bug evaporates once the rule is in the prompt.
  • Four-gate verifier — parse + plan + sample + review fans out across automatic and human gates. Parse and sample catch ~70% of mistakes for free in CI; plan inspection catches the expensive ones; review catches semantic drift.
  • Cost — prompt tokens = O(DDL + samples); verifier cost = O(parse + small sample query). Cheap relative to the senior-engineer-hour the SQL would otherwise consume.

SQL
Topic — SQL generation
SQL generation problems (schema-as-context + window math)

Practice →


3. dbt macro and model generation with RAG over your project

Naive ChatGPT hallucinates refs — RAG over your project is the fix

The mental model in one line: a frontier LLM with browser context has read the dbt docs but has not read your project — so it confidently invents ref('stg_orders') when your project actually calls it stg__shopify__orders, and it cheerfully writes a macro that conflicts with one you already have. Retrieval-Augmented Generation (RAG) over your dbt project closes that gap: the model retrieves the relevant slice of your code before it answers, and the answer reflects your conventions, not the public-docs average.

Visual diagram of dbt macro + model generation with RAG — left a dbt project (manifest.json + models folder + macros folder) feeding a vector store; centre an LLM combining retrieved chunks + the user prompt; right a generated macro running through dbt parse + dbt compile + dbt test safety rails before commit; on a light PipeCode card.

Why naive LLMs fail on dbt.

  • ref() and source() hallucination. The model invents model names that don't exist in your project.
  • Macro conflict. It writes safe_divide(a, b) even though your project already has utils.safe_divide(numerator, denominator) with a different signature.
  • Tag and meta drift. It forgets your tags: ['daily', 'finance'] convention and emits a model that fails your CI's tag-based scheduling.
  • YAML schema drift. It picks the generic dbt YAML shape, not your house style (column descriptions in meta, custom test configs, etc.).
  • Materialisation defaults. It defaults to view; your project may default to incremental for marts.

The RAG architecture for dbt — four stages.

  • ① dbt project. manifest.json, models/, macros/, schema.yml, dbt_project.yml, selectors.yml. These are the ground-truth inputs.
  • ② Vector store. Chunk each file (or each Jinja block) into 1–2K-token windows, embed with text-embedding-3-large or bge-large (better for code), store in pgvector, Chroma, or Pinecone. Re-index on every PR merge.
  • ③ LLM + retrieval. At prompt time: embed the user's task, retrieve the top-K (usually 5–10) most relevant chunks, stuff them into the prompt as # Context blocks, then ask the model to write the macro or model.
  • ④ Safety rails. Before the output is committed: dbt parse (syntax), dbt compile (refs resolve), dbt test (assertions pass). If any rail fails, loop back to stage ③ with the error embedded in the next prompt.

Chunking strategies — by file vs by Jinja block.

  • By file. Simplest. Every .sql model is one chunk. Works for small projects.
  • By Jinja block. For projects with very long macros or deeply nested models, split on {% macro ... %} / {% endmacro %} and WITH … AS ( boundaries. Better recall.
  • By manifest node. Use manifest.json to chunk by logical node (model, macro, test, source). The richest representation — embeds the dependency graph.

Embedding models — 2026 picks.

  • text-embedding-3-large (OpenAI). Strong general-purpose; cheap.
  • bge-large-en-v1.5 (BAAI, open). Best open code-aware embedding; self-host friendly.
  • Voyage AI voyage-code-2. Purpose-built for code; the best accuracy on dbt + SQL benchmarks at the time of writing.
  • Avoid text-embedding-ada-002. Older; meaningfully worse on code than the above.

The safety rails — three checks every LLM-generated macro must pass.

  • dbt parse. Catches Jinja syntax errors and undefined macro calls.
  • dbt compile. Resolves every ref() and source() against the manifest. If the model invented a name, this fails loudly.
  • dbt test. Runs the project's existing tests (and any new ones the macro added). Catches semantic regressions.

Retry-with-error context — the secret to RAG that ships. When a safety rail fails, the system shouldn't surface the error to the human. It should re-prompt the model with the original prompt + the error message + the offending output, and ask for a fix. Two or three iterations usually clear most failures.

Worked example — generate a window-function macro that respects your project's naming

Detailed explanation. A common interview probe: "Walk me through how you'd ask an LLM to write a dbt macro that fits your project's conventions, not the public docs default." The interesting layer is the retrieval — the model must see your existing utility macros before it writes a new one.

Question. Generate a dbt macro rolling_window_sum(column_name, partition_by, order_by, days) that computes a rolling sum over a date window, matches the project's snake_case naming, uses the project's safe_cast() helper for the numeric column, and emits dialect-aware SQL (Snowflake vs Postgres).

Input (retrieved RAG chunks — top 4 of 10).

-- chunk 1 — macros/utils/safe_cast.sql
{% macro safe_cast(column, type) %}
  {% if target.type == 'snowflake' %}
    TRY_CAST({{ column }} AS {{ type }})
  {% else %}
    CAST({{ column }} AS {{ type }})
  {% endif %}
{% endmacro %}

-- chunk 2 — macros/utils/snake_case_assertion.md
"All macros and arguments use snake_case. No camelCase. No abbreviations
shorter than 3 chars. Plural arguments take a list."

-- chunk 3 — manifest.json node — model 'fct_orders_daily'
"depends_on": ["model.proj.stg_shopify_orders"],
"materialized": "incremental",
"unique_key": ["order_date","customer_id"]

-- chunk 4 — dbt_project.yml fragment
vars:
  default_window_days: 30
  numeric_cast_type: 'NUMERIC(18,2)'
Enter fullscreen mode Exit fullscreen mode

Code (the macro the model emits after retrieval).

{# macros/window/rolling_window_sum.sql #}
{#-
  Computes a rolling sum of `column_name` over the last `days` days,
  partitioned by `partition_by` and ordered by `order_by`.
  Uses safe_cast() so non-numeric values become NULL instead of failing.
  Dialect-aware via target.type.
-#}
{% macro rolling_window_sum(column_name, partition_by, order_by, days=none) %}

  {%- set window_days = days if days is not none else var('default_window_days') -%}
  {%- set cast_type  = var('numeric_cast_type') -%}

  {% if target.type == 'snowflake' %}
    SUM({{ safe_cast(column_name, cast_type) }}) OVER (
      PARTITION BY {{ partition_by }}
      ORDER BY {{ order_by }}
      RANGE BETWEEN INTERVAL '{{ window_days }} DAY' PRECEDING AND CURRENT ROW
    )
  {% elif target.type in ('postgres', 'redshift') %}
    SUM({{ safe_cast(column_name, cast_type) }}) OVER (
      PARTITION BY {{ partition_by }}
      ORDER BY {{ order_by }}
      RANGE BETWEEN '{{ window_days }} days'::INTERVAL PRECEDING AND CURRENT ROW
    )
  {% else %}
    {{ exceptions.raise_compiler_error(
       "rolling_window_sum() does not support target.type=" ~ target.type
    ) }}
  {% endif %}

{% endmacro %}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The retriever pulls safe_cast.sql because the prompt mentioned a numeric column — the LLM now knows the helper exists with that exact signature and uses it instead of inventing to_number().
  2. The snake_case_assertion.md chunk anchors the model to the team's naming rule — partition_by and order_by (not partitionBy, not partBy).
  3. The dbt_project.yml vars (default_window_days, numeric_cast_type) become defaults in the macro instead of hard-coded magic numbers — house-style consistent.
  4. target.type branching emits Snowflake INTERVAL '30 DAY' syntax for Snowflake and '30 days'::INTERVAL for Postgres/Redshift — dialect-aware out of the box because the retrieved safe_cast() macro is also dialect-aware (the model copies the pattern).
  5. The else branch raises a compile-time error for unsupported targets — defensive, dbt-idiomatic, and consistent with the rest of the codebase's macros (a behaviour the model picked up from manifest.json).

Output.

Aspect Before RAG After RAG
Helper macro reuse invents to_number() uses existing safe_cast()
Argument naming partitionBy, daysBack partition_by, days
Defaults hard-coded 30 var('default_window_days')
Dialect Snowflake-only dialect-aware
dbt compile first try fails (unknown macro) passes

Rule of thumb. For any dbt project with > 50 models or > 20 macros, RAG is not optional. Naive prompting can be made to work for a single one-off macro; it cannot be made to work across a team and across releases.

Interview question on shipping RAG-grounded dbt assistance

A senior probe sometimes goes: "How would you build a system that lets a junior engineer ask for a new dbt macro in plain English and ship something that respects all the project's conventions on first compile?" The probe tests RAG design, safety rail composition, and operational maturity.

Solution Using a vector store over manifest.json + project files + a three-rail safety pipeline

# dbt_rag_assistant.py — minimal RAG + safety-rail loop
import json, subprocess
from pathlib import Path
from typing import Iterable
import voyageai
import chromadb

# 1 — index every macro, model, and manifest node
def index_dbt_project(project_dir: Path) -> chromadb.Collection:
    client = chromadb.PersistentClient(path=".chroma")
    col = client.get_or_create_collection("dbt_project_v1")
    vo  = voyageai.Client()
    docs, ids, metas = [], [], []
    for p in project_dir.rglob("*.sql"):
        docs.append(p.read_text()); ids.append(str(p)); metas.append({"kind": "sql"})
    for p in project_dir.rglob("*.yml"):
        docs.append(p.read_text()); ids.append(str(p)); metas.append({"kind": "yml"})
    manifest = json.loads((project_dir / "target" / "manifest.json").read_text())
    for node_id, node in manifest["nodes"].items():
        docs.append(json.dumps(node, indent=2))
        ids.append(node_id); metas.append({"kind": "manifest_node"})
    embeds = vo.embed(docs, model="voyage-code-2", input_type="document").embeddings
    col.upsert(ids=ids, documents=docs, embeddings=embeds, metadatas=metas)
    return col

# 2 — answer a task with retrieved context
def answer(task: str, col: chromadb.Collection, llm) -> str:
    vo = voyageai.Client()
    q  = vo.embed([task], model="voyage-code-2", input_type="query").embeddings[0]
    hits = col.query(query_embeddings=[q], n_results=10)
    ctx  = "\n\n---\n\n".join(hits["documents"][0])
    prompt = f"# Context (your dbt project)\n{ctx}\n\n# Task\n{task}\n\n# Output (Jinja macro only)"
    return llm.complete(prompt)

# 3 — run the three safety rails
def safety_rails(generated: str, project_dir: Path) -> dict:
    out_path = project_dir / "macros" / "_generated.sql"
    out_path.write_text(generated)
    parse   = subprocess.run(["dbt", "parse"],   cwd=project_dir, capture_output=True)
    compile = subprocess.run(["dbt", "compile"], cwd=project_dir, capture_output=True)
    test    = subprocess.run(["dbt", "test"],    cwd=project_dir, capture_output=True)
    return {
        "parse_ok":   parse.returncode   == 0,
        "compile_ok": compile.returncode == 0,
        "test_ok":    test.returncode    == 0,
        "stderr":     (parse.stderr + compile.stderr + test.stderr).decode(),
    }

# 4 — loop with retry-on-error
def run(task: str, project_dir: Path, llm, col):
    for attempt in range(3):
        macro = answer(task, col, llm)
        rails = safety_rails(macro, project_dir)
        if rails["parse_ok"] and rails["compile_ok"] and rails["test_ok"]:
            return macro
        task = task + "\n\nPrevious attempt failed:\n" + rails["stderr"]
    raise RuntimeError("RAG-grounded macro failed safety rails after 3 attempts")
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Component What it does
1 index_dbt_project embeds every .sql, .yml, and manifest node into the vector store with voyage-code-2
2 answer(task) embeds the user task, retrieves top-10 similar chunks, builds a context-rich prompt, calls the LLM
3 safety_rails runs dbt parse, dbt compile, dbt test against the generated macro
4 Retry loop on failure, feeds the stderr back into the next prompt and tries again, up to 3 times
5 Final return the macro that passed all three rails — eligible for human PR review

Output:

Attempt parse_ok compile_ok test_ok Outcome
1 true false refused — invented to_decimal(); retry with error
2 true true true accepted — uses safe_cast()

Why this works — concept by concept:

  • Retrieval grounding — the model literally sees your project's macros, models, and node graph before answering. It cannot invent a ref() that doesn't exist when the manifest is in the prompt.
  • Code-aware embeddingsvoyage-code-2 (or bge-large open) understands SQL/Jinja semantics far better than text-embedding-ada-002. The retrieval recall difference is 30–50%.
  • Three-rail safety pipelinedbt parse is fast and catches Jinja typos; dbt compile is medium and catches missing refs; dbt test is slow and catches semantic regressions. The cost ladder matches the bug severity ladder.
  • Retry-with-error — by feeding the safety-rail stderr back into the prompt, the system self-corrects without paging a human. Two or three iterations clear ~85% of remaining failures.
  • Human PR review — the loop ends at eligible for review, never at merged. A senior engineer still owns the merge button. We unpack the full guardrail ladder in §5.
  • Cost — index build = O(project_size); per-query = O(retrieval + LLM tokens). Typical 50-model project: $0.02 per generated macro.

SQL
Topic — ETL
ETL practice problems (dbt-style transformation)

Practice →


4. Auto-generated lineage and docs from SQL + LLM enrichment

The doc-debt killer — SQL parser for structure, LLM for semantics

The mental model in one line: column-level lineage and useful documentation are two halves of one workflow — a deterministic SQL parser extracts the graph (which column flowed into which), and an LLM enriches the graph with business-glossary-aligned prose (what each column means and why it matters). Once that combo runs in CI, a 100-model project can reach full doc coverage in two hours instead of three weeks of manual writing.

Visual diagram of LLM-generated lineage + docs — a SQL parser extracts source/target column mappings, an LLM enriches each column with business-glossary alignment, output renders as a dbt-style docs page with linked lineage graph; tools row (dbt-cosm, Datafold AI, Atlan AI) at the bottom; on a light PipeCode card.

The four-stage pipeline.

  • ① SQL files. All your models/ .sql files (and any orphan ad-hoc SQL you want covered).
  • ② SQL parser. sqlglot is the open-source workhorse — it builds an AST and emits source→target column mappings for SELECTs, CTEs, joins, and unions.
  • ③ LLM enrichment. For each column the parser identified, the LLM writes a 1–2 sentence description, tags PII, and aligns to a business-glossary entry.
  • ④ Docs + lineage output. A dbt-style docs page with the lineage graph (column-level), the descriptions, the glossary links, and a list of suggested tests.

What gets auto-generated — five outputs.

  • Per-model description. A 2-sentence "this model joins stg_orders and stg_users to produce one row per order with the buyer profile attached, filtered to last 90 days."
  • Per-column description. "A monotonically increasing primary key from the upstream Stripe API."
  • Business-glossary link. Maps gross_merchandise_value_cents → glossary entry "GMV (cents)".
  • PII / sensitivity tag. Flags email, phone_number, government_id automatically.
  • Suggested tests. unique on primary keys, not_null on FKs, accepted_values on enums, relationships on FK joins.

Column-level lineage — why it beats model-level.

  • Model-level lineage says "mart_orders depends on stg_orders." Useful but coarse.
  • Column-level lineage says "mart_orders.gross_value_cents = stg_orders.amount_cents × 1.0 - stg_refunds.refund_amount_cents." That's the lineage you need for impact analysis, compliance, and root-cause debugging.
  • A SQL parser like sqlglot extracts column-level lineage deterministically — no LLM needed for the structural part. The LLM only writes the narrative.

The tools — 2026 landscape.

  • dbt-osmosis and dbt-cosm. Open-source. Walks your dbt project, fills missing column descriptions by pulling from upstream (and via LLM for the originating column). Free, opinionated, fits in CI.
  • Datafold AI. Column-level lineage + LLM-generated diffs at PR time. Strong "this PR changes 14 downstream columns" notifications.
  • Atlan AI / Alation AI. Catalog-native; the docs live in the catalog and the LLM writes them. Good when the catalog is already the source of truth.
  • Lightdash AI. BI-side metric definitions written from dashboard usage + dbt model context.
  • SDF. Newer entrant, SQL compiler with built-in column-level lineage and dialect cross-translation.

Anti-patterns to avoid.

  • LLM writes lineage from prose. The model will hallucinate edges. Always use a SQL parser for structure.
  • One-shot bulk run with no review. A 200-model run with no human pass produces 200 docs that sound right and have 5–10% subtle errors that propagate forever. Review in waves of 20 at a time.
  • Skip glossary alignment. Without a glossary, every model gets a slightly-different definition of "active user." Glossary first, descriptions second.

Worked example — auto-document a fct_orders mart in 90 seconds

Detailed explanation. The classic dbt mart everyone has: one row per order, joined to user profile, with first-touch attribution and refund-adjusted gross value. Manually documenting all eight columns takes 20 minutes. Let's auto-generate.

Question. Given the SQL for fct_orders below, auto-generate (a) a model-level description, (b) per-column descriptions, (c) PII tags, (d) glossary links, and (e) suggested dbt tests. Show how sqlglot extracts the column-level lineage and where the LLM enrichment plugs in.

Input (the dbt model SQL).

-- models/marts/fct_orders.sql
{{ config(materialized='incremental', unique_key='order_id') }}

WITH orders AS (
  SELECT
    order_id,
    user_id,
    amount_cents,
    created_at,
    first_touch_channel
  FROM {{ ref('stg_shopify_orders') }}
  WHERE _loaded_at >= (SELECT MAX(_loaded_at) - INTERVAL '7 days' FROM {{ this }})
),
refunds AS (
  SELECT order_id, SUM(refund_cents) AS refund_total_cents
  FROM {{ ref('stg_shopify_refunds') }}
  GROUP BY order_id
),
users AS (
  SELECT user_id, email, country_code, signup_at
  FROM {{ ref('stg_users') }}
)
SELECT
  o.order_id,
  o.user_id,
  u.email,
  u.country_code,
  o.first_touch_channel,
  o.amount_cents                                       AS gross_value_cents,
  COALESCE(r.refund_total_cents, 0)                    AS refund_value_cents,
  o.amount_cents - COALESCE(r.refund_total_cents, 0)   AS net_value_cents,
  o.created_at                                         AS order_placed_at
FROM orders o
LEFT JOIN refunds r ON r.order_id = o.order_id
LEFT JOIN users   u ON u.user_id  = o.user_id
Enter fullscreen mode Exit fullscreen mode

Code (the parser + LLM enrichment loop).

# auto_doc_fct_orders.py
import sqlglot, json
from sqlglot.lineage import lineage

SQL = open("models/marts/fct_orders.sql").read()

# 1 — render Jinja first (use dbt compile in real life)
compiled_sql = render_jinja(SQL)

# 2 — extract column-level lineage with sqlglot
target_cols = [
    "order_id", "user_id", "email", "country_code", "first_touch_channel",
    "gross_value_cents", "refund_value_cents", "net_value_cents", "order_placed_at",
]
edges = {}
for col in target_cols:
    nodes = lineage(col, compiled_sql, dialect="snowflake")
    edges[col] = [n.source.alias_or_name for n in nodes.walk()]

# 3 — call the LLM for narrative enrichment
prompt = (
    f"# Lineage (deterministic)\n{json.dumps(edges, indent=2)}\n\n"
    f"# Compiled SQL\n{compiled_sql}\n\n"
    "# Task — return JSON\n"
    "{\n"
    '  "model_description": "1-2 sentences",\n'
    '  "columns": [\n'
    '    {"name": "...", "description": "...", "pii": false, "glossary_key": "..."}\n'
    "  ],\n"
    '  "suggested_tests": [{"column": "...", "test": "unique|not_null|..."}]\n'
    "}"
)
enrichment = json.loads(llm.complete(prompt))
print(json.dumps(enrichment, indent=2))
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. sqlglot.lineage() walks the AST starting from each target column, recursing through CTEs and joins, and returns the source columns that feed it. This is deterministic — same SQL, same lineage, every time.
  2. The result for net_value_cents is ["amount_cents", "refund_total_cents"] with the formula o.amount_cents - COALESCE(r.refund_total_cents, 0) baked into the lineage node's expression.
  3. The LLM never gets to invent lineage — the structure is locked. It only writes prose for description, picks a glossary_key, and flags pii.
  4. For each target column, the LLM is prompted to write a description in 1–2 sentences, set pii: true for email, phone, etc., and pick a glossary_key from a list of canonical entries.
  5. The model also suggests dbt tests — unique on order_id, not_null on FKs, accepted_values on first_touch_channel if it sees a small enum cardinality in the compiled SQL.

Output (LLM-generated fct_orders.yml).

Field Value
model_description "One row per Shopify order, refund-adjusted, with buyer profile attached. Incremental on _loaded_at."
columns[0].name order_id
columns[0].description "Primary key from Shopify; monotonically increasing per store."
columns[2].pii true (email)
columns[7].description "Net value after refunds, in cents. = gross_value_cents − refund_value_cents."
suggested_tests unique(order_id), not_null(user_id), relationships(user_id → stg_users.user_id)

Rule of thumb. Always run the parser before the LLM. The parser owns what flows where; the LLM owns what it means. Mixing these responsibilities is the root cause of every "the auto-doc tool wrote a beautifully-worded lie" failure mode.

Interview question on shipping column-level lineage and docs at scale

A senior probe: "Your 200-model dbt project has 12% column-description coverage and a compliance team wants 95% by next quarter. How would you ship that without burning a quarter of engineering time?"

Solution Using sqlglot lineage + LLM enrichment + waves-of-twenty human review

# rollout_auto_docs.py — wave-based rollout
from pathlib import Path
import yaml

WAVES = [
    ("staging",  20, "junior_de@team"),
    ("staging",  40, "junior_de@team"),
    ("marts",    20, "senior_de@team"),
    ("marts",    40, "senior_de@team"),
    ("metrics",  20, "data_pm@team"),
]

def run_wave(layer: str, n_models: int, reviewer: str, project: Path):
    candidates = sorted((project / "models" / layer).rglob("*.sql"))[:n_models]
    drafts = []
    for sql_path in candidates:
        yml_path = sql_path.with_suffix(".yml")
        enrichment = auto_doc(sql_path)            # parser + LLM
        yml_path.write_text(yaml.safe_dump(enrichment))
        drafts.append({"file": str(yml_path), "reviewer": reviewer})
    return drafts

all_drafts = []
for layer, n, reviewer in WAVES:
    all_drafts.extend(run_wave(layer, n, reviewer, Path("./dbt_project")))

# Submit one PR per wave for focused review
print(f"Generated {len(all_drafts)} draft YML files across {len(WAVES)} review waves")
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Action Effect
1 Run sqlglot lineage on each model deterministic column-level structure
2 LLM enrichment per wave (20 models) small enough that one reviewer can read every diff
3 Write .yml next to each .sql fits dbt's schema discipline; PR diff is reviewable
4 Per-wave PR with a single reviewer catches systematic errors before the next wave
5 After each wave, update the glossary if new canonical terms emerged feeds back into the next wave's prompt

Output:

Wave Models Reviewer Outcome
1 (staging × 20) 20 junior_de accepted with 4 description edits
2 (staging × 40) 40 junior_de accepted with 6 edits + 1 glossary addition
3 (marts × 20) 20 senior_de accepted with 2 edits + 1 PII tag correction
4 (marts × 40) 40 senior_de accepted, no edits
5 (metrics × 20) 20 data_pm accepted with 3 glossary alignments
Total 140 95% coverage in 2 weeks, 8 hours of review time

Why this works — concept by concept:

  • Wave-based rollout — 20 at a time keeps the PR small enough to review carefully. 200 at a time would get rubber-stamped.
  • Layer-aware reviewers — juniors review staging (mechanical), seniors review marts (judgement), PM reviews metrics (glossary). Right person at the right layer.
  • Glossary as feedback loop — each wave updates the glossary with new canonical terms, and the next wave's LLM prompt benefits.
  • Parser owns structure, LLM owns narrative — column-level lineage from sqlglot is deterministic and audit-able; descriptions from the LLM are advisory and edited.
  • 95% coverage, not 100% — the last 5% is usually deprecated models, sandboxes, or models so weird that auto-doc would do more harm than good. Accept the asymptote.
  • Cost — engineering hours = 8 (reviews) vs ~120 (manual writing). LLM cost ≈ $5 total. The bottleneck stays human attention, where it should.

SQL
Topic — data transformation
Data transformation problems (lineage-aware modelling)

Practice →


5. Human-in-the-loop — the production safety pattern

The trust ladder — five gates, one principle, never auto-merge to prod

The mental model in one line: an LLM accelerates; it does not own; the human is the final gate, forever. Every senior DE team that has rolled out LLM-assisted workflows arrived at some version of a five-gate trust ladder, and every shop that skipped a gate eventually paid for it — usually with a silent metric corruption that took weeks to detect because the buggy SQL "looked right."

Visual human-in-the-loop guardrail diagram — a vertical 'trust ladder' showing LLM output ascending through 4 gates (suggestion → PR review → CI tests pass → monitored merge); each gate is a green or amber checkmark card; a side annotation about hallucination budget; on a light PipeCode card.

The five gates (bottom = untrusted, top = production-monitored).

  • Gate 0 — LLM raw output. Untrusted by default. The model returned a string. Treat it as a draft, not a commit.
  • Gate 1 — IDE suggestion. Developer accepts/rejects per line in Cursor/Copilot/Windsurf. The human is reading every diff in real time.
  • Gate 2 — PR review. A peer reads the diff with full context (the original prompt, the model, the test outputs) and approves/requests changes.
  • Gate 3 — CI: dbt test + EXPLAIN + dry-run. Automated gates that break the build on any of: failed assertions, suspicious query plan, row-count drift > threshold.
  • Gate 4 — Merged + monitored. The change is live, but data-quality monitors (freshness alerts, row-count diffs, PII scans, downstream chart-failure detection) watch it for ≥ 7 days.

The non-negotiable rules.

  • Never auto-merge LLM-generated SQL or dbt models into prod. Even with passing CI, the cost of a silent business-logic error dwarfs the cost of a 30-minute human review.
  • Every LLM-touched commit is tagged. llm: claude-opus-4-7, prompt-hash: abc123, reviewer: alice@team. This is the audit log; you'll thank yourself the first time you need it.
  • The reviewer is not the prompt author. The author has confirmation bias; a peer catches what the author can't see.
  • PR description includes the prompt. So the reviewer can sanity-check that the prompt actually asked for the right thing.

Hallucination budget — measure it, reduce it.

  • What it is. A quarterly target like "≤ 2% of LLM-touched PRs introduce a defect requiring rollback or hotfix."
  • How to measure it. Tag every PR with assisted-by: llm. Count rollbacks/hotfixes in that population. Compute the rate.
  • How to reduce it. Better RAG (so the model has more context), tighter prompts (with house style baked in), more gates (column-level lineage diff at PR time), and training — the team gets better at prompting and reviewing over time.

Audit log — what to capture per LLM-touched change.

  • prompt (or its hash + a stored copy)
  • model (e.g. claude-opus-4-7, gpt-4o-2026-05)
  • tools_used (RAG retrieval IDs, function-call traces)
  • reviewer (the human who approved)
  • gates_passed (parse / compile / test / EXPLAIN / DQ monitors)
  • outcome (merged / rolled-back / reverted / fixed-forward)

Operational alerts every LLM-using team should ship.

  • Row-count diff. Post-deploy row count differs from a 30-day rolling baseline by > 10%? Page.
  • Freshness drift. Model's _loaded_at lagged by > 1 hour? Page.
  • PR comment regex. "wrong" / "fix" / "hallucinat" in PR comments tagged assisted-by: llm? Roll up weekly.
  • Downstream chart failure. Any BI chart that depends on the changed column starts erroring? Notify the change author.
  • Test escape. A new test broke ≥ 48 hours after the model that changed? Suspect the LLM change.

The "do not auto" list — things the LLM never does by itself.

  • Approving its own PR. Period.
  • Merging to main.
  • Deploying to production.
  • Backfilling historical data.
  • Dropping or truncating tables.
  • Running DELETE / UPDATE against prod without an explicit human confirmation.

Worked example — design the CI pipeline for an LLM-assisted dbt PR

Detailed explanation. A common interview probe: "Walk me through the CI checks you'd add specifically because the change might be LLM-authored." The interesting layer is the additional checks — the basics (parse, compile, test) are table stakes; we want the gates that specifically catch LLM failure modes.

Question. Sketch a CI pipeline for a dbt PR tagged assisted-by: llm that runs three "table-stakes" gates and three "LLM-specific" gates. The LLM-specific ones must catch silent errors: a join condition that quietly doubles row counts, a deprecated column reference, and a PII leak via a renamed column.

Input (.github/workflows/dbt_pr.yml).

name: dbt PR (LLM-aware)
on:
  pull_request:
    paths: ['models/**', 'macros/**', 'seeds/**', 'snapshots/**']

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      # Gate 1 — table stakes
      - run: dbt deps
      - run: dbt parse
      - run: dbt compile --target ci
      - run: dbt test --target ci

      # Gate 2 — LLM-specific (row-count diff)
      - name: row-count diff vs main
        run: python tools/row_count_diff.py --threshold 10  # %

      # Gate 3 — LLM-specific (column-level lineage diff)
      - name: column lineage diff
        run: python tools/lineage_diff.py --warn-on=downstream_columns_changed

      # Gate 4 — LLM-specific (PII scan)
      - name: PII scan
        run: python tools/pii_scan.py --fail-on=leaked_pii_columns

      # Audit trail
      - name: tag PR with LLM metadata
        if: contains(github.event.pull_request.labels.*.name, 'assisted-by:llm')
        run: |
          gh pr edit ${{ github.event.pull_request.number }} \
             --add-label "needs-senior-review" \
             --add-label "hallucination-budget"
Enter fullscreen mode Exit fullscreen mode

Code (the row-count diff gate).

# tools/row_count_diff.py
import argparse, json, subprocess, sys

p = argparse.ArgumentParser()
p.add_argument("--threshold", type=float, default=10.0,
               help="row-count % delta vs main that triggers a warn/fail")
args = p.parse_args()

def run(cmd):
    return subprocess.run(cmd, shell=True, check=True, capture_output=True, text=True).stdout

# 1 — list changed models in this PR
changed = json.loads(run("dbt ls -s state:modified --output json --state ./prod-manifest/"))
violations = []
for model in changed:
    fqn = model["unique_id"]
    main_count = int(run(f"dbt show -s {fqn} --target prod --limit 0 --output rows-only").strip())
    pr_count   = int(run(f"dbt show -s {fqn} --target ci   --limit 0 --output rows-only").strip())
    if main_count == 0:
        continue
    delta_pct = abs(pr_count - main_count) / main_count * 100
    if delta_pct > args.threshold:
        violations.append((fqn, main_count, pr_count, delta_pct))

if violations:
    print("ROW-COUNT DIFF VIOLATIONS:")
    for v in violations:
        print(f"  {v[0]}: main={v[1]}, pr={v[2]}, delta={v[3]:.1f}%")
    sys.exit(1)
print("row-count diff OK")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Gates 1–4 in order. Parse catches Jinja syntax. Compile catches missing refs (the classic LLM hallucination). Test catches assertion regressions. These are table stakes and run in seconds.
  2. Row-count diff. Lists state:modified models (only what changed in the PR), runs both prod and ci against the same time window, compares the row counts. A model that quietly went from 1M rows to 2M rows is almost always a join-cardinality bug — exactly the LLM failure mode that compiles and passes unit tests.
  3. Column lineage diff. Compares sqlglot-extracted lineage between prod and the PR. If a column that was sourced from stg_users.email is now sourced from a different column (say stg_users.username), the gate warns the reviewer.
  4. PII scan. Walks the new YML files. If a column previously tagged pii: true was renamed and the tag wasn't carried, fail the build.
  5. Audit labels. When the PR carries assisted-by:llm, the workflow adds needs-senior-review and hallucination-budget labels. The org's quarterly metric pulls from those labels.

Output.

Gate Catches Typical fail rate on LLM PRs
dbt parse Jinja typos 2%
dbt compile invented refs 4%
dbt test broken assertions 5%
Row-count diff cardinality blowups 3%
Lineage diff silent column re-source 1%
PII scan dropped PII tags 0.5%

Rule of thumb. Three gates beyond the table-stakes trio (row-count, lineage, PII) catch ~80% of LLM-specific defects that would otherwise reach prod. Add them all; the cost is minutes per PR, the savings is incidents-per-quarter.

Interview question on the human-in-the-loop guardrail design

A senior probe: "An engineer on your team wants to auto-merge LLM-generated SQL for staging models because 'staging isn't prod.' How do you respond?"

Solution Using a tiered policy — gate every model, but vary the human bar by layer

# llm_review_policy.py
LAYER_POLICY = {
    "staging": {
        "gates": ["parse", "compile", "test", "row_count_diff"],
        "human_reviewer": "any peer (LGTM ok)",
        "merge_window": "any",
        "rollback_sla_hours": 24,
    },
    "intermediate": {
        "gates": ["parse", "compile", "test", "row_count_diff", "lineage_diff"],
        "human_reviewer": "any peer",
        "merge_window": "any",
        "rollback_sla_hours": 12,
    },
    "marts": {
        "gates": ["parse", "compile", "test", "row_count_diff", "lineage_diff", "pii_scan"],
        "human_reviewer": "senior DE",
        "merge_window": "business hours only",
        "rollback_sla_hours": 4,
    },
    "metrics": {
        "gates": ["parse", "compile", "test", "row_count_diff", "lineage_diff",
                  "pii_scan", "metric_definition_review"],
        "human_reviewer": "senior DE + data PM",
        "merge_window": "business hours only",
        "rollback_sla_hours": 1,
    },
}

def required_gates(model_layer: str) -> dict:
    return LAYER_POLICY.get(model_layer, LAYER_POLICY["marts"])  # default to strictest
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Layer Min gates Reviewer bar Rollback SLA
staging parse + compile + test + rowcount any peer 24h
intermediate + lineage diff any peer 12h
marts + PII scan senior DE 4h
metrics + metric def review senior DE + data PM 1h

Output:

Layer Auto-merge allowed? Why
staging no even staging feeds downstream marts; a silent bug compounds
intermediate no same as staging
marts no direct BI/ML consumer; metric corruption blast radius
metrics no board-facing numbers; reputational risk

Why this works — concept by concept:

  • Gate every layer, vary the bar — there is no layer where "auto-merge" makes sense, because every layer feeds something downstream. What varies is the seniority of the reviewer and the number of automated gates, not whether a human is required.
  • Rollback SLA matches blast radius — staging has 24 hours because downstream marts re-run nightly; metrics has 1 hour because a Slack channel of execs will notice.
  • PII scan is mart+ — staging PII is usually upstream-tagged; the moment you transform it, the renamed columns can drop the tag. That's a mart-layer risk.
  • Metric definition review needs a non-engineer — a data PM catches "this LLM rewrote active_user to use last_seen_at instead of last_event_at" because the metric meaning matters more than the SQL elegance.
  • No silent paths — every model goes through CI; the policy is enforced in YAML; the audit log is automatic. No "but staging is fine to auto-merge" loophole.
  • Cost — review time = O(diff lines × layer-multiplier); incidents saved = the entire reason the role exists. The trade is obvious.

SQL
Topic — data processing
Data processing problems (pipeline review and CI checks)

Practice →


Cheat sheet — DE LLM workflows in one screen

  • SQL generation. Schema (DDL) + 5–10 sample rows + task in natural language → LLM → verify with EXPLAIN + sample run + dialect lint + PR review. Half-open windows (>= start AND < end). One-shot example for any non-trivial shape.
  • dbt macro / model generation. RAG over manifest.json + models/ + macros/ with a code-aware embedding (voyage-code-2 or bge-large) → LLM with retrieved context → safety rails (dbt parse, dbt compile, dbt test) → human review. Retry with error context on rail failure.
  • Doc + lineage generation. sqlglot for deterministic column-level lineage → LLM enrichment for descriptions, glossary links, PII tags, and suggested tests → wave-based human review in batches of 20. Never let the LLM invent lineage edges.
  • Refactoring. LLM rewrites + automated diff + targeted tests + PR review. Run the existing tests first; if they pass, the refactor is at least functionally compatible.
  • Test scaffolding. LLM suggests unique, not_null, accepted_values, relationships. Human picks which ones to keep. Defaults: PKs unique, FKs not_null + relationships, enums accepted_values, dates not_null.
  • Human-in-the-loop policy. Five gates: raw output → IDE suggestion → PR review → CI (parse+compile+test+row_count_diff+lineage_diff+pii_scan) → merged + monitored. Never auto-merge to any layer. Every LLM-touched commit tagged with prompt + model + reviewer.
  • Hallucination budget. Set quarterly (e.g. ≤ 2% LLM-introduced rollbacks). Measure via PR labels + post-deploy DQ alerts. Reduce via better RAG, tighter prompts, more gates, team training.
  • Tooling stack 2026. Editor: Cursor / Windsurf / VS Code + Copilot. RAG: pgvector/Chroma + voyage-code-2. dbt-aware: dbt-osmosis, dbt-cosm. Lineage: sqlglot, Datafold AI, Atlan AI. CI: GitHub Actions running dbt parse/compile/test + custom row-count and lineage diff scripts.

Frequently asked questions

Will LLMs replace data engineers?

No — but the shape of the role keeps shifting up the stack. LLMs collapse the boilerplate that used to consume 50–70% of weekly keystrokes (staging models, YAML configs, glue Python, dialect conversions) and free senior DEs to spend that time on schema design, pipeline architecture, metric definitions, and code review. The teams that are growing fastest in 2026 are the ones that treat chat gpt for data engineer jobs as a force multiplier on senior judgement, not as a headcount-reduction lever. The bottleneck moves from writing code to reviewing code and owning the contract — and that's still a human job, with no plausible automation path on the horizon.

What is the best LLM for SQL generation in 2026?

Claude Opus 4.x and GPT-4o-2026 sit at the frontier for SQL accuracy on standard benchmarks, with Anthropic's family slightly ahead on complex window functions and OpenAI's slightly ahead on dialect breadth. Specialised SQL models like SQLCoder-34B and Defog AI's tuned variants are competitive on Postgres/Snowflake and meaningfully cheaper at scale. The decision rarely hinges on raw model quality — it hinges on whether you can ship the three-part prompt (schema + sample rows + task), a dialect tag, and a four-gate verifier. Any frontier model with that scaffolding outperforms any model without it.

How do I prevent LLM hallucinations from reaching production?

Three layers, every time: context grounding (RAG over your project's manifest.json, models, and macros so the LLM sees your real refs and column names), automated CI gates (dbt parse + compile + test + row-count diff + column-level lineage diff + PII scan), and human PR review with the prompt attached so the reviewer can sanity-check intent. Set an explicit hallucination budget (e.g. ≤ 2% LLM-touched PRs requiring rollback per quarter) and instrument every LLM-touched commit with prompt-hash, model, reviewer, and gates_passed labels. The gates catch the structural failures cheaply; the reviewer catches the semantic ones; the budget keeps the team honest about trade-offs.

Can ChatGPT write production-ready dbt models on its own?

It can write the first draft, and with RAG over your project it can produce something that compiles and respects your conventions on the first or second attempt — but "production-ready" requires the same gates any human-written model would face: dbt parse, dbt compile, dbt test, peer review, and post-deploy monitoring. The realistic 2026 pattern is: junior DE drafts the model with dbt llm assist, senior DE reviews and edits, CI runs the full safety-rail suite, and the model lands tagged assisted-by:llm with the prompt stored in the audit log. The model never auto-merges, even when CI is green.

What tools auto-generate data lineage from SQL?

sqlglot is the open-source workhorse for deterministic column-level lineage — it parses SQL, builds an AST, and emits source→target column mappings without an LLM in the loop. Commercial tools layered on top include Datafold AI (column-level lineage + LLM-generated PR diffs), Atlan AI and Alation AI (catalog-native with LLM enrichment), and SDF (a SQL compiler with built-in lineage and cross-dialect translation). For pure dbt projects, dbt-osmosis and dbt-cosm use a combination of upstream propagation + LLM enrichment to fill missing column descriptions and lineage notes. The split that matters in 2026: parser owns structure, LLM owns narrative.

Is LLM-generated code safe to commit, and how should I review it?

LLM-generated code is safe to commit only after it has cleared automated CI (parse, compile, test, row-count diff, lineage diff, PII scan if applicable) and at least one human reviewer has read the full diff with the original prompt attached. The reviewer should be a peer, not the prompt author (confirmation bias is real). Every LLM-touched commit should be tagged in the audit log with the prompt, model, retrieval IDs (if RAG was used), and the reviewer's identity — so when something does go wrong, the team can re-run the prompt, reproduce the output, and learn. Treat the LLM as a fluent intern: take its drafts gratefully, take its judgement skeptically.

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every LLM workflow above ships with hands-on practice rooms where you draft SQL with a real schema, generate dbt macros against a real project, and review pipelines against a real-time scoring engine. Start with the SQL-generation library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.

Practice SQL generation now →
ETL practice drills →

Top comments (0)