35 ChatGPT Prompts for Data Engineers: Build Faster, Debug Smarter, Document Better

#ai #chatgpt #productivity #webdev

Data engineers spend their days designing pipelines, wrangling schemas, optimizing queries, and keeping data quality from quietly degrading across distributed systems. ChatGPT can act as a tireless pair programmer — helping you write boilerplate, think through architecture trade-offs, and generate documentation you'd otherwise put off indefinitely. These 35 prompts cover the core domains of modern data engineering work.

1. Pipeline Design and Architecture

I need to design a real-time data pipeline that ingests clickstream events from a web application (approximately 50,000 events/minute), enriches them with user profile data from a PostgreSQL database, and loads them into a data warehouse for analytics. Recommend an architecture using Apache Kafka and either Apache Flink or Spark Structured Streaming, and explain the trade-offs between the two stream processing options.

Explain the medallion architecture (Bronze, Silver, Gold layers) for a data lakehouse. For a retail company processing daily sales transactions, describe what data belongs in each layer, what transformations happen at each stage, and how to implement it using Delta Lake on Databricks.

We're choosing between a Lambda architecture and a Kappa architecture for our data platform. We have both real-time dashboards (sub-minute latency) and complex historical batch analytics. Compare the two architectures in terms of complexity, maintenance burden, and which fits our requirements better.

Design a CDC (Change Data Capture) pipeline from a MySQL production database to our Snowflake data warehouse. Compare Debezium with Airbyte for this use case, covering setup complexity, supported MySQL versions, schema evolution handling, and cost at scale.

Our batch ETL pipeline takes 6 hours to run and is causing SLA violations. The pipeline processes 500GB of daily transaction data with 15 sequential transformation steps. Suggest a refactoring strategy using parallelization, incremental processing, and partition pruning to reduce runtime to under 2 hours.

2. SQL and Query Optimization

I have a Snowflake query joining a 2-billion-row fact table with three dimension tables that takes 45 minutes to run. The query uses multiple GROUP BY operations and window functions. Walk me through a systematic query optimization process including: checking query profile, clustering keys, materialized views, and result caching strategies.

Write a SQL query using window functions to calculate a 7-day rolling average of daily revenue per product category, along with the percentage change from the prior 7-day period. Use a sample schema: sales(sale_date, product_category, revenue).

Explain the difference between RANK(), DENSE_RANK(), and ROW_NUMBER() window functions with concrete examples. Then show me how to use them to identify the top 3 customers by revenue within each region, handling ties correctly.

I need to detect slowly changing dimension (SCD) Type 2 updates in a customer table. Write a SQL merge statement for Snowflake that: identifies changed records by comparing a hash of key attributes, closes existing records by setting end_date, and inserts new versions with is_current = true.

Our dbt model has a performance issue: a large CTE is being evaluated multiple times in the same query. Explain the concept of CTE materialization in different SQL engines (Snowflake, BigQuery, Redshift), and rewrite the pattern using a dbt ref() with materialized='table' to fix the problem.

3. Python and PySpark Development

Write a PySpark function that reads a partitioned Parquet dataset from S3, applies a schema validation check against an expected schema, filters out records with null values in any of five key columns, and writes the cleaned output to a Delta table. Include error handling and logging.

I need to deduplicate a large PySpark DataFrame (500M rows) that may have multiple records for the same entity_id, keeping only the record with the latest updated_at timestamp. Write an efficient solution using window functions that avoids a full shuffle where possible.

Write a Python class for a reusable ETL pipeline framework that supports: configuration-driven source/target definitions (YAML), pluggable transformation steps, checkpointing for restart capability, and Slack notifications on failure. Include the class structure with stub methods and docstrings.

Explain the difference between pandas, Polars, and PySpark for data transformation workloads. Given a 50GB CSV file that needs to be processed on a single machine with 64GB RAM, which would you recommend and why? Provide a code snippet for reading the file and performing a groupby aggregation in your recommended tool.

Write a Python script using the Great Expectations library to validate a DataFrame of customer transactions. Include expectations for: no nulls in customer_id and transaction_date, transaction_amount between 0 and 1,000,000, valid ISO date format in transaction_date, and unique transaction_id values.

4. Data Modeling and Schema Design

Design a dimensional data model for an e-commerce company's order analytics use case. The business wants to analyze orders by customer, product, geography, and time. Create a star schema with one fact table and at least four dimension tables, including the column definitions, data types, and primary/foreign key relationships.

Explain the differences between a star schema and a snowflake schema. For a BI team using Tableau on top of Snowflake, which would you recommend and why? Address query performance, maintainability, and impact on Tableau data source design.

We're modeling a many-to-many relationship between Orders and Products (an order can have multiple products, a product can be in multiple orders). Design the fact table structure for this relationship, including how to handle multiple granularities (order-level metrics vs. order-line metrics) without double-counting.

Our data warehouse has columns storing JSON blobs from a semi-structured API response. Explain when to keep data as JSON in Snowflake's VARIANT type versus flattening into structured columns, and write a SQL FLATTEN query to extract nested fields from a sample JSON structure: {"user": {"id": 123, "events": [{"type": "click", "ts": "2024-01-01"}]}}.

Design a slowly changing dimension (SCD) strategy for a customer dimension table where we need to track historical changes to customer_segment, customer_tier, and mailing_address. Compare SCD Type 1, Type 2, and Type 6, and recommend which type to use for each attribute based on typical analytics requirements.

5. Data Quality and Observability

Design a data quality monitoring framework for a production data pipeline. Specify what checks to run at each layer (ingestion, transformation, serving), how to calculate data quality scores, how to alert on anomalies, and how to integrate quality metrics into a DataHub or Monte Carlo-style data catalog.

Write a dbt test configuration (schema.yml) for a fact_orders model. Include tests for: not_null and unique on order_id, accepted_values on order_status, referential integrity to dim_customer and dim_product, and a custom test that flags records where order_total does not equal the sum of line item amounts.

Explain statistical anomaly detection approaches for time-series data quality monitoring. For a pipeline where daily row counts vary seasonally, compare z-score, IQR, and ARIMA-based anomaly detection. Write a Python function implementing the z-score method with a rolling 28-day baseline.

Our pipeline occasionally produces duplicate records due to at-least-once delivery semantics in Kafka. Design an idempotency strategy for the downstream Snowflake MERGE operation, and explain how to use a deduplication key and a processing watermark to safely handle late-arriving and duplicate events.

What is data lineage and why does it matter for data quality governance? Compare the lineage capabilities of dbt, Apache Atlas, and OpenLineage/Marquez. Explain how to implement column-level lineage tracking in a dbt project so the data team can trace a metric back to its source tables.

6. Cloud Infrastructure and DevOps

Write a Terraform configuration to deploy an AWS Glue ETL job with the following requirements: reads from S3, writes to Redshift Serverless, uses Glue Data Catalog for schema discovery, runs on a schedule using EventBridge, and has IAM roles with least-privilege permissions. Include the key resource blocks and variable structure.

Design a CI/CD pipeline for a dbt project hosted on GitHub. The pipeline should: run dbt compile and dbt test on pull request, deploy to a staging environment on merge to main, promote to production after manual approval, and send a Slack notification with test results. Specify the GitHub Actions workflow structure.

Compare Airflow, Prefect, and Dagster as orchestration platforms for a team of 8 data engineers managing 200+ DAGs. Evaluate each on: DAG authoring experience, dynamic task generation, observability, deployment complexity, and cost at scale. Give a recommendation for a company migrating off legacy cron jobs.

We're running Spark jobs on AWS EMR and the cluster costs have grown to $50K/month. Outline a cost optimization strategy covering: spot instance usage, auto-termination policies, right-sizing executor memory/cores, switching to EMR Serverless, and identifying jobs that could move to smaller tools like Polars or DuckDB.

Explain how to implement column-level encryption for PII fields (SSN, email, phone) in a Snowflake data warehouse. Cover Snowflake's native ENCRYPT/DECRYPT functions, dynamic data masking policies, role-based access control for PII, and how to handle key rotation without re-encrypting all data.

7. Documentation and Team Collaboration

Write a technical design document template for a new data pipeline. Include sections for: executive summary, business requirements, data sources and schemas, transformation logic, target data model, data quality checks, error handling strategy, performance requirements, and rollout plan. Add guiding questions under each section.

I need to document a complex dbt model that calculates customer lifetime value. The model has 5 CTEs, uses window functions, and references 8 upstream models. Write a comprehensive dbt model description and column-level documentation (in schema.yml format) that would help a new team member understand and maintain it.

Our team lacks a data dictionary for our 300-column core data warehouse. Write a process document explaining how to build one iteratively: how to prioritize tables, what metadata to capture per column (definition, source, data type, nullability, example values, business owner), and how to maintain it in a tool like dbt docs or Confluence.

Create an on-call runbook for our Airflow-based data platform. Cover the five most common failure scenarios: DAG timeout, database connection failure, S3 access denied error, memory overflow in a Spark task, and dbt test failures. For each scenario include: symptoms, diagnostic steps, resolution steps, and escalation path.

Write a technical blog post (500–700 words) for my engineering team's internal wiki explaining how we implemented incremental processing in our dbt models using the is_incremental() macro, why it reduced our daily model runtime from 4 hours to 35 minutes, and the gotchas to watch out for.