ClawGear

Posted on May 18

35 ChatGPT Prompts for Data Scientists: From EDA to Model Deployment

#ai #chatgpt #datascience #productivity

A 2023 Anaconda survey found that data scientists spend only 26% of their time on actual modeling — the rest disappears into data wrangling, documentation, stakeholder communication, and debugging pipelines that break in production. The gap between what data scientists were hired to do and what they actually spend their days doing is one of the most persistent frustrations in the field.

AI assistants like ChatGPT, Claude, and DeepSeek have become genuine force-multipliers for data professionals who know how to prompt them well. These tools can't replace your domain expertise or your judgment about what a distribution actually means in context — but they can handle the boilerplate, the first-draft explanations, the SQL scaffolding, and the code review that eats hours every week. The 35 prompts below are organized around the real workflow of a working data scientist, from raw data exploration through to deployment and stakeholder reporting.

1. Exploratory Data Analysis (EDA)

I have a pandas DataFrame with the following columns: [list columns and dtypes]. Write a comprehensive EDA function that checks for missing values, computes descriptive statistics, identifies skewed distributions, detects potential outliers using IQR, and prints a structured summary report. Include inline comments explaining each step.

I'm exploring a dataset about [topic]. The target variable is [variable name]. Suggest 8 specific hypotheses I should test during EDA, and for each hypothesis explain which statistical test or visualization would be most appropriate to investigate it.

Write a Python function using matplotlib and seaborn that produces a correlation heatmap, pairplot for the top 6 most correlated features with [target], and a missing-value heatmap. The function should accept a DataFrame and a target column name as arguments.

I found these anomalies in my EDA: [describe anomalies]. For each one, give me three possible explanations — one involving data collection error, one involving genuine outlier behavior, and one involving a domain-specific phenomenon — so I can decide how to investigate further.

Explain the difference between MCAR, MAR, and MNAR missing data mechanisms to a stakeholder with no statistics background. Then tell me which imputation strategies are appropriate for each, with the tradeoffs of each approach.

2. Feature Engineering & Data Preparation

I'm building a model to predict [outcome] using tabular data. The features include [list features]. Suggest 10 engineered features I might create, explain the intuition behind each, and provide the pandas code to generate them.

Write a scikit-learn Pipeline that handles the following preprocessing steps for a classification task: impute numeric columns with median, impute categorical columns with most frequent value, scale numeric features with StandardScaler, and one-hot encode categorical features. Include a step for a RandomForestClassifier placeholder at the end.

I have a datetime column in my dataset. Write Python code to extract the following features from it: hour of day, day of week, is_weekend flag, days since a reference date, cyclical encoding of month using sine and cosine transforms, and a business_hours boolean flag.

My dataset has high cardinality in the [column name] column with [N] unique values. Compare target encoding, frequency encoding, and embedding approaches for handling this in a gradient boosting model. What are the data leakage risks with target encoding and how do I avoid them?

Review this feature engineering code and identify any data leakage risks, inefficiencies, or places where I'm fitting transformers on the full dataset instead of only the training fold:

[paste code]

3. Model Selection & Training

I'm solving a [binary classification / regression / multi-class classification] problem. My dataset has [N rows], [M features], and [describe any class imbalance or data characteristics]. Recommend 5 candidate algorithms, rank them by likely performance for this use case, and explain the key hyperparameters to tune for each.

Write a Python script using Optuna to tune the hyperparameters of an XGBoost classifier. Include search spaces for learning_rate, max_depth, n_estimators, subsample, and colsample_bytree. Use 5-fold stratified cross-validation with AUC-ROC as the objective. Add early stopping.

Explain the bias-variance tradeoff in terms a business stakeholder can understand, then connect it to a concrete recommendation about when to stop tuning my current model vs. collecting more data vs. engineering better features.

I'm training a model on imbalanced data where the positive class is only 3% of the dataset. Compare SMOTE, class_weight='balanced', threshold tuning, and cost-sensitive learning as strategies. For each, tell me when to prefer it and what metric I should optimize instead of accuracy.

Write a cross-validation harness in Python that: (1) uses StratifiedKFold with 5 splits, (2) fits a pipeline containing preprocessing and a model, (3) evaluates precision, recall, F1, and AUC-ROC on each fold, (4) prints a summary table with mean and standard deviation for each metric.

4. Model Interpretation & Explainability

I trained an XGBoost model for [task]. Write Python code using SHAP to: generate a beeswarm summary plot, a waterfall plot for the highest-risk prediction in my test set, and a dependence plot for the top 3 features. Include interpretive comments explaining what each visualization tells us.

My model's top 5 features by SHAP importance are [list features]. Write a plain-English explanation of what the model is doing that I can include in a stakeholder presentation — no jargon, no code, no formulas. Keep it under 200 words.

Explain LIME vs. SHAP for local model explanations. When should I use each? What are the failure modes of each approach, and what questions should I ask when an explanation seems suspicious?

I need to present model results to a regulatory audience that requires explainability under [GDPR / model risk management guidelines / Fair Credit Reporting Act]. Draft a one-page model explanation document covering: model purpose, input features, how predictions are generated, known limitations, and how a customer can contest a decision.

My model has high accuracy overall but poor performance on the [subgroup] subgroup. Generate 5 hypotheses for why this disparity might exist, and for each hypothesis suggest one diagnostic check I can run to test it.

5. SQL, Data Engineering & Pipeline Work

Write an optimized SQL query to calculate a 7-day rolling average of [metric] by [dimension], handling gaps in dates where there are no records. Use window functions. Assume the table is named events with columns: user_id, event_date, metric_value.

I have the following slow-running SQL query: [paste query]. Analyze it and suggest specific optimizations including indexing strategy, query rewriting, and any aggregation pushdowns. Explain why each optimization should help.

Write a Python function using SQLAlchemy that connects to a PostgreSQL database, runs a parameterized query, returns the result as a pandas DataFrame, and handles connection errors with retry logic up to 3 attempts.

I need to design a data pipeline that ingests [data source], transforms it through [describe transformations], and loads it to [destination]. Draw a logical architecture diagram using ASCII art, identify the 3 most likely failure points, and recommend tools for each stage.

Review this dbt model for correctness, performance, and best practices. Flag any issues with the grain of the model, unnecessary full table scans, or missing tests I should add to the schema.yml:

[paste dbt model SQL]

6. Statistical Analysis & Experimentation

I'm designing an A/B test for [feature or change]. Walk me through calculating the required sample size given: baseline conversion rate of [X]%, minimum detectable effect of [Y]%, significance level of 0.05, and power of 0.8. Provide the Python code and explain each assumption.

My A/B test ran for [N] days and the results are: control had [X] conversions out of [N1] visitors, treatment had [Y] conversions out of [N2] visitors. Run the appropriate statistical test, state the p-value and confidence interval, and write a conclusion I can send to the product team.

Explain the difference between frequentist and Bayesian approaches to A/B testing for a product manager audience. Include when each is preferable and what "stopping early""does to each approach.

I have time series data for [metric] that shows a trend, seasonality, and some irregular spikes. Recommend an appropriate forecasting approach, explain how to decompose the series to validate your recommendation, and provide the Python code to fit and evaluate the model.

I ran a multivariate test with 4 variants and got a significant overall chi-square result. Explain the multiple comparisons problem, show me how to apply a Bonferroni correction, and write the Python code to do pairwise comparisons with corrected p-values.

7. Communication, Documentation & Stakeholder Reporting

I completed a model that predicts [outcome] with [metric] of [value]. Write an executive summary (max 300 words) for a non-technical VP audience covering: what problem we solved, how the model works in plain terms, the business impact in dollar or percentage terms, key limitations, and the recommended next step.

Write a README for a data science project repository. The project is [brief description]. Include sections for: project overview, repository structure, setup instructions, data sources and access, how to run the pipeline, model performance summary, and known limitations.

I need to document a machine learning model for a model card. The model is [description]. Generate a complete model card following the Mitchell et al. format covering: model details, intended use, factors, metrics, evaluation data, training data, quantitative analyses, ethical considerations, and caveats.

A stakeholder is pushing back on my model recommendation because the accuracy on a holdout set (87%) doesn't match their intuition. Draft a professional email that: acknowledges their concern, explains why 87% is meaningful in context, describes the alternative (no model) baseline, and proposes a pilot to resolve the disagreement.

Convert this Jupyter notebook narrative into a structured technical report with an abstract, methods section, results section with key findings highlighted, and a limitations and future work section. Keep the original findings but improve the clarity and structure:

[paste notebook narrative]

AI Prompt Toolkit for Data Scientists (Claude, ChatGPT and DeepSeek) → https://gumroad.com/PENDING_AUTO_datascientist

Works with Claude, ChatGPT, and DeepSeek.

DEV Community