35 ChatGPT Prompts for Biostatisticians: Accelerate Analysis, Improve Communication, and Strengthen Your Research

#research #chatgpt #productivity #ai

Biostatisticians sit at the intersection of data, science, and decision-making — and the demands on your time are relentless, from designing studies and selecting methods to writing up results and collaborating with clinical teams. AI tools like ChatGPT can serve as a tireless thinking partner, helping you work through methodological questions, draft statistical sections, and communicate complex findings to non-statistical audiences. The 35 prompts below are organized by the core areas of your work to help you get more out of every session.

Study Design and Power Analysis

I am designing a randomized controlled trial comparing two treatments for [condition]. The primary endpoint is [outcome measure]. Help me outline the key design elements I need to specify before conducting a power analysis, including assumptions I will need to justify.

I need to calculate sample size for a two-arm parallel RCT. The primary outcome is a continuous variable with an expected mean difference of [X] and a pooled standard deviation of [Y]. Assume 80% power and a two-sided alpha of 0.05. Walk me through the calculation and the formula used.

I am designing a case-control study to detect an odds ratio of [X] for a binary exposure. Explain how to determine the required number of cases and controls, including how the case-to-control ratio affects efficiency.

My clinical trial has a time-to-event primary endpoint. Explain how to calculate the number of events required to detect a hazard ratio of [X] with 90% power, and then how to translate that into a total sample size given an expected accrual rate and follow-up period.

I am designing a crossover study for a rare disease with limited patient availability. Compare the statistical efficiency of a crossover design versus a parallel design for this setting, and describe the assumptions required for the crossover analysis to be valid.

Statistical Methodology Selection

My outcome is a count variable representing the number of hospitalizations per patient over a 12-month period. The data are heavily zero-inflated. Compare Poisson regression, negative binomial regression, and zero-inflated models for this situation, and recommend an analytical strategy with justification.

I have a longitudinal dataset with repeated measures on patients over time. Some patients have missing visits. Compare complete case analysis, mixed-effects models for repeated measures (MMRM), and multiple imputation for handling this missing data, including the assumptions each requires.

I need to compare survival curves between three treatment groups while adjusting for baseline covariates. Walk me through the options, including the Cox proportional hazards model, accelerated failure time models, and stratified analyses, with guidance on when each is preferred.

My dataset contains patients nested within clinical sites, and I am concerned about intraclass correlation. Explain the implications of ignoring clustering and compare mixed-effects models, GEE, and cluster-robust standard errors as analytical approaches.

I am conducting an observational study and want to control for confounding. Compare propensity score matching, inverse probability weighting, and direct covariate adjustment using regression, including the assumptions and limitations of each approach.

Data Analysis and Interpretation

I ran a Cox proportional hazards model and obtained a hazard ratio of [X] with a 95% confidence interval of [A, B] and a p-value of [P]. Help me write a clear, precise statistical interpretation of these results suitable for inclusion in a manuscript.

My regression model shows a statistically significant interaction term between treatment and a baseline covariate. Explain what this means clinically and statistically, and describe how I should present and interpret subgroup effects in light of this interaction.

I conducted a multiple imputation analysis with 20 imputed datasets. Explain Rubin's rules for combining estimates across imputed datasets and how to interpret the resulting pooled estimates and standard errors.

I have a biomarker variable that is measured with known assay variability. Explain the concept of regression dilution bias, how measurement error in a continuous predictor affects regression coefficients, and what correction methods are available.

My primary analysis result is a p-value of 0.06 for the primary endpoint. The clinical team is asking whether this is "close to significant." Help me draft a statistically rigorous explanation of what this p-value means and does not mean, suitable for a clinical audience.

Reporting and Scientific Writing

Write a statistical methods section for a manuscript reporting results from a randomized controlled trial with a time-to-event primary endpoint, a binary secondary endpoint, and a pre-specified subgroup analysis. Use language appropriate for a medical journal.

I need to write the results section for a survival analysis. The primary result is a Kaplan-Meier estimated median survival of [X] months in the treatment arm versus [Y] months in the control arm, with a log-rank p-value of [P] and a Cox model hazard ratio of [HR] (95% CI: [A, B]). Draft this results paragraph.

Help me write an interpretation paragraph for a non-inferiority trial in which the upper bound of the 95% confidence interval for the difference in primary endpoint rates fell below the pre-specified non-inferiority margin of [X]. Explain what conclusions can and cannot be drawn.

I need to populate a CONSORT flow diagram description for a trial with [N] randomized participants, [X] lost to follow-up in arm A, [Y] lost to follow-up in arm B, and [Z] excluded from the per-protocol population. Write the narrative text to accompany this figure.

I am writing the limitations section of a manuscript for an observational study. The main methodological concerns are residual confounding, potential selection bias due to convenience sampling, and a relatively short follow-up period. Help me draft a thorough, honest limitations paragraph.

R and SAS Programming Support

Write R code using the survival package to fit a Cox proportional hazards model with covariates age, sex, and treatment group. Include code to test the proportional hazards assumption using Schoenfeld residuals and to produce a forest plot of hazard ratios.

Write SAS PROC MIXED code for a mixed-effects model for repeated measures (MMRM) analysis of a continuous outcome measured at baseline, week 4, week 8, and week 12. Include an unstructured covariance matrix and appropriate contrast statements for the primary endpoint comparison at week 12.

Write R code using the mice package to perform multiple imputation on a dataset called trial_data with 10 imputed datasets, fit a logistic regression model on each imputed dataset using the glm function, and pool the results using Rubin's rules.

I have a large clinical trial dataset in SAS. Write a macro that calculates summary statistics (n, mean, SD, median, Q1, Q3, min, max) for a list of continuous variables, stratified by treatment group, formatted for a Table 1 in a clinical study report.

Write R code to conduct a penalized Cox regression using the glmnet package for variable selection in a high-dimensional survival dataset. Include cross-validation to select the optimal lambda and produce a coefficient plot.

Regulatory and Clinical Trial Statistics

I am preparing the statistical analysis plan (SAP) for a Phase 3 trial. List the key sections that must be included according to ICH E9 and E9(R1) guidance, and describe what should be documented in each section before unblinding.

Explain the estimand framework introduced in ICH E9(R1). Describe the five attributes of an estimand and explain how the choice of intercurrent event strategy (treatment policy, hypothetical, composite, while on treatment, principal stratum) affects the analysis and interpretation.

I need to implement a group sequential design with one interim analysis at 50% of events, using an O'Brien-Fleming alpha spending function. Explain how to calculate the interim and final critical values and how to report conditional power at the interim.

Describe the regulatory requirements for handling missing data in a confirmatory clinical trial submission to the FDA, including expectations for sensitivity analyses and the documentation required in the SAP.

I am analyzing a basket trial with multiple tumor histologies sharing a common experimental arm. Compare the statistical approaches available for borrowing information across cohorts, including independent cohort analyses, Bayesian hierarchical models, and Simon's two-stage design, from a regulatory acceptability perspective.

Collaboration and Peer Communication

I need to explain to a clinical investigator with no statistical background why we cannot simply add more patients after a trial fails to reach significance. Draft a plain-language explanation of alpha inflation from unplanned interim analyses and why pre-specification matters.

A clinical collaborator is insisting on using a paired t-test for data that I believe violates the normality assumption. Help me draft a diplomatic email explaining my methodological concern, proposing the Wilcoxon signed-rank test as an alternative, and offering to discuss further.

I am presenting a statistical analysis plan to a data monitoring committee. Write talking points that explain the primary analysis, the interim analysis schedule, the stopping boundaries, and the approach to missing data in accessible language for a mixed clinical and statistical audience.

A journal reviewer has questioned whether our Cox model adequately satisfied the proportional hazards assumption. Draft a response to the reviewer explaining the diagnostic tests we performed, any violations found, and the sensitivity analyses conducted to address them.

I need to lead a 30-minute training session for clinical research associates on the difference between statistical significance and clinical meaningfulness. Help me outline the session, including a simple example using a confidence interval to illustrate the concept of minimum clinically important difference.