DEV Community: Arian Mokhtariha

A CSV sample that looks like the whole dataset is worse than no CSV at all

Arian Mokhtariha — Thu, 09 Jul 2026 01:42:55 +0000

I've been building data2prompt for a few months now. It takes a data-heavy project (CSVs, SQL dumps, notebooks, Excel files) and turns it into a single file an LLM can read, instead of the model choking on raw 200MB CSVs or a generic repo-packager just skipping them. This week I shipped v0.5.0, and the real work in that release wasn't a new parser or a flag. It was a document I wrote called docs/output-contract.md, and writing it forced me to think about a failure mode I'd been quietly ignoring for months.

Here's the problem. To fit a large table into a context window, data2prompt samples it, say, 15 random rows out of 1.2 million. That's necessary and fine. But once you hand that sample to an LLM without being extremely explicit about what it's looking at, the model will happily treat 15 rows as the dataset. It'll compute an "average" from them. It'll describe a "trend." It'll answer "how many customers churned" using a number that only exists because of your random seed. The sample doesn't just lose information, it actively invites the model to hallucinate as if it had everything.

That's the actual design problem behind data2prompt: the output isn't documentation for a person, it's an input for a model, and models fail differently than people do. A person skimming a CSV preview instinctively knows it's a preview. A model doesn't have that instinct unless you build it in.

What "grounding" actually means in the code

The rule I landed on: any time data gets reduced, the full count has to be captured before the reduction happens, and it has to travel with the sample as a notice.

total_rows = len(df)  # capture first
sample = df.sample(n=config.csv_sample_size, random_state=config.seed).sort_index()
# -- [Sample: random 15 of 1,234,567 rows] --

Every sampled table, every truncated SQL insert block, every notebook cell whose output got cut off, all of it carries a line like:

-- [Sample: random 15 of 1,234,567 rows] --
-- [CSV truncated: Showing random 15 of 1,234,567 rows to save context] --
-- [Table data truncated: Showing random 15 of 200 buffered rows to save context] --

Same grammar every time: -- [Category: detail] --. Not a *Note:* in italics, no emoji, no "heads up!" One shape, always, on purpose. The system prompt at the top of the generated file teaches the model this exact pattern once, and after that the model can reliably tell "this line is the tool talking to me" apart from "this line is file content." Mix in even one differently-formatted note and that separation gets fuzzy.

Nothing partial is allowed to look complete

This is the rule that actually matters, and it applies past just numeric sampling. A .parquet file with no pyarrow installed doesn't vanish. It shows up with Skipped (No pyarrow) and an install command. An SQL table's schema-only mode still prints CREATE TABLE, just with the rows replaced by -- [N data row(s) omitted: schema-only] --. A file excluded by .gitignore still gets a row in the File Index with status Omitted. If a file was scanned, it's accounted for somewhere in the document, even if the answer is "not shown, here's why."

The alternative is silently dropping things, which is how you get an LLM confidently telling you a project has no config file when it does. It just didn't make the cut.

Two bugs this thinking actually caught

Writing the contract wasn't just theory, it surfaced real bugs while I was auditing the codebase against it:

The Excel visual-element check (are there charts or images in this sheet the tool can't extract) was reading sheet._images off worksheets opened in read_only=True mode. Read-only worksheets in openpyxl never parse drawing parts, and the attribute on regular worksheets is _charts, not charts, anyway. So that check could never fire. It always said "no visuals," even when a sheet was covered in charts. Fixed it by treating an .xlsx as what it actually is, a zip file, and checking for xl/media/ and xl/charts/ in the archive listing directly. No workbook load needed, and now it's actually correct instead of silently wrong.

The other one is worse in a different way: bare .env files have an empty file suffix, so they were falling through the extension-based skip list and getting parsed by the generic text handler, which just dumps file contents. A tool built specifically to keep secrets out of your LLM context was leaking them because the routing was extension-based and .env doesn't have an extension. Now .env files (and .env.local, prod.env, etc.) are matched by name before extension dispatch even runs, and the parser only ever emits KEY=<redacted>.

Neither of these were found by "testing the feature." They were found by writing down what the system was supposed to guarantee and then checking the code against that sentence.

Why I bothered writing a contract doc at all

I'm a data scientist, not a software engineer by trade, and I build data2prompt by directing coding agents through the implementation rather than hand-writing every line myself. That workflow has a specific failure mode: without something written down, every session re-derives its own idea of "how should this notice be worded" or "does this new parser need to update the preamble too," and consistency erodes one plausible-looking PR at a time. The contract doc exists so that decision doesn't have to be re-litigated: format parity across markdown/xml, one notice grammar, one canonical path key, controlled vocabularies only, fixed anchors at the top and bottom of the document. Seven invariants, plus a checklist for each kind of change. Any session, human or agent, that touches the output code reads that first.

If you're building anything that generates content for an LLM to consume rather than a human, not just data2prompt, any RAG pipeline, any agent tool output, I'd push back on treating "human-readable" and "model-readable" as the same design target. They're not. A human notices when something looks off. A model narrates whatever's in front of it as if it's the truth, unless you've engineered the document to stop it from doing that.

Repo's here if you want to see the actual contract doc or poke at the code: github.com/arianmokhtariha/data2prompt. It's MIT licensed, on PyPI as data2prompt. Happy to talk through any of the tradeoffs above in the comments.

Stop Pasting Raw CSVs Into ChatGPT: A Data Scientist's Guide to LLM Context Engineering

Arian Mokhtariha — Wed, 24 Jun 2026 20:28:00 +0000

Your LLM doesn't need 50,000 rows. It needs the right 15.

There's a mistake I see data scientists make constantly when they first start using LLMs for analysis.

They paste their entire CSV into the prompt.

I get the instinct. It feels rigorous. The model should have everything, right? But that instinct is exactly backwards — and it's silently degrading your results in ways that are easy to miss.

Let me show you why, and walk through a different approach.

Why Raw Data Destroys LLM Performance

Large language models have two constraints that matter deeply for data work.

Context windows are finite. A 100-row CSV with 20 columns? Probably fine. A 10,000-row CSV? That's millions of characters. You've burned your entire context window on one file before you've even written your question.

More tokens ≠ better answers. This is the counterintuitive part. LLM attention degrades with noisy, repetitive input. Row 8,437 of your sales data looks structurally identical to row 4,291. The model doesn't need both — it needs to understand the pattern, not memorize every instance.

Dumping raw data into a prompt is the equivalent of handing someone a 500-page report and asking them to summarize it verbally on the spot. They'll struggle, and the important details will get lost in the noise.

What LLMs Actually Need From Your Data

For data analysis tasks, a well-structured LLM context needs four things — and only four things:

Schema — column names, data types
A representative sample — enough rows to understand patterns and edge cases (15–50 is usually enough)
Statistics computed on the full dataset — missing value counts, value distributions, describe() output
Structure — how files relate to each other in your project

Notice what's not on that list: all 50,000 rows.

Here's the key insight: if your statistics are computed on the full dataset, you don't need the full dataset in the prompt. The model knows the mean, the standard deviation, the missing value rate, the quartiles — all from the full 50K rows — without seeing any of them directly.

The sample is there to show the model what the data looks like. The statistics tell it the truth about what the data actually is.

What This Looks Like in Practice

Instead of pasting your entire CSV, you want something like this in your prompt:

## File: sales_data.csv
[Shape: 52,341 rows × 18 columns | Sampled: 15 random rows]

| order_id | order_date | region | category | sales | discount | profit |
|----------|------------|--------|----------|-------|----------|--------|
| CA-2021-... | 2021-03-12 | West | Technology | 1249.00 | 0.20 | 312.25 |
| ... [13 more rows] ...

### Dataset Statistics (full dataset: 52,341 rows)
Columns: order_id (object), order_date (datetime64), region (object), ...
Missing values: discount: 12.3% (6,438), postal_code: 0.1% (52)
Numerical summary:
  sales: min=0.44  mean=229.86  max=22638.48  std=623.25
  profit: min=-6599.98  mean=28.66  max=8399.98  std=234.26
  discount: min=0.0  mean=0.15  max=0.80

That's one file in your context. Now imagine a full project: 3 CSVs, 2 SQL dumps, a Jupyter analysis notebook, an Excel summary. Each one needs this treatment.

And they all need to be assembled into a single coherent context file that fits in one prompt.

Automating This With data2prompt

This is exactly the problem I built data2prompt to solve. It runs in your project directory and produces a single structured PROMPT.md (or .xml) file ready to paste into any LLM session.

pipx install data2prompt
cd your-data-project
data2prompt

What it does under the hood:

For CSV files: Draws a random sample of 15 rows (configurable), then computes the full stats block — dtype per column, missing value counts and percentages, and a describe() summary — on the entire file. Not the sample.

For Jupyter notebooks: Extracts code cells and text outputs in execution order. Strips Base64-encoded images and raw HTML that would waste thousands of tokens while contributing nothing to analysis context.

For SQL files: Applies intelligent sampling to SELECT-able content and surfaces schema structure for DDL statements.

For Excel files: Processes each sheet separately with the same stats-aware approach, up to a configurable sheet limit.

For .env files: Lists variable names with values redacted (SECRET_KEY=<redacted>). The LLM understands your configuration without you leaking credentials.

The Size Difference Is Dramatic

I ran this on a real Superstore analytics project — multiple CSVs, SQL dumps, a Jupyter analysis notebook, and Excel summaries:

Tool	Output Size	Data Handling
data2prompt	241 KB	Smart sampling + full-dataset stats
code2prompt	9,304 KB	Raw file content
Repomix	22,085 KB	Raw file content

Same project. 91× smaller than Repomix while preserving everything the LLM actually needs.

But the real win isn't the file size — it's that the LLM now gets better signal in fewer tokens. Less noise, cleaner attention, more focused responses.

How This Changes What You Can Ask

When your context is structured right, your prompts get dramatically more specific and the answers get dramatically better.

Before (raw data dump):

"Here's my data [paste 5,000 rows]. Can you find any interesting patterns?"

After (structured context):

"Here's my project context [paste PROMPT.md]. I'm seeing an unusual spike in returns in the West region in Q3. The discount column has 12.3% missing values — mostly concentrated in the Furniture category. Can you form a hypothesis about what's driving the return spike and suggest which columns to cross-tabulate to test it?"

The second prompt is possible because you already know the missing value distribution, the regional breakdown, and the data quality issues. data2prompt surfaces all of that automatically, so you can skip the exploratory small talk and go straight to the interesting question.

Schema-Only Mode for Exploration

When you're starting with a new project and don't yet know what to ask, there's a lighter option:

data2prompt --schema-only

This drops all data rows and gives you just column names, types, and statistics. Useful for a first conversation with an LLM where you want to explore structure before committing to an analysis direction.

Context Engineering Is Now a Core Skill

The data science community talks a lot about prompt engineering — how to phrase questions to get better answers. But for data-heavy work, the bigger leverage is in context engineering: how you structure and size the information you hand the model before you ask anything.

The gains I've seen from better context far outweigh the gains from better phrasing. Same model, same question, 10× more specific answer — just by giving it structured context instead of raw rows.

data2prompt is on GitHub and installable via PyPI:

pipx install data2prompt

It's MIT-licensed, has no external API calls, and works entirely offline.

How are you currently preparing data context for LLM sessions? Curious whether others have found different approaches — drop it in the comments.

Links:

GitHub: https://github.com/arianmokhtariha/data2prompt
PyPI: https://pypi.org/project/data2prompt

Tags: ai datascience python llm machinelearning

Meet data2prompt: The CLI Tool That Finally Makes LLMs Understand Your Data Science Projects

Arian Mokhtariha — Tue, 02 Jun 2026 15:25:43 +0000

Every data scientist has hit this wall.

You are deep in a project — CSVs, SQL dumps, Jupyter notebooks, maybe some Power BI files — and you want to ask an LLM to help you reason across the whole thing. Not just one script. The entire project. You want it to understand your data structure, your pipeline logic, your model decisions all at once.

So you try to package it up and paste it in. And it fails. The context window chokes. The LLM forgets files it saw earlier. The responses stop making sense.

The problem is not your LLM. The problem is that nobody built the right packaging tool for data-heavy projects — until now.

Introducing data2prompt

data2prompt is an open-source CLI tool that packages your entire data science project into a single, optimized, LLM-ready file. Not a generic dump of everything in your folder — a smart, data-aware output that knows how to handle CSVs, SQL, Jupyter notebooks, Excel files, and binary data files the way a data scientist actually needs them handled.

Install it in one command:

pipx install data2prompt

Run it from your project root:

data2prompt

That is it. You get a single PROMPT.md or PROMPT.xml file ready to paste into Claude, ChatGPT, Gemini, or any LLM with a large context window.

The Problem With Generic Tools

There are great tools out there for packaging software projects for LLM context. They work beautifully on codebases full of Python scripts and config files.

But a data science project is not a software project. It contains fundamentally different file types that need fundamentally different handling — and when a generic tool encounters them, it does the worst possible thing: it dumps everything raw.

Here is what that looks like in practice:

A CSV with 50,000 rows gets written to the output in full — 50,000 rows of token-consuming noise when what the LLM actually needs is the schema and a representative sample.

A SQL dump gets included with every single INSERT statement — hundreds of thousands of rows of raw data when what the LLM needs is the CREATE TABLE schema and a handful of example rows per table.

A Jupyter notebook gets written with all its outputs intact — which includes matplotlib charts and styled dataframes stored as base64-encoded image strings. A single notebook visualization can contribute 60,000 tokens of encoded image data that an LLM literally cannot read or use.

A Power BI or pickle file is binary — reading it as text produces corrupted garbage that fills your context window with meaningless characters.

The result is a context file that is technically complete and practically useless. Every token budget gets consumed before the LLM has seen anything worth reasoning about.

How data2prompt Handles Each File Type

data2prompt applies a dedicated strategy to every file type found in a data project.

CSVs and Excel — Intelligent Random Sampling

Instead of dumping all rows, data2prompt takes a random sample:

# Default: 15 random rows per CSV
data2prompt

# Increase sample size when you need more context
data2prompt --csv-sample-size 30

The sampling is true random with a fixed seed — not head/tail. This means the LLM sees a representative spread of your actual data values, not just whatever happened to be at the top of the file. The seed makes the output fully reproducible — same result every time you run it.

The output is clean and annotated:

-- [Sample - Random 15 rows of 52,411 total] --
| order_id | customer   | region | sales  | profit |
|----------|------------|--------|--------|--------|
| CA-2019  | John Smith | West   | 245.00 | 41.65  |
...
-- [CSV truncated: Showing random 15 rows to save context] --

For Excel, each sheet is sampled independently. Sheets that are purely visual dashboards — no tabular data — are detected and noted rather than producing empty tables.

SQL Files — Schema First, Data Sampled

SQL dumps have two distinct parts: the schema (CREATE TABLE definitions) and the data (INSERT statements). data2prompt treats them completely differently.

Schema blocks are always preserved in full. Every CREATE TABLE, every column definition, every constraint and foreign key relationship — this is exactly what the LLM needs to understand your database.

INSERT statements are sampled randomly per table. You get a handful of representative rows for each table, enough to understand what the data looks like, without the thousands of repetitive INSERT lines that collapse your context budget.

Jupyter Notebooks — Logic Without the Noise

Notebooks store both cell source code and cell outputs. The source code is what the LLM needs — your transformations, model definitions, evaluation logic, markdown explanations. The outputs are the problem.

data2prompt extracts source code from every cell and discards outputs entirely. The base64 image strings, printed dataframes, and error tracebacks that bloat notebook files are stripped out. What remains is clean, readable notebook logic that an LLM can actually reason about.

Binary Files — Acknowledged, Not Mangled

.pbix, .pkl, .parquet, .db, .sqlite, .h5, .feather — data2prompt lists these in your project tree so the LLM knows they exist, and skips their content entirely. No corrupted binary strings eating your context window.

Two Output Formats: Markdown and XML

# Clean Markdown (default)
data2prompt

# Structured XML
data2prompt --format xml

The XML format was added based on Anthropic's research showing that XML-style structured tags improve LLM attention and parsing within long context windows. Every file gets semantic tags:

<codebase name="my-project">
  <directory_structure>...</directory_structure>
  <files>
    <file path="data\sales.csv">
      ...sampled table...
    </file>
    <file path="notebooks\analysis.ipynb">
      <cell index="1" type="code">
        <content>
          df = pd.read_csv('data/sales.csv')
          df.head()
        </content>
      </cell>
    </file>
  </files>
</codebase>

Use Markdown for quick analysis sessions. Use XML when you want the LLM to navigate a large project with maximum structural clarity.

The Numbers

Same data science project — dimension and fact CSVs, a SQL dump, Power BI files, ML notebooks, classification and clustering scripts — run through three tools:

Tool	Output Size
Generic tool #1	22,085 KB
Generic tool #2	9,304 KB
data2prompt	241 KB

241 KB versus 22,085 KB. Same project. Same information that actually matters.

That gap exists entirely because of the file-type-specific strategies above — the random CSV sampling, the SQL schema extraction, the notebook output stripping, the binary file handling. Nothing clever, just the right tool for the right files.

Full Usage

# Install (pipx recommended for isolated environment)
pip install data2prompt
pipx install data2prompt

# Basic run — outputs PROMPT.md in project root
data2prompt

# XML format
data2prompt --format xml

# Custom CSV sample size
data2prompt --csv-sample-size 25

# Ignore specific folders
data2prompt --ignore-folders data/raw archive models/checkpoints

# Custom output filename
data2prompt --output my_project_context

Create a .data2promptignore file in your project root for permanent exclusions — same syntax as .gitignore:

data/raw/
*.pkl
archive/

Who This Is Built For

data2prompt is specifically designed for data scientists, data analysts, and data engineers who work with real data files alongside their code. If your project has CSVs, SQL, notebooks, or Excel files, this tool will dramatically improve the quality of your LLM interactions with that project.

If you work in a pure software development context with no data files, a general-purpose packaging tool will serve you well. But if your project looks anything like a real data science workflow, give data2prompt a try.

GitHub: https://github.com/arianmokhtariha/data2prompt
PyPI: https://pypi.org/project/data2prompt

Open to questions, feedback, and contributions — drop them in the comments or open an issue on GitHub.

I Tried Repomix on My Data Science Project. It Generated a 22,000 KB File. So I Built My Own Tool

Arian Mokhtariha — Sun, 31 May 2026 22:31:28 +0000

A few months ago a friend showed me two tools — Repomix and code2prompt. The idea was simple: point them at your project folder, they package everything into one file, you paste it into an LLM and ask questions about your whole codebase at once. For his pure Python projects they worked great.

I was working on a data analytics project at the time — dimension and fact CSVs, a SQL dump, some Power BI files, Jupyter notebooks with ML models. I ran Repomix on it and got a 22,085 KB output file. code2prompt gave me 9,304 KB. I tried pasting either of them into Claude. It choked immediately.

So I opened the files to see what was actually inside them. What I found was the root of the problem.

What These Tools Get Wrong for Data Projects

Repomix and code2prompt are built for code repos. They operate on a simple principle: read every file, dump every file. That works fine when your project is Python scripts and config files. It completely falls apart when your project looks like mine.

Here's what was inflating those files:

Raw CSV dumps. My Fact_Sales.csv had tens of thousands of rows. The tool dumped every single one. An LLM doesn't need 50,000 rows of sales data — it needs to understand the structure and a representative sample.

Endless SQL INSERT statements. My Superstore.sql file had the full database dump including every INSERT INTO statement for every table. The schema — the CREATE TABLE blocks — is what an LLM actually needs. The data rows are mostly noise.

Notebook outputs with base64 images. Jupyter notebooks store cell outputs as JSON inside the .ipynb file. When a cell generates a matplotlib chart, that chart gets saved as a base64-encoded image string inside the notebook. A single chart output can be 50,000+ characters of base64 garbage that an LLM cannot use at all.

Binary files read as text. My .pbix (Power BI) files are binary. These tools attempted to read them as text and produced corrupted garbage that consumed tokens while providing zero information.

There was no tool that understood these problems. So after three months of building — and being very honest that I'm a data scientist not a developer, so this was heavily AI-assisted — I shipped data2prompt.

What data2prompt Does Differently

The core idea is that each file type in a data project needs its own strategy, not a generic "read and dump" approach.

CSVs and Excel: Smart Sampling

Instead of dumping all rows, data2prompt takes a random sample:

df = df.sample(sample_size, random_state=seed)

The default is 15 rows, configurable with --csv-sample-size. Critically it's random sampling with a fixed seed — not head/tail. Random sampling gives the LLM a more representative picture of value diversity across the dataset. The seed (default 42) makes output reproducible across runs.

The output tells the LLM exactly what it's looking at:

-- [Sample - Random 15 rows] --
| order_id | customer_name | sales  | profit |
|----------|--------------|--------|--------|
| CA-2019  | John Smith   | 245.00 | 41.65  |
...
-- [CSV truncated: Showing random 15 rows to save context] --

For Excel files, each sheet is sampled independently. The parser also detects sheets that are purely visual dashboards (charts, images only) and notes them rather than producing an empty table.

SQL Files: Schema Preserved, Data Sampled

This was the hardest parser to get right. SQL dump files typically follow a pattern: CREATE TABLE block defining the schema, followed by hundreds or thousands of INSERT INTO statements loading the data.

data2prompt reads line by line and applies different logic to each part:

# Always preserve the schema
if "CREATE TABLE" in line_upper:
    flush_buffer()
    in_create_block = True
    processed_lines.append(line)

# Buffer INSERT rows for sampling
if is_insert or is_data_row:
    table_data_buffer.append(line)

When it hits a buffer of INSERT rows, it samples them randomly:

rest_indices = sorted(rng.sample(range(1, len(table_data_buffer)), sample_size - 1))
sampled_rows = [first_line] + [table_data_buffer[idx] for idx in rest_indices]

The first line (the INSERT header) is always preserved. The rest are random samples in their original order. The LLM gets the full schema of every table plus a representative data sample — which is exactly what it needs to understand your database.

Jupyter Notebooks: Source Code Only

Notebook cells store three things: source code, execution count, and outputs. The outputs are what bloat the file — printed dataframes, matplotlib charts as base64, error tracebacks.

data2prompt keeps the source code of every cell and strips the outputs entirely. A notebook that was 8MB of JSON becomes a clean sequence of code cells. The parser also handles truncation of unusually long lines and caps output blocks at a configurable line limit for cases where outputs are genuinely useful text.

For the XML format, each cell becomes a structured block:

<cell path="ML/Q1/Q1.ipynb" index="3" type="code">
    <content>
model = XGBRegressor(n_estimators=100, learning_rate=0.1)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
    </content>
</cell>

Binary Files: Listed, Not Read

.pbix, .parquet, .pkl, .db, .sqlite, .feather, .h5 — all listed in the directory tree so the LLM knows they exist, but content is skipped entirely. No garbage bytes consuming your context window.

Two Output Formats: Markdown and XML

data2prompt supports both --format markdown (default) and --format xml.

The XML format was added after Anthropic published research showing that XML-style tags improve LLM attention and parsing. The full project gets wrapped in a structured hierarchy:

<codebase name="superstore-analysis">
  <metadata>
    <generated_on>2025-05-31 09:00</generated_on>
    <total_tokens method="o200k_base">48293</total_tokens>
  </metadata>
  <directory_structure>
    Fact&dim-csv\Fact_Sales.csv
    Superstore.sql
    ...
  </directory_structure>
  <files>
    <file path="Fact&dim-csv\Fact_Sales.csv">
      ...sampled table...
    </file>
  </files>
</codebase>

The Benchmark

Same project, same files, three tools:

Tool	Output Size
Repomix	22,085 KB
code2prompt	9,304 KB
data2prompt	241 KB

That's a 98.9% reduction vs Repomix and 97.4% vs code2prompt on the same data-heavy project, while preserving all the structurally useful information — schemas, sampled data, notebook logic, file tree.

The reduction is so dramatic specifically because of the project type. A pure Python project would show a much smaller gap. That's exactly the point — this tool is built for data projects, not code projects.

Installation and Usage

# Install
pip install data2prompt

# Recommended: use pipx for isolated install
pipx install data2prompt

Basic usage — run inside your project directory:

# Default: Markdown output
data2prompt

# XML output (better for LLM structured parsing)
data2prompt --format xml

# Increase CSV sample size
data2prompt --csv-sample-size 25

# Custom output file name
data2prompt --output my_project_context

# Ignore specific folders
data2prompt --ignore-folders data/raw

The output file (default: PROMPT.md or PROMPT.xml) is ready to paste directly into Claude, ChatGPT, Gemini, or any LLM with a large context window.

You can also create a .data2promptignore file in your project root — same syntax as .gitignore — to exclude specific files or patterns permanently.

Who This Is For

data2prompt is specifically designed for data scientists, data analysts, and data engineers who work with:

CSV/Excel data files
SQL database dumps
Jupyter notebooks
Power BI or other binary analytics files
Mixed projects with both code and data

If your project is purely Python scripts with no data files, Repomix or code2prompt will serve you fine. But if your project looks anything like a real data science workflow, give data2prompt a try.

GitHub: https://github.com/arianmokhtariha/data2prompt

PyPI: https://pypi.org/project/data2prompt

Questions about how the SQL parser works or why random sampling over head/tail? Drop them in the comments.