Arian Mokhtariha

Posted on Jun 2

Meet data2prompt: The CLI Tool That Finally Makes LLMs Understand Your Data Science Projects

#ai #datascience #dataengineering #llm

Every data scientist has hit this wall.

You are deep in a project — CSVs, SQL dumps, Jupyter notebooks, maybe some Power BI files — and you want to ask an LLM to help you reason across the whole thing. Not just one script. The entire project. You want it to understand your data structure, your pipeline logic, your model decisions all at once.

So you try to package it up and paste it in. And it fails. The context window chokes. The LLM forgets files it saw earlier. The responses stop making sense.

The problem is not your LLM. The problem is that nobody built the right packaging tool for data-heavy projects — until now.

Introducing data2prompt

data2prompt is an open-source CLI tool that packages your entire data science project into a single, optimized, LLM-ready file. Not a generic dump of everything in your folder — a smart, data-aware output that knows how to handle CSVs, SQL, Jupyter notebooks, Excel files, and binary data files the way a data scientist actually needs them handled.

Install it in one command:

pipx install data2prompt

Run it from your project root:

data2prompt

That is it. You get a single PROMPT.md or PROMPT.xml file ready to paste into Claude, ChatGPT, Gemini, or any LLM with a large context window.

The Problem With Generic Tools

There are great tools out there for packaging software projects for LLM context. They work beautifully on codebases full of Python scripts and config files.

But a data science project is not a software project. It contains fundamentally different file types that need fundamentally different handling — and when a generic tool encounters them, it does the worst possible thing: it dumps everything raw.

Here is what that looks like in practice:

A CSV with 50,000 rows gets written to the output in full — 50,000 rows of token-consuming noise when what the LLM actually needs is the schema and a representative sample.

A SQL dump gets included with every single INSERT statement — hundreds of thousands of rows of raw data when what the LLM needs is the CREATE TABLE schema and a handful of example rows per table.

A Jupyter notebook gets written with all its outputs intact — which includes matplotlib charts and styled dataframes stored as base64-encoded image strings. A single notebook visualization can contribute 60,000 tokens of encoded image data that an LLM literally cannot read or use.

A Power BI or pickle file is binary — reading it as text produces corrupted garbage that fills your context window with meaningless characters.

The result is a context file that is technically complete and practically useless. Every token budget gets consumed before the LLM has seen anything worth reasoning about.

How data2prompt Handles Each File Type

data2prompt applies a dedicated strategy to every file type found in a data project.

CSVs and Excel — Intelligent Random Sampling

Instead of dumping all rows, data2prompt takes a random sample:

# Default: 15 random rows per CSV
data2prompt

# Increase sample size when you need more context
data2prompt --csv-sample-size 30

The sampling is true random with a fixed seed — not head/tail. This means the LLM sees a representative spread of your actual data values, not just whatever happened to be at the top of the file. The seed makes the output fully reproducible — same result every time you run it.

The output is clean and annotated:

-- [Sample - Random 15 rows of 52,411 total] --
| order_id | customer   | region | sales  | profit |
|----------|------------|--------|--------|--------|
| CA-2019  | John Smith | West   | 245.00 | 41.65  |
...
-- [CSV truncated: Showing random 15 rows to save context] --

For Excel, each sheet is sampled independently. Sheets that are purely visual dashboards — no tabular data — are detected and noted rather than producing empty tables.

SQL Files — Schema First, Data Sampled

SQL dumps have two distinct parts: the schema (CREATE TABLE definitions) and the data (INSERT statements). data2prompt treats them completely differently.

Schema blocks are always preserved in full. Every CREATE TABLE, every column definition, every constraint and foreign key relationship — this is exactly what the LLM needs to understand your database.

INSERT statements are sampled randomly per table. You get a handful of representative rows for each table, enough to understand what the data looks like, without the thousands of repetitive INSERT lines that collapse your context budget.

Jupyter Notebooks — Logic Without the Noise

Notebooks store both cell source code and cell outputs. The source code is what the LLM needs — your transformations, model definitions, evaluation logic, markdown explanations. The outputs are the problem.

data2prompt extracts source code from every cell and discards outputs entirely. The base64 image strings, printed dataframes, and error tracebacks that bloat notebook files are stripped out. What remains is clean, readable notebook logic that an LLM can actually reason about.

Binary Files — Acknowledged, Not Mangled

.pbix, .pkl, .parquet, .db, .sqlite, .h5, .feather — data2prompt lists these in your project tree so the LLM knows they exist, and skips their content entirely. No corrupted binary strings eating your context window.

Two Output Formats: Markdown and XML

# Clean Markdown (default)
data2prompt

# Structured XML
data2prompt --format xml

The XML format was added based on Anthropic's research showing that XML-style structured tags improve LLM attention and parsing within long context windows. Every file gets semantic tags:

<codebase name="my-project">
  <directory_structure>...</directory_structure>
  <files>
    <file path="data\sales.csv">
      ...sampled table...
    </file>
    <file path="notebooks\analysis.ipynb">
      <cell index="1" type="code">
        <content>
          df = pd.read_csv('data/sales.csv')
          df.head()
        </content>
      </cell>
    </file>
  </files>
</codebase>

Use Markdown for quick analysis sessions. Use XML when you want the LLM to navigate a large project with maximum structural clarity.

The Numbers

Same data science project — dimension and fact CSVs, a SQL dump, Power BI files, ML notebooks, classification and clustering scripts — run through three tools:

Tool	Output Size
Generic tool #1	22,085 KB
Generic tool #2	9,304 KB
data2prompt	241 KB

241 KB versus 22,085 KB. Same project. Same information that actually matters.

That gap exists entirely because of the file-type-specific strategies above — the random CSV sampling, the SQL schema extraction, the notebook output stripping, the binary file handling. Nothing clever, just the right tool for the right files.

Full Usage

# Install (pipx recommended for isolated environment)
pip install data2prompt
pipx install data2prompt

# Basic run — outputs PROMPT.md in project root
data2prompt

# XML format
data2prompt --format xml

# Custom CSV sample size
data2prompt --csv-sample-size 25

# Ignore specific folders
data2prompt --ignore-folders data/raw archive models/checkpoints

# Custom output filename
data2prompt --output my_project_context

Create a .data2promptignore file in your project root for permanent exclusions — same syntax as .gitignore:

data/raw/
*.pkl
archive/

Who This Is Built For

data2prompt is specifically designed for data scientists, data analysts, and data engineers who work with real data files alongside their code. If your project has CSVs, SQL, notebooks, or Excel files, this tool will dramatically improve the quality of your LLM interactions with that project.

If you work in a pure software development context with no data files, a general-purpose packaging tool will serve you well. But if your project looks anything like a real data science workflow, give data2prompt a try.

GitHub: https://github.com/arianmokhtariha/data2prompt
PyPI: https://pypi.org/project/data2prompt

Open to questions, feedback, and contributions — drop them in the comments or open an issue on GitHub.

Top comments (1)

Gilder Miller • Jun 12

This looks like a really handy tool for anyone juggling complex data projects. I like that it handles each file type intelligently instead of just dumping everything. Makes it a lot easier to get meaningful insights without fighting the LLM's context limits. Definitely going to keep an eye on this. Wonder how it performs on really large multi-format projects.