Arian Mokhtariha

Posted on May 31

I Tried Repomix on My Data Science Project. It Generated a 22,000 KB File. So I Built My Own Tool

#ai #code2prompt #repomix #llm

A few months ago a friend showed me two tools — Repomix and code2prompt. The idea was simple: point them at your project folder, they package everything into one file, you paste it into an LLM and ask questions about your whole codebase at once. For his pure Python projects they worked great.

I was working on a data analytics project at the time — dimension and fact CSVs, a SQL dump, some Power BI files, Jupyter notebooks with ML models. I ran Repomix on it and got a 22,085 KB output file. code2prompt gave me 9,304 KB. I tried pasting either of them into Claude. It choked immediately.

So I opened the files to see what was actually inside them. What I found was the root of the problem.

What These Tools Get Wrong for Data Projects

Repomix and code2prompt are built for code repos. They operate on a simple principle: read every file, dump every file. That works fine when your project is Python scripts and config files. It completely falls apart when your project looks like mine.

Here's what was inflating those files:

Raw CSV dumps. My Fact_Sales.csv had tens of thousands of rows. The tool dumped every single one. An LLM doesn't need 50,000 rows of sales data — it needs to understand the structure and a representative sample.

Endless SQL INSERT statements. My Superstore.sql file had the full database dump including every INSERT INTO statement for every table. The schema — the CREATE TABLE blocks — is what an LLM actually needs. The data rows are mostly noise.

Notebook outputs with base64 images. Jupyter notebooks store cell outputs as JSON inside the .ipynb file. When a cell generates a matplotlib chart, that chart gets saved as a base64-encoded image string inside the notebook. A single chart output can be 50,000+ characters of base64 garbage that an LLM cannot use at all.

Binary files read as text. My .pbix (Power BI) files are binary. These tools attempted to read them as text and produced corrupted garbage that consumed tokens while providing zero information.

There was no tool that understood these problems. So after three months of building — and being very honest that I'm a data scientist not a developer, so this was heavily AI-assisted — I shipped data2prompt.

What data2prompt Does Differently

The core idea is that each file type in a data project needs its own strategy, not a generic "read and dump" approach.

CSVs and Excel: Smart Sampling

Instead of dumping all rows, data2prompt takes a random sample:

df = df.sample(sample_size, random_state=seed)

The default is 15 rows, configurable with --csv-sample-size. Critically it's random sampling with a fixed seed — not head/tail. Random sampling gives the LLM a more representative picture of value diversity across the dataset. The seed (default 42) makes output reproducible across runs.

The output tells the LLM exactly what it's looking at:

-- [Sample - Random 15 rows] --
| order_id | customer_name | sales  | profit |
|----------|--------------|--------|--------|
| CA-2019  | John Smith   | 245.00 | 41.65  |
...
-- [CSV truncated: Showing random 15 rows to save context] --

For Excel files, each sheet is sampled independently. The parser also detects sheets that are purely visual dashboards (charts, images only) and notes them rather than producing an empty table.

SQL Files: Schema Preserved, Data Sampled

This was the hardest parser to get right. SQL dump files typically follow a pattern: CREATE TABLE block defining the schema, followed by hundreds or thousands of INSERT INTO statements loading the data.

data2prompt reads line by line and applies different logic to each part:

# Always preserve the schema
if "CREATE TABLE" in line_upper:
    flush_buffer()
    in_create_block = True
    processed_lines.append(line)

# Buffer INSERT rows for sampling
if is_insert or is_data_row:
    table_data_buffer.append(line)

When it hits a buffer of INSERT rows, it samples them randomly:

rest_indices = sorted(rng.sample(range(1, len(table_data_buffer)), sample_size - 1))
sampled_rows = [first_line] + [table_data_buffer[idx] for idx in rest_indices]

The first line (the INSERT header) is always preserved. The rest are random samples in their original order. The LLM gets the full schema of every table plus a representative data sample — which is exactly what it needs to understand your database.

Jupyter Notebooks: Source Code Only

Notebook cells store three things: source code, execution count, and outputs. The outputs are what bloat the file — printed dataframes, matplotlib charts as base64, error tracebacks.

data2prompt keeps the source code of every cell and strips the outputs entirely. A notebook that was 8MB of JSON becomes a clean sequence of code cells. The parser also handles truncation of unusually long lines and caps output blocks at a configurable line limit for cases where outputs are genuinely useful text.

For the XML format, each cell becomes a structured block:

<cell path="ML/Q1/Q1.ipynb" index="3" type="code">
    <content>
model = XGBRegressor(n_estimators=100, learning_rate=0.1)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
    </content>
</cell>

Binary Files: Listed, Not Read

.pbix, .parquet, .pkl, .db, .sqlite, .feather, .h5 — all listed in the directory tree so the LLM knows they exist, but content is skipped entirely. No garbage bytes consuming your context window.

Two Output Formats: Markdown and XML

data2prompt supports both --format markdown (default) and --format xml.

The XML format was added after Anthropic published research showing that XML-style tags improve LLM attention and parsing. The full project gets wrapped in a structured hierarchy:

<codebase name="superstore-analysis">
  <metadata>
    <generated_on>2025-05-31 09:00</generated_on>
    <total_tokens method="o200k_base">48293</total_tokens>
  </metadata>
  <directory_structure>
    Fact&dim-csv\Fact_Sales.csv
    Superstore.sql
    ...
  </directory_structure>
  <files>
    <file path="Fact&dim-csv\Fact_Sales.csv">
      ...sampled table...
    </file>
  </files>
</codebase>

The Benchmark

Same project, same files, three tools:

Tool	Output Size
Repomix	22,085 KB
code2prompt	9,304 KB
data2prompt	241 KB

That's a 98.9% reduction vs Repomix and 97.4% vs code2prompt on the same data-heavy project, while preserving all the structurally useful information — schemas, sampled data, notebook logic, file tree.

The reduction is so dramatic specifically because of the project type. A pure Python project would show a much smaller gap. That's exactly the point — this tool is built for data projects, not code projects.

Installation and Usage

# Install
pip install data2prompt

# Recommended: use pipx for isolated install
pipx install data2prompt

Basic usage — run inside your project directory:

# Default: Markdown output
data2prompt

# XML output (better for LLM structured parsing)
data2prompt --format xml

# Increase CSV sample size
data2prompt --csv-sample-size 25

# Custom output file name
data2prompt --output my_project_context

# Ignore specific folders
data2prompt --ignore-folders data/raw

The output file (default: PROMPT.md or PROMPT.xml) is ready to paste directly into Claude, ChatGPT, Gemini, or any LLM with a large context window.

You can also create a .data2promptignore file in your project root — same syntax as .gitignore — to exclude specific files or patterns permanently.

Who This Is For

data2prompt is specifically designed for data scientists, data analysts, and data engineers who work with:

CSV/Excel data files
SQL database dumps
Jupyter notebooks
Power BI or other binary analytics files
Mixed projects with both code and data

If your project is purely Python scripts with no data files, Repomix or code2prompt will serve you fine. But if your project looks anything like a real data science workflow, give data2prompt a try.

GitHub: https://github.com/arianmokhtariha/data2prompt

PyPI: https://pypi.org/project/data2prompt

Questions about how the SQL parser works or why random sampling over head/tail? Drop them in the comments.

Top comments (1)

Harjot Singh • Jun 1

i like how you highlighted the limitations of tools built for code repos when handling diverse data projects. sometimes they just don't fit. on a different note, if you're looking to deploy apps quickly, moonshift lets you set up a full next.js + postgres + auth project in about 7 minutes, and you fully own the code. let me know if you want a free run to try it out.