<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Arian Mokhtariha</title>
    <description>The latest articles on DEV Community by Arian Mokhtariha (@arian_mokhtariha_6206efac).</description>
    <link>https://dev.to/arian_mokhtariha_6206efac</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3961717%2F193a1aac-fb1c-484d-bf46-c7f092cae8fb.jpg</url>
      <title>DEV Community: Arian Mokhtariha</title>
      <link>https://dev.to/arian_mokhtariha_6206efac</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/arian_mokhtariha_6206efac"/>
    <language>en</language>
    <item>
      <title>I Tried Repomix on My Data Science Project. It Generated a 22,000 KB File. So I Built My Own Tool</title>
      <dc:creator>Arian Mokhtariha</dc:creator>
      <pubDate>Sun, 31 May 2026 22:31:28 +0000</pubDate>
      <link>https://dev.to/arian_mokhtariha_6206efac/i-tried-repomix-on-my-data-science-project-it-generated-a-22000-kb-file-so-i-built-my-own-tool-58j7</link>
      <guid>https://dev.to/arian_mokhtariha_6206efac/i-tried-repomix-on-my-data-science-project-it-generated-a-22000-kb-file-so-i-built-my-own-tool-58j7</guid>
      <description>&lt;p&gt;A few months ago a friend showed me two tools — Repomix and code2prompt. The idea was simple: point them at your project folder, they package everything into one file, you paste it into an LLM and ask questions about your whole codebase at once. For his pure Python projects they worked great.&lt;/p&gt;

&lt;p&gt;I was working on a data analytics project at the time — dimension and fact CSVs, a SQL dump, some Power BI files, Jupyter notebooks with ML models. I ran Repomix on it and got a 22,085 KB output file. code2prompt gave me 9,304 KB. I tried pasting either of them into Claude. It choked immediately.&lt;/p&gt;

&lt;p&gt;So I opened the files to see what was actually inside them. What I found was the root of the problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  What These Tools Get Wrong for Data Projects
&lt;/h2&gt;

&lt;p&gt;Repomix and code2prompt are built for &lt;em&gt;code&lt;/em&gt; repos. They operate on a simple principle: read every file, dump every file. That works fine when your project is Python scripts and config files. It completely falls apart when your project looks like mine.&lt;/p&gt;

&lt;p&gt;Here's what was inflating those files:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Raw CSV dumps.&lt;/strong&gt; My Fact_Sales.csv had tens of thousands of rows. The tool dumped every single one. An LLM doesn't need 50,000 rows of sales data — it needs to understand the structure and a representative sample.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Endless SQL INSERT statements.&lt;/strong&gt; My Superstore.sql file had the full database dump including every INSERT INTO statement for every table. The schema — the CREATE TABLE blocks — is what an LLM actually needs. The data rows are mostly noise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Notebook outputs with base64 images.&lt;/strong&gt; Jupyter notebooks store cell outputs as JSON inside the .ipynb file. When a cell generates a matplotlib chart, that chart gets saved as a base64-encoded image string inside the notebook. A single chart output can be 50,000+ characters of base64 garbage that an LLM cannot use at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Binary files read as text.&lt;/strong&gt; My .pbix (Power BI) files are binary. These tools attempted to read them as text and produced corrupted garbage that consumed tokens while providing zero information.&lt;/p&gt;

&lt;p&gt;There was no tool that understood these problems. So after three months of building — and being very honest that I'm a data scientist not a developer, so this was heavily AI-assisted — I shipped data2prompt.&lt;/p&gt;




&lt;h2&gt;
  
  
  What data2prompt Does Differently
&lt;/h2&gt;

&lt;p&gt;The core idea is that each file type in a data project needs its own strategy, not a generic "read and dump" approach.&lt;/p&gt;

&lt;h3&gt;
  
  
  CSVs and Excel: Smart Sampling
&lt;/h3&gt;

&lt;p&gt;Instead of dumping all rows, data2prompt takes a random sample:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sample_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The default is 15 rows, configurable with &lt;code&gt;--csv-sample-size&lt;/code&gt;. Critically it's &lt;strong&gt;random&lt;/strong&gt; sampling with a fixed seed — not head/tail. Random sampling gives the LLM a more representative picture of value diversity across the dataset. The seed (default 42) makes output reproducible across runs.&lt;/p&gt;

&lt;p&gt;The output tells the LLM exactly what it's looking at:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-- [Sample - Random 15 rows] --
| order_id | customer_name | sales  | profit |
|----------|--------------|--------|--------|
| CA-2019  | John Smith   | 245.00 | 41.65  |
...
-- [CSV truncated: Showing random 15 rows to save context] --
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Excel files, each sheet is sampled independently. The parser also detects sheets that are purely visual dashboards (charts, images only) and notes them rather than producing an empty table.&lt;/p&gt;

&lt;h3&gt;
  
  
  SQL Files: Schema Preserved, Data Sampled
&lt;/h3&gt;

&lt;p&gt;This was the hardest parser to get right. SQL dump files typically follow a pattern: CREATE TABLE block defining the schema, followed by hundreds or thousands of INSERT INTO statements loading the data.&lt;/p&gt;

&lt;p&gt;data2prompt reads line by line and applies different logic to each part:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Always preserve the schema
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CREATE TABLE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;line_upper&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;flush_buffer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;in_create_block&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="n"&gt;processed_lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Buffer INSERT rows for sampling
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_insert&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;is_data_row&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;table_data_buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When it hits a buffer of INSERT rows, it samples them randomly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;rest_indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_data_buffer&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;sample_size&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;sampled_rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;first_line&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;table_data_buffer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rest_indices&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first line (the INSERT header) is always preserved. The rest are random samples in their original order. The LLM gets the full schema of every table plus a representative data sample — which is exactly what it needs to understand your database.&lt;/p&gt;

&lt;h3&gt;
  
  
  Jupyter Notebooks: Source Code Only
&lt;/h3&gt;

&lt;p&gt;Notebook cells store three things: source code, execution count, and outputs. The outputs are what bloat the file — printed dataframes, matplotlib charts as base64, error tracebacks.&lt;/p&gt;

&lt;p&gt;data2prompt keeps the source code of every cell and strips the outputs entirely. A notebook that was 8MB of JSON becomes a clean sequence of code cells. The parser also handles truncation of unusually long lines and caps output blocks at a configurable line limit for cases where outputs are genuinely useful text.&lt;/p&gt;

&lt;p&gt;For the XML format, each cell becomes a structured block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;cell&lt;/span&gt; &lt;span class="na"&gt;path=&lt;/span&gt;&lt;span class="s"&gt;"ML/Q1/Q1.ipynb"&lt;/span&gt; &lt;span class="na"&gt;index=&lt;/span&gt;&lt;span class="s"&gt;"3"&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;"code"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;content&amp;gt;&lt;/span&gt;
model = XGBRegressor(n_estimators=100, learning_rate=0.1)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
    &lt;span class="nt"&gt;&amp;lt;/content&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/cell&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Binary Files: Listed, Not Read
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;.pbix&lt;/code&gt;, &lt;code&gt;.parquet&lt;/code&gt;, &lt;code&gt;.pkl&lt;/code&gt;, &lt;code&gt;.db&lt;/code&gt;, &lt;code&gt;.sqlite&lt;/code&gt;, &lt;code&gt;.feather&lt;/code&gt;, &lt;code&gt;.h5&lt;/code&gt; — all listed in the directory tree so the LLM knows they exist, but content is skipped entirely. No garbage bytes consuming your context window.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two Output Formats: Markdown and XML
&lt;/h2&gt;

&lt;p&gt;data2prompt supports both &lt;code&gt;--format markdown&lt;/code&gt; (default) and &lt;code&gt;--format xml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The XML format was added after Anthropic published research showing that XML-style tags improve LLM attention and parsing. The full project gets wrapped in a structured hierarchy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;codebase&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"superstore-analysis"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;metadata&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;generated_on&amp;gt;&lt;/span&gt;2025-05-31 09:00&lt;span class="nt"&gt;&amp;lt;/generated_on&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;total_tokens&lt;/span&gt; &lt;span class="na"&gt;method=&lt;/span&gt;&lt;span class="s"&gt;"o200k_base"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;48293&lt;span class="nt"&gt;&amp;lt;/total_tokens&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/metadata&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;directory_structure&amp;gt;&lt;/span&gt;
    Fact&lt;span class="err"&gt;&amp;amp;&lt;/span&gt;dim-csv\Fact_Sales.csv
    Superstore.sql
    ...
  &lt;span class="nt"&gt;&amp;lt;/directory_structure&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;files&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;file&lt;/span&gt; &lt;span class="na"&gt;path=&lt;/span&gt;&lt;span class="s"&gt;"Fact&amp;amp;dim-csv\Fact_Sales.csv"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
      ...sampled table...
    &lt;span class="nt"&gt;&amp;lt;/file&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/files&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/codebase&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Benchmark
&lt;/h2&gt;

&lt;p&gt;Same project, same files, three tools:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Output Size&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Repomix&lt;/td&gt;
&lt;td&gt;22,085 KB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;code2prompt&lt;/td&gt;
&lt;td&gt;9,304 KB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;data2prompt&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;241 KB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's a 98.9% reduction vs Repomix and 97.4% vs code2prompt on the same data-heavy project, while preserving all the structurally useful information — schemas, sampled data, notebook logic, file tree.&lt;/p&gt;

&lt;p&gt;The reduction is so dramatic specifically because of the project type. A pure Python project would show a much smaller gap. That's exactly the point — this tool is built for data projects, not code projects.&lt;/p&gt;




&lt;h2&gt;
  
  
  Installation and Usage
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;data2prompt

&lt;span class="c"&gt;# Recommended: use pipx for isolated install&lt;/span&gt;
pipx &lt;span class="nb"&gt;install &lt;/span&gt;data2prompt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Basic usage — run inside your project directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Default: Markdown output&lt;/span&gt;
data2prompt

&lt;span class="c"&gt;# XML output (better for LLM structured parsing)&lt;/span&gt;
data2prompt &lt;span class="nt"&gt;--format&lt;/span&gt; xml

&lt;span class="c"&gt;# Increase CSV sample size&lt;/span&gt;
data2prompt &lt;span class="nt"&gt;--csv-sample-size&lt;/span&gt; 25

&lt;span class="c"&gt;# Custom output file name&lt;/span&gt;
data2prompt &lt;span class="nt"&gt;--output&lt;/span&gt; my_project_context

&lt;span class="c"&gt;# Ignore specific folders&lt;/span&gt;
data2prompt &lt;span class="nt"&gt;--ignore-folders&lt;/span&gt; data/raw 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output file (default: &lt;code&gt;PROMPT.md&lt;/code&gt; or &lt;code&gt;PROMPT.xml&lt;/code&gt;) is ready to paste directly into Claude, ChatGPT, Gemini, or any LLM with a large context window.&lt;/p&gt;

&lt;p&gt;You can also create a &lt;code&gt;.data2promptignore&lt;/code&gt; file in your project root — same syntax as &lt;code&gt;.gitignore&lt;/code&gt; — to exclude specific files or patterns permanently.&lt;/p&gt;




&lt;h2&gt;
  
  
  Who This Is For
&lt;/h2&gt;

&lt;p&gt;data2prompt is specifically designed for data scientists, data analysts, and data engineers who work with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CSV/Excel data files&lt;/li&gt;
&lt;li&gt;SQL database dumps&lt;/li&gt;
&lt;li&gt;Jupyter notebooks&lt;/li&gt;
&lt;li&gt;Power BI or other binary analytics files&lt;/li&gt;
&lt;li&gt;Mixed projects with both code and data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your project is purely Python scripts with no data files, Repomix or code2prompt will serve you fine. But if your project looks anything like a real data science workflow, give data2prompt a try.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/arianmokhtariha/data2prompt" rel="noopener noreferrer"&gt;https://github.com/arianmokhtariha/data2prompt&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/data2prompt" rel="noopener noreferrer"&gt;https://pypi.org/project/data2prompt&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Questions about how the SQL parser works or why random sampling over head/tail? Drop them in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>code2prompt</category>
      <category>repomix</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
