<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Arian Mokhtariha</title>
    <description>The latest articles on DEV Community by Arian Mokhtariha (@arian_mokhtariha_6206efac).</description>
    <link>https://dev.to/arian_mokhtariha_6206efac</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3961717%2F193a1aac-fb1c-484d-bf46-c7f092cae8fb.jpg</url>
      <title>DEV Community: Arian Mokhtariha</title>
      <link>https://dev.to/arian_mokhtariha_6206efac</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/arian_mokhtariha_6206efac"/>
    <language>en</language>
    <item>
      <title>Meet data2prompt: The CLI Tool That Finally Makes LLMs Understand Your Data Science Projects</title>
      <dc:creator>Arian Mokhtariha</dc:creator>
      <pubDate>Tue, 02 Jun 2026 15:25:43 +0000</pubDate>
      <link>https://dev.to/arian_mokhtariha_6206efac/meet-data2prompt-the-cli-tool-that-finally-makes-llms-understand-your-data-science-projects-lda</link>
      <guid>https://dev.to/arian_mokhtariha_6206efac/meet-data2prompt-the-cli-tool-that-finally-makes-llms-understand-your-data-science-projects-lda</guid>
      <description>&lt;p&gt;Every data scientist has hit this wall.&lt;/p&gt;

&lt;p&gt;You are deep in a project — CSVs, SQL dumps, Jupyter notebooks, maybe some Power BI files — and you want to ask an LLM to help you reason across the whole thing. Not just one script. The entire project. You want it to understand your data structure, your pipeline logic, your model decisions all at once.&lt;/p&gt;

&lt;p&gt;So you try to package it up and paste it in. And it fails. The context window chokes. The LLM forgets files it saw earlier. The responses stop making sense.&lt;/p&gt;

&lt;p&gt;The problem is not your LLM. The problem is that nobody built the right packaging tool for data-heavy projects — until now.&lt;/p&gt;




&lt;h2&gt;
  
  
  Introducing data2prompt
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;data2prompt&lt;/strong&gt; is an open-source CLI tool that packages your entire data science project into a single, optimized, LLM-ready file. Not a generic dump of everything in your folder — a smart, data-aware output that knows how to handle CSVs, SQL, Jupyter notebooks, Excel files, and binary data files the way a data scientist actually needs them handled.&lt;/p&gt;

&lt;p&gt;Install it in one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pipx &lt;span class="nb"&gt;install &lt;/span&gt;data2prompt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it from your project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;data2prompt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is it. You get a single &lt;code&gt;PROMPT.md&lt;/code&gt; or &lt;code&gt;PROMPT.xml&lt;/code&gt; file ready to paste into Claude, ChatGPT, Gemini, or any LLM with a large context window.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem With Generic Tools
&lt;/h2&gt;

&lt;p&gt;There are great tools out there for packaging software projects for LLM context. They work beautifully on codebases full of Python scripts and config files.&lt;/p&gt;

&lt;p&gt;But a data science project is not a software project. It contains fundamentally different file types that need fundamentally different handling — and when a generic tool encounters them, it does the worst possible thing: it dumps everything raw.&lt;/p&gt;

&lt;p&gt;Here is what that looks like in practice:&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;CSV with 50,000 rows&lt;/strong&gt; gets written to the output in full — 50,000 rows of token-consuming noise when what the LLM actually needs is the schema and a representative sample.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;SQL dump&lt;/strong&gt; gets included with every single INSERT statement — hundreds of thousands of rows of raw data when what the LLM needs is the CREATE TABLE schema and a handful of example rows per table.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;Jupyter notebook&lt;/strong&gt; gets written with all its outputs intact — which includes matplotlib charts and styled dataframes stored as base64-encoded image strings. A single notebook visualization can contribute 60,000 tokens of encoded image data that an LLM literally cannot read or use.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;Power BI or pickle file&lt;/strong&gt; is binary — reading it as text produces corrupted garbage that fills your context window with meaningless characters.&lt;/p&gt;

&lt;p&gt;The result is a context file that is technically complete and practically useless. Every token budget gets consumed before the LLM has seen anything worth reasoning about.&lt;/p&gt;




&lt;h2&gt;
  
  
  How data2prompt Handles Each File Type
&lt;/h2&gt;

&lt;p&gt;data2prompt applies a dedicated strategy to every file type found in a data project.&lt;/p&gt;

&lt;h3&gt;
  
  
  CSVs and Excel — Intelligent Random Sampling
&lt;/h3&gt;

&lt;p&gt;Instead of dumping all rows, data2prompt takes a random sample:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Default: 15 random rows per CSV&lt;/span&gt;
data2prompt

&lt;span class="c"&gt;# Increase sample size when you need more context&lt;/span&gt;
data2prompt &lt;span class="nt"&gt;--csv-sample-size&lt;/span&gt; 30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The sampling is &lt;strong&gt;true random&lt;/strong&gt; with a fixed seed — not head/tail. This means the LLM sees a representative spread of your actual data values, not just whatever happened to be at the top of the file. The seed makes the output fully reproducible — same result every time you run it.&lt;/p&gt;

&lt;p&gt;The output is clean and annotated:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;-- [Sample - Random 15 rows of 52,411 total] --
| order_id | customer   | region | sales  | profit |
|----------|------------|--------|--------|--------|
| CA-2019  | John Smith | West   | 245.00 | 41.65  |
...
-- [CSV truncated: Showing random 15 rows to save context] --
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Excel, each sheet is sampled independently. Sheets that are purely visual dashboards — no tabular data — are detected and noted rather than producing empty tables.&lt;/p&gt;

&lt;h3&gt;
  
  
  SQL Files — Schema First, Data Sampled
&lt;/h3&gt;

&lt;p&gt;SQL dumps have two distinct parts: the schema (CREATE TABLE definitions) and the data (INSERT statements). data2prompt treats them completely differently.&lt;/p&gt;

&lt;p&gt;Schema blocks are always preserved in full. Every CREATE TABLE, every column definition, every constraint and foreign key relationship — this is exactly what the LLM needs to understand your database.&lt;/p&gt;

&lt;p&gt;INSERT statements are sampled randomly per table. You get a handful of representative rows for each table, enough to understand what the data looks like, without the thousands of repetitive INSERT lines that collapse your context budget.&lt;/p&gt;

&lt;h3&gt;
  
  
  Jupyter Notebooks — Logic Without the Noise
&lt;/h3&gt;

&lt;p&gt;Notebooks store both cell source code and cell outputs. The source code is what the LLM needs — your transformations, model definitions, evaluation logic, markdown explanations. The outputs are the problem.&lt;/p&gt;

&lt;p&gt;data2prompt extracts source code from every cell and discards outputs entirely. The base64 image strings, printed dataframes, and error tracebacks that bloat notebook files are stripped out. What remains is clean, readable notebook logic that an LLM can actually reason about.&lt;/p&gt;

&lt;h3&gt;
  
  
  Binary Files — Acknowledged, Not Mangled
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;.pbix&lt;/code&gt;, &lt;code&gt;.pkl&lt;/code&gt;, &lt;code&gt;.parquet&lt;/code&gt;, &lt;code&gt;.db&lt;/code&gt;, &lt;code&gt;.sqlite&lt;/code&gt;, &lt;code&gt;.h5&lt;/code&gt;, &lt;code&gt;.feather&lt;/code&gt; — data2prompt lists these in your project tree so the LLM knows they exist, and skips their content entirely. No corrupted binary strings eating your context window.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two Output Formats: Markdown and XML
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clean Markdown (default)&lt;/span&gt;
data2prompt

&lt;span class="c"&gt;# Structured XML&lt;/span&gt;
data2prompt &lt;span class="nt"&gt;--format&lt;/span&gt; xml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The XML format was added based on Anthropic's research showing that XML-style structured tags improve LLM attention and parsing within long context windows. Every file gets semantic tags:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;codebase&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"my-project"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;directory_structure&amp;gt;&lt;/span&gt;...&lt;span class="nt"&gt;&amp;lt;/directory_structure&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;files&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;file&lt;/span&gt; &lt;span class="na"&gt;path=&lt;/span&gt;&lt;span class="s"&gt;"data\sales.csv"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
      ...sampled table...
    &lt;span class="nt"&gt;&amp;lt;/file&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;file&lt;/span&gt; &lt;span class="na"&gt;path=&lt;/span&gt;&lt;span class="s"&gt;"notebooks\analysis.ipynb"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;cell&lt;/span&gt; &lt;span class="na"&gt;index=&lt;/span&gt;&lt;span class="s"&gt;"1"&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;"code"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;content&amp;gt;&lt;/span&gt;
          df = pd.read_csv('data/sales.csv')
          df.head()
        &lt;span class="nt"&gt;&amp;lt;/content&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;/cell&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/file&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/files&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/codebase&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use Markdown for quick analysis sessions. Use XML when you want the LLM to navigate a large project with maximum structural clarity.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;p&gt;Same data science project — dimension and fact CSVs, a SQL dump, Power BI files, ML notebooks, classification and clustering scripts — run through three tools:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Output Size&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Generic tool #1&lt;/td&gt;
&lt;td&gt;22,085 KB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generic tool #2&lt;/td&gt;
&lt;td&gt;9,304 KB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;data2prompt&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;241 KB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;241 KB versus 22,085 KB.&lt;/strong&gt; Same project. Same information that actually matters.&lt;/p&gt;

&lt;p&gt;That gap exists entirely because of the file-type-specific strategies above — the random CSV sampling, the SQL schema extraction, the notebook output stripping, the binary file handling. Nothing clever, just the right tool for the right files.&lt;/p&gt;




&lt;h2&gt;
  
  
  Full Usage
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install (pipx recommended for isolated environment)&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;data2prompt
pipx &lt;span class="nb"&gt;install &lt;/span&gt;data2prompt

&lt;span class="c"&gt;# Basic run — outputs PROMPT.md in project root&lt;/span&gt;
data2prompt

&lt;span class="c"&gt;# XML format&lt;/span&gt;
data2prompt &lt;span class="nt"&gt;--format&lt;/span&gt; xml

&lt;span class="c"&gt;# Custom CSV sample size&lt;/span&gt;
data2prompt &lt;span class="nt"&gt;--csv-sample-size&lt;/span&gt; 25

&lt;span class="c"&gt;# Ignore specific folders&lt;/span&gt;
data2prompt &lt;span class="nt"&gt;--ignore-folders&lt;/span&gt; data/raw archive models/checkpoints

&lt;span class="c"&gt;# Custom output filename&lt;/span&gt;
data2prompt &lt;span class="nt"&gt;--output&lt;/span&gt; my_project_context
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a &lt;code&gt;.data2promptignore&lt;/code&gt; file in your project root for permanent exclusions — same syntax as &lt;code&gt;.gitignore&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/raw/
*.pkl
archive/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Who This Is Built For
&lt;/h2&gt;

&lt;p&gt;data2prompt is specifically designed for &lt;strong&gt;data scientists, data analysts, and data engineers&lt;/strong&gt; who work with real data files alongside their code. If your project has CSVs, SQL, notebooks, or Excel files, this tool will dramatically improve the quality of your LLM interactions with that project.&lt;/p&gt;

&lt;p&gt;If you work in a pure software development context with no data files, a general-purpose packaging tool will serve you well. But if your project looks anything like a real data science workflow, give data2prompt a try.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/arianmokhtariha/data2prompt" rel="noopener noreferrer"&gt;https://github.com/arianmokhtariha/data2prompt&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/data2prompt" rel="noopener noreferrer"&gt;https://pypi.org/project/data2prompt&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Open to questions, feedback, and contributions — drop them in the comments or open an issue on GitHub.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>dataengineering</category>
      <category>datascience</category>
      <category>llm</category>
    </item>
    <item>
      <title>I Tried Repomix on My Data Science Project. It Generated a 22,000 KB File. So I Built My Own Tool</title>
      <dc:creator>Arian Mokhtariha</dc:creator>
      <pubDate>Sun, 31 May 2026 22:31:28 +0000</pubDate>
      <link>https://dev.to/arian_mokhtariha_6206efac/i-tried-repomix-on-my-data-science-project-it-generated-a-22000-kb-file-so-i-built-my-own-tool-58j7</link>
      <guid>https://dev.to/arian_mokhtariha_6206efac/i-tried-repomix-on-my-data-science-project-it-generated-a-22000-kb-file-so-i-built-my-own-tool-58j7</guid>
      <description>&lt;p&gt;A few months ago a friend showed me two tools — Repomix and code2prompt. The idea was simple: point them at your project folder, they package everything into one file, you paste it into an LLM and ask questions about your whole codebase at once. For his pure Python projects they worked great.&lt;/p&gt;

&lt;p&gt;I was working on a data analytics project at the time — dimension and fact CSVs, a SQL dump, some Power BI files, Jupyter notebooks with ML models. I ran Repomix on it and got a 22,085 KB output file. code2prompt gave me 9,304 KB. I tried pasting either of them into Claude. It choked immediately.&lt;/p&gt;

&lt;p&gt;So I opened the files to see what was actually inside them. What I found was the root of the problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  What These Tools Get Wrong for Data Projects
&lt;/h2&gt;

&lt;p&gt;Repomix and code2prompt are built for &lt;em&gt;code&lt;/em&gt; repos. They operate on a simple principle: read every file, dump every file. That works fine when your project is Python scripts and config files. It completely falls apart when your project looks like mine.&lt;/p&gt;

&lt;p&gt;Here's what was inflating those files:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Raw CSV dumps.&lt;/strong&gt; My Fact_Sales.csv had tens of thousands of rows. The tool dumped every single one. An LLM doesn't need 50,000 rows of sales data — it needs to understand the structure and a representative sample.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Endless SQL INSERT statements.&lt;/strong&gt; My Superstore.sql file had the full database dump including every INSERT INTO statement for every table. The schema — the CREATE TABLE blocks — is what an LLM actually needs. The data rows are mostly noise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Notebook outputs with base64 images.&lt;/strong&gt; Jupyter notebooks store cell outputs as JSON inside the .ipynb file. When a cell generates a matplotlib chart, that chart gets saved as a base64-encoded image string inside the notebook. A single chart output can be 50,000+ characters of base64 garbage that an LLM cannot use at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Binary files read as text.&lt;/strong&gt; My .pbix (Power BI) files are binary. These tools attempted to read them as text and produced corrupted garbage that consumed tokens while providing zero information.&lt;/p&gt;

&lt;p&gt;There was no tool that understood these problems. So after three months of building — and being very honest that I'm a data scientist not a developer, so this was heavily AI-assisted — I shipped data2prompt.&lt;/p&gt;




&lt;h2&gt;
  
  
  What data2prompt Does Differently
&lt;/h2&gt;

&lt;p&gt;The core idea is that each file type in a data project needs its own strategy, not a generic "read and dump" approach.&lt;/p&gt;

&lt;h3&gt;
  
  
  CSVs and Excel: Smart Sampling
&lt;/h3&gt;

&lt;p&gt;Instead of dumping all rows, data2prompt takes a random sample:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sample_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The default is 15 rows, configurable with &lt;code&gt;--csv-sample-size&lt;/code&gt;. Critically it's &lt;strong&gt;random&lt;/strong&gt; sampling with a fixed seed — not head/tail. Random sampling gives the LLM a more representative picture of value diversity across the dataset. The seed (default 42) makes output reproducible across runs.&lt;/p&gt;

&lt;p&gt;The output tells the LLM exactly what it's looking at:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-- [Sample - Random 15 rows] --
| order_id | customer_name | sales  | profit |
|----------|--------------|--------|--------|
| CA-2019  | John Smith   | 245.00 | 41.65  |
...
-- [CSV truncated: Showing random 15 rows to save context] --
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Excel files, each sheet is sampled independently. The parser also detects sheets that are purely visual dashboards (charts, images only) and notes them rather than producing an empty table.&lt;/p&gt;

&lt;h3&gt;
  
  
  SQL Files: Schema Preserved, Data Sampled
&lt;/h3&gt;

&lt;p&gt;This was the hardest parser to get right. SQL dump files typically follow a pattern: CREATE TABLE block defining the schema, followed by hundreds or thousands of INSERT INTO statements loading the data.&lt;/p&gt;

&lt;p&gt;data2prompt reads line by line and applies different logic to each part:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Always preserve the schema
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CREATE TABLE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;line_upper&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;flush_buffer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;in_create_block&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="n"&gt;processed_lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Buffer INSERT rows for sampling
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_insert&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;is_data_row&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;table_data_buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When it hits a buffer of INSERT rows, it samples them randomly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;rest_indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_data_buffer&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;sample_size&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;sampled_rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;first_line&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;table_data_buffer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rest_indices&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first line (the INSERT header) is always preserved. The rest are random samples in their original order. The LLM gets the full schema of every table plus a representative data sample — which is exactly what it needs to understand your database.&lt;/p&gt;

&lt;h3&gt;
  
  
  Jupyter Notebooks: Source Code Only
&lt;/h3&gt;

&lt;p&gt;Notebook cells store three things: source code, execution count, and outputs. The outputs are what bloat the file — printed dataframes, matplotlib charts as base64, error tracebacks.&lt;/p&gt;

&lt;p&gt;data2prompt keeps the source code of every cell and strips the outputs entirely. A notebook that was 8MB of JSON becomes a clean sequence of code cells. The parser also handles truncation of unusually long lines and caps output blocks at a configurable line limit for cases where outputs are genuinely useful text.&lt;/p&gt;

&lt;p&gt;For the XML format, each cell becomes a structured block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;cell&lt;/span&gt; &lt;span class="na"&gt;path=&lt;/span&gt;&lt;span class="s"&gt;"ML/Q1/Q1.ipynb"&lt;/span&gt; &lt;span class="na"&gt;index=&lt;/span&gt;&lt;span class="s"&gt;"3"&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;"code"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;content&amp;gt;&lt;/span&gt;
model = XGBRegressor(n_estimators=100, learning_rate=0.1)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
    &lt;span class="nt"&gt;&amp;lt;/content&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/cell&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Binary Files: Listed, Not Read
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;.pbix&lt;/code&gt;, &lt;code&gt;.parquet&lt;/code&gt;, &lt;code&gt;.pkl&lt;/code&gt;, &lt;code&gt;.db&lt;/code&gt;, &lt;code&gt;.sqlite&lt;/code&gt;, &lt;code&gt;.feather&lt;/code&gt;, &lt;code&gt;.h5&lt;/code&gt; — all listed in the directory tree so the LLM knows they exist, but content is skipped entirely. No garbage bytes consuming your context window.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two Output Formats: Markdown and XML
&lt;/h2&gt;

&lt;p&gt;data2prompt supports both &lt;code&gt;--format markdown&lt;/code&gt; (default) and &lt;code&gt;--format xml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The XML format was added after Anthropic published research showing that XML-style tags improve LLM attention and parsing. The full project gets wrapped in a structured hierarchy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;codebase&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"superstore-analysis"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;metadata&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;generated_on&amp;gt;&lt;/span&gt;2025-05-31 09:00&lt;span class="nt"&gt;&amp;lt;/generated_on&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;total_tokens&lt;/span&gt; &lt;span class="na"&gt;method=&lt;/span&gt;&lt;span class="s"&gt;"o200k_base"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;48293&lt;span class="nt"&gt;&amp;lt;/total_tokens&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/metadata&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;directory_structure&amp;gt;&lt;/span&gt;
    Fact&lt;span class="err"&gt;&amp;amp;&lt;/span&gt;dim-csv\Fact_Sales.csv
    Superstore.sql
    ...
  &lt;span class="nt"&gt;&amp;lt;/directory_structure&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;files&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;file&lt;/span&gt; &lt;span class="na"&gt;path=&lt;/span&gt;&lt;span class="s"&gt;"Fact&amp;amp;dim-csv\Fact_Sales.csv"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
      ...sampled table...
    &lt;span class="nt"&gt;&amp;lt;/file&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/files&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/codebase&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Benchmark
&lt;/h2&gt;

&lt;p&gt;Same project, same files, three tools:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Output Size&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Repomix&lt;/td&gt;
&lt;td&gt;22,085 KB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;code2prompt&lt;/td&gt;
&lt;td&gt;9,304 KB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;data2prompt&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;241 KB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's a 98.9% reduction vs Repomix and 97.4% vs code2prompt on the same data-heavy project, while preserving all the structurally useful information — schemas, sampled data, notebook logic, file tree.&lt;/p&gt;

&lt;p&gt;The reduction is so dramatic specifically because of the project type. A pure Python project would show a much smaller gap. That's exactly the point — this tool is built for data projects, not code projects.&lt;/p&gt;




&lt;h2&gt;
  
  
  Installation and Usage
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;data2prompt

&lt;span class="c"&gt;# Recommended: use pipx for isolated install&lt;/span&gt;
pipx &lt;span class="nb"&gt;install &lt;/span&gt;data2prompt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Basic usage — run inside your project directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Default: Markdown output&lt;/span&gt;
data2prompt

&lt;span class="c"&gt;# XML output (better for LLM structured parsing)&lt;/span&gt;
data2prompt &lt;span class="nt"&gt;--format&lt;/span&gt; xml

&lt;span class="c"&gt;# Increase CSV sample size&lt;/span&gt;
data2prompt &lt;span class="nt"&gt;--csv-sample-size&lt;/span&gt; 25

&lt;span class="c"&gt;# Custom output file name&lt;/span&gt;
data2prompt &lt;span class="nt"&gt;--output&lt;/span&gt; my_project_context

&lt;span class="c"&gt;# Ignore specific folders&lt;/span&gt;
data2prompt &lt;span class="nt"&gt;--ignore-folders&lt;/span&gt; data/raw 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output file (default: &lt;code&gt;PROMPT.md&lt;/code&gt; or &lt;code&gt;PROMPT.xml&lt;/code&gt;) is ready to paste directly into Claude, ChatGPT, Gemini, or any LLM with a large context window.&lt;/p&gt;

&lt;p&gt;You can also create a &lt;code&gt;.data2promptignore&lt;/code&gt; file in your project root — same syntax as &lt;code&gt;.gitignore&lt;/code&gt; — to exclude specific files or patterns permanently.&lt;/p&gt;




&lt;h2&gt;
  
  
  Who This Is For
&lt;/h2&gt;

&lt;p&gt;data2prompt is specifically designed for data scientists, data analysts, and data engineers who work with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CSV/Excel data files&lt;/li&gt;
&lt;li&gt;SQL database dumps&lt;/li&gt;
&lt;li&gt;Jupyter notebooks&lt;/li&gt;
&lt;li&gt;Power BI or other binary analytics files&lt;/li&gt;
&lt;li&gt;Mixed projects with both code and data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your project is purely Python scripts with no data files, Repomix or code2prompt will serve you fine. But if your project looks anything like a real data science workflow, give data2prompt a try.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/arianmokhtariha/data2prompt" rel="noopener noreferrer"&gt;https://github.com/arianmokhtariha/data2prompt&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/data2prompt" rel="noopener noreferrer"&gt;https://pypi.org/project/data2prompt&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Questions about how the SQL parser works or why random sampling over head/tail? Drop them in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>code2prompt</category>
      <category>repomix</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
