DEV Community

Cover image for Stop Passing Raw DataFrames to Your LLM — Here's a Better Way
serada
serada

Posted on

Stop Passing Raw DataFrames to Your LLM — Here's a Better Way

TL;DR

  • df.to_string() on a 100K-row DataFrame = millions of tokens, guaranteed failure
  • df.head() = 5 rows with zero statistical context, useless for real analysis
  • dfcontext generates a token-budget-aware, column-type-aware summary — no LLM calls required

The Problem Nobody Talks About

You've got a DataFrame with 100,000 rows. You want to ask an LLM about it. What do you do?

Most people try one of two things:

# Option A: Dump everything (will blow the context window)
prompt = df.to_string()

# Option B: Just use head (loses almost all information)
prompt = df.head().to_string()
Enter fullscreen mode Exit fullscreen mode

Option A will hit your token limit instantly. Option B gives the model five rows of data and basically asks it to guess the rest.

There's no obvious middle ground in the standard pandas API — so I built one.

Introducing dfcontext

dfcontext generates a compact, statistically rich summary of your DataFrame that fits within a token budget you specify. It's pure data processing — zero LLM calls, works with any LLM provider.

pip install dfcontext
Enter fullscreen mode Exit fullscreen mode
import pandas as pd
from dfcontext import to_context

df = pd.read_csv("sales.csv")  # 100K rows

ctx = to_context(df, token_budget=2000)
print(ctx)
Enter fullscreen mode Exit fullscreen mode

Here's what that looks like on output:

## Dataset overview

- 100,000 rows × 5 columns

## Schema

| Column    | Type           | Non-null |
| --------- | -------------- | -------- |
| region    | object         | 100%     |
| sales     | float64        | 100%     |
| quantity  | int64          | 100%     |
| date      | datetime64[ns] | 100%     |
| is_return | bool           | 100%     |

## Column statistics

### region (categorical, 4 unique)

Top values: East (28.0%), West (25.8%), North (23.2%), South (23.0%)

### sales (numeric)

Range: 4.64 — 8,172.45 | Mean: 1,010.55 | Std: 1,030.04
Distribution: [█▃▁▁▁▁▁▁]

### date (datetime)

Range: 2024-01-01 — 2024-02-11 | Granularity: hourly

### is_return (boolean)

True: 6.0% | False: 94.0%

## Sample rows (diverse selection)

| region | sales   | quantity | date       | is_return |
| ------ | ------- | -------- | ---------- | --------- |
| East   | 4.64    | 32       | 2024-01-14 | False     |
| South  | 697.55  | 50       | 2024-01-15 | False     |
| West   | 8172.45 | 68       | 2024-01-02 | False     |
Enter fullscreen mode Exit fullscreen mode

2,000 tokens. Full schema. Real distributions. Diverse sample rows. That's the sweet spot.

Why Column-Type-Aware Matters

The key insight behind dfcontext is that different column types need different summaries.

A numeric column like sales needs range, mean, and distribution. A categorical column like region needs value frequencies. A datetime column needs range and granularity. A boolean column needs true/false ratio.

Feeding mean and std for a boolean column is meaningless. Showing "top values" for a float column is wrong. dfcontext handles each type correctly out of the box.

Query Hints: Tell It What You Care About

If you already know the focus of your analysis, pass a hint. dfcontext will allocate more token budget to relevant columns.

ctx = to_context(df, token_budget=2000, hint="regional sales trends")
# "region" and "sales" get richer detail; less budget is spent on other columns
Enter fullscreen mode Exit fullscreen mode

This is useful when you have a wide DataFrame but only care about a few columns for a given question.

Correlation Detection

ctx = to_context(df, token_budget=2000, include_correlations=True)
# Adds: "sales ↔ quantity: r=+0.823 (strong positive)"
Enter fullscreen mode Exit fullscreen mode

Often the most valuable thing you can tell an LLM is how columns relate to each other. This surfaces strong correlations without requiring the model to compute them.

Multiple Output Formats

dfcontext outputs Markdown by default, but it also supports plain text and YAML — useful if your prompt template expects structured data.

ctx_md    = to_context(df, format="markdown")  # default, great for chat models
ctx_plain = to_context(df, format="plain")     # no markdown syntax
ctx_yaml  = to_context(df, format="yaml")      # structured, requires pyyaml
Enter fullscreen mode Exit fullscreen mode

Wiring It Up to Claude (or Any LLM)

Here's a complete example with the Anthropic SDK:

import anthropic
import pandas as pd
from dfcontext import to_context

df = pd.read_csv("sales.csv")
ctx = to_context(df, token_budget=2000, hint="sales trends by region")

client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"{ctx}\n\nWhat are the key sales trends? Any anomalies worth investigating?",
    }],
)
print(response.content[0].text)
Enter fullscreen mode Exit fullscreen mode

The same pattern works with OpenAI, Gemini, or any API that accepts a string prompt.

Budget Tuning: More Tokens = Richer Stats

dfcontext adapts to the budget you give it. With a higher budget, it adds percentiles, skewness, and outlier rates automatically.

ctx_tight = to_context(df, token_budget=500)   # overview + schema only
ctx_rich  = to_context(df, token_budget=5000)  # full stats, percentiles, more samples
Enter fullscreen mode Exit fullscreen mode

You can see this in action without any API key by running the budget_tuning.py example in the repo.

Under the Hood: Getting Structured Results

If you want to integrate dfcontext into your own tooling rather than just generating a string, use analyze_columns directly:

from dfcontext import analyze_columns

summaries = analyze_columns(df)
for name, s in summaries.items():
    print(f"{name}: {s.column_type}, {s.unique_count} unique values")
    if s.distribution_sketch:
        print(f"  histogram: [{s.distribution_sketch}]")
    if "outlier_rate" in s.stats:
        print(f"  outlier rate: {s.stats['outlier_rate'] * 100:.1f}%")
Enter fullscreen mode Exit fullscreen mode

Each ColumnSummary object gives you dtype, column_type, non_null_rate, unique_count, stats, sample_values, and distribution_sketch — enough to build your own rendering layer if you need it.

Performance

It handles 100K rows in under a second. The bottleneck is pandas, not dfcontext.

Token counting defaults to a character-based estimate. For accurate counts, install the optional dependency:

pip install dfcontext[tiktoken]
Enter fullscreen mode Exit fullscreen mode

Install Options

pip install dfcontext          # core only (no extra deps)
pip install dfcontext[tiktoken] # accurate token counting
pip install dfcontext[yaml]     # YAML format output
pip install dfcontext[all]      # everything
Enter fullscreen mode Exit fullscreen mode

When to Use This

dfcontext is a good fit when:

  • You have DataFrames larger than a few hundred rows
  • You're building LLM pipelines that take tabular data as input
  • You want consistent, reproducible context across runs (no sampling randomness)
  • You're working within strict token budgets (API costs, model limits)

It's not trying to replace exploratory analysis. Use pandas profiling or ydata-profiling for that. dfcontext is specifically optimized for the "give an LLM just enough to understand this data" use case.


GitHub: sserada/dfcontext — PRs and issues welcome.

What's your current approach for feeding DataFrames to LLMs? Do you truncate, sample, or something else? Let me know in the comments — I'm curious what patterns people are using in the wild.

Top comments (0)