klement Gunndu

Posted on Mar 26

Ask Your CSV Anything: Build a Data Analysis Agent in Python

#machinelearning #python #ai #tutorial

Every data scientist has the same ritual. Load a CSV. Write df.describe(). Squint at column names. Write 14 lines of pandas to answer one question. Repeat.

What if an LLM could write those 14 lines for you — and run them?

That is exactly what data analysis agents do. You ask a question in English. The agent writes pandas code, executes it in a sandboxed Python environment, reads the output, and returns a plain-language answer. No manual wrangling. No copy-pasting between notebook cells.

Here are 4 patterns that make this work in production — from a 5-line quick start to a fully custom analysis pipeline.

Pattern 1: The Pandas DataFrame Agent

LangChain's create_pandas_dataframe_agent is the fastest path from CSV to answers. It wraps your DataFrame in a tool-calling agent that writes and executes pandas code through a sandboxed Python REPL.

Install the dependencies:

pip install langchain-experimental langchain-openai pandas

Load your data and create the agent:

import pandas as pd
from langchain_openai import ChatOpenAI
from langchain_experimental.agents.agent_toolkits import (
    create_pandas_dataframe_agent,
)

df = pd.read_csv("sales_data.csv")

llm = ChatOpenAI(model="gpt-4o", temperature=0)

agent = create_pandas_dataframe_agent(
    llm,
    df,
    agent_type="tool-calling",
    verbose=True,
    allow_dangerous_code=True,
)

The allow_dangerous_code=True flag is mandatory. The agent executes real Python code against your DataFrame, so it must be opted into explicitly. Run this only in a sandboxed environment — never on a production server with sensitive data.

Now ask questions:

result = agent.invoke("What are the top 5 products by total revenue?")
print(result["output"])

The agent inspects the DataFrame's columns and dtypes, writes a pandas expression like df.groupby('product')['revenue'].sum().nlargest(5), executes it, and returns the result as natural language.

Multi-DataFrame support. Pass a list of DataFrames to compare datasets:

df_2024 = pd.read_csv("sales_2024.csv")
df_2025 = pd.read_csv("sales_2025.csv")

agent = create_pandas_dataframe_agent(
    llm,
    [df_2024, df_2025],
    agent_type="tool-calling",
    verbose=True,
    allow_dangerous_code=True,
)

result = agent.invoke(
    "Which products grew more than 20% between 2024 and 2025?"
)

When to use this pattern: Small to medium datasets (under 1 million rows) that fit in memory. Quick exploratory analysis. Prototyping.

Limitation: The agent sends the first few rows and column names to the LLM as context. For wide DataFrames with 100+ columns, this context can overwhelm the model's attention. For large datasets, use Pattern 2.

Pattern 2: The SQL Agent for Larger Datasets

When your data outgrows a DataFrame — millions of rows, multiple related tables, or anything that benefits from indexing — convert it to SQLite and let a SQL agent handle the queries.

pip install langchain-community langchain-openai sqlalchemy

Load your CSV into SQLite and create the agent:

import pandas as pd
import sqlalchemy
from langchain_community.utilities import SQLDatabase
from langchain_community.agent_toolkits import create_sql_agent
from langchain_openai import ChatOpenAI

engine = sqlalchemy.create_engine("sqlite:///analysis.db")

df_orders = pd.read_csv("orders.csv")
df_customers = pd.read_csv("customers.csv")
df_orders.to_sql("orders", engine, if_exists="replace", index=False)
df_customers.to_sql("customers", engine, if_exists="replace", index=False)

db = SQLDatabase(engine)
print(f"Tables: {db.get_usable_table_names()}")

llm = ChatOpenAI(model="gpt-4o", temperature=0)

agent = create_sql_agent(
    llm,
    db=db,
    agent_type="tool-calling",
    verbose=True,
)

Now ask questions that span multiple tables:

result = agent.invoke(
    "What is the average order value per customer segment?"
)
print(result["output"])

The agent inspects the database schema, writes a SQL query with the appropriate JOINs, executes it, and translates the result into English.

Why SQL beats pandas at scale:

Indexing. SQLite creates indexes that make filtering and joining fast. Pandas scans the full DataFrame.
Memory. SQLite reads from disk. Pandas loads everything into RAM.
Joins. SQL JOINs are declarative and optimized. Pandas merges require explicit column mapping and can be memory-intensive.
Audit trail. Every query the agent writes is a valid SQL statement you can review and rerun.

Safety note: The SQL agent generates and executes queries against your database. Use a read-only connection or a copy of your data for analysis. Never point it at a production database with write access.

engine = sqlalchemy.create_engine(
    "sqlite:///analysis.db",
    connect_args={"check_same_thread": False},
)

# For read-only protection, use a copy of the database
# or set PRAGMA query_only = ON
with engine.connect() as conn:
    conn.execute(sqlalchemy.text("PRAGMA query_only = ON"))

Pattern 3: Custom Analysis Tools With Tool-Calling

The pre-built agents handle common cases. But real data analysis needs domain-specific operations — statistical tests, time series decomposition, custom visualizations. Build a custom tool-calling agent that exposes exactly the operations your analysts need.

pip install langchain-openai langgraph scipy

Define your analysis tools as Python functions with clear docstrings — the LLM reads these to decide which tool to call:

import pandas as pd
import json
from scipy import stats
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool

df = pd.read_csv("experiment_results.csv")


@tool
def describe_dataset(column: str) -> str:
    """Get summary statistics for a specific column in the dataset.
    Returns count, mean, std, min, 25%, 50%, 75%, max."""
    if column not in df.columns:
        return f"Column '{column}' not found. Available: {list(df.columns)}"
    return df[column].describe().to_json()


@tool
def compare_groups(
    group_column: str, value_column: str, group_a: str, group_b: str
) -> str:
    """Run a two-sample t-test comparing two groups on a numeric column.
    Returns t-statistic, p-value, and whether the difference is
    statistically significant at alpha=0.05."""
    a = df[df[group_column] == group_a][value_column].dropna()
    b = df[df[group_column] == group_b][value_column].dropna()
    t_stat, p_value = stats.ttest_ind(a, b)
    significant = p_value < 0.05
    return json.dumps({
        "group_a_mean": round(a.mean(), 4),
        "group_b_mean": round(b.mean(), 4),
        "t_statistic": round(t_stat, 4),
        "p_value": round(p_value, 6),
        "significant_at_0.05": significant,
        "n_a": len(a),
        "n_b": len(b),
    })


@tool
def correlation_matrix(columns: list[str]) -> str:
    """Calculate pairwise Pearson correlations between specified columns.
    Returns a correlation matrix as JSON."""
    valid = [c for c in columns if c in df.columns]
    if not valid:
        return f"No valid columns found. Available: {list(df.columns)}"
    corr = df[valid].corr()
    return corr.to_json()


@tool
def filter_and_aggregate(
    filter_column: str,
    filter_value: str,
    agg_column: str,
    agg_function: str,
) -> str:
    """Filter the dataset by a column value, then aggregate another column.
    Supported agg_function values: mean, sum, count, min, max, median."""
    valid_aggs = {"mean", "sum", "count", "min", "max", "median"}
    if agg_function not in valid_aggs:
        return f"Invalid agg_function. Use one of: {valid_aggs}"
    filtered = df[df[filter_column] == filter_value]
    result = getattr(filtered[agg_column], agg_function)()
    return json.dumps({
        "filter": f"{filter_column} == {filter_value}",
        "rows_matched": len(filtered),
        "aggregation": f"{agg_function}({agg_column})",
        "result": round(float(result), 4),
    })

Wire the tools into a tool-calling agent:

from langchain_core.messages import SystemMessage

tools = [describe_dataset, compare_groups, correlation_matrix, filter_and_aggregate]

llm = ChatOpenAI(model="gpt-4o", temperature=0)
llm_with_tools = llm.bind_tools(tools)

system_prompt = """You are a data analysis assistant. You have access to a dataset
with these columns: {columns}

When the user asks a question:
1. Identify which tools can answer it
2. Call the appropriate tool(s)
3. Interpret the results in plain language
4. Note statistical significance, sample sizes, and caveats

Always mention sample sizes when reporting statistical results.""".format(
    columns=list(df.columns)
)

from langgraph.prebuilt import create_react_agent

agent = create_react_agent(llm, tools)

Ask domain-specific questions:

result = agent.invoke({
    "messages": [
        {"role": "system", "content": system_prompt},
        {
            "role": "user",
            "content": "Is there a significant difference in conversion rate between the control and treatment groups?",
        },
    ]
})

print(result["messages"][-1].content)

Why custom tools beat the generic pandas agent:

Controlled operations. The agent can only call functions you defined. No arbitrary code execution.
Domain language. Tool descriptions use your team's terminology. The LLM maps natural language to the right operation.
Safety. No Python REPL. No allow_dangerous_code. Each tool validates its inputs and returns structured output.
Reproducibility. Every tool call is logged with exact parameters. You can replay any analysis.

Pattern 4: Output Validation and Guardrails

An analysis agent that returns wrong numbers is worse than no agent at all. These guardrails catch errors before they reach your stakeholders.

Guardrail 1: Schema validation on tool inputs.

Use Pydantic to enforce that tool arguments match expected types and ranges:

from pydantic import BaseModel, field_validator


class AggregationRequest(BaseModel):
    filter_column: str
    filter_value: str
    agg_column: str
    agg_function: str

    @field_validator("agg_function")
    @classmethod
    def validate_agg(cls, v: str) -> str:
        allowed = {"mean", "sum", "count", "min", "max", "median"}
        if v not in allowed:
            raise ValueError(f"Must be one of {allowed}, got '{v}'")
        return v

    @field_validator("filter_column", "agg_column")
    @classmethod
    def validate_columns(cls, v: str) -> str:
        if v not in df.columns:
            raise ValueError(
                f"Column '{v}' not in dataset. "
                f"Available: {list(df.columns)}"
            )
        return v

Guardrail 2: Result sanity checks.

Wrap your agent's output with automatic validation:

def validate_numeric_result(result: dict, column: str) -> dict:
    """Check that a computed result falls within the column's actual range."""
    col_min = float(df[column].min())
    col_max = float(df[column].max())
    value = result.get("result")

    if value is not None and not (col_min <= value <= col_max):
        result["warning"] = (
            f"Result {value} is outside the column range "
            f"[{col_min}, {col_max}]. Verify the calculation."
        )
    return result

Guardrail 3: Row count assertions.

Catch empty results before they produce misleading statistics:

def safe_aggregate(
    filtered_df: pd.DataFrame,
    column: str,
    operation: str,
    min_rows: int = 5,
) -> dict:
    """Aggregate with a minimum sample size requirement."""
    n = len(filtered_df)
    if n == 0:
        return {"error": "Filter returned zero rows. Check filter criteria."}
    if n < min_rows:
        return {
            "warning": f"Only {n} rows matched (minimum {min_rows}). "
            "Results may not be statistically meaningful.",
            "result": getattr(filtered_df[column], operation)(),
            "n": n,
        }
    return {
        "result": getattr(filtered_df[column], operation)(),
        "n": n,
    }

Guardrail 4: Query logging for audit.

Log every tool call so analysts can verify what the agent actually computed:

import logging
from datetime import datetime, timezone

logger = logging.getLogger("analysis_agent")
handler = logging.FileHandler("agent_audit.log")
handler.setFormatter(
    logging.Formatter("%(asctime)s | %(message)s")
)
logger.addHandler(handler)
logger.setLevel(logging.INFO)


def log_tool_call(tool_name: str, params: dict, result: dict):
    logger.info(
        f"TOOL={tool_name} | PARAMS={params} | "
        f"RESULT_KEYS={list(result.keys())} | "
        f"TIMESTAMP={datetime.now(timezone.utc).isoformat()}"
    )

These guardrails are not optional extras. In any workflow where analysis drives decisions — A/B test results, revenue forecasts, user segmentation — a wrong number costs more than a slow answer.

Which Pattern Fits Your Data

Scenario	Pattern	Why
Quick CSV exploration	1 (Pandas Agent)	Five lines to first answer
Multiple related tables, millions of rows	2 (SQL Agent)	Indexing, joins, disk-based
Domain-specific analysis (stats, experiments)	3 (Custom Tools)	Controlled operations, no REPL
Production pipeline with stakeholders	3 + 4 (Custom + Guardrails)	Validated, audited, reproducible

Start with Pattern 1 to prototype. Move to Pattern 2 when data gets large. Move to Pattern 3 when you need control. Add Pattern 4 when wrong answers have consequences.

What to Watch Out For

Code execution risks. Patterns 1 and 2 execute generated code. Run them in Docker containers or sandboxed environments — never on the same machine as production data without isolation.

LLM hallucination in analysis. The agent can write syntactically valid pandas code that answers the wrong question. Always log the generated code (verbose=True) and spot-check against manual queries.

Token costs scale with data context. The pandas agent sends column names, dtypes, and sample rows to the LLM on every call. Wide DataFrames with 100+ columns burn tokens fast. Use Pattern 3's targeted tools to control exactly what context the LLM sees.

Statistical literacy. The LLM knows how to call scipy.stats.ttest_ind. It does not always know when a t-test is the wrong test for your data distribution. Domain expertise still matters — the agent accelerates the workflow, it does not replace the analyst.

Every analyst spends more time wrangling data than analyzing it. These 4 patterns shift that ratio. The LLM handles the pandas syntax, the SQL joins, and the boilerplate. You handle the questions that matter.

Follow @klement_gunndu for more AI engineering content. We're building in public.