Building Data Analysis Tools with LLMs: A Step-by-Step Approach

#engineering #oxlo #ai

I recently shipped an internal CLI tool that lets our team drop a CSV on disk and ask questions in plain English. Instead of returning guesses, the agent generates Python code, runs it against the actual data, and returns verified numbers. Below is the exact setup I used, wired to Oxlo.ai's flat per-request pricing so that long schema descriptions cost the same as short ones.

What you'll need

Python 3.10 or newer.
The OpenAI SDK and pandas: pip install openai pandas
An Oxlo.ai API key from https://portal.oxlo.ai. If you want to compare costs, see the pricing page; request-based billing means a 500-row schema preview and a 10-row preview cost the same per question.

Step 1: Configure the Oxlo.ai client

I use the OpenAI SDK as a drop-in client. Oxlo.ai exposes a fully compatible base URL, so only the endpoint and key change.

from openai import OpenAI
import pandas as pd
import json
import io
import sys

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key="YOUR_OXLO_API_KEY"  # from https://portal.oxlo.ai
)

Step 2: Load and summarize the dataset

The agent needs column names, dtypes, and a few sample rows to write correct code. I serialize this into a compact string and feed it into the context window.

def summarize_csv(path: str) -> tuple[pd.DataFrame, str]:
    df = pd.read_csv(path)
    summary = []
    summary.append(f"Columns: {list(df.columns)}")
    summary.append(f"Shape: {df.shape[0]} rows, {df.shape[1]} columns")
    summary.append(f"Dtypes: {dict(df.dtypes)}")
    summary.append("Sample rows:")
    summary.append(df.head(3).to_csv(index=False))
    return df, "\n".join(summary)

df, schema = summarize_csv("sales.csv")
print(schema)

Step 3: Define the agent's system prompt

The prompt constrains the model to emit JSON with executable Python. I ask it to store the final answer in a variable named RESULT so the runner can extract it reliably.

SYSTEM_PROMPT = """You are a data analysis agent. The user has provided a pandas DataFrame already loaded as the variable `df`.

Your task is to answer the user's question by generating Python code.

Rules:
1. Output valid JSON with exactly two keys: "explanation" and "code".
2. The "code" string must be valid Python that uses only `df`, standard library modules, and pandas (available as `pd`).
3. Assign the final answer to a variable named `RESULT`. It should be a string, number, list, or dict.
4. Do not use file I/O, network calls, or system commands.
5. Keep the code concise."""

Step 4: Add a sandboxed execution helper

Running LLM-generated code requires caution. I restrict the execution namespace to pandas and the dataframe, capture stdout, and return both the RESULT variable and any printed output.

def run_code(code: str, df: pd.DataFrame):
    namespace = {"pd": pd, "df": df}
    stdout_buffer = io.StringIO()
    sys.stdout = stdout_buffer

    try:
        exec(code, namespace)
        result = namespace.get("RESULT", "No RESULT variable defined.")
        logs = stdout_buffer.getvalue()
    except Exception as e:
        result = f"Execution error: {e}"
        logs = stdout_buffer.getvalue()
    finally:
        sys.stdout = sys.__stdout__

    return {"result": result, "logs": logs}

Step 5: Send the question and parse the response

Now I wire the LLM call. Because Oxlo.ai bills per request, I can pass the full schema string every time without worrying about token length. I use JSON mode to enforce valid output.

def ask_agent(csv_path: str, question: str):
    df, schema = summarize_csv(csv_path)

    user_content = f"{schema}\n\nQuestion: {question}"

    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_content},
        ],
        response_format={"type": "json_object"}
    )

    payload = json.loads(response.choices[0].message.content)
    execution = run_code(payload["code"], df)

    return {
        "explanation": payload["explanation"],
        "code": payload["code"],
        "result": execution["result"],
        "logs": execution["logs"],
    }

Run it

Here is a sample invocation using a fictional sales.csv with columns Region, Product, Revenue, and Date.

if __name__ == "__main__":
    answer = ask_agent("sales.csv", "What is the average revenue per region?")
    print("Explanation:", answer["explanation"])
    print("Generated code:\n", answer["code"])
    print("Result:", answer["result"])

Example output:

Explanation: Group the dataframe by Region and calculate the mean of the Revenue column.
Generated code:
 region_avg = df.groupby('Region')['Revenue'].mean().to_dict()
RESULT = region_avg

Result: {'North': 45200.5, 'South': 38900.0, 'East': 51250.75, 'West': 42100.2}

Next steps

Replace exec with a sandboxed Docker container or RestrictedPython so you can safely run this on untrusted user uploads in production.

Add a second Oxlo.ai call that formats the raw RESULT into a short narrative or Markdown table for non-technical stakeholders. DeepSeek V3.2 works well for that summarization step, and because Oxlo.ai charges per request, chaining two models still costs the same predictable rate regardless of how verbose the intermediate JSON becomes.