Genevieve Breton

Posted on Jun 12

Pandas pipelines through AI without leaking your column names

#ai #pandas #python #privacy

Every other framework in this series leaked through identifiers. Pandas leaks through strings — and "never rewrite strings" was the rule that kept the whole pipeline safe.

The rule pandas breaks

Across the Python and Django articles, one rule held without exception: PromptCape never rewrites string literals. Strings are user-visible labels, error messages, template paths, MIME types, SQL fragments, data. Rewriting them is how you turn "Download CSV" into garbage on a button, or mime="text/csv" into a broken response. The whole reverse-apply story leans on it — strings are inert, only identifiers carry the rename.

Pandas is the framework where that rule fails. Consider one line:

df = df[df["churn_probability"] > 0.5]

There is no identifier here worth protecting. df is a throwaway local. 0.5 is a number. The business secret — the fact that this company scores customers on a churn probability model — lives entirely inside the string literal "churn_probability". And in any real pandas pipeline there are dozens of them: "annual_salary", "patient_diagnosis_code", "ltv_segment", "fraud_score". The column names are the schema, and the schema is the thing you don't want sitting in an AI provider's logs.

So pandas forces the uncomfortable thing: a scoped, deliberate exception to the never-touch-strings rule. Not "rewrite all strings" — that re-breaks everything the rule protected. Rewrite only strings the engine can prove are column names. The entire difficulty of pandas obfuscation is in the word prove.

Where column names hide

If column names only ever appeared as df["name"], this would be a one-line regex. They don't. A column name is any string sitting in a "column position", and pandas has a sprawling vocabulary of column positions:

Access pattern	Example	Column strings
Single subscript	`df["annual_salary"]`	`annual_salary`
List subscript	`df[["dept", "salary"]]`	`dept`, `salary`
Attribute access	`df.annual_salary`	`annual_salary` (as an identifier, not a string)
`.loc` / `.at`	`df.loc[:, "salary"]`, `df.at[0, "salary"]`	`salary`
Grouping	`df.groupby("department")`, `groupby(by=["a", "b"])`	grouping keys
Joining	`df.merge(other, on="employee_id")`, `left_on=`, `right_on=`	join keys
Sorting / indexing	`df.sort_values("hire_date")`, `df.set_index("id")`	sort/index keys
Reshaping	`df.pivot_table(index="dept", columns="year", values="salary")`	three sets of columns
Melting	`df.melt(id_vars="id", value_vars=["q1", "q2"])`	id + value columns
Renaming	`df.rename(columns={"old": "new"})`	keys and values both
Assignment	`df.assign(margin=...)`, `df["margin"] = ...`	new column names (identifier and string)
Named aggregation	`df.agg(avg_pay=("salary", "mean"))`	output kwarg `avg_pay` and source string `salary`
Typing / IO	`read_csv(usecols=[...], dtype={...}, names=[...], parse_dates=[...])`	every listed column
Query strings	`df.query("churn_probability > 0.5")`	column names inside an expression string

Three of these are genuinely nasty:

rename(columns={...}) carries column names in both the keys and the values of a dict literal. Miss the values and you leak every renamed column.
Named aggregation (df.agg(avg_pay=("salary", "mean"))) puts an output column name in a keyword-argument position — a Python identifier — while the source column sits in a string in the same call. One line, two different obfuscation mechanisms.
df.query("...") and df.eval("...") embed column names inside a mini-language that pandas parses at runtime. You can't treat the argument as an opaque string; you have to parse the expression, find the names, rewrite them, and re-serialize — or refuse and warn.

The detection problem: the schema isn't in the code

Every other detector in this series had it easy: the names it needed to find were declared somewhere. Pydantic fields are in the class body. Django models declare their fields. SQLAlchemy columns are Column(...) assignments. An AST scan finds the declaration site and you're done.

Pandas has no declaration site. A DataFrame's columns come from the data — the header row of a CSV, a SQL result set, a Parquet schema — none of which is in the source code:

df = pd.read_csv("s3://acme-hr/attrition_2026.csv")
# df now has 40 columns. None of their names appear anywhere in this file.
for col in df.columns:
    df[col] = df[col].fillna(0)        # every column touched, zero literals

This splits column names into two populations, and only one is reachable:

Population	Where it appears	Obfuscatable statically?
Referenced columns	As literals in code: `df["churn_probability"]`, `groupby("dept")`, `rename(columns=...)`	Yes — they're in the AST
Dynamic-only columns	Only in the data, touched via `df.columns`, `for col in df.columns`, `df.select_dtypes(...)`	No — the name never appears as a literal to rewrite

The honest framing PromptCape ships with: it obfuscates the column names that appear as literals in the code, because those are exactly the ones that would otherwise reach the AI. A column that's never named in the source can't leak through the source — it only leaks if the AI reads the data file, which is a separate problem (covered below). The PandasColumnDetector does an AST/LibCST scan of column-position contexts and collects every literal it finds into a project-wide column registry, hashed the same way identifiers are — churn_probability → col_e2d4b7c9 — so the mapping round-trips through reverse-apply exactly like fld_ and mtd_ names.

The type-inference trap: not every `x["string"]` is a column

Here is the bug that defined the whole subsystem.

The first version of the column detector treated every string subscript as a column name: any x["..."] got rewritten. It worked beautifully on pandas code and corrupted everything else:

# Before obfuscation
db_url = os.environ["DATABASE_URL"]
config = settings["feature_flags"]["new_dashboard"]
headers = {"Content-Type": "application/json"}
row = df["annual_salary"]

# After the naive obfuscator (three bugs and one correct rewrite)
db_url = os.environ["col_a1b2c3d4"]          # ← BUG: env var lookup now fails
config = settings["col_5e6f7a8b"]["col_..."] # ← BUG: dict keys mangled
headers = {"col_...": "application/json"}     # ← BUG: HTTP header broken
row = df["col_99c1d2e3"]                       # ← correct

os.environ["DATABASE_URL"], dict subscripts, JSON payloads — they all use the exact same syntax as a DataFrame column access. Subscript-with-a-string is not a pandas signal; it's a Python idiom. Rewriting it blindly turns a privacy tool into a code corrupter.

The fix is DataFrame-variable inference: only rewrite subscript strings on a variable the engine can show is a DataFrame. The sidecar tracks, per scope, which names are bound to a DataFrame by walking assignments:

Assigned from a constructor or IO call: pd.read_csv(...), pd.read_sql(...), pd.DataFrame(...), pd.read_parquet(...).
Assigned from a DataFrame-returning operation on a known DataFrame: df.copy(), df[mask], df.groupby(...).agg(...), df.merge(...), df.rename(...).
Annotated : pd.DataFrame in a parameter or variable.
A column-position keyword on a known pandas method (groupby(by=...), merge(on=...)) — here the method itself proves the argument is a column, even without var inference.

Strings subscripted on anything the engine can't prove is a DataFrame are left untouched. The trade-off is deliberate and stated: a DataFrame that arrives through an un-inferrable path (returned from a third-party function with no annotation, stored in a list, pulled from a dict) won't have its columns obfuscated. PromptCape chooses silent under-obfuscation over silent corruption — a leaked column name is a privacy miss; a rewritten os.environ key is a broken app. The first you can catch in review; the second wastes an afternoon.

Before / after

A real-shaped HR attrition pipeline. Watch the columns, the named aggregations, and notice what is kept.

Source the AI must never see:

import pandas as pd


def build_attrition_report(path: str) -> pd.DataFrame:
    df = pd.read_csv(path)
    df = df[df["employment_status"] == "active"]
    df["tenure_years"] = df["tenure_months"] / 12

    summary = (
        df.groupby("department")
          .agg(
              headcount=("employee_id", "count"),
              avg_salary=("annual_salary", "mean"),
              attrition_risk=("churn_probability", "mean"),
          )
          .reset_index()
          .sort_values("attrition_risk", ascending=False)
    )
    return summary

What the AI actually receives:

import pandas as pd


def mtd_3f2a1b0c(path: str) -> pd.DataFrame:
    fld_9c8d7e6f = pd.read_csv(path)
    fld_9c8d7e6f = fld_9c8d7e6f[fld_9c8d7e6f["col_4a1f8b2e"] == "active"]
    fld_9c8d7e6f["col_7d3e9a14"] = fld_9c8d7e6f["col_b6c2f085"] / 12

    fld_2e5a8c91 = (
        fld_9c8d7e6f.groupby("col_1f7b3d6a")
          .agg(
              col_c5e90a2b=("col_d40a1e77", "count"),
              col_88a1d3f4=("col_55b3e9c0", "mean"),
              col_e2d4b7c9=("col_aa1f2b30", "mean"),
          )
          .reset_index()
          .sort_values("col_e2d4b7c9", ascending=False)
    )
    return fld_2e5a8c91

The provider sees a function that filters, divides, groups, and aggregates. It does not see that this is attrition modelling, that employees have a churn_probability, or that the company tracks annual_salary. The shape of the analysis is intact; the meaning is gone.

Three things to notice in the diff:

attrition_risk round-trips as one name. It's created as a named-aggregation kwarg (attrition_risk=) and referenced three lines later in sort_values("attrition_risk"). Both became col_e2d4b7c9 — the registry is keyed by the real name, so a column that's born in .agg() and used in .sort_values() stays consistent. Get this wrong and the workspace raises KeyError: 'col_e2d4b7c9' because the sort references a column the agg never produced under that name.
"active" is untouched. It's a value, not a column. It stays readable — which is the runtime-data boundary discussed below, not an oversight.
pd, read_csv, groupby, agg, mean, count, reset_index, sort_values, ascending all survive. That's the PandasDetector's job, unchanged from earlier articles.

What does NOT change (and why)

Preserved	Why
Pandas API names — `read_csv`, `groupby`, `agg`, `merge`, `pivot_table`, `to_csv`, `value_counts`, `reset_index`, `fillna`, `astype`, …	`PandasDetector`: a fixed list of ~200 DataFrame/Series/IO method and attribute names. Renaming `to_csv` → `mtd_xxx` raises `AttributeError`
Aggregation function strings — `"mean"`, `"sum"`, `"count"`, `"first"`	These are pandas reduction names, not columns. They're a small fixed allow-list inside the column detector
Value literals — `"active"`, `"2026-01-01"`, `"USD"`	Data, not schema. PromptCape never rewrites values (see boundary section)
File paths — `pd.read_csv("attrition_2026.csv")`	A path, not a column. Subscript/kwarg position is what marks a column; a positional path argument to `read_csv` is not a column position
Dict / env / JSON subscripts — `os.environ["X"]`, `config["y"]`	The variable isn't an inferred DataFrame, so the string is left alone
`df` and other local variable names visible above	They are renamed (`fld_…`) — shown here only to contrast with the columns

The data-file leak (the other half)

Obfuscating df["churn_probability"] to df["col_e2d4b7c9"] in the code accomplishes nothing if the AI can open attrition_2026.csv and read this:

employee_id,department,annual_salary,churn_probability,employment_status
4471,Engineering,142000,0.12,active

Worse, there's a runtime conflict: the obfuscated code asks for a column col_e2d4b7c9 that the real CSV header calls churn_probability. The workspace won't even run — KeyError: 'col_e2d4b7c9'.

This is the same split the Python article drew for .env: a value that must exist at runtime but must not reach the AI's view of the workspace. PromptCape resolves pandas data files two ways depending on whether they're fixtures or real data:

Bundled sample/fixture data (small CSVs the repo ships for tests/demos) get their header row rewritten with the same column registry: employee_id,department,annual_salary,... → col_d40a1e77,col_1f7b3d6a,col_55b3e9c0,.... The workspace runs on the fixture, the obfuscated code matches the obfuscated header, and the AI sees neither the real names nor the real rows' meaning (values stay, but a header-less 0.12 leaks little).
Real production data stays in the source project and never enters the workspace, exactly like .env. promptcape run resolves the data path at launch and a thin pandas IO shim applies rename(columns=registry) immediately after each read, so the code's col_… names line up with the freshly-renamed frame. The real headers live only in the developer's source tree, never in ~/.promptcape/cache/<hash>/.

The principle is identical to the secrets story: the AI-visible workspace directory contains schema and structure, never the real vocabulary or the real rows.

Reverse-apply and AI-invented columns

When the AI adds a feature — "add a column flagging anyone above the 90th salary percentile" — it writes against the obfuscated frame:

fld_9c8d7e6f["col_new_flag"] = (
    fld_9c8d7e6f["col_55b3e9c0"] > fld_9c8d7e6f["col_55b3e9c0"].quantile(0.9)
)

Two cases on the way back:

References to existing columns (col_55b3e9c0) hit the registry and reverse-map cleanly to annual_salary. Same hash-resolver as identifiers.
The AI's new column is a name the AI invented. It's not in the registry, and that's fine — it's not a leak (the AI chose it, the provider never saw a real name). It comes back verbatim as whatever the AI typed (col_new_flag, or high_earner if it wrote a readable name). The developer reviews and renames to taste, identical to how AI-invented variable names are handled in the Java pipeline.

The one failure mode worth a guard: if the AI writes a col_xxxxxxxx-shaped string that collides with a registry hash it shouldn't (extraordinarily unlikely with 8 hex digits, but the resolver is strict), the pre-apply gate flags any col_ literal in the diff that maps to a column the surrounding code never read. In practice this never fires; it exists so a collision is loud rather than silent.

What this does NOT protect

The threat boundary is the same shape as the rest of the series, with one pandas-specific edge.

Threat	Protected?	By what
Column names reaching the AI provider	Yes	Column registry + scoped string rewriting
Business logic / pipeline shape	Partially	The operations are visible (a `groupby().agg()` is still a `groupby().agg()`); only the vocabulary is gone
Runtime data values in the AI's view	Only if data is kept out / fixtures are header-stripped	`promptcape run` data indirection; values themselves are never rewritten
Dynamic-only columns (never named in code)	Not via code obfuscation	They don't appear in source; they only leak if the data file does, which the data-file handling addresses
Columns inside un-inferrable DataFrames	No (deliberate)	Under-obfuscate rather than corrupt non-pandas subscripts

The honest line: PromptCape removes the schema vocabulary from what the AI sees. It does not pretend to hide that you're doing data analysis, nor to encrypt the numbers. For a column called churn_probability, the name is the secret — and that's what's gone.

Conclusion

Pandas is the framework that inverts the series' central rule. Everywhere else, the leak was in identifiers and strings were safe to ignore. In pandas the leak is the strings, because column names are the business vocabulary and they live as literals.

The three load-bearing ideas:

Scoped string rewriting, gated on proof. The fix isn't "rewrite strings now" — that re-breaks labels, paths, and dict keys. It's a column registry fed only by strings in proven column positions, hashed and reverse-mapped exactly like identifiers.
DataFrame-variable inference is the whole ballgame. x["str"] is the most overloaded syntax in Python. Without knowing x is a DataFrame, you either leak columns (too cautious) or corrupt os.environ (too eager). The detector errs toward under-obfuscation because a broken app is worse than a missed name you can catch in review.
The data file is half the problem. Obfuscating code while the AI can read attrition_2026.csv is theatre. Fixtures get header-rewritten; real data stays in the source and is renamed on load — the same secrets-never-touch-the-workspace principle as .env.

PromptCape ships open for trial at https://promptcape.com/ — free for 3 months, no credit card required. The PandasColumnDetector, the DataFrame-inference sidecar, and the data-file handling ship in the same JAR as the rest of the Python pipeline; the language is auto-detected from the source tree.