Every other framework in this series leaked through identifiers. Pandas leaks through strings — and "never rewrite strings" was the rule that kept the whole pipeline safe.
The rule pandas breaks
Across the Python and Django articles, one rule held without exception: PromptCape never rewrites string literals. Strings are user-visible labels, error messages, template paths, MIME types, SQL fragments, data. Rewriting them is how you turn "Download CSV" into garbage on a button, or mime="text/csv" into a broken response. The whole reverse-apply story leans on it — strings are inert, only identifiers carry the rename.
Pandas is the framework where that rule fails. Consider one line:
df = df[df["churn_probability"] > 0.5]
There is no identifier here worth protecting. df is a throwaway local. 0.5 is a number. The business secret — the fact that this company scores customers on a churn probability model — lives entirely inside the string literal "churn_probability". And in any real pandas pipeline there are dozens of them: "annual_salary", "patient_diagnosis_code", "ltv_segment", "fraud_score". The column names are the schema, and the schema is the thing you don't want sitting in an AI provider's logs.
So pandas forces the uncomfortable thing: a scoped, deliberate exception to the never-touch-strings rule. Not "rewrite all strings" — that re-breaks everything the rule protected. Rewrite only strings the engine can prove are column names. The entire difficulty of pandas obfuscation is in the word prove.
Where column names hide
If column names only ever appeared as df["name"], this would be a one-line regex. They don't. A column name is any string sitting in a "column position", and pandas has a sprawling vocabulary of column positions:
| Access pattern | Example | Column strings |
|---|---|---|
| Single subscript | df["annual_salary"] |
annual_salary |
| List subscript | df[["dept", "salary"]] |
dept, salary
|
| Attribute access | df.annual_salary |
annual_salary (as an identifier, not a string) |
.loc / .at
|
df.loc[:, "salary"], df.at[0, "salary"]
|
salary |
| Grouping |
df.groupby("department"), groupby(by=["a", "b"])
|
grouping keys |
| Joining |
df.merge(other, on="employee_id"), left_on=, right_on=
|
join keys |
| Sorting / indexing |
df.sort_values("hire_date"), df.set_index("id")
|
sort/index keys |
| Reshaping | df.pivot_table(index="dept", columns="year", values="salary") |
three sets of columns |
| Melting | df.melt(id_vars="id", value_vars=["q1", "q2"]) |
id + value columns |
| Renaming | df.rename(columns={"old": "new"}) |
keys and values both |
| Assignment |
df.assign(margin=...), df["margin"] = ...
|
new column names (identifier and string) |
| Named aggregation | df.agg(avg_pay=("salary", "mean")) |
output kwarg avg_pay and source string salary
|
| Typing / IO | read_csv(usecols=[...], dtype={...}, names=[...], parse_dates=[...]) |
every listed column |
| Query strings | df.query("churn_probability > 0.5") |
column names inside an expression string |
Three of these are genuinely nasty:
-
rename(columns={...})carries column names in both the keys and the values of a dict literal. Miss the values and you leak every renamed column. -
Named aggregation (
df.agg(avg_pay=("salary", "mean"))) puts an output column name in a keyword-argument position — a Python identifier — while the source column sits in a string in the same call. One line, two different obfuscation mechanisms. -
df.query("...")anddf.eval("...")embed column names inside a mini-language that pandas parses at runtime. You can't treat the argument as an opaque string; you have to parse the expression, find the names, rewrite them, and re-serialize — or refuse and warn.
The detection problem: the schema isn't in the code
Every other detector in this series had it easy: the names it needed to find were declared somewhere. Pydantic fields are in the class body. Django models declare their fields. SQLAlchemy columns are Column(...) assignments. An AST scan finds the declaration site and you're done.
Pandas has no declaration site. A DataFrame's columns come from the data — the header row of a CSV, a SQL result set, a Parquet schema — none of which is in the source code:
df = pd.read_csv("s3://acme-hr/attrition_2026.csv")
# df now has 40 columns. None of their names appear anywhere in this file.
for col in df.columns:
df[col] = df[col].fillna(0) # every column touched, zero literals
This splits column names into two populations, and only one is reachable:
| Population | Where it appears | Obfuscatable statically? |
|---|---|---|
| Referenced columns | As literals in code: df["churn_probability"], groupby("dept"), rename(columns=...)
|
Yes — they're in the AST |
| Dynamic-only columns | Only in the data, touched via df.columns, for col in df.columns, df.select_dtypes(...)
|
No — the name never appears as a literal to rewrite |
The honest framing PromptCape ships with: it obfuscates the column names that appear as literals in the code, because those are exactly the ones that would otherwise reach the AI. A column that's never named in the source can't leak through the source — it only leaks if the AI reads the data file, which is a separate problem (covered below). The PandasColumnDetector does an AST/LibCST scan of column-position contexts and collects every literal it finds into a project-wide column registry, hashed the same way identifiers are — churn_probability → col_e2d4b7c9 — so the mapping round-trips through reverse-apply exactly like fld_ and mtd_ names.
The type-inference trap: not every x["string"] is a column
Here is the bug that defined the whole subsystem.
The first version of the column detector treated every string subscript as a column name: any x["..."] got rewritten. It worked beautifully on pandas code and corrupted everything else:
# Before obfuscation
db_url = os.environ["DATABASE_URL"]
config = settings["feature_flags"]["new_dashboard"]
headers = {"Content-Type": "application/json"}
row = df["annual_salary"]
# After the naive obfuscator (three bugs and one correct rewrite)
db_url = os.environ["col_a1b2c3d4"] # ← BUG: env var lookup now fails
config = settings["col_5e6f7a8b"]["col_..."] # ← BUG: dict keys mangled
headers = {"col_...": "application/json"} # ← BUG: HTTP header broken
row = df["col_99c1d2e3"] # ← correct
os.environ["DATABASE_URL"], dict subscripts, JSON payloads — they all use the exact same syntax as a DataFrame column access. Subscript-with-a-string is not a pandas signal; it's a Python idiom. Rewriting it blindly turns a privacy tool into a code corrupter.
The fix is DataFrame-variable inference: only rewrite subscript strings on a variable the engine can show is a DataFrame. The sidecar tracks, per scope, which names are bound to a DataFrame by walking assignments:
- Assigned from a constructor or IO call:
pd.read_csv(...),pd.read_sql(...),pd.DataFrame(...),pd.read_parquet(...). - Assigned from a DataFrame-returning operation on a known DataFrame:
df.copy(),df[mask],df.groupby(...).agg(...),df.merge(...),df.rename(...). - Annotated
: pd.DataFramein a parameter or variable. - A column-position keyword on a known pandas method (
groupby(by=...),merge(on=...)) — here the method itself proves the argument is a column, even without var inference.
Strings subscripted on anything the engine can't prove is a DataFrame are left untouched. The trade-off is deliberate and stated: a DataFrame that arrives through an un-inferrable path (returned from a third-party function with no annotation, stored in a list, pulled from a dict) won't have its columns obfuscated. PromptCape chooses silent under-obfuscation over silent corruption — a leaked column name is a privacy miss; a rewritten os.environ key is a broken app. The first you can catch in review; the second wastes an afternoon.
Before / after
A real-shaped HR attrition pipeline. Watch the columns, the named aggregations, and notice what is kept.
Source the AI must never see:
import pandas as pd
def build_attrition_report(path: str) -> pd.DataFrame:
df = pd.read_csv(path)
df = df[df["employment_status"] == "active"]
df["tenure_years"] = df["tenure_months"] / 12
summary = (
df.groupby("department")
.agg(
headcount=("employee_id", "count"),
avg_salary=("annual_salary", "mean"),
attrition_risk=("churn_probability", "mean"),
)
.reset_index()
.sort_values("attrition_risk", ascending=False)
)
return summary
What the AI actually receives:
import pandas as pd
def mtd_3f2a1b0c(path: str) -> pd.DataFrame:
fld_9c8d7e6f = pd.read_csv(path)
fld_9c8d7e6f = fld_9c8d7e6f[fld_9c8d7e6f["col_4a1f8b2e"] == "active"]
fld_9c8d7e6f["col_7d3e9a14"] = fld_9c8d7e6f["col_b6c2f085"] / 12
fld_2e5a8c91 = (
fld_9c8d7e6f.groupby("col_1f7b3d6a")
.agg(
col_c5e90a2b=("col_d40a1e77", "count"),
col_88a1d3f4=("col_55b3e9c0", "mean"),
col_e2d4b7c9=("col_aa1f2b30", "mean"),
)
.reset_index()
.sort_values("col_e2d4b7c9", ascending=False)
)
return fld_2e5a8c91
The provider sees a function that filters, divides, groups, and aggregates. It does not see that this is attrition modelling, that employees have a churn_probability, or that the company tracks annual_salary. The shape of the analysis is intact; the meaning is gone.
Three things to notice in the diff:
-
attrition_riskround-trips as one name. It's created as a named-aggregation kwarg (attrition_risk=) and referenced three lines later insort_values("attrition_risk"). Both becamecol_e2d4b7c9— the registry is keyed by the real name, so a column that's born in.agg()and used in.sort_values()stays consistent. Get this wrong and the workspace raisesKeyError: 'col_e2d4b7c9'because the sort references a column the agg never produced under that name. -
"active"is untouched. It's a value, not a column. It stays readable — which is the runtime-data boundary discussed below, not an oversight. -
pd,read_csv,groupby,agg,mean,count,reset_index,sort_values,ascendingall survive. That's thePandasDetector's job, unchanged from earlier articles.
What does NOT change (and why)
| Preserved | Why |
|---|---|
Pandas API names — read_csv, groupby, agg, merge, pivot_table, to_csv, value_counts, reset_index, fillna, astype, … |
PandasDetector: a fixed list of ~200 DataFrame/Series/IO method and attribute names. Renaming to_csv → mtd_xxx raises AttributeError
|
Aggregation function strings — "mean", "sum", "count", "first"
|
These are pandas reduction names, not columns. They're a small fixed allow-list inside the column detector |
Value literals — "active", "2026-01-01", "USD"
|
Data, not schema. PromptCape never rewrites values (see boundary section) |
File paths — pd.read_csv("attrition_2026.csv")
|
A path, not a column. Subscript/kwarg position is what marks a column; a positional path argument to read_csv is not a column position |
Dict / env / JSON subscripts — os.environ["X"], config["y"]
|
The variable isn't an inferred DataFrame, so the string is left alone |
df and other local variable names visible above |
They are renamed (fld_…) — shown here only to contrast with the columns |
The data-file leak (the other half)
Obfuscating df["churn_probability"] to df["col_e2d4b7c9"] in the code accomplishes nothing if the AI can open attrition_2026.csv and read this:
employee_id,department,annual_salary,churn_probability,employment_status
4471,Engineering,142000,0.12,active
Worse, there's a runtime conflict: the obfuscated code asks for a column col_e2d4b7c9 that the real CSV header calls churn_probability. The workspace won't even run — KeyError: 'col_e2d4b7c9'.
This is the same split the Python article drew for .env: a value that must exist at runtime but must not reach the AI's view of the workspace. PromptCape resolves pandas data files two ways depending on whether they're fixtures or real data:
-
Bundled sample/fixture data (small CSVs the repo ships for tests/demos) get their header row rewritten with the same column registry:
employee_id,department,annual_salary,...→col_d40a1e77,col_1f7b3d6a,col_55b3e9c0,.... The workspace runs on the fixture, the obfuscated code matches the obfuscated header, and the AI sees neither the real names nor the real rows' meaning (values stay, but a header-less0.12leaks little). -
Real production data stays in the source project and never enters the workspace, exactly like
.env.promptcape runresolves the data path at launch and a thin pandas IO shim appliesrename(columns=registry)immediately after each read, so the code'scol_…names line up with the freshly-renamed frame. The real headers live only in the developer's source tree, never in~/.promptcape/cache/<hash>/.
The principle is identical to the secrets story: the AI-visible workspace directory contains schema and structure, never the real vocabulary or the real rows.
Reverse-apply and AI-invented columns
When the AI adds a feature — "add a column flagging anyone above the 90th salary percentile" — it writes against the obfuscated frame:
fld_9c8d7e6f["col_new_flag"] = (
fld_9c8d7e6f["col_55b3e9c0"] > fld_9c8d7e6f["col_55b3e9c0"].quantile(0.9)
)
Two cases on the way back:
-
References to existing columns (
col_55b3e9c0) hit the registry and reverse-map cleanly toannual_salary. Same hash-resolver as identifiers. -
The AI's new column is a name the AI invented. It's not in the registry, and that's fine — it's not a leak (the AI chose it, the provider never saw a real name). It comes back verbatim as whatever the AI typed (
col_new_flag, orhigh_earnerif it wrote a readable name). The developer reviews and renames to taste, identical to how AI-invented variable names are handled in the Java pipeline.
The one failure mode worth a guard: if the AI writes a col_xxxxxxxx-shaped string that collides with a registry hash it shouldn't (extraordinarily unlikely with 8 hex digits, but the resolver is strict), the pre-apply gate flags any col_ literal in the diff that maps to a column the surrounding code never read. In practice this never fires; it exists so a collision is loud rather than silent.
What this does NOT protect
The threat boundary is the same shape as the rest of the series, with one pandas-specific edge.
| Threat | Protected? | By what |
|---|---|---|
| Column names reaching the AI provider | Yes | Column registry + scoped string rewriting |
| Business logic / pipeline shape | Partially | The operations are visible (a groupby().agg() is still a groupby().agg()); only the vocabulary is gone |
| Runtime data values in the AI's view | Only if data is kept out / fixtures are header-stripped |
promptcape run data indirection; values themselves are never rewritten |
| Dynamic-only columns (never named in code) | Not via code obfuscation | They don't appear in source; they only leak if the data file does, which the data-file handling addresses |
| Columns inside un-inferrable DataFrames | No (deliberate) | Under-obfuscate rather than corrupt non-pandas subscripts |
The honest line: PromptCape removes the schema vocabulary from what the AI sees. It does not pretend to hide that you're doing data analysis, nor to encrypt the numbers. For a column called churn_probability, the name is the secret — and that's what's gone.
Conclusion
Pandas is the framework that inverts the series' central rule. Everywhere else, the leak was in identifiers and strings were safe to ignore. In pandas the leak is the strings, because column names are the business vocabulary and they live as literals.
The three load-bearing ideas:
- Scoped string rewriting, gated on proof. The fix isn't "rewrite strings now" — that re-breaks labels, paths, and dict keys. It's a column registry fed only by strings in proven column positions, hashed and reverse-mapped exactly like identifiers.
-
DataFrame-variable inference is the whole ballgame.
x["str"]is the most overloaded syntax in Python. Without knowingxis a DataFrame, you either leak columns (too cautious) or corruptos.environ(too eager). The detector errs toward under-obfuscation because a broken app is worse than a missed name you can catch in review. -
The data file is half the problem. Obfuscating code while the AI can read
attrition_2026.csvis theatre. Fixtures get header-rewritten; real data stays in the source and is renamed on load — the same secrets-never-touch-the-workspace principle as.env.
PromptCape ships open for trial at https://promptcape.com/ — free for 3 months, no credit card required. The PandasColumnDetector, the DataFrame-inference sidecar, and the data-file handling ship in the same JAR as the rest of the Python pipeline; the language is auto-detected from the source tree.
Top comments (0)