grepture

Posted on Mar 5

Stop Leaking PII Through Your OpenAI API Calls

#webdev #programming #javascript #ai

Every chat.completions.create call sends your prompt to OpenAI's servers. If that prompt contains user data — support tickets, form inputs, CRM records — there's a good chance it includes names, emails, phone numbers, and worse.

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [
    {
      role: "user",
      content: `Summarize this support ticket:

      From: Sarah Chen <sarah.chen@acme.com>
      Phone: (415) 555-0142
      SSN: 521-44-8832

      My order #38291 hasn't arrived. I live at
      742 Evergreen Terrace, Springfield, IL 62704.`,
    },
  ],
});

That single request just sent a name, email, phone number, SSN, and home address to an external service. Under GDPR, CCPA, or HIPAA, that's a compliance incident waiting to happen.

The problem is invisible

Most teams don't audit what's inside their AI prompts. The Authorization header is your OpenAI key — that's expected. The problem is the request body.

PII shows up in places you don't expect:

Support tickets — customer names, emails, account numbers embedded in the text
RAG chunks — documents from your vector store may contain PII from the original source
Chat history — previous messages in a conversation accumulate identifiers
CRM data — customer records pulled into prompts for personalization
Code snippets — hardcoded credentials, API keys, database connection strings

And it's not just direct identifiers. Under GDPR, data is personal if it can be combined with other information to identify someone. A user ID + timestamp + location? That's personal data.

What you can do about it

There are three approaches, from manual to automated:

1. Manual redaction (doesn't scale)

Write regex patterns or use string replacement to strip known PII patterns before each API call. This works for obvious cases (emails, phone numbers) but misses freeform PII like names in unstructured text.

// Fragile and incomplete
const sanitized = input
  .replace(/[\w.-]+@[\w.-]+\.\w+/g, "[EMAIL]")
  .replace(/\d{3}-\d{2}-\d{4}/g, "[SSN]");

Problems: you have to maintain the patterns, they miss edge cases, and you can't restore the original values in the response.

2. NER-based detection (better, but heavy)

Run a Named Entity Recognition model (spaCy, Presidio, etc.) on every prompt before sending it. More accurate for names and organizations, but adds latency and infrastructure complexity.

3. Proxy-level redaction

Put a scanning proxy between your app and the AI provider. Every request is inspected and sanitized before it leaves your infrastructure. No code changes in your application.

This is the approach I built Grepture around — it's an open-source security proxy that sits in front of any AI API. Here's what the setup looks like:

import OpenAI from "openai";
import { Grepture } from "@grepture/sdk";

const grepture = new Grepture({
  apiKey: process.env.GREPTURE_API_KEY!,
  proxyUrl: "https://proxy.grepture.com",
});

const openai = new OpenAI({
  ...grepture.clientOptions({
    apiKey: process.env.OPENAI_API_KEY!,
    baseURL: "https://api.openai.com/v1",
  }),
});

// Every request is now scanned — your code doesn't change
const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: userInput }],
});

clientOptions() reroutes traffic through the proxy. Your OpenAI key is forwarded securely. The proxy scans every request against 50+ detection patterns (80+ on Pro) — emails, phone numbers, SSNs, credit cards, API keys, IBANs, and more.

Reversible redaction: the key feature

Plain redaction breaks things. If you strip all names from a support ticket, the AI's summary is useless — "The customer [REDACTED] has an issue with [REDACTED]."

Reversible redaction (mask-and-restore) solves this. PII is replaced with consistent tokens:

What OpenAI sees:

Summarize this support ticket:
From: [PERSON_1] <[EMAIL_1]>
Phone: [PHONE_1]
SSN: [SSN_1]
My order #38291 hasn't arrived. I live at [ADDRESS_1].

What your app gets back:

The customer Sarah Chen (sarah.chen@acme.com) is asking about
order #38291 which hasn't been delivered to 742 Evergreen Terrace,
Springfield, IL 62704.

The model processes clean data with consistent entity references. Your application receives the full, personalized response. No PII ever reaches OpenAI.

Works with any provider

While I used OpenAI in these examples, the same proxy approach works with any AI provider — Anthropic, Google Gemini, Azure OpenAI, AWS Bedrock, Mistral, Groq. You just change the baseURL and apiKey:

// Anthropic
const anthropic = new Anthropic({
  ...grepture.clientOptions({
    apiKey: process.env.ANTHROPIC_API_KEY!,
    baseURL: "https://api.anthropic.com",
  }),
});

// Google Gemini (OpenAI-compatible endpoint)
const gemini = new OpenAI({
  ...grepture.clientOptions({
    apiKey: process.env.GEMINI_API_KEY!,
    baseURL: "https://generativelanguage.googleapis.com/v1beta/openai",
  }),
});

For non-SDK calls (webhooks, custom HTTP requests), there's a drop-in fetch replacement:

const response = await grepture.fetch("https://api.example.com/data", {
  method: "POST",
  body: JSON.stringify(payload),
});

GDPR angle: why this matters now

If you're processing EU user data through AI APIs, every API call is a data transfer to a third-party processor. GDPR requires:

Data minimization — only send what's necessary
Data Processing Agreements — signed with every AI provider
Transfer Impact Assessments — for cross-border transfers to US providers

The simplest way to satisfy data minimization? Don't send personal data at all. Redact before the API call, restore after.

I wrote a longer guide on this: How to Make AI API Calls GDPR-Compliant.

Getting started

npm install @grepture/sdk
Get an API key at grepture.com — free tier includes 1,000 requests/month
Wrap your AI client with clientOptions() or use grepture.fetch()

The docs have setup guides for every major provider.

DEV Community