Ilya Ploskovitov

Posted on Jun 10

PII-Shield: Cleaning PII From Logs Before It Reaches ELK

#security #devops #go #opensource

The first idea was simple.

Take a log line. Look at suspicious parts. Count entropy. Hide anything that looks like a random secret.

PII means personally identifiable information. It includes emails, phone numbers, addresses, passport numbers, card numbers, access tokens, and other values that should not move freely through logs.

At first, entropy looked like a good signal. Many tokens, keys, and session values really look like noise:

x9VdQp2Mz_La77kPq0
sk_live_51Nx...
eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...

But entropy alone was not enough.

Some values have low entropy and still must be hidden. For example: password=123, token=dev, cvv=000.

Other values look random, but they are not secrets. Trace IDs, UUIDs, short commit hashes, request IDs, and path fragments can all look suspicious.

If the entropy threshold is too low, the filter breaks useful logs. If the threshold is too high, it misses weak secrets.

That is why PII-Shield grew beyond entropy. It added regex rules, sensitive keys, allow lists, and separate validators like the Luhn algorithm for payment card numbers.

I also did not like where PII is often cleaned today.

Many teams clean logs at the Fluentd, Logstash, SIEM, or log pipeline level. This helps. But it is late. The raw data has already left the application. It may have passed through buffers, retries, temporary files, alerts, and dashboards.

PII-Shield tries to clean the data earlier. It is an open-source tool that removes PII and secrets from logs before they leave the pod.

Repository: https://github.com/pii-shield/pii-shield

The Basic Idea

The short version is:

application writes a log
        |
        v
PII-Shield reads the raw log near the app
        |
        v
only the cleaned line goes out

The goal is not "clean it later". The goal is "do not let the raw value leave".

PII-Shield can be used in several ways:

a CLI tool or container that filters standard input and output;
a sidecar container in Kubernetes;
a Kubernetes operator that injects the sidecar when a pod is created;
Helm charts for installation;
WASM SDKs for Node.js and Python, if you want to run the scanner inside the process.

The main Kubernetes path works like this. The application writes logs to a file on a shared volume. The sidecar reads that file. It scans each line. Then it writes the cleaned stream to standard output. A normal log collector can read it from there.

┌──────────────────── pod ────────────────────┐
│                                             │
│  app container                              │
│      │                                      │
│      │ /var/log/app/output.log             │
│      v                                      │
│  shared emptyDir volume                     │
│      │                                      │
│      v                                      │
│  pii-shield sidecar -> sanitized stdout     │
│                                             │
└─────────────────────────────────────────────┘
                      │
                      v
              Loki / ELK / S3 / SIEM

This is not invisible interception of everything. The application must write to a known file. But this path is easy to test. It does not need a shell inside the sidecar image. It also does not change the application runtime.

What Counts As Sensitive

The scanner does not have one magic button for "find all private data". It uses several layers.

The first layer is sensitive keys. If a line has password=..., token=..., secret=..., or api_key=..., the value near that key should be hidden.

input:  payment failed token=sk_live_51Nx...
output: payment failed token=[HIDDEN:9b22c1]

The second layer is custom regex rules.

Many companies have their own internal IDs. These can be ticket numbers, policy IDs, customer IDs, medical record numbers, legal case numbers, or contract numbers.

These values are hard to guess from general signals. It is better to say it directly:

export PII_CUSTOM_REGEX_LIST='[
  {"pattern": "^MRN-[0-9]{8}$", "name": "MedicalRecord"},
  {"pattern": "^CASE-[0-9]{4}-[0-9]{6}$", "name": "CaseNumber"}
]'

There is a performance detail here.

If a user adds ten rules, and the scanner runs ten separate regex checks on each token, line processing can become heavier. So the rules are first checked one by one when the config is loaded. Then they are joined into one larger regexp with |.

Each rule is wrapped in a group. This lets the scanner know which name to put into the redaction marker.

For example:

(^MRN-[0-9]{8}$)|(^CASE-[0-9]{4}-[0-9]{6}$)

In the code this is stored as CombinedCustomRegex. The separate compiled rules are still kept in the config. But the main path uses the combined regexp.

This does not promise a speed win for every possible rule set. Regex performance depends on the rules. But it removes the need to try each custom regex one after another on every token.

The third layer is entropy.

Many secrets look like random text. API keys, session tokens, and random passwords are common examples. PII-Shield uses Shannon entropy for this. If a token is long enough and random enough, the scanner treats it as suspicious.

input:  Authorization failed: x9VdQp2Mz_La77kPq0
output: Authorization failed: [HIDDEN:3e12aa]

But entropy can fight with real logs. Commit hashes, UUIDs, trace IDs, request IDs, and file paths can also look suspicious. So there is an allow list:

export PII_SAFE_REGEX_LIST='[
  {"pattern": "^[a-f0-9]{7}$", "name": "GitShortSHA"}
]'

The allow list runs before other checks. If the rule is too broad, it can allow a value that should be hidden. This is not a bug in the idea. It is the cost of manual tuning.

The fourth layer is validation for specific data types.

Payment card numbers are checked with the Luhn algorithm. This is not just a regex for 16 digits. The scanner looks for digit sequences from 13 to 19 digits. It checks boundaries. It drops weak digit sets. Then it checks the Luhn checksum.

There is also a context check to reduce false positives.

A number can pass Luhn and still be just a trace value:

TraceId=4556737586899855

That should not always be treated as a card.

But this line has card context:

visa card 4556737586899855 provided

This one should be hidden.

Why Use A Hash Instead Of `[REDACTED]`

If every secret becomes [REDACTED], debugging gets harder.

Sometimes you need to know that ten errors all had the same token or the same user ID. You still should not see the raw value.

So PII-Shield replaces a sensitive value with a short marker:

[HIDDEN:a1b2c3]

The marker is based on a salt. If PII_SALT is stable, the same raw value gets the same marker. QA and SRE teams can compare events across logs.

If the salt is random on every start, this link is lost after a restart.

For production, the salt should come from a secret:

env:
  - name: PII_SALT
    valueFrom:
      secretKeyRef:
        name: pii-shield-secrets
        key: salt

It is also better to require a strong salt:

PII_REQUIRE_STRONG_SALT=true

Quick Test Without Kubernetes

The fastest way to try the scanner is through the container:

echo 'login failed email=ivan@example.com password=MySecretPass123!' \
  | docker run -i --rm ghcr.io/pii-shield/pii-shield:2.1.0

The output should look like this:

login failed email=[HIDDEN:...] password=[HIDDEN:...]

The exact hash suffix depends on the salt.

Kubernetes: Operator And Policy

For Kubernetes, there is an operator. It can be installed with Helm:

helm repo add pii-shield https://pii-shield.github.io/pii-shield/
helm repo update

helm install pii-shield-operator pii-shield/pii-shield-operator \
  -n operator-system \
  --create-namespace

Then you create a PiiPolicy:

apiVersion: core.pii-shield.io/v1alpha1
kind: PiiPolicy
metadata:
  name: strict-policy
  namespace: default
spec:
  injectionMode: file
  logPath: /var/log/app/output.log
  failPolicy: open

Then you mark the deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: billing-api
spec:
  template:
    metadata:
      labels:
        pii-shield.io/inject: "true"
      annotations:
        pii-shield.io/policy: "strict-policy"

The webhook adds the sidecar, the volume, and the settings. The application writes to /var/log/app/output.log. The sidecar reads this file and prints the cleaned stream.

The Helm chart has small resource limits: 30Mi of memory and 50m CPU for the sidecar. You still need to measure this on your own logs. This matters when logs contain a lot of JSON or very long lines.

Why The Scanner Does Not Use A Heavy JSON Parser

Logs often come as JSON. But using encoding/json for every line would add overhead during constant stream processing.

The scanner only needs to find values and keep the structure valid enough for redaction. So it uses a narrower parser. It can handle JSON-like text for this task, but it does not turn every line into a full object tree.

This does not make the code prettier. It does reduce allocations and surprises under load.

There are tests for:

nested JSON;
broken lines;
binary garbage;
multilingual logs;
false positives;
custom regex rules;
fuzz regressions.

For scanner-only benchmarks:

go test -bench=. -benchmem ./pkg/scanner

For end-to-end CLI throughput:

./benchmark/run_benchmarks.sh

In the current project notes, the normal Go scanner was in the microsecond range per line on a synthetic corpus.

A neural PII detector on CPU was about three orders of magnitude slower in my test. So a model layer should not run on every line by default. If it is added, it should be an explicit mode. It may make sense for medical logs, legal logs, support chats, and similar domains.

Fail Open Or Fail Closed

A log filter has an unpleasant choice. What should happen if the scanner fails?

fail open means the line is allowed to pass. Logs keep flowing. The application and monitoring do not go blind. But there is a risk that a raw value escapes.

fail closed means the raw line is not released. The filter returns a drop marker instead. This is safer for privacy. It is worse for debugging.

PII-Shield makes this configurable:

PII_FAIL_POLICY=open

or:

PII_FAIL_POLICY=closed

The default is open. Losing logs in production can become a separate incident. For strict compliance workloads, the right choice may be different.

Current Limits

I do not want to hide this part at the end of the README.

PII-Shield can be run and tested now. But not all modes have the same maturity:

file-based sidecar mode is the main practical path;
fully transparent protection for normal Kubernetes stdout and stderr logs is not production-ready yet;
pipe mode changes the target container command and needs testing per workload;
the Kubernetes operator is still being stabilized;
eBPF mode is research work, not a production compliance control;
Proxy-Wasm gateway integration and a visual control plane are still planned work.

This part is boring, but it matters. A privacy tool without clear limits quickly becomes a decorative shield.

What Was Hard

Not regex. Not Helm.

The hard part was not breaking normal logs.

If the scanner is too aggressive, it hides trace IDs, commit hashes, harmless UUIDs, and path fragments. If it is too relaxed, it misses weak passwords and handmade tokens.

One global threshold does not fit every language and every domain. And an allow list can create its own problems if it is too broad.

So the project is not moving toward one smart detector. It is moving toward several clear layers:

sensitive keys;
strict custom rules;
entropy with tuning;
allow lists;
profiles for different domains;
maybe an optional model check where the extra latency is worth it.

Less magic. More testable rules.

Where This Helps

PII-Shield does not replace logging discipline.

Developers still should not write passwords into log.info(...). APIs still need to validate input. Security teams still need to check retention, access, and log export.

But a filter near the application helps with one common mistake: a sensitive value enters the shared log stream by accident.

This matters when logs move into:

central log storage;
a data lake;
incident analysis tools;
analytics and reporting;
LLM/RAG pipelines;
AI agents that read traces and evidence chains.

The last point was important for me. If PII gets into a training or evaluation dataset, removing one line from Loki is no longer enough.

What Is Next

The next work areas are:

production support for Kubernetes stdout and stderr;
a more stable operator lifecycle;
stronger release checks: checksums, image digests, and artifact provenance;
domain-specific profiles;
selective model checks without a large latency hit;
Proxy-Wasm and eBPF work only where they give a real benefit.

PII-Shield is not a universal cure. It is a practical layer of protection in a place that is often not controlled well: between the application and the log system.

If there is one idea to take from this article, it is this: clean private data as close as possible to the place where it appears. Everything after that is harder to check and harder to undo.

Top comments (4)

Joe Bordes • Jun 13

Nice (and important) work!
kudus

Kevin • Jun 10

This is the correct direction. Thanks for sharing!

VoltageGPU • Jun 15

Interesting approach using entropy to detect potential PII. Have you considered integrating a deterministic tokenization layer before the logs even hit your ELK stack? At my job, we use a similar pattern with VoltageGPU for real-time scrubbing of sensitive data in telemetry before it's stored or visualized. It helps reduce false positives and keeps the pipeline clean.

Ilya Ploskovitov • Jun 17

Thanks! In this case one-way HMAC is a deliberate choice — the pipeline is designed so cleaned logs are safe even if ELK is breached, so reversibility would be a liability rather than a feature. Detokenization needs a vault, which is exactly the high-value target I want to avoid.