DEV Community

Ilya Ploskovitov
Ilya Ploskovitov

Posted on

PII-Shield: Cleaning PII From Logs Before It Reaches ELK

The first idea was simple.

Take a log line. Look at suspicious parts. Count entropy. Hide anything that looks like a random secret.

PII means personally identifiable information. It includes emails, phone numbers, addresses, passport numbers, card numbers, access tokens, and other values that should not move freely through logs.

At first, entropy looked like a good signal. Many tokens, keys, and session values really look like noise:

x9VdQp2Mz_La77kPq0
sk_live_51Nx...
eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
Enter fullscreen mode Exit fullscreen mode

But entropy alone was not enough.

Some values have low entropy and still must be hidden. For example: password=123, token=dev, cvv=000.

Other values look random, but they are not secrets. Trace IDs, UUIDs, short commit hashes, request IDs, and path fragments can all look suspicious.

If the entropy threshold is too low, the filter breaks useful logs. If the threshold is too high, it misses weak secrets.

That is why PII-Shield grew beyond entropy. It added regex rules, sensitive keys, allow lists, and separate validators like the Luhn algorithm for payment card numbers.

I also did not like where PII is often cleaned today.

Many teams clean logs at the Fluentd, Logstash, SIEM, or log pipeline level. This helps. But it is late. The raw data has already left the application. It may have passed through buffers, retries, temporary files, alerts, and dashboards.

PII-Shield tries to clean the data earlier. It is an open-source tool that removes PII and secrets from logs before they leave the pod.

Repository: https://github.com/pii-shield/pii-shield

The Basic Idea

The short version is:

application writes a log
        |
        v
PII-Shield reads the raw log near the app
        |
        v
only the cleaned line goes out
Enter fullscreen mode Exit fullscreen mode

The goal is not "clean it later". The goal is "do not let the raw value leave".

PII-Shield can be used in several ways:

  • a CLI tool or container that filters standard input and output;
  • a sidecar container in Kubernetes;
  • a Kubernetes operator that injects the sidecar when a pod is created;
  • Helm charts for installation;
  • WASM SDKs for Node.js and Python, if you want to run the scanner inside the process.

The main Kubernetes path works like this. The application writes logs to a file on a shared volume. The sidecar reads that file. It scans each line. Then it writes the cleaned stream to standard output. A normal log collector can read it from there.

┌──────────────────── pod ────────────────────┐
│                                             │
│  app container                              │
│      │                                      │
│      │ /var/log/app/output.log             │
│      v                                      │
│  shared emptyDir volume                     │
│      │                                      │
│      v                                      │
│  pii-shield sidecar -> sanitized stdout     │
│                                             │
└─────────────────────────────────────────────┘
                      │
                      v
              Loki / ELK / S3 / SIEM
Enter fullscreen mode Exit fullscreen mode

This is not invisible interception of everything. The application must write to a known file. But this path is easy to test. It does not need a shell inside the sidecar image. It also does not change the application runtime.

What Counts As Sensitive

The scanner does not have one magic button for "find all private data". It uses several layers.

The first layer is sensitive keys. If a line has password=..., token=..., secret=..., or api_key=..., the value near that key should be hidden.

input:  payment failed token=sk_live_51Nx...
output: payment failed token=[HIDDEN:9b22c1]
Enter fullscreen mode Exit fullscreen mode

The second layer is custom regex rules.

Many companies have their own internal IDs. These can be ticket numbers, policy IDs, customer IDs, medical record numbers, legal case numbers, or contract numbers.

These values are hard to guess from general signals. It is better to say it directly:

export PII_CUSTOM_REGEX_LIST='[
  {"pattern": "^MRN-[0-9]{8}$", "name": "MedicalRecord"},
  {"pattern": "^CASE-[0-9]{4}-[0-9]{6}$", "name": "CaseNumber"}
]'
Enter fullscreen mode Exit fullscreen mode

There is a performance detail here.

If a user adds ten rules, and the scanner runs ten separate regex checks on each token, line processing can become heavier. So the rules are first checked one by one when the config is loaded. Then they are joined into one larger regexp with |.

Each rule is wrapped in a group. This lets the scanner know which name to put into the redaction marker.

For example:

(^MRN-[0-9]{8}$)|(^CASE-[0-9]{4}-[0-9]{6}$)
Enter fullscreen mode Exit fullscreen mode

In the code this is stored as CombinedCustomRegex. The separate compiled rules are still kept in the config. But the main path uses the combined regexp.

This does not promise a speed win for every possible rule set. Regex performance depends on the rules. But it removes the need to try each custom regex one after another on every token.

The third layer is entropy.

Many secrets look like random text. API keys, session tokens, and random passwords are common examples. PII-Shield uses Shannon entropy for this. If a token is long enough and random enough, the scanner treats it as suspicious.

input:  Authorization failed: x9VdQp2Mz_La77kPq0
output: Authorization failed: [HIDDEN:3e12aa]
Enter fullscreen mode Exit fullscreen mode

But entropy can fight with real logs. Commit hashes, UUIDs, trace IDs, request IDs, and file paths can also look suspicious. So there is an allow list:

export PII_SAFE_REGEX_LIST='[
  {"pattern": "^[a-f0-9]{7}$", "name": "GitShortSHA"}
]'
Enter fullscreen mode Exit fullscreen mode

The allow list runs before other checks. If the rule is too broad, it can allow a value that should be hidden. This is not a bug in the idea. It is the cost of manual tuning.

The fourth layer is validation for specific data types.

Payment card numbers are checked with the Luhn algorithm. This is not just a regex for 16 digits. The scanner looks for digit sequences from 13 to 19 digits. It checks boundaries. It drops weak digit sets. Then it checks the Luhn checksum.

There is also a context check to reduce false positives.

A number can pass Luhn and still be just a trace value:

TraceId=4556737586899855
Enter fullscreen mode Exit fullscreen mode

That should not always be treated as a card.

But this line has card context:

visa card 4556737586899855 provided
Enter fullscreen mode Exit fullscreen mode

This one should be hidden.

Why Use A Hash Instead Of [REDACTED]

If every secret becomes [REDACTED], debugging gets harder.

Sometimes you need to know that ten errors all had the same token or the same user ID. You still should not see the raw value.

So PII-Shield replaces a sensitive value with a short marker:

[HIDDEN:a1b2c3]
Enter fullscreen mode Exit fullscreen mode

The marker is based on a salt. If PII_SALT is stable, the same raw value gets the same marker. QA and SRE teams can compare events across logs.

If the salt is random on every start, this link is lost after a restart.

For production, the salt should come from a secret:

env:
  - name: PII_SALT
    valueFrom:
      secretKeyRef:
        name: pii-shield-secrets
        key: salt
Enter fullscreen mode Exit fullscreen mode

It is also better to require a strong salt:

PII_REQUIRE_STRONG_SALT=true
Enter fullscreen mode Exit fullscreen mode

Quick Test Without Kubernetes

The fastest way to try the scanner is through the container:

echo 'login failed email=ivan@example.com password=MySecretPass123!' \
  | docker run -i --rm ghcr.io/pii-shield/pii-shield:2.1.0
Enter fullscreen mode Exit fullscreen mode

The output should look like this:

login failed email=[HIDDEN:...] password=[HIDDEN:...]
Enter fullscreen mode Exit fullscreen mode

The exact hash suffix depends on the salt.

Kubernetes: Operator And Policy

For Kubernetes, there is an operator. It can be installed with Helm:

helm repo add pii-shield https://pii-shield.github.io/pii-shield/
helm repo update

helm install pii-shield-operator pii-shield/pii-shield-operator \
  -n operator-system \
  --create-namespace
Enter fullscreen mode Exit fullscreen mode

Then you create a PiiPolicy:

apiVersion: core.pii-shield.io/v1alpha1
kind: PiiPolicy
metadata:
  name: strict-policy
  namespace: default
spec:
  injectionMode: file
  logPath: /var/log/app/output.log
  failPolicy: open
Enter fullscreen mode Exit fullscreen mode

Then you mark the deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: billing-api
spec:
  template:
    metadata:
      labels:
        pii-shield.io/inject: "true"
      annotations:
        pii-shield.io/policy: "strict-policy"
Enter fullscreen mode Exit fullscreen mode

The webhook adds the sidecar, the volume, and the settings. The application writes to /var/log/app/output.log. The sidecar reads this file and prints the cleaned stream.

The Helm chart has small resource limits: 30Mi of memory and 50m CPU for the sidecar. You still need to measure this on your own logs. This matters when logs contain a lot of JSON or very long lines.

Why The Scanner Does Not Use A Heavy JSON Parser

Logs often come as JSON. But using encoding/json for every line would add overhead during constant stream processing.

The scanner only needs to find values and keep the structure valid enough for redaction. So it uses a narrower parser. It can handle JSON-like text for this task, but it does not turn every line into a full object tree.

This does not make the code prettier. It does reduce allocations and surprises under load.

There are tests for:

  • nested JSON;
  • broken lines;
  • binary garbage;
  • multilingual logs;
  • false positives;
  • custom regex rules;
  • fuzz regressions.

For scanner-only benchmarks:

go test -bench=. -benchmem ./pkg/scanner
Enter fullscreen mode Exit fullscreen mode

For end-to-end CLI throughput:

./benchmark/run_benchmarks.sh
Enter fullscreen mode Exit fullscreen mode

In the current project notes, the normal Go scanner was in the microsecond range per line on a synthetic corpus.

A neural PII detector on CPU was about three orders of magnitude slower in my test. So a model layer should not run on every line by default. If it is added, it should be an explicit mode. It may make sense for medical logs, legal logs, support chats, and similar domains.

Fail Open Or Fail Closed

A log filter has an unpleasant choice. What should happen if the scanner fails?

fail open means the line is allowed to pass. Logs keep flowing. The application and monitoring do not go blind. But there is a risk that a raw value escapes.

fail closed means the raw line is not released. The filter returns a drop marker instead. This is safer for privacy. It is worse for debugging.

PII-Shield makes this configurable:

PII_FAIL_POLICY=open
Enter fullscreen mode Exit fullscreen mode

or:

PII_FAIL_POLICY=closed
Enter fullscreen mode Exit fullscreen mode

The default is open. Losing logs in production can become a separate incident. For strict compliance workloads, the right choice may be different.

Current Limits

I do not want to hide this part at the end of the README.

PII-Shield can be run and tested now. But not all modes have the same maturity:

  • file-based sidecar mode is the main practical path;
  • fully transparent protection for normal Kubernetes stdout and stderr logs is not production-ready yet;
  • pipe mode changes the target container command and needs testing per workload;
  • the Kubernetes operator is still being stabilized;
  • eBPF mode is research work, not a production compliance control;
  • Proxy-Wasm gateway integration and a visual control plane are still planned work.

This part is boring, but it matters. A privacy tool without clear limits quickly becomes a decorative shield.

What Was Hard

Not regex. Not Helm.

The hard part was not breaking normal logs.

If the scanner is too aggressive, it hides trace IDs, commit hashes, harmless UUIDs, and path fragments. If it is too relaxed, it misses weak passwords and handmade tokens.

One global threshold does not fit every language and every domain. And an allow list can create its own problems if it is too broad.

So the project is not moving toward one smart detector. It is moving toward several clear layers:

  • sensitive keys;
  • strict custom rules;
  • entropy with tuning;
  • allow lists;
  • profiles for different domains;
  • maybe an optional model check where the extra latency is worth it.

Less magic. More testable rules.

Where This Helps

PII-Shield does not replace logging discipline.

Developers still should not write passwords into log.info(...). APIs still need to validate input. Security teams still need to check retention, access, and log export.

But a filter near the application helps with one common mistake: a sensitive value enters the shared log stream by accident.

This matters when logs move into:

  • central log storage;
  • a data lake;
  • incident analysis tools;
  • analytics and reporting;
  • LLM/RAG pipelines;
  • AI agents that read traces and evidence chains.

The last point was important for me. If PII gets into a training or evaluation dataset, removing one line from Loki is no longer enough.

What Is Next

The next work areas are:

  • production support for Kubernetes stdout and stderr;
  • a more stable operator lifecycle;
  • stronger release checks: checksums, image digests, and artifact provenance;
  • domain-specific profiles;
  • selective model checks without a large latency hit;
  • Proxy-Wasm and eBPF work only where they give a real benefit.

PII-Shield is not a universal cure. It is a practical layer of protection in a place that is often not controlled well: between the application and the log system.

If there is one idea to take from this article, it is this: clean private data as close as possible to the place where it appears. Everything after that is harder to check and harder to undo.

Top comments (0)