Bala Paranj

Posted on Jul 2

DLP is a Workaround for a Missing Data Schema

#cloudsecurity #data #security #architecture

✓ Human-authored analysis; AI used for formatting and proofreading.

Data Loss Prevention is the third pillar of cloud security — after identity and network and it fails for the same structural reason as the other two.

The principle says: protect sensitive data. The implementation: scan every storage resource to "discover" whether it contains PII, financial records, credentials, or health information. The discovery happens after the data has already been written to the wrong place, in the wrong format, with the wrong access controls. The tool finds the problem. The problem already happened. The scan is an autopsy, not a diagnosis.

If you've read Least Privilege Is a Workaround for a Missing Specification or Microsegmentation Is a Workaround for a Missing Application Map, the pattern will be familiar. The missing artifact is different. The structural failure is identical.

The Hidden Assumption

Data protection says: classify sensitive data and apply appropriate controls. Every framework mandates it — GDPR, HIPAA, PCI-DSS, SOC2. The principle is correct. The implementation assumes that someone, at some point, declared what kind of data each storage resource holds. Nobody did.

A developer creates an S3 bucket for a new feature. The bucket stores user-uploaded images. Six months later, another team uses the same bucket to stage CSV exports that contain customer email addresses and phone numbers. The bucket was created for public images. It now holds PII. Nothing in the system recorded the transition. Nothing prevented it. Nothing detected it — until a DLP scanner ran its weekly scan and flagged a regex match for email patterns in a CSV file that was uploaded three months ago.

The data was exposed for three months. The scanner found it in month four. The finding was triaged in month five. The bucket was still misconfigured in month six because nobody knew whether the CSV exports should move to a different bucket or whether the bucket's classification should change. The finding sat in a queue alongside 2,000 other DLP alerts, most of which were false positives from the regex engine matching credit card-like number patterns in log files.

The Three Mismatches

Mismatch 1: Granularity

DLP tools scan content to infer classification. A regex matches a 16-digit number and flags it as a potential credit card. But 16-digit numbers appear in log files (request IDs), analytics data (session tokens), configuration files (API keys that look like card numbers), and test fixtures (deliberately fake card numbers). The tool can't distinguish between a real credit card number and a string that matches the pattern.

The developer who created the storage resource knows what it holds. The DLP tool guesses. The developer could have declared data_type: public_images at creation time. The DLP tool reverse-engineers this declaration from content patterns months or years later — with error rates that make the output unactionable.

Amazon Macie reports a "SensitiveData:S3Object/Financial" finding on an S3 object. Is it a real credit card number? A test fixture? A number that happens to match the Luhn algorithm? The analyst must open the object, inspect the content, understand the business context, and make a judgment call. Multiply this by 2,000 findings per week. The tool generates work. It doesn't generate answers.

Mismatch 2: Time

Data classification should be a property assigned at creation time and enforced continuously. Instead, it's discovered retroactively through periodic scanning.

A database column is created: customer_notes TEXT. At creation, it holds free-text order notes. Over time, customer support agents start pasting customer phone numbers, addresses, and complaint details into the same column. The column was never classified as PII because it wasn't PII when it was created. It became PII gradually, through usage patterns the schema never captured.

DLP scanning discovers this months or years after the transition. The tool reports "PII detected in column customer_notes." The remediation is not clear: re-architect the database to separate PII into classified columns? Encrypt the column retroactively? Add a DLP rule to prevent future PII from entering? Each option is expensive. None would have been necessary if the column's data type had been declared and enforced at creation: customer_notes TEXT CONSTRAINT no_pii — a schema constraint, not a scanner finding.

The temporal mismatch is worse than for IAM or networking because data accumulates. An unused IAM permission can be revoked. An unused network path can be closed. Data written to the wrong location must be found, classified, relocated, and the original deleted — across backups, replicas, caches, and downstream systems that may have already consumed it. The cost of retroactive data classification grows with every day the data sits unclassified.

Mismatch 3: Composition

Individual storage resources are scanned in isolation. Data flows across resources are not tracked by anyone.

An S3 bucket is classified as "internal — no PII." A Lambda function reads from a PII-classified database, transforms the data, and writes results to the S3 bucket. The Lambda stripped the names and addresses but kept customer IDs — which, combined with another table, re-identify individuals. The S3 bucket now holds data that is PII-by-composition, even though no individual field matches a PII regex pattern.

DLP scanning the S3 bucket finds nothing. Because the data doesn't match PII patterns. The data is sensitive due to its connection. This is the data equivalent of the compound path problem in IAM and networking. Per-resource scanning evaluates each storage resource independently. The data flow that creates sensitivity through composition is invisible.

Data lineage — the graph of which data flows from which sources through which transformations to which destinations is the data equivalent of the network dependency map and the IAM trust chain. Without it, "sensitive data" is defined by regex patterns instead of by provenance. A customer ID is not PII in isolation. A customer ID that joins to a table containing names and addresses is PII. The sensitivity is a property of the graph, not the node.

The Symptom Treatment Industry

The data security industry has built the same symptom-treatment ecosystem as IAM and networking:

DLP scanners (Macie, Google DLP, Microsoft Purview) scan storage content using regex patterns, machine learning classifiers, and keyword dictionaries. They report: "This S3 bucket contains 47 objects matching PII patterns." The signal-to-noise ratio is very high. Every report includes credit-card-pattern matches on log files, SSN-pattern matches on zip codes, and email-pattern matches on internal notification addresses. Teams learn to ignore the findings because investigating each one costs more than the risk it represents. The tool that should protect data instead produces alert fatigue.

Data classification tools attempt to label storage resources after the fact. Scan the bucket, infer the classification, apply a tag. But the tag is derived from content scanning. The same regex-based inference — so it inherits the same false positive rate. A bucket tagged classification: PII because Macie found one credit-card-like string in one of 50,000 objects is not a useful classification. It's a guess with a label.

Data Security Posture Management (DSPM) maps where sensitive data lives across cloud environments. The mapping is descriptive, not prescriptive. It shows where PII was found, not where PII should be allowed. Just like flow logs show observed traffic rather than intended traffic, DSPM shows observed data locations rather than intended data locations. The map of current data is not the specification of intended data architecture.

Data Loss Prevention policies block data from leaving classified resources. But if the classification is wrong (derived from scanning rather than declared from intent), the policy blocks the wrong things — preventing legitimate exports while allowing sensitive data to flow to unclassified storage through paths the policy doesn't cover. Enforcement built on inferred classification inherits every error in the inference.

The Missing Artifact

The artifact that would make data protection automatic is typed data provenance. A declaration of what kind of data each storage resource is intended to hold, enforced at provisioning time and validated continuously.

DLP scan result of "what data does this bucket contain?" exists.
The retroactive classification "what data type tags are applied?" exists.

But: "what data is this storage resource designed to hold, and what must it never hold?" That's the data schema declaration, and it doesn't exist as an enforced specification.

DATA SCHEMA DECLARATION (the missing artifact):

  s3://product-images:
    data_type: public
    allows:
      - image/jpeg
      - image/png
      - image/webp
    never_contains:
      - PII
      - financial
      - credentials
    reason: "Public product catalog images — served via CloudFront to anonymous users"
    owner: product-team

  s3://customer-exports:
    data_type: confidential-pii
    allows:
      - text/csv
    contains:
      - PII: [email, phone, address]
      - financial: [order_total, payment_method_last4]
    encryption: required (aws:kms, key: customer-data-key)
    retention: 90 days
    access: [export-service-role, compliance-audit-role]
    reason: "Customer data exports for compliance reporting"
    owner: compliance-team

  rds://orders-db/customer_notes:
    data_type: confidential-pii
    allows:
      - free_text
    contains:
      - PII: [potential — human-entered text may include names, phone numbers]
    constraint: "Must not be joined with customer_id in unencrypted exports"
    reason: "Support agent notes — may contain incidental PII"
    owner: support-team

If this schema existed, a developer writing PII to the product-images bucket would get a schema violation at write time, not at scan time. The violation would be a type error, not a DLP finding. The prevention would be immediate, not retroactive. The false positive rate would be zero. Because the schema was declared by the developer who knows the data, not inferred by a regex engine.

Chesterton's Fence

Every unclassified storage resource is a Chesterton's Fence. A bucket with no data classification tag could hold public marketing assets, customer PII, application logs, or database backups containing credentials. Nobody dares delete it, restrict its access, or change its encryption because nobody knows what it holds or why it was created. The DLP scanner says "no PII detected". But absence of regex matches is not proof of safety. It's proof that the data doesn't match the patterns. Data that is sensitive by context (customer IDs that join to PII tables) passes every DLP scan because no individual field triggers a pattern.

The reason and data_type fields in the schema declaration are Chesterton's Fence made auditable. When the reason no longer applies, the resource can be evaluated: "Is this still needed? Does it still hold this type of data? Should the classification change?" Without the declaration, the resource accumulates data indefinitely, and the classification drifts from whatever the original creator intended.

Stewart Brand's Shearing Layers

The same rate-of-change separation applies:

Data architecture (which resources hold which types of data)
  → changes quarterly — new data stores for new features

Application behavior (which services write which data where)
  → changes weekly — new integrations, new export formats

Storage classification (what data type the resource is tagged with)
  → assigned retroactively by DLP scan — or never

The classification should track the architecture. The architecture changes quarterly. The classification doesn't update because it was never declared. It was inferred from a scan that runs weekly and produces findings nobody trusts. The gap between intended data architecture and actual data distribution is data sprawl — the storage equivalent of privilege creep.

Declared data schemas separate these layers:

Layer 1: DATA SCHEMA (changes when data architecture changes)
  "This bucket holds public product images, never PII"
  → changes when the feature's data model changes
  → owned by the team that builds the feature

Layer 2: DERIVED CONTROLS (changes automatically when Layer 1 changes)
  Encryption settings, access policies, retention rules, bucket policies
  → computed from the schema declaration
  → S3 bucket policy that rejects PutObject from services not in the access list

Layer 3: DEPLOYED CONFIGURATION (verified on every snapshot)
  The actual bucket policy, encryption settings, access controls
  → verified against Layer 2 on every snapshot
  → any deviation = finding, not "data sprawl"

The verification math for Layer 3 already exists. AWS Config rules can check encryption settings. S3 Access Analyzer can evaluate bucket policies. Macie can scan content. But all of them verify against inferred classifications or manual tags — not against a declared schema that expresses the architect's intent. The engine exists. The specification to verify against does not.

The Data Lineage Gap

The compound problem — data sensitive by composition rather than by content requires a capability no DLP tool provides: data lineage as a verification input.

Data lineage tracks where data came from, what transformations were applied, and where it went. If the lineage graph shows that an S3 object was derived from a PII-classified database through a Lambda function, the object inherits the source classification unless the transformation provably removed the sensitive attributes. Provably means the transformation is declared and verified, not assumed because the Lambda's code looks like it strips PII.

This is the data equivalent of compound path analysis in IAM. A customer ID is not PII in isolation. A customer ID that flows through a pipeline to an unclassified bucket, where it can be joined with a customer-details table, is PII — by lineage, not by content. DLP scanning the destination bucket finds nothing because the content doesn't match PII patterns. The sensitivity is in the provenance, not the payload.

No current tool connects data lineage to data classification enforcement. Data catalogs (Amundsen, DataHub, OpenMetadata) maintain lineage metadata — which tables feed which pipelines, which pipelines produce which outputs. Like Backstage for service dependencies, the lineage data exists. It's used for debugging data quality issues and understanding data freshness. Nobody feeds it into the security enforcement loop. The data architecture exists as a catalog entry. It should exist as a policy input.

The Identical Pattern

Three articles, three domains, one structural gap:

Domain	Principle	Missing artifact	Symptom treatment
IAM	Least privilege	Intent specification	Granted-vs-used analyzers
Network	Microsegmentation	Application dependency map	Flow log analyzers
Data	Data protection	Typed data schema	DLP content scanners

Each principle assumes a specification exists. Each tool measures compliance against a proxy (history, traffic patterns, content patterns) because the specification doesn't exist. Each tool generates findings nobody acts on because the proxy is wrong — measuring what happened rather than what was intended.

The fix is the same in every domain: declare what should be true (intent specification, application map, data schema). Derive the implementation from the declaration (IAM policy, security group rules, bucket policies). Verify the deployed configuration against the derived controls (snapshot comparison). The declaration is the artifact that makes everything else work.

The Path Forward

Start with the storage resources that matter most. The internet-facing, the ones that hold data subject to regulatory requirements, the ones that have had DLP findings in the past.

Declare what each one should hold. Declare what it must never hold. Derive the access controls, encryption settings, and retention policies from the declaration. Compare against what's deployed. Fix the delta.

The data schema doesn't need to cover every S3 bucket on day one. It needs to cover the buckets an auditor will ask about and the buckets an attacker would target. Each declaration is a ratchet. Once the intended data type is specified, any violation is a schema error caught at write time, not a DLP finding discovered months later.

The same honest challenges apply. The schema must be simpler than the bucket policy it informs. If developers copy-paste IAM ARNs into the data declaration, the specification is the policy with a different file extension. The schema must accept high-level types (data_type: confidential-pii) and reject raw implementation details. The dynamic case exists: a multi-tenant architecture where the data type of a bucket is determined by which customer's data it holds at runtime. Tag-based and pattern-based declarations constrain the classification even when they can't fully specify it.

A practical challenge: "schema violation at write time" requires enforcement infrastructure that cloud providers don't natively offer. S3 has no built-in mechanism to reject a PutObject based on content classification. Enforcement at write time requires an architectural shim. A Lambda trigger that validates content against the schema before allowing the write, an OPA gate in the application layer, or a service mesh policy that intercepts data flow. This shim doesn't exist natively in most cloud platforms. It must be built.

There is a poor man's shim already available: S3 bucket policies can evaluate request headers. Requiring an x-amz-meta-classification header on every PutObject and rejecting uploads that omit it forces developers to self-declare the data type at the API level. It doesn't validate the content. It doesn't prevent a developer from labeling PII as public. But it forces the declaration to exist, which is the first step. A bucket where every object has a classification header is auditable. A bucket where no object has a classification header is a black box. The header doesn't solve the problem. It makes the problem visible.

But the shim without the schema is useless (enforce against what?), and the schema without the shim is still valuable. It makes DLP scanning more precise because the scanner validates content against a declared classification rather than inferring classification from regex patterns. Even without write-time enforcement, a declared schema reduces DLP false positives from thousands to near zero. The schema declaration should live in the same Git repository as the Terraform or Pulumi code that creates the bucket — same file, same pull request, same review. If the schema is elsewhere, it drifts from the infrastructure it describes, and the declaration becomes another document that nobody maintains. Schema as code, not schema as wiki page.

The second practical challenge: the schema approach assumes data flows through managed pipelines. The top-down world where services write to declared storage resources via known paths. DLP's hardest problem is the bottom-up world. Shadow data created when users move data outside managed pipelines. An engineer copies a production database snapshot to a personal S3 bucket for debugging. A support agent pastes customer details into a shared spreadsheet. A data scientist downloads a CSV to a laptop and re-uploads it to an unclassified bucket in a different account. No schema covers these paths because they bypass every managed pipeline.

The fix is identity-to-storage pinning: only the services declared in the schema can write to the bucket. If the schema says access: [export-service-role, compliance-audit-role], then any other identity writing to that bucket is a policy violation regardless of what data they write. The write itself is unauthorized, not just the content. This converts shadow data from a content-scanning problem to an access-control problem, which connects directly to the IAM intent specification from the first article in this series. The three domains: identity, network, data are not independent. They're layers of the same specification. The identity layer controls who can write. The network layer controls which paths exist. The data layer controls what should be written. All three need the declaration. None of them have it.

But even an imperfect schema — "this bucket is for public images, never PII" is more useful than no schema. A DLP scanner that validates content against a declared schema has a false positive rate approaching zero. A DLP scanner that infers classification from regex patterns has a false positive rate that makes the tool unusable. The difference is the declaration. The declaration is the artifact the industry never built.

DLP is not the goal. It's the symptom of not having declared the purpose of each storage resource. Declare the purpose, and data protection becomes a schema constraint. Without the declaration, it remains what it has always been: regex-based archaeology — scanning for evidence of a problem that should have been prevented at provisioning time.

DEV Community