Dominique Rene

Posted on Apr 20

Data Privacy in Regulated Applications: What Developers Need to Know

#security #webdev #privacy

Regulated apps are different from regular software in one uncomfortable way: you're legally required to collect data you'd rather not touch. Government IDs. Social security numbers. Real-time location. The regulatory mandate forces you to gather sensitive material — then separate laws demand you protect it. That tension doesn't get resolved in a compliance meeting. It gets resolved in your architecture, or it doesn't get resolved at all.

The KYC Pipeline Problem

Most teams make the same mistake with KYC: they treat it as a feature rather than an isolated subsystem. The result is government ID scans sitting in the same database as user preferences, accessible to the same application services, shipped to the same logging aggregator.

The first structural question worth asking early: should you store raw identity documents at all? In many cases, delegating to a KYC provider — Persona, Jumio, Onfido — and storing only the verification reference and outcome is the cleaner path. Your database holds kyc_status: verified, provider_ref: "abc123", and a timestamp. Nothing else.

However — and this matters — some gaming regulators explicitly require independent retention of identity documents, not just a third-party reference. Michigan's MGCB technical standards, for example, have specific data retention obligations that may require you to hold copies directly. Check the jurisdiction requirements before assuming delegation is sufficient. Your compliance and legal team needs to sign off on the storage model, not just your architect.

When you genuinely need to retain KYC data (some jurisdictions require it), keep it isolated:

users → user_id, email, created_at

kyc_profiles → kyc_id, user_id, status, verified_at, provider_ref

kyc_vault → encrypted blob, strict ACL, separate credentials

The vault should be unreachable from your application layer by default. Only a dedicated compliance service touches it, and every read gets logged. Not application logs — a separate audit trail.

Field-level encryption matters here more than disk encryption. Encrypting at rest protects you if someone walks out with a hard drive. Field-level encryption protects you from your own engineers, your own queries, and your own misconfigured storage buckets. Use your KMS to encrypt SSN, DOB, and document hashes individually. Decryption should require explicit, logged justification. For a concrete implementation pattern using KMS-backed field encryption in a serverless context, this production walkthrough is worth reading — the key policy scoping discussion alone saves most teams a painful mistake.

Verification drift is consistently underestimated. A user's KYC is valid today. Eighteen months later, their document has expired or your provider's risk model has shifted. Build re-verification flows before you need them. Stale KYC is both a compliance liability and unnecessary data exposure — you're holding sensitive material past its useful life with no corresponding obligation.

Geofencing Without Storing Location

Jurisdiction enforcement creates a specific constraint: you can't just verify where a user lives. Confirming where they physically are at the moment of a transaction is a fundamentally different problem.

The instinct is to log GPS coordinates with timestamps. Avoid it. That's a detailed record of someone's movement patterns, and regulations typically require proof of the check result — not retention of the raw coordinates themselves. Minimize what you collect to what the obligation actually demands.

A cleaner pattern:

Client → sends coordinates to internal GeoValidation service

GeoValidation → checks against jurisdiction polygon

→ returns: { permitted: true, jurisdiction: "MI", checked\_at: timestamp }

Main app → stores result only, coordinates discarded

Short TTL caching on geo results is reasonable — re-checking on every request creates more location data than required and adds latency. But keep the TTL short on financial transactions. Minutes, not hours. Users can cross state lines.

On mobile, request whenInUse authorization, not always. Background location collection is rarely justified by the actual regulatory requirement. If your legal team pushes for it, ask them to point to the specific obligation. Usually they can't.

IP-based geolocation is a useful secondary fraud signal — but under GDPR, IP addresses (including hashed ones) can still constitute personal data if re-identification remains possible. Treat IPs as PII by default, minimize retention, and don't treat IP geolocation as primary jurisdiction evidence. It's a corroborating signal, not proof.

Logging Is Where Privacy Goes to Die

Application logs are the most overlooked PII risk in most systems. Engineers treat them as ephemeral debugging tools. In practice, they're shipped to third-party aggregators, retained for months, searchable by broad engineering teams, and occasionally exported during incident investigations.

A log line containing a user's email, IP, and a behavioral event is personal data under GDPR, regardless of your intent when writing it. The solution isn't better scrubbing — it's not writing it in the first place. If you want a solid grounding on exactly what GDPR classifies as personal data and how the storage limitation principle translates into engineering constraints, this dev.to breakdown covers it without the legal padding.

Pseudonymize at write time, not after. Your log pipeline should never receive a raw email address. It receives an internal pseudonymous ID instead. The OWASP Privacy Cheat Sheet lays out data classification and pseudonymization requirements clearly — worth keeping open when defining your log field taxonomy. Classify every log field explicitly:

✅ user\_id: "usr\_8f3k2" — pseudonymous internal ID

✅ action: "kyc\_check\_passed" — behavioral event, no direct PII

❌ email: "user@email.com" — never in logs

❌ ip\_address: raw — treat as PII, minimize or drop

❌ ssn\_last4 + user\_id — linkable combination, avoid

Note on IP addresses specifically: hashing doesn't automatically resolve the GDPR question. If the original IP can be reconstructed or re-identified through other means, a hashed value may still be personal data. The safer default is not retaining them in logs at all unless there's a documented, necessary purpose.

Audit logs operate under entirely different rules. Compliance requires an immutable record of access and actions — append-only, write-once, accessible only to your compliance function. Engineers debugging a production incident should not share an access tier with your financial audit trail. These are separate systems with separate purposes, and treating them as one creates both security and compliance exposure.

Scrubbing middleware on outbound log streams catches mistakes. It's not a design strategy — it's a fallback for when the actual design fails somewhere.

Retention Is an Engineering Problem, Not a Policy Document

Every regulated company has a retention policy. Far fewer have the technical enforcement of it. The policy says "delete KYC documents per regulatory schedule." The data sits in production indefinitely because no one built the deletion job. The NIST Privacy Framework treats data lifecycle management — including retention and disposal — as a core engineering outcome, not an afterthought. It's a useful structural reference when you're defining what "done" actually looks like for a retention program.

Retention windows vary significantly by jurisdiction, data type, and applicable regulation — the figures below are illustrative starting points, not legal requirements. Validate specifics with counsel for your target jurisdictions:

Data Type	Retention Window	Enforcement Mechanism
KYC raw vault	Account lifetime + jurisdiction requirement	Compliance workflow, manual review gate
Geofence results	90 days	TTL field, rolling purge job
Session/auth logs	90–180 days	Log store TTL config
Financial records	5–7 years	Legal hold check before purge
Behavioral/marketing data	12 months	Scheduled deletion, user-scoped

The right-to-erasure problem in distributed systems is harder than it first appears. Deleting a user means accounting for: primary database, read replicas, analytics warehouse, event streams, backups, third-party KYC provider, email service, and any other downstream system you pushed their data into. Build a data subject request (DSR) workflow that fans out across all stores. Retrofitting this is significantly more painful than building it early.

For records you're legally required to keep, anonymize the user linkage rather than deleting the record. The transaction happened, the financial record stays — but the user_id foreign key gets replaced with a non-reversible hash. Compliant retention, no live PII.

Soft deletes (deleted_at column) are not privacy-compliant deletion. They're a UX convenience that leaves data fully intact. For regulated data, you need either hard deletion or cryptographic erasure — delete the encryption key, and the data becomes permanently unreadable without requiring a physical delete.

What Regulated Betting Applications Actually Look Like

Online sports betting is one of the most instructive domains for this kind of engineering. The compliance surface is unusually wide: state gaming authority requirements, AML obligations, age and identity verification mandates, and consumer privacy law all apply simultaneously — and they don't always point in the same direction.

Applications supporting sports betting in Michigan operate under Michigan Gaming Control Board oversight, which imposes specific technical requirements around identity verification, geolocation, and record retention. CCPA adds a further layer for any California residents using the platform — note that CCPA scoping depends on user residency and operator thresholds, not just the state of operation. These obligations coexist with platform-level privacy commitments and create a genuinely complex compliance matrix.

In practice, this means a KYC gate at account creation that hard-blocks product access until verification is confirmed and stored per the applicable retention requirement. Geo-check middleware injected at the transaction layer — not just login, but every wager. An audit pipeline physically isolated from the observability stack, with access controls that don't overlap with general engineering. A compliance data store with credentials that most engineers never hold.

The betting domain is worth studying even if you're building healthcare software or fintech tooling. Regulatory pressure is intense enough that architectural shortcuts are genuinely costly — teams in this space have had to solve these problems for real, under audit, rather than deferring them to a future sprint.

Practical Checklist

KYC storage model has been validated against jurisdiction-specific regulatory requirements — not just assumed
Identity data lives in an isolated schema with field-level encryption and separate credentials
No PII written to application logs — pseudonymization at the source, not post-hoc scrubbing
IP addresses treated as PII by default — not retained in logs without documented necessity
Geolocation data is ephemeral — check result stored, raw coordinates discarded
Audit logs are append-only, on a separate pipeline, inaccessible to general engineering access
Retention windows are technically enforced — TTLs and scheduled purge jobs, not just documented policy
A tested DSR workflow exists that fans out deletion across every data store, including third parties
Third-party KYC and data vendors have signed DPAs
Soft deletes are not used as a substitute for compliant deletion of regulated personal data
Erasure approach (hard delete vs. cryptographic erasure vs. anonymization) has been reviewed against applicable DPA guidance
Geo-check middleware runs at the transaction layer, not only at authentication

DEV Community

Data Privacy in Regulated Applications: What Developers Need to Know

Top comments (0)