DEV Community: Sean Falconer

Solving The Austin Problem with a Data Privacy Vault

Sean Falconer — Thu, 01 Feb 2024 00:26:20 +0000

Data anonymization and tokenization play a crucial role in safeguarding sensitive information. Traditional tokenization systems, however, encounter significant limitations that hinder their effectiveness, breaking certain workflows and complicating security. This blog post delves into the "Austin Problem" – a challenge arising from traditional tokenization systems – and explores how Skyflow's data privacy vault addresses these issues.

Understanding Data Anonymization and Tokenization

Tokenization is a non-algorithmic approach to data anonymization that swaps sensitive data for tokens. For example, if you tokenize a customer’s name, like “John”, it gets replaced by an obfuscated (or tokenized) string like “A12KTX”. Because there’s no mathematical relationship between “John” and “A12KTX”, even if someone has the tokenized data, they can’t get the original data from tokenized data without access to the tokenization process.

A common form of tokenization is PCI tokenization, whereby credit card data is replaced with randomly generated tokens.

As an example, when integrating with a payment service provider (PSP) like Stripe, Adyen, or Braintree, your card acceptance flow will look something like what’s shown below.

The front end SDK will collect the credit card information and pass that along to the issuing bank. Once the card is validated and authorized, the gateway or PSP will store the card and pass back a token as a stand-in for the card. You can safely store the token in your backend and be free from PCI compliance. The only thing you care about is the token and that’s enough for you to do everything you need to do with the credit card when relying on the PSP’s SDKs and APIs.

PCI Tokenization for card acceptance.

But tokenization isn’t limited to credit card data; it can anonymize various types of sensitive information, such as names, addresses, social security numbers, and more. The inverse process of tokenization, called detokenization, retrieves the original data from tokens using a token map.

Before we get into the limitations of this approach, let’s take a look at how traditional tokenization systems work.

How Traditional Tokenization Works

Conceptually, tokenization is fairly simple, swap sensitive data for a non-sensitive randomly generated value and keep track of the mapping from sensitive to non-sensitive values and vice versa. From an implementation standpoint, here's how these systems typically work:

Token Generation: When sensitive data needs to be protected, such as a customer's name, a tokenization system generates a random or pseudo-random token, which is a unique alphanumeric string like "A12KTX."

Token Map: The system maintains a token map or token dictionary, functioning similarly to a hash table. This map associates each original data value (e.g., "John") with its corresponding token (e.g., "A12KTX"). This mapping is stored securely.

Token Replacement: The sensitive data, such as "John," is replaced with its corresponding token, "A12KTX," before it is stored or transmitted. This tokenized data is what gets used in databases, applications, or during data exchanges.

Detokenization: When the original data needs to be retrieved, a process called detokenization is used. To detokenize data, the system looks up the token in the token map and retrieves the corresponding original value. Only authorized users or processes with access to the tokenization system can perform detokenization.

Security: Traditional tokenization systems emphasize the absence of a mathematical relationship between the original data and the token. This means that even if someone gains access to the tokenized data, they cannot reverse-engineer it to obtain the original information without the tokenization process and access to the token map.

Protection of Original Data: Since the original data is never stored alongside the tokenized data and is only retrievable through the token map, even if the environment containing the tokenized data is breached, the original data remains secure and uncompromised.

Example of a traditional tokenization system
.

AWS has a great article going through in detail how to create such a system. But even a seemingly simple system has a lot going on under the covers that you would need to build and maintain as shown below.

AWS serverless tokenization solution.

Limitations of Traditional Tokenization Systems

Traditional tokenization systems suffer from three major limitations:

Token Overload: In these systems, mapping a given input string to a token can lead to issues, as it may result in token collisions. This can break analytics and clean room workflows.

Limited Security Model for Detokenization: Users or processes often have broad permissions for detokenization, which poses security risks and lacks granularity.

Choice Between Tokenization and Encryption: Traditional tokenization systems typically do not integrate encryption, limiting their security capabilities.

Let’s take a closer look at each of these problems.

The Austin Problem: Token Overload

To support analytics use cases with tokenization, the same input value needs to generate the same tokenized output value. This is known as consistent or deterministic tokenization. With consistent tokenization, I will always know that the city “San Francisco” will be tokenized the same way. This approach keeps query operations like counts, group bys, and joins intact.

However, with traditional tokenization, there’s no disambiguation between different types of input. The token map doesn’t know that a particular string represents a city, state, name or any other information. This naive approach can lead to undesirable effects and is the manifestation of what I call The Austin Problem.

The Austin Problem occurs when the same input string generates the same token value for two different distinct types of data For instance, if a customer's first name is "Austin" and another customer lives in the city of Austin, Texas, their name and city would yield the same token (see the image below).

Pictural representation of the Austin Problem.

Not being able to disambiguate the data type when going from token to original value can cause incorrect analytical calculations and confusion depending on the design of your analytics store.

Additionally, this simplistic approach complicates adhering to privacy requirements like data subject requests (DSAR) and the right to be forgotten (RTBF) that are a feature of many privacy regulations like GDPR and CCPA.

For example, let’s say you get a RTBF from a customer named ‘Austin’. If you have only one customer named 'Austin.', ideally, you would delete the mapping from 'Austin' to the token value in the Token Store, rendering any tokens stored in your downstream services invalid. However, this becomes challenging if you also need to retain the token mapping for customers residing in the city of Austin, Texas. Consequently, what initially appeared as a straightforward compliance action turns into a manual project to ensure analytics continuity for customers in that city.

Similar to the challenges with analytics, clean rooms where the data has been tokenized by both parties can also be problematic. A clean room depends on being able to perform join operation between two or more parties within a secure isolated environment. Depending on how the data is stored by the businesses, not being able to know the difference between something like a person’s name and a person’s city could lead to miscalculations.

Limited Security Model for Detokenization

Traditional tokenization systems lack fine-grained access control over detokenization permissions and output. This makes it challenging to cater to various use cases and data types.

For example, a marketer might only need partial access to a customer's date of birth, while the customer should see their full date of birth.

Detokenization Based on the Identity of the Requestor.

Even if you can add some additional magic to control access to who can detokenize data based on their identity, since the token map has no knowledge of the datatype, they’re all just strings, applying data masking based on identity gets complicated.

The Choice Between Tokenization and Encryption

Tokenization offers a unique advantage to encryption: it severs any mathematical link between the original data and the generated tokens. In contrast, encryption processes can potentially be reverse-engineered. Additionally, tokens provide practical benefits when it comes to searchability and analytics, as they don't require decryption for use. However, what we really need is a combination of these techniques.

In traditional tokenization, data encryption operates as a distinct system, and the process of detokenization involves bringing the data back to its original plaintext form. If data masking is necessary, it is applied in real-time to the plaintext data. The separation of encryption, detokenization, and masking within traditional tokenization systems introduces potential vulnerabilities at each integration point.

Solving the Austin Problem

The key to solving the Austin Problem requires a more sophisticated approach that brings what has been historically siloed and independent systems together like tokenization, access control, data masking, and encryption. Additionally, we need to re-think traditional tokenization to expand its functionality to address some of the outlined limitations.

Skyflow is a data privacy vault that isolates, protects, and governs sensitive customer data. With Skyflow, the vault supports a new more advanced tokenization called schema-based tokenization.

With schema-based tokenization, instead of relying on a single token map where tokens regardless of data type get intermixed, we can define a schema the same way we would for a database, and each column has its own self-contained tokenization map and tokenization scheme.

Each column within the schema can define its own custom format rules, allowing you to generate any kind of token. In the example image below, you can see the settings for a format-preserving consistently generated token for a credit card number, and also a consistently generated token in the form of a UUID for a cardholder name.

Configuring token types in Skyflow.

Skyflow supports a variety of sophisticated tokenization techniques, like transient tokenization for temporary storage. These are essentially self-destructing tokens for ephemeral data.

Additional enhancements include:

Data Type Awareness: Skyflow's system understands the data type of input, allowing for customized tokenization. Each column within the schema can define its tokenization rules, accommodating various data types.

Column Groups: To prevent token overload, Skyflow supports column groups, similar to namespaces in programming. This feature restricts deterministic tokenization to specific columns, ensuring tokens remain unique even when dealing with similar input values.

Fine-Grained Access Control: Skyflow's data governance engine offers fine-grained policy-based access control, extending to the detokenization process. This enables control over who can access what data, how it's accessed, and its format.

For example, in the image below, the same data is shown in two different ways depending on the role of the viewer. The Customer Support role not only has restrictions on the columns, but also for specific rows.

Example of two different views of vault data based on different policies.

An example policy for customer support is shown below.

ALLOW READ ON payments.name, payments.state WITH REDACTION = PLAIN_TEXT WHERE payments.state = Arizona

ALLOW READ ON payments.ssn WITH REDACTION = MASKED WHERE payments.state = Arizona

Simple policies control how data can be viewed and by whom, making tokenization and access control not isolated disconnected features, but all part of the same system.

Integration of Tokenization and Encryption: Even with the enhancements available through schema-based tokenization, there are still workflows that can’t be fully supported. Similarly, encryption is great, but typically requires decrypting the data in order to perform operations like search.

Skyflow solves this with polymorphic data encryption, a unique approach that fully supports operations on encrypted data. For example you could query your vault to calculate the total number of customers that are over the age of 21 in the state of California without ever decrypting the data.

Skyflow seamlessly integrates tokenization, encryption, data masking, and access control, enhancing security while preserving data usability. This combination ensures data remains protected throughout its lifecycle.

Wrapping Up

The Austin Problem highlights one of several limitations with traditional tokenization systems, which hampers analytics workflows and makes adhering to certain privacy requirements like DSARs and RTBF extremely difficult.

Skyflow’s data privacy vault technology addresses these challenges by providing data type awareness, column groups, fine-grained access control, and integrated advanced tokenization and encryption. This approach not only enhances security but also preserves data usability, a tradeoff that companies have historically had to make. Polymorphic data encryption balances the need for security while keeping the data usable for any workflow you might perform with sensitive customer data.

Adding a Privacy Layer to AWS PartyRock

Sean Falconer — Tue, 28 Nov 2023 19:03:08 +0000

AWS recently unveiled PartyRock – an Amazon Bedrock Playground. PartyRock lets users leverage foundation models from Amazon and other leading AI companies in an intuitive and code-free playground to quickly create AI-powered applications that can handle an array of specialized tasks.

Whether you need to orchestrate your re:Invent schedule, optimize marketing strategies, or develop a diabetes-management diet planner, PartyRock is an amazing tool for transforming ideas into applications with minimal effort.

However, while the excitement surrounding PartyRock and the capabilities of generative AI is well-founded, it’s important to be mindful of data privacy concerns. The lack of a “delete” button for AI models raises substantial privacy and security concerns, because if users reveal sensitive data to an AI model, it can’t be deleted the same way you can delete a row from a relational database.

Consider, for example, a contract analysis assistant application operating on PartyRock. While this application proves invaluable in parsing complex contracts and extracting pertinent information, you need to put privacy measures in place to use this application because many contracts inevitably contain confidential data. Sharing such sensitive information with the underlying AI model presents a significant privacy risk.

So, how can you use Personally Identifiable Information (PII) in AI-driven applications?

To navigate the potential privacy limitations of any AI-based application it’s imperative that we add a data privacy layer to limit PII exposure. To demonstrate this, we built a Chrome Extension to protect unintended PII sharing with apps built on PartyRock. The data privacy layer leverages Skyflow LLM Privacy Vault. Using Skyflow, the extension detects and de-identifies PII so that PartyRock's models remain fully functional without compromising the privacy of sensitive details. The video below shows the complete functionality.

In this blog post, I’ll show how to create a privacy-preserving Chrome Extension. I’ll also share insights on how you can leverage the functionality offered by PartyRock, or any other AI model, while using a data privacy vault to protect sensitive data and safeguard user data privacy.

What is Skyflow LLM Privacy Vault?

Skyflow LLM Privacy Vault is a technology that’s purpose-built to isolate, protect, and govern sensitive customer data seamlessly throughout the lifecycle of LLMs. It’s not limited to working strictly with Amazon Bedrock – you can use Skyflow LLM Privacy Vault with any LLM, including a public model, a fine-tuned foundation model like those provided by PartyRock, or your own custom model.

Privacy During Model Training

Whether you’re constructing foundation models, fine-tuning models, or developing Retrieval Augmented Generation (RAG) models, the privacy vault works like a privacy firewall or a data transformation layer. It detects and de-identifies sensitive data during collection and processing, regardless of whether the source data originates from a single source, or is compiled from multiple sources.

The plaintext sensitive data that’s detected by Skyflow is stored in the vault and replaced by de-identified data. Then, LLM training can proceed as normal, with a de-identified and privacy-safe dataset.

Using a privacy vault for privacy-preserving model training.

Privacy in Inference

Users interact with AI models in a variety of different ways, with the most popular one being a front-end UI like the ones used by PartyRock applications. Users can also upload files to AI models. In both cases, these models use inference to collect data that users provide to them, including sensitive data – unless that data is first de-identified.

Using a privacy vault, sensitive data isn’t just de-identified; it's securely stored. All sensitive customer data (and even core IP) is kept out of LLMs entirely. This data can only be re-identified by authorized users. This approach preserves data privacy during inference when AI models provide responses because PII is protected by fine-grained access controls. These controls restrict who can see what data, when, where, and for how long.

Using a privacy vault for privacy-preserving inference.

Detect and De-identify PII

So, how does this work, and how can you add a privacy vault to any PartyRock application?

The first step is to detect sensitive data, including PII, from a dataset. The same approach is applied to model training datasets, and to any data supplied by a user during inference.

To detect PII, Skyflow provides a detect API endpoint that can accept text or files. This endpoint automatically identifies hundreds of forms of PII, and returns a privacy-safe version of the input where each piece of detected PII is replaced by vault-generated tokens. Note that vault-generated tokens are distinct from the LLM-generated tokens that are used to chunk and process information within AI models.

In the sample API call below, I’m calling the detect API with a sentence containing a name and phone number. When working with an LLM, either in training or inference, I typically don’t want to share these details or any other PII.

curl -s -X POST "https://manage.skyflowapis.com/v1/detect" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{
    "vault_id": "t6dadfbc3f4d4cdfbf12bb38b694b144",
    "data": {
        "blob": "Hi, my name is Sean Falconer and my phone number is 123-456-7890.",
        "send_back_entities": true
     }
}'

This API call returns a response like the following example, where name and phone number are detected and replaced by vault-generated tokens. The context of each of these entities – my name and my phone number – remains intact, which is all the LLM needs to draw context for training and inference.

In this example, the name is replaced by a token formatted as a UUID while the phone number is replaced by a format-preserving token that still resembles a phone number. You can generate tokens in a variety of formats depending on your use case.

{
    "processed_text": "Hi, my name is NAME:576a5b26-5cca-4cdc-b409-ea2c39b53f21 and my phone number is PHONE:(765) 978-2342."
    "entities": [
        {
            "processed_text": "NAME:576a5b26-5cca-4cdc-b409-ea2c39b53f21",
            "text": "Sean Falconer",
            "location": {
                "stt_idx": "16",
                "end_idx": "28",
                "stt_idx_processed": "16",
                "end_idx_processed": "56"
            }
        },
        {
            "processed_text": "PHONE:(765) 978-2342",
            "text": "123-456-7890",
            "location": {
                "stt_idx": "53",
                "end_idx": "64",
                "stt_idx_processed": "81",
                "end_idx_processed": "100"    
            }
        }
    ]
}

For end-to-end LLM data protection when creating or fine-tuning your own AI models, you would use the Skyflow detect API during both training and inference. For PartyRock applications, we can’t control the training process because we don’t have access to the backend service. However, we do have control over what gets shared during inference.

In the following sections, we dive into how to build a Chrome Extension that uses Skyflow LLM Privacy Vault to carefully monitor what’s shared with PartyRock and filter out PII.

Creating a Chrome Extension

Chrome Extensions are custom-built programs that enable users to customize the Chrome browsing experience. They are relatively simple to create.

They consist of a manifest.json file that describes the extension’s capabilities and configuration. The manifest I created for my Skyflow extension is shown below.

{
 "manifest_version": 3,
 "name": "Skyflow",
 "version": "1.0",
 "description": "Prevent PII sharing with AWS PartyRock Apps",
 "icons": {
   "16": "images/skyflow-16.png",
   "32": "images/skyflow-32.png",
   "48": "images/skyflow-48.png",
   "128": "images/skyflow-128.png"
 },
 "content_scripts": [
   {
     "js": ["scripts/jquery-3.7.1.min.js", "scripts/detect-and-tokenize.js"],
     "run_at": "document_end",
     "matches": [
       "https://partyrock.aws/u/*"
     ]
   }
 ]
}

The extension runs on any website matching the https://partyrock.aws domain and u (i.e. user) route. It imports two Javascript files:

jQuery, which I’m using to help provide shorthand for some of the DOM manipulation and matching I need to monitor input and output from a PartyRock app
detect-and-tokenize.js, the main program that integrates with Skyflow to monitor inference data for PII

Monitor, Detect, and De-identify PII

To prevent potential sharing of PII with the model, we need to monitor an app’s input fields, capture the user input, and then use Skyflow to detect and remove PII. The de-identified version is then swapped into the user input fields and passed along to the model for inference.

For example, in the image below, both areas that are boxed in red represent user input fields where PII might be intentionally or accidentally shared.

Example PartyRock app highlighting the areas in red where a user might share PII.

PartyRock apps load dynamically, so the input fields aren’t rendered until after the page loads. This means that in order to monitor user input, we need to wait for the page to load before attaching an input listener to the <textarea> element where users interact with an app.

Once the page loads, for each <textarea> input, we attach an input listener which is executed as a user types input. To avoid calling the Skyflow API on every keystroke, the setTimeout function is used to delay each call by 500 milliseconds. If there’s new input by the user, the delayed call is cleared and a new one starts.

textarea.on('input', function () {
  if (callback) {
    clearTimeout(callback);
  }

  callback = setTimeout(tokenizePii, 500);
});

The tokenizePii function takes the input field’s text value and calls an AWS Lambda function, which in turn calls the Skyflow detect endpoint, as shown in steps 1 and 2 below:

Using a Chrome extension and Skyflow to provide end-to-end AI data privacy for PII.

When we use an app like Contract Assistant with this Chrome Extension, PII contained within a contract is replaced by de-identified vault-tokenized values, as shown on the right side of the following illustration:

Plaintext PII in a contract is replaced with vault-tokenized values.

Monitor, Detect, and Re-identify PII

Now that the ingress messages to the PartyRock backend are free of PII, responses coming back from the LLM may contain de-identified values, which is ideal for data privacy but could be puzzling for app users. So, the next step is to re-identify these de-identified values to provide authorized users with PII from the vault, subject to fine-grained access controls.

To do this, we need our Chrome Extension to monitor the <div> element where responses are generated and automatically restore the de-identified values to the original values to give the user a readable, truly usable contract analysis application.

I used the MutationObserver interface to look for new child nodes being added to the <div>, indicating the presence of new response data. Similar to the ingress logic shown above, I’m applying a delay of 500ms so that I can avoid excessive processing and only re-identify the response after it fully loads.

var config = { childList: true };

// Callback function to execute when mutations are observed
var mutationCallback = function(mutationsList) {
  for (var mutation of mutationsList) {
    if (mutation.type == 'childList') {
      if (callback) {
        clearTimeout(callback);
      }

      responseText = $(responseArea).html();

      callback = setTimeout(reIdentifyData, 500);
    }
  }
};

// Create an observer instance linked to the callback function
var observer = new MutationObserver(mutationCallback);

To re-identify any vault-tokenized values, we could use the Skyflow API to return these tokens with the original plaintext PII values, subject to fine-grained access controls. However, because this is an example application and this particular use case likely doesn’t require a very large amount of PII information, I’m caching the tokens and original values in this example Chrome Extension.

This way re-identification is completely done client side, as shown below:

function reIdentifyData() {
 let originalString = responseText;
 let referenceObject = responseArea;

 if(originalString !== undefined) {
   for(let [token, pii] of tokenMap) {
     if(originalString.indexOf(token) >= 0) {
       let modifiedString = originalString.replace(new RegExp(token, 'gi'), pii);

       originalString = modifiedString;

       $(referenceObject).html(originalString);
     }
   }
 }
}

Of course, caching PII in a Chrome Extension wouldn’t work for an industrial-grade version of this application. For that, we’d need to enhance this Chrome Extension to call Skyflow’s detokenize API endpoint, allowing it to de-tokenize vault-generated tokens in contract assistant responses for multiple users – as governed by strict fine-grained access controls.

Final Thoughts

AWS PartyRock provides an exciting set of capabilities for anyone who wants to explore the world of AI application development. It’s exciting to see such a broad range of applications available to run on PartyRock less than two weeks after its release!

But, to move AI applications that handle PII or other sensitive data beyond the proof–of-concept phase, it’s critically important to get a handle on data privacy. Using a Chrome Extension like the one shown here along with Skyflow LLM Privacy Vault enhances the privacy of PartyRock applications so you can harness the potential of Amazon Bedrock, or any LLM, without impacting data privacy.

The best part is, this approach doesn’t impact the usefulness of PartyRock applications because PII de-identification is reversible – so the user experience is unaffected by keeping PII out of AI models.

I hope you have a great time building privacy-preserving applications with PartyRock!

De-scoping Your AWS Services from Data Residency Requirements

Sean Falconer — Mon, 25 Sep 2023 16:04:33 +0000

From the widely recognized GDPR in Europe to Brazil's LGPD regulations, and the more recent introduction of India's DPDP law, over 100 countries now have some form of privacy regulation in place. What's common among many of these regulations is the concept of data residency – the physical location of your data. However, each region's requirements bring their own unique nuances, encompassing restrictions on data transfer, data storage locations, and individual data rights.

Navigating this complex sphere of privacy regulations is a huge burden for many companies born in the cloud. Their data simply ends up everywhere, and tracking down the locations, adhering to local laws, and even storing and using it locally is enormously complex and expensive.

Over the past year, I've engaged with numerous companies eager to expand their businesses into new markets, such as Europe and Australia. However, they've encountered a significant roadblock – the absence of a robust technology solution to address the data residency requirements of these regions. As a result, they face the expensive and nightmarish scenario of duplicating their cloud infrastructure for each new region, which not only hampers operational efficiency but also limits their data analyst and scientists from running analytics globally.

In this blog post, I offer a solution to this pressing technology and business challenge by introducing a PII data privacy vault. This architectural approach to data privacy effectively removes the burden of data residency, compliance, and data security responsibilities from your infrastructure, providing a seamless path for global expansion and data management.

Let’s dive in.

Data Residency and Barriers to Expansion

To grasp the intricacies of regulatory compliance in the context of global expansion, it’s important to understand a few key concepts.

Compliance

Compliance denotes a business's adherence to the laws and regulations governing data privacy and protection. These regulations are contingent on the geographic location of the customer whose data is being collected. Ensuring compliance is imperative for legal reasons as it shields businesses from financial penalties, license revocations, and the erosion of customer trust.

Data Residency

Data residency pertains to the physical location where customer data is stored. For instance, a website may serve customers in the EU, but their data could be hosted on a server located in Chicago. Different countries and regions have precise laws dictating how customer data should be handled, processed, stored, and safeguarded, making data residency a critical consideration.

Varying Regulations

The complexity surrounding data residency and compliance obligations primarily arises from the diversity of regulations worldwide. For instance, the European Union (EU) has GDPR, Brazil follows LGPD, and the United States enforces a patchwork of state-specific laws like CCPA in California and CTDPA in Connecticut. These regulations diverge significantly in terms of their stipulations and penalties.

Barriers to Global Expansion

The disparities in regulations and compliance requirements often pose formidable obstacles for companies striving to attain a global presence. Navigating diverse regulatory frameworks demands significant time, resources, and expertise. The resulting complexity frequently dissuades businesses from venturing into new markets, thereby constraining opportunities for global expansion.

We’ve looked at the problem, now, let’s explore an approach to addressing these challenges.

What is a Data Privacy Vault?

A data privacy vault isolates, protects, and governs access to sensitive customer data. Within the vault, confidential information is securely stored, while abstract and non-sensitive tokens, serving as references, are retained in conventional cloud storage. This means that only non-sensitive tokenized data is accessible to other systems, ensuring the utmost protection and compliance.

In a recent IEEE article, the authors made a case that this architectural approach to data privacy is the future of privacy engineering. Just as any modern system likely contains back end services, a database, and a warehouse, all modern systems need a data privacy vault to safely store, handle, and use of sensitive customer PII.

Traditional PII management versus a data privacy vault (source: IEEE).

Let's take a look at a specific example for a simple web application. In the image below, a phone number is being collected by a front-end application. For effective de-scoping, it’s ideal to initiate the de-identification process at the earliest stage in the data lifecycle. In this scenario, the phone number is stored directly within the vault during collection at the front end.

Example of vault architecture for collecting sensitive customer PII.

Within the vault, the phone number, alongside any other personally identifiable information (PII), is stored within a robust and isolated environment, segregated from your organization's existing infrastructure. All downstream services, ranging from application databases to data warehouses, analytics platforms, and logging systems, interact solely with tokenized (de-identified) representations of the data. Queries against the PII for specialized operations or algorithmic operations against PII execute directly within the vault.

Access to de-tokenize or re-identify data is controlled through a zero trust model. Policy-based rules control who sees what, when, where, and for how long on a row and column level.

Controlling access to vault data based on who is requesting the data.

The vault combines the principle of isolation, zero trust, privacy-enhancing technologies, and governance controls to insulate your systems from ever having to touch PII directly. This places your AWS components beyond the scope of regulatory compliance, assuring a higher level of data protection and adherence to data residency requirements.

Your AWS Services Handle Only De-identified Data

Let’s assume we have a simple application infrastructure as shown below with AWS Amplify providing the web server infrastructure, DynamoDB for application storage, and Redshift for warehousing.

Example web application infrastructure running on AWS.

Without a vault in place, everything within our AWS account is under compliance and security scope.

By introducing the vault as shown below (in this example, the collection of PII is handled directly from the vault), we de-scope all our AWS services. The services are only ever handling de-identified data, including the warehouse.

Many analytical operations can be performed with de-identified data provided the data is consistently generated. A warehouse doesn’t need to have access to someone’s name, it only needs a consistently generated representation of the name in order to execute counts, group bys, and joins.

Example of de-scoping AWS services with a data privacy vault.

Storing PII to Different Regionalized Vaults

With Skyflow, a data privacy vault company, you can host vaults in various global regions and route sensitive data to a specific regional vault for storage and use. For instance, consider how the following application architecture meets data residency requirements across multiple regions:

Using regional multiple vaults to comply with data residency requirements.

Your company’s site collects customer PII during account creation.
On the client side, the website detects the customer’s location.
Detecting that the customer is in the EU, the client-side code uses Skyflow’s SDK to collect the PII data and store it in your company’s data privacy vault in Frankfurt, Germany. Note: For customers based in the US, the PII data is instead routed to the data privacy vault in the US (in this case, Virginia).
The EU-based customer’s sensitive PII is stored in the EU-based data privacy vault, and Skyflow responds with de-identified data.
The client-side code sends the account request, now with de-identified data, to the server.
The server processes the request, storing the data (now de-identified and tokenized) in cloud storage in the “Oregon, US” region.
At the end of the week, your company’s Redshift instance in Tokyo, Japan, loads the data (already de-identified and tokenized) from cloud storage to perform analytics.

Deploying multiple vaults situated in different regions streamlines the management of your sensitive data, ensuring compliance with data residency requirements across all your markets.

The data privacy vault architecture significantly simplifies the complexities associated with data residency and compliance. Furthermore, by exempting Redshift (or any warehouse) from the compliance responsibilities tied to data residency, global analytics operations continue seamlessly within a single warehouse instance.

Final Thoughts

Compliance regulations, with their stringent data residency stipulations, necessitate businesses to maintain rigorous standards for data localization, protection, privacy, and security. Adhering to these regulations is essential to mitigating the risks associated with breaches, penalties, and potential damage to reputation. However, enterprises operating in various global regions, serving diverse customer bases, are left to deal with the complex task of navigating multiple regulatory landscapes.

Using data privacy vaults as your core infrastructure for customer PII offers a streamlined solution to simplify global compliance, particularly concerning AWS services and cloud storage.

With a data privacy vault, organizations gain the ability to centralize the security of all sensitive data, effectively removing AWS and cloud storage from their compliance scope. By deploying data privacy vaults in various regions, companies can ensure that sensitive data storage and transmission align with the specific laws and regulations of each operational jurisdiction, thereby enhancing their overall compliance and security posture.

If you have thoughts on this or questions about this approach, please reach out to me on LinkedIn.

The Data Cloud’s Cheese and Diamond Problem

Sean Falconer — Mon, 18 Sep 2023 10:51:49 +0000

In any given week, if you search the news for “data breach”, you’ll see headlines like the ones below.

Companies like MGM and Caesars spend millions of dollars on firewalls, SIEMs, HSMs, and a whole smorgasbord of cybersecurity tools and yet, they can’t protect your social security number.

From hotels and casinos to some of the most innovative technology companies in the world, why is it that companies with seemingly endless financial and talent resources can’t get a handle on their data security challenges?

I believe this is due to a fundamental misunderstanding about the nature of data that started over 40 years ago.

Back in the 1980s, as computers found their way more and more into businesses, we lived in a disconnected world. To steal someone’s data, you had to physically steal the box the data lived on. As a consequence, we assumed that all data is created equal, that all data is simply ones and zeros, but this is wrong. All data isn’t created equal, some data is special, and needs to be treated that way.

In this blog post, I share my thoughts on what I refer to as the “Cheese and Diamond Problem” and how this has led to the data security challenges companies face today. I also explore an alternative approach, a new way of thinking, a privacy by engineering approach that helps us move towards a world where security is the default, and not bolted on.

The Cheese and Diamond Problem

Imagine that in my house I have cheese and I have diamonds. As a gracious host, I want guests of my home to be able to access my cheese. They should be able to freely go into the refrigerator and help themselves to some delicious cheese and perhaps a cracker.

However, I don’t want just anyone to touch my diamonds. Perhaps my diamonds even have sentimental value because it’s a diamond ring that’s been passed down through many generations in my family. Clearly the diamond is special.

Yet, if I store my diamonds in the refrigerator next to my cheese, it makes controlling access to the diamonds much more challenging. By co-locating these very different objects, my refrigerator alone isn’t enough to make sure my wife has access to the diamonds and cheese, but my guests only have access to my cheese.

The rules of engagement for something like diamonds are completely different than the rules of engagement for cheese. We all understand this distinction when it comes to physical objects.

This is exactly why my passport and my children’s birth certificates aren’t in the junk drawer in my kitchen with my batteries and my flashlights. If someone breaks into my home and steals my batteries, it's not that big a deal, but if someone steals my daughter’s birth certificate, then I not only feel like I’ve failed as a parent, but the information on her birth certificate is also now compromised forever. I can’t simply replace her date of birth.

Despite all of us intuitively understanding that some physical objects are different, that they’re special, we somehow miss this point when we work with data. We don’t apply this thinking to Personally Identifiable Information (PII). We treat it like any other form of transactional or application data. We stuff it in a database, pass it around, make a million copies, and this leads to a whole host of problems.

The PII Replication Problem

Let’s consider a simple example.

In the diagram below, which represents an abstraction of a modern system, a phone number is being collected in the front end of the application, perhaps during account creation. That phone number ends up being passed downstream through each node and edge of the graph and at each node, we potentially end up with a copy of the phone number.

The PII replication problem.

We store it in our database, in the warehouse, but we may also end up with a copy in our log files and the backups of all these systems. Instead of just having one copy of the phone number, we now have many copies and we need to protect all those locations and control access consistently wherever the data is stored.

Imagine that instead of having one copy of your passport that you keep in a secure location, you made 10,000 copies and then distributed them all over the world. Suddenly keeping your passport safe becomes a much harder problem in all 10,000 locations than if you have one copy secure in your home.

But this is exactly what we do with data.

We copy it everywhere and then attempt to lock down the hatches across all these systems and keep the policies and controls in sync about who can see what, when, and where. Additionally, because of the Cheese and Diamond Problem, we can’t adequately govern access to the data because the intermixing of our data conflates the rules of engagement about who has access. This quickly becomes an intractable problem because businesses don’t know what they’re storing or where it is, leading to the world we live in now where major corporations have data breaches on a regular basis.

Not All Data is Equal

Businesses are collecting and processing more data than ever. With the explosion of generative AI, as much as we are in an AI revolution, we are also in a data revolution. We can’t have powerful LLMs without access to massive data.

Companies leverage their data to drive business decisions, product direction, help serve customers better, and even create new types of consumer experiences. However, as discussed, not all data is created equal, some data, like PII, is special.

Over time, we’ve recognized that other forms of data like encryption keys, secrets, and identity are special and need to be treated that way. There was a time when we stored secrets in our application code or database. We eventually realized that was a bad idea and moved them into secret managers.

Approaches to managing different types of sensitive data.

Despite this progress, we are still left without an accepted standard for the storage and management of sensitive PII data. PII deserves the same type of special handling. You shouldn’t be contaminating your database with customer PII.

Luckily there’s a solution to this problem originally pioneered by companies like Netflix, Google, Apple, and Goldman Sachs and now touted by the IEEE as the future of privacy engineering, the PII Data Privacy Vault.

The PII Data Privacy Vault

A data privacy vault isolates, protects, and governs access to sensitive customer data (i.e. PII) while also keeping it usable. With a vault approach, you remove PII from your existing infrastructure, effectively de-scoping it from the responsibility of compliance and data security.

A vault is a first principles architectural approach to data privacy and security, facilitating workflows like:

PII storage and management for regulated industries
PCI storage and payment orchestration
Data residency compliance
Privacy-preserving analytics
Privacy-preserving AI

Let’s go back to our example from earlier where we were collecting a phone number from the front end of an application.

In the vault world, the phone number is sent directly to the vault from the front end. From a security perspective, we ideally want to de-identify sensitive data as early in the life cycle as possible. The real phone number will only exist within the vault, it acts as a single source of truth that’s isolated and protected outside of the existing systems.

Example of using a data privacy vault to de-scope an application.

The vault securely stores the phone number and generates a de-identified reference in the form of a token that gets passed back to the front end. The token has no mathematical connection to the original data, so it can’t be reverse engineered to reveal the original value.

This way, even if someone steals the data, as what happened with the Capital One data breach, the tokenized data carries no value. In fact, Capital One was fined only because they failed to tokenize all regulated data, some records were purely encrypted and those records were compromised.

Revealing Sensitive Data

While it’s great to securely store sensitive data, if we simply lock it up and throw away the key, it’s not super useful. We store all this customer PII so we can use it.

For example, we may need to reveal some of the data to a customer support agent, an IT administrator, a data analyst, or to the owner of the data. In this case, if we absolutely need to reveal some of the data, we want to re-identify it as late as possible, for example during render. We also want to limit what a user has access to based on the operations they need to perform with the data. While I might be able to see my full phone number, a customer support agent likely only needs the last four digits of my phone number and an analyst maybe only needs the area code for executing geo-based analytics.

The vault facilitates all of these use cases through a zero trust model where no one and no thing has access to data without explicit policies in place. The policies are built bottoms up, granting access to specific columns and rows of PII. This allows you to control who sees what, when, where, for how long, and in what format.

Let’s consider the situation where we have a user logging into an application and navigating to their account page. On the account page, we want to show the user their name, email, phone number, and home address based on the information they registered with us.

In the application database, we’ll have a table similar to the one shown below where the actual PII has been replaced by de-identified tokens.

Example of users table within the application database.

As in the non-vault world, the application will query the application database for the user record associated with the logged in user. The record will be passed to the front end application and the front end will exchange the tokens for a representation of the original values depending on the policies in place.

In the image below, the front end already has the tokenized data but needs to authenticate with the vault attaching the identity of the logged in user so that access is restricted based on the contextual information of the user’s identity. This is known as context-aware authorization.

Once authenticated and authorized, the front end can directly call the data privacy vault to reveal the true values of the user’s account information. But the front end only has access to this singular row of data and it's limited to the few columns needed to render the information on the account page.

Example of revealing sensitive data for a single record.

Sharing Sensitive Data

No modern application exists in a silo. Most applications need to share customer PII with third party services to send emails, SMS, issue a payment, or some other type of workflow. This is also supported by the vault architecture by using the vault as a proxy to the third party service.

In this case, instead of calling a third party API directly, you call the data privacy vault with the de-identified data. The vault knows how to re-identify the PII securely within its environment, and then securely share that with the third party service.

An example of this flow for sending HIPAA compliant forms of communication is shown below. The backend server calls the vault directly with tokenized data and the vault then shares the actual sensitive data with the third party communication service.

Example of using a vault to send HIPAA compliant communication.

Final Thoughts

We’ve come a long way since building business applications in the 1980s, but we’ve failed to evolve our thinking regarding how we secure and manage customer PII. Point solutions like firewalls, encryption, and tokenization alone aren’t enough to address the fundamental problem. We need a new approach to cut to the root of the Cheese and Diamond Problem.

Not all data is the same, PII belongs in a data privacy vault.

The data privacy vault provides such an approach.

It's an architectural approach to data privacy where security is the default. Multiple techniques like polymorphic encryption, confidential computing, tokenization, data governance, and others combine with the principle of isolation and zero trust to give you all the tools you need to store and use PII securely without exposing your systems to the underlying data.

If you have comments or questions about this approach, please connect with me on LinkedIn. Thanks for reading!