Soveren

Posted on Dec 8, 2021

How to discover personal data in cloud storage

#tutorial #security #cloud #privacy

Data loss prevention tools are often employed to discover and monitor personal data in the cloud, but how effective and costly are they?

Personal data laws have been a bit of a spanner in the works and made everyone have a bit of a rethink about how they store client data that could be classified as “personal”. The thing is, which data can be classed as personal can change depending on whether it is paired with other data. This means that data which has the potential to be personal could be pretty much anywhere.

Since most of the world now operates or is looking to shift all of their data operations to the cloud, it is becoming a major storage place for personal data.

Why you would need to look for personal data in the cloud

In any storage, things pile up. Just think about all the historical data on your hard disk and how often you go clearing that up! Now if we are tasked with knowing where our personal data is for compliance reasons, we need to trawl through historical data to see whether it has personal data in there.

Note here: the data could be in text form, whether structured or unstructured data, just as it could be in pdf or jpg format.

So the increasing likelihood of a data breach, such as unintended third party access to your company’s data or otherwise, or an audit from the regulators has lit a fire under us to get classifying and setting up rules for personal data usage and storage.

Luckily for you, if you’re using a cloud service like Google Cloud, Azure or Amazon S3, there are tools for finding and classifying data and you can use them to improve your personal data practices. However, they aren’t so simple as they first might seem.

Personal data and cloud basics

Most cloud providers offer data loss prevention (DLP) services which aim to detect and protect against data breaches (loss) through monitoring, detecting, and blocking access to sensitive data while in use, in motion, and at rest.

Since DLPs are able to detect data, they can be tuned to detect personal data. To do so, you will need to write a few cloud functions, set up the output, and draw a few schemas.

When you write the function, you need to take care to enter all possible file formats to be scanned and update the cloud function if you add any new file formats. If you miss some file types out (remember: the cloud can store all file types), then the function will just skip them. The problem here is that if you create the function to scan too many file formats then the function will become too big. You will basically be adding more complexity with the extra logic and expanding the function way beyond the initial intention and to something more like a full service.

A way to get around this would be to only accept personal data in certain file formats from customers.

The good news is that the way to set up personal data monitoring in the cloud is pretty much identical whether you are using Amazon, Azure, or Google clouds.

First you will need to push the files in your cloud storage through the DLP. Take the outcome of this and place it in specific storage for metrics and then transform it into a schema via the cloud’s data studio (called Quicksight on Amazon).

Тake a look below for how it is done in Google Cloud. (The same schema applies for other clouds.)

Setting up

Since I have used an image on how to set up personal data monitoring for Google cloud, let’s run through how it is done there.

You’ll need to carry out a full scan to analyze the data you hold there. This is a bit of a long and complicated exercise: if you have several types of storage with large amounts of data, the time it will take to scan will run into the hundreds of hours.

Moreover, the price analyzing the data that passes the DLP is done by volume, so the more data you push though, the higher the cost. This could be compounded by throughput problems as analyzing the storage and looking for PII adds additional requests. Depending on your budget, you will probably need to cap the amount of requests.

Google offers you a set of best practices to keep costs down, for example: setting your cloud function to only scan data that has been updated or changed. Still though, this can come with its own problems: imagine you make a small change to a large file; the whole file will be scanned, not just the change.

The input can be pdf, word, pictures with text, etc., and the different formats make it hard to identify whether there is personal data in there. This is especially true for pictures since the picture quality needs to be good enough for the data to be recognized.

You can find an example of the function code for analyzing the data store on Google Cloud’s Github.

For new files or data you will need to analyze them using a separate function. The triggers for function execution are when new files are uploaded to the storage or when files are updated.

Launch the file scan and set the publication output for BigQuery or elsewhere.

Files are uploaded or updated
The cloud function is triggered
The cloud function checks files for personal data
The results are published to another storage

Next you will want to classify the data you have.

The classification splits depending on how you want the data to be split. Different people will analyze different classifications of personal data.

When all of the resources have been created in the cloud, you’ll need to write the cloud functions. You can find an example of the cloud functions code on Google Cloud’s Github.

Deploying

Deployment can be difficult depending on how many environments you have. If you have only one storage and one environment it will be relatively easy, but if you have multiple then you will have problems. For example, if you use demo, beta, and production environments: you will need to write the functions on the demo environment, check them, test thek and then do the same in beta and the same in prod, taking it all the way to production. This is time consuming across several systems and you will probably need the help of a DevOps engineer.

If there are different storages across different cloud providers then you will have different storages for different services of file types. You may have storage for user activity, user payment, etc. across different storage. The difficulty here is that you need to know how to launch to production across all systems.

Next you will want to classify the data you have.

How to classify the data

All personal data has its own specific type as a flag for being detected. Google provides a full list of data types.

You can find an example of the function code for analyzing the data store on Google Cloud’s Github.

You will usually use Google Data Studio, Power Bi, or something else to visualize the data. But the main point of classifying data is to identify what you have and see how sensitive the different types of data you hold really are.

You can find out how sensitive the data is by getting a security analyst to review it. This will add to the cost as the analyst will need to carefully check the BigQuery table and its metrics to monitor the classification types and assess the likelihood of it being personal data. The analyst will generally have to write a script for what happens when data is flagged as personal.

So let’s look at an example:

Input

For example, you receive an email which is then placed in your storage as a text document:

Please update my records with the following information:

Email address: example.name@example.com

National Provider Identifier: 1245319599

Driver's license: AC333991

Output

In the table above, we set the outputs of the scan as the type of information found, the likelihood that it is personal data and where the suspected personal data is located in the text (characters into the text). It’s important to note here that information types are preset and cannot be added to. Secondly, you’ll note that the likelihood classification isn’t a binary choice: which means there’s room for misinterpretation and that you’ll actually need to analyze whether it is personal data or not.

After you have these metrics they are published to BigQuery; then you can visualize the data in the data studio. Here you can add a table for the different types of personal data that are contained in the documents you have: e.g. 50% of docs have email and driver’s license.

Problems using DLP for data detection

You may have noticed: using a DLP in the cloud to monitor personal data comes with its issues.

It’s difficult
Actually building an efficient system that is able to capture and classify everything you need is pretty difficult to say the least. You need to get a few people involved, like programmers to write the functions, devops engineers to deploy the resources, and system analysts to build it all out. Additional people getting involved means additional complexity

It’s fragile
If you manage to set everything up and then realize you want to alter something, you will have a job to do. Adding any components to the cloud will affect the strict dependencies between them, increasing the complexity and fragility of the system as a whole. The same if you want to add new data stores or sources: it will increase development complexity.

It’s expensive
The cost of this solution on face value seems pretty low, but when you factor in the work hours involved, the cost soon skyrockets. Moreover, DLP systems charge based on the amount of data that flows through them, plus for the work of the functions, independent of the amount of personal data discovered.

Bottom line

In the end, the result you are looking for may be impossible to achieve because the store you identify may not be the only place that personal data is located. To really know where all your data is, you will have to conduct deep scans which we’ve seen are costly and time consuming. This really makes them out of reach for any company that isn’t a large enterprise with a separate team for privacy.

The other thing is, finding and classifying data is not a one-time process, but an ongoing one. This means it should be simple and user-friendly and the DLP method described above just isn't.

Is there another option?

Try monitoring data in motion without a DLP; using a proxy.

Top comments (1)

Diana • Dec 9 '21

Thank you. Useful article. I will try it.