Pragmatic Privacy for Programmers (Part 1)

#privacy #security #agile

As software developers, we have to make sure that the applications we build have the right functionality, we also have to ensure they meet various non-functional requirements. Our apps better be secure against hackers, be fast enough not to annoy users, quickly scale and not crash when there's a peak in usage. As of 2018, it's unavoidable that privacy is one of these concerns. If it wasn't already, it's the enforcement of the General Data Protection Regulation (GDPR) back in May 2018 which made it a concern for most developers.

One way to deal with GDPR is to throw your hands up in the air and block everybody from the EU. That sounds drastic, but some sites and companies have actually done that. If you don't have that option, say because you actually live in the EU (like me) or have users or customers there, you'll need to take care of this privacy thing. And even if you don't have to deal with GDPR, it's quite likely you'll have to adhere to other privacy laws or regulations.

Perhaps some of us are lucky to work for a large organization where there's a privacy expert available to tell us what to do. If we're really lucky, there might even be a privacy engineer who knows enough about software engineering to take care of the work. But in many cases, such experts aren't around or available to our projects. That means we will largely need to address it ourselves. But how can we go about this? We don't have the time to become privacy experts!

In this article (and following parts) I'll outline a pragmatic approach for dealing privacy as a programmer. Since I'm in the EU, I'll use GDPR as a guideline for this, but it should be applicable regardless of the specific laws you need to adhere to. Before we can get to the practical stuff, it's necessary to have some basic definition of what privacy is about.

Since this is a legal topic, please take the below as a generic approach, but not as a specific solution for your apps. It may contain incorrect assumptions, does not take into account your specific situation and is intended to be an illustration.

A bit of theory

When talking about software, privacy is about protection of personal data. At first blush that may seem clear, but some closer inspection doesn't hurt:

First, what is personal data? If we follow GDPR, personal data is information that relates to a person that's identifiable. That's a pretty broad definition, and also a bit vague. There's no reference list of what constitutes personal data, and neither is it possible to create it, as it depends on context. Let's take hair colour, for example. The information clearly relates to a person. But is that person also identifiable? On a global scale, no. But with additional information, we may be able to indirectly identify the person. For example, if we know the person works at your company, or lives in your street, just knowing the hair colour could be enough to identify. In such cases, hair colour is personal data that falls under GDPR.

Secondly, the apps we build should protect personal data. We should protect personal data against everybody: people with technical access to the application (that's us), other people in our organization, collaborators and outsiders. Doing this correctly will require us to perform a number of duties. For starters, we need to have a valid reason (lawful purpose) for processing personal data. Next, we can only use it for that reason (purpose limitation), and so on. The people the personal data relates to also have rights that we need to guarantee. Examples of owner rights include knowing when personal data is processed (right of information) and being able to see what personal data is processed (right of access).

So in summary, privacy is about:

Personal Data:
a. information that relates to a person
b. a person that is identifiable
Protection:
c. Duties
d. Owner Rights

Compliance the pragmatic way

Under GDPR, it's up to the processor of personal data - us - to show that they're compliant with the regulations. Compliance sounds very legal, but it's not so far from what we as developers do: we use automated tests to show that our code is "compliant" with our functional requirements, and monitoring to show that our app is "compliant" with our performance demands. Let's try to build something like this for our privacy requirements.

Using the theory from the previous section, a first version of what we build might look like this:

Information Related To Person	Person identifiable?	Duties fulfilled?	Owner Rights supported?
Foo	No	N/A	N/A
Bar	Yes	✅	✅

It's basically a list of personal data candidates. If it's actually personal data (the person is identifiable), we'll need to confirm that we've fulfilled our duties and supported the rights of the users. If it's not personal data, it's still useful to keep it on the list to document that we don't consider this personal data.

We can maintain this document anywhere we want, but it makes sense to have it in a central, easily accessible place. If you're using a tool like GitHub, putting it in the README.md or wiki makes sense. That also has the advantage of being able to embed links/images from automated checks, in case we are able to do that.

Let's put this proposed approach to the test and build a simple application. We'll build it in a few increments to show the approach works well when doing agile. Since the app will have all kinds of features to demonstrate the approach, let's call it Kitchen Sink.

Increment 1: avoiding personal data.

In the first increment we're going to build a very basic static website with a picture of a kitchen sink. We're going to put it on a server we happen to have running. The server already has Apache (or NGINX) installed on it, so let's put some static HTML in there.

Not much to do here about privacy in a static web page, right? Actually, we're already collecting personal data. People will visit your static website. And Apache is registering every visit in its log files. Are page visits personal data? They absolutely relate to a person (some companies are paying good money to know which sites you are visiting). Is the person also identifiable? The log file usually records the visitor's IP address. While that doesn't give you a person's name, the IP address often allows you to identify a person across websites, which is enough to make the personal data identifiable. When you combine the IP address with other information that's easily available (to see what's possible, look into browser fingerprinting), it becomes even clearer this is really personal data.

So, after putting up this static page, our checklist looks like this:

Information Related To Person	Person identifiable?	Duties fulfilled?	Owner Rights supported?
Website visit	Yes	TBD	TBD

But, do we really need this personal data? Well, we'd like to have the data to do statistical analysis such as visitor counts and origin, but we don't really need the person behind it to be identifiable. If we can achieve that, it's no longer personal data and doesn't need to be protected according to the GDPR.

There are a number of ways to de-personalize data:

Don't store the data at all, or remove the identifiers. This is by far the simplest approach, but of course has the disadvantage that we cannot use them in any way.
Aggregate the data and/or identifiers. This means summarizing it in such a way that individual records are no longer present. In our case, this could mean directly feeding it into our access file analysis tools.
Anonymize identifiers. By obscuring identifiers from the data, it is no longer personal data. This would allow us keep the log files for future goals if we desire so.

Let's go with anonymization. Or actually, we're going to take a slightly weaker form: data masking. Instead of fully anonymizing the IP address, we will mask the last octet of the IP(v4) address. So 192.168.11.12 will be masked to 192.168.11.x, or if we want to keep it a valid IP address, 192.168.11.0. It might be argued that this is insufficient, but it is what Google Analytics is doing, so let's consider that sufficient for now. It can be done with a few rules configuration, or by installing a module. And the same goes for NGINX, and other web servers.

It seems valuable to register this choice of de-personalization in our "compliance" table, so we can see why we don't consider website visits as personal data. Let's add a column:

Information Related To Person	Person identifiable?	Depersonalized?	Duties fulfilled?	Owner Rights supported?
Website visit	No	Yes (Masking)	NA	NA

We aren't fully done yet. We claim to mask identifiers here, but saying that doesn't make it true. If we botch the configuration at some point, we'll never know. So, we should make a test to see if we're still compliant here. How to do this will depend on our setup. An elegant approach might be to use a log analysis tool, such as Splunk, Elastic stack or Papertrail. With these tools, you can create a search of your access logs for the presence of unmasked IP addresses. If that unlikely event occurs, you can register that, or get notified, or send a web hook to a different application.

With that check, we get our final privacy overview for this iteration:

Information Related To Person	Person identifiable?	Depersonalized?	Duties fulfilled?	Owner Rights supported?
Website visit	No	Yes (Masking) ✅	NA	NA

Please join me soon for the next part, where I will expand the Kitchen Sink app to see if we can also fulfill our duties as processor of personal data.

If you'd like to read more on related subjects:

The UK Information Commissioner's Office has a thorough explanation of GDPR and how it works
The book Building Evolutionary Architectures defines the term "architectural fitness function", which has a lot of overlap with the approach above.
Good anonymization of data is notoriously hard, as demonstrated here using taxis.

This post was previously published at Rijks ICT Gilde (in Dutch)

Photo by Dayne Topkin on Unsplash