In this post we are going to look at some examples of reconstruction attacks i.e. how from seemingly anonymous data, one can reveal most sensitive information about individuals.
Let’s say you are analysing data. Maybe you are running some ML prediction algorithms, training your models, calculating different statistics, and sharing your outputs. It may seem that simply removing all the personally identifiable information such as names, addresses or telephone numbers should suffice to make sure that no private information is revealed after the analysis. That might even be sufficient to be considered anonymous data according to some privacy laws. If so, then surely you don’t need to be too worried, right?
Perhaps instead you are aggregating data over many individuals so you don’t even think about any privacy issues. An extremely trivial example of how things might go wrong with aggregate statistics is revealing an average salary of say 100 employees and then publishing an average of 101 after a new employee has joined. This allows anyone with access to these aggregates to easily figure a salary of a new employee. Even though that might seem like an obvious thing one can easily avoid, it becomes much trickier when revealing a range of statistics and aggregates, in different contexts. Things get even more challenging when such information is combined with other data sources about the same individuals.
If you don't follow a structured approach to data sharing, you've got a good chance of compromising the privacy of the data source. Many large companies and governments have made these mistakes, so let's talk about how you can avoid the same peril!
Few data points suffice to identify individuals
Even if we think of ourselves as a needle in a haystack of 7.7bn people in the world, a range of studies has shown that very few data points suffice to uniquely or with high probability identify an individual. As an example, 4 spatiotemporal points taken from credit card metadata are sufficient to uniquely reidentify 90% of individuals [1].
Similarly, in another study that considered mobility data taken from mobile phone devices with a time resolution of 1h and the spatial resolution determined by the distance between antennas, only 4 randomly drawn points sufficed to identify 95% of individuals (and two randomly drawn points identified over 50%) [2]. The task is even easier for an attacker who cleverly uses non-uniform sampling e.g. by exploiting the fact that calls from an office at 2 am might provide more information about an individual than calls at 3 pm, when the office is crowded. Similar attacks can be performed by using other mobility data from geotagging used by social media platforms, smartphone apps, and others.
It means that even when you completely remove addresses, account numbers, and other PII it is very easy to reidentify people from such a dataset. Almost all re-identification attacks make use of this.
However, sensitive information can be compromised even if the identifiers are not unique. It is well known that 87% of Americans can be uniquely identified just from their gender, birthday, and ZIP code [3]. To prevent such attacks, the commonly used method is to group and coarsen the identifiers by reporting only the age brackets, giving only the first three digits of ZIP codes etc. resulting in quasi-identifiers. This is done in such a way as to guarantee k-anonymity. As a result, for any record and any set of quasi-identifiers there are at least k-1 other records with the same quasi-identifiers. It is a very common and natural way of trying to ensure privacy. Unfortunately, it can often fail in protecting sensitive information too. A straightforward example of that is the so-called homogeneity attack.
Given a dataset of different medical conditions (clearly very sensitive information) for individuals, whose age, ZIP codes, and other identifiers have been coarsened in such a way as to ensure k-anonymity, it may still be possible to recover the sensitive information [4]. Simply all k individuals for a given set of quasi-identifiers can have the same medical conditions. Hence, if a neighbour knows your age, your ZIP code, and gender, it may well be that you fall in the category where all other k-1 individuals have the same condition as you. Basically, the situation arises whenever the sensitive information is not very diverse. The scarcity of data severely impacts k-anonymity. The effect becomes even more dominant for high-dimensional data with a large number of quasi-identifiers, when even ensuring k-anonymity becomes harder [5].
The lesson from this is that inference attacks are often successful even when very few and coarse-grained data points are revealed.
Linkage attacks - connecting information from different sources
Information disclosed by one dataset might not be all the information publicly available about the individual. This may initially be obvious but implies very non-trivial attacks. By joining information from such a dataset with another one or some background information can allow for very successful inference attacks. Such background information might not even be sensitive. Background information that a particular medical condition is much more prevalent in a given age group or sex can increase the probability of identifying medical conditions for individuals in our previous example. Exploiting side information about individuals can lead to spectacular attacks. Arguably, one of the most famous is the one performed by Latanya Sweeney in 1997. A couple of years before that, Massachusetts Group Insurance Company had shared with researchers and sold to industry medical data that included performed medical procedures, prescribed medications, ethnicity but also people's gender, date of birth, and ZIP code. Governor Bill Weld assured that the data had been fully anonymised. Sweeney paid $20 for the Cambridge Massachusetts voter registration list, which also contained these 3 characteristics. Thus by cross-referencing the two databases, she identified Weld's entry in GIC and his medical records.
Another example comes from journalist Svea Eckert and data scientist Andreas Dewes. They set up a fake AI start-up, pretended to be needing some data for training their ML models and they did obtain a free database of browsing history for 3m German users with a long list of 9bn URLs and associated timestamps. All this from a data broker. Even though no other identifiers were available, they still managed to re-identify the browsing history of politicians, judges, and even their work colleagues. One way they could achieve it was by noticing that a Twitter’s user who visits Twitter's analytics page, leaves a trace of his or her username in the corresponding URL. Hence, by going to the corresponding Twitter profiles Eckert and Dewes could identify such individuals. Interestingly, they also found out about a police force’s undercover operation. The information about it was in Google Translate URLs, which contain the whole text one inputs to the translator.
Even what might seem like fairly insensitive data can tell a lot about us. Netflix learned it the hard way when it shared the database with movie ratings made by its users for the Netflix Prize competition. They stripped off all the PII from the data, but as you probably know by now, it was still possible to identify some of the users. This was done by the research from the University of Texas, which linked Netflix’s dataset to IMDB [6]. In this way information about people’s political preferences and even their sexual orientation was compromised.
- The main takeaway from this part is that linking information from different data sources can lead to severe privacy leakages. *
Attacks on ML models
All the examples so far were considered with attacks based on some publicly released data. However, one does not need to have direct access to such data to learn about sensitive information of individuals. Another example comes from attacks on machine learning models. It has been shown that that one can learn about statistical properties of trained datasets simply from parameters of trained machine learning models. Not only that, it is also possible to perform attacks given only black-box access to a model by using it to run predictions on input data. Researchers from Cornell Tech have shown that even models trained on MLaaS offerings of Google and Amazon can be open to membership inference attacks [7]. In this scenario, an attacker can say whether a given record was used as a training dataset.
How to handle this?
In the current data economy, a vast of information is shared between companies, organisations, and individuals. Banning this is probably unfeasible and counterproductive in the long term. We believe that privacy-enhancing technologies need to employ in order to tackle the privacy challenges. Multi-party computation can allow for encryption during computation.
Secure enclaves can ensure that data is processed only according to a pre-agreed specification. Differential privacy can be employed in training ML models, building synthetic data, and sharing aggregates with privacy guarantees. We will be writing more about all these different PETs.
However, if you have encountered any such privacy challenges and you wish to run PETs in your environment, give us a shout!
References:
[1] De Montjoye, Yves-Alexandre, Laura Radaelli, and Vivek Kumar Singh. "Unique in the shopping mall: On the reidentifiability of credit card metadata." Science 347.6221 (2015): 536-539.
[2] De Montjoye, Yves-Alexandre, et al. "Unique in the crowd: The privacy bounds of human mobility." Scientific reports 3.1 (2013): 1-5.
[3] Sweeney, Latanya. "k-anonymity: A model for protecting privacy." International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10.05 (2002): 557-570.
[4] Machanavajjhala, Ashwin, et al. "l-diversity: Privacy beyond k-anonymity." ACM Transactions on Knowledge Discovery from Data (TKDD) 1.1 (2007): 3-es.
[5] Shokri, Reza, et al. "Membership inference attacks against machine learning models." 2017 IEEE Symposium on Security and Privacy (SP). IEEE, 2017.
[6] Narayanan, Arvind, and Vitaly Shmatikov. "Robust de-anonymization of large sparse datasets." 2008 IEEE Symposium on Security and Privacy (sp 2008). IEEE, 2008.
[7] Aggarwal, Charu C. "On k-anonymity and the curse of dimensionality." VLDB. Vol. 5. 2005.
Top comments (0)