DEV Community

Cover image for Hash Personal Identifiable Information (PII) in your ELT pipelines
Falk
Falk

Posted on

Hash Personal Identifiable Information (PII) in your ELT pipelines

What is Personal Identifiable Information?

Personal Identifiable Information (PII) is defined as: Any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means. If you collect, use or store PII of people in the European Union, you have to work GDPR-compliant and therefore should protect your customers personal data.

So what, I just delete those

While you could of course think about deleting any PII from your data before adding it to your file storage/ database, the data users in your organization might have a word to say about that. Oftentimes, Analysts or Data Scientists have to work with multiple data sources and find connections between them in order to generate insights.
For example, finding a customer across various sales platforms such as Amazon, Shopify, or eBay. These platforms don't have a common unique identifier for each user, so an alternative such as an email, name, or phone number must be used. That is however not possible anymore if we've decided to delete this personal information. So let's try to add a "Hashing" step to our ELT pipeline.

Image description

Introducing ByeByePii

ByeByePii is a Python package that is meant for hashing personal identifiable information. It was built focused on making Data Lakes storing JSON files GDPR-compliant. It's a simple package with two features:

  • Analyzing Python Dictionaries in order to identify PII
  • Hashing PII in a given Python dictionary

Binary installers for the latest released version are available at the Python Package Index (PyPI):

pip install ByeByePii

Analyzing a JSON and creating a list of keys to hash

In order to not having to manually look for all the keys in a Python Dictionary, we can use the analyzeDict function.

import byebyepii
import json

if __name__ == '__main__':

    # Loading local JSON file
    with open('data.json') as json_file:
        data = json.load(json_file)

    # Analyzing the dictionary and creating our hash list
    key_list, subkey_list = byebyepii.analyzeDict(data)
Enter fullscreen mode Exit fullscreen mode
$ python3 analyzeDict.py

Add BuyerInfo - BuyerEmail to hash list? (y/n) y
Add SalesChannel to hash list? (y/n) n
Add OrderStatus to hash list? (y/n) n
Add PurchaseDate to hash list? (y/n) n
Add ShippingAddress - StateOrRegion to hash list? (y/n) y
Add ShippingAddress - PostalCode to hash list? (y/n) y
Add ShippingAddress - City to hash list? (y/n) n
Add ShippingAddress - CountryCode to hash list? (y/n) n
Add LastUpdateDate to hash list? (y/n) n

Keys to hash: ['BuyerInfo', 'ShippingAddress', 'ShippingAddress', 'ShippingAddress', 'ShippingAddress']
Subkeys to hash: ['BuyerEmail', 'StateOrRegion', 'PostalCode']
Enter fullscreen mode Exit fullscreen mode

Hashing PII in a given JSON

Using the key lists we just created we can proceed to hash the PII in the dictionary.

import byebyepii
import json

if __name__ == '__main__':

    # Loading local JSON file
    with open('data.json') as json_file:
        data = json.load(json_file)

    # Hasing the PII
    keys_to_hash = ['BuyerInfo', 'ShippingAddress', 'ShippingAddress', 'ShippingAddress', 'ShippingAddress']
    subkeys_to_hash = ['BuyerEmail', 'StateOrRegion', 'PostalCode']
    hashed_pii = byebyepii.hashPii(data, keys_to_hash, subkeys_to_hash)

    # Writing the updated JSON file
    with open('hashed_data.json', 'w') as outfile:
        json.dump(hashed_pii, outfile)
Enter fullscreen mode Exit fullscreen mode

Before:

{
  "BuyerInfo": {
    "BuyerEmail": "test@test.com"
  },
  "EarliestShipDate": "2022-01-01T23:59:59Z",
  "SalesChannel": "Website",
  "OrderStatus": "Shipped",
  "PurchaseDate": "2022-01-01T23:59:59Z",
  "ShippingAddress": {
    "StateOrRegion": "West Midlands",
    "PostalCode": "DY9 0TH",
    "City": "STOURBRIDGE",
    "CountryCode": "GB"
  },
  "LastUpdateDate": "2022-01-01T23:59:59Z",
}
Enter fullscreen mode Exit fullscreen mode

After:

{
  "BuyerInfo": {
    "BuyerEmail": "037a51cb9162f51772eaf6b0fb02e1b5d0bf8219deacf723eeedc162209bfd33"
  },
  "EarliestShipDate": "2022-01-01T23:59:59Z",
  "SalesChannel": "Website",
  "OrderStatus": "Shipped",
  "PurchaseDate": "2022-01-01T23:59:59Z",
  "ShippingAddress": {
    "StateOrRegion": "08fa57d00de1936ebea7aeaf8e36d04510a5d885cfaa4f169c2b010d36ccaca4",
    "PostalCode": "714f02c01e20988ee273776dc218f44326c2f5839618b0c117413b0cc7d91701",
    "City": "STOURBRIDGE",
    "CountryCode": "GB"
  },
  "LastUpdateDate": "2022-01-01T23:59:59Z",
}
Enter fullscreen mode Exit fullscreen mode

Since the string test@test.com will always be hashed to 037a51cb9162f51772eaf6b0fb02e1b5d0bf8219deacf723eeedc162209bfd33 it is still perfectly usable as a cross-functional identifier.

Top comments (0)