TLDR; The following post will outline both first party and open source techniques for detecting PII with Azure.
What is PII?
Personally Identifiable information (PII), is any data that can be used used to identify a individuals such as names, driver’s license number, SSNs, bank account numbers, passport numbers, email addresses and more. Many regulations from GDPR to HIPPA require strict protection of user privacy.
If you are new to Azure you can get started a free subscription using the link below.
Create your Azure free account today | Microsoft Azure
Detecting PII With Azure Cognitive Search (Preview)
Azure Cognitive Search is a cloud solution that provides developers APIs and tools for adding a rich search experience to their data, content and applications. With cognitive search you can add cognitive skills to apply AI processes during indexing. Doing so can add new information and structures useful for search and other scenarios.
The Azure PII Detection skill (Currently in Preview) extracts personally identifiable information from an input text and gives you the option to mask it from that text in various ways. This skill uses the machine learning models provided by Text Analytics in Cognitive Services.
PII Detection cognitive skill (preview) - Azure Cognitive Search
Detecting PII With Microsoft Presidio
In addition to the first party cognitive search Microsoft also provides an open source PII detection tool for Azure called Presidio which was developed by the Microsoft Commercial Software Engineering team in Israel.
Why use Presidio?
Presidio is open-source, transparent, and scalable. Presidio allows developers and data scientists to customize or add new PII recognizers via API or code to best fit your anonymization needs. Presidio leverages docker and kubernetes for workloads at scale.
Presidio automatically detects Personal-Identifiable Information (PII) in unstructured text, annonymizes it based on one or more anonymization mechanisms, and returns a string with no personal identifiable data. For example:
For each PII entity, presidio returns a confidence score:
Text anonymization in images (beta)
Presidio uses OCR to detect text in images. It further allows the redaction of the text from the original image.
Check out a public demo to try out on your own data with the link below.
- Navigate into \deployment from command line.
- If You have helm installed, but havn’t run helm init, execute deploy-helm.sh in the command line. It will install tiller (helm server side) on your cluster, and grant it sufficient permissions.
- Grant the Kubernetes cluster access to the container registry follow these instructions to grant the AKS cluster access to the ACR.
- If you already have helm and tiller configured, or if you installed it in the previous step, execute deploy-presidio.sh in the command line as follows:
More information can be found on the github repo and near one click deployment options for Azure are coming soon!
Additional deployment options can be found here.
In this post you learned two of my favorite options for detecting PII in your data with Azure. If you are interested in Azure and AI be sure to check out my other posts and the Azure medium blog.
About the Author
Aaron (Ari) Bornstein is an AI researcher with a passion for history, engaging with new technologies and computational medicine. As an Open Source Engineer at Microsoft’s Cloud Developer Advocacy team, he collaborates with Israeli Hi-Tech Community, to solve real world problems with game changing technologies that are then documented, open sourced, and shared with the rest of the world.
Top comments (0)