TLDR; This post describes what homoglyph attacks are and how to prevent them with Cognitive Services.
Code for this story can be found on github.
One click deployment instructions on Azure can be found below.
In orthography and typography, a homoglyph is one of two or more characters with shapes that appear identical or very similar. In layman's terms, a homo-glyph is any character that looks similar to another character such the S and $ in the image above .
Language models are often vulnerable to obfuscation attacks using homo-glyphs, due to the way they encode text. In Unicode and Ascii for example, the same character codes look different in different fonts, and a model will struggle to learn their similarities.
To drive this point home, let’s take a look at the phrase below:
I got a $hitty result from my awesome cloud sentiment analysis model.
The phrase above clearly demonstrates negative sentiment. The word $hitty is a homoglyphic obfuscation of the profane word shitty.
Let’s see how the four most popular cloud sentiment analysis services on the internet, handle this attack.
As we can see Azure Text Analytics and GCP Natural Language correctly classify the original sentiment but both fail on the obfuscated text. IBM Watson, fails to correctly classify sentiment of either text. AWS Comprehend, does not provide a demo without an AWS subscription and also fails on the example sentence.
While the $ and S example above may seem arbitrary, the Latin and Cyrillic letters below in wikipedia demonstrate how such an attack can be effective and hard to detect.
This presents all sorts of problems for use cases, where such attacks can exploit or cause harm to applications, such as bots trying to hide from fake news detectors.
After talking with my friend Amit Moryossef from the BIU NLP lab, we realized we might be able to prevent homoglyph attacks using OCR systems.
Using Azure Computer Vision service, I tested this theory, with the sentence above and it correctly used the image domain context to extract the word Shitty from the homoglyph $hitty.
Using this capability I’ve written the following open source container service that will:
Take a given text as input
Convert the text to an image
Process the image using OCR
Return the correct text with homoglyphs removed.
The Docker service is cloud agnostic. I provide a one click deployment option to Azure for convenience.
If you have an existing Azure subscription, you can get started by clicking the button below to auto deploy the service.
Otherwise you can get a free Azure Account here and then click the deploy button above.
If you have any questions, comments, or topics you would like me to discuss, feel free to follow me on Twitter. Thanks again to Amit Moryossef and the BIU NLP lab for the amazing inspiration and Iddan Sachar with his help debugging ARM for one click deployment.
To use the service just send a URL encoded query string of up to 200 characters to the service perfect for validating tweets. Below is an example call using curl be sure to use your own service endpoint.
While the service works very well at removing homoglyphs there are still a few cases it fails on.
Future work will explore using a more custom approach to solving this problem but this approach works very well for very minimal effort.
- Image Processing with the Computer Vision API | Microsoft Azure
- App Service - Web App for Containers | Microsoft Azure
- Text Analytics API | Microsoft Azure
Aaron (Ari) Bornstein is an avid AI enthusiast with a passion for history, engaging with new technologies and computational medicine. As an Open Source Engineer at Microsoft’s Cloud Developer Advocacy team, he collaborates with Israeli Hi-Tech Community, to solve real world problems with game changing technologies that are then documented, open sourced, and shared with the rest of the world.