DEV Community

Cover image for Unicode Normalization for NLP in Python
James Briggs
James Briggs

Posted on

1

Unicode Normalization for NLP in Python

ℕ𝕠-π• π•Ÿπ•– π•šπ•Ÿ π•₯π•™π•–π•šπ•£ π•£π•šπ•˜π•™π•₯ π•žπ•šπ•Ÿπ•• 𝕨𝕠𝕦𝕝𝕕 𝕖𝕧𝕖𝕣 𝕦𝕀𝕖 π•₯𝕙𝕖𝕀𝕖 π•’π•Ÿπ•Ÿπ• π•ͺπ•šπ•Ÿπ•˜ π•—π• π•Ÿπ•₯ π•§π•’π•£π•šπ•’π•Ÿπ•₯𝕀. 𝕋𝕙𝕖 𝕨𝕠𝕣𝕀π•₯ π•₯π•™π•šπ•Ÿπ•˜, π•šπ•€ π•šπ•— π•ͺ𝕠𝕦 𝕕𝕠 π•’π•Ÿπ•ͺ π•—π• π•£π•ž 𝕠𝕗 ℕ𝕃ℙ π•’π•Ÿπ•• π•ͺ𝕠𝕦 𝕙𝕒𝕧𝕖 𝕔𝕙𝕒𝕣𝕒𝕔π•₯𝕖𝕣𝕀 π•π•šπ•œπ•– π•₯π•™π•šπ•€ π•šπ•Ÿ π•ͺ𝕠𝕦𝕣 π•šπ•Ÿπ•‘π•¦π•₯, π•ͺ𝕠𝕦𝕣 π•₯𝕖𝕩π•₯ π•“π•–π•”π• π•žπ•–π•€ π•”π• π•žπ•‘π•π•–π•₯𝕖𝕝π•ͺ π•¦π•Ÿπ•£π•–π•’π••π•’π•“π•π•–.

We also find that text like this is incredibly commonβ€Š-β€Šparticularly on social media.

Another pain-point comes from diacritics (the little glyphs in Γ‡, Γ©, Γ…) that you'll find in almost every European language.

These characters have a hidden property that can trip up any NLP modelβ€Š-β€Štake a look at the unicode for two versions of Γ‡:

Latin capital letter C with cedilla: \u00C7

Latin capital letter C + combining cedilla: \u0043\u0327

Both are completely different, despite rendering as the same character.

To deal with all of these text variants we need to use unicode normalization - which we will cover in this video.

Top comments (1)

Collapse
 
arvindpdmn profile image
Arvind Padmanabhan β€’

This topic is briefly covered in this article: devopedia.org/text-normalization
In particular, check out the 4 forms: NFD, NFC, NFKD and NFKC

Qodo Takeover

Introducing Qodo Gen 1.0: Transform Your Workflow with Agentic AI

Rather than just generating snippets, our agents understand your entire project context, can make decisions, use tools, and carry out tasks autonomously.

Read full post