DEV Community

Cover image for Removal of Non ASCII characters using Python
Avinash Dalvi for This is Learning

Posted on • Originally published at internetkatta.com

4 1

Removal of Non ASCII characters using Python

Hello Devs,

I am going to explain about how to remove non ascii characters from input text or content. Let first get to know what non-ascii characters are.

What are non ascii characters ?

You might have faced an issue while copy pasting text from document ( docx ) to HTML input element or any editor. Sometimes the format of symbols is not supported in particular. input area. Example, double quote is used in docx file  and code editor or input element is different see below 👇🏻

“Example Text”. - in docx file 
"Example Text" - in editor or HTML input element
Enter fullscreen mode Exit fullscreen mode

When you are trying to docx file text format into HTML then it is treated as non ascii characters or junk characters. Generally It can save into the database but sometime while doing some encoding or signature calculating you will face an issue because this will throw an error due to an unsupported string. One of the real scenarios I faced while calculating AWS signature before passing to API gateway and same matching with calculated signature by AWS is match and it throws an error because AWS signature calculation mechanism removes those characters and calculates signature but in your code you might not be doing then very straight it will not match.

How to solve this issue then ?

Below is Python script to remove those non ascii characters or junk characters.

Prerequisite :

  • Python any version ( recommended 3.x )
  • Regular expression operations library(re) - pip install re
import re
ini_string = "'technews One lone dude awaits iPad 2 at Apple\x89Ûªs SXSW store"
res1 = " ".join(re.split("[^A-Za-z0-9]+", ini_string)) 
print(res1)

if re.match("[^\t\r\n\x20-\x7E]+", ini_string):
    print("found")
    
result = ini_string.encode().decode('ascii', 'replace').replace(u'\ufffd', '`')
result2 = ini_string.encode().decode("utf-8").replace(u"\x89Ûª", "`").encode("utf-8")
print(result2)

Enter fullscreen mode Exit fullscreen mode

References :

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read more

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more