DEV Community

James Moberg
James Moberg

Posted on

6 2

Convert Unicode Symbols & Punctuation to ASCII using ColdFusion/Java

symbolsToASCII is a ColdFusion UDF (user-defined function) to convert Unicode symbols and punctuation to ASCII7. I was previously using ConvertSpecialChars from CFLib, but it didn't include enough mapped characters.

I found some documentation from NIH's Lexical Systems Group website that documented their approach to "Map Symbols & Punctuation to ASCII". They state that "converting Unicode punctuation and symbols to ASCII punctuation and symbols is imperative in NLP for preserving the original documents. Their java code implementation is simply "perform mapping if the character is in the punctuation & symbols mapping table".

Their approach makes a lot of sense. When I'm performing a search using a SQL query or a Verity collection, the HTML5 input field doesn't auto-corrupt "dumb quotes" to “smart quotes” like Microsoft Word does. If stored content has characters that are HTML-encoded, wouldn't extra logic be required to account for potential substitutions containing high ASCII characters as well as ‘, ’, “ and ”?

Usage: symbolsToASCII(required string inputString)

<cfset testString = '#CHR(8220)#I don#CHR(8217)#t know what you mean by #CHR(8216)#glory,#CHR(8217)# #CHR(8221)# Alice said.'>

<cfoutput>
<textarea style="width:95%; height:300px;">
Original: #TestString#

symbolsToASCII: #symbolsToASCII(testString)#
</textarea>
Enter fullscreen mode Exit fullscreen mode

Try it online at TryCF.com

https://trycf.com/gist/6f35220d47caa7fdbf75eb884ff1cec7

Source code

<cfscript>
/* 20200604 Map Symbols & Punctuation to ASCII
Convert the Unicode punctuation and symbols to ASCII punctuation and symbols is imperative in Natural language processing (NLP) for preserving the original documents.
Based on mapping from Lexical Systems Group: https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/2013/docs/designDoc/UDF/unicode/NormOperations/mapSymbolToAscii.html
Blog: https://dev.to/gamesover/convert-symbols-punctuation-to-ascii-using-coldfusion-java-3l6a
TryCF: https://trycf.com/gist/6f35220d47caa7fdbf75eb884ff1cec7 */
string function symbolsToASCII(required string inputString){
var TempContent = javacast("string", arguments.inputString);
TempContent = TempContent.replaceAll("[\u00B4\u02B9\u02BC\u02C8\u0301\u2018\u2019\u201B\u2032\u2034\u2037]", chr(39)); /* apostrophe (') */
TempContent = TempContent.replaceAll("[\u00AB\u00BB\u02BA\u030B\u030E\u201C\u201D\u201E\u201F\u2033\u2036\u3003\u301D\u301E]", chr(34)); /* quotation mark (") */
TempContent = TempContent.replaceAll("[\u00AD\u2010\u2011\u2012\u2013\u2014\u2212\u2015]", chr(45)); /* hyphen (-) */
TempContent = TempContent.replaceAll("[\u01C3\u2762]", chr(33)); /* exclamation mark (!) */
TempContent = TempContent.replaceAll("[\u266F]", chr(35)); /* music sharp sign (#) */
TempContent = TempContent.replaceAll("[\u066A\u2052]", chr(37)); /* percent sign (%) */
TempContent = TempContent.replaceAll("[\u066D\u204E\u2217\u2731\u00D7]", chr(42)); /* asterisk (*) */
TempContent = TempContent.replaceAll("[\u201A\uFE51\uFF64\u3001]", chr(44)); /* comma (,) */
TempContent = TempContent.replaceAll("[\u00F7\u0338\u2044\u2215]", chr(47)); /* slash (/) */
TempContent = TempContent.replaceAll("[\u0589\u05C3\u2236]", chr(58)); /* colon (:) */
TempContent = TempContent.replaceAll("[\u203D]", chr(63)); /* question mark (?) */
TempContent = TempContent.replaceAll("[\u27E6]", chr(91)); /* opening square bracket ([) */
TempContent = TempContent.replaceAll("[\u20E5\u2216]", chr(92)); /* backslash (\) */
TempContent = TempContent.replaceAll("[\u301B]", chr(93)); /* closing square bracket ([) */
TempContent = TempContent.replaceAll("[\u02C4\u02C6\u0302\u2038\u2303]", chr(94)); /* caret (^) */
TempContent = TempContent.replaceAll("[\u02CD\u0331\u0332\u2017]", chr(95)); /* underscore (_) */
TempContent = TempContent.replaceAll("[\u02CB\u0300\u2035]", chr(96)); /* grave accent (`) */
TempContent = TempContent.replaceAll("[\u2983]", chr(123)); /* opening curly bracket ({) */
TempContent = TempContent.replaceAll("[\u01C0\u05C0\u2223\u2758]", chr(124)); /* vertical bar / pipe (|) */
TempContent = TempContent.replaceAll("[\u2016]", "#chr(124)##chr(124)#"); /* double vertical bar / double pipe (||) */
TempContent = TempContent.replaceAll("[\u02DC\u0303\u2053\u223C\u301C]", chr(126)); /* tilde (~) */
TempContent = TempContent.replaceAll("[\u2039\u2329\u27E8\u3008]", chr(60)); /* less-than sign (<) */
TempContent = TempContent.replaceAll("[\u2264\u2266]", "#chr(60)##chr(61)#"); /* less-than equal-to sign (<=) */
TempContent = TempContent.replaceAll("[\u203A\u232A\u27E9\u3009]", chr(62)); /* greater-than sign (>) */
TempContent = TempContent.replaceAll("[\u2265\u2267]", "#chr(62)##chr(61)#"); /* greater-than equal-to sign (>=) */
TempContent = TempContent.replaceAll("[\u200B\u2060\uFEFF]", chr(32)); /* space ( ) */
TempContent = TempContent.replaceAll("\u2153", "1/3");
TempContent = TempContent.replaceAll("\u2154", "2/3");
TempContent = TempContent.replaceAll("\u2155", "1/5");
TempContent = TempContent.replaceAll("\u2156", "2/5");
TempContent = TempContent.replaceAll("\u2157", "3/5");
TempContent = TempContent.replaceAll("\u2158", "4/5");
TempContent = TempContent.replaceAll("\u2159", "1/6");
TempContent = TempContent.replaceAll("\u215A", "5/6");
TempContent = TempContent.replaceAll("\u215B", "1/8");
TempContent = TempContent.replaceAll("\u215C", "3/8");
TempContent = TempContent.replaceAll("\u215D", "5/8");
TempContent = TempContent.replaceAll("\u215E", "7/8");
TempContent = TempContent.replaceAll("\u2026", "\.\.\.");
return TempContent;
}
</cfscript>
<cfset testString = '#CHR(8220)#I don#CHR(8217)#t know what you mean by #CHR(8216)#glory,#CHR(8217)# #CHR(8221)# Alice said.'>
<cfoutput>
<textarea style="width:95%; height:300px;">
Original: #TestString#
symbolsToASCII: #symbolsToASCII(testString)#
</textarea>
</cfoutput>

Symbols and Punctuation Default Mapping Table (credit)

Unicode Mapped String Char Unicode Name
\u00AB " « LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
\u00AD - ­ SOFT HYPHEN
\u00B4 ' ´ ACUTE ACCENT
\u00BB " » RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
\u00F7 / ÷ DIVISION SIGN
\u01C0 | ǀ LATIN LETTER DENTAL CLICK
\u01C3 ! ǃ LATIN LETTER RETROFLEX CLICK
\u02B9 ' ʹ MODIFIER LETTER PRIME
\u02BA " ʺ MODIFIER LETTER DOUBLE PRIME
\u02BC ' ʼ MODIFIER LETTER APOSTROPHE
\u02C4 ^ ˄ MODIFIER LETTER UP ARROWHEAD
\u02C6 ^ ˆ MODIFIER LETTER CIRCUMFLEX ACCENT
\u02C8 ' ˈ MODIFIER LETTER VERTICAL LINE
\u02CB ` ˋ MODIFIER LETTER GRAVE ACCENT
\u02CD _ ˍ MODIFIER LETTER LOW MACRON
\u02DC ~ ˜ SMALL TILDE
\u0300 ` ̀ COMBINING GRAVE ACCENT
\u0301 ' ́ COMBINING ACUTE ACCENT
\u0302 ^ ̂ COMBINING CIRCUMFLEX ACCENT
\u0303 ~ ̃ COMBINING TILDE
\u030B " ̋ COMBINING DOUBLE ACUTE ACCENT
\u030E " ̎ COMBINING DOUBLE VERTICAL LINE ABOVE
\u0331 _ ̱ COMBINING MACRON BELOW
\u0332 _ ̲ COMBINING LOW LINE
\u0338 / ̸ COMBINING LONG SOLIDUS OVERLAY
\u0589 : ։ ARMENIAN FULL STOP
\u05C0 | ׀ HEBREW PUNCTUATION PASEQ
\u05C3 : ׃ HEBREW PUNCTUATION SOF PASUQ
\u066A % ٪ ARABIC PERCENT SIGN
\u066D * ٭ ARABIC FIVE POINTED STAR
\u200B ZERO WIDTH SPACE
\u2010 - HYPHEN
\u2011 - NON-BREAKING HYPHEN
\u2012 - FIGURE DASH
\u2013 - EN DASH
\u2014 - EM DASH
\u2015 -- HORIZONTAL BAR
\u2016 | DOUBLE VERTICAL LINE
\u2017 _ DOUBLE LOW LINE
\u2018 ' LEFT SINGLE QUOTATION MARK
\u2019 ' RIGHT SINGLE QUOTATION MARK
\u201A , SINGLE LOW-9 QUOTATION MARK
\u201B ' SINGLE HIGH-REVERSED-9 QUOTATION MARK
\u201C " LEFT DOUBLE QUOTATION MARK
\u201D " RIGHT DOUBLE QUOTATION MARK
\u201E " DOUBLE LOW-9 QUOTATION MARK
\u201F " DOUBLE HIGH-REVERSED-9 QUOTATION MARK
\u2032 ' PRIME
\u2033 " DOUBLE PRIME
\u2034 ''' TRIPLE PRIME
\u2035 ` REVERSED PRIME
\u2036 " REVERSED DOUBLE PRIME
\u2037 ''' REVERSED TRIPLE PRIME
\u2038 ^ CARET
\u2039 < SINGLE LEFT-POINTING ANGLE QUOTATION MARK
\u203A > SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
\u203D ? INTERROBANG
\u2044 / FRACTION SLASH
\u204E * LOW ASTERISK
\u2052 % COMMERCIAL MINUS SIGN
\u2053 ~ SWUNG DASH
\u2060 WORD JOINER
\u20E5 \ COMBINING REVERSE SOLIDUS OVERLAY
\u2212 - MINUS SIGN
\u2215 / DIVISION SLASH
\u2216 \ SET MINUS
\u2217 * ASTERISK OPERATOR
\u2223 \
\u2236 : RATIO
\u223C ~ TILDE OPERATOR
\u2264 <= LESS-THAN OR EQUAL TO
\u2265 >= GREATER-THAN OR EQUAL TO
\u2266 <= LESS-THAN OVER EQUAL TO
\u2267 >= GREATER-THAN OVER EQUAL TO
\u2303 ^ UP ARROWHEAD
\u2329 < LEFT-POINTING ANGLE BRACKET
\u232A > RIGHT-POINTING ANGLE BRACKET
\u266F # MUSIC SHARP SIGN
\u2731 * HEAVY ASTERISK
\u2758 | LIGHT VERTICAL BAR
\u2762 ! HEAVY EXCLAMATION MARK ORNAMENT
\u27E6 [ MATHEMATICAL LEFT WHITE SQUARE BRACKET
\u27E8 < MATHEMATICAL LEFT ANGLE BRACKET
\u27E9 > MATHEMATICAL RIGHT ANGLE BRACKET
\u2983 { LEFT WHITE CURLY BRACKET
\u2984 } RIGHT WHITE CURLY BRACKET
\u3003 " DITTO MARK
\u3008 < LEFT ANGLE BRACKET
\u3009 > RIGHT ANGLE BRACKET
\u301B ] RIGHT WHITE SQUARE BRACKET
\u301C ~ WAVE DASH
\u301D " REVERSED DOUBLE PRIME QUOTATION MARK
\u301E " DOUBLE PRIME QUOTATION MARK
\uFEFF  ZERO WIDTH NO-BREAK SPACE

Imagine monitoring actually built for developers

Billboard image

Join Vercel, CrowdStrike, and thousands of other teams that trust Checkly to streamline monitor creation and configuration with Monitoring as Code.

Start Monitoring

Top comments (2)

Collapse
 
jeffcoughlin profile image
jeffcoughlin

Dude. Thank you. This saved me a lot of trouble. Might I suggest submitting this to cflib.org?

Collapse
 
gdhpress1 profile image
GDH Press

Nice James!

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Discover a treasure trove of wisdom within this insightful piece, highly respected in the nurturing DEV Community enviroment. Developers, whether novice or expert, are encouraged to participate and add to our shared knowledge basin.

A simple "thank you" can illuminate someone's day. Express your appreciation in the comments section!

On DEV, sharing ideas smoothens our journey and strengthens our community ties. Learn something useful? Offering a quick thanks to the author is deeply appreciated.

Okay