loading...

Convert Unicode Symbols & Punctuation to ASCII using ColdFusion/Java

gamesover profile image James Moberg ・3 min read

symbolsToASCII is a ColdFusion UDF (user-defined function) to convert Unicode symbols and punctuation to ASCII7. I was previously using ConvertSpecialChars from CFLib, but it didn't include enough mapped characters.

I found some documentation from NIH's Lexical Systems Group website that documented their approach to "Map Symbols & Punctuation to ASCII". They state that "converting Unicode punctuation and symbols to ASCII punctuation and symbols is imperative in NLP for preserving the original documents. Their java code implementation is simply "perform mapping if the character is in the punctuation & symbols mapping table".

Their approach makes a lot of sense. When I'm performing a search using a SQL query or a Verity collection, the HTML5 input field doesn't auto-corrupt "dumb quotes" to “smart quotes” like Microsoft Word does. If stored content has characters that are HTML-encoded, wouldn't extra logic be required to account for potential substitutions containing high ASCII characters as well as ‘, ’, “ and ”?

Usage: symbolsToASCII(required string inputString)

<cfset testString = '#CHR(8220)#I don#CHR(8217)#t know what you mean by #CHR(8216)#glory,#CHR(8217)# #CHR(8221)# Alice said.'>

<cfoutput>
<textarea style="width:95%; height:300px;">
Original: #TestString#

symbolsToASCII: #symbolsToASCII(testString)#
</textarea>

Try it online at TryCF.com

https://trycf.com/gist/6f35220d47caa7fdbf75eb884ff1cec7

Source code

Symbols and Punctuation Default Mapping Table (credit)

Unicode Mapped String Char Unicode Name
\u00AB " « LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
\u00AD - ­ SOFT HYPHEN
\u00B4 ' ´ ACUTE ACCENT
\u00BB " » RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
\u00F7 / ÷ DIVISION SIGN
\u01C0 | ǀ LATIN LETTER DENTAL CLICK
\u01C3 ! ǃ LATIN LETTER RETROFLEX CLICK
\u02B9 ' ʹ MODIFIER LETTER PRIME
\u02BA " ʺ MODIFIER LETTER DOUBLE PRIME
\u02BC ' ʼ MODIFIER LETTER APOSTROPHE
\u02C4 ^ ˄ MODIFIER LETTER UP ARROWHEAD
\u02C6 ^ ˆ MODIFIER LETTER CIRCUMFLEX ACCENT
\u02C8 ' ˈ MODIFIER LETTER VERTICAL LINE
\u02CB ` ˋ MODIFIER LETTER GRAVE ACCENT
\u02CD _ ˍ MODIFIER LETTER LOW MACRON
\u02DC ~ ˜ SMALL TILDE
\u0300 ` ̀ COMBINING GRAVE ACCENT
\u0301 ' ́ COMBINING ACUTE ACCENT
\u0302 ^ ̂ COMBINING CIRCUMFLEX ACCENT
\u0303 ~ ̃ COMBINING TILDE
\u030B " ̋ COMBINING DOUBLE ACUTE ACCENT
\u030E " ̎ COMBINING DOUBLE VERTICAL LINE ABOVE
\u0331 _ ̱ COMBINING MACRON BELOW
\u0332 _ ̲ COMBINING LOW LINE
\u0338 / ̸ COMBINING LONG SOLIDUS OVERLAY
\u0589 : ։ ARMENIAN FULL STOP
\u05C0 | ׀ HEBREW PUNCTUATION PASEQ
\u05C3 : ׃ HEBREW PUNCTUATION SOF PASUQ
\u066A % ٪ ARABIC PERCENT SIGN
\u066D * ٭ ARABIC FIVE POINTED STAR
\u200B ZERO WIDTH SPACE
\u2010 - HYPHEN
\u2011 - NON-BREAKING HYPHEN
\u2012 - FIGURE DASH
\u2013 - EN DASH
\u2014 - EM DASH
\u2015 -- HORIZONTAL BAR
\u2016 | DOUBLE VERTICAL LINE
\u2017 _ DOUBLE LOW LINE
\u2018 ' LEFT SINGLE QUOTATION MARK
\u2019 ' RIGHT SINGLE QUOTATION MARK
\u201A , SINGLE LOW-9 QUOTATION MARK
\u201B ' SINGLE HIGH-REVERSED-9 QUOTATION MARK
\u201C " LEFT DOUBLE QUOTATION MARK
\u201D " RIGHT DOUBLE QUOTATION MARK
\u201E " DOUBLE LOW-9 QUOTATION MARK
\u201F " DOUBLE HIGH-REVERSED-9 QUOTATION MARK
\u2032 ' PRIME
\u2033 " DOUBLE PRIME
\u2034 ''' TRIPLE PRIME
\u2035 ` REVERSED PRIME
\u2036 " REVERSED DOUBLE PRIME
\u2037 ''' REVERSED TRIPLE PRIME
\u2038 ^ CARET
\u2039 < SINGLE LEFT-POINTING ANGLE QUOTATION MARK
\u203A > SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
\u203D ? INTERROBANG
\u2044 / FRACTION SLASH
\u204E * LOW ASTERISK
\u2052 % COMMERCIAL MINUS SIGN
\u2053 ~ SWUNG DASH
\u2060 WORD JOINER
\u20E5 \ COMBINING REVERSE SOLIDUS OVERLAY
\u2212 - MINUS SIGN
\u2215 / DIVISION SLASH
\u2216 \ SET MINUS
\u2217 * ASTERISK OPERATOR
\u2223 \
\u2236 : RATIO
\u223C ~ TILDE OPERATOR
\u2264 <= LESS-THAN OR EQUAL TO
\u2265 >= GREATER-THAN OR EQUAL TO
\u2266 <= LESS-THAN OVER EQUAL TO
\u2267 >= GREATER-THAN OVER EQUAL TO
\u2303 ^ UP ARROWHEAD
\u2329 < LEFT-POINTING ANGLE BRACKET
\u232A > RIGHT-POINTING ANGLE BRACKET
\u266F # MUSIC SHARP SIGN
\u2731 * HEAVY ASTERISK
\u2758 | LIGHT VERTICAL BAR
\u2762 ! HEAVY EXCLAMATION MARK ORNAMENT
\u27E6 [ MATHEMATICAL LEFT WHITE SQUARE BRACKET
\u27E8 < MATHEMATICAL LEFT ANGLE BRACKET
\u27E9 > MATHEMATICAL RIGHT ANGLE BRACKET
\u2983 { LEFT WHITE CURLY BRACKET
\u2984 } RIGHT WHITE CURLY BRACKET
\u3003 " DITTO MARK
\u3008 < LEFT ANGLE BRACKET
\u3009 > RIGHT ANGLE BRACKET
\u301B ] RIGHT WHITE SQUARE BRACKET
\u301C ~ WAVE DASH
\u301D " REVERSED DOUBLE PRIME QUOTATION MARK
\u301E " DOUBLE PRIME QUOTATION MARK
\uFEFF  ZERO WIDTH NO-BREAK SPACE

Posted on Jun 30 by:

gamesover profile

James Moberg

@gamesover

I’m a ColdFusion web application developer at SunStar Media located in Monterey, CA. I am a fan of technology, music and web development.

Discussion

markdown guide