I’ve struggled for years attempting to identify the best solution for converting unicode accents and other characters using ColdFusion. I’ve used regex, java.text.Normalizer, ICU4J Transliterate and Apache.Lang3.StringUtils.StripAccents and recently scrapped them all in favor of using JUnidecode. JUnidecode is a Java port of Text::Unidecode perl module. The JUnidecode Java library only has one method and it takes a string and transliterates it to a valid 7-bit ASCII String (obviously it also strips diacritic marks).
Examples:
- Москвa becomes Moskva.
- čeština becomes cestina.
- Հայաստան becomes Hayastan.
- Ελληνικά becomes Ellenika.
- 北亰 becomes Bei Jing
- Häuser Bäume Höfe Gärten becomes Hauser Baume Hofe Garten
- daß becomes dass
WARNING: Please be aware that Junidecode doesn't like emojis. You may need to sanitize (or convert to aliases) using cf-emoji-java prior to using converting to ASCII7.
Here's a demo script I've written that has some generic test cases:
https://gist.github.com/JamoCA/6565bd4e2526b7c177a5f0cde3980d1c
<cfprocessingdirective pageEncoding="utf-8"> | |
<cfsetting enablecfoutputonly="Yes"> | |
<!--- | |
BLOG: https://dev.to/gamesover/convert-unicode-strings-to-ascii-with-coldfusion-junidecode-lhf | |
---> | |
<cfscript> | |
function JUnidecode(inputString){ | |
var JUnidecodeLib = ""; | |
var response = ""; | |
var temp = {}; | |
temp.encoder = createObject("java", "java.nio.charset.Charset").forName("utf-8").newEncoder(); | |
temp.isUTF = temp.encoder.canEncode(arguments.inputString); | |
if (temp.isUTF){ | |
/* NFKC: UTF Compatibility Decomposition, followed by Canonical Composition */ | |
temp.normalizer = createObject( "java", "java.text.Normalizer" ); | |
temp.normalizerForm = createObject( "java", "java.text.Normalizer$Form" ); | |
arguments.inputString = temp.normalizer.normalize( javaCast( "string", arguments.inputString ), temp.normalizerForm.NFKC ); | |
} | |
try { | |
JUnidecodeLib = createObject("java", "net.gcardone.junidecode.Junidecode"); | |
response = JUnidecodeLib.unidecode( javacast("string", arguments.inputString) ); | |
} catch (any e) { | |
response = "ERROR: JUnidecode is not installed"; | |
} | |
return trim(Response.replaceAll("\[\?\]", "")); | |
} | |
function isDiff(compareArr, val, pos){ | |
return (pos GT arrayLen(comparearr) OR comparearr[pos] neq val); | |
} | |
</cfscript> | |
<cfset TestStrings = [ | |
"ℰ𝒳𝒜ℳ𝓟ℒℰ", | |
"ABC #chr(160)# Café “test”", | |
"北亰", | |
"Mr. まさゆき たけだ", | |
"Łukasiński", | |
"⠏⠗⠑⠍⠊⠑⠗", | |
"What about Ø, Ł or æøåá", | |
"ราชอาณาจักรไทย", | |
"Ελληνικά", | |
"Москвa", | |
"Հայաստան", | |
"čeština", | |
"®™™™©©©Ⓒ½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅐⅛⅜⅝⅞⅑⅒●⚫⬤", | |
"ÀÁÂÃÄÅÆÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝàáâãäåæèéêëìíîïñòóôõöøùúûüý’“”–…", | |
"Häuser Bäume Höfe Gärten daß Ü ü ö ä Ä Ö ß " | |
]> | |
<cfset CFString = "cfscript"> | |
<cfparam name="URL.testString" default=""> | |
<cfif len(trim(URL.testString))> | |
<cfset TestStrings = listToArray(trim(URL.testString))> | |
</cfif> | |
<cfsetting enablecfoutputonly="no"> | |
<!doctype html> | |
<html lang="en"> | |
<head> | |
<title>JUnidecode ColdFusion Demo</title> | |
</head> | |
<style> | |
.diff {background-color:#ff0;} | |
fieldset:nth-child(even) {background-color:#ededed;} | |
</style> | |
<body> | |
<h1>JUnidecode ColdFusion Demo</h1> | |
<p>by <a href="https://about.me/jamesmoberg">James Moberg</a> / <a href="https://www.sunstarmedia.com/">SunStar Media</a> (February 6, 2019)</p> | |
<p>This is a demo on how to use <a href="https://github.com/gcardone/junidecode">JUnidecode</a> with <a href="https://www.adobe.com/products/coldfusion-family.html">ColdFusion</a> to convert Unicode strings to somewhat reasonable ASCII7-only strings then strip diacritics and convert strings.</p> | |
<p>I've compared this Java library against <a href="https://www.bennadel.com/blog/1155-cleaning-high-ascii-values-for-web-safeness-in-coldfusion.htm">regex</a>, <a href="https://cflib.org/udf/deAccent">java.text.Normalizer</a>, <a href="https://gist.github.com/JamoCA/ec4617b066fc4bb601f620bc93bacb57">ICU4J Transliterate</a> (390k vs 12mb+) and <a href="https://www.codota.com/code/java/methods/org.apache.commons.lang3.StringUtils/stripAccents">Apache.Lang3.StringUtils.StripAccents()</a> (500k) and found it to generate more consistent results while safely converting more characters than other solutions. I've also updated our <a href="https://gist.github.com/JamoCA/fee34a03bbe61a2f8e40">SanitizeFilename UDF</a> to use it.</p> | |
<p><b>Installation:</b> Download the latest <a href="https://github.com/gcardone/junidecode/releases">JUnidecode JAR</a>, place it in your java path & restart your ColdFusion server (or use Javaloader).<P> | |
<p><b>Sample User-Defined Function (UDF):</b></p> | |
<cfoutput> | |
<textarea rows="7" cols="100" style="margin-left:25px;"><#CFString#> | |
function JUnidecode(inputString){ | |
var JUnidecodeLib = createObject("java", "net.gcardone.junidecode.Junidecode"); | |
var response = JUnidecodeLib.unidecode( javacast("string", arguments.inputString) ); | |
return trim(replacenocase(Response, "[?]", "", "all")); | |
} | |
</#CFString#></textarea> | |
<p><b>Usage:</b></p> | |
<p style="margin-left:25px;">JUnidecode(<i>string</i>)</p> | |
<hr> | |
<h2>Form Test</h2> | |
<form action="" method="get"> | |
<input type="text" name="teststring" value="" required placeholder="Enter test string"> <button type="submit">Test</button><cfif len(trim(URL.TestString))> <a href="?">Reset</a></CFIF> | |
</form> | |
<h2>Test Results</h2> | |
<cfloop from="1" to="#ArrayLen(TestStrings)#" index="r"> | |
<cfset TestString = TestStrings[r]> | |
<cfset TestResult = JUnidecode(TestString)> | |
<cfset letters = []> | |
<fieldset> | |
<legend>#r#. #TestString#</legend> | |
<b>Result:</b> #TestResult# | |
<table border="1" cellspacing="0" cellpadding="0"> | |
<tr valign="top"> | |
<th>Original</th><cfloop from="1" to="#len(TestString)#" index="i"> | |
<cfset Letter = mid(TestString, i, 1)> | |
<cfset arrayAppend(letters, Letter)><td><tt>#Letter#</tt><br><tt>#asc(Letter)#</tt></td></cfloop> | |
</tr> | |
<tr valign="top"> | |
<th>JUnidecode</th><cfloop from="1" to="#len(TestResult)#" index="i"> | |
<cfset Letter = mid(TestResult, i, 1)> | |
<td<CFIF isDiff(Letters, Letter, i)> class="diff"</cfif>><tt>#Letter#</tt><br><tt>#asc(Letter)#</tt></td></cfloop> | |
</tr> | |
</table> | |
</fieldset> | |
</cfloop> | |
</cfoutput> | |
</body> | |
</html> |
Top comments (0)