DEV Community

James Moberg
James Moberg

Posted on • Edited on

2 1

Convert Unicode strings to ASCII with ColdFusion & JUnidecode

I’ve struggled for years attempting to identify the best solution for converting unicode accents and other characters using ColdFusion. I’ve used regex, java.text.Normalizer, ICU4J Transliterate and Apache.Lang3.StringUtils.StripAccents and recently scrapped them all in favor of using JUnidecode. JUnidecode is a Java port of Text::Unidecode perl module. The JUnidecode Java library only has one method and it takes a string and transliterates it to a valid 7-bit ASCII String (obviously it also strips diacritic marks).

Examples:

  • Москвa becomes Moskva.
  • čeština becomes cestina.
  • Հայաստան becomes Hayastan.
  • Ελληνικά becomes Ellenika.
  • 北亰 becomes Bei Jing
  • Häuser Bäume Höfe Gärten becomes Hauser Baume Hofe Garten
  • daß becomes dass

WARNING: Please be aware that Junidecode doesn't like emojis. You may need to sanitize (or convert to aliases) using cf-emoji-java prior to using converting to ASCII7.

Here's a demo script I've written that has some generic test cases:
https://gist.github.com/JamoCA/6565bd4e2526b7c177a5f0cde3980d1c

<cfprocessingdirective pageEncoding="utf-8">
<cfsetting enablecfoutputonly="Yes">
<!---
BLOG: https://dev.to/gamesover/convert-unicode-strings-to-ascii-with-coldfusion-junidecode-lhf
--->
<cfscript>
function JUnidecode(inputString){
var JUnidecodeLib = "";
var response = "";
var temp = {};
temp.encoder = createObject("java", "java.nio.charset.Charset").forName("utf-8").newEncoder();
temp.isUTF = temp.encoder.canEncode(arguments.inputString);
if (temp.isUTF){
/* NFKC: UTF Compatibility Decomposition, followed by Canonical Composition */
temp.normalizer = createObject( "java", "java.text.Normalizer" );
temp.normalizerForm = createObject( "java", "java.text.Normalizer$Form" );
arguments.inputString = temp.normalizer.normalize( javaCast( "string", arguments.inputString ), temp.normalizerForm.NFKC );
}
try {
JUnidecodeLib = createObject("java", "net.gcardone.junidecode.Junidecode");
response = JUnidecodeLib.unidecode( javacast("string", arguments.inputString) );
} catch (any e) {
response = "ERROR: JUnidecode is not installed";
}
return trim(Response.replaceAll("\[\?\]", ""));
}
function isDiff(compareArr, val, pos){
return (pos GT arrayLen(comparearr) OR comparearr[pos] neq val);
}
</cfscript>
<cfset TestStrings = [
"ℰ𝒳𝒜ℳ𝓟ℒℰ",
"ABC #chr(160)# Café “test”",
"北亰",
"Mr. まさゆき たけだ",
"Łukasiński",
"⠏⠗⠑⠍⠊⠑⠗",
"What about Ø, Ł or æøåá",
"ราชอาณาจักรไทย",
"Ελληνικά",
"Москвa",
"Հայաստան",
"čeština",
"®™™™©©©Ⓒ½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅐⅛⅜⅝⅞⅑⅒●⚫⬤",
"ÀÁÂÃÄÅÆÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝàáâãäåæèéêëìíîïñòóôõöøùúûüý’“”–…",
"Häuser Bäume Höfe Gärten daß Ü ü ö ä Ä Ö ß "
]>
<cfset CFString = "cfscript">
<cfparam name="URL.testString" default="">
<cfif len(trim(URL.testString))>
<cfset TestStrings = listToArray(trim(URL.testString))>
</cfif>
<cfsetting enablecfoutputonly="no">
<!doctype html>
<html lang="en">
<head>
<title>JUnidecode ColdFusion Demo</title>
</head>
<style>
.diff {background-color:#ff0;}
fieldset:nth-child(even) {background-color:#ededed;}
</style>
<body>
<h1>JUnidecode ColdFusion Demo</h1>
<p>by <a href="https://about.me/jamesmoberg">James Moberg</a> / <a href="https://www.sunstarmedia.com/">SunStar Media</a> (February 6, 2019)</p>
<p>This is a demo on how to use <a href="https://github.com/gcardone/junidecode">JUnidecode</a> with <a href="https://www.adobe.com/products/coldfusion-family.html">ColdFusion</a> to convert Unicode strings to somewhat reasonable ASCII7-only strings then strip diacritics and convert strings.</p>
<p>I've compared this Java library against <a href="https://www.bennadel.com/blog/1155-cleaning-high-ascii-values-for-web-safeness-in-coldfusion.htm">regex</a>, <a href="https://cflib.org/udf/deAccent">java.text.Normalizer</a>, <a href="https://gist.github.com/JamoCA/ec4617b066fc4bb601f620bc93bacb57">ICU4J Transliterate</a> (390k vs 12mb+) and <a href="https://www.codota.com/code/java/methods/org.apache.commons.lang3.StringUtils/stripAccents">Apache.Lang3.StringUtils.StripAccents()</a> (500k) and found it to generate more consistent results while safely converting more characters than other solutions. I've also updated our <a href="https://gist.github.com/JamoCA/fee34a03bbe61a2f8e40">SanitizeFilename UDF</a> to use it.</p>
<p><b>Installation:</b> Download the latest <a href="https://github.com/gcardone/junidecode/releases">JUnidecode JAR</a>, place it in your java path &amp; restart your ColdFusion server (or use Javaloader).<P>
<p><b>Sample User-Defined Function (UDF):</b></p>
<cfoutput>
<textarea rows="7" cols="100" style="margin-left:25px;"><#CFString#>
function JUnidecode(inputString){
var JUnidecodeLib = createObject("java", "net.gcardone.junidecode.Junidecode");
var response = JUnidecodeLib.unidecode( javacast("string", arguments.inputString) );
return trim(replacenocase(Response, "[?]", "", "all"));
}
</#CFString#></textarea>
<p><b>Usage:</b></p>
<p style="margin-left:25px;">JUnidecode(<i>string</i>)</p>
<hr>
<h2>Form Test</h2>
<form action="" method="get">
<input type="text" name="teststring" value="" required placeholder="Enter test string"> <button type="submit">Test</button><cfif len(trim(URL.TestString))> <a href="?">Reset</a></CFIF>
</form>
<h2>Test Results</h2>
<cfloop from="1" to="#ArrayLen(TestStrings)#" index="r">
<cfset TestString = TestStrings[r]>
<cfset TestResult = JUnidecode(TestString)>
<cfset letters = []>
<fieldset>
<legend>#r#. #TestString#</legend>
<b>Result:</b> #TestResult#
<table border="1" cellspacing="0" cellpadding="0">
<tr valign="top">
<th>Original</th><cfloop from="1" to="#len(TestString)#" index="i">
<cfset Letter = mid(TestString, i, 1)>
<cfset arrayAppend(letters, Letter)><td><tt>#Letter#</tt><br><tt>#asc(Letter)#</tt></td></cfloop>
</tr>
<tr valign="top">
<th>JUnidecode</th><cfloop from="1" to="#len(TestResult)#" index="i">
<cfset Letter = mid(TestResult, i, 1)>
<td<CFIF isDiff(Letters, Letter, i)> class="diff"</cfif>><tt>#Letter#</tt><br><tt>#asc(Letter)#</tt></td></cfloop>
</tr>
</table>
</fieldset>
</cfloop>
</cfoutput>
</body>
</html>

Speedy emails, satisfied customers

Postmark Image

Are delayed transactional emails costing you user satisfaction? Postmark delivers your emails almost instantly, keeping your customers happy and connected.

Sign up

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Explore a sea of insights with this enlightening post, highly esteemed within the nurturing DEV Community. Coders of all stripes are invited to participate and contribute to our shared knowledge.

Expressing gratitude with a simple "thank you" can make a big impact. Leave your thanks in the comments!

On DEV, exchanging ideas smooths our way and strengthens our community bonds. Found this useful? A quick note of thanks to the author can mean a lot.

Okay