DEV Community

Cover image for Filtering Zero-Width SPaces (ZWSPs) using ColdFusion
James Moberg
James Moberg

Posted on

1

Filtering Zero-Width SPaces (ZWSPs) using ColdFusion

This is a follow-up to a January 11, 2019 article I wrote on my old Tumber blog.

The Hacker News posted an article regarding the ability to bypass Microsoft Office 365's "Safe Links" security feature by adding zero-width spaces (ZWSPs). I generated some sample URLs to test against my ColdFusion-based URL blacklist script and discovered that it was also vulnerable. Using isValid("url") returns TRUE for strings with ZWSPs. Clicked links also redirect to the correct destination as the browser (or DNS) appears to automatically ignore the ZWSP characters. This makes it possible for cybercriminals and email scammers to send malware and phishing links through our servers. It's impossible to filter using a blacklist without removing the characters.
For example, this function permanently redirects GET requests for "/shop/?" to "/shop/".
I've developed a ColdFusion CFC with various methods to trim, sanitize (replace/remove) & identify these invisible characters by both character and HTML entities. I'm still evaluating the library by testing it in a couple of smaller projects and am hoping to post more in the very near future. (I've actually been meaning to write this for a while in order to trim non-breaking spaces from data imported by user-generated Excel files.)

I just realized that I never shared the whitespace.cfc that I developed back in 2019. Public methods include:

  • dumpRegex (performs a CFDump of regex rules)
  • getConfig (Lists all rules used when a tag is specified)
  • getRegex (Generates pipe-delimited REGEX list of whitespace/ZWSP characters. Ex. 'chr(32)|A_Space')
  • hasWhiteSpace (Checks if string contains any whitespace)
  • hasUnsafeSpace (Checks if string contains unsafe-ish whitespace)
  • identifyUnsafeSpace (Provides a array of shortcodes, names, decimal or hex values of identified whitespace and their regex positions)
  • leftTrim (Performs a left trim and strips all whitespace)
  • rightTrim (Performs a right trim and strips all whitespace)
  • fullTrim (Performs a left/right trim and strips all whitespace)
  • sanitize (Performs a left/right trim and strips control characters)
  • compressText (Santizes, reduces multiple space characters to a single character)
  • compressHtml (Replaces a huge amount of unnecessary whitespace from your HTML code)
  • singleLine (Modifies content to output on a single line (for logging))

https://gist.github.com/JamoCA/68673cbac81d5924c80754c3fe4effcf

component displayname="whitespace" output="false" hint="Identifies and performs trim functions on white space-related characters" {
/*
author: James Moberg <james@ssmedia.com>
date: 2019-01-07
Description: Removes all whitespace-related characters (ie, Zero-Width SPaces (ZWSPs)) from a string... not just characters below U+0020.
.NET Trim() really trims a string - also trimming non-breaking-spaces. This is not the case in Java.
http://www.henrikbrinch.dk/Blog/2013/02/28/java-net-string-gotchas-of-the-day/
https://stackoverflow.com/a/4731164/693068
https://stackoverflow.com/a/4307261/693068
The String.trim() method talks about "whitespace", but defines this in a very precise but rather crude and idiosyncratic way - it simply regards anything up to and including U+0020 (the usual space character) as whitespace, and anything above that as non-whitespace.
https://closingbraces.net/2008/11/11/javastringtrim/
charcodes
https://www.fileformat.info/info/unicode/char/0001/index.htm (substitute 0001 w/Spacedata.u)
Reason:
https://thehackernews.com/2019/01/phishing-zero-width-spaces.html
Used:
"NUL" = {n="Null char", d=0, h="0000", tags="join,unsafe"},
GIST: https://gist.github.com/JamoCA/68673cbac81d5924c80754c3fe4effcf
Blog: https://dev.to/gamesover/filtering-zero-width-spaces-zwsps-using-coldfusion-122c
Tweet: https://twitter.com/gamesover/status/1704168951960519058
*/
/* 20191018 https://gist.github.com/JamoCA/42c3be286185aff0476d5888f0a819ff */
variables.spaceData = [
"SOH": ["n":"Start of Heading", "d":1, "h":"0001", "tags":"space,unsafe"]
,"STX": ["n":"Start of Text", "d":2, "h":"0002", "tags":"space,unsafe"]
,"ETX": ["n":"End of Text", "d":3, "h":"0003", "tags":"space,unsafe"]
,"EOT": ["n":"End of Transmission", "d":4, "h":"0004", "tags":"space,unsafe"]
,"ENQ": ["n":"Enquiry", "d":5, "h":"0005", "tags":"space,unsafe"]
,"ACK": ["n":"Acknowledgment", "d":6, "h":"0006", "tags":"space,unsafe"]
,"BEL": ["n":"Bell", "d":7, "h":"0007", "tags":"space,unsafe"]
,"BS": ["n":"Back Space", "d":8, "h":"0008", "tags":"space,unsafe"]
,"HT": ["n":"Horizontal Tab", "d":9, "h":"0009", "tags":"space,unsafe"]
,"LF": ["n":"Line Feed", "d":10, "h":"000A", "tags":"space,safe"]
,"VT": ["n":"Vertical Tab", "d":11, "h":"000B", "tags":"space,unsafe"]
,"FF": ["n":"Form Feed", "d":12, "h":"000C", "tags":"space,unsafe"]
,"CR": ["n":"Carriage Return", "d":13, "h":"000D", "tags":"space,safe"]
,"SO": ["n":"Shift Out / X-On", "d":14, "h":"000E", "tags":"space,unsafe"]
,"SI": ["n":"Shift In / X-Off", "d":15, "h":"000F", "tags":"space,unsafe"]
,"DLE": ["n":"Data Line Escape", "d":16, "h":"0010", "tags":"space,unsafe"]
,"DC1": ["n":"Device Control 1 (oft. XON)", "d":17, "h":"0011", "tags":"space,unsafe"]
,"DC2": ["n":"Device Control 2", "d":18, "h":"0012", "tags":"space,unsafe"]
,"DC3": ["n":"Device Control 3 (oft. XOFF)", "d":19, "h":"0013", "tags":"space,unsafe"]
,"DC4": ["n":"Device Control 4", "d":20, "h":"0014", "tags":"space,unsafe"]
,"NAK": ["n":"Negative Acknowledgement", "d":21, "h":"0015", "tags":"space,unsafe"]
,"SYN": ["n":"Synchronous Idle", "d":22, "h":"0016", "tags":"space,unsafe"]
,"ETB": ["n":"End of Transmit Block", "d":23, "h":"0017", "tags":"space,unsafe"]
,"CAN": ["n":"Cancel", "d":24, "h":"0018", "tags":"space,unsafe"]
,"EM": ["n":"End of Medium", "d":25, "h":"0019", "tags":"space,unsafe"]
,"SUB": ["n":"Substitute", "d":26, "h":"001A", "tags":"space,unsafe"]
,"ESC": ["n":"Escape", "d":27, "h":"001B", "tags":"space,unsafe"]
,"FS": ["n":"File Separator", "d":28, "h":"001C", "tags":"space,unsafe"]
,"GS": ["n":"Group Separator", "d":29, "h":"001D", "tags":"space,unsafe"]
,"RS": ["n":"Record Separator", "d":30, "h":"001E", "tags":"space,unsafe"]
,"US": ["n":"Unit Separator", "d":31, "h":"001F", "tags":"space,unsafe"]
,"SP": ["n":"Space", "d":32, "h":"0020", "tags":"space,safe"]
,"NEL": ["n":"next line", "d":133, "h":"0085", "tags":"space,unsafe"]
,"NBSP": ["n":"no-breaking space", "d":160, "h":"00A0", "tags":"space,unsafe"]
,"OGHAM": ["n":"OGHAM Space Mark", "d":5760, "h":"1680", "tags":"space,unsafe"]
,"MONGOLIAN": ["n":"Mongolian Vowel Separator", "d":6158, "h":"180E", "tags":"space,unsafe"]
,"ENQUAD": ["n":"EN Quad", "d":8192, "h":"2000", "tags":"space,unsafe"]
,"EMQUAD": ["n":"EM Quad", "d":8193, "h":"2001", "tags":"space,unsafe"]
,"ENSP": ["n":"EN Space", "d":8194, "h":"2002", "tags":"space,unsafe"]
,"EMSP": ["n":"EM Space", "d":8195, "h":"2003", "tags":"space,unsafe"]
,"THREE-PER:M SPACE" = ["n":"Thick Space", "d":8196, "h":"2004", "tags":"space,unsafe"]
,"FOUR-PER:M SPACE" = ["n":"Mid Space", "d":8197, "h":"2005", "tags":"space,unsafe"]
,"SIX-PER:M SPACE" = ["n":"Six-per-EM Space", "d":8198, "h":"2006", "tags":"space,unsafe"]
,"FGMSP": ["n":"Figure Space", "d":8199, "h":"2007", "tags":"space,unsafe"]
,"PUNSP": ["n":"Punctuation Space", "d":8200, "h":"2008", "tags":"space,unsafe"]
,"THINSPACE": ["n":"Thin Space", "d":8201, "h":"2009", "tags":"space,unsafe"]
,"HAIRSPACE": ["n":"Hair Space", "d":8202, "h":"200A", "tags":"space,unsafe"]
,"ZWSP": ["n":"zero-width space", "d":8203, "h":"200B", "tags":"space,unsafe"]
,"ZWNJ": ["n":"zero-width non-joiner", "d":8204, "h":"200C", "tags":"join,unsafe"]
,"ZWJ": ["n":"zero-width joiner", "d":8205, "h":"200D", "tags":"join,unsafe"]
,"LRM": ["n":"left-to-right mark", "d":8206, "h":"200E", "tags":"space,unsafe"]
,"RLM": ["n":"right-to-left mark", "d":8207, "h":"200F", "tags":"space,unsafe"]
,"WJ": ["n":"Word Joiner", "d":8288, "h":"2060", "tags":"join,unsafe"]
,"LINSEP": ["n":"Line Separator", "d":8232, "h":"2028", "tags":"space,unsafe"]
,"PARSEP": ["n":"Paragraph Separator", "d":8233, "h":"2029", "tags":"space,unsafe"]
,"NNBSP": ["n":"Narrow No-Break Space", "d":8239, "h":"202F", "tags":"space,unsafe"]
,"MMASP": ["n":"Medium Mathematical Space", "d":8287, "h":"205F", "tags":"space,unsafe"]
,"SMSP": ["n":"Symbol for Space", "d":9248, "h":"2420", "tags":"space,unsafe"]
,"BLANK": ["n":"Blank Symbol", "d":9250, "h":"2422", "tags":"space,unsafe"]
,"OPENBOX": ["n":"Open Box", "d":9251, "h":"2423", "tags":"space,unsafe"]
,"BB": ["n":"Braille blank pattern", "d":10240, "h":"2800", "tags":"space,unsafe"]
,"IDSP": ["n":"Ideographic Space", "d":12288, "h":"3000", "tags":"space,unsafe"]
,"BOM": ["n":"Zero Width No-Break Space (AKA Byte Order Mark)", "d":65279, "h":"FEFF", "tags":"join,unsafe"]
,"FWDZ": ["n":"Full-Width Digit Zero", "d":65296, "h":"FF10", "tags":"space,unsafe"]
];
// tags = list of filters (all,safe,unsafe,space,join)
public void function dumpRegex(string tags="") output=true hint="performs a CFDump of regex rules" {
writedump(var="#getRegex(arguments.tags)#", label="SpacesRegex");
}
// tags = list of filters (all,safe,unsafe,space,join)
public struct function getConfig(string tags="") output=false hint="Lists all rules used when a tag is specified" {
return [
"spaceData": variables.spaceData
,"regex": getRegex(arguments.tags)
];
}
public boolean function hasWhiteSpace(string inputString="") output=false hint="Checks if string contains any whitespace" {
if (!len(arguments.inputString)){
return javacast("boolean", 0);
}
return javacast("boolean", refindnocase(getRegex('all'), arguments.inputString, 1, false));
}
public any function hasUnsafeSpace(string inputString="") output=false hint="Checks if string contains unsafe-ish whitespace" {
if (!len(arguments.inputString)){
return javacast("boolean", 0);
}
return javacast("boolean", refindnocase(getRegex('unsafe'), arguments.inputString, 1, false));
}
public struct function identifyUnsafeSpace(string inputString="", string tags="all") output=false hint="Provides a array of shortcodes, names, decimal or hex values of identified whitespace and their regex positions" {
local.outputData = [:];
local.tags = (len(trim(arguments.tags))) ? listtoarray(lcase(trim(arguments.tags))) : ["all"];
local.returnAll = arrayfind(local.tags, "all");
for ( local.space in variables.spaceData ) {
local.thisSpace = variables.spaceData[local.space];
if (!local.returnAll){
local.tagfilter = duplicate(local.tags);
local.spacetags = listtoarray(duplicate(local.thisSpace.tags));
local.tagfilter.retainAll(local.spacetags);
}
if ( local.returnAll || arraylen(local.tagfilter) ) {
if ( find(chr(local.thisSpace.d), arguments.inputString) ) {
if ( !local.outputData.keyExists("#local.space#") ) {
local.outputData["#local.space#"] = [
"d": javacast("int", local.thisSpace.d)
,"h": javacast("string", local.thisSpace.h)
,"name": javacast("string", local.thisSpace.n)
,"positions": []
];
local.searchData = reFindNoCaseAll(chr(local.thisSpace.d), arguments.inputString);
local.outputData["#local.space#"].positions = local.searchData.pos;
}
}
}
}
return local.outputData;
}
public string function leftTrim(string inputString="") output=false hint="Performs a left trim and strips all whitespace" {
return javacast("string", arguments.inputString).replaceAll("^(#getRegex('all')#)+", "");
}
public string function rightTrim(string inputString="") output=false hint="Performs a right trim and strips all whitespace" {
return javacast("string", arguments.inputString).replaceAll("(#getRegex('all')#)+$", "");
}
public string function fullTrim(string inputString="") output=false hint="Performs a left/right trim and strips all whitespace" {
return javacast("string", arguments.inputString).replaceAll("(#getRegex('all')#)+$", "").replaceAll("^(#getRegex('all')#)+", "");
}
public string function sanitize(string inputString="") output=false hint="Removes all unsafe whitespace" {
return javacast("string", arguments.inputString).replaceAll("(#getRegex('space,unsafe')#)", " ").replaceAll("(#getRegex('join,unsafe')#)", "");
}
public string function sanitizeTrim(string inputString="") output=false hint="Performs a left/right trim and strips control characters" {
return fullTrim(sanitize(arguments.inputString));
}
public string function compressText(string inputString="") output=false hint="Santizes, reduces multiple space characters to a single character" {
return sanitizeTrim(arguments.inputString).replaceAll(" +", " ").replaceAll("[\r\n]+", "#chr(13)##chr(10)#");
}
/// Level = levels 1, 2 (default) or 3
public string function compressHtml(string inputHtml="", string level="2") output=false hint="Replaces a huge amount of unnecessary whitespace from your HTML code" {
local.outputString = javacast("string", arguments.inputHtml);
arguments.level = (listfind("1,2,3", arguments.level)) ? arguments.level : 2;
if ( arguments.level eq 3 ) {
local.outputString = compressText(local.outputString);
local.outputString.replaceAll("> <", "><");
local.outputString.replaceAll("<!--[^>]+>", "");
} else if ( arguments.level eq 2 ) {
local.outputString = compressText(local.outputString);
local.outputString.replaceAll("(#getRegex('unsafe')#)", "");
local.outputString.replaceAll("(" & chr(10) & "|" & chr(13) & ")+[[:space:]]{2,}", chr(13));
}
return fullTrim(local.outputString);
}
public string function singleLine(string inputString="") output=false hint="Modifies content to output on a single line (for logging)" {
return fullTrim(replacelist(arguments.inputString, "#chr(9)#,#chr(10)#,#chr(12)#,#chr(13)#,#chr(160)#", " , , , , "));
}
// tags = list of filters (all,safe|unsafe,space|join; unicode)
public string function getRegex(string tags="") output=false hint="Generates pipe-delimited REGEX list of whitespace/ZWSP characters. Ex. 'chr(32)|chr(160)'" {
local.cachekey = "udf_getRegex_#arguments.toString().hashCode()#";
local.initialFilter = lcase(trim(arguments.tags));
local.workingFilter = local.initialFilter;
local.useUnicode = listfind(local.initialFilter, "unicode");
if (structkeyexists(request, local.cachekey)){
return request[local.cachekey];
}
local.filters = [:];
if (!local.filters.keyExists("#local.initialFilter#")){
local.regex = [];
if (listfind(local.workingFilter, "unicode")){
local.workingFilter = listdeleteat(local.workingFilter, listfind(local.workingFilter, "unicode"));
}
local.all = !len(trim(local.workingFilter)) || listfind(lcase(local.workingFilter), "all");
for (local.char in variables.spaceData){
local.thisSpace = variables.spaceData[local.char];
local.includeThis = local.all;
if (!local.includeThis){
for (local.thisTag in listtoarray(lcase(local.workingFilter))){
if (listfind(local.thisSpace.tags, local.thisTag)){
local.includeThis = local.includeThis + 1;
}
}
if (local.includeThis neq listlen(local.workingFilter)){
local.includeThis = 0;
}
}
if (local.includeThis){
if (local.useUnicode){
arrayappend(local.regex, "\\u#variables.spaceData[local.char].h#");
} else {
arrayappend(local.regex, "#chr(variables.spaceData[local.char].d)#");
arrayappend(local.regex, "&###variables.spaceData[local.char].d#;");
arrayappend(local.regex, "&##xf#lcase(variables.spaceData[local.char].h)#;");
}
}
}
local.filters["#local.initialFilter#"] = arraytolist(local.regex, "|");
}
request[local.cachekey] = local.filters["#local.initialFilter#"];
return local.filters["#local.initialFilter#"];
}
/* 4/14/2009 http://www.cflib.org/udf/reFindNoCaseAll
@author Ben Forta (ben@forta.com)
@version 1, November 17, 2003
@version 2, January 7, 2019 - James @ SunStar Media */
private struct function reFindNoCaseAll(required string regex, required string text) output=false hint="Returns all the matches (case insensitive) of a regular expression within a string. This is simular to reGet(), but more closely matches the result set of reFind." {
local.results = [
"pos": []
,"len": []
];
local.pos = 1;
local.done = 0;
while (local.done neq 1) {
local.subex = refindnocase(arguments.regex, arguments.text, local.pos, true);
if (local.subex.len[1] eq 0){
local.done = 1;
} else {
arrayappend(local.results.len, local.subex.len[1]);
arrayappend(local.results.pos, local.subex.pos[1]);
local.pos = local.subex.pos[1] + local.subex.len[1];
}
}
if (arraylen(local.results.len) eq 0){
arrayappend(local.results.len, 0);
arrayappend(local.results.pos, 0);
}
return local.results;
}
}
view raw whitespace.cfc hosted with ❤ by GitHub

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

Top comments (2)

Collapse
 
daemach profile image
Daemach

What is the most efficient way to use this in a coldbox application?

Collapse
 
gamesover profile image
James Moberg

I'm not entirely sure. I don't use ColdBox & haven't developed any ColdBox-specific extensions... "yet".

I'm heading to the upcoming ColdFusion Summit and intend on sitting down with some of the Ortus team to figure out update & migrate some of my CFML libraries & projects to their ForgeBox package repository.

Speedy emails, satisfied customers

Postmark Image

Are delayed transactional emails costing you user satisfaction? Postmark delivers your emails almost instantly, keeping your customers happy and connected.

Sign up

👋 Kindness is contagious

Immerse yourself in a wealth of knowledge with this piece, supported by the inclusive DEV Community—every developer, no matter where they are in their journey, is invited to contribute to our collective wisdom.

A simple “thank you” goes a long way—express your gratitude below in the comments!

Gathering insights enriches our journey on DEV and fortifies our community ties. Did you find this article valuable? Taking a moment to thank the author can have a significant impact.

Okay