DEV Community: Florian Pigorsch

Regular and Unusual "Space" Characters

Florian Pigorsch — Sun, 22 Aug 2021 14:26:49 +0000

Regular Space Characters

U+0020 SPACE

This is the regular space character as produced by pressing the space bar of your keyboard.

U+00A0 NO-BREAK SPACE

A fixed space that prevents an automatic line break at its position. Abbreviation: NBSP

U+2000 EN QUAD

A 1 en (= 1/2 em) wide space, where 1 em is the height of the current font.

U+2001 EM QUAD

A 1 em wide space, where 1 em is the height of the current font.

U+2002 EN SPACE

A 1 en (= 1/2 em) wide space, where 1 em is the height of the current font.

U+2003 EM SPACE

A 1 em wide space, where 1 em is the height of the current font.

U+2004 THREE-PER-EM SPACE

A 1/3 em wide space, where 1 em is the height of the current font. "Thick Space".

U+2005 FOUR-PER-EM SPACE

A 1/4 em wide space, where 1 em is the height of the current font. "Mid Space".

U+2006 SIX-PER-EM SPACE

A 1/6 em wide space, where 1 em is the height of the current font.

U+2007 FIGURE SPACE

A space character that is as wide as fixed-width digits. Usually used when typesetting vertically aligned numbers.

U+2008 PUNCTUATION SPACE

A space character that is as wide as a perido (".").

U+2009 THIN SPACE

A 1/6 em - 1/4 em wide space, where 1 em is the height of the current font.

U+200A HAIR SPACE

Narrower than the "THIN SPACE", usually the thinnest space character.

U+202F NARROW NO-BREAK SPACE

A narrow form of a no-break space, typically the width of a "THIN SPACE". Abbreviation: NNBSP.

U+205F MEDIUM MATHEMATICAL SPACE

A 4/18 em wide space, where 1 em is the height of the current font. Usually used when typesetting mathematical formulas.

Regular Space Characters with Zero Width

‌U+200C ZERO WIDTH NON-JOINER

When placed between two characters that would otherwise be connected into a ligature, a ZWNJ causes them to be printed in their final and initial forms, respectively.

‍U+200D ZERO WIDTH JOINER

When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms (ligature). Also used to join emoji with modifier characters.

U+2060 WORD JOINER

A zero width non-breaking space. Abbreviation: WJ.

U+FEFF ZERO WIDTH NO-BREAK SPACE

The zero width no-break space (ZWNBSP) is a deprecated use of the Unicode character at code point U+FEFF. Character U+FEFF is intended for use as a Byte Order Mark (BOM) at the start of a file. However, if encountered elsewhere, it should, according to Unicode, be treated as a "zero width no-break space". The deliberate use of U+FEFF for this purpose is deprecated as of Unicode 3.2, with the word joiner strongly preferred.

Non-Space Characters that Act Like Spaces

The following characters are probably the most interesting: they act like regular space characters, but are typically not considered as such. Because of this, they can often be used in places where a single (regular) space character is not allowed (e.g. as a Youtube video title, in nick names in popular games, etc.).

U+180E MONGOLIAN VOWEL SEPARATOR

The MVS is a word-internal thin whitespace that may occur only before the word-final vowels U+1820 MONGOLIAN LETTER A and U+1821 MONGOLIAN LETTER E. It determines the specific form of the character preceding it, selects a special variant shape of these vowels, and produces a small gap within the word. It is no longer classified as space character (i.e. in Zs category) in Unicode 6.3.0, even though it was in previous versions of the standard.

U+2800 BRAILLE PATTERN BLANK

The Braille pattern "dots-0", also called a "blank Braille pattern", is a 6-dot or 8-dot braille cell with no dots raised. It is represented by the Unicode code point U+2800, and in Braille ASCII with a space. In all Braille systems, the Braille pattern dots-0 is used to represent a space or the lack of content. In particular some fonts display the character as a fixed-width blank. However, the Unicode standard explicitly states that it does not act as a space.

U+3164 HANGUL FILLER

The Hangul Filler character is used to introduce eight-byte Hangul composition sequences and to stand in for an absent element (usually an empty final) in such a sequence. Unicode includes the Wansung code Hangul Filler in the Hangul Compatibility Jamo block for round-trip compatibility, but uses its own system (with its own, differently used, filler characters) for composing Hangul.

Visible Space Characters

␠ U+2420 SYMBOL FOR SPACE

␢ U+2422 BLANK SYMBOL

␣ U+2423 OPEN BOX

Go: Identifiers vs. Unicode

Florian Pigorsch — Tue, 03 Aug 2021 08:23:29 +0000

A recent Reddit post about Unicode characters in Go identifiers sparked my interest to dive into the Go spec and look things up properly:

According to the spec, the syntax for valid identifiers is

identifier = letter { letter | unicode_digit }

with

letter = unicode_letter | "_"
unicode_letter = /* a Unicode code point classified as "Letter" */ .
unicode_digit  = /* a Unicode code point classified as "Number, decimal digit" */ .

The "Letter" category consists of the Unicode categories Lu (uppercase letters), Ll (lowercase letters), Lt (titlecase letters), Lm (modifier letters), and Lo (other letters), where "Number, decimal digit" refers to the Unicode category Nd.

So an identifier has to start with either a "letter" or an underscore ("_"), and must contain only "letters", "decimal digits" and "underscores" - according to what's defined as letters and digits in Unicode.
The set of letters is not only the usual A-Z, a-z, but also letters from other scripts, like greek letters (e.g. Σ, or CJK characters (e.g. 㭪). The same holds for digits - not only 0-9, but also digits from other scripts are allowed: e.g. ୩, ٣, etc.

Valid identifiers:

abc_123
_myidentifier
Σ (U+03A3 GREEK CAPITAL LETTER SIGMA)
㭪 (some CJK character from the Lo category)
x٣३߃૩୩3 (x + decimal digits 3 from various scripts)

Invalid identifiers:

42 (does not start with a letter)
😀 (not a letter, but So / Symbol, other)
⽔ (not a letter, but So / Symbol, other)
x🌞 (starts with a letter, but contains non-letter/digit characters)

Although Go considers identifiers valid that contain other characters than A-Z, a-z, 0-9, and _, it's generally not advisable to use those - because of readability, accessibility, or even to avoid rendering issues.