Fotis Papadogeorgopoulos

Posted on Jul 24, 2020 • Originally published at fotis.xyz on Jul 24, 2020

A toLocaleString Mystery

#internationalization #localization #i18n #javascript

Recently, at work, one of our tests started failing. Our site is available in 11 languages, and the months for Azerbaijani (with the Latin script) had inconsistent capitalisation!

After investigating, and a little bit of good guessing, it turned out to be an issue with the localisation data in the browser and Node themselves.

This post digs into how I went about investigating that issue, with entirely too many diversions along the way. I hope it gives you a fun insight into how localisation data ends up in JS APIs, and how to spot errors!

The bug

Let’s frame the problem.

We have a function that provides a list of months (in the Gregorian calendar), localised for one of the languages and scripts that we support. For English US, that would be “January, February, March…”.

JavaScript environments, whether web browsers such as Chrome and Firefox, or Node, provide a set of APIs for localisation and internationalisation. Two common ones are the Intl namespace of APIs, and the Date object with its toLocaleString method. We use toLocaleString specifically to get a localised month, for each month of the calendar.

However, the result of calling those APIs can vary depending on the data that each browser has available.

Because that possibility can sometimes be unexpected (especially for people that have not worked with multiple languages or scripts before), last year we added a series of tests to verify the localisation of months.

Then, at some later point, our tests started failing:

AssertionError: expected [ Array(12) ] to deeply equal [ Array(12) ]
+ expected - actual

[
-  "yanvar"
+  "Yanvar"
    "Fevral"
-  "mart"
+  "Mart"
    "Aprel"
    "May"
    "İyun"
    "İyul"
    "Avqust"
    "Sentyabr"
    "Oktyabr"
    "Noyabr"
-  "dekabr"
+  "Dekabr"
]

In other words: the months for Azerbaijani with the Latin script, Yanvar (January), Mart (March) and Dekabr (December) were lower case, while all the other months were capitalised.

First step, checking our own function

Before going down the path that the data might be wrong, let’s make sure that our own function is not doing anything absurd.

The function itself is provided below, a small wrapper around calling toLocaleString for 12 Dates.

function getArrayOfMonths(localeTag) {
  // Months for Date are 0.=11
  const months = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11].map((month) => {
    const dateobj = new Date(1970, month, 15);
    return dateobj.toLocaleString(localeTag, { month: 'long' });
  });
  return months;
}

(There are subtleties to getting a list of months this way, which may make the results wrong or unidiomatic. In our use, those are fine, but I am listing an example with noun cases at the end of the article.)

Running this function in Firefox and Node (with localisation data, more on that later!) brings up the same results:

// Node
// NODE_ICU_DATA=node_modules/full-icu node    
// Welcome to Node.js v12.16.3.

> console.log(getArrayOfMonths('az-AZ'));

[
  'yanvar',   'Fevral',
  'mart',     'Aprel',
  'May',      'İyun',
  'İyul',     'Avqust',
  'Sentyabr', 'Oktyabr',
  'Noyabr',   'dekabr'
]

// Firefox

> console.log(getArrayOfMonths('az-AZ'));

Array(12) [ "yanvar", "Fevral", "mart", "Aprel", "May", "İyun", "İyul", "Avqust", "Sentyabr", "Oktyabr", … ]

Firefox and Node having the same inconsistent capitalisation was already tipping me off. They are different engines, so them processing the data in the same odd way seemed too good to be a coincidence.

Chrome only prints out English months, but that is as-intended, because it does not support Azerbaijani in Intl/toLocaleString yet, and I did not specify a fallback.

Finding if a locale is supported with Intl

The Intl family of APIs is really powerful. They have a bunch of namespaces and constructors to account for different lingustic artefacts and locales. For example, there is Intl.DateTimeFormat for formatting dates and times (day month year? month day year? fight!).

One useful function is Intl.DateTimeFormat.supportedLocalesOf. It takes an array of locales as BCP 47 language tags, such as en-GB (English as used in Great Britan) or el-GR (Hellenic/Greek as used in Greece) as an argument, and returns an array of the ones that are supported:

> console.log(Intl.DateTimeFormat.supportedLocalesOf(['az-AZ', 'en-GB', 'el-GR']))

['az-AZ', 'en-GB', 'el-GR']

Here I would go on a tangent about locales being a complex interaction of languages, regions and scripts, but this post has already too many diversions, and I don’t feel qualified to give you good examples.

To account for these interactions, BCP 47 tags have optional components for scripts, region or country codes, variants, and also reserved extensions. I found this article from MDN on locale identification useful for a short explanation.

Azerbaijani (as far as my searching shows, I might be wrong) has both a Latin and Cyrillic script. Those would be az-Latn-AZ and az-Cyrl-AZ respectively. As far as I can tell, az-AZ defaults to Latin, but I am not sure if that is an artefact of a specific data source.

A past Chrome bug with supportedLocalesOf

When I started seeing issues with Azerbaijani in particular, I was already on my toes about issues with data.

About a year ago, we had run into a bug with Azerbaijani and Chrome, which claimed it supported it via supportedLocalesOf, but would give placeholder months.

In particular, this was the behaviour from this function back then (circa July 2019):

> Intl.DateTimeFormat.supportedLocalesOf(['az-AZ']);

['az-AZ']
// Means it is supported

> getArrayOfMonths('az-AZ')
[M0, M1, M2, M3, ... M11]

In other words, ‘az-AZ’ was purportedly supported, but the months were these odd M0 to M11 months, which seemed like internal placeholders. If Azerbaijani was unsupported, I would expect supportedLocalesOf to not report it, and also the months to be in English GB (because that is my system locale, and I did not specify a fallback).

After double- and triple-checking with colleagues and different platforms, I filed a bug in Chromium, and it was confirmed! It was eventually fixed, and supportedLocalesOf reports Azerbaijani as unsupported.

Long story short, Azerbaijani being unsupported indicates to me that the localisation data might be incomplete. I have referenced “the data” multiple times now; let’s dive into what that data is, and where it comes from.

Localisation data: ICU, CLDR, oh my!

Let’s take a few different Intl APIs:

DateTimeFormat, uhm, formatting (as is bugging us so far)
Pluralization (e.g. apple, 2 of them = two apples, or more complex changes for languages that differentiate between “one”, “a handful”, and “many”)
Locale names (e.g. saying that “Greek” is “Ελληνικά” in Greek)

You can imagine that all of the underlying data (calendars, names of months, pluralization rules) must be coming from somewhere!

Indeed, there is a standard resource for these in the ICU (International Components for Unicode) data. Quoting from the site:

ICU makes use of a wide variety of data tables to provide many of its services. Examples include converter mapping tables, collation rules, transliteration rules, break iterator rules and dictionaries, and other locale data. Additional data can be provided by users, either as customizations of ICU’s data or as new data altogether.

A related data-set is the CLDR (Unicode Common Locale Data Repository). Quoting again from the site:

The Unicode CLDR provides key building blocks for software to support the world’s languages, with the largest and most extensive standard repository of locale data available. This data is used by a wide spectrum of companies for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks. It includes: …

The ICU data-set uses CLDR itself for many things, with a few differences:

Data which is NOT sourced from CLDR includes:

Conversion Data

Break Iterator Dictionary Data ( Thai, CJK, etc )

Break Iterator Rule Data (as of this writing, it is manually kept in sync with the CLDR datasets)

Those data come in different formats, such as XML (LDML), categorised by locale (roughly, that I can tell). The ICU data seems more commonly used by higher-level libraries, because the format is more compact.

With this data available, browsers have enough information to provide the richer Intl and Date localisation APIs.

Handwaving

Here are some things I am handwaving at this point.

I use ICU and CLDR rather interchangeably. As far as I can tell, the ICU data is derived from the CLDR data. I found better links for the CLDR sources, so I am digging into those.

I am also not 100% clear on whether all browsers use the ICU/CLDR data at the moment, or use some other source. I could not find anything normative about the data source in the specs (I would find that surprising anyway), and I am bad at going through issue trackers.

I found one tracking issue about Firefox transitioning to the CLDR data, and at least my testing seems to support that. Perhaps the CLDR data version would be useful for browsers to expose? Not as an API, rather an `about:` config or something similar in the UI.

Node definitely uses the ICU data, and gets its own following section for it.

Excerpt from the CLDR Data

For example, here is the top-level directory structure from one download of the CLDR data:

> tree -L 1 cldr-common-35.1/
cldr-common-35.1/common/
├── annotations
├── annotationsDerived
├── bcp47
├── casing
├── collation
├── dtd
├── main
├── properties
├── rbnf
├── segments
├── subdivisions
├── supplemental
├── transforms
├── uca
└── validity

An excerpt from the main directory:

> cldr-exploration tree -L 1 cldr-common-35.1/common/main
cldr-common-35.1/common/main
├── af_NA.xml
├── af.xml
├── af_ZA.xml
├── agq_CM.xml
├── agq.xml
├── ak_GH.xml
├── ak.xml
├── am_ET.xml
├── am.xml
├── ar_001.xml
├── ar_AE.xml
├── ar_BH.xml

And here is part of the data for English (common/main/en.xml):

<monthWidth type="wide">
    <month type="1">January</month>
    <month type="2">February</month>
    <month type="3">March</month>
    <month type="4">April</month>
    <month type="5">May</month>
    <month type="6">June</month>
    <month type="7">July</month>
    <month type="8">August</month>
    <month type="9">September</month>
    <month type="10">October</month>
    <month type="11">November</month>
    <month type="12">December</month>
</monthWidth>

ICU and Node

If you have tried to work with internationalisation in Node, you might have run into the ICU data yourself.

Up until version 13 (a few months ago), Node only had a base English locale loaded. The ICU data takes up space on the order of tens of megabytes, and so Node for the longest time did not come with them installed.

To get correct localisations in Node, you had to either a) build Node yourself with the full-icu dataset loaded, or b) install the correct build of the icu data locally, and provide the path via NODE_ICU_DATA.

It was messy, and probably still exists as an arcane parameter in current and aging codebases. Watch tests fail because NODE_ICU_DATA is not supplied, ugh.

Node getting the full ICU data built-in from version 13 was one of my favourite features, and if you’ve read this far, at least someone else might now understand my excitement!

If you are curious:

Either way, now that we have gone through all the abbreviations, we’re in a good spot to find the data and investigate it!

Digging into the CLDR data

Time to dig into the CLDR data, to validate whether the months in Azerbaijani show up capitalised, uncapitalised, or inconsistent.

To check for any changes (and in the case of our test, regressions), I downloaded CLDR versions 35.1, 36.1 and 37.

I started browsing through the directories and quickly got lost because my search skills are bad.

I then decided to go with a more drastic approach, and headed to the command line. In my case Gnome Terminal on Linux, but iTerm on MacOS or Windows Subsystem for Linux would work just as well, if you want to follow along.

There is a nice utility called ripgrep which can search through files very fast. It is written in Rust and is lovely, though to be honest I just didn’t remember the grep flags any more.

Anyway, I went searching through the files. I used “Yanvar” capital case and “yanvar” lower case for the known issues, as well as “Oktyabr” capital case and “oktyabr” lower case as a control.

The results from ripgrep across three versions follow, and then a long-form explanation of them.

# Yanvar capital case - 1 result from version 35.1
>  az-AZ-exploration rg "Yanvar" cldr*/**/az.xml 
cldr-common-35.1/common/main/az.xml
1412:  <month type="1">Yanvar</month>

# Yanvar lower case - two results for version 36.1 and 37, one for 35.1
>  az-AZ-exploration rg "yanvar" cldr*/**/az.xml
cldr-common-37.0/common/main/az.xml
1360:  <month type="1">yanvar</month>
1404:  <month type="1">yanvar</month>

cldr-common-36.1/common/main/az.xml
1360:  <month type="1">yanvar</month>
1404:  <month type="1">yanvar</month>

cldr-common-35.1/common/main/az.xml
1368:  <month type="1">yanvar</month>

# Oktyabr capital case - one result for each version
>  az-AZ-exploration rg "Oktyabr" cldr*/**/az.xml
cldr-common-37.0/common/main/az.xml
1413:  <month type="10">Oktyabr</month>

cldr-common-36.1/common/main/az.xml
1413:  <month type="10">Oktyabr</month>

cldr-common-35.1/common/main/az.xml
1421:  <month type="10">Oktyabr</month>

# Oktyabr lower case - one result for each version
>  az-AZ-exploration rg "oktyabr" cldr*/**/az.xml
cldr-common-37.0/common/main/az.xml
1369:  <month type="10">oktyabr</month>

cldr-common-36.1/common/main/az.xml
1369:  <month type="10">oktyabr</month>

cldr-common-35.1/common/main/az.xml
1377:  <month type="10">oktyabr</month>

We have a winner! From version 36 onward, we get “yanvar” as lower case for January, while “Fevral” for February stays capitalised for all versions. The same pattern repeats with March and December. Version 35, by comparison, has both Yanvar and Fevral (and all the other months) capitalised.

Data sources

Something I found interesting: the data for months appears in two places, once in a “months” entry, and once in a “calendar” entry (again, for the Gregorian calendar).

The “months” entry has consistent capitalisation throughout. They are all lower case; “yanvar”, “fevral” and so on.

This hints to me that Firefox and Node use the “calendar” entry for the names of the months in this case. It makes sense, because if you’ll recall our original function, we go through a Date object’s toLocaleString, which deals with dates directly, rather than canonical names or anything of that sort.

Changelog, Contributions

I was curious as to what changed in version 36 onward.

Diving into the Changelog for the CLDR data version 36 we find the following line:

Additionally, the following locales had at least a 15% increase in basic coverage: az (Azerbaijani / Latin script), so (Somali / Latin script).

The inconsistent months might have been entered accidentally, or were caused somehow when the coverage was expanded.

Future steps

This is all many words, for a simple change in our codebase at least: change the test to match the data (3 line change), alongside a description about why that is ok (200 words in the PR, however many words is this post).

I am not keen on capitalising the months ourselves (today’s hotfix is tomorrow’s footgun), but we might do that specifically for Azerbaijani, with an inverse test case to notify us when the data is updated.

Another thing I am looking into, is contributing the consistent capitalisation into the CLDR. Ideally, I would like to submit it as something to be approved by a native speaker, because who the heck am I to say what the capitalisation of months in Azerbaijani should be!

I have not really researched the CLDR process, so this might all be simple.

Wrapping up

Long story short: sometimes, it is the data.

This whole process was some of the most fun I’ve had at work this month! I love it when the different abstraction layers (specs, JS APIs, JS hosts, CLDR data, bugs, messiness) fall into place. Localisation and internationalisation take a lot of material effort, so diving into it makes me appreciate it much more.

In this case, I am also fond of our team’s past selves. We had the tests in place, and had already gone into the ICU/CLDR rabbit hole a year ago, filing the Chrome bug. It was both a time-saver, and brought a smile to my face.

I hope I managed to impart at least a glimpse of that fun to you, and that you found something interesting here.

I’ll be happy to discuss this post and any linked resources!

Appendix: When this method of getting months goes wrong

As mentioned previously, we go through a Date object’s toLocaleString to get the array of months.

However, because the formatting happens in the context of a date, languages with different cases might inflect the month.

When running this function for Greek, we get the following:

> console.log(getArrayOfMonths('el-GR'));

[
  'Ιανουαρίου',  'Φεβρουαρίου',
  'Μαρτίου',     'Απριλίου',
  'Μαΐου',       'Ιουνίου',
  'Ιουλίου',     'Αυγούστου',
  'Σεπτεμβρίου', 'Οκτωβρίου',
  'Νοεμβρίου',   'Δεκεμβρίου'
]

All of these months are in the genitive case (denoting possession). This is the equivalent of saying “x of January”, “y of February” and so on in English. In our site, we use this function in the context of birthdays, so it ends up being ok! If, however, we wanted to only list the months, it would technically be wrong (we’d need the nominative case). Make sure to test for your use-case, and beware of tutorials only assuming English language rules.

I am not certain how I would go about listing the months in the nominative, at least with the Date object. Intl has a draft (Stage 3) family of APIs called Intl.DisplayNames that “enables the consistent translation of language, region and script display names”. Would something similar for month names be desirable? I’m not sure! Let me know if you know of an approach.

DEV Community