DEV Community

Cover image for The Ethical Imperative of Character Encoding
Fuqiao Xue
Fuqiao Xue

Posted on

The Ethical Imperative of Character Encoding

The World Wide Web was conceived as a space without borders. It's a medium where information could flow freely, where anyone could participate, and where all human expression would find digital form. This vision was articulated in the Web's founding documents and realized through decades of collaborative standards development. And there is a simple foundation: the ability to represent text.

Yet beneath this simplicity lies one of the most profound questions of our digital age: Who can be named on the Web? The answer to this question is determined by the architecture of character encoding. It is an ethical imperative that defines the boundaries of digital citizenship itself.

I. The Name as a Digital Gateway

A name is far more than an identifier. It is the key that unlocks participation in modern society. For example, tens of millions of people in China have names that contain rare characters. In addition, a significant number of place names and ancient texts have difficulties being digitalized due to rare characters contained within.

Consider what happens when a person's name cannot be represented in a digital system. In Huaiyuan county, Anhui province, more than 1,000 locals from a single village were forced to change their surnames because the character could not be typed into the computer system for registration. Their surname, Chi, written in traditional Chinese, cannot be found in the simplified Chinese font used by the system. As a result, these residents face difficulties in traveling, wealth management, and education, since a full name is required for most computer-based services.

This problem extends far beyond a single village. Government and public services in China are accelerating their digital transformation and moving operations online. However, many people's names and place names cannot be input, causing troubles in transactions such as opening bank accounts and purchasing transportation tickets. In another striking example, a village in Yunnan Province became famous when a local surname "Nia", meaning "a flying bird", could not be typed into computers. Many villagers with this family name were forced to register for ID cards using a similar character meaning "duck" instead.

When individuals cannot open bank accounts, purchase train tickets, register property, or obtain identification documents because their names exist outside the boundaries of supported character sets, a technical limitation has been transformed into a barrier to essential services.

II. The Architecture

As of Unicode 17.0, Unicode defines a total of 101,996 CJK Unified Ideograph characters. This represents an amazing achievement in technical standardization. The Ideographic Research Group (IRG) is very carefully cataloging, reviewing, and assessing CJK characters for inclusion into the standard. The only real limitation on the number of CJK characters in the standard is the ability of the IRG to process them.

Yet encoding a character in Unicode is only the first step in a longer journey. Several challenges remain.

The Encoding Gap

While Unicode has grown to encompass an impressive repertoire, gaps persist. There are a few people whose names contain characters not in Unicode.

The Implementation Gap

GB 18030 has registered more than 80,000 Chinese characters. However, most computers only support the input and display of about 30,000 commonly used characters.

A character may exist in the international standard, yet remain inaccessible to ordinary users for several reasons. Input methods may not support it, preventing users from typing the character in the first place. System fonts may not include glyphs for it, leaving only empty boxes or placeholder symbols on screen. Application software may fail to render or process it correctly, corrupting data as it moves through different programs. Furthermore, legacy systems often use incompatible encoding schemes, creating barriers when older infrastructure must interact with modern standards.

The Interoperability Gap

When different systems use different code points for the same character, whether through legacy Private Use Area assignments or historical encoding decisions, name comparison fails across systems. A person whose identity is encoded one way in the banking system and another way in the transportation system may be unable to prove they are who they say they are.

III. The Cascade of Consequences

The inability to represent one's name digitally initiates a cascade of consequences.

Without the ability to register their legal names, individuals cannot fully participate in the financial system. They cannot open accounts in their own names or engage in legitimate commerce. They are pushed to the margins of the digital economy.

Government services increasingly move online. Those whose names cannot be processed become administratively invisible.

Names connect individuals to their families, their communities, and their heritage. Every rare character is a part of the cultural heritage. They shouldn't be lost in the digital era. When people are forced to change their names to fit the constraints of computer systems, we witness a form of cultural erasure: the subordination of human identity to technical limitation.

The pressure to adopt "computer-compatible" names affects not only those living today but shapes naming practices for future generations. Traditional names, passed down through centuries, are abandoned. Regional variations and minority group naming conventions are homogenized. The rich diversity of onomastic tradition gives way to a narrower repertoire of "safe" choices.

IV. A Multi-Stakeholder Responsibility

Addressing this challenge requires coordinated action across multiple stakeholder groups. No single actor can solve this problem alone.

Standards Bodies

Newly proposed CJK unified ideographs are first submitted to the IRG through national bodies or liaison organizations, and are then assembled into a new "IRG Working Set" that goes through several rounds of detailed review and scrutiny before being approved for standardization as a new CJK Unified Ideographs extension block. Individuals who wish to propose the encoding of new CJK unified ideographs are encouraged to work with their respective country's national body.

The processes for proposing new characters must be accessible and efficient. The time between identifying an important unencoded character and its inclusion in the standard must be minimized.

Platform Providers and Software Developers

Platform providers must move beyond minimum compliance with standards to comprehensive implementation. This means:

  • Ensuring system fonts include glyphs for all encoded characters
  • Providing input methods that can access the full Unicode repertoire
  • Testing software with more characters, not merely common subsets
  • Supporting new character additions

Government Agencies

Governments must recognize character encoding as critical digital infrastructure and invest accordingly. This includes conducting comprehensive surveys to identify affected populations and establishing clear pathways for individuals to report encoding problems. Governments should also mandate comprehensive support for Unicode in both government systems and those of contractors.

Industry

Comprehensive font support is needed. Developers need to ensure that the chosen fonts include the necessary glyphs to accurately render Unicode characters. Proper font selection is crucial to maintain visual consistency and legibility in text.

Industry must treat character support not as a feature but as a fundamental accessibility requirement.

V. Building Awareness

Cases like the Yunnan village surname issue have drawn media coverage and online discussion. When people see that someone cannot get an ID card because their name uses a rare character, encoding stops being an abstract technical problem.

VI. Toward a Principled Framework

Every person should be able to represent their legal name accurately in any digital system that requires identification. This is not a matter of convenience but of fundamental dignity. The name is the primary marker of individual identity, and the inability to represent it constitutes a form of digital exclusion.

A name encoded in one system must be recognized and correctly processed in all other systems. The fragmentation of identity across incompatible encoding schemes imposes unacceptable burdens on individuals and undermines the integrity of identification systems.

Character encoding decisions should support, not undermine, the preservation of cultural heritage. Rare characters often carry particular cultural significance, like in family names passed down through generations or in place names that encode local history.

Rather than waiting for individuals to report that their names cannot be represented, responsible actors should proactively identify gaps and address them. This means conducting systematic surveys, engaging with affected communities, and establishing clear processes for adding needed characters.

VII. The Character of the Web

Character encoding is not merely a technical specification but a statement of values.

When we choose which characters to encode, implement, and support, we are choosing who can participate fully in the digital age. We are drawing the boundaries of the digital commons. We are determining whose names will be spoken by our machines and whose will be transformed, truncated, or erased.

Every name, every script, every writing tradition that has meaning for a community deserves a place in our digital infrastructure.

Unicode allows you to deal simply with almost all scripts and languages in use around the world. In this way Unicode simplifies the handling of content in multiple languages, whether within a single page or across one or more sites.

The technical capability exists. The standards framework is in place. What remains is the commitment from standards bodies, governments, industry, and civil society to complete the work of character support. This commitment honors not only those affected today but the generations to come, who deserve to inherit a digital world that can speak their names.

This is the Web we are called to build.

Top comments (0)