CrabPascal

Posted on Jun 4 • Edited on Jun 9

Unicode and UTF-16 String Semantics in CrabPascal (v2.16.0) | Unicode e semântica UTF-16 no CrabPascal (v2.16.0)

#pascal #rust #compiler #delphi

Bilingual post · Post bilíngue

Jump to: English · Português

English {#english}

Unicode and UTF-16 String Semantics in CrabPascal (v2.16.0)

Delphi's modern UnicodeString model measures strings in UTF-16 code units, not bytes. Sprint 8 (v2.16.0) aligned CrabPascal's builtin string functions with that model — a breaking-but-necessary step for internationalized apps.

What changed

Functions in pascal_strings — Length, Copy, Pos — now operate on UTF-16 code units via encode_utf16, while internal Value::String storage may still be UTF-8 for pragmatic runtime reasons.

program UnicodeDemo;
uses
  System.SysUtils;

var
  S: string;
begin
  S := 'año';           // n with tilde
  WriteLn('Length = ', Length(S));  // 3 code units, not 4 bytes

  S := '🦀';            // crab emoji
  WriteLn('Emoji length = ', Length(S));  // 2 (surrogate pair)
end.

If you assumed byte length from C habits, Length results will differ for non-ASCII text after v2.16.0. That is correct Delphi behavior.

Why UTF-16 code units?

Embarcadero chose UTF-16 for string ages ago. Compatibility means:

Copy(S, i, n) slices code units, which may split surrogate pairs if you are careless.
Pos searches in the same unit space Delphi uses.
Generated C in stubs.c mirrors UTF-16 counting so run and build-exe stay aligned.

Test gates

cargo test --test unicode_conformance
cargo test --test string_conformance
cargo test pascal_strings

The unicode_conformance suite locks cases like año → 3 and emoji → 2 units. Regression on ASCII strings remains green in string_conformance.

Practical guidance

For user-visible text, prefer whole graphemes in UI layers — but at the Pascal RTL level, match Delphi:

function SafePrefix(const S: string; MaxUnits: Integer): string;
begin
  if Length(S) <= MaxUnits then
    Result := S
  else
    Result := Copy(S, 1, MaxUnits);
end;

Document that this truncates by UTF-16 units, not Unicode scalar values.

IO and JSON roadmap

Sprint 8 focused on builtin semantics, not full wide PChar buffers or explicit UTF-8 ↔ UTF-16 conversion at every IO boundary. Horse JSON examples mostly stay ASCII in tests; production apps with mixed encodings should validate edge cases explicitly.

Migration from v2.13.0

If you wrote tests asserting UTF-8 codepoint counts, update expectations when upgrading past v2.16.0. ASCII-only code behaves the same; internationalized string metrics change by design.

Summary

Unicode is where "mostly compatible" compilers fail silently. v2.16.0 chooses honest Delphi semantics over simpler byte math — setting up Sprint 9–10 parity work so native binaries and the interpreter agree on string output for Portuguese, emoji, and API payloads alike.

Português {#portugus}

Unicode e semântica UTF-16 no CrabPascal (v2.16.0)

O UnicodeString moderno do Delphi mede strings em code units UTF-16, não bytes. A Sprint 8 (v2.16.0) alinhou as funções builtin de string do CrabPascal a esse modelo — passo necessário (e breaking) para apps internacionalizados.

O que mudou

Funções em pascal_strings — Length, Copy, Pos — agora operam em code units UTF-16 via encode_utf16, enquanto o armazenamento interno Value::String pode continuar UTF-8 por pragmatismo de runtime.

program UnicodeDemo;
uses
  System.SysUtils;

var
  S: string;
begin
  S := 'año';           // n com til
  WriteLn('Length = ', Length(S));  // 3 code units, não 4 bytes

  S := '🦀';            // emoji caranguejo
  WriteLn('Emoji length = ', Length(S));  // 2 (par surrogate)
end.

Se você assumia comprimento em bytes por hábito C, resultados de Length diferem para texto não ASCII após v2.16.0. Esse é o comportamento correto Delphi.

Por que code units UTF-16?

A Embarcadero escolheu UTF-16 para string há tempo. Compatibilidade significa:

Copy(S, i, n) fatia code units, podendo partir pares surrogate se você não tomar cuidado.
Pos busca no mesmo espaço de unidades que o Delphi usa.
C gerado em stubs.c espelha contagem UTF-16 para run e build-exe ficarem alinhados.

Gates de teste

cargo test --test unicode_conformance
cargo test --test string_conformance
cargo test pascal_strings

A suite unicode_conformance trava casos como año → 3 e emoji → 2 units. Regressão em strings ASCII permanece verde em string_conformance.

Orientação prática

Para texto visível ao usuário, prefira grafemas inteiros na camada de UI — mas no nível RTL Pascal, siga Delphi:

function SafePrefix(const S: string; MaxUnits: Integer): string;
begin
  if Length(S) <= MaxUnits then
    Result := S
  else
    Result := Copy(S, 1, MaxUnits);
end;

Documente que isso trunca por unidades UTF-16, não valores escalares Unicode.

Roadmap IO e JSON

A Sprint 8 focou semântica builtin, não buffers wide PChar completos ou conversão UTF-8 ↔ UTF-16 explícita em cada fronteira de IO. Exemplos Horse JSON permanecem majoritariamente ASCII nos testes; apps de produção com encodings mistos devem validar casos de borda explicitamente.

Migração da v2.13.0

Se você escreveu testes assertando contagem de codepoints UTF-8, atualize expectativas ao subir além da v2.16.0. Código só ASCII se comporta igual; métricas de string internacionalizada mudam por design.

Resumo

Unicode é onde compiladores "quase compatíveis" falham em silêncio. A v2.16.0 escolhe semântica Delphi honesta em vez de aritmética de bytes mais simples — preparando o trabalho de paridade das Sprints 9–10 para binários nativos e interpretador concordarem na saída de strings em português, emoji e payloads de API.

Published on dev.to/@crabpascal · Código em CrabPascal

DEV Community