DEV Community

Germán Alberto Gimenez Silva
Germán Alberto Gimenez Silva

Posted on • Originally published at rubystacknews.com on

Parsing Taiwanese Like Code

December 19, 2025

How Ruby, Parser Theory, and Linguistic Precision Solved a Problem No One Wanted

At RubyWorld Conference 2025 , Mu-Fan Teng (鄧慕凡)—founder of 5xRuby and long-time Ruby community leader—presented a talk that quietly demonstrated something powerful: compiler theory is not limited to programming languages.

In “Parsing Taiwanese Like Code”, Teng showed how a real-world linguistic problem—aligning Taiwanese Romanization with Chinese characters—was solved not with ad-hoc scripts, but with proper parser architecture in Ruby p15-8.

What began as an “unwanted” government project ultimately became a case study in how choosing the right abstraction can unlock both correctness and elegance.

Article content


The problem nobody wanted to touch

Taiwanese (台語) is written using:

  • Han characters (漢字)
  • Romanized phonetics , officially standardized as Tailo / POJ

Unlike Mandarin Pinyin, Taiwanese Romanization:

  • Uses hyphens to separate syllables
  • Uses double hyphens (–) to represent pauses
  • Includes tone marks as Unicode combining characters

Example:


漢字: 紲落來看新竹市
POJ: suà-lo̍h lâi-khuànn Sin-tik-tshī

Enter fullscreen mode Exit fullscreen mode

The challenge is not just tokenization—it is alignment :

  • One syllable corresponds to one Han character
  • Hyphens matter semantically
  • Roman words may appear inside the Han stream

This complexity explains why multiple vendors failed to bid on the project. The problem wasn’t Ruby. It was structure p15-8.



Advertise on RubyStackNews


RubyStackNews is a niche publication read by Ruby and Rails developers worldwide.
Our audience includes senior engineers, tech leads, and decision-makers from
the US, Europe, and Asia.

<h3>
  Sponsorship Options
</h3>


  <strong><img src="https://s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f4dd.png" alt="📝"> Article Sponsorship</strong><br>
  Your brand featured inside a technical article (clearly marked as sponsored).<br>
  <span>USD 250 per article</span>



  <strong><img src="https://s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f4cc.png" alt="📌"> Inline Sponsored Block</strong><br>
  Highlighted sponsor section embedded within an article.<br>
  <span>USD 100 per week</span>



  <strong><img src="https://s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f4ce.png" alt="📎"> Sidebar Sponsor</strong><br>
  Logo + link displayed site-wide in the sidebar.<br>
  <span>USD 150 per month</span>
Enter fullscreen mode Exit fullscreen mode
  • Highly targeted Ruby / Rails audience
  • Organic traffic from search and developer communities
  • No ad networks — direct sponsorships only
<p>
  Interested in sponsoring RubyStackNews?
</p>
<a href="https://wa.me/5493434192620">
  Contact via WhatsApp
</a>
Enter fullscreen mode Exit fullscreen mode

The first solution: a handcrafted 3-phase pipeline

The initial implementation followed an implicit compiler-like structure:

Phase 1 — Normalization (WASH)

  • Insert spaces around punctuation
  • Preserve hyphens and double hyphens
  • Normalize Roman and Kanji independently
  • Over 65 GSUB patterns applied

Phase 2 — Tokenization (SPLIT)

  • CJK-aware regex for Han characters
  • Space-based splitting for Roman text
  • Careful handling of edge cases

Phase 3 — Alignment & Validation

  • Count syllables via hyphens
  • Match syllable count to Han character count
  • Validate that nothing was lost or duplicated

This worked. But it felt… wrong.

The code was fragile, regex-heavy, and difficult to reason about. Maintenance would be costly.

Then came the realization.


“What I built was already a parser”

After attending a RubyConf talk on Ruby’s own grammar conflicts, Teng had a moment of clarity:

“I wasn’t writing string manipulation code. I was already doing lexical analysis, syntax analysis, and semantic analysis.”

In other words: this was a parser p15-8.

So the solution was rewritten—properly.

Article content


Enter Parslet: Ruby as a language-tooling platform

Using the Parslet gem, the system was re-implemented with an explicit grammar:

Phase 1 — Lexical Analysis

Tokens were defined for:

  • Hyphenated syllables
  • Double hyphens
  • Punctuation
  • Tone-marked characters

Hyphens were preserved by design , not by convention.

Phase 2 — Syntax Analysis

Grammar rules ensured:

  • Hyphenated POJ words remained a single token
  • Punctuation was structurally distinct
  • Ordering mattered (PEG semantics)

Phase 3 — Semantic Analysis

AST transformations:

  • Count syllables from token structure
  • Map syllable counts directly to Han character spans
  • Produce aligned Roman/Kanji arrays safely

No fragile regex chains. No post-hoc fixes. Just structure.


Why this approach works

The key architectural decision was one-way dependency :

Kanji processing depends on the Roman parser—not the other way around

Romanized Taiwanese is more structurally complex:

  • Syllables
  • Tone marks
  • Pause semantics

Once Roman tokens are parsed correctly, Han alignment becomes trivial:

  • n syllables → n characters

This inversion of responsibility eliminated entire classes of bugs p15-8.


Real-world results

This was not an academic exercise.

  • 3,000 real corpus records
  • 100% parse success
  • Zero errors
  • Deployed as the official Taiwanese language corpus system
  • Commissioned by Taiwan’s Ministry of Education p15-8

Ruby didn’t just pass—it excelled.


Why this matters to engineers

This talk is not about Taiwanese linguistics.

It’s about:

  • Recognizing when a problem is structural
  • Applying compiler theory beyond compilers
  • Knowing when regex is no longer enough
  • Using Ruby as a language-engineering tool

The lesson is universal:

With the right abstractions, complex problems become obvious.


Final thought

Programming languages and natural languages are not so different. They both have grammar. They both have meaning. And they both benefit from being treated with respect.

Or, as the Ruby community often proves:

Elegance scales—when structure comes first.

p15-8-rsnDownload

Article content

Top comments (0)