December 19, 2025
How Ruby, Parser Theory, and Linguistic Precision Solved a Problem No One Wanted
At RubyWorld Conference 2025 , Mu-Fan Teng (鄧慕凡)—founder of 5xRuby and long-time Ruby community leader—presented a talk that quietly demonstrated something powerful: compiler theory is not limited to programming languages.
In “Parsing Taiwanese Like Code”, Teng showed how a real-world linguistic problem—aligning Taiwanese Romanization with Chinese characters—was solved not with ad-hoc scripts, but with proper parser architecture in Ruby p15-8.
What began as an “unwanted” government project ultimately became a case study in how choosing the right abstraction can unlock both correctness and elegance.
The problem nobody wanted to touch
Taiwanese (台語) is written using:
- Han characters (漢字)
- Romanized phonetics , officially standardized as Tailo / POJ
Unlike Mandarin Pinyin, Taiwanese Romanization:
- Uses hyphens to separate syllables
- Uses double hyphens (–) to represent pauses
- Includes tone marks as Unicode combining characters
Example:
漢字: 紲落來看新竹市
POJ: suà-lo̍h lâi-khuànn Sin-tik-tshī
The challenge is not just tokenization—it is alignment :
- One syllable corresponds to one Han character
- Hyphens matter semantically
- Roman words may appear inside the Han stream
This complexity explains why multiple vendors failed to bid on the project. The problem wasn’t Ruby. It was structure p15-8.
Advertise on RubyStackNews
RubyStackNews is a niche publication read by Ruby and Rails developers worldwide.
Our audience includes senior engineers, tech leads, and decision-makers from
the US, Europe, and Asia.
<h3>
Sponsorship Options
</h3>
<strong><img src="https://s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f4dd.png" alt="📝"> Article Sponsorship</strong><br>
Your brand featured inside a technical article (clearly marked as sponsored).<br>
<span>USD 250 per article</span>
<strong><img src="https://s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f4cc.png" alt="📌"> Inline Sponsored Block</strong><br>
Highlighted sponsor section embedded within an article.<br>
<span>USD 100 per week</span>
<strong><img src="https://s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f4ce.png" alt="📎"> Sidebar Sponsor</strong><br>
Logo + link displayed site-wide in the sidebar.<br>
<span>USD 150 per month</span>
- Highly targeted Ruby / Rails audience
- Organic traffic from search and developer communities
- No ad networks — direct sponsorships only
<p>
Interested in sponsoring RubyStackNews?
</p>
<a href="https://wa.me/5493434192620">
Contact via WhatsApp
</a>
The first solution: a handcrafted 3-phase pipeline
The initial implementation followed an implicit compiler-like structure:
Phase 1 — Normalization (WASH)
- Insert spaces around punctuation
- Preserve hyphens and double hyphens
- Normalize Roman and Kanji independently
- Over 65 GSUB patterns applied
Phase 2 — Tokenization (SPLIT)
- CJK-aware regex for Han characters
- Space-based splitting for Roman text
- Careful handling of edge cases
Phase 3 — Alignment & Validation
- Count syllables via hyphens
- Match syllable count to Han character count
- Validate that nothing was lost or duplicated
This worked. But it felt… wrong.
The code was fragile, regex-heavy, and difficult to reason about. Maintenance would be costly.
Then came the realization.
“What I built was already a parser”
After attending a RubyConf talk on Ruby’s own grammar conflicts, Teng had a moment of clarity:
“I wasn’t writing string manipulation code. I was already doing lexical analysis, syntax analysis, and semantic analysis.”
In other words: this was a parser p15-8.
So the solution was rewritten—properly.
Enter Parslet: Ruby as a language-tooling platform
Using the Parslet gem, the system was re-implemented with an explicit grammar:
Phase 1 — Lexical Analysis
Tokens were defined for:
- Hyphenated syllables
- Double hyphens
- Punctuation
- Tone-marked characters
Hyphens were preserved by design , not by convention.
Phase 2 — Syntax Analysis
Grammar rules ensured:
- Hyphenated POJ words remained a single token
- Punctuation was structurally distinct
- Ordering mattered (PEG semantics)
Phase 3 — Semantic Analysis
AST transformations:
- Count syllables from token structure
- Map syllable counts directly to Han character spans
- Produce aligned Roman/Kanji arrays safely
No fragile regex chains. No post-hoc fixes. Just structure.
Why this approach works
The key architectural decision was one-way dependency :
Kanji processing depends on the Roman parser—not the other way around
Romanized Taiwanese is more structurally complex:
- Syllables
- Tone marks
- Pause semantics
Once Roman tokens are parsed correctly, Han alignment becomes trivial:
- n syllables → n characters
This inversion of responsibility eliminated entire classes of bugs p15-8.
Real-world results
This was not an academic exercise.
- 3,000 real corpus records
- 100% parse success
- Zero errors
- Deployed as the official Taiwanese language corpus system
- Commissioned by Taiwan’s Ministry of Education p15-8
Ruby didn’t just pass—it excelled.
Why this matters to engineers
This talk is not about Taiwanese linguistics.
It’s about:
- Recognizing when a problem is structural
- Applying compiler theory beyond compilers
- Knowing when regex is no longer enough
- Using Ruby as a language-engineering tool
The lesson is universal:
With the right abstractions, complex problems become obvious.
Final thought
Programming languages and natural languages are not so different. They both have grammar. They both have meaning. And they both benefit from being treated with respect.
Or, as the Ruby community often proves:
Elegance scales—when structure comes first.




Top comments (0)