Văn Tuấn Lê

Posted on Jun 23

I built a free offline Chinese pinyin annotator in a single HTML file

#opensource #javascript #webdev #html

I live in China and kept running into the same problem: I'd see Chinese text
I couldn't fully read and needed to quickly see the pronunciation (pinyin)
above each character.

Every tool I found was either:

Paywalled after 5 uses
Required creating an account
Sent your text to a server
Had terrible UI from 2009

So I built one myself. Single HTML file. Fully offline after first load.
Nothing sent anywhere.

→ Try it live
→ GitHub repo

How it works

The dictionary

The core is a ~2,500 character lookup table embedded directly in the JS:

const raw = `的:de:0:1|一:yī:1:1|是:shì:4:1|了:le:0:1|我:wǒ:3:1...`
// format: character : pinyin : tone(1-4, 0=neutral) : hsk_level(1-6)

I store it as a pipe-delimited string and parse it once on load.
Covers ~97% of common written Chinese. Characters outside the dictionary
show a "?" — there aren't many in normal text.

Ruby annotations

HTML has a built-in <ruby> tag for exactly this:

<ruby>
  <span class="char">中</span>
  <rt>zhōng</rt>
</ruby>

The rt element renders above the base character. No canvas tricks,
no absolute positioning — just semantic HTML doing what it was designed for.

Tone colors

Each pinyin string carries its tone in the data, and CSS classes handle
the rest:

.tone-on rt.t1 { color: #ff4d4d; }  /* 1st tone — red */
.tone-on rt.t2 { color: #ff9900; }  /* 2nd tone — orange */
.tone-on rt.t3 { color: #22c55e; }  /* 3rd tone — green */
.tone-on rt.t4 { color: #a78bfa; }  /* 4th tone — purple */
.tone-on rt.t0 { color: #8899aa; }  /* neutral — grey */

Toggle the class on the container and all tones update instantly
without re-rendering anything.

HSK level highlight

Same pattern — a CSS class on the container, data attributes on
each character span:

.hsk-on .char-span.hsk1 { color: #7ee8bb; }
.hsk-on .char-span.hsk2 { color: #60d4b0; }
/* ... */
.hsk-on .char-span.unk  { color: #6a7a9a; } /* unknown */

This lets learners instantly see which characters are beginner vs.
advanced vs. completely outside the HSK vocabulary list.

The offline constraint

I wanted this to work with zero network after the first load — useful
if you're on a plane with a downloaded article, or in China where
connectivity to foreign tools can be unreliable.

Everything is embedded: the dictionary, the CSS, the JS. The HTML file
is ~180KB total. Download once, use forever.

What I learned

<ruby> line-height is annoying. Getting the ruby annotations to
not blow up the line spacing required some CSS gymnastics:

ruby {
  display: inline-flex;
  flex-direction: column-reverse;
  align-items: center;
  vertical-align: bottom;
  line-height: 1;
}

Polyphonic characters are a real problem. Many Chinese characters
have multiple pronunciations depending on context (e.g., 行 = xíng or
háng). I used the most common reading for each. A proper solution would
need NLP context analysis — out of scope for a single HTML file.

2,500 characters covers more than you'd think. The most frequent
2,500 Chinese characters account for ~97% of text in newspapers and
books. The long tail exists but it's genuinely rare.

Also built

This is part of a small suite of offline Chinese learning tools I've
been building:

📖 Chinese Reading Lab — 10 historical stories in Chinese (HSK4–6) with comprehension quiz
🐉 Chengyu Stories — 20 classic idioms with origin stories + scenario quiz
🃏 Mandarin Flashcards — HSK1–3 spaced repetition
✍️ Chinese Writing Toolkit — model essays for 11 HSK writing types

All single HTML files, all free: daligao.github.io/learn-chinese-free

Source for the pinyin annotator: github.com/daligao/pinyin-annotator

Questions welcome — especially if you know a clean way to handle
polyphonic characters without a server.

Top comments (2)

Frank • Jun 23

How did you handle character encoding and font support for the various Chinese characters in your annotator, especially considering it's a single HTML file?