DEV Community

Cover image for I built a free offline Chinese pinyin annotator in a single HTML file
Văn Tuấn Lê
Văn Tuấn Lê

Posted on

I built a free offline Chinese pinyin annotator in a single HTML file

I live in China and kept running into the same problem: I'd see Chinese text
I couldn't fully read and needed to quickly see the pronunciation (pinyin)
above each character.

Every tool I found was either:

  • Paywalled after 5 uses
  • Required creating an account
  • Sent your text to a server
  • Had terrible UI from 2009

So I built one myself. Single HTML file. Fully offline after first load.
Nothing sent anywhere.

→ Try it live
→ GitHub repo


How it works

The dictionary

The core is a ~2,500 character lookup table embedded directly in the JS:

const raw = `的:de:0:1|一:yī:1:1|是:shì:4:1|了:le:0:1|我:wǒ:3:1...`
// format: character : pinyin : tone(1-4, 0=neutral) : hsk_level(1-6)
Enter fullscreen mode Exit fullscreen mode

I store it as a pipe-delimited string and parse it once on load.
Covers ~97% of common written Chinese. Characters outside the dictionary
show a "?" — there aren't many in normal text.

Ruby annotations

HTML has a built-in <ruby> tag for exactly this:

<ruby>
  <span class="char"></span>
  <rt>zhōng</rt>
</ruby>
Enter fullscreen mode Exit fullscreen mode

The rt element renders above the base character. No canvas tricks,
no absolute positioning — just semantic HTML doing what it was designed for.

Tone colors

Each pinyin string carries its tone in the data, and CSS classes handle
the rest:

.tone-on rt.t1 { color: #ff4d4d; }  /* 1st tone — red */
.tone-on rt.t2 { color: #ff9900; }  /* 2nd tone — orange */
.tone-on rt.t3 { color: #22c55e; }  /* 3rd tone — green */
.tone-on rt.t4 { color: #a78bfa; }  /* 4th tone — purple */
.tone-on rt.t0 { color: #8899aa; }  /* neutral — grey */
Enter fullscreen mode Exit fullscreen mode

Toggle the class on the container and all tones update instantly
without re-rendering anything.

HSK level highlight

Same pattern — a CSS class on the container, data attributes on
each character span:

.hsk-on .char-span.hsk1 { color: #7ee8bb; }
.hsk-on .char-span.hsk2 { color: #60d4b0; }
/* ... */
.hsk-on .char-span.unk  { color: #6a7a9a; } /* unknown */
Enter fullscreen mode Exit fullscreen mode

This lets learners instantly see which characters are beginner vs.
advanced vs. completely outside the HSK vocabulary list.


The offline constraint

I wanted this to work with zero network after the first load — useful
if you're on a plane with a downloaded article, or in China where
connectivity to foreign tools can be unreliable.

Everything is embedded: the dictionary, the CSS, the JS. The HTML file
is ~180KB total. Download once, use forever.


What I learned

<ruby> line-height is annoying. Getting the ruby annotations to
not blow up the line spacing required some CSS gymnastics:

ruby {
  display: inline-flex;
  flex-direction: column-reverse;
  align-items: center;
  vertical-align: bottom;
  line-height: 1;
}
Enter fullscreen mode Exit fullscreen mode

Polyphonic characters are a real problem. Many Chinese characters
have multiple pronunciations depending on context (e.g., 行 = xíng or
háng). I used the most common reading for each. A proper solution would
need NLP context analysis — out of scope for a single HTML file.

2,500 characters covers more than you'd think. The most frequent
2,500 Chinese characters account for ~97% of text in newspapers and
books. The long tail exists but it's genuinely rare.


Also built

This is part of a small suite of offline Chinese learning tools I've
been building:

All single HTML files, all free: daligao.github.io/learn-chinese-free

Source for the pinyin annotator: github.com/daligao/pinyin-annotator


Questions welcome — especially if you know a clean way to handle
polyphonic characters without a server.

Top comments (0)