How this page breaks Japanese lines

#cjk #typography #javascript #webdev

Open a Japanese sentence in a narrow column and watch where the browser breaks it. It will happily split 特定商取引法 into 特定商取引 / 法, or push a 。 to the start of the next line. Japanese has no spaces, so the default line-breaker treats almost every character boundary as fair game. To a Japanese reader that looks broken in the same way impor / tant would look broken to you.

Most sites ship exactly that. It is the kind of thing you only notice if you read the page in Japanese, which is most of the point of this whole site.

The rule we actually want

Japanese wraps at phrase boundaries — 文節, roughly a content word plus its trailing particles. It also follows 禁則: a closing bracket or a 。 never starts a line, an opening bracket never ends one. Those two together are what "set correctly" means.

CSS gives you half of it for free:

.prose {
  line-break: strict;   /* keep 。 、 ) off the start of a line */
  word-break: keep-all; /* never break inside a run of characters */
  overflow-wrap: break-word;
}

line-break: strict handles the kinsoku edge. word-break: keep-all tells the browser to stop breaking between characters at all. But now nothing breaks, and a long sentence overflows the column. We have to hand the browser the break points back — the right ones this time.

Finding the phrases

The break points are the phrase boundaries, and finding them means segmenting Japanese, which is the hard part. I use BudouX, Google's small phrase model. It turns a sentence into chunks:

import { loadDefaultJapaneseParser } from "budoux";

const parser = loadDefaultJapaneseParser();
parser.parse("特定商取引法の表示ページが無い。");
// → ["特定商取引法の", "表示ページが", "無い。"]

Then I join the chunks with <wbr>, the "break here if you must" tag. With word-break: keep-all in force, the browser breaks only at those points:

- <p>特定商取引法の表示ページが無い。</p>
+ <p>特定商取引法の<wbr>表示ページが<wbr>無い。</p>

Notice the 。 stayed glued to 無い. That is the kinsoku rule falling out of phrase segmentation for free — the model never puts a boundary in front of trailing punctuation, so there is nothing to break before it.

I run this at build time, not in the browser. A small pass walks the rendered HTML, inserts <wbr> into Japanese text, and skips anything inside <code> or <pre> so code samples are left alone. The model stays on the build machine. The reader downloads a few <wbr> tags and no JavaScript.

Where it stops

BudouX is a model, not a rulebook, so it is about right, not exactly right. It occasionally splits a rare compound in a place a typographer wouldn't, and it has nothing to say about full justification or 約物 spacing. For body text at a normal measure I have not needed to correct it by hand yet. If I do, I will say so here.

The honest limit is the usual one: this fixes the mechanical part. It cannot tell you the Japanese was worth reading. That is still a human call.

Written by **greymoth. I build developer tools and write about where software quietly breaks — Japanese/CJK edge cases, i18n, the boring infra nobody checks. → *glovrex.com** · github.com/greymoth-jp*

DEV Community

How this page breaks Japanese lines

The rule we actually want

Finding the phrases

Where it stops

Top comments (0)