Ackah Kelvin

Posted on Jun 6

What a string extractor gets wrong: six lessons from three real codebases

#react #typescript #i18n #staticanalysis

TransLift is a CLI that statically finds user-facing strings in a React/TypeScript
codebase and wraps them in t("…") translation calls. The premise sounds
mechanical (grep for quoted text, wrap it), and the demos are easy. The
interesting part is everything the demo hides.

This is a write-up of what surfaced when I stopped trusting my own fixtures and
ran the tool against three real codebases (Excalidraw, Mattermost, and Bluesky)
alongside two other extractors. Each repo embodies a different i18n convention,
and the convention turned out to be the whole game. The headline finding is not
"my tool is better." It's narrower and more useful: the hard problem in string
extraction isn't finding strings, it's knowing which strings are already
handled. I hadn't built that capability either, until each repo forced it.

What follows is section zero (the benchmark), then six problems the benchmark
exposed, each one a mistake I made, found, and fixed, and finally an honest
accounting of what I still don't trust.

§0 — The benchmark

I compared three tools, each in its intended aggressive auto-wrap mode:

TransLift (extract)
i18next-cli (instrument): the closest comparable, an i18next codemod
a18n (wrap): a CJK-first (Chinese/Japanese/Korean) extractor

Before the numbers, two terms, because everything below is scored on them. When a
tool rewrites your code, it can fail in two opposite directions:

Recall — of all the user-facing strings that should be translated, how many did the tool actually find? A miss here is silent: the string stays hardcoded and compiles fine, so nobody notices until a user sees untranslated text in production.
False positives (the inverse of precision) — of all the strings the tool did wrap, how many were a mistake? The damaging kind is re-wrapping text that is already translated, which produces double-translated, often uncompilable code.

These pull against each other: a tool that wraps nothing has zero false positives
and zero recall; one that wraps everything has perfect recall and ruinous false
positives. The bar is doing well on both at once. Each repo below uses a
different convention and is chosen to isolate one axis:

Repo	Convention	Axis it measures
Excalidraw (pre-i18n)	none (raw hardcoded strings)	recall
Mattermost (`admin_console`)	react-intl (`formatMessage`/`<FormattedMessage>`)	false positives
Bluesky (`screens/Settings`)	Lingui macros + React Native	false positives + RN recall

The false-positive axis is the cleaner measurement and the more dramatic result.
A false positive here is specific and damaging: the tool takes a
string that is already internationalized and wraps it again, producing
double-translated, often uncompilable code. On the two internationalized repos:

Tool	Mattermost re-translation FP	Bluesky re-translation FP
TransLift	0	0
i18next-cli `instrument`	1,892	23 (and broke TypeScript, see §6)
a18n (`--text=capitalized`)	3,751	254
a18n (default `--text=cjk`)	0	0

The single most vivid example is i18next-cli on Mattermost emitting literal
double-translation:

defaultMessage: t('...', 'Test is unavailable in this environment')

That defaultMessage is already a react-intl translation source. Wrapping it in
t(...) translates the translation. There are 1,892 of these.

Now the recall axis, stated directionally. The three tools measure recall
differently (TransLift via its own verdicts, i18next-cli by diff-parsing the
strings it replaced, a18n by grepping inserted calls), so treat these as ±1–2,
not exact:

Tool	Excalidraw recall (÷29)
TransLift	~90% (26/29)
i18next-cli `instrument`	~76% (22/29)
a18n (`--text=capitalized`)	100% (29/29), but at 3,751 FP
a18n (default `cjk`)	0%

TransLift is the only tool that scores on both axes: high recall and zero
re-translation. But here is the part that matters, and that I want to state before
anything else:

TransLift was ~96% false-positive on Mattermost too, until I specifically
built the guard that recognizes react-intl. There is nothing innate. a18n's two
columns (0/0 or 100%/3,751) show the default behavior of a string extractor with
no convention-awareness: either it finds nothing, or it wraps everything
including the already-translated. The entire difference is one capability,
recognizing what's already handled, and it has to be built per convention, by
hand, against real code. The rest of this essay is the story of building it, one
repo at a time, getting it wrong first each time.

A note on method, kept honest throughout: scoring is positional (a wrap is an FP
if its source position lands inside an existing i18n construct), computed from each
tool's emitted output by parsing its write diff into {file, line, col} and
checking containment. Denominators differ by how aggressive each tool is, so the
exact FP counts are directional; the ordering (0 vs 23 vs 254) is solid. a18n's
capitalized mode runs it outside its CJK design, shown only for a comparable
English data point, with that caveat attached every time.

One paragraph on how TransLift decides, because everything below leans on it.
For each string, TransLift asks one question: is this human-facing copy? It answers
in two layers. First, decisive rules settle the obvious cases outright — a hard
skip (className, a URL, a console.log argument) or a hard wrap. Anything left
falls to a weighted score, where signals add or subtract points and the total
picks the verdict: wrap, skip, or escalate (escalate = "I'm not
sure, show a human"). The signal that matters most is the sink — a registered
place where user-facing copy is known to live: a component prop
(<Toast message=…>), a function argument (alert(…)), or an attribute
(aria-label). Landing in a sink is strong evidence a string is copy; looking
structural — an SVG path, a CSS var(…) — pushes the other way. Three words to
carry into the sections below: a sink (where copy lives), a boost/penalty
(points for or against in the weighted score), and a decisive rule (one that
settles a string before scoring even runs).

§1 — Shape-based rules fail in both directions (Excalidraw)

The first real run, against the live Excalidraw checkout rather than my fixtures,
broke the precision claim immediately. My fixture suite said 100% precision.
Real Excalidraw was 86%.

The false positives were all SVG and CSS attribute values:

<path d="M39.9 12.3..." viewBox="0 0 40 40" transform="translate(2 2)" />
<Icon size="var(--icon-size, 1rem)" />

d, viewBox, transform, size reached the weighted-scoring tier and picked
up a generic "attribute carrying a string" boost, because they weren't in any
structural blocklist. i18next-cli skipped all of them correctly. The fix was two
parts: a structural-attribute blocklist (SVG presentation attrs get a penalty),
plus a decisive value-shape skip. A string that looks like var(...),
calc(...), a numeric coordinate list, or SVG path data is not copy regardless of
which attribute holds it.

Then the same run revealed the opposite failure. aria-label="Shade" was
silently missed: not wrapped, not flagged. The cause was rule ordering: an
identifier-shaped hard-skip ("Shade looks like a code identifier, drop it") ran
before the check for registered attribute sinks. So a legitimately registered
sink lost to a shape heuristic. This was the exact twin of a bug I'd already fixed
for function sinks and scoped out for attributes at the time.

The lesson generalizes into a principle I leaned on for everything after:
position beats shape, and rule order is a correctness surface, not a detail. A
string's meaning comes from where it sits (a registered sink, a structural attr),
not what its characters look like. When a shape heuristic runs before a structural
rule, it silently overrides better information. Net effect of the fix: Excalidraw
shed all 13 SVG/CSS false positives and, in the same pass, began wrapping the
identifier-shaped aria-label copy it had been missing, Shade among them,
with zero new false positives.

§2 — The fixture that lied (Excalidraw recall)

My synthetic fixture suite reported 100% recall. I believed it. Then I ran a
recall test against reality: check out Excalidraw at the commit just before it
adopted i18n, run the tool, and score against the strings the maintainers actually
translated in the i18n commit. Real recall was 55% (16/29).

A 45-point gap between the fixtures and the truth, and it was one hidden failure
class: user-facing labels declared as object-property values, not JSX.

{ contextItemLabel: "Delete", label: "Change text alignment" }

My fixtures had JSX text, attributes, function-call arguments: the shapes I
thought of when writing tests. They had almost no object-property labels,
because it didn't occur to me. The tool scored 100% on the cases I imagined and
55% on the cases the real app actually used.

The fix is less interesting than how it had to be tuned. My first version reused
the JSX copy-prop allowlist for object keys and produced 159 wraps on current
Excalidraw, ~150 of them false, because as object keys, text:/title:/label:
overwhelmingly hold data (element content, demo data, or already-keys like
label: "labels.alignTop"), not copy. I had to tighten it empirically to a strict
copy-key set plus a value-shape guard rejecting dotted-key-shaped and empty
values. Recall climbed 55% → 79% → 90% across iterations, each verified in both
directions on real code.

The lesson is the one I keep relearning: a green synthetic suite is a liar's
contract. It tests your imagination, not the world. The only ground truth is real
repo code, and the only safe way to extend a heuristic is to measure its blast
radius on a real codebase before trusting it.

§3 — Re-translation false positives, the actual differentiator (Mattermost)

Mattermost's admin console is 642 files, almost entirely internationalized with
react-intl. It's the perfect false-positive test: nearly every string a naive
extractor sees is already translated. Run a convention-blind tool here and it
re-wraps the whole codebase.

That's exactly what happened. i18next-cli's instrument re-wrapped 1,892
react-intl defaultMessage values into t(...), the literal double-translation
shown in §0. a18n in capitalized mode produced 3,751. And, the honest part,
TransLift's raw behavior here was ~96% false-positive too. A string extractor
with no model of react-intl sees defaultMessage: "Save" and thinks "untranslated
copy, wrap it."

The fix is a structural signal I called enclosingI18n: walk a string's ancestors,
and if it sits inside a recognized translation construct (formatMessage(...),
<FormattedMessage>, defineMessages(...), <Trans>), mark it already-translated
and skip it. With that one guard, TransLift's re-translation FP on Mattermost went
to 0, while still wrapping the genuine hardcoded copy in the same files.

This is the differentiator, and I want to frame it precisely. It is not that
TransLift is smarter. It is that recognizing what's already translated is a
feature you can build, and the other extractors haven't built it. i18next-cli
knows i18next; point it at react-intl and it's blind. a18n knows CJK characters;
point it at English and it either sees nothing or everything. I can't tell from
outside whether that absence is a deliberate scope choice (each tool serving its
own ecosystem) or a gap no one's filled yet; I can only report that on these
three repos, today, neither recognizes a convention it wasn't built for. The moat is
cross-convention awareness, it's achievable, and as the next sections show, it has
to be earned one convention at a time, including the ones you think you already
covered.

§4 — Why one guard wasn't enough (react-intl descriptors)

The enclosingI18n structural rule from §3 is elegant: it keys off position in
the AST (the parsed syntax tree the tool reads your code into), the most reliable
signal there is. So my instinct was to make it the
single source of truth and delete the grubbier name-based guard I'd written earlier
(skip anything whose object key is defaultMessage or id).

Replacing specific-with-general regressed Mattermost from 144 wraps to 188. The
structural rule had a blind spot.

react-intl message descriptors are routinely passed around detached:

const messages = defineMessages({
  greeting: { id: 'app.greeting', defaultMessage: 'Hello there' },
});
// ...elsewhere, far from any formatMessage or <FormattedMessage>:
showToast(messages.greeting);

That { id, defaultMessage } object has no formatMessage/<FormattedMessage>
ancestor anywhere near it. The structural walk finds nothing and the descriptor's
defaultMessage value gets re-wrapped. The grubby name-based guard ("a property
keyed defaultMessage is never copy to wrap") catches exactly these, because it
keys off the name, which travels with the object wherever it goes.

So I kept both. The structural rule covers message children and inline
formatMessage calls; the name-based guard covers detached descriptors the
structural rule can't see. Neither is sufficient alone.

The lesson is about epistemics, not code: you cannot predict from first
principles which rule you'll need. The elegant general rule and the ugly specific
one cover different real-world shapes, and only real codebases reveal which
shapes exist. I wanted one clean rule. The codebase wanted two. The codebase was
right.

§5 — Wrapper and HOC resolution (validated on fixtures; real-world is future work)

A common real-world pattern hides a registered sink behind a wrapper — a styling
helper or a higher-order component (HOC) that takes the real component and returns
a new one under a new name:

const StyledToast = styled(Toast);        // or memo(Toast), withTracking(Toast)
<StyledToast message="Saved" />           // AST sees "StyledToast", registry has "Toast"

The AST sees StyledToast; the registry knows Toast; the relationship is
invisible to name matching and the trace dead-ends. TransLift resolves this by
asking the TypeScript type checker for the resolved prop type of the wrapped
component rather than trying to follow the wrapper call, because TypeScript
already computes StyledToast's props through styled(), ComponentProps<typeof Toast>, intersections, and so on. It unwraps styled(X) / memo(X) / withX(Y)
/ connect()() to the inner registered component by inspecting the declaration's
call arguments. react-docgen-typescript resolves props through wrappers the same
way to power Storybook's prop tables, so the approach has precedent.

The bound: this is validated on clean, two-layer fixtures only. styled(Toast)
works in a synthetic test; real design-system composition (Grafana's
@grafana/ui chains, Backstage's MUI styled + HOC stacks, wrappers nested
several layers deep across typed and untyped boundaries) is exactly the shape §2
warns about, where a green fixture suite hid an entire failure class on real code.
So I cap the claim at the evidence: the type-driven approach resolves the common
single-wrapper cases, and real-world design-system coverage is future work.
Render-wrappers (forwardRef((p, ref) => …), a hand-written component that
internally renders another) and untyped JS still require a manual registry entry.

§6 — The crash that falsified my own convention-awareness (Bluesky)

Bluesky is a real Lingui codebase, and React Native rather than web. I added it to
the benchmark expecting it to validate two things I'd shipped on faith: that my
foreign-i18n guard covered Lingui, and that the adapter handled RN's <Text>
component vocabulary. I'd built the guard against react-intl and added <Trans>
(which Lingui also uses) to the recognized list, so I assumed Lingui was covered.

Instead, on the very first run, extract crashed:

TypeError: Property quasi of TaggedTemplateExpression expected node to be of a
type ["TemplateLiteral"] but instead got "CallExpression"

Exit 1, nothing written. The guard I thought covered Lingui covered exactly one of
its forms, the <Trans> JSX element, and missed the form Bluesky actually uses
overwhelmingly: the macro _(msg`Require alt text`). There are 1,894 msg
macros in Bluesky's source versus 1,560 <Trans> elements. The dominant Lingui
idiom was the one my guard had never seen, and the failure wasn't a quiet
false-positive. It was a hard crash, because the mutator tried to replace a
template literal that was the .quasi of a tagged-template macro, which violates a
babel AST invariant.

This is the cleanest vindication of the whole "convention-awareness must be earned
per convention" thesis, so I want to be exact about the arc, including the missteps:

First reading (wrong): I measured 156 "re-translation FPs", but that was a count of dry-run verdicts, not emitted output. The tool never got that far.
Second reading (also wrong): a write-mode run showed 0 files changed, which I briefly read as "the guard works." It was a crashed run: 0 files changed because extract threw, not because it cleanly skipped.
Root cause: pinned to one line in the mutator's TemplateLiteral visitor.
Fix: recognize all three Lingui macro shapes (the tagged template _(msg`…`), the descriptor call msg({ message, context }), and the <Plural>/<Select> components), plus a mutator fail-safe that never replaces a tagged-template quasi, so any future mis-score degrades to a no-op instead of a crash.
Result: re-translation FP on the Settings slice went 156 → 20 → 2 → 0 across the three shapes; 190/190 tests pass; three regression tests pin the exact Bluesky shapes, including the crash case.

I'm keeping the two wrong readings in this write-up on purpose. They're not flaws
in the finding; they're what the finding actually looked like before it resolved,
and a write-up that pretends the root cause was obvious on first look is lying the
same way a green fixture suite lies.

The competitor data point on the same repo is its own §0-grade example. i18next-cli
on Bluesky didn't just re-translate (23 cases of t() nested inside _(msg(...))).
It broke TypeScript, wrapping a type annotation:

type Props = NativeStackScreenProps<CommonNavigatorParams, i18next.t('aboutsettings', 'AboutSettings')>

That doesn't compile. A string in type position is not runtime copy, and a
convention-blind codemod can't tell the difference.

The RN recall axis came out clean, ~31/32, measured by
un-translating real Bluesky RN files (the stripped Lingui messages are the ground
truth) and checking how many the tool recovers. <Text> children and
accessibilityLabel/accessibilityHint all wrap correctly; the single miss is a
single-word <ButtonText>{"Submit"}</ButtonText>, the same identifier-shaped-label
class as Excalidraw's "Code"/"Normal" misses. The RN component vocabulary didn't
choke.

The lesson is the thesis, sharpened: convention-awareness is real but fragile
per-convention. A convention you haven't tested against real code is a latent bug;
here, a latent crash. The differentiator isn't a property you have; it's a debt
you keep paying, one codebase at a time.

§7 — What I still don't trust

A write-up that ends on the wins is marketing. Here's the honest ledger.

The fixture-vs-real gap is managed, not closed. §2 is a permanent hazard, not
a solved problem. Every heuristic I have could be hiding another 45-point gap
behind a convention I haven't benchmarked. The mitigation (benchmark against
real pre-i18n checkouts) only covers the conventions I've thought to test.
Recall numbers are directional, measured differently per tool. The 90-vs-76
Excalidraw gap is real and large enough to mean something; I would not defend the
exact digits. Cross-tool recall comparison via three different measurement
methods (own verdicts vs. diff-parsing vs. grep) is inherently ±1–2.
Bluesky RN recall is synthetic and small-sample. ~31/32 is one slice,
reconstructed by un-translating real files rather than a true pre-i18n checkout
(Bluesky adopted Lingui too early for that). Read it as "the RN adapter clearly
works," not as a precise figure.
Wrapper/HOC resolution (§5) is unproven on real design systems. Fixtures
only. I flagged this hardest because it's where I'm most likely wrong, and the
one place a future benchmark (Grafana, Backstage) could still overturn a claim.
The residual identifier-shaped-label misses persist across all three repos:
"Code", "Normal", and "Submit", single-word labels in unregistered components that
look like code identifiers. Consistent, understood, not yet fixed.
a18n is run outside its design. Its capitalized-mode numbers exist only to
give an English data point; its real CJK mode is a different tool for a different
job, and the comparison says nothing about how good it is at that job.

The thread through all six problems is one idea: in static string extraction,
location is tractable and meaning is not. The symbol graph can tell you exactly
where a string flows: through props, wrappers, function calls, to its terminal
position. What it cannot tell you is whether that string is copy. Every fix in
this essay is a hand-built rule layered on top of the graph to answer the meaning
question for one more shape, one more convention. That's not a deficiency to be
engineered away; it's the irreducible core of the problem.

Convention-awareness, knowing what's already translated, is the capability that
separates a usable extractor from one that double-translates a codebase. It is
achievable. The other tools haven't built it. I built it three times, for three
conventions, and got it wrong first each time. That's the finding: not that the
problem is solved, but that it's earnable, and the earning leaves an audit trail of
exactly the kind of mistakes above.