SEN LLC

Posted on Jun 12

A Hello World Museum for 84 Languages — and the Round-Trip Test That Makes a Tiny Syntax Highlighter Trustworthy

#javascript #programming #testing #frontend

I built a "museum" that exhibits the Hello World of 84 programming languages on a decade timeline — Fortran (1957) through Mojo (2023), with paradigm/typing filters and a clickable influence graph. The data collection is the fun part, but two implementation details turned out to be worth writing about: (1) syntax-highlighting 84 languages without shipping highlight.js — one generic tokenizer parameterized by 27 declarative family profiles, and (2) testing the tokenizer with a round-trip invariant: re-joining the token stream must reproduce the original source, for all 84 snippets. Plus a closed-vocabulary trick that turns influence-graph typos into test failures.

🌐 Demo: https://sen.ltd/portfolio/hello-world-museum/
📦 GitHub: https://github.com/sen-ltd/hello-world-museum

The data model

{ name: "Rust", year: 2010,
  paradigms: ["imperative", "functional", "concurrent"],
  typing: "static",
  family: "c",                                    // highlighting profile
  influences: ["C++", "ML", "Haskell", "Erlang"], // closed vocabulary
  code: `fn main() {\n    println!("Hello, World!");\n}` },

84 languages × 7 fields. The timeline, filters, influence graph, and highlighting all derive from this one table.

A tiny highlighter: 1 tokenizer + 27 profiles

highlight.js and Prism are excellent and hundreds of KB. For 1-6 line Hello Worlds you need four token types — comment, string, keyword, number — and that's one generic tokenizer parameterized per language family:

export const PROFILES = {
  c: {
    lineComment: "//",
    blockComment: ["/*", "*/"],
    strings: ['"', "'", "`"],
    keywords: ["int", "void", "return", "fn", "func", ...],
  },
  python:  { lineComment: "#",  blockComment: null,         strings: ['"', "'"], ... },
  haskell: { lineComment: "--", blockComment: ["{-", "-}"], strings: ['"'], ... },
  ml:      { lineComment: null, blockComment: ["(*", "*)"], strings: ['"'], ... },
  // 27 families total
};

The C family (C, C++, Java, JavaScript, Go, Rust, Zig, ...) shares one profile. So do the Lisps (Lisp, Scheme, Racket, Clojure). 84 languages compress to 27 profiles.

The tokenizer is a single pass with priority ordering:

export function tokenizeLine(line, profile) {
  let i = 0;
  while (i < line.length) {
    // 1. line comment → rest of line
    // 2. block comment (same-line)
    // 3. string (with backslash escapes)
    // 4. number (not inside an identifier)
    // 5. word → keyword lookup → "kw" or "plain"
    // 6. any other single char
  }
}

Priority matters: comment > string > number > word. # "not a string" is all comment; "// not a comment" is all string. First-match-wins gives you this ordering for free.

The round-trip test

The strongest tokenizer test isn't a list of hand-picked cases. It's the invariant that concatenating the tokens reproduces the input exactly:

test("highlight() reassembles to the original code", () => {
  for (const lang of LANGUAGES) {
    const lines = highlight(lang.code, lang.family);
    const rebuilt = lines.map((toks) => toks.map((t) => t.text).join("")).join("\n");
    assert.equal(rebuilt, lang.code, `${lang.name}: lossy tokenization`);
  }
});

Why this is so effective:

Dropped characters (forgot to advance the index) → caught.
Duplicated characters (pushed text but didn't advance) → caught.
All 84 real snippets become edge-case fixtures for free: Brainfuck's + runs, APL's quote style, COBOL's leading spaces, Befunge's "!dlroW ,olleH" — cases I would never have thought to write by hand.

It caught two real bugs while I was writing the tokenizer: an escape-handling branch that skipped one character too many, and an off-by-one in the "is the previous character part of an identifier" check (line[i - 1] || "" at position 0). Neither would have made it into a hand-written case list.

Closed-vocabulary influence references

The influences field is hand-typed, which means typos ("Smalltak") are inevitable. The fix: references must resolve to either an in-dataset language or an explicit allowlist of historical out-of-dataset names (ISWIM, BCPL, ABC, ...):

export function unresolvedInfluences(pool, external) {
  const names = new Set(pool.map((l) => l.name));
  const ext = new Set(external);
  const bad = new Set();
  for (const lang of pool) {
    for (const inf of lang.influences) {
      if (!names.has(inf) && !ext.has(inf)) bad.add(`${lang.name} → ${inf}`);
    }
  }
  return [...bad].sort();
}

test("all influence references resolve", () => {
  assert.deepEqual(unresolvedInfluences(), []);
});

Note assert.deepEqual(bad, []) instead of assert.equal(bad.length, 0): when it fails, the error message shows exactly which references are broken, not just "1 !== 0".

With resolved references, derived queries get fun:

influenceRanking();
// → C: 13, Lisp: 13, ALGOL 60: 9, Haskell: 9, Java: 9, Smalltalk: 8, ML: 7 ...

descendants("Lisp"); // transitive BFS
// → Scheme, Common Lisp, Racket, Clojure, JavaScript (via Scheme), Ruby, ...

C and Lisp tie for first. Haskell ranks high not as a "most used" language but as a "most influential" one — which is exactly the story an influence graph should tell.

Clickable lineage

Each card's influences are clickable when in-dataset — clicking filters the museum to that language. You can walk Rust → C++ → C → ALGOL 60, four generations of syntax ancestry, in four clicks.

Architecture

data.js      ← 84 languages (the data IS the app)
core.js      ← filtering, decade grouping, influence graph (DOM-free)
highlight.js ← generic tokenizer + 27 profiles (DOM-free)
app.js       ← UI glue

32 tests: data integrity (no duplicates, plausible years, paradigm vocabulary, profile existence, influence resolution), logic (AND-filtering, decade grouping, BFS), and the tokenizer (specific cases + the 84-snippet round trip).

Try it

Demo: https://sen.ltd/portfolio/hello-world-museum/
GitHub: https://github.com/sen-ltd/hello-world-museum

Recommended: set the paradigm filter to "esoteric". You get Brainfuck, Befunge, LOLCODE, INTERCAL, Whitespace, Piet, and Shakespeare. Befunge's Hello World stores the string backwards ("!dlroW ,olleH") because the program counter physically travels through 2D code space.

Takeaways

You don't need highlight.js for snippets. One generic tokenizer + declarative per-family profiles: 84 languages → 27 profiles.
Test tokenizers with the round-trip invariant (join(tokens) === original). Your real data becomes your edge-case suite, and it catches index bugs no hand-written list would.
Priority is comment > string > number > word, and first-match-wins gives it to you naturally.
Hand-typed graph references need a closed vocabulary + a resolution test. Typos become red tests, not silent dead links.
assert.deepEqual(bad, []) beats assert.equal(bad.length, 0) — failures carry the evidence.

This is OSS portfolio #261 from SEN LLC (Tokyo). https://sen.ltd/portfolio/

DEV Community