Writing a SQL Formatter With a Handwritten Tokenizer

#javascript #sql #webdev #parser

Writing a SQL Formatter With a Handwritten Tokenizer

A SQL formatter is a tokenizer plus a pretty-printer. The tokenizer recognizes keywords, identifiers, strings, numbers, operators, comments. The pretty-printer walks tokens and emits them with newlines before major clauses (SELECT, FROM, WHERE, JOIN) and indentation rules for nested parens. No parser needed — tokens are enough.

SQL formatting is a solved problem (sql-formatter, Prettier's SQL plugin, IDEs). But understanding how it works is valuable, and a minimal implementation fits in about 500 lines of vanilla JS.

🔗 Live demo: https://sen.ltd/portfolio/sql-formatter/
📦 GitHub: https://github.com/sen-ltd/sql-formatter

Features:

Handwritten tokenizer (keywords, strings, numbers, operators, comments)
Pretty-print with configurable indent
Keyword case conversion (UPPER / lower / Capitalize)
Minify (single-line output)
Syntax highlighting
Example queries
Japanese / English UI
Dark / light theme
Zero dependencies, 57 tests

The tokenizer

SQL tokens are easier to identify than you'd think:

export function tokenize(sql) {
  const tokens = [];
  let i = 0;
  while (i < sql.length) {
    const c = sql[i];

    // Whitespace
    if (/\s/.test(c)) {
      let j = i;
      while (j < sql.length && /\s/.test(sql[j])) j++;
      tokens.push({ type: 'whitespace', value: sql.slice(i, j) });
      i = j;
      continue;
    }

    // Line comment
    if (c === '-' && sql[i + 1] === '-') {
      let j = i + 2;
      while (j < sql.length && sql[j] !== '\n') j++;
      tokens.push({ type: 'comment', value: sql.slice(i, j) });
      i = j;
      continue;
    }

    // Block comment
    if (c === '/' && sql[i + 1] === '*') {
      const end = sql.indexOf('*/', i + 2);
      const j = end === -1 ? sql.length : end + 2;
      tokens.push({ type: 'comment', value: sql.slice(i, j) });
      i = j;
      continue;
    }

    // String literal
    if (c === "'") {
      let j = i + 1;
      while (j < sql.length) {
        if (sql[j] === "'" && sql[j + 1] === "'") j += 2;
        else if (sql[j] === "'") { j++; break; }
        else j++;
      }
      tokens.push({ type: 'string', value: sql.slice(i, j) });
      i = j;
      continue;
    }

    // Identifier or keyword
    if (/[a-zA-Z_]/.test(c)) {
      let j = i;
      while (j < sql.length && /[a-zA-Z0-9_]/.test(sql[j])) j++;
      const word = sql.slice(i, j);
      const type = KEYWORDS.has(word.toUpperCase()) ? 'keyword' : 'identifier';
      tokens.push({ type, value: word });
      i = j;
      continue;
    }

    // Number
    if (/[0-9]/.test(c)) {
      let j = i;
      while (j < sql.length && /[0-9.eE+\-]/.test(sql[j])) j++;
      tokens.push({ type: 'number', value: sql.slice(i, j) });
      i = j;
      continue;
    }

    // Operators and punctuation
    const twoChar = sql.slice(i, i + 2);
    if (['<>', '<=', '>=', '!='].includes(twoChar)) {
      tokens.push({ type: 'operator', value: twoChar });
      i += 2;
      continue;
    }

    tokens.push({ type: c.match(/[(),;]/) ? 'punctuation' : 'operator', value: c });
    i++;
  }
  return tokens;
}

The non-obvious parts:

SQL string escapes are doubled quotes ('O''Brien'), not backslash. Most programming languages get this wrong on their first attempt.
Two-character operators (<>, <=, >=, !=) are matched before single characters.
Keywords are recognized by looking up the uppercased form in a Set — so select and SELECT both become keyword tokens.

Pretty-printing with clause breaks

The formatter walks the tokens and decides where to insert newlines:

const CLAUSE_STARTERS = new Set([
  'SELECT', 'FROM', 'WHERE', 'GROUP', 'ORDER', 'HAVING', 'LIMIT',
  'UNION', 'INSERT', 'UPDATE', 'SET', 'DELETE', 'VALUES',
]);

const JOIN_KEYWORDS = new Set(['JOIN', 'LEFT', 'RIGHT', 'INNER', 'OUTER', 'CROSS']);

function format(tokens, options) {
  const out = [];
  let depth = 0;
  for (const tok of tokens) {
    if (tok.type === 'whitespace') continue; // strip, we rebuild

    if (tok.type === 'keyword') {
      const upper = tok.value.toUpperCase();
      if (CLAUSE_STARTERS.has(upper) || JOIN_KEYWORDS.has(upper)) {
        out.push('\n' + indent(depth));
      }
    }

    if (tok.value === '(') depth++;
    out.push(formatToken(tok, options));
    if (tok.value === ')') depth--;
  }
  return out.join('');
}

The pattern: strip original whitespace, rebuild from scratch. This means formatter output is deterministic — running it twice produces the same result regardless of input formatting.

Why handwritten instead of parser generator

A real SQL parser is complicated — SQL has dozens of dialects, hundreds of keywords, and context-sensitive grammar. But a formatter doesn't need full parsing. You just need to know where clause boundaries are, and those can be detected at the token level.

The tokenizer is ~150 lines. The formatter is ~100 lines. Together they handle 90% of real-world SQL queries well. The remaining 10% (weird vendor extensions, CTE formatting, window functions with complex OVER clauses) would need a real parser, but that's scope creep for most use cases.