Writing a SQL Formatter With a Handwritten Tokenizer
A SQL formatter is a tokenizer plus a pretty-printer. The tokenizer recognizes keywords, identifiers, strings, numbers, operators, comments. The pretty-printer walks tokens and emits them with newlines before major clauses (SELECT, FROM, WHERE, JOIN) and indentation rules for nested parens. No parser needed — tokens are enough.
SQL formatting is a solved problem (sql-formatter, Prettier's SQL plugin, IDEs). But understanding how it works is valuable, and a minimal implementation fits in about 500 lines of vanilla JS.
🔗 Live demo: https://sen.ltd/portfolio/sql-formatter/
📦 GitHub: https://github.com/sen-ltd/sql-formatter
Features:
- Handwritten tokenizer (keywords, strings, numbers, operators, comments)
- Pretty-print with configurable indent
- Keyword case conversion (UPPER / lower / Capitalize)
- Minify (single-line output)
- Syntax highlighting
- Example queries
- Japanese / English UI
- Dark / light theme
- Zero dependencies, 57 tests
The tokenizer
SQL tokens are easier to identify than you'd think:
export function tokenize(sql) {
const tokens = [];
let i = 0;
while (i < sql.length) {
const c = sql[i];
// Whitespace
if (/\s/.test(c)) {
let j = i;
while (j < sql.length && /\s/.test(sql[j])) j++;
tokens.push({ type: 'whitespace', value: sql.slice(i, j) });
i = j;
continue;
}
// Line comment
if (c === '-' && sql[i + 1] === '-') {
let j = i + 2;
while (j < sql.length && sql[j] !== '\n') j++;
tokens.push({ type: 'comment', value: sql.slice(i, j) });
i = j;
continue;
}
// Block comment
if (c === '/' && sql[i + 1] === '*') {
const end = sql.indexOf('*/', i + 2);
const j = end === -1 ? sql.length : end + 2;
tokens.push({ type: 'comment', value: sql.slice(i, j) });
i = j;
continue;
}
// String literal
if (c === "'") {
let j = i + 1;
while (j < sql.length) {
if (sql[j] === "'" && sql[j + 1] === "'") j += 2;
else if (sql[j] === "'") { j++; break; }
else j++;
}
tokens.push({ type: 'string', value: sql.slice(i, j) });
i = j;
continue;
}
// Identifier or keyword
if (/[a-zA-Z_]/.test(c)) {
let j = i;
while (j < sql.length && /[a-zA-Z0-9_]/.test(sql[j])) j++;
const word = sql.slice(i, j);
const type = KEYWORDS.has(word.toUpperCase()) ? 'keyword' : 'identifier';
tokens.push({ type, value: word });
i = j;
continue;
}
// Number
if (/[0-9]/.test(c)) {
let j = i;
while (j < sql.length && /[0-9.eE+\-]/.test(sql[j])) j++;
tokens.push({ type: 'number', value: sql.slice(i, j) });
i = j;
continue;
}
// Operators and punctuation
const twoChar = sql.slice(i, i + 2);
if (['<>', '<=', '>=', '!='].includes(twoChar)) {
tokens.push({ type: 'operator', value: twoChar });
i += 2;
continue;
}
tokens.push({ type: c.match(/[(),;]/) ? 'punctuation' : 'operator', value: c });
i++;
}
return tokens;
}
The non-obvious parts:
- SQL string escapes are doubled quotes (
'O''Brien'), not backslash. Most programming languages get this wrong on their first attempt. - Two-character operators (
<>,<=,>=,!=) are matched before single characters. - Keywords are recognized by looking up the uppercased form in a Set — so
selectandSELECTboth becomekeywordtokens.
Pretty-printing with clause breaks
The formatter walks the tokens and decides where to insert newlines:
const CLAUSE_STARTERS = new Set([
'SELECT', 'FROM', 'WHERE', 'GROUP', 'ORDER', 'HAVING', 'LIMIT',
'UNION', 'INSERT', 'UPDATE', 'SET', 'DELETE', 'VALUES',
]);
const JOIN_KEYWORDS = new Set(['JOIN', 'LEFT', 'RIGHT', 'INNER', 'OUTER', 'CROSS']);
function format(tokens, options) {
const out = [];
let depth = 0;
for (const tok of tokens) {
if (tok.type === 'whitespace') continue; // strip, we rebuild
if (tok.type === 'keyword') {
const upper = tok.value.toUpperCase();
if (CLAUSE_STARTERS.has(upper) || JOIN_KEYWORDS.has(upper)) {
out.push('\n' + indent(depth));
}
}
if (tok.value === '(') depth++;
out.push(formatToken(tok, options));
if (tok.value === ')') depth--;
}
return out.join('');
}
The pattern: strip original whitespace, rebuild from scratch. This means formatter output is deterministic — running it twice produces the same result regardless of input formatting.
Why handwritten instead of parser generator
A real SQL parser is complicated — SQL has dozens of dialects, hundreds of keywords, and context-sensitive grammar. But a formatter doesn't need full parsing. You just need to know where clause boundaries are, and those can be detected at the token level.
The tokenizer is ~150 lines. The formatter is ~100 lines. Together they handle 90% of real-world SQL queries well. The remaining 10% (weird vendor extensions, CTE formatting, window functions with complex OVER clauses) would need a real parser, but that's scope creep for most use cases.
Series
This is entry #79 in my 100+ public portfolio series.
- 📦 Repo: https://github.com/sen-ltd/sql-formatter
- 🌐 Live: https://sen.ltd/portfolio/sql-formatter/
- 🏢 Company: https://sen.ltd/

Top comments (0)