JoongHyuk Shin

Posted on May 6 • Edited on May 22

1.2.1 From SQL Text to Raw Parse Tree

#postgres #database #internals #parser

A line like SELECT name FROM users WHERE id = 1 arrives at the backend. As we saw in 1.1.1, the backend is a child process forked by the postmaster when the client connects, dedicated to that one client. What this process first holds in its hands is just a byte array. It does not yet know where the keywords end, where the identifiers begin, or where the integer constant lives. Turning that byte array into a tree structure is the front half of the second of the five stages, namely raw parsing. This section covers only that front half. The back half (parse analysis), which consults the catalog to attach meaning, belongs to 1.2.2. (The catalog, in case you need a refresher, is the set of internal tables PostgreSQL maintains to describe itself: which tables have which columns, which functions take which argument types, and so on, all stored as rows in these tables.)

The output of raw parsing is a single RawStmt node per SQL string (or a List, if multiple statements are joined with ;). This RawStmt wraps a raw node like SelectStmt for SELECT, InsertStmt for INSERT, and so on. The name "raw" means it has not seen the catalog. Whether users is actually an existing table, which column id refers to, none of that is known yet. All that has been captured is the grammatical structure: a SELECT keyword followed by a column list, a FROM followed by an identifier, a WHERE followed by a comparison expression.

Two tools dividing the work: flex and Bison

PostgreSQL's raw parser is a collaboration of two tools. One side is the lexer (lexical analyzer), the other is the grammar (syntactic analyzer). The lexer slices the byte array into tokens. The grammar takes that token sequence and groups it into a tree.

PostgreSQL does not write these two by hand. It uses standard generator tools brought in from outside. The lexer is flex, the grammar is Bison. Both are code generators. The developer writes rules, and at build time the tool reads those rules and emits actual working lexer/parser C code.

For flex, a rule means "if this regex pattern comes in, emit this token." For example, "if a letter is followed by alphanumerics, emit an IDENT (identifier) token", "if only digits appear, emit an ICONST (integer constant) token." PostgreSQL keeps these rules in scan.l (about 1,500 lines on the current PostgreSQL 18). At build time flex reads this file and generates the C function that does the tokenizing.

For Bison, a rule means "if this token sequence appears, build this tree node." For example, "if SELECT is followed by a column list, then FROM and a table identifier, build a SelectStmt node." PostgreSQL keeps these grammar rules in gram.y (about 20,000 lines on the current PostgreSQL 18). At build time Bison reads this file and generates the C parser function that groups tokens into a tree.

The driver that ties these two together is the raw_parser() function.

List *
raw_parser(const char *str, RawParseMode mode)
{
    core_yyscan_t yyscanner;
    base_yy_extra_type yyextra;
    int           yyresult;

    scanner_init(str, ...);     /* flex init */
    parser_init(&yyextra);      /* Bison state init */
    yyresult = base_yyparse(yyscanner);   /* one cycle */
    scanner_finish(yyscanner);

    if (yyresult)
        return NIL;             /* syntax error */

    return yyextra.parsetree;   /* List of RawStmt */
}

base_yyparse() is the actual parser function Bison generated. Inside it, whenever a token is needed, it calls the lexer and groups tokens into a tree according to grammar rules. The lexer does not run on its own. It is pulled along by the grammar.

What the lexer (scan.l) sees

What the lexer does is simple. It scans the byte array from the start, finds the longest pattern that matches one of the regex rules, and hands the corresponding token type and value to the grammar. What does "longest matching pattern" mean? When several rules can match starting at the same position, the lexer picks the one with the longer match. For example, if the input contains >=, the lexer does not slice off > alone. It groups >= into a single token (such as GREATER_EQUALS). > alone would also match a comparison-operator rule, but a longer-matching rule for >= exists, so that one wins. In lex terminology this is called longest match. Almost every lexer behaves this way. So when SELECT name FROM users WHERE id = 1 comes in, the lexer emits tokens in this order.

SELECT     → keyword SELECT
name       → identifier (IDENT, value = "name")
FROM       → keyword FROM
users      → identifier (IDENT, value = "users")
WHERE      → keyword WHERE
id         → identifier (IDENT, value = "id")
=          → token '='
1          → integer constant (ICONST, value = 1)

How does the lexer tell identifiers from keywords? Every identifier candidate (a letter followed by alphanumerics) first matches the same regex rule. After that, the matched string is checked against the keyword table by binary search. If it is in the table, it becomes a keyword token. If not, IDENT.

The size and categorization of that keyword table matter. On the current PostgreSQL 18 there are 494 keywords, split into four categories.

Category	Count	Meaning
UNRESERVED	330	usable as identifier (e.g. `abort`, `aggregate`)
COL_NAME	63	usable as a column name but not as a function/type name
TYPE_FUNC_NAME	23	usable as a type or function name
RESERVED	78	never usable as identifier (e.g. `select`, `from`, `where`)

This split is why a query like SELECT abort FROM ... works in PostgreSQL: abort is UNRESERVED, so it can also be used as an identifier. By contrast, SELECT select FROM ... is a syntax error. RESERVED never gives way. Compatibility differences with other RDBMSes often come from this split.

There is an interesting comment at the top of scan.l: "rules in this file must be kept in sync with src/fe_utils/psqlscan.l and src/interfaces/ecpg/preproc/pgc.l!" In other words, the same SQL lexer rules live in three places. The psql client has its own lexer, ecpg (the embedded SQL preprocessor) has its own. The reason is that each tool has slightly different lexical needs, but the practical consequence is that changing the lexer rules in PostgreSQL means synchronizing three files.

The tree the grammar (gram.y) builds

As the lexer emits tokens, the grammar takes them in order, matches them against grammar rules, and assembles a tree. It builds bottom-up. Small subtrees are grouped first, those subtrees are then collected into larger nodes, and finally a single node corresponding to the whole statement is completed. This act of "taking a sequence of tokens or subtrees and reducing them into a single parent node" is called reduce in parser terminology. For example, if the token sequence IDENT '=' ICONST matches the "binary comparison expression" rule, those three are reduced into a single subtree. That subtree is then reduced as part of a WHERE clause rule. That WHERE clause is reduced as part of a SelectStmt rule. In the end, one SelectStmt node is built and gets wrapped in a RawStmt.

Here is a picture of the raw parse tree that SELECT name FROM users WHERE id = 1 produces. The tree has the root at the top, leaves toward the bottom.

                        RawStmt
                           │
                       SelectStmt
              ┌────────────┼─────────────┐
              │            │             │
          targetList   fromClause    whereClause
              │            │             │
            IDENT       RangeVar      A_Expr (=)
            "name"      "users"      ┌──────┴──────┐
                                     │             │
                                   IDENT         ICONST
                                   "id"             1

The RawStmt at the top is the output of the raw parser. Keywords like SELECT, FROM, WHERE are only markers that tell the grammar rule which subtree to reduce into, so they do not survive as separate tree nodes. Only meaningful values (identifiers, constants) from the input tokens remain as leaves.

PostgreSQL's top rule is stmtmulti. It expresses that multiple SQL statements may be joined with ;.

stmtmulti:  stmtmulti ';' toplevel_stmt
                { ... lappend($1, makeRawStmt($3, @3)); ... }
          | toplevel_stmt
                { ... list_make1(makeRawStmt($1, @1)); ... }
;

Each toplevel_stmt is one SQL statement, and they all get wrapped by makeRawStmt() into a RawStmt. That is why the raw parser's output is always a List of RawStmt.

Going down into toplevel_stmt and then stmt, you find that stmt is a giant OR rule.

stmt:
        AlterEventTrigStmt
      | AlterCollationStmt
      | AlterDatabaseStmt
      | ...
      | SelectStmt
      | InsertStmt
      | ...
;

More than 120 statement kinds are listed here on the current PostgreSQL 18, one per line. Each then expands into its own sub-rules. SelectStmt alone has dozens of sub-rules, and every SELECT element such as FROM clause, WHERE clause, GROUP BY, set operation unfolds inside it like a tree. The roughly 20,000 lines of gram.y are the result of this unfolding.

A new tree node is built inside the reduce action. At the moment SELECT * FROM users reduces into a SelectStmt node, makeNode(SelectStmt) allocates the node and fills in target list, FROM clause, WHERE, and so on. At this moment we do not know which OID of which table users refers to. The identifier string "users" simply lives inside a RangeVar node. Catalog lookup is the next stage's job.

gram.y also happens to be the file that most directly shows PostgreSQL's history of SQL compatibility changes. When a new PostgreSQL version adds a new SQL feature (such as MERGE or JSON_TABLE), the grammar rules grow accordingly, so dozens to hundreds of lines get added to gram.y each time. Following git blame is enough to see which SQL feature entered which PostgreSQL version. Its roughly 20,000-line size is the trace of three decades of SQL standard accumulation in a single file.

The lookahead filter: base_yylex

There is one more detail. The parser Bison generates is LALR(1). Spelling out the acronym, that is Look-Ahead Left-to-right Rightmost-derivation, with 1-token lookahead. The name sounds intimidating, but the gist is simple. The parser scans input from left to right exactly once, and at each step it peeks at exactly one upcoming token ("1-token lookahead") to decide which grammar rule to apply. Being able to decide with one token of lookahead keeps the parser fast and memory-efficient. Almost every mainstream compiler/DB parser works this way.

The catch is that SQL grammar has a few cases that do not fit cleanly into LALR(1). Multi-word tokens like NULLS FIRST and WITH ORDINALITY need to be received by the grammar as single tokens. If NULLS and FIRST were given to the grammar as separate tokens, the grammar could not decide what comes after NULLS based on a single lookahead.

PostgreSQL solves this by inserting one filter layer between the lexer and the grammar. That filter is base_yylex(). The actual lexer flex generates is core_yylex(), but Bison never calls it directly. It always receives tokens through base_yylex(). base_yylex looks at one token, and if this is a case that needs the next token pulled in and merged, it gives the grammar a single combined token. The result is that the grammar can be written cleanly as LALR(1), and the complexity of multi-word tokens is isolated inside base_yylex.

Never touches the catalog

The most important constraint of the raw parser stage is that it never accesses the system catalog. This is not just a convention but a correctness requirement.

The header comment of pg_parse_query() explains why.

Analysis and rewriting cannot be done in an aborted transaction, since they require access to database tables. So, we rely on the raw parser to determine whether we've seen a COMMIT or ABORT command; when we are in abort state, other commands are not processed any further than the raw parse stage.

To follow this comment, you first need to know what "the transaction is in abort state" means. Here is the most intuitive scenario.

BEGIN;
INSERT INTO users(id, name) VALUES (1, 'a');
INSERT INTO users(id, name) VALUES (1, 'b');   -- unique violation!
SELECT * FROM users;

The moment the third line violates the unique constraint and errors out, PostgreSQL marks this transaction as "already broken." That is the abort state. The next line, SELECT * FROM users;, is grammatically fine and the table exists in the catalog, but PostgreSQL refuses to execute it. Instead, every subsequent SQL gets the same one-line error.

ERROR:  current transaction is aborted, commands ignored until end of transaction block

This is the situation in psql where, after one query inside a transaction fails, every following query is rejected with the same message. To unlock it, the client must explicitly send ROLLBACK (or COMMIT, which in abort state has the same effect as rollback). In other words, the only SQL that PostgreSQL effectively accepts in abort state is ROLLBACK.

Now the relationship between the raw parser and the catalog matters. ROLLBACK still arrives as SQL text, so it has to be parsed somehow. But in abort state, even catalog reads are blocked. What if the raw parser depended on the catalog? Parsing ROLLBACK itself would then error out, and the client would have no way to unwind the transaction. Killing and reopening the connection would be the only escape.

The design choice that avoids this from the start is the raw parser's catalog ban. The raw parser must work in any transaction state, so it never touches the catalog. All identifier-level meaning analysis is deferred to the next stage (parse analysis). 1.2.2 covers that story.

What this means in practice

First, "parsing is fast" really means raw parsing is fast. PostgreSQL's raw parser is pure string work without catalog lookup, so it is very fast. But the bulk of what a prepared statement (1.1.2) caches is not raw parsing. It is the next stage (parse analysis). Parse analysis is where catalog lookups, function-overload resolution, and type checking happen. When people talk about the per-query parse cost of the simple query protocol being a burden, they almost always mean parse analysis cost, not raw parse cost. To diagnose accurately, separate which stage's cost you are looking at.

Second, keyword categories are a hidden trap in PostgreSQL compatibility and migration. A word that other RDBMSes happily allow as a column or function name may be RESERVED in PostgreSQL. If your migration tool does not automatically wrap such names in quotes ("name"), you get a syntax error. Conversely, identifiers PostgreSQL allows freely (such as abort) may be blocked in another DB. When reviewing a migration, comparing the keyword tables on both sides is the safe move.

Third, the raw parser only catches syntax errors. Errors like "table does not exist", "function signature does not match", "column is ambiguous" are all thrown by the next stage, parse analysis. So when a query has both a syntax error and a semantic error, the syntax error is reported first. The semantic error only surfaces once syntax is clean. If your error message suddenly changes during debugging, that may be a signal that the syntax issue cleared and you are now seeing the next stage's error.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.