Your code doesn't run the way you wrote it.
Before the machine executes a single instruction, your source file passes through a pipeline. Two of the most important stages in that pipeline are things most developers never think about — the lexer and the parser.
Every syntax error you've ever seen was one of them catching something. Every IDE that highlights keywords differently from variable names is using one right now.
Let's break down what they actually do.
What Is a Lexer?
The lexer (also called a tokenizer or lexical analyzer) is stage one. It takes your raw source code — just a stream of characters — and breaks it into meaningful units called tokens.
Think of it like reading a sentence. Before you understand grammar, you first recognize individual words. The lexer does that for code.
Take this:
x = 5 + 10;
The lexer scans it character by character and produces:
[ID: "x"] [ASSIGN: "="] [NUM: "5"] [PLUS: "+"] [NUM: "10"] [SEMICOLON: ";"]
Each token has a type (what category it belongs to) and a value (the actual text). Keywords, operators, identifiers, literals, punctuation — the lexer categorizes all of it.
What it does NOT do: care whether those tokens make any sense together. That is the next stage's problem.
What Is a Parser?
The parser is stage two. It takes the token stream from the lexer and tries to make structural sense of it. It checks whether those tokens follow the grammar rules of the language, and if they do, it builds an Abstract Syntax Tree (AST) — a hierarchical structure that represents what your code actually means.
The analogy holds here too. The lexer identifies parts of speech. The parser checks if those parts form a valid sentence. "The cat sat on the mat" — valid. "Mat the sat cat" — same words, completely broken.
For x = 5 + 10;, the AST looks like this:
Assignment
/ \
x Add
/ \
5 10
The assignment is the root. The addition is its right-hand side. 5 and 10 are leaf nodes.
Now try x = + 5 10. The lexer produces valid tokens. The parser rejects it — the structure doesn't match any grammar rule it knows.
Why Are They Two Separate Things?
This is a design decision that pays off.
Splitting the work means you can change your language's syntax (parser rules) without touching how it recognizes basic tokens (lexer rules), and vice versa. Each component is smaller, testable, and easier to reason about on its own.
Most production compilers and interpreters auto-generate these from tools like ANTLR, Lex/Flex, or Yacc/Bison — you define the rules, the tool writes the code.
Here's a quick comparison:
| Lexer | Parser | |
|---|---|---|
| Input | Raw character stream | Token stream |
| Output | Tokens | Abstract Syntax Tree (AST) |
| Works at | Word level | Structure level |
| Detects | Unknown characters, bad tokens | Syntax errors, broken grammar |
Where You've Already Seen This
Every syntax error you've ever hit — the parser caught it. When your editor colors a keyword differently from a variable name — the lexer already ran.
Linters, formatters, Prettier, ESLint, TypeScript's type checker — they all build on top of this same pipeline. Even AI code tools that "understand" your code are working with ASTs under the hood, not raw text.
Understanding the lexer and parser doesn't just fill in a computer science gap. It explains why error messages look the way they do, why some tools can analyze your code without running it, and how the entire software toolchain that you use daily actually works.
Enjoyed this? I write about backend architecture, DevOps, Linux, and systems programming. Check out more at habibullah.dev
Top comments (1)
What are you thinking ???