DEV Community

Cover image for Code Never Runs First - The Lexer and Parser Do
MD. HABIBULLAH SHARIF
MD. HABIBULLAH SHARIF

Posted on

Code Never Runs First - The Lexer and Parser Do

Your code doesn't run the way you wrote it.

Before the machine executes a single instruction, your source file passes through a pipeline. Two of the most important stages in that pipeline are things most developers never think about — the lexer and the parser.

Every syntax error you've ever seen was one of them catching something. Every IDE that highlights keywords differently from variable names is using one right now.

Let's break down what they actually do.


What Is a Lexer?

The lexer (also called a tokenizer or lexical analyzer) is stage one. It takes your raw source code — just a stream of characters — and breaks it into meaningful units called tokens.

Think of it like reading a sentence. Before you understand grammar, you first recognize individual words. The lexer does that for code.

Take this:

x = 5 + 10;
Enter fullscreen mode Exit fullscreen mode

The lexer scans it character by character and produces:

[ID: "x"]  [ASSIGN: "="]  [NUM: "5"]  [PLUS: "+"]  [NUM: "10"]  [SEMICOLON: ";"]
Enter fullscreen mode Exit fullscreen mode

Each token has a type (what category it belongs to) and a value (the actual text). Keywords, operators, identifiers, literals, punctuation — the lexer categorizes all of it.

What it does NOT do: care whether those tokens make any sense together. That is the next stage's problem.


What Is a Parser?

The parser is stage two. It takes the token stream from the lexer and tries to make structural sense of it. It checks whether those tokens follow the grammar rules of the language, and if they do, it builds an Abstract Syntax Tree (AST) — a hierarchical structure that represents what your code actually means.

The analogy holds here too. The lexer identifies parts of speech. The parser checks if those parts form a valid sentence. "The cat sat on the mat" — valid. "Mat the sat cat" — same words, completely broken.

For x = 5 + 10;, the AST looks like this:

    Assignment
    /        \
   x         Add
            /   \
           5    10
Enter fullscreen mode Exit fullscreen mode

The assignment is the root. The addition is its right-hand side. 5 and 10 are leaf nodes.

Now try x = + 5 10. The lexer produces valid tokens. The parser rejects it — the structure doesn't match any grammar rule it knows.


Why Are They Two Separate Things?

This is a design decision that pays off.

Splitting the work means you can change your language's syntax (parser rules) without touching how it recognizes basic tokens (lexer rules), and vice versa. Each component is smaller, testable, and easier to reason about on its own.

Most production compilers and interpreters auto-generate these from tools like ANTLR, Lex/Flex, or Yacc/Bison — you define the rules, the tool writes the code.

Here's a quick comparison:

Lexer Parser
Input Raw character stream Token stream
Output Tokens Abstract Syntax Tree (AST)
Works at Word level Structure level
Detects Unknown characters, bad tokens Syntax errors, broken grammar

Where You've Already Seen This

Every syntax error you've ever hit — the parser caught it. When your editor colors a keyword differently from a variable name — the lexer already ran.

Linters, formatters, Prettier, ESLint, TypeScript's type checker — they all build on top of this same pipeline. Even AI code tools that "understand" your code are working with ASTs under the hood, not raw text.

Understanding the lexer and parser doesn't just fill in a computer science gap. It explains why error messages look the way they do, why some tools can analyze your code without running it, and how the entire software toolchain that you use daily actually works.


Enjoyed this? I write about backend architecture, DevOps, Linux, and systems programming. Check out more at habibullah.dev

Top comments (1)

Collapse
 
md8_habibullah profile image
MD. HABIBULLAH SHARIF

What are you thinking ???