honestly this blog is more for me than presenting it
because i forget everything i do in 12 hours, so i am just teaching myself how i did it after i did it. a candy for my future self.
we have a million different languages to define one logic into in their own syntax, stupid right? yeah, useful? maybe. do i know how to write hello world in 8 different languages, unfortunately. my next step is binary out of spite. whether your goal is adding 2 numbers or finishing a client project whose payment is gonna be “in-process” for 3 months.
my assumptions:
- you have written some code before (optional)
- you hate yourself a little (required)
so the f$ck are compilers, you think? me too.
i have a lot to say for someone who dropped out engineering in the first 3 months of it.
compilers are something that translate your high-level code that is written in languages like rust, c, c++, etc and turns it into a low-level machine code that the computer processor can understand and execute.
humans: “hello(’print’)”
computer: sorry bro i only understand 1s and 0s.
so a compiler takes your code as input, and to make machine understand it, the compiler transforms it multiple times over many steps and converts it to a machine code that computer can understand and execute. here are the steps (if you really wanna learn):
- lexical analysis: source / high-level code is broken down into tokens. stuff like keywords, identifiers, separators, operators, etc. the only part that didn’t kill me while building.
- parsing or syntax analysis: aka the grammar of your language, this just checks if your code is syntactically correct, if it follows the rules or not. if you forgot to put a semicolon again.
-
semantic analysis: aka the brain of your language, this allows us to stay away from a common error of
int x = hello, checks all the logical errors and grown-up stuff. - code optimization: this turns your lame-ass javascript into a faster lame-ass javascript. this is used to improve the efficiency of your language, like removing unused code, constant folding, constant propagation.
- code generation: optimized code is translated into target machine code for the specific architecture of your computer.
in this article, i am gonna be covering only the lexical analysis part and implement a basic lexer. and i will be choosing rust as my language of choice because i hate myself.
but by the end of this page, you’d have made your own lexer from scratch!
what is lexical analysis?
╯°□°)╯ WHY IS THIS SO HARD???
this step involves taking your code as source at one end and makes tokens out of them on the other end,
the categories that we define are called as tokens and the words we put in each category are called lexemes. for one line understanding of it.
lets try to see this by an example:
`a = 69 if 69 < 420 else 420`
and the above code passed through a lexer will give out:
**`[IDENTIFIER, 'a'] [ASSIGN_OP, '='] [INTEGER, '69'] [IF] [INTEGER, '69'] [LESS_THAN] [INTEGER, '420'] [ELSE] [INTEGER, '420']`**
the lexer looks at each word / symbol in the code and categorizes them into their respective token type. integer becomes IntegerLiteral string as StringLiteral and so on. this step is important because the compiler can throw everything in their own boxes regardless of different names you write, also it standardizes the vocab for later steps. its like a sorting hat of compilers where every lexeme wishes to go either in gryffindors or slytherin because they want the main character energy,
this stage usually ignores white spaces and comments, unless you write python, then something’s already wrong with you.
Regular Language:
the syntax of language is represented via something called Regular Language. basically the rules of writing something in a programming language, so you dont write
hihihaha = 69
print???hello????
x = cat
and get away with it,
those rules = regular language. aka the patterns your language allows.
and how do we define these patterns? an ancient cursed language of symbols - regex.
rust analyzer staring me because i forgot to put a semicolon.
**Regular Expressions: your lexer’s tinder algorithm
(╬ ಠ益ಠ) = i hate this but i must wield it.**
regex pattern matches your source code string against the defined tokens to determine if its a valid or acceptable string or not. its swiping right on the code that matches your type. you basically tell your computer “hey, this is what valid shit looks like in my language: int x = 5 .”
you define patterns by these specific rules such as:
- start and end anchors: ^ and $ are used to denote beginning and end of line respectively. ^ matches the position before the string starts and $ comes after your string has ended.
- character classes: [] are used to any one of the characters written in the brackets, for eg: [rule] will match any single ‘r’ ‘u’ ‘l’ and ‘e’, it doesnt have to be a specific word, can be any letter. you can also add a range to it by adding a dash between them like [a-z] or [0-9] for digits and this will allow everything in the specified range to be detected
- predefined character classes: these are the shortcut for character classes which are more common like
\wmatches any word from [a-zA-Z],\dmatches any digit from [0-9] and lastly\smatches whitespace characters - negated character classes: this is used when you want to finding something EXCEPT these specific characters, and this is done by placing ^ inside square brackets like [^yes], so this will match characters except ‘y’ ‘e’ ‘s’.
- quantifiers: if there is a need to find something only if it has a number of a character like strawberry has 3 r in it so you this one. this is written like:
*= matches zero or more times+= matches one or more times?= matches zero or one and {n}, {n, m} are used for specific quantities.
example containing all of them can be -
starts with a letter or underscore
then can have letters, digits, or underscores
regex for that:
**[a-zA-Z_][a-zA-Z0-9_]***
you dont need to memorize this, you'll understand the logic
**[a-zA-Z_] → first character rules
[a-zA-Z0-9_]* → zero or more of the allowed characters**
a quick and painless crash course of it:
anchors
-
^→ start of line -
$→ end of line
character classes
-
[abc]→ match 'a' or 'b' or 'c' -
[a-z]→ lowercase letters -
[0-9]→ digits
shortcuts
-
\d→ digits -
\w→ word chars (letters, digits, underscore) -
\s→ whitespace
quantifiers
-
*→ 0 or more -
+→ 1 or more -
?→ 0 or 1 -
{n}→ exactly n times
i know some of you are gonna be like
but let loose, things are only gonna go downhill from here.
enough with the theory, fire up your vsc (no cursor autocomplete)
now we are gonna be
- creating a rust file
- defining a Token enum
- write regex pattern for each token
- write a lexer that scans whole code
- break everything
- fix it
- break it again
- make it work
time to hurt ourselves with rust now.
we’ll compile this following code -
int x = 5;
int y = 6;
int x = x + y;
if (z > 10) {
print("hihi");
} else {
print("haha");
}
based on this, we need a lexer that can capture:
- keywords like
int, if, else - identifiers like
x, y, z - numbers like
5, 6 - operators:
=, +, > - strings:
“hihi”, “haha” - brackets like
(, ), {, }
setting up project:
we start by the basic cargo command which will create a cargo.toml file and a src folder with a main dot rs file in it.
cargo init rust-lexer
cd rust-lexer
- cargo.toml is the config file of your project, this will have name, version, author and all the dependencies. if you need to add a package, you can do cargo add “package-name” or write it here directly.
- src / main .rs: this is the default starting point of your rust code, cargo looks for this file to compile your code as binary
we are gonna be increasing the file population in src folder but nothing china has to worry about.
lets first create a sub-folder in src folder by
cd src
mkdir lexing
in lexing folder, we are gonna add 3 more files, namely; token.rs , lexer.rs and mod.rs
your folder system should be looking like this rn
.
├── Cargo.toml
├── src
│ ├── lexing
│ │ ├── [lexer.rs](http://lexer.rs/)
│ │ ├── [mod.rs](http://mod.rs/)
│ │ └── [token.rs](http://token.rs/)
│ └── [main.rs](http://main.rs/)
add regex = “1” in the dependencies section of your cargo.toml because you dont want your code to break (it will, somehow it will, dw).
writing token.rs:
(ง'̀-'́)ง = fight me you 50 token types
we will be using enums for this task because why else do they exist,
we’ll start by creating a Token enum which will hold the token type for our lexemes, aka the types its gonna segregate string.
#[derive(Debug, Clone)]
pub enum Token {
// keywords
Print(String),
If(String),
Else(String),
Int(String),
Minus(String),
// literals
IntegerLiteral(i32),
StringLiteral(String),
BooleanLiteral(bool),
// identifiers
Identifier(String),
// operators
Plus(String),
Assign(String),
// punctuation
SemiColon(String),
LeftParen(String),
RightParen(String),
LeftBrace(String),
RightBrace(String),
// logical operators
GreaterThan(String),
LessThan(String),
}
do not copy this and write it by yourself.
this is a small list of basic stuff we need to at least get started with, this list can ofc be much bigger depending. and we have used String for most elements, this might NOT be the idiomatic way of writing rust but eh.
the #[derive(Debug)] is an attribute that implements a trait called debug, which lets us print the enum elements without any extra efforts. also we have to implement 2 functions now:
- when token_type string and value are given, return token object. for eg: given token_type string
“StringLiteral”and valuefuck, it’ll return asToken::StringLiteral(fuck)where::is used to specify a particular variant of an enum, in this caseStringLiteral. - return the regex to match lexeme for the token.
now we implement these 2 functions in our rust code inside implfor Token
impl Token {
pub fn get_token(token_type: &str, value: Option<&str>) -> Token {
match token_type {
// keywords
"Print" => Token::Print("print".to_string()),
"If" => Token::If("if".to_string()),
"Else" => Token::Else("else".to_string()),
"Int" => Token::Int("int".to_string()),
// literals
"IntegerLiteral" => {
Token::IntegerLiteral(value.unwrap().parse::<i32>().unwrap())
}
"StringLiteral" => {
Token::StringLiteral(value.unwrap().to_string())
}
"BooleanLiteral" => {
let val = match value.unwrap() {
"true" => true,
"false" => false,
_ => panic!("invalid boolean literal"),
};
Token::BooleanLiteral(val)
}
// identifiers
"Identifier" => Token::Identifier(value.unwrap().to_string()),
// operators
"Plus" => Token::Plus("+".to_string()),
"Minus" => Token::Minus("-".to_string()),
"Assign" => Token::Assign("=".to_string()),
// punctuation
"SemiColon" => Token::SemiColon(";".to_string()),
"LeftParen" => Token::LeftParen("(".to_string()),
"RightParen" => Token::RightParen(")".to_string()),
"LeftBrace" => Token::LeftBrace("{".to_string()),
"RightBrace" => Token::RightBrace("}".to_string()),
// logical operators
"GreaterThan" => Token::GreaterThan(">".to_string()),
"LessThan" => Token::LessThan("<".to_string()),
_ => panic!("invalid token type {}", token_type),
}
}
pub fn get_token_regex(token_type: &str) -> String {
match token_type {
// keywords
"Print" => r"print",
"If" => r"if",
"Else" => r"else",
"Int" => r"int\\s+",
// literals
"IntegerLiteral" => r"\\d+",
"StringLiteral" => r#"\\".*\\""#,
"BooleanLiteral" => r"\\b(?:true|false)\\b",
// identifiers
"Identifier" => r"[a-zA-Z_][a-zA-Z0-9_]*",
// operators
"Plus" => r"\\+",
"Minus" => r"-",
"Assign" => r"=",
// punctuation
"SemiColon" => r";",
"LeftParen" => r"\\(",
"RightParen" => r"\\)",
"LeftBrace" => r"\\{",
"RightBrace" => r"\\}",
// logical operators
"GreaterThan" => r">",
"LessThan" => r"<",
_ => panic!("invalid token type: {}", token_type),
}.to_string()
}
}
here we are only matching strings and returning appropriate result based on whatever matches.
and something else you might notice is that we have written &str and String differently, what are they you might ask? (us, i still google it) and on top of that we are using .to_string to convert &str into String , you can think of &str as an immutable string and String as a mutable string, ik rust has its own fetishes.
and at the last line of first fn, you can see the panic line there, that is used to when nothing matches the whole regex, thats called syntax error.
you can also make a case that an “if” can be mistaken for being an identifier, you would be right for it, thats why it is written above identifier in list, so it catches those specific words early wherever present.
ik its getting a lot, you can take a break now
(Φ ω Φ) = “the cat judges your code”
or just keep going if youre a masochist like me.
you can see that a small function named get_token_regex sneaked beneath our code, so what is it actually doing? we made an enum, we made a function that returns token variants, but our lexer still does not know how to detect a token. and we do that by regex, cursed alien language.
understanding regex:
regular expression is a concise way to describe any kind of pattern inside text. you write a weird symbol soup and expect it to send you back what you actually wanted, and it surprisingly does that,
it scans your whole document to find something that matches your given pattern and gives it back to you. something like ^[a-zA-Z_]\w*$ ,
we can understand this simply by writing an an email validator. they follow a simple patten like:
- a small or large case letter first, NO number or symbols as first letter.
- then can be lower / upper case including numbers, multiple characters.
- then an
@ - followed by same lower / upper case which can include numbers.
- then a dot
. - generally some designated words like in, com, org.
now we write that in regex form: that would go like
-
[a-zA-Z]for first letter -
[a-zA-Z0-9_]*for second and going forward @-
[a-zA-Z0-9]+as domain extension, like@sigma420.com -
.the literal dot -
[in | com | edu]if you accept some specific individual characters only.
the final result would look like [a-zA-Z][a-zA-Z0-9_]*@[a-zA-Z0-9]+.[in | com | edu]
its like summoning demons but for pattern matching.
laxmi_chit_fund@totally_safe.in please_send_money@gpay.in
viola, now you can understand and write regex by your own (i hope so).
now we write regex for tokens:
⊙_⊙; brain lag detected
-
keyword and operators:
r”print”, r”if”, r”int\s+”:these are straightforward ones, they match the specific keyword strings. forint,s\+is added to ensure that there is at least one whitespace if not more after it. also differentiate it from identifiers that can start with int likeintegerLiteral. -
numbers:
-
\d+→ matches one or more digits, eg5,69,420 -
\d+(\.\d+)?→ can be used to match decimals, but we are gonna stick with whole numbers rn.
-
-
identifiers (variable names, fn names):
classic example of starts normal then goes brrr pattern
we denote that by[a-zA-Z_][a-zA-Z0-9_]*
- meaning the first letter can be small / upper case, can also be an underscore.
- after that it can be letters, numbers or underscores, so it matches stuff like
`heheheha`, `y69`, and doesnt match stuff like `%yo` or `6n9` or `rule-34`.
-
strings:
well these are the simplest regex in a way, you can just write
“[^”]*”and done, this just basically EXCLUDES quote and returns anything that is not a quote.
eg beinghihihaha. -
operators & punctuations (brackets, braces or other fun shapes):
r"\+",r"=",r";",r"\(",r"\)",r"\{",r"}",r">",r"<": these patterns match operators and punctuations characters. some characters like+, (, ), {, }are special characters, to match them, they are escaped with a backslash.
sorry for too much technical blabber, im aiming to write as random bullshit go.
now we start writing our lexer.rs (THE FINAL BOSS): (ง🔥Д🔥)ง = prepare for pain
we have tokens, we have regex, now we need machine that eats your strings and poops out the tokens.
logic:
- take whole code as string
- start at index 0
- loop through list of Token regexes
- does it match?
- yeah: cut that part, save the token, move forward
- no: try next regex
- nothing matches? panic at the desk-o.
- regret everything
- lose sanity
5 stages of grief:
step 1: choose your victims aka tokens
let me walk you through some code:
let current_input = program;
let tokens = [
"Print", "If", "Else", "Int", // keywords
"BooleanLiteral", "IntegerLiteral", "StringLiteral", // literals
"Plus", "Minus", "Assign", "GreaterThan", "LessThan", // operators
"LeftParen", "RightParen", "LeftBrace", "RightBrace", "SemiColon", // punctuation
"Identifier", // identifier
];
here we made a string (current_input)
→ a list of what to look for
→ a place to store every match from every regex
and think of match_vec as a big group chat where every regex drops in like:
“hey i found something at index 27”
its like a vip list for words; keywords get in first, operators next and identifiers last
step 2: send regex bots to scan everything:
for token in tokens.iter() {
let re = Regex::new(pattern).unwrap();
for m in re.find_iter(current_input) {
match_vec.push((token, start, end));
}
}
here, for each token type:
- get the regex pattern
- compile that regex
- run it over the source code
- every match yields: token type, start index in string, end index in string.
this creates a giant list of all matches for all token types
even the wrong ones, even the cursed ones, like a data hungry pokemon trainer.
step 3 - sorting hat of tokens
then:
match_vec.sort_by(|a, b| {
a.1.cmp(&b.1)
.then_with(|| (b.2 - b.1).cmp(&(a.2 - a.1)))
});
now we take that giant list we made above and
- sort by starting index → left to right
- if they start at the same place, choose the longer match
fixes headaches like
intvsintegerLiteral,=vs==
step 4 - keep the chosen ones.
for (token_type, start, end) in match_vec {
if start < last_end {
continue;
}
last_end = end;
let lexeme = ¤t_input[start..end];
token_vec.push(Token::get_token(token_type, Some(lexeme)));
}
imagine all the tokens are standing in a queue, if a new token tries to overlap with previous one, you’d be like “we dont do that here” ****and throw it behind.
step 5 - return the loot bag of chaos
token_vec
this is the final treasure chest
- all tokens
- all in correct order
- overlapping drama resolved
- ready to make parser cry
you can dance now.
now we write our main.rs
ಥ‿ಥ please work please work please work
we’ll be writing our main.rs to test our lexer, we need to define the code we want as a string and get results from lex_program method we wrote.
mod lexing; // the only module we care about for now
use lexing::lexer::lex_program;
const PROGRAM: &str = "
int x = 5;
int y = 6;
int z = x + y;
if (z > 0) {
print(\"hihi\");
} else {
print(\"haha\");
}
";
fn main() {
println!("--- LEXING TIME ---");
println!("feeding code into the lexer like it's a vending machine...\n");
let tokens = lex_program(PROGRAM);
println!("--- TOKENS THE LEXER SPAT OUT ---\n");
for t in &tokens {
println!("{:?}", t);
}
println!("\nmission accomplished.");
println!("parser, ast, semantic analysis & codegen will stand outside the club until further notice.");
}
write this in your main.rs file and see how it goes,
also remember that mod.rs file we made?
yeah, write
pub mod lexer;
pub mod token;
in that file to declare other 2 rust files as public files and to let main.rs file access these files and not say, tf are these.
now type cargo run and watch your lexer either work perfectly or produce errors that make you question your life choices
writing this at 5:26 am so (⌐■_■)
token.rs
#[derive(Debug, Clone)]
pub enum Token {
// keywords
Print(String),
If(String),
Else(String),
Int(String),
Minus(String),
// literals
IntegerLiteral(i32),
StringLiteral(String),
BooleanLiteral(bool),
// identifiers
Identifier(String),
// operators
Plus(String),
Assign(String),
// punctuation
SemiColon(String),
LeftParen(String),
RightParen(String),
LeftBrace(String),
RightBrace(String),
// logical operators
GreaterThan(String),
LessThan(String),
}
impl Token {
pub fn get_token(token_type: &str, value: Option<&str>) -> Token {
match token_type {
// keywords
"Print" => Token::Print("print".to_string()),
"If" => Token::If("if".to_string()),
"Else" => Token::Else("else".to_string()),
"Int" => Token::Int("int".to_string()),
// literals
"IntegerLiteral" => {
Token::IntegerLiteral(value.unwrap().parse::<i32>().unwrap())
}
"StringLiteral" => {
Token::StringLiteral(value.unwrap().to_string())
}
"BooleanLiteral" => {
let val = match value.unwrap() {
"true" => true,
"false" => false,
_ => panic!("invalid boolean literal"),
};
Token::BooleanLiteral(val)
}
// identifiers
"Identifier" => Token::Identifier(value.unwrap().to_string()),
// operators
"Plus" => Token::Plus("+".to_string()),
"Minus" => Token::Minus("-".to_string()),
"Assign" => Token::Assign("=".to_string()),
// punctuation
"SemiColon" => Token::SemiColon(";".to_string()),
"LeftParen" => Token::LeftParen("(".to_string()),
"RightParen" => Token::RightParen(")".to_string()),
"LeftBrace" => Token::LeftBrace("{".to_string()),
"RightBrace" => Token::RightBrace("}".to_string()),
// logical operators
"GreaterThan" => Token::GreaterThan(">".to_string()),
"LessThan" => Token::LessThan("<".to_string()),
_ => panic!("invalid token type {}", token_type),
}
}
pub fn get_token_regex(token_type: &str) -> String {
match token_type {
// keywords
"Print" => r"print",
"If" => r"if",
"Else" => r"else",
"Int" => r"int\s+",
// literals
"IntegerLiteral" => r"\d+",
"StringLiteral" => r#"\".*\""#,
"BooleanLiteral" => r"\b(?:true|false)\b",
// identifiers
"Identifier" => r"[a-zA-Z_][a-zA-Z0-9_]*",
// operators
"Plus" => r"\+",
"Minus" => r"-",
"Assign" => r"=",
// punctuation
"SemiColon" => r";",
"LeftParen" => r"\(",
"RightParen" => r"\)",
"LeftBrace" => r"\{",
"RightBrace" => r"\}",
// logical operators
"GreaterThan" => r">",
"LessThan" => r"<",
_ => panic!("invalid token type: {}", token_type),
}.to_string()
}
}








Top comments (0)