The basic parsing method presented in part II works well for simple expressions that consist of binary operators and literals. Our calculator must support more complex expressions, including unary operators, parenthesis, and function calls. In addition, we'd like to be able to communicate parsing errors to the user pointing them to the potential source of the problem.
Errors
When a parser hits an issue, the user will usually get a message explaining the problem the parser hit and where in the code it happened:
To return such errors, we need to store the message and the span of input code the error relates to:
pub struct Span {
pos: usize,
len: usize
}
impl Span {
pub fn new(pos: usize, len: usize) -> Self {
Span {
pos,
len
}
}
}
pub struct Error {
message: String,
span: Span
}
impl Error {
pub fn new(message: &str, span: Span) -> Self {
Error {
message: String::from(message),
span
}
}
Sometimes the error is detected by a function coming from a library we use and the error type is different than our Error
. In this case, it will be handy to wrap that error in our Error
structure to have consistent error flow through the call stack:
impl Error {
pub fn wrap<T: Display>(t: T) -> Self {
Error {
message: format!("{t}"),
span: Span::new(0, 0)
}
}
}
We can use the wrap
method with Rust's handy Result::map_err
method:
pub fn parse_int(input: &str) -> Result<i128, Error> {
input.parse().map_err(Error::wrap)
}
Grammar
Let's now look into improving our parser to handle more complex expressions. We are following recursive descent parsing (RDP) - but how exactly such a parser should be built? Wikipedia's definition of RDP states:
In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures (or a non-recursive equivalent) where each such procedure implements one of the nonterminals of the grammar. Thus the structure of the resulting program closely mirrors that of the grammar it recognizes.
We need to build a grammar and then create a procedure for each non-terminal of the grammar. A non-terminal is an element of the grammar defined by other terminals and non-terminals. A terminal is the actual token we read. A grammar for basic addition can be defined as follows:
exp: fact + fact
fact: literal
The exp
(expression) consists of two factors and +
operator. A factor is a simple literal. Grammar like this defines a single addition operation 2+2
. To define a chain of additions, we need to add recursion to the grammar:
exp: fact + exp | fact
fact: literal
This can define anything from a single literal 1
to any number of additions 1+2+3...
. The fact exp
is on the right-hand side of the fact + exp
term is not an accident - parsing left-recursive grammars is more complicated, and for RDP we need to use right-side recursion.
Next, we need to consider operator precedence. The 1+2*3
should be computed as 1+(2*3)
and 1*2+3
should become (1*2)+3
. It means that for lower precedence operations like +
each side can be a literal or a higher-precedence operation. so for +
and *
we can write:
exp: exp1 + exp | exp1
exp1: fact * exp1 | fact
fact: literal
It's a bit tricky but as the old joke says, "to understand recursion you first need to understand recursion".
Now we can write full grammar for the expressions we want to handle in our calculator. In addition to basic mathematical operations, we want to handle parenthesis, identifiers (which may be constants like pi
or e
), and functions like sin
.
exp: exp1 | empty
exp1: exp2 op1 exp1 | exp2
op1: + | -
exp2: exp3 op2 exp2| exp3
op2: * | / | %
exp3: term op3 exp3 | term
op3: ^
fact: +fact | -fact | (exp1) | func | id | literal
func: id(exp1)
Parsing
The idea for the parser implementation follows the RDP definition: for each non-terminal, we write a function that will parse this non-terminal. The function will return true if parsing consumed a terminal or false if parsing ended with an empty statement. This will help us check whether all sides of the expression returned the actual result. Example function for parsing a non-terminal:
I added comments showing pieces of grammar each function is responsible for:
pub fn parse(&mut self) -> Result<bool, Error> {
self.init();
self.parse_exp()
}
// exp: exp1 | empty
pub fn parse_exp(&mut self) -> Result<bool, Error> {
match self.next_token.kind {
lexer::TokenKind::Eof => Ok(false),
_ => self.parse_exp1()
}
}
// exp1: exp2 op1 exp1 | exp2
// op1: + | -
fn parse_exp1(&mut self) -> Result<bool, Error> {
let has = self.parse_exp2()?;
match self.next_token.kind {
kind @(lexer::TokenKind::Add | lexer::TokenKind::Sub) => {
self.bump();
if self.parse_exp1()? {
self.parse_binary_op(kind)?;
Ok(true)
} else {
self.error(ERR_EOF)
}
},
_ => Ok(has),
}
}
// exp2: exp3 op2 exp2| exp3
// op2: * | / | %
fn parse_exp2(&mut self) -> Result<bool, Error> {
let has = self.parse_exp3()?;
match self.next_token.kind {
kind @(lexer::TokenKind::Mul | lexer::TokenKind::Div | lexer::TokenKind::Mod) => {
self.bump();
if self.parse_exp2()? {
self.parse_binary_op(kind)?;
Ok(true)
} else {
self.error(ERR_EOF)
}
},
_ => Ok(has)
}
}
// exp3: fact op3 exp3 | fact
// op3: ^
fn parse_exp3(&mut self) -> Result<bool, Error> {
let has = self.parse_fact()?;
match self.current_token.kind {
kind @lexer::TokenKind::Pow => {
self.bump();
if self.parse_exp3()? {
self.parse_binary_op(kind)?;
Ok(true)
} else {
self.error(ERR_EOF)
}
},
_ => Ok(has)
}
}
// fact: +fact | -fact | (exp1) | func | id | literal
// func: id(exp1)
fn parse_fact(&mut self) -> Result<bool, Error> {
self.bump();
match self.current_token.kind {
kind@ (lexer::TokenKind::Add |lexer::TokenKind::Sub) => {
if !self.parse_fact()? {
return self.error("Unary operator needs expression");
}
self.parse_unary_op(kind)
}
lexer::TokenKind::Lpar => {
let has = self.parse_exp1()?;
if self.next_token.kind != lexer::TokenKind::Rpar {
return self.error("Missing closing parenthesis");
};
self.bump();
Ok(has)
},
lexer::TokenKind::Literal(kind) => {
self.parse_literal(kind)
},
lexer::TokenKind::Eof => Ok(false),
// etc for id, function...
_ => self.error(ERR_UNEXP)
}
}
I shortened parse_fact
a bit but you get the idea. When parsing a terminal, we just add it to the output program I described in the previous post:
fn parse_literal(&mut self, l: lexer::LiteralKind) -> Result<bool, Error> {
let val = match l {
lexer::LiteralKind::Int => self.parse_int()?,
lexer::LiteralKind::Float => self.parse_float()?
};
self.program.push(Expression::Val(val));
Ok(true)
}
fn parse_binary_op(&mut self, kind: lexer::TokenKind) -> Result<bool, Error>{
let op = match kind {
lexer::TokenKind::Add => Op::Add,
lexer::TokenKind::Sub => Op::Sub,
lexer::TokenKind::Div => Op::Div,
lexer::TokenKind::Mul => Op::Mul,
lexer::TokenKind::Mod => Op::Mod,
lexer::TokenKind::Pow => Op::Pow,
_ => return self.error("Invalid binary operator")
};
self.program.push(Expression::BinaryOp(op));
Ok(true)
}
And that's it! We can now parse very complex mathematical expressions.
Sources:
Top comments (0)