This blog post is best seen in it's original format.
This blog post is based on a talk I did entitled The One Hour Expression Language and aims to provide a review of both the concepts and the code in that talk1.
An expression language2 in our sense is something that evaluates
an expression where an expression is a sequence of bytes which are highly likely
to be utf-8 characters3 Some
examples:
1 + 1
//article[@title="foobar"]//image
.items[].foo|select(.bar = "foo")
a.comments > 1 and a.category not in ["misc"]
Examples of expression languages (or DSLs4) would include:
Why would you want to create your own expression language? Well, why wouldn't you? Perhaps you are too busy? Never fear! It needn't take months, weeks nor days to write an expression language, you can do it in an hour with my patented5 One Hour Expression Language!
ProCalc2000
We will be building the ProCalc2000 expression language. ProCalc2000
is a next generation non-scientific, precedent-disrespecting, arithmetic calculator for the year 2000 and subsequent years.
It allows the evaluation of complex expressions such as 1 + 1
or even 1 +
, and can even be extended to solve division problems such as
21 +
.
3 + 2 / 2
Godzilla
Godzilla doesn't like division. It results in floating point numbers which he
doesn't want to deal with today.
The expression language consists of numbers (i.e. 1
and 2
) and operators (+
, -
,
*
) and will not support operator precedence (see Appendix I) or division.
Despite it's simplicity it does provide the foundations from which you can
easily introduce more features. You can add variables, functions, pipe
operators, suffixes, string concatenation and even, contrary to the will of
Godzilla, support division.
What's in One, Please?
There are a many ways to write code to evaluate a sequence of bytes in
some way, but we will be using a tokenizer, parser and an
evaluator:
+-----------+ tokens +--------+ ast +-----------+
EXPRESSION ==>| Tokenizer |--------->| Parser |------>| Evaluator | => VALUE
+-----------+ +--------+ +-----------+
Tokenizer
Also known as a Lexer or a Scanner. This class is responsible for
splitting the string into categorised chunks called tokens.
class Tokenizer
{
public function tokenize(string $expression): Tokens
{
// ...
}
}
For example 1 + 2 + 3
would evaluate to 5 tokens:
Token(Integer, 1)
Token(Plus)
Token(Integer, 2)
Token(Plus)
Token(Integer, 3)
The tokenizer scans the expression string from left to right and identifies
"chunks" that are interesting. In our expression language
those interesting chunks are:
- positive integer numbers.
- the
+
,-
and*
operators.
White-space is interesting only in that we will skip it and if we encounter
any other character we will produce an error.
Our set of token types will be Integer
, Plus
, Minus
and Multiply
.
Godzilla
ProCal2000 could also be implemented with a
Tokenizer
and a stackmachine which is
virtuous but we will implement a
Parser
and an Evaluator
becausebecause Godzilla cares about you.
Note that the tokenizer doesn't care if the expression is valid it only
cares about splitting the expression into distinct and categorised chunks6.
Once the tokenizer has produced tokens we hand them off to the parser.
Parser
The parser will accept the list of tokens and make sense of them by
transforming them into an Abstract Syntax Tree (AST for short).
class Parser
{
public function parse(Tokens $tokens): Node
{
// ...
}
}
So given a list of tokens (list<Token>
) the parser will return an AST which
is the root Node
of a tree of nodes. In our case each and every node in the tree is an expression that can be evaluated and there are two node types BinaryOp
and Integer
.
A binary operation is an ? operations with two operands, so for example:
foo or bar
could > beBinaryOp(Variable('foo'), 'or', Variable('bar'))
or5 ** 6
would beBinaryOp(Integer(5), '**', Integer(6))
.Commonly we also have unary
operation with one operand. Negation is a unary operation -!bar
could beUnaryOp('!', Variable('bar'))
. Could-1
be a unary operation? What about----1
?Ternary operations have three operands and are most commonly seen as the ternary conditional
foo ? bar : baz
.
The expression 1 + 1 / 5
is represented as a single BinaryOp
with an operator (+
) and two operands: the integer value 1
and another binary operation:
+-------------+
| Binary Op + | <-- root of the AST
+---+---+-----+
left | | right
+-------+ +--------+
| |
+-----------+ +-------------+
| Integer 1 | | Binary Op / |
+-----------+ +----+---+----+
left | |right
+------+ +----+
| |
+-----+-----+ +----+------+
| Integer 1 | | Integer 5 |
+-----------+ +-----------+
Which in PHP code could be represented as:
$ast = new BinaryOp(
left: new Integer(1),
operator: '+',
right: new BinaryOp(
left: new Integer(1),
operator: '/',
right: new Integer(5),
)
);
Evaluator
Finally we have the evaluator which accepts a Node
and returns something
else in our case it will return an integer value. The Evaluator is also
known as a tree-walking interpreter.
class Evaluator
{
public function evaluate(Node $node): int
{
// ...
}
}
Show Me Your Code, Please?
This code was created at the PHPSW meetup on the 9th of July and was driven by
unit tests which have been omitted in this blog post. See the repository.
Godzilla
Godzilla says that if he wrote this code he would be angry. He suggests
you refactor it or you'll be angry too and you will have to fight
Godzilla but he always wins.
Tokenizer
First of all we need a class to represent our Token
which contains a
TokenType
enum and an optional value:
<?php
class Token
{
public function __construct(
public TokenType $type,
public ?string $value = null
) {}
}
We store the "value" (if applicable) in the token. An alternative approach
would be store the start and end offset of the token within the expression
string and reference the contents of the expression later. This can be more memory
efficient and generally more useful. But we're being lazy today.
We used a TokenType
enum so let's define it with our 4 token types:
<?php
enum TokenType
{
case Plus;
case Minus;
case Multiply;
case Integer;
}
Tokens would then look like:
[
new Token(TokenType::Integer, 50),
new Token(TokenType::Plus),
// ...
]
The mammoth 🦣 in the room is the Tokenizer
class. It does all the
work7:
<?php
class Tokenizer
{
public function tokenize(string $expression): Tokens
{
$offset = 0;
// we will need to _collect_ the tokens we create
$tokens = [];
// scan from left to right
while (isset($expression[$offset])) {
// get the current char and advance the pointer to the
// next char
$char = $expression[$offset++];
// if PHP had pattern matching we could do this in the `match`
// expression below but it doesn't.
//
// if the char is a number...
if (is_numeric($char)) {
// while the _next_ char is a number
while (is_numeric($expression[$offset] ?? null)) {
// append the current char and advance the pointer
$char .= $expression[$offset++];
}
// add the token
$tokens[] = new Token(TokenType::Integer, $char);
// we're done here, continue
continue;
}
// handle single charcater tokens - in our case this
// consists of our _operators_ and the space
$token = match ($char) {
'+' => new Token(TokenType::Plus),
'-' => new Token(TokenType::Minus),
'*' => new Token(TokenType::Multiply),
// return NULL for space so we can handle it specially
' ' => null,
// otherwise the user was wrong, tell them why.
default => throw new RuntimeException(sprintf(
'Invalid operator: "%s"', $char
)),
};
// if the token is NULL then it was whitespace so we continue
// without adding the token to the list
if ($token === null) {
continue;
}
// if we get this far then we're winning!
$tokens[] = $token;
}
// return a _collection_ of tokens, Godzilla is angry. Should he be?
return new Tokens($tokens);
}
}
Finally let's define that Tokens
collection object:
<?php
use ArrayIterator;
use IteratorAggregate;
use Traversable;
// _almost_ all of my collection objects implement IteratorAggregate
// as it's an easy way to make an object itreable (e.g. in a foreach loop)
class Tokens implements IteratorAggregate
{
// we use a "pointer" to track where we are in the tokens array.
private int $offset = 0;
public function __construct(private array $tokens)
{
}
// from the IteratorAggregate interface
public function getIterator(): Traversable
{
return new ArrayIterator($this->tokens);
}
// take() returns the _current_ token and then advances
// the pointer to the next token. Alternative names for this
// method include:
//
// - chomp()
// - eat()
// - consume()
// - next()
// - bite()
// - ...
public function take(): ?Token
{
return $this->tokens[$this->offset++] ?? null;
}
}
Godzilla
Godzilla interjects rudely and growls that he would do this without a collection object by using an
array
and array_shift
instead of take()
or by having the Tokenizer
return a
Generator
and then, he claims, he can tokenize and parse at the same time! Well done Godzilla.Parser
<?php
use DTL\OneHourExp\Node\BinaryOp;
use DTL\OneHourExp\Node\Integer;
use RuntimeException;
class Parser
{
public function parse(Tokens $tokens): Node
{
// take the first token in the given token list
$token = $tokens->take();
// resolve a node for the token
$node = match ($token?->type) {
// we actually only support Integers at this point
TokenType::Integer => new Integer((int)$token->value),
// if we get NULL then the token was not returned the expression
// was not complete (e.g. "1 + 2 +" would cause this error).
null => throw new RuntimeException('Unexpected end of expression'),
// otherwise throw an exception (e.g. "1 + + +").
default => throw new RuntimeException(sprintf(
'Do not know what to do, thanks: %s', $token->type->name ?? 'null'
)),
};
// see if we have an operator
$token = $tokens->take();
// if we don't then we're at the end of the expression
// and we should return the $node
if ($token === null) {
return $node;
}
// if we do have an operator then return a binary operation
// with the $node as the left operand and the REST OF THE
// EXPRESSION as the right operand.
return new BinaryOp(
$node,
match ($token->type) {
TokenType::Plus => '+',
TokenType::Minus => '-',
TokenType::Multiply => '*',
default => throw new RuntimeException(sprintf(
'Unknown operator: %s', $token->name
)),
},
$this->parse($tokens),
);
}
}
It is here that you would add support for operator precedence, suffix parsing,
pipe operators, etc. We'll talk about operator precedence later, but let's
quickly look at suffix parsing.Say for example you want
5 miles
to return aMiles
node (if you wanted to
defer conversion to a sane unit system later) or aDistance
node (if you
wanted to eagerly make the conversion) so that5 miles * 2 kilometers = (5 * 1.609.344) * 2 = 16.090
.After we've parsed
5
we'd look at the next token, if it's a suffix token
(i.e. one ofmiles
orkilometers
) then we'd combine the previous node
(Integer
) into aDistance
node before forwarding that node
to the binary operation:Distance(Integer(5))
orDistance(5)
or
Distance(5.0, 'miles')
orDistance(5.0, UnitSystem::Imperial)
.
Evaluator
Finally the Evaluator is where the real magic happens 🪄 and it's not even
hard. The evaluator accepts any node and will resolve it to a value -
using recursion to evaluate any nested nodes:
<?php
use DTL\OneHourExp\Node\BinaryOp;
use DTL\OneHourExp\Node\Integer;
use Exception;
use RuntimeException;
class Evaluator
{
// accept any node and, in this case, return an int although
// a typical evaluator would return `mixed` here.
public function evaluate(Node $node): int
{
// if the node is an integer, then return the integer value
if ($node instanceof Integer) {
return $node->value;
}
// oh boy! a BinaryOp!
if ($node instanceof BinaryOp) {
// this is where things get interesting as we recurse...
$leftValue = $this->evaluate($node->left);
// ...evaluating the left and right opeands
$rightValue = $this->evaluate($node->right);
// and finally perform the arithmetic operations
return match ($node->operator) {
'+' => $leftValue + $rightValue,
'-' => $leftValue - $rightValue,
'*' => $leftValue * $rightValue,
default => throw new Exception(sprintf(
'Unknown operator: %s',
$node->operator
)),
};
}
throw new RuntimeException(sprintf(
'Do not know how to evaluate node: %s',
$node::class
));
}
}
That's It
That was the code that I live coded when I live coded the code at
the PHP South West PHP code meetup, although the live demo included the
tests.
You can see the whole thing here and clone it to run bin/scalc2000
to solve maths.
What About Operator Precedence?
Oh if you really want to know. The expression 1 * 3 + 4
should be evaluated so that multiplication operations are calculated before addition operations so that:
2 * 3 + 4 = (2 * 3) + 4 = 6 + 4 = 10
But our expression language will counter-intuitively do the following because we always evaluate
the entire right hand side of the expression due to the way we parsed the
tokens:
2 * 3 + 4 = 2 * (3 + 4) = 2 * 7 = 14
This is the wrong answer8. We need to change the way the parser builds the
AST. One way would be by updating our parser to be a Pratt
Parser:
<?php
use DTL\OneHourExp\Node\BinaryOp;
use DTL\OneHourExp\Node\Integer;
use RuntimeException;
class Parser
{
// we've added a new function to return the precedence of any given
// operator
private function operatorPrecedence(?TokenType $type): int {
return match ($type) {
null => 0,
// plus and minus have the same precedence
TokenType::Plus => 10,
TokenType::Minus => 10,
// multiplication has a higher precedence
TokenType::Multiply => 20,
// we could have an Integer here which wouldn't make any sense
// so throw an exception.
default => throw new RuntimeException(sprintf(
'%s is not an operator', $type->name
)),
};
}
// our original parse function is updated to accept a "precedence" which we'll set when we recurse.
public function parse(Tokens $tokens, int $precedence = 0): Node
{
// parse the operand (e.g. 5) as we did before.
$token = $tokens->take();
$node = match ($token?->type) {
TokenType::Integer => new Integer((int)$token->value),
null => throw new RuntimeException('Unexpected end of expression'),
default => throw new RuntimeException(sprintf(
'Do not know what to do thanks: %s', $token->type->name ?? 'null'
)),
};
// we add a new method "current" to the Tokens collection to return the
// current token without consuming it.
$token = $tokens->current();
if ($token === null) {
return $node;
}
// we compare the given precedence with the precedence for the current operator.
while ($precedence < $newPrecedence = $this->operatorPrecedence($tokens->current()?->type)) {
// only now do we consume the operator
$token = $tokens->mustTake();
// we parse the expression until an operator with a higher
// precedence than `$newPrecedence` is encountered.
$rightNode = $this->parse($tokens, $newPrecedence);
// we construct the BinaryOp and assign it to `$node`
$node = new BinaryOp(
$node,
match ($token->type) {
TokenType::Plus => '+',
TokenType::Minus => '-',
TokenType::Multiply => '*',
default => throw new RuntimeException(sprintf(
'Unknown operator: %s', $token->type->name
)),
},
$rightNode,
);
// now the next iteration will either exit or have $node as the left operand of
// the _next_ BinaryOp.
}
return $node;
}
}
Godzilla
The Pratt Parser makes use of recursion which is far too complex for the
human mind but
Godzilla, as a lizard, understands.
Even if recursion is too complex for humans, let's look inside of Godzilla's
brain and how he evaluates 2 * 3 + 4
:
- Enter the function initially with precedence
0
- Parse the operand
Integer(2)
. - The current token is
*
with a precedence of20
. -
20
is greater than0
so we enter the loop. - Create a binary operation with the left value of
2
:$node = BinaryOp(Integer(2), *, ...)
- Recurse to
parse
the right operand with the new precedence of20
:- Enter
parse
with a precedence of20
. - Parse the operand
Integer(3)
- The current token is a
+
operator with precedence10
-
10
is NOT less than than20
so we exit the loop and returnInteger(3)
- Enter
- Now we have resolved our
BinaryOp
node's right value:$node = BinaryOp(Integer(2), *, Integer(3))
- The loop is evaluated again and the operator is
+
with precedence10
-
10
is greater than0
so we enter the loop - Create a new binary node with the left hand side inherited from the previous iteration:
BinaryOp(BinaryOp(Integer(2), *, Integer(3), '+', ...)
- Recurse to evaluate the right hand side of the
BinaryOp
with the new precedence of10
:- Enter
parse
with a precedence of10
. - Parse the operand
Integer(4)
. - There are no more tokens so we return
Integer(4)
.
- Enter
- Now we have:
$node = BinaryOp(BinaryOp(Integer(2), *, Integer(3)), +, Integer(4))
- Loop is evaluated again.
- Current token is
NULL
with a precedence of0
-
0
is not less than0
. - Exit the loop
- Return the final
Node
.
The evaluator can now, without modification, give the "correct" answer of
10
.
Further Reading
- Crafting Interpreters: Book (with free web edition) by Robert Nystrom - if you ever wanted to write your own programming language then you can follow this simple guide.
- Expression Parsing Made Easy: Blog post by the same author as above, implementing a Pratt Parser in Java.
- Stack Machine RPN Calculator: 2014 Post by Igor Wiedler followed by many more interesting articles.
- Doctrine Lexer: I think I based my parsers on this library, a long time ago.9
- PHPStan Phpdoc Parser: Another example of a parser.
-
actually the specific code created at the PHPSW meetup in January ↩
- The actual code changes each time.
-
or more specifically an expresison language interpreter. ↩
-
commonly known as a
string
in PHP (although let's hope they're not
multi-byte chars 😱). ↩ -
domain-specific language - while an expression languages aren't full
languages (in the sense that PHP is a language) they can be categorised as DSLs. ↩ -
there is no patent - but don't get any ideas. ↩
-
the tokenizer is useful in its own right, for example you could
easily create a basic syntax highlighter. ↩ -
it's common and possibly more performant to implement the
tokenizer withpreg_
methods, for example: phpstan docblock lexer or my own soon-to-be-abandoned docblock parser ↩ -
it's only wrong if you expected a different answer. ↩
-
it was also through the Doctrine project's query builders that I discovered the power of walking trees 🌲. ↩
Top comments (0)