Wipes away a happy tear...
Ladies and gentlemen, I am beyond delighted to report that my Scanner is ready!
Yesterday, I'd created a basic skeleton of the scanner and enabled it to create tokens for single and two-character lexemes. Now, it can do the same for numbers, decimals, strings, keywords, and comments too.
What I built: Commit 9ed6135
What I understood:
1) Comments
- Creating a
peek()
function that serves as a lookahead for the next character. -
Ex: In the single-line comment, once the second
/
in//
is encountered, the Scanner just keeps advancing until\n
is reached (which peek helps look out for), to indicate a line end. - Here,
peek()
is non-destructive unlikeadvance()
, which cannot be rolled back, for it updates thecurrent
pointer. - Similarly,
peekNext()
is used to check the character after peek(). - I struggled to understand why
peekNext()
wasn't working for multiline comments, like it did for decimals. But more on that later.
2) What the Scanner Ignores
- Used a common (switch)
case
for the Scanner to ignore whitespaces, tab spaces, and carriage returns. Likewise, it updates theline
counter every time\n
is encountered. - But keeping track of whitespaces isn't futile either.
-
Ex1: In C,
#include <stdio.h>
requires proper spacing between the directive and the filename, or the preprocessor may fail to include the file. -
Ex2: Python uses whitespaces instead of
{}
to indent blocks.
3) Strings
- Until the closing
"
has been reached, the scanner advances. - The substring/contents between the lexeme's start and end ( " and ") are stored in a java String object called
value
, that's passed toaddToken()
4) Numbers
- If the lexeme qualifies the
isDigit(
) function (that checks if the character is between 0 and 9), it is stored as a number, in which decimals are also checked for throughpeekNext()
- In the course of parsing through a number,
peek()
looks for whether or not the next character is a.
andpeekNext()
if the character after the.
is a digit. - If yes, the Scanner stores the whole lexeme as a
NUMBER
. If not, the numeric part alone is stored as aNUMBER
, while the.
is stored asDOT
- If it qualifies the
isAlpha()
function, it is stored as an identifier.
5) Keywords:
- We use a Hash Map here to accept keyword lexemes and their types.
What I didn't understand: Why peekNext()
wasn't working for multiline comments, like it was for numbers. I spent a lot of time last evening trying to figure this out, and it wasn't until this morning that I understood why.
- I initially tried a condition like the one used in single-line comments.
case '/':
else if (match('*')) {
while (peek() != '*' && peekNext() != '/' &&!isAtEnd()) advance();
}
- What I didn't account for was that I'd been using the same variable for
advance()
that I declared at the beginning of the switch-case loop. - I took AI's help for a hint and then understood that I had to create a new variable for
advance()
inside the while loop. -
Think of it like this. You're tasked with cleaning up the sea. You use a counter called
c = advance()
to keep track of each piece of trash you pick and classify by type, for easier recycling. - But for unrecyclable items, we cannot use only
c
. - So we create a new counter called
ch = advance()
- Let's assume that
*
followed by/
are the last two unrecyclable items to be picked. -
peek()
is a lookahead that checks whether or not a certain item (character) has been reached.
if (peek() == '*' && peekNext() == '/') {
advance();
return;
}
- No matter what I tried, the Scanner would always consider the final
/
of a multiline comment a new token. - I then realised that
peek()
is merely for looking and not doing anything. - When the code above finds that the next two characters are
*
and/
, it advances. Which means that*
is picked. But what about/
? - To address this, we must first ensure that
*
has been picked and the item after it is/
. - Now, when we advance(), we also pick up
/
. That looks like:
case '/':
if (match('/')) {
while (peek() != '\n' && !isAtEnd()) advance();
}
//multineline comments
else if (match('*')) {
while (!isAtEnd()) {
char ch = advance();
if (ch == '\n') line++;
if (ch == '*' && peek() == '/') {
advance();
return;
}
}
Lox.error(line, "Unterminated comment.");
}
else {
addToken(SLASH);
}
break;
What's next: The Parser!
Musings:
It took me a while to understand the problem and solve it (by myself!), so I'm pretty proud. I like the analogy too because yet again it relates to the sea, which I love. I haven’t travelled by sea much, but often draw inspiration from it. Its endlessness and mysteries aren’t unlike those of the world or universe we inhabit. Whenever we’re deluded by our own might and permanence, it humbles us. Understand it, understand yourself, and maybe you can weather the storm.
Top comments (0)