The Scanner is readyyy!

#computerscience #compilers #interpreter #learning

Wipes away a happy tear...
Ladies and gentlemen, I am beyond delighted to report that my Scanner is ready!
Yesterday, I'd created a basic skeleton of the scanner and enabled it to create tokens for single and two-character lexemes. Now, it can do the same for numbers, decimals, strings, keywords, and comments too.

What I built: Commit 9ed6135

What I understood:

1) Comments

Creating a peek() function that serves as a lookahead for the next character.
Ex: In the single-line comment, once the second / in // is encountered, the Scanner just keeps advancing until \n is reached (which peek helps look out for), to indicate a line end.
Here, peek() is non-destructive unlike advance(), which cannot be rolled back, for it updates the current pointer.
Similarly, peekNext() is used to check the character after peek()
I struggled to understand why peekNext() wasn't working for multiline comments, like it did for decimals. But more on that later.

2) What the Scanner Ignores

Used a common (switch) case for the Scanner to ignore whitespaces, tab spaces, and carriage returns. Likewise, it updates the line counter every time \n is encountered.
But keeping track of whitespaces isn't futile either.
Ex1: In C, #include <stdio.h> requires proper spacing between the directive and the filename, or the preprocessor may fail to include the file.
Ex2: Python uses whitespaces instead of {} to indent blocks.

3) Strings

Until the closing " has been reached, the scanner advances.
The substring/contents between the lexeme's start and end ( " and ") are stored in a java String object called value, that's passed to addToken()

4) Numbers

If the lexeme qualifies the isDigit() function (that checks if the character is between 0 and 9), it is stored as a number, in which decimals are also checked for through peekNext()
In the course of parsing through a number, peek() looks for whether or not the next character is a . and peekNext() if the character after the . is a digit.
If yes, the Scanner stores the whole lexeme as a NUMBER. If not, the numeric part alone is stored as a NUMBER, while the . is stored as DOT
If it qualifies the isAlpha() function, it is stored as an identifier.

5) Keywords:

We use a Hash Map here to accept keyword lexemes and their types.

What I didn't understand: Why peekNext() wasn't working for multiline comments, like it was for numbers. I spent a lot of time last evening trying to figure this out, and it wasn't until this morning that I understood why.

I initially tried a condition like the one used in single-line comments.

case '/':
         else if (match('*')) {
              while (peek() != '*' && peekNext() != '/' &&!isAtEnd()) advance();
         }

What I didn't account for was that I'd been using the same variable for advance() that I declared at the beginning of the switch-case loop.
I took AI's help for a hint and then understood that I had to create a new variable for advance() inside the while loop.
Think of it like this. You're tasked with cleaning up the sea. You use a counter called c = advance() to keep track of each piece of trash you pick and classify by type, for easier recycling.
But for unrecyclable items, we cannot use only c.
So we create a new counter called ch = advance()
Let's assume that * followed by / are the last two unrecyclable items to be picked.
peek() is a lookahead that checks whether or not a certain item (character) has been reached.

if (peek() == '*' && peekNext() == '/') {
    advance();
    return;
}

No matter what I tried, the Scanner would always consider the final / of a multiline comment a new token.
I then realised that peek() is merely for looking and not doing anything.
When the code above finds that the next two characters are * and /, it advances. Which means that * is picked. But what about /?
To address this, we must first ensure that * has been picked and the item after it is /.
Now, when we advance(), we also pick up /. That looks like:

case '/':
         if (match('/')) {
             while (peek() != '\n' && !isAtEnd()) advance();
         }
         //multineline comments
         else if (match('*')) {
             while (!isAtEnd()) {
                   char ch = advance();
                   if (ch == '\n') line++;
                   if (ch == '*' && peek() == '/') {
                      advance();
                      return;
                   }
             }
             Lox.error(line, "Unterminated comment.");
         }
         else {
              addToken(SLASH);
         }
         break;

What's next: The Parser!

Musings:
It took me a while to understand the problem and solve it (by myself!), so I'm pretty proud. I like the analogy too because yet again it relates to the sea, which I love. I haven’t travelled by sea much, but often draw inspiration from it. Its endlessness and mysteries aren’t unlike those of the world or universe we inhabit. Whenever we’re deluded by our own might and permanence, it humbles us. Understand it, understand yourself, and maybe you can weather the storm.

DEV Community

The Scanner is readyyy!

Top comments (0)