DEV Community

Yutaka HARA
Yutaka HARA

Posted on

How CRuby decides an `if` is a modifier

Ruby has two styles to write if.

  1. if foo then bar end
  2. foo if bar

This reads natural to human, but not to machines. For example, can you tell if this code is valid or not?

p if 1 then 2 else 3 end
Enter fullscreen mode Exit fullscreen mode

The answer is:

$ ruby -e 'p if 1 then 2 else 3 end'
-e:1: syntax error, unexpected `then', expecting end-of-input
Enter fullscreen mode Exit fullscreen mode

Because the if here is recognized as "modifier if", not "keyword if". So how does Ruby decides the type of if?

parse.y

The answer should be in the parse.y, which defines Ruby's grammer.

In the parse.y, you see keyword_if and modifier_if. It means the type of if is decided by the lexer, not the parser.

lex.c.blt

By grepping modifier_if, you will find lex.c.blt has a table of keywords in the function rb_reserved_word.

#line 31 "defs/keywords"
      {gperf_offsetof(stringpool, 33), {keyword_if, modifier_if}, EXPR_VALUE},
Enter fullscreen mode Exit fullscreen mode

parse.y

The lexer starts from yylex. It calls parser_yylex, which handles the symbols like +, -, etc. If the character is not a symbol, parse_ident is called.

parse_ident checks if a keyword begins from the current position with rb_reserved_word. The returned kw is a member of the table we've seen in lex.c.blt.

    /* See if it is a reserved word.  */
    kw = rb_reserved_word(tok(p), toklen(p));
Enter fullscreen mode Exit fullscreen mode

In the case of if keyword, kw->id[0] corresponds to keyword_if and kw->id[1] corresponds to modifier_if.

Actually id has two values to distinguish keywords and modifiers. According to lex.c.blt, Ruby has five modifiers.

  • x if y
  • x unless y
  • x while y
  • x until y
  • x rescue y

When an if is a modifier

This is the condition that distinguishes keyword_if and modifier_if. In short, an if is a keyword if the lexer state is EXPR_BEG; otherwise, it is a modifier.

            if (IS_lex_state_for(state, (EXPR_BEG | EXPR_LABELED)))
                return kw->id[0];
            else {
                if (kw->id[0] != kw->id[1])
                    SET_LEX_STATE(EXPR_BEG | EXPR_LABEL);
                return kw->id[1];
            }
Enter fullscreen mode Exit fullscreen mode

The lexer state

Among the states of the lexer, EXPR_BEG, EXPR_END and EXPR_ARG are the most important. They decides operators like +, - is unary or binary. For example:

  • 1 - 2: This is binary minus because the state is EXPR_END after the 1.
  • foo(-1): This is unary minus because the state is EXPR_BEG after the (.

EXPR_ARG is a bit tricky; On this state, the meaning of - changes by the space after it.

  • foo - 1: binary minus
  • foo -1: unary minus

What is interesting is that this rule is not so difficult for humans. The former "looks like" binary and the latter "looks like" unary. So you will actually never be bothered by this, unless you are implementing the parser.

keyword if and modifier if

Now you can tell an if is a keyword or modifier by checking the lexer state.

  • foo() if ...: This is modifier_if because the state is EXPR_END after the ).
  • foo(if ...): This is keyword_if because the state is EXPR_BEG after the (.
  • foo if ...: This is modifier_if because the state is EXPR_ARG after the before if.

Why this matters to me

I think most Rubyists does not care about corner cases like this; However I needed to figure out this because I'm making my original programming language Shiika which has Ruby-like syntax.

As you've seen, parsing Ruby-like syntax is not easy, especially parsing method calls without parentheses. I'm happy if this entry helps someone who want to make a Rubyish language.

Top comments (0)