Let’s understand Chrome V8 — Chapter 3: Compilation pipeline, Scanner

#javascript #v8 #cpp #chrome

Original source: https://medium.com/@huidou/lets-understand-chrome-v8-d34028b37e03

Welcome to other chapters of Let’s Understand Chrome V8

1. Copmilation pipeline

When the JS compiler is initialized, the Scanner is called to generate a token and store it in the Cache. Then the Parser starts to work and takes the token from the Cache to generate an AST tree node. When the Parser encounters a Cache miss, it calls the Scanner to generate the token and store it in Cache as well and then continues to work. So on until the AST is generated fully.As shown in Figure 1.

Let’s look at the following code, the Parser::ParseProgram is the entry function of Scanner that point in Figure 1. Further, let’s take a look at scanner_.Initialize() and FunctionLiteral* result = DoParseProgram(isolate, info), their role are:

1. Initializing Scanner: generate a token, and make the scanner.c0_ point to the first character of the js code that will be compiled.

2. DoParserProgram: take out a token to generate an AST node, then insert the node into the AST tree.

void Parser::ParseProgram(Isolate* isolate, Handle<Script> script,
                          ParseInfo* info,
                          MaybeHandle<ScopeInfo> maybe_outer_scope_info) {
  DCHECK_EQ(script->id(), flags().script_id());

  // It's OK to use the Isolate & counters here, since this function is only
  // called in the main thread.
  DCHECK(parsing_on_main_thread_);
  RCS_SCOPE(runtime_call_stats_, flags().is_eval()
                                     ? RuntimeCallCounterId::kParseEval
                                     : RuntimeCallCounterId::kParseProgram);
  TRACE_EVENT0(TRACE_DISABLED_BY_DEFAULT("v8.compile"), "V8.ParseProgram");
  base::ElapsedTimer timer;
  if (V8_UNLIKELY(FLAG_log_function_events)) timer.Start();

  // Initialize parser state.
  DeserializeScopeChain(isolate, info, maybe_outer_scope_info,
                        Scope::DeserializationMode::kIncludingVariables);

  DCHECK_EQ(script->is_wrapped(), info->is_wrapped_as_function());
  if (script->is_wrapped()) {
    maybe_wrapped_arguments_ = handle(script->wrapped_arguments(), isolate);
  }

  scanner_.Initialize();
  FunctionLiteral* result = DoParseProgram(isolate, info);
  MaybeResetCharacterStream(info, result);
  MaybeProcessSourceRanges(info, result, stack_limit_);
  PostProcessParseResult(isolate, info, result);

  HandleSourceURLComments(isolate, script);

  if (V8_UNLIKELY(FLAG_log_function_events) && result != nullptr) {
    double ms = timer.Elapsed().InMillisecondsF();
    const char* event_name = "parse-eval";
    int start = -1;
    int end = -1;
    if (!flags().is_eval()) {
      event_name = "parse-script";
      start = 0;
      end = String::cast(script->source()).length();
    }
    LOG(isolate,
        FunctionEvent(event_name, flags().script_id(), ms, start, end, "", 0));
  }
}

The Parser::ParseProgram is the closest way to debug JS compiler source code, after debugging deep, you will see the compiler’s full workflow. By the way, the Parser::ParseProgram is surrounded by many outer callers, these callers are just doing some trivial but unimportant things, so to see the compilation workflow, the Parser::ParseProgram is enough, debug it!

You also need to know: Before you step into the Parser::ParseProgram, Scanner has generated a token for you, in the initialization mentioned above.

2. Scanner Workflow

Lexical analysis is the first part of the compiler. In V8, the function Scanner is responsible for implementing the lexical analysis. The scanner reads the JS source stream and generates tokens.

The JS stream is encoded in UTF-16, and the source code is as follows:

// ---------------------------------------------------------------------
// Buffered stream of UTF-16 code units, using an internal UTF-16 buffer.
// A code unit is a 16 bit value representing either a 16 bit code point
// or one part of a surrogate pair that make a single 21 bit code point.
class Utf16CharacterStream {
 public:
  static constexpr base::uc32 kEndOfInput = static_cast<base::uc32>(-1);

  virtual ~Utf16CharacterStream() = default;

  V8_INLINE void set_parser_error() {
    buffer_cursor_ = buffer_end_;
    has_parser_error_ = true;
  }
  V8_INLINE void reset_parser_error_flag() { has_parser_error_ = false; }
  V8_INLINE bool has_parser_error() const { return has_parser_error_; }

  inline base::uc32 Peek() {
    if (V8_LIKELY(buffer_cursor_ < buffer_end_)) {
      return static_cast<base::uc32>(*buffer_cursor_);
    } else if (ReadBlockChecked()) {
      return static_cast<base::uc32>(*buffer_cursor_);
    } else {
      return kEndOfInput;
    }
  }
  // Returns and advances past the next UTF-16 code unit in the input
  // stream. If there are no more code units it returns kEndOfInput.
  inline base::uc32 Advance() {
    base::uc32 result = Peek();
    buffer_cursor_++;
    return result;
  }

//...............
//omit.............
//..............
  // Read more data, and update buffer_*_ to point to it.
  // Returns true if more data was available.
  //
  // ReadBlock() may modify any of the buffer_*_ members, but must sure that
  // the result of pos() remains unaffected.
  //
  // Examples:
  // - a stream could either fill a separate buffer. Then buffer_start_ and
  //   buffer_cursor_ would point to the beginning of the buffer, and
  //   buffer_pos would be the old pos().
  // - a stream with existing buffer chunks would set buffer_start_ and
  //   buffer_end_ to cover the full chunk, and then buffer_cursor_ would
  //   point into the middle of the buffer, while buffer_pos_ would describe
  //   the start of the buffer.
  virtual bool ReadBlock() = 0;

  const uint16_t* buffer_start_;
  const uint16_t* buffer_cursor_;
  const uint16_t* buffer_end_;
  size_t buffer_pos_;
  RuntimeCallStats* runtime_call_stats_;
  bool has_parser_error_ = false;
};

There are several important members in the above code: buffer_start_, buffer_cursor_, buffer_end_, see the comments for their roles. The figure 2 is an example to show Utf16CharacterStream.

The left side of Figure 2 is the JS source code, and the right side is the corresponding string stream.

1.  bool ReadBlock() final {
2.    size_t position = pos();
3.    buffer_pos_ = position;
4.    buffer_start_ = &buffer_[0];
5.    buffer_cursor_ = buffer_start_;
6.    DisallowGarbageCollection no_gc;
7.    Range<uint8_t> range =
8.        byte_stream_.GetDataAt(position, runtime_call_stats(), &no_gc);
9.    if (range.length() == 0) {
10.      buffer_end_ = buffer_start_;
11.      return false;
12.    }
13.    size_t length = std::min({kBufferSize, range.length()});
14.    i::CopyChars(buffer_, range.start, length);
15.    buffer_end_ = &buffer_[length];
16.    return true;
17.  }

Line 3 of the above code makes buffer_start_ point to the first character of the JS stream. The scanner reads every character in the JS stream by moving the buffer_start_ pointer.

Figure 3 shows another important function — Advance(), which it reads each character by buffer_start_++.

The aforementioned scanner.c0_, it always points to the first character in the string used to generate the token. The following code shows the role of c0_.

void Advance() {
    if (capture_raw) {
      AddRawLiteralChar(c0_);
    }
    c0_ = source_->Advance();//Here!!
  }

In Scanner, we can see the following method, which is used to generate token.

1.  void Scanner::Scan(TokenDesc* next_desc) {
2.    DCHECK_EQ(next_desc, &next());
3.    next_desc->token = ScanSingleToken();
4.    DCHECK_IMPLIES(has_parser_error(), next_desc->token == Token::ILLEGAL);
5.    next_desc->location.end_pos = source_pos();
6.  #ifdef DEBUG
7.    SanityCheckTokenDesc(current());
8.    SanityCheckTokenDesc(next());
9.    SanityCheckTokenDesc(next_next());
10.  #endif
11.  }

Line 23 in the above code tells us that the next_desc is the data structure describing the token, and it is bellow:

struct TokenDesc {
    Location location = {0, 0};
    LiteralBuffer literal_chars;
    LiteralBuffer raw_literal_chars;
    Token::Value token = Token::UNINITIALIZED;
    MessageTemplate invalid_template_escape_message = MessageTemplate::kNone;
    Location invalid_template_escape_location;
    uint32_t smi_value_ = 0;
    bool after_line_terminator = false;

#ifdef DEBUG
    bool CanAccessLiteral() const {
      return token == Token::PRIVATE_NAME || token == Token::ILLEGAL ||
             token == Token::ESCAPED_KEYWORD || token == Token::UNINITIALIZED ||
             token == Token::REGEXP_LITERAL ||
             base::IsInRange(token, Token::NUMBER, Token::STRING) ||
             Token::IsAnyIdentifier(token) || Token::IsKeyword(token) ||
             base::IsInRange(token, Token::TEMPLATE_SPAN, Token::TEMPLATE_TAIL);
    }
    bool CanAccessRawLiteral() const {
      return token == Token::ILLEGAL || token == Token::UNINITIALIZED ||
             base::IsInRange(token, Token::TEMPLATE_SPAN, Token::TEMPLATE_TAIL);
    }
#endif  // DEBUG
  };

So far, the Scanner initialization is completed, and then the Parser calls the scanner to generate the token. The Scan() is the entrance of the Scanner, it is below.

void Scanner::Scan(TokenDesc* next_desc) {
  DCHECK_EQ(next_desc, &next());

  next_desc->token = ScanSingleToken();
  DCHECK_IMPLIES(has_parser_error(), next_desc->token == Token::ILLEGAL);
  next_desc->location.end_pos = source_pos();

#ifdef DEBUG
  SanityCheckTokenDesc(current());
  SanityCheckTokenDesc(next());
  SanityCheckTokenDesc(next_next());
#endif
}

3. Token Generation

In Figure 4, the getNextToken takes a token from the cache and sends it to Parser, which generates an AST node.

1.  void ParserBase<Impl>::ParseStatementList(StatementListT* body,
2.                                            Token::Value end_token) {
3.    // StatementList ::
4.    //   (StatementListItem)* <end_token>
5.    DCHECK_NOT_NULL(body);
6.    while (peek() == Token::STRING) {
7.      bool use_strict = false;
8.  #if V8_ENABLE_WEBASSEMBLY
9.      bool use_asm = false;
10.  #endif  // V8_ENABLE_WEBASSEMBLY
11.      Scanner::Location token_loc = scanner()->peek_location();
12.      if (scanner()->NextLiteralExactlyEquals("use strict")) {
13.        use_strict = true;
14.  #if V8_ENABLE_WEBASSEMBLY
15.      } else if (scanner()->NextLiteralExactlyEquals("use asm")) {
16.        use_asm = true;
17.  #endif  // V8_ENABLE_WEBASSEMBLY
18.      }
19.      StatementT stat = ParseStatementListItem();
20.  //.......................
21.  //omit
22.  //.......................
23.    }
24.    while (peek() != end_token) {
25.      StatementT stat = ParseStatementListItem();
26.      if (impl()->IsNull(stat)) return;
27.      if (stat->IsEmptyStatement()) continue;
28.      body->Add(stat);
29.    }
30.  }
31.  }

In the above code, limited by the test case, our debug cannot cover full code. The 25th line of code is very important, it is responsible for parsing every JS statement, Figure 5 shows its function call stack.

In Figure 5, you can also see the aforementioned cache, which is implemented by the Consum(). The code of token generation is bellow.

V8_INLINE Token::Value Scanner::ScanSingleToken() {
  Token::Value token;
  do {
    next().location.beg_pos = source_pos();

    if (V8_LIKELY(static_cast<unsigned>(c0_) <= kMaxAscii)) {
      token = one_char_tokens[c0_];

      switch (token) {
//..............
//omit
//..............

        case Token::CONDITIONAL:
          // ? ?. ?? ??=
          Advance();
          if (c0_ == '.') {
            Advance();
            if (!IsDecimalDigit(c0_)) return Token::QUESTION_PERIOD;
            PushBack('.');
          } else if (c0_ == '?') {
            return Select('=', Token::ASSIGN_NULLISH, Token::NULLISH);
          }
          return Token::CONDITIONAL;
        case Token::STRING:
          return ScanString();
        case Token::LT:
          // < <= << <<= <!--
          Advance();
          if (c0_ == '=') return Select(Token::LTE);
          if (c0_ == '<') return Select('=', Token::ASSIGN_SHL, Token::SHL);
          if (c0_ == '!') {
            token = ScanHtmlComment();
            continue;
          }
          return Token::LT;
        case Token::ASSIGN:
          // = == === =>
          Advance();
          if (c0_ == '=') return Select('=', Token::EQ_STRICT, Token::EQ);
          if (c0_ == '>') return Select(Token::ARROW);
          return Token::ASSIGN;

        case Token::NOT:
          // ! != !==
          Advance();
          if (c0_ == '=') return Select('=', Token::NE_STRICT, Token::NE);
          return Token::NOT;
        case Token::ADD:
          // + ++ +=
          Advance();
          if (c0_ == '+') return Select(Token::INC);
          if (c0_ == '=') return Select(Token::ASSIGN_ADD);
          return Token::ADD;
//..............
//omit
//..............
        default:
          UNREACHABLE();
      }
    }
    if (IsIdentifierStart(c0_) ||
        (CombineSurrogatePair() && IsIdentifierStart(c0_))) {
      return ScanIdentifierOrKeyword();
    }
    if (c0_ == kEndOfInput) {
      return source_->has_parser_error() ? Token::ILLEGAL : Token::EOS;
    }
    token = SkipWhiteSpace();
    // Continue scanning for tokens as long as we're just skipping whitespace.
  } while (token == Token::WHITESPACE);

  return token;
}

We can see that the process of token generation is switch case, which is a finite state automaton. The process of token generation also uses a lot of regular expressions.

In fact, there are many token templates(TOKEN_LIST) in V8, and V8 uses these templates to match characters for generating tokens.

Okay, that wraps it up for this share. I’ll see you guys next time, take care!

My blog is cncyclops.com. Please reach out to me if you have any issues.

WeChat: qq9123013 Email: v8blink@outlook.com