DEV Community

Cover image for Ruby's Hidden Gems, StringScanner
Michael Kohl for AppSignal

Posted on • Originally published at blog.appsignal.com

Ruby's Hidden Gems, StringScanner

Ruby is not only a fun language, it also comes with an excellent standard library. Some of which are not that known, and are almost hidden Gems. Today guest writer Michael Kohl highlights a favorite: Stringscanner.

Ruby's hidden Gems: StringScanner

One can get quite far without having to resort to installing third party gems, from data structures like OpenStruct and Set over CSV parsing to benchmarking. However, there are some less well-known libraries available in Ruby's standard installation that can be very useful, one of which is StringScanner which according to the documentation "provides lexical scanning operations on a string".

Scanning and parsing

So what does "lexical scanning" mean exactly? Essentially it describes the process of taking an input string and extracting meaningful bits of information from it, following certain rules. For example, this can be seen at the first stage of a compiler which takes an expression like 2 + 1 as input and turns it into the following sequence of tokens:

[{ number: "1" }, {operator: "+"}, { number: "1"}]
Enter fullscreen mode Exit fullscreen mode

Lexical scanners are usually implemented as finite-state automata and there are several well-known tools available that can generate them for us (e.g. ANTLR or Ragel).

However, sometimes our parsing needs aren't that elaborate, and a simpler library like the regular expression based StringScanner can come in very handy in such situations. It works by remembering the location of a so-called scan pointer which is nothing more than an index into the string. The scanning process then tries to match the code right after the scan pointer with the provided expression. Apart from matching operations, StringScanner also provides methods for moving the scan pointer (moving forwards or backwards through the string), looking ahead (seeing what's next without modifying the scan pointer just yet) as well as finding out where in the string we currently are (is it the beginning or end of a line/the entire string etc).

Parsing Rails Logs

Enough theory, let's see StringScanner in action. The following example will take a Rails' log entry like the one below,

log_entry = <<EOS
Started GET "/" for 127.0.0.1 at 2017-08-20 20:53:10 +0900
Processing by HomeController#index as HTML
  Rendered text template within layouts/application (0.0ms)
  Rendered layouts/_assets.html.erb (2.0ms)
  Rendered layouts/_top.html.erb (2.6ms)
  Rendered layouts/_about.html.erb (0.3ms)
  Rendered layouts/_google_analytics.html.erb (0.4ms)
Completed 200 OK in 79ms (Views: 78.8ms | ActiveRecord: 0.0ms)
EOS
Enter fullscreen mode Exit fullscreen mode

and parse it into the following hash:

{
  method: "GET",
  path: "/"
  ip: "127.0.0.1",
  timestamp: "2017-08-20 20:53:10 +0900",
  success: true,
  response_code: "200",
  duration: "79ms",
}
Enter fullscreen mode Exit fullscreen mode

NB: While this makes for a good example for StringScanner a real application would be better off using Lograge and its JSON log formatter.

In order to use StringScanner we first need to require it:

require 'strscan'
Enter fullscreen mode Exit fullscreen mode

After this we can initialize a new instance by passing the log entry as an argument to the constructor. At the same time we'll also define an empty hash to hold the result of our parsing efforts:

scanner = StringScanner.new(log_entry)
log = {}
Enter fullscreen mode Exit fullscreen mode

We can now use the scanner's pos method to get the current location of our scan pointer. As expected, the result is 0, the first character of the string:

scanner.pos #=> 0
Enter fullscreen mode Exit fullscreen mode

Let's visualize this so the process will be easier to follow along:

Started GET "/" for 127.0.0.1 at 2017-08-20 20:53:10 +0900
^
...
Completed 200 OK in 79ms (Views: 78.8ms | ActiveRecord: 0.0ms)

Enter fullscreen mode Exit fullscreen mode

For further introspection of the scanner's state we can use beginning_of_line? and eos? to confirm that the scan pointer currently is at the beginning of a line and that we have not yet fully consumed our input:

scanner.beginning_of_line? #=> true
scanner.eos? #=> false
Enter fullscreen mode Exit fullscreen mode

The first bit of information we want to extract is the HTTP request method, which can be found right after the word "Started" followed by a space. We can use the scanner's appropriately named skip method to advance the scan pointer, which will return the number of ignored characters, which in our case is 8. Additionally we can use matched? to confirm that everything worked as expected:

scanner.skip(/Started /) #=> 8
scanner.matched? #=> true
Enter fullscreen mode Exit fullscreen mode

The scan pointer is now right before the request method:

Started GET "/" for 127.0.0.1 at 2017-08-20 20:53:10 +0900
       ^
...
Completed 200 OK in 79ms (Views: 78.8ms | ActiveRecord: 0.0ms)
Enter fullscreen mode Exit fullscreen mode

Now we can use scan_until to extract the actual value, which returns the entire regular expression match. Since the request method is all in uppercase, we can use a simple character class and the + operator which matches one or characters:

log[:method] = scanner.scan_until(/[A-Z]+/) #=> "GET"
Enter fullscreen mode Exit fullscreen mode

After this operation the scan pointer will be at the final "T" of the word "GET".

Started GET "/" for 127.0.0.1 at 2017-08-20 20:53:10 +0900
          ^
...
Completed 200 OK in 79ms (Views: 78.8ms | ActiveRecord: 0.0ms)
Enter fullscreen mode Exit fullscreen mode

To extract the requested path, we will therefore need to skip one space and then extract everything enclosed in double quotes. There are several ways to achieve this, one of them is via a capture group (the part of the Regular expression included in parenthesis, i.e. (.+)) which matches one or more of any character:

scanner.scan(/\s"(.+)"/) #=> " \"/\""
Enter fullscreen mode Exit fullscreen mode

However, we will not be using the return value of this scan operation directly, but instead use captures to get the value of the first capture group instead:

log[:path] =  scanner.captures.first #=> "/"
Enter fullscreen mode Exit fullscreen mode

We successfully extracted the path and the scan pointer is now at the closing double quote:

Started GET "/" for 127.0.0.1 at 2017-08-20 20:53:10 +0900
              ^
...
Completed 200 OK in 79ms (Views: 78.8ms | ActiveRecord: 0.0ms)
Enter fullscreen mode Exit fullscreen mode

To parse the IP address from the log, we once again use skip to ignore the string "for" surrounded by spaces and then use scan_until to match one or more non whitespace characters (\s is the character class representing whitespace and [^\s] is its negation):

scanner.skip(/ for /) #=> 5
log[:ip] = scanner.scan_until(/[^\s]+/) #=> "127.0.0.1"
Enter fullscreen mode Exit fullscreen mode

Can you tell where the scan pointer will be now? Think about it for a moment and then compare your answer to the solution:

Started GET "/" for 127.0.0.1 at 2017-08-20 20:53:10 +0900
                            ^
...
Completed 200 OK in 79ms (Views: 78.8ms | ActiveRecord: 0.0ms)
Enter fullscreen mode Exit fullscreen mode

Parsing the timestamp should feel very familiar by now. First we use trusty old skip to ignore the literal string " at " and then use scan_until to read until the end of the current line, which is represented by $ in regular expressions:

scanner.skip(/ at /) #=> 4
log[:timestamp] = scanner.scan_until(/$/) #=> "2017-08-20 20:53:10 +0900"
Enter fullscreen mode Exit fullscreen mode

The next piece of information we're interested in is the HTTP status code on the last line, so we'll use skip_until to take us all the way to the space after the word "Completed".

scanner.skip_until(/Completed /) #=> 296
Enter fullscreen mode Exit fullscreen mode

As the name suggests this works similarly to scan_until but instead of returning the matched string it returns the number of skipped over characters. This puts the scan pointer right in front of the HTTP status code we're interested in.

Started GET "/" for 127.0.0.1 at 2017-08-20 20:53:10 +0900
...
Completed 200 OK in 79ms (Views: 78.8ms | ActiveRecord: 0.0ms)
         ^
Enter fullscreen mode Exit fullscreen mode

Now before we scan the actual HTTP response code, wouldn't it be nice if we could tell if the HTTP response code denotes a success (for the sake of this example any code in the 2xx range) or failure (all other ranges)? To achieve this we will make use of peek to look at the next character, without actually moving the scan pointer.

log[:success] = scanner.peek(1) == "2" #=> true
Enter fullscreen mode Exit fullscreen mode

Now we can use scan to read the next three characters, represented by the regular expression /\d{3}/:

log[:response_code] = scanner.scan(/\d{3}/) #=> "200"
Enter fullscreen mode Exit fullscreen mode

Once again the scan pointer will be right at the end of the previously matched regular expression:

Started GET "/" for 127.0.0.1 at 2017-08-20 20:53:10 +0900
...
Completed 200 OK in 79ms (Views: 78.8ms | ActiveRecord: 0.0ms)
            ^
Enter fullscreen mode Exit fullscreen mode

The last bit of information we want to extract from our log entry is the execution time in milliseconds, which can be achieved by skipping over the string " OK in " and then reading everything up to and including the literal string "ms".

scanner.skip(/ OK in /) #=> 7
log[:duration] = scanner.scan_until(/ms/) #=> "79ms"
Enter fullscreen mode Exit fullscreen mode

And with that last bit in there, we have the hash we wanted.

{
  method: "GET",
  path: "/"
  ip: "127.0.0.1",
  timestamp: "2017-08-20 20:53:10 +0900",
  success: true,
  response_code: "200",
  duration: "79ms",
}
Enter fullscreen mode Exit fullscreen mode

Summary

Ruby's StringScanner occupies a nice middle ground between simple regular expressions and a full-blown lexer. It isn't the best choice for complex scanning and parsing needs. But it's straightforward nature makes it easy for everyone with basic regular expression knowledge to extract information from input strings and I've used those successfully in production code in the past. We hope you'll discover this hidden Gem.

Michael Kohl’s love affair with Ruby started around 2003. He also enjoys writing and speaking about the language and co-organizes Bangkok.rb and RubyConf Thailand.

Discussion (1)

Collapse
dengel29 profile image
Dan

Cheers, nice breakdown and actually exactly what I need as I'm trying to write my first git hook!