DEV Community

Will BL
Will BL

Posted on

Let's write a tiny JSON parser in Kotlin! Part 4: Strings

Today we're going to learn how to parse strings in JSON. This is more complicated than the datatypes we've done so far, but don't worry, it's not bad :)

Recap

Let's recall our string definition from part 0:

A string is a string of characters enclosed within double quotes ("). Any character can be put within a string, except for the following, which must be escaped:

  • double quotes (") [Escaped with \"]
  • backslash (\) [Escaped with \\]
  • backspace [Escaped with \b]
  • form feed [Escaped with \f]
  • line feed [Escaped with \n]
  • carriage return [Escaped with \r]
  • horizontal tab [Escaped with \t]

In addition, the escape code \/ resolves to a forward slash (/), and a backslash followed by u followed by four hex digits resolves to the character at the Unicode codepoint specified by said hex digits.
A diagram showing the grammar for a JSON string

Writing it

First, let's set up the skeleton of our method:

fun readString(): String? {
    val oldCursor = cursor
    val result = StringBuilder()
}
Enter fullscreen mode Exit fullscreen mode

The quotes

A string will always start with a quote, so if we don't see one at the start, we can fail:

    if (step() != '"') {
        cursor = oldCursor
        return null
    }
Enter fullscreen mode Exit fullscreen mode

The loop

We'll next use a loop to iterate over each character, firstly making sure that there is a next character, and it isn't a double-quote (since that would mark the end of the string) - then, in the loop we'll store the current character in a variable so we can do our logic on it.

    while (hasNext() && peek() != '"') {
        val char = step()
    }
Enter fullscreen mode Exit fullscreen mode

Escaped characters

First, we need to check for escaped characters. These will always start with a \ and will always have at least one other character afterwards - so let's check for a backslash character, and then begin pattern-matching on the next character:

        if (char == '\\') { // just a single backslash, written as a double backslash to escape it
            when (val it = step()) {
Enter fullscreen mode Exit fullscreen mode

Now we can check for each valid following character, and add that to our string:

                '"' -> result.append('"')
                '\\' -> result.append('\\')
                '/' -> result.append('/')
                'b' -> result.append('\b')
                'f' -> result.append(0x0C.toChar())
                'n' -> result.append('\n')
                'r' -> result.append('\r')
                't' -> result.append('\t')
                'u' -> result.append(readHexChar())
                else -> return null
Enter fullscreen mode Exit fullscreen mode

A few things of interest here:

  • Kotlin does not support \f for form feeds, so we have to use the raw ASCII value.
  • We've put the \u0000-reading logic into a new function, readHexChar, which we'll write in a second.
  • We have else -> return null as there are no other valid characters after a backslash. If you want to be slightly less spec-conforming, you could use else -> result.append(it)

readHexChar()

Let's make our readHexChar method:

private fun readHexChar(): Char? {
    val oldCursor = cursor
    return try {
        read(::isHexDigit).toInt(16).toChar()
    } catch (e: NumberFormatException) {
        cursor = oldCursor
        null
    }
}
Enter fullscreen mode Exit fullscreen mode

This is fairly simple, using our read and isHexDigit functions, then parsing the resulting hex string, then converting it to a Char. Simple!

The other characters

The other branch of our if, for non-escaped characters, is nice and simple:

        } else {
            if (char >= 32.toChar()) {
                result.append(char)
            } else {
                cursor = oldCursor
                return null
            }
        }
Enter fullscreen mode Exit fullscreen mode

The characters disallowed in JSON strings are the first 32 ASCII codepoints, the control characters. If we encounter one of those, we can fail. Otherwise, we add it to our string.

Leaving the loop

    }
    skip()
    return result.toString()
Enter fullscreen mode Exit fullscreen mode

Finally, we skip the last character (since it's a double quote) and return our final string.

Conclusion

Here's our final code:

    fun readString(): String? {
        val oldCursor = cursor
        val result = StringBuilder()
        if (step() != '"') {
            cursor = oldCursor
            return null
        }

        while (hasNext() && peek() != '"') {
            val char = step()
            if (char == '\\') {
                when (val it = step()) {
                    '"' -> result.append('"')
                    '\\' -> result.append('\\')
                    '/' -> result.append('/')
                    'b' -> result.append('\b')
                    'f' -> result.append(0x0C.toChar())
                    'n' -> result.append('\n')
                    'r' -> result.append('\r')
                    't' -> result.append('\t')
                    'u' -> result.append(readHexChar())
                    else -> return null
                }
            } else {
                if (char >= 32.toChar()) {
                    result.append(char)
                } else {
                    cursor = oldCursor
                    return null
                }
            }
        }
        skip()
        return result.toString()
    }
Enter fullscreen mode Exit fullscreen mode

Want to improve it? You could instead throw an exception when encountering invalid characters, instead of returning null.

Top comments (0)