Sugawara Yuuta

Posted on May 2, 2023

Journey to the Fastest Streaming JSON Decoder In Go

#opensource #go #performance #showdev

Introduction

This is a rewrite of the Medium article, less promotional and more developer-oriented.

Hello, I'm Sugawara Yuuta, the developer of a new, fast, compatible JSON library in Go, Sonnet.

I challenged myself to see how fast I could make my own JSON library if I used all my knowledge over a period of about six months. I gained a lot, and I would like to share it with you here.

First, let me note how far that has been accomplished.

Benchmarks

The benchmark code is borrowed from goccy's repository (benchmarks/medium_payload.go).

It is fast. At least this is the result in my environment. I recommend that you measure it yourself if you are interested.

But how? So let's talk about the design of this.

The design

Reader

It is probably the lowest layer and the first part that will be important is the reader part.

Although this part is based on pkg/json by davecheney, I still made some improvements.

If you like performance, you may have already thought of sync.Pool.
But wait a minute, this is a bit special. It is staged.
In other words, there is an array of sync.Pools, and the size of the buffer that each pool can take is determined from the beginning.

The method of finding the position of the array is not complicated; it is the length over bits of capacity divided by 1024. Below is an example.

var (
    buffers [shard]sync.Pool
)

const (
    shard = 16
    shift = 10
    Min   = 1 << shift
)

func Put(bytes []byte) {
    size := cap(bytes) >> shift
    idx := bits.Len(uint(size))
    if idx >= shard {
        return
    }
    buffers[idx].Put(&buffer{bytes: bytes[:0]})
}

This led to a reduction in memory usage as well as an increase in speed.

Small note: When coming to lower layers, e.g., hash functions, etc., division and modulo are avoided because they are slow. Instead, if the number is a power of 2, shift and bit AND can be used.

In summary: sync.Pool is your best friend!

I used the same method for my scanner as the package I just described, but I changed my method significantly along the way.

Scanner

At first I used these steps for decoding

tokenize
syntax check
actual decoding

I will talk about how embedding the top two into the bottom one has made it faster by eliminating unnecessary function calls and complexity.
The slowness of the separate tokenization depends especially on the format I was using. For example, after the { comes the } or "key" and :. But if I do it in a separate process, I can't anticipate what tokens are coming next, so I can't implement specialized processing for each one.

Taking this to the max, results in a single byte slice iteration, except in special cases such as floating-point numbers (this requires me to call strconv.ParseFloat unlike int and uint).

In summary: Make assumptions for your use case, as data is never random.

Now, on to the decoder, which is chock-full of tricks.

Decoder

If we include the fact that JSON can only have object keys that are strings, we can easily say that the faster the string is processed, the faster the entire decoding process will be.

Now consider fastpath and slowpath.

Fastpath is things that happen in most cases, and we can optimize using that premise.

In this case fastpath is a string containing only ASCII (which is most of the time). The decoding is completed by simply copying it.
The problem is how to tell whether it contains only ASCII or characters that must be decoded in a special other way. I am developing this to run fast on any platform, so assembly is not an option.

This is where SWAR comes in: SWAR is a technique for operating multiple bytes simultaneously that can be used in environments where SIMD (which requires assembly) is not available. Below is a working example:

package main

import (
 "encoding/binary"
 "fmt"
 "math/bits"
)

const mapper uint64 = 0x0101010101010101

func main() {
 u8s := []uint8("01234567")
 // Use encoding/binary for simplicity
 u64 := binary.LittleEndian.Uint64(u8s)
 // XOR a uint64 with byte '5' mapped to all bytes.
 // XOR outputs 0 if a match is found.
 u64 ^= mapper * '5'
 // Subtract uint64 where 1 is mapped to all bytes.
 // overflows if the previous result is 0.
 u64 -= mapper * 1
 // AND uint64 with 0x80 mapped to every byte
 // (scoop out the most significant bit of each uint8)
 u64 &= mapper * 0x80
 if u64 != 0 {
  // If the output is not zero, it means it was found somewhere, so,
  // divide by 8 to see how far the bits are 0.
  fmt.Println("found at:", bits.TrailingZeros64(u64)/8)
  return
 }
 fmt.Println("couldn't find the character")
}

As you can see, you can check the characters without loops.

The next most common, in my experience it is an integer (int, int32, uint64...).
The answer is clear on this one, isn't it? It is to build the converter yourself.
It is not difficult to build (because, unlike the standard library, it always uses the decimal that is, base == 10, system). as an example implementation - (by the way you need to check overflows in addition to this)

func toUint(bytes []byte) (uint64, error) {

    length := len(bytes)
    if length == 0 {
        return 0, errors.New("cannot create uint64 from an empty string")
    }

    var pos int
    var u64 uint64
    for pos != length {
        if bytes[pos] >= '0' && bytes[pos] <= '9' {
            u64 = u64*10 + uint64(bytes[pos]-'0')
            pos++
            continue
        }
        break
    }

    if pos != length {
        return 0, errors.New("failed to parse: " + string(bytes))
    }
    return u64, nil
}

In summary: Make it yourself!

As for this, it is considerably faster, as is obvious since there is no cost to convert from a byte array to a string etc.

If you benchmark at this stage, pprof will probably be filled with mapaccess (you have to look for it each time because the order of the keys in JSON does not always match the order of the keys in the structure). That was the case for me as well.
Perhaps this means it is important to optimize the lookup.

I have three recommendations for lookup optimization.

use sync/atomic.XXXPointer if it needs parallel access protection. If unsafe is not available, use sync.RWMutex. In my environment it is significantly faster than sync.Map in both cases.
If your keys are uint8 or byte, use fixed length array. Because 1<<8 (8 to the power of 2) is only 256. This allows us to reproduce the worst case O(1) only in this special case.
There are several alternatives to the hashmap used in the Go standard.

Note that there is a movement to migrate the standard map to Swiss table. Expect more in the future.

Now, completely ignoring what I just said, I am using the perfect hash function (PHF).
This is literally perfect for my use and is only available when key/value pairs are known from the start and don't change.

And related to this is the hash function, which is becoming more and more important. The important thing is to choose the right one for your application, and I use FNV1a for the following reasons.

Size that can be inlined manually
Can generate a hash while converting to lowercase since it reads one letter at a time
Faster than the hash function used inside the Go language when inlined and unrolled (under my environment)

This is roughly how I made this work. Thanks for taking a look. Have a great day.

DEV Community

Journey to the Fastest Streaming JSON Decoder In Go

Introduction

Benchmarks

The design

Reader

Scanner

Decoder

Top comments (0)

Read next

🚨💥 Top 5 Trending Open-source LLM Tools & Frameworks You Must Know About ✨🚀

Should We Rethink About IDs? A Deep Dive into "Snowflake IDs"

The use of eBPF – in Netflix, GPU infrastructure, Windows programs and more

My Hacktoberfest 2024 Journey: A Reflection on Contributor Experience