[P1] Writing a serialization library in C#: The case for Ion

#csharp #dotnet #ion #amazon

During my summer internship, I've had the chance to work with an open-source data serialization format from Amazon called Ion. Amazon describe Ion as "a richly-typed, self-describing, hierarchical format offering interchangeable binary and text representations", and currently provide libraries in Java, Python, C and Javascript. This post is the first in a series of blogs where I ~~advertise~~ talk about the implementation of the format in C# that can be usable for (hopefully) all .NET platforms.

So why would I use this Ion thing?

TL:DR Ion offers a type-rich, compact binary format that's efficient for parsing and also supplies a readable text format to support prototyping/development.

Try out IonDotnet on github. Please don't scream at my code 😂.

Amazon has a whole page that talks about the advantage of Ion compared to other similar formats, which you of course can read if interested. I will mainly discuss in this post from a developer point of view.

If you have worked with softwares that involve more than one computer, you probably have worked with serialization before. It's the process of converting your data object to byte sequence that can then be stored or sent to other processes/machines. The 2 most well-known serialization formats today are (of course) JSON and XML, which you most certainly have heard of.

Generally speaking, there are 2 kinds of serialization software nowadays. The first kind is what I'd call static serialization: you declare your model, then generate the codes that serialize that model ahead of time. Google's Protocol buffers and FlatBuffers, for example, do this. This method leads to extremely compact and efficient output: the serialized object contains basically zero metadata, and parsing the bytes is simply 'casting' the memory layout into the runtime object. It comes at a cost, however: Since the format is static, updating your object models means re-generating the serializing code, which might result in breaking changes to existing consumers. This is an undesirable effect for systems that expect to change and evolve quickly.

On the other hand, we have formats like JSON that are more dynamic: The layout of the data is generated at runtime depending on what kind of data you put in. Being a text-based format, JSON is very readable, which is a reason why it's become popular (beside the fact that it's native to Javascript). That being said, even as a text format, JSON has many shortcomings.

The type system

Let's say you're writing a laboratory softwares that manages experiments and deal with object model like this.

enum ExperimentResult
{
    Success,
    Failure,
    Unknown
}

class Experiment
{
    public int Id { get; set; }
    public string Name { get; set; }
    public DateTimeOffset StartDate { get; set; }
    public TimeSpan Duration { get; set; }
    public bool IsActive { get; set; }
    public byte[] SampleData { get; set; }
    public decimal Budget { get; set; }
    public ExperimentResult Result { get; set; }
}

var experiment = new Experiment
{
    Id = 233,
    Name = "Measure performance impact of boxing",
    Duration = TimeSpan.FromSeconds(90),
    StartDate = new DateTimeOffset(2018, 07, 21, 11, 11, 11, TimeSpan.Zero),
    IsActive = true,
    Result = ExperimentResult.Failure,
    SampleData = new byte[100],
    Budget = decimal.Parse("12345.01234567890123456789")
};

Using JSON.NET, if we do JsonConvert.SerializeObject(experiment ), we get

{
  "Id": 233,
  "Name": "Measure performance impact of boxing",
  "StartDate": "2018-07-21T11:11:11+00:00",
  "Duration": "00:01:30",
  "IsActive": true,
  "SampleData": "2e36MMwesekp5vKCjNEZKyEi+mro6HfE6Q1UcxCwzguscpMX0PLV+qAvU7zlXth4+DyKrKUHjfB1Nka/yj7ZeBfm1ho9AlouTQDJuJW73os03HrTJiFlpOSjoZqsFTBiVtuk/g==",
  "Budget": 12345.01234567890123456789,
  "Result": 1
}

We can see right away that there are several data type that JSON serialization does not properly represent. For example,the ExperimentResult enum gives us "Result" : 1, but this is problematic, because the consumers of this data will have difficulty understanding what 1 means as an ExperimentResult. Even worse, if you update the ExperimentResult enum and add a new enum before Failure, then 1 no longer means Failure. Of course JSON.NET allows us to serialize the enum as a string:

[JsonConverter(typeof(StringEnumConverter))]
public ExperimentResult Result { get; set; }

Which will give us {"Result": "Failure"}. But even then there's still a problem (apart from the ugliness of that attribute): Result is now a string which is typically interpreted as text instead of a specifier.

Another example is the Timespan Duration property. Here JSON.NET gives us the string representation in the format hh:mm:ss, but it's still a string. The intention to represent a time duration is lost.

The same goes for the DateTime, decimal and byte[] properties, which JSON.NET will find a workaround, most often by formatting them to a string (such as Base64-encoding the byte array). These methods often lead to loss of meaning of the value or increase the size of the output (like with byte[]).

Ion offers a solution for that problem. First of all, it has more native types, including decimal(fit for monetary calculation), blob for byte sequence, symbol for encoding Enum. The full list of supported datatypes can be found here. The type system is also extensible with the use of annotations, which I'll talk about in a future post.

The (proper) Ion text format for the above object will look something like this

{
  Id: 233,
  Name: "Measure performance impact of boxing",
  StartDate: 2018-07-21T11:11:11+00:00,
  Duration: seconds::90, //a time duration in seconds
  IsActive: true,
  SampleData: {{ 2e36MMwesekp5vKCjNEZKyEi+mro6HfE6Q1UcxCwzguscpMX0PLV+qAvU7zlXth4+DyKrKUHjfB1Nka/yj7ZeBfm1ho9AlouTQDJuJW73os03HrTJiFlpOSjoZqsFTBiVtuk/g== }},
  Budget: 12345.01234567890123456789,
  Result: 'Failure'
}

Let's look at the above format: The byte sequence SampleData is represented as Base64 in the text format, but will be copied as-is in the binary format. No extra encoding-decoding is required when parsing binary data. Moreover, the double-block sign {{ }} lets us know that it is a byte[], so no meaning of the value is loss. Similarly, the enum Result is represented by a special kind of text called symbol, which is put in single-quote '' as opposed to normal texts (string), which are put within double-quote "". And yes, Ion-text supports comments.

Using Ion in C#

IonDotnet is built with the goal to support reading and writing standard Ion data, while providing a set of APIs that's friendly to .NET developers. Therefore, using IonDotnet is much easier than the Java counterpart. The following code serialize the Experiment object to the binary format as a byte[], and back:

byte[] ionBytes = IonSerialization.Serialize(experiment);
Experiment deserialized = IonSerialization.Deserialize<Experiment>(ionBytes);

At the moment of this writing, the implementation of text-serialization of IonDotnet has not been completed yet. Production systems should use the binary format for its compactness, the text format is readable and can be used to support development and prototyping.

The compactness

A piece of data can be considered as containing 2 components: the actual information and the metadata (how the data should be read/parsed). JSON, being a text-based format, wastes a lot of space for the metadata stuffs like field names, quotation marks ("") and braces ([],{}). The numeric representation in JSON is also not optimal: a 4-byte binary number can take up to 10 bytes when translated to text.

Static serialization format like Protocol buffers essentially removes all the metadata bits in the data which makes it really compact, but also rigid and difficult to change/update. Ion seeks a balance between the two: It's more compact than JSON, but still dynamic in nature. Change/updating your Ion format should be no harder than doing so with JSON.

Let's look at the following code, which compares the serialization size of JSON and Ion for a typical Web API (from Foursquare):


private static string GetJson(string api)
{
    using (var httpClient = new HttpClient())
    {
        var str = httpClient.GetStringAsync(api);
        str.Wait();
        return str.Result;
    }
}

var jsonString = GetJson(@"https://api.foursquare.com/v2/venues/explore?near=NYC
                &oauth_token=IRLTRG22CDJ3K2IQLQVR1EP4DP5DLHP343SQFQZJOVILQVKV&v=20180728");

var obj = JsonConvert.DeserializeObject<RootObject>(jsonString);
byte[] jsonBytes = Encoding.UTF8.GetBytes(jsonString);
byte[] ionBytes = IonSerialization.Serialize(obj);

Console.WriteLine($"JSON size: {jsonBytes.Length} bytes");
Console.WriteLine($"ION size: {ionBytes.Length} bytes");

And the output is:

JSON size: 70920 bytes
ION size: 40675 bytes

Which is a ~40% saving in size! And it's not even the best case scenario. Because of the way that Ion encodes data, you can save even more space by getting the two ends of the transmission to agree on the set of encoding symbols, in which case you can get rid of all the "field names" and bring the size very close to schema-less serializations protocols like Protobuf. This is great for high-performance scenarios such as gamings and real-time communications.

DEV Community

[P1] Writing a serialization library in C#: The case for Ion

So why would I use this Ion thing?

The type system

Using Ion in C#

The compactness

Top comments (0)