Efficient JSON Data formats
JSON has become an ubiquitous format for transferring data between applications and web api’s. For all its popularity there are quite a lot of downsides in using it efficiently. More specifically, the way its being used in the popular way can be quite inefficient. This perceived inefficiency has led to many binary formats like MessagePack, BSON, PROTOBuff, Thrift and many more.
It would be nice if we can improve the efficiency of JSON while not having to include a new library that may introduce unwanted side effects like protocol version dependency, application version dependency, readability and other issues which are typically associated with binary formats. We would like to introduce measures to retain the generic and user friendly nature of JSON itself, while improving efficiency by a few factors. Also, lowering parsing and application processing latency are additional benefits which will become clear, in most part, due to the reduction in data size. Peculiarly these measures have a lot in common with database types and the general design of database systems. For the purpose of this article we will refer to the measures or JSON formats respectively as Object oriented, Row Oriented and Column Oriented. At the end we will produce some test data to illustrate the possible advantages which may be had.
Object Oriented
This is the popular way for web api’s to produce data and can be seen generally like this:
[
{
name1:value_11,
name2:value_12,
name3:value_13,
...,
namek:value_1k
},
{
name1:value_21,
name2:value_22,
name3:value_23,
...,
namek:value_2k
},
{...},
{
name1:value_n1,
name2:value_n2,
name3:value_n3,
...,
namek:value_nk
}
]
Some of examples of well known public web api’s which support this style:
https://developers.facebook.com/docs/marketing-api/using-the-api
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/DBActivityStreams.html
https://code.msdn.microsoft.com/REST-in-Bing-Maps-Windows-6d547d69/sourcecode?fileId=82515&pathId=1683851973
A more exhaustive list of public API’s can be found here:
https://github.com/n0shake/Public-APIs
Row Oriented
A more efficient way to represent data in a row oriented pattern:
{
meta: {“name1”,“name2”, “name3”, ..., "namek"}
data:
[
[ value_11, value_12, value_13, ..., value_1k ],
[ value_21, value_22, value_23, ..., value_2k ],
[...],
[ value_n1, value_n2, value_n3, ..., value_nk ]
]
}
I havn't found any api’s based on this format but there are some partial examples especially among mapping api’s where large amounts of data is transferred.
Partial Example
https://code.msdn.microsoft.com/REST-in-Bing-Maps-Windows-6d547d69/sourcecode?fileId=82515&pathId=1683851973
Column Oriented
A more efficient way to represent data in a column oriented pattern:
{
meta: {“name1”,“name2”, “name3”, .., "namek"}
data:
[
[ value_11, value_12, value_13, ..., value_1n ],
[ value_21, value_22, value_23, ..., value_2n ],
[ value_31, value_32, value_33, ..., value_3n ],
[...],
[ value_k1, value_k2, value_k3, ..., value_kn ]
]
}
Note: that there will be far less individual arrays created if this type of orientation is parsed in comparison to the Row oriented format although the total data size may not be much less. This will become apparent in processing latency tests.
Tests
For the tests we will be using node js and javascript as a sandbox. I chose javascript for its optimized native implementation of the json parser. It is also a very popular environment for both consuming and producing json API’s. The test is simulating the task of transferring a set of database rows. We will be recording the row count, json creation time, json parse time and transmitted data size.
The source code can be found in Appendix A at the end of this document.
And finally the results.
Type | Object | Row | Column | Best v Worst |
---|---|---|---|---|
Row Count | 10000 | 10000 | 10000 | na |
Data Size (KiB) | 1190 | 565 | 487 | 2.44353 |
Parsing Time (ms) | 8 | 5 | 3 | 2.66667 |
Creation Time (ms) | 7 | 3 | 1 | 7 |
Row Count | 100000 | 100000 | 100000 | na |
Data Size (KiB) | 11316 | 5750 | 4969 | 2.27732 |
Parsing Time (ms) | 84 | 55 | 27 | 3.11111 |
Creation Time (ms) | 47 | 26 | 15 | 3.13333 |
Row Count | 1000000 | 1000000 | 1000000 | na |
Data Size (KiB) | 120613 | 58485 | 50672 | 2.38027 |
Parsing Time (ms) | 1075 | 616 | 388 | 2.77062 |
Creation Time (ms) | 750 | 342 | 266 | 2.81955 |
It becomes clear that at about 100000 rows the overall efficiency is at an optimum. I can only guess that it is the influence of the CPU cache size vs final data size which causes this effect. It seems that the row count can be fine tuned for optimal efficiency on a case by case basis.
Conclusion
This test is purely an indication of what possible improvements can be made to well known JSON formats. If your JSON format already incorporates these kind of optimizations it would serve better to follow the binary path. However, if your application is following the popular JSON Object oriented pattern for object serialization then there is much to be gained from changing your format pattern first without having to rewrite large parts of your infrastructure. If gzip is used to minimize size the more efficient format will produce round about the same transit size just much less latency or even better transit size with gzip applied to the efficient format itself.
Appendix A
Test Code
The test can be executed using the v8 engine in node js.
//
// Copyright 2019
//
function createMetaData(){
console.debug("createMetaData")
return [
'user',
'sessionId',
'command',
'statement',
'transactionId',
'time'
]
}
function createData(count){
console.debug("createData: %i",count)
var data = []
var meta = createMetaData()
for(var d = 0; d < count; ++d){
var object = {}
object[meta[0]] = 'test'
object[meta[1]] = 1
object[meta[2]] = 'SELECT'
object[meta[3]] = 'SELECT * from mydata'
object[meta[4]] = d
object[meta[5]] = new Date().getMilliseconds()
data.push(object)
}
return {data:data}
}
function createRowData(count){
console.debug("createRowData %i",count)
var meta = createMetaData()
var data = []
for(var d = 0; d < count; ++d){
for(var d = 0; d < count; ++d){
var row = []
row.push('test')
row.push(1)
row.push('SELECT')
row.push('SELECT * from mydata')
row.push(d)
row.push(new Date().getMilliseconds())
data.push(row)
}
}
return {data:data, meta:meta}
}
function createColData(count){
console.debug("createColData: %i",count)
var meta = createMetaData()
var cols = {}
for(var r = 0; r < meta.length; ++r){
cols[meta[r]] = []
}
for(var d = 0; d < count; ++d){
cols[meta[0]].push('test')
cols[meta[1]].push(1)
cols[meta[2]].push('SELECT')
cols[meta[3]].push('SELECT * from mydata')
cols[meta[4]].push(d)
cols[meta[5]].push(new Date().getMilliseconds())
}
var data = []
for(var d = 0; d < meta.length; ++d){
data.push(cols[meta[d]]);
}
return {data:data, meta:meta}
}
function bench(data){
console.log("bench %i",data.data.length)
var start = new Date()
var serialized = JSON.stringify(data)
var endSerialized = new Date()
console.info("JSON Size: %f KiB Time to serialize %dms",serialized.length/1024.0,(endSerialized-start))
start = new Date()
var deserialized = JSON.parse(serialized)
var endDeSerialized = new Date()
console.info("Time to deserialize %dms Deserialized size %i ",(endDeSerialized-start),deserialized.data.length)
}
var counts = [10000, 100000, 1000000]
console.info(" ----------------- Object oriented ----------------------")
for (var c in counts){
var data = createData(counts[c])
bench(data)
}
console.info(" ----------------- Row oriented -----------------------")
for (var c in counts){
var rowData = createRowData(counts[c])
bench(rowData)
}
console.info(" ----------------- Col oriented -----------------------")
for (var c in counts){
var colData = createColData(counts[c])
bench(colData)
}
Top comments (0)