Blacksmoke16

Posted on Jul 13, 2019 • Edited on Feb 26, 2021

oq - A portable/performant jq wrapper

#crystal #jq #json #python

oq

A performant, and portable jq wrapper thats facilitates the consumption and output of formats other than JSON; using jq filters to transform the data.

Background

I've been using jq for a while for transforming a master JSON document into partner dependent structures for their consumption. However, up until recently all of the partner structures have also been in JSON. Since jq does not support outputting XML on its own, I began to look around to see if there were any libraries that would allow using jq filters to transform the data, but output XML in addition to JSON. I ended up finding a Python library called yq that seemed to be perfect.

It supports outputting to XML and JSON while being able to use the same jq filter for both. After playing around with it for a while it became clear that, while quite speedy for smaller files, it really struggled with some of the larger documents I needed to process. The fact that it's Python also complicated things as Python needs to be installed to use it, without going through some extra process to make it a singular binary. Thus, the idea for a more performant and portable option began to take shape.

Introduction

Using the relatively new Crystal language; I created oq with the primary goals being portability, performance, and to extend the formats that jq supports.

Usage

oq has three additional arguments that sets the input/output formats to use, in additional to the name of the root element if serializing to XML. All other arguments are passed on to jq.

Examples

Consuming JSON and output XML

echo '{"name": "Jim"}' | oq -o xml .
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <name>Jim</name>
</root>

Consuming JSON and output YAML

echo '{"name": "Jim"}' | oq -o yaml .
---
name: Jim

Consume YAML from a file and output XML

data.yaml

---
name: Jim
numbers:
  - 1
  - 2
  - 3

oq -i yaml -o xml . data.yaml 
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <name>Jim</name>
  <numbers>1</numbers>
  <numbers>2</numbers>
  <numbers>3</numbers>
</root>

Consume JSON, transform it, and output XML

data.json

{
  "guests": [
    {
      "name": "Jim",
      "age": 17,
      "numbers": [
        1,
        2,
        3
      ]
    },
    {
      "name": "Bob",
      "age": 51,
      "numbers": [
        4,
        5,
        6
      ]
    },
    {
      "name": "Susan",
      "age": 85,
      "numbers": [
        7,
        8,
        9
      ]
    }
  ]
}

filter

.guests | 
{ 
  "person": [
    .[] | {
      "age": {
        "@scale": .scale,
        "#text": .age
      },
      "name": .name,
      "favorite_numbers": {
        "number": .numbers 
      }
    }
  ]
}

oq -o xml --xml-root people -f filter data.json
<?xml version="1.0" encoding="UTF-8"?>
<people>
  <person>
    <age scale="months">289</age>
    <name>Jim</name>
    <favorite_numbers>
      <number>1</number>
      <number>2</number>
      <number>3</number>
    </favorite_numbers>
  </person>
  <person>
    <age scale="years">51</age>
    <name>Bob</name>
    <favorite_numbers>
      <number>4</number>
      <number>5</number>
      <number>6</number>
    </favorite_numbers>
  </person>
  <person>
    <age scale="days">31025</age>
    <name>Susan</name>
    <favorite_numbers>
      <number>7</number>
      <number>8</number>
      <number>9</number>
    </favorite_numbers>
  </person>
</people>

The approach on handling the JSON to XML transcoding is based on this article.

Benchmarks

I also ran some benchmarks for jq, yq, and oq to show how they compare in various situations.

Setup

OS: #1 SMP Debian 4.9.168-1+deb9u3 (2019-06-16)
CPU: Intel i7-7700k
Memory: 32GB @ 3,000 MHz
SSD: Samsung 850 PRO - 512GB

Benchmarks are done via the /usr/bin/time -v command

Simple

First, I used the data.json file to see how they perform simply parsing the file and output itself via the . filter.

jq

jq . data.json | wc -l
    Command being timed: "jq . data.json"
    User time (seconds): 0.02
    System time (seconds): 0.01
    Percent of CPU this job got: 68%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.06
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 16236
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 3860
    Voluntary context switches: 224
    Involuntary context switches: 8
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
31

yq

yq . spec/assets/data1.json | wc -l
    Command being timed: "yq . data.json"
    User time (seconds): 0.08
    System time (seconds): 0.01
    Percent of CPU this job got: 77%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.11
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 16252
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 1
    Minor (reclaiming a frame) page faults: 7179
    Voluntary context switches: 189
    Involuntary context switches: 10
    Swaps: 0
    File system inputs: 1672
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
31

oq

oq . data.json | wc -l
    Command being timed: "oq . data.json"
    User time (seconds): 0.02
    System time (seconds): 0.04
    Percent of CPU this job got: 74%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.10
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 16140
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 4499
    Voluntary context switches: 306
    Involuntary context switches: 13
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
31

For this first test, all three are pretty much equal, with only a negligible difference in wallclock/memory used.

Jeopardy.json (#2)

The next benchmark uses the jeopardy.json ~56mb file as retrieved in jq's benchmark wiki page.

First up, a simple length jeopardy.json command.

jq

jq length jeopardy.json 
216930
    Command being timed: "jq length jeopardy.json"
    User time (seconds): 0.64
    System time (seconds): 0.10
    Percent of CPU this job got: 97%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.76
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 230080
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 63213
    Voluntary context switches: 240
    Involuntary context switches: 13
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

yq

yq length jeopardy.json 
216930
    Command being timed: "yq length jeopardy.json"
    User time (seconds): 152.45
    System time (seconds): 1.27
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 2:33.04
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 3853532
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 1117041
    Voluntary context switches: 13708
    Involuntary context switches: 3189
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

oq

oq length jeopardy.json 
216930
    Command being timed: "oq length jeopardy.json"
    User time (seconds): 0.67
    System time (seconds): 0.17
    Percent of CPU this job got: 105%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.80
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 230224
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 63839
    Voluntary context switches: 13832
    Involuntary context switches: 12
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

The big files do not bode well with yq, with it taking ~190x longer than either oq or jq, while also using almost 17x more memory.

YAML => XML

The last benchmark I did was giving both yq and oq a large yaml file (~57mb), then having them convert it to XML. Since jq can't consume YAML, I excluded it.

The file used: invItems.yaml from the EVE Online SDE Export.

Example Input:

-   flagID: 0
    itemID: 0
    locationID: 0
    ownerID: 0
    quantity: -1
    typeID: 0
-   flagID: 0
    itemID: 1
    locationID: 0
    ownerID: 0
    quantity: -1
    typeID: 0
 ...

yq

For yq, I had to give it a filter and some extra args for it to output correctly

yq -s -x --xml-root items --xml-dtd '{"item": .[] | .}' invItems.yaml > invItems.yq.xml
    Command being timed: "yq -s -x --xml-root items --xml-dtd {"item": .[] | .} invItems.yaml"
    User time (seconds): 309.21
    System time (seconds): 2.76
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 5:11.90
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 7817608
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 2262904
    Voluntary context switches: 32918
    Involuntary context switches: 2504
    Swaps: 0
    File system inputs: 0
    File system outputs: 195072
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

Example Output

<?xml version="1.0" encoding="utf-8"?>
<items>
  <item>
    <flagID>0</flagID>
    <itemID>0</itemID>
    <locationID>0</locationID>
    <ownerID>0</ownerID>
    <quantity>-1</quantity>
    <typeID>0</typeID>
  </item>
  <item>
    <flagID>0</flagID>
    <itemID>1</itemID>
    <locationID>0</locationID>
    <ownerID>0</ownerID>
    <quantity>-1</quantity>
    <typeID>0</typeID>
  </item>
  ...
</items>

oq

oq -i yaml -o xml --xml-root items . invItems.yaml > invItems.oq.xml
    Command being timed: "oq -i yaml -o xml --xml-root items . invItems.yaml"
    User time (seconds): 20.08
    System time (seconds): 0.48
    Percent of CPU this job got: 107%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:19.13
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1332328
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 522235
    Voluntary context switches: 30478
    Involuntary context switches: 974
    Swaps: 0
    File system inputs: 0
    File system outputs: 195072
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

Example Output

<?xml version="1.0" encoding="UTF-8"?>
<items>
  <item>
    <flagID>0</flagID>
    <itemID>0</itemID>
    <locationID>0</locationID>
    <ownerID>0</ownerID>
    <quantity>-1</quantity>
    <typeID>0</typeID>
  </item>
  <item>
    <flagID>0</flagID>
    <itemID>1</itemID>
    <locationID>0</locationID>
    <ownerID>0</ownerID>
    <quantity>-1</quantity>
    <typeID>0</typeID>
  </item>
  ...
</items>

Similarly to the jeopary.json benchmark, yq just has a hard time dealing with the larger inputs with this test case taking ~16x longer and using almost 6x the memory than oq.

Road to 1.0.0

Since this project is still early in its development, I put together a roadmap of what I would like to get done before calling it 1.0.0:

Support XML input format
Address bugs/issues that arise
Small feature requests
Possibly additional formats

Feel free to submit issues/PRs.

Top comments (2)

KrisLamote • Dec 3 '19

@blacksmoke16 Great - looking forward to a new release including the xml input format :)

Blacksmoke16 • Dec 16 '19

Just finished releasing 1.0.0, let me know if you have any trouble :)

DEV Community