DEV Community

Michael Salaverry

Posted on Jan 24, 2022

data manipulation: jq vs Miller

#productivity

Recently, I was confronted with a 700mb json file which I needed to manipulate and explore in order to find the source of a bug. I needed to find any relationships between an occasionally missing field, and the other properties. Finding a pattern would let me focus my debugging efforts.

My first attempt was actually a simple Node.js script which would fs.readFile and use vanilla JS to map/reduce/summarize the JSON. This was a good place to start, as it was fairly easy to launch the Node REPL and to .save and .load as needed between sessions. However, I didn't like the wait associated with the single threaded parsing of the JSON file. While it was the most flexible option, it felt the most manual.

My usual tool for dealing with JSON,

stedolan / jq

Command-line JSON processor

jq

jq is a lightweight and flexible command-line JSON processor.

Unix: Windows:

If you want to learn to use jq, read the documentation at https://stedolan.github.io/jq. This documentation is generated from the docs/ folder of this repository You can also try it online at jqplay.org.

If you want to hack on jq, feel free, but be warned that its internals are not well-documented at the moment. Bring a hard hat and a shovel. Also, read the wiki: https://github.com/stedolan/jq/wiki, where you will find cookbooks, discussion of advanced topics, internals, release engineering, and more.

Source tarball and built executable releases can be found on the homepage and on the github release page, https://github.com/stedolan/jq/releases

If you're building directly from the latest git, you'll need flex bison (3.0 or newer), libtool, make, automake, and autoconf installed. To get regexp support you'll also need to install Oniguruma or clone it as…

View on GitHub

is a workhouse written in C which has a variety of useful verbs for parsing and summarizing JSONs with one liners.
In this case, jq is fast, but still single threaded.

Instead, I turned to a relatively newer tool

johnkerl / miller

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

What is Miller?

Miller is like awk, sed, cut, join, and sort for data formats such as CSV, TSV, JSON, JSON Lines, and positionally-indexed.

What can Miller do for me?

With Miller, you get to use named fields without needing to count positional indices, using familiar formats such as CSV, TSV, JSON, JSON Lines, and positionally-indexed. Then, on the fly, you can add new fields which are functions of existing fields, drop fields, sort, aggregate statistically pretty-print, and more.

Miller operates on key-value-pair data while the familiar Unix tools operate on integer-indexed fields: if the natural data structure for the latter is the array, then Miller's natural data structure is the insertion-ordered hash map.
Miller handles a variety of data formats including but not limited to the familiar CSV, TSV, and JSON/JSON Lines (Miller can handle positionally-indexed data too!)

In the above image you…

View on GitHub

which serves a similar purpose to jq. When I tried miller, I noticed that it's count-distinct verb was quite fast. I read the Miller documentation, and discovered on https://miller.readthedocs.io/en/latest/cpu/ that Miller is often multi-threaded as a consequence of it's implementation in Go and use of channels. This can give Miller the edge over jq for certain tasks where concurrency can make a difference.

I plan on diving deeper into Miller, especially since it can handle data in CSV, JSON, or other formats. Also, the then-chaining launching separate goroutines sounds like a big win for more complicated data manipulation flows.

Another tool I tried using is sqlite and it's json extension combined with it's readfile extension. This was inspired by this question / answer I found

Jun 27 '20

A way do this without CSV or a 3rd party tool is to use the JSON1 extension of SQLite combined with the readfile extension that is provided in the sqlite3 CLI tool. As well as overall being a "more direct" solution, this has the advantage of handling JSON NULL values…

Open Full Answer

and ended up being a fast way to run sql on my JSON file. By running an sql statement in the sqlite3 REPL like this

INSERT INTO my_table SELECT 
  json_extract(value, '$.user_agent')
FROM json_each(readfile('my_data.json'));

I was able to create an SQL table based on the JSON keys I was interested in from each line. However, it wasn't a great fit for my data, as many lines are missing common keys, and I didn't want sparse rows. Also, I would have to write my own verbs in SQL, whereas Miller provides many verbs out of the box. However, sqlite3 was quite fast to provide an answer, probably as a result of it's query optimizations.

I was interested in trying

duckdb / duckdb

DuckDB is an in-process SQL OLAP Database Management System

DuckDB

DuckDB is a high-performance analytical database system. It is designed to be fast, reliable and easy to use. DuckDB provides a rich SQL dialect, with support far beyond basic SQL. DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (arrays, structs), and more. For more information on the goals of DuckDB, please refer to the Why DuckDB page on our website.

Installation

If you want to install and use DuckDB, please see our website for installation and usage instructions.

Data Import

For CSV files and Parquet files, data import is as simple as referencing the file in the FROM clause:

SELECT * FROM 'myfile.csv'
SELECT * FROM 'myfile.parquet';

Refer to our Data Import section for more information.

SQL Reference

The website contains a reference of functions and SQL constructs available in DuckDB.

Development

For development, DuckDB requires CMake…

View on GitHub

for the same reason I tried sqlite3, but ultimately didn't pursue it since I had the json extension was easier than converting the json to csv first, and then importing to duckdb later.