DEV Community

Michael Salaverry
Michael Salaverry

Posted on

data manipulation: jq vs Miller

Recently, I was confronted with a 700mb json file which I needed to manipulate and explore in order to find the source of a bug. I needed to find any relationships between an occasionally missing field, and the other properties. Finding a pattern would let me focus my debugging efforts.

My first attempt was actually a simple Node.js script which would fs.readFile and use vanilla JS to map/reduce/summarize the JSON. This was a good place to start, as it was fairly easy to launch the Node REPL and to .save and .load as needed between sessions. However, I didn't like the wait associated with the single threaded parsing of the JSON file. While it was the most flexible option, it felt the most manual.

My usual tool for dealing with JSON,

GitHub logo stedolan / jq

Command-line JSON processor

jq

jq is a lightweight and flexible command-line JSON processor.

Coverage Status Unix: Build Status Windows: Windows build status

If you want to learn to use jq, read the documentation at https://stedolan.github.io/jq. This documentation is generated from the docs/ folder of this repository You can also try it online at jqplay.org.

If you want to hack on jq, feel free, but be warned that its internals are not well-documented at the moment. Bring a hard hat and a shovel. Also, read the wiki: https://github.com/stedolan/jq/wiki, where you will find cookbooks, discussion of advanced topics, internals, release engineering, and more.

Source tarball and built executable releases can be found on the homepage and on the github release page, https://github.com/stedolan/jq/releases

If you're building directly from the latest git, you'll need flex bison (3.0 or newer), libtool, make, automake, and autoconf installed. To get regexp support you'll also need to install Oniguruma or clone it as…

is a workhouse written in C which has a variety of useful verbs for parsing and summarizing JSONs with one liners.
In this case, jq is fast, but still single threaded.

Instead, I turned to a relatively newer tool

GitHub logo johnkerl / miller

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

What is Miller?

Miller is like awk, sed, cut, join, and sort for data formats such as CSV, TSV, JSON, JSON Lines, and positionally-indexed.

What can Miller do for me?

With Miller, you get to use named fields without needing to count positional indices, using familiar formats such as CSV, TSV, JSON, JSON Lines, and positionally-indexed. Then, on the fly, you can add new fields which are functions of existing fields, drop fields, sort, aggregate statistically pretty-print, and more.

cover-art

  • Miller operates on key-value-pair data while the familiar Unix tools operate on integer-indexed fields: if the natural data structure for the latter is the array, then Miller's natural data structure is the insertion-ordered hash map.

  • Miller handles a variety of data formats including but not limited to the familiar CSV, TSV, and JSON/JSON Lines (Miller can handle positionally-indexed data too!)

In the above image you…

which serves a similar purpose to jq. When I tried miller, I noticed that it's count-distinct verb was quite fast. I read the Miller documentation, and discovered on https://miller.readthedocs.io/en/latest/cpu/ that Miller is often multi-threaded as a consequence of it's implementation in Go and use of channels. This can give Miller the edge over jq for certain tasks where concurrency can make a difference.

I plan on diving deeper into Miller, especially since it can handle data in CSV, JSON, or other formats. Also, the then-chaining launching separate goroutines sounds like a big win for more complicated data manipulation flows.

Another tool I tried using is sqlite and it's json extension combined with it's readfile extension. This was inspired by this question / answer I found

A way do this without CSV or a 3rd party tool is to use the JSON1 extension of SQLite combined with the readfile extension that is provided in the sqlite3 CLI tool. As well as overall being a "more direct" solution, this has the advantage of handling JSON NULL values…

and ended up being a fast way to run sql on my JSON file. By running an sql statement in the sqlite3 REPL like this
INSERT INTO my_table SELECT 
  json_extract(value, '$.user_agent')
FROM json_each(readfile('my_data.json'));
Enter fullscreen mode Exit fullscreen mode

I was able to create an SQL table based on the JSON keys I was interested in from each line. However, it wasn't a great fit for my data, as many lines are missing common keys, and I didn't want sparse rows. Also, I would have to write my own verbs in SQL, whereas Miller provides many verbs out of the box. However, sqlite3 was quite fast to provide an answer, probably as a result of it's query optimizations.

I was interested in trying

GitHub logo duckdb / duckdb

DuckDB is an in-process SQL OLAP Database Management System

 

Github Actions Badge codecov discord Latest Release

DuckDB

DuckDB is a high-performance analytical database system. It is designed to be fast, reliable and easy to use. DuckDB provides a rich SQL dialect, with support far beyond basic SQL. DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (arrays, structs), and more. For more information on the goals of DuckDB, please refer to the Why DuckDB page on our website.

Installation

If you want to install and use DuckDB, please see our website for installation and usage instructions.

Data Import

For CSV files and Parquet files, data import is as simple as referencing the file in the FROM clause:

SELECT * FROM 'myfile.csv'
SELECT * FROM 'myfile.parquet';
Enter fullscreen mode Exit fullscreen mode

Refer to our Data Import section for more information.

SQL Reference

The website contains a reference of functions and SQL constructs available in DuckDB.

Development

For development, DuckDB requires CMake

for the same reason I tried sqlite3, but ultimately didn't pursue it since I had the json extension was easier than converting the json to csv first, and then importing to duckdb later.

Top comments (0)