Recently, I was confronted with a 700mb json file which I needed to manipulate and explore in order to find the source of a bug. I needed to find any relationships between an occasionally missing field, and the other properties. Finding a pattern would let me focus my debugging efforts.
My first attempt was actually a simple Node.js script which would fs.readFile and use vanilla JS to map/reduce/summarize the JSON. This was a good place to start, as it was fairly easy to launch the Node REPL and to .save
and .load
as needed between sessions. However, I didn't like the wait associated with the single threaded parsing of the JSON file. While it was the most flexible option, it felt the most manual.
My usual tool for dealing with JSON,
jq
jq is a lightweight and flexible command-line JSON processor.
If you want to learn to use jq, read the documentation at https://stedolan.github.io/jq. This documentation is generated from the docs/ folder of this repository You can also try it online at jqplay.org.
If you want to hack on jq, feel free, but be warned that its internals are not well-documented at the moment. Bring a hard hat and a shovel. Also, read the wiki: https://github.com/stedolan/jq/wiki, where you will find cookbooks, discussion of advanced topics, internals, release engineering, and more.
Source tarball and built executable releases can be found on the homepage and on the github release page, https://github.com/stedolan/jq/releases
If you're building directly from the latest git, you'll need flex bison (3.0 or newer), libtool, make, automake, and autoconf installed. To get regexp support you'll also need to install Oniguruma or clone it as…
In this case, jq is fast, but still single threaded.
Instead, I turned to a relatively newer tool
johnkerl / miller
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
What is Miller?
Miller is like awk, sed, cut, join, and sort for data formats such as CSV, TSV, JSON, JSON Lines, and positionally-indexed.
What can Miller do for me?
With Miller, you get to use named fields without needing to count positional indices, using familiar formats such as CSV, TSV, JSON, JSON Lines, and positionally-indexed. Then, on the fly, you can add new fields which are functions of existing fields, drop fields, sort, aggregate statistically pretty-print, and more.
-
Miller operates on key-value-pair data while the familiar Unix tools operate on integer-indexed fields: if the natural data structure for the latter is the array, then Miller's natural data structure is the insertion-ordered hash map.
-
Miller handles a variety of data formats including but not limited to the familiar CSV, TSV, and JSON/JSON Lines (Miller can handle positionally-indexed data too!)
In the above image you…
I plan on diving deeper into Miller, especially since it can handle data in CSV, JSON, or other formats. Also, the then-chaining launching separate goroutines sounds like a big win for more complicated data manipulation flows.
Another tool I tried using is sqlite and it's json extension combined with it's readfile extension. This was inspired by this question / answer I found
A way do this without CSV or a 3rd party tool is to use the JSON1
extension of SQLite combined with the readfile
extension that is provided in the sqlite3
CLI tool. As well as overall being a "more direct" solution, this has the advantage of handling JSON NULL values…
INSERT INTO my_table SELECT
json_extract(value, '$.user_agent')
FROM json_each(readfile('my_data.json'));
I was able to create an SQL table based on the JSON keys I was interested in from each line. However, it wasn't a great fit for my data, as many lines are missing common keys, and I didn't want sparse rows. Also, I would have to write my own verbs in SQL, whereas Miller provides many verbs out of the box. However, sqlite3 was quite fast to provide an answer, probably as a result of it's query optimizations.
I was interested in trying
DuckDB
DuckDB is a high-performance analytical database system. It is designed to be fast, reliable and easy to use. DuckDB provides a rich SQL dialect, with support far beyond basic SQL. DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (arrays, structs), and more. For more information on the goals of DuckDB, please refer to the Why DuckDB page on our website.
Installation
If you want to install and use DuckDB, please see our website for installation and usage instructions.
Data Import
For CSV files and Parquet files, data import is as simple as referencing the file in the FROM clause:
SELECT * FROM 'myfile.csv'
SELECT * FROM 'myfile.parquet';
Refer to our Data Import section for more information.
SQL Reference
The website contains a reference of functions and SQL constructs available in DuckDB.
Development
For development, DuckDB requires CMake…
for the same reason I tried sqlite3, but ultimately didn't pursue it since I had the json extension was easier than converting the json to csv first, and then importing to duckdb later.
Top comments (0)