Viacheslav Poturaev

Posted on Apr 23

Finding a Practical Analytics Format for Structured JSON Logs

#json #go #analytics #sql

Structured JSON logs are easy to produce and hard to analyze at scale. They carry useful context, but that context is nested, optional, inconsistent, and often wider than expected. Before the data can be queried comfortably, it usually has to become a table.

The question is not only "can we flatten it?" The more useful question is:

After flattening, what should the output be?

CSV is simple and fast. Parquet is compact and portable. DuckDB gives a ready-to-query local database. SQLite is widely available. Direct database APIs look convenient from Go, but may not behave like bulk loaders.

This note walks through a small discovery process: define realistic shapes, run the same flattening flow through several output types, and compare write/export cost plus resulting artifact size.

The Contestants

The tested output types were:

CSV file: plain text table output, the simplest baseline.
Parquet Snappy: portable columnar file with fast compression.
Parquet Zstd: portable columnar file with stronger compression.
DuckDB CLI stdin: stream CSV into the duckdb CLI and create a database table. This path relies on DuckDB's CSV reader with automatic type detection rather than a hand-declared schema.
SQLite CLI import: stream CSV into the sqlite3 CLI and create a database table.
SQLite direct inserts: write rows through the Go SQLite driver.
DuckDB native appender experiment: write rows through a native DuckDB Go appender driver.

The CLI output types matter because they behave like bulk loaders. They let flatjsonl focus on flattening and streaming rows, while the database process handles ingestion in its optimized path.

The `flatjsonl` Flow

flatjsonl uses two passes:

Scan the input to discover keys and infer column types.
Read the input again, flatten each JSON object, and write rows to the chosen output.

The first pass is mostly independent of output format. For comparing CSV, Parquet, SQLite, and DuckDB, the second pass is more interesting: it includes flattening, value conversion, writing/importing, compression, and finalization.

So the main timing metric below is Export time:

Export time = total wall time - key scan time

This intentionally ignores the common discovery pass and focuses on the output-side cost.

The Data Shapes

Flattened JSON performance depends heavily on shape. Three patterns were used:

Pattern	Rows	Result Columns	Type Mix	Shape
Narrow	5691804	23	Mostly strings, a few bools and ints.	Small reporting extract: low/medium-cardinality dimensions, one numeric measure, and status flags.
Normal	5691804	118	String-heavy with more ints and bool flags.	Practical analytics extract: request metadata, timestamps, dimensions, numeric measures, duration windows and flags.
Wide	193217	about 1680	Mostly optional strings and sparse dimensions, with some ints/bools/floats.	Stress case for rich nested JSON logs with many rarely-populated and high-cardinality columns.

The exact field names are not important here. The important properties are row count, column count, type mix, string cardinality, sparse optional fields, and how much text has to move through the writer.

All numbers are approximate. These are production-shaped samples rather than controlled synthetic fixtures. The goal is to build intuition about tradeoffs.

Reading The Results

The tables use relative cost indexes.

100% is the best measured result in that scenario. Higher is worse.

For example:

250% write/export time means about 2.5x slower than the fastest path for that data shape.
250% artifact size means about 2.5x larger than the smallest artifact for that data shape.

Write/Export Time

This table excludes key scanning and focuses on flattening plus writing/importing.

Output	Narrow	Normal	Wide
CSV file	`100%`	`100%`	`100%`
Parquet, Snappy	`102%`	`130%`	`253%`
Parquet, Zstd	`103%`	`142%`	`289%`
DuckDB CLI stdin	`103%`	`199%`	`384%`
SQLite CLI CSV import	`103%`	`246%`	`596%`
SQLite direct inserts	`130%`	`534%`	`2932%`
DuckDB native appender (experimental)	`253%`	`1089%`	`2440%`

The fastest absolute export times were:

Pattern	Best Output	Best Export Time
Narrow	CSV file	`35.1s`
Normal	CSV file	`35.9s`
Wide	CSV file	`10.6s`

The narrow case is almost flat: CSV, Parquet, DuckDB CLI, and SQLite CLI are all within a few percent. In the normal and wide cases, writer choice starts to matter.

Artifact Size

This table compares resulting file/database size.

Output	Narrow	Normal	Wide
CSV file	`828%`	`635%`	`264%`
Parquet, Snappy	`156%`	`138%`	`122%`
Parquet, Zstd	`100%`	`100%`	`101%`
DuckDB CLI stdin	`113%`	`141%`	`100%`
SQLite CLI CSV import	`901%`	`708%`	`280%`
SQLite direct inserts	`901%`	`670%`	`274%`
DuckDB native appender (experimental)	`212%`	`177%`	`130%`

The winners were:

Pattern	Smallest Path	Size
Narrow	Parquet, Zstd	`90M`
Normal	Parquet, Zstd	`502M`
Wide	DuckDB CLI stdin	`732M`

CSV wins on write speed, but it is consistently large. Compressed Parquet and DuckDB are much more storage-efficient.

The native DuckDB appender path was an experiment, not the final export implementation. Its database schema was not identical to the CLI path: it included an extra sequence primary key and used less precise types for some fields. Treat its size and time as a cautionary data point for row-wise appender ingestion, not as a tuned DuckDB baseline.

The size gap is also a useful reminder that DuckDB is columnar. Physical size is sensitive to column types, compression opportunities, and how data is loaded. SQLite behaved differently in the same experiment: CLI import and direct inserts produced similarly sized files even when their declared schemas differed, because SQLite stores row records with dynamic typing rather than compressed column segments.

Sample Analytical Query

Export speed is only half of the story. The point of making a table is usually to query it.

As a small read-side check, I ran a grouped count over one string dimension:

SELECT count(1), os_name
FROM flatjsonl
GROUP BY os_name
ORDER BY count(1);

For CSV and Parquet, DuckDB queried the files directly. They were not imported into a database first. SQLite databases were queried with sqlite3.

These timings are single local runs and should be read as rough order-of-magnitude results.

No indexes were created for this query. That is intentional: the goal is to compare raw analytical scan behavior of the artifacts as produced by the export step. Adding indexes, especially to SQLite or DuckDB tables, could change query performance drastically for selective filters or repeated lookups. At that point the benchmark would be measuring index design and maintenance cost rather than the default export artifact.

Cells show relative query cost and absolute time. Percentages are normalized independently down each data-shape column, because the datasets differ in row count, width, and value distribution. 100% is the fastest result in that column, and higher percentages are slower.

Output	Narrow	Normal	Wide
CSV via DuckDB direct scan	`600% / 0.36s`	`2860% / 1.43s`	`9800% / 3.92s`
Parquet Snappy via DuckDB	`200% / 0.12s`	`240% / 0.12s`	`1700% / 0.68s`
Parquet Zstd via DuckDB	`167% / 0.10s`	`260% / 0.13s`	`1675% / 0.67s`
DuckDB database	`100% / 0.06s`	`100% / 0.05s`	`125% / 0.05s`
SQLite CLI database	`3750% / 2.25s`	`8360% / 4.18s`	`13450% / 5.38s`
SQLite direct database	`4217% / 2.53s`	`9180% / 4.59s`	`14250% / 5.70s`
DuckDB native appender database	`100% / 0.06s`	`100% / 0.05s`	`100% / 0.04s`

This query is favorable to columnar formats: it touches one dimension column and computes a small aggregation. DuckDB database files and Parquet are very fast. CSV has to parse text again and gets slower as the file gets wider/larger. SQLite is functional, but slower for this analytical scan.

The wide result highlights an important distinction: Parquet is columnar, but it is still an external file format. DuckDB can query it efficiently, but it still has to read Parquet metadata, decode Parquet pages and adapt the data into DuckDB execution vectors. A DuckDB database is already in DuckDB's native storage layout, with its own column segments, compression, statistics and execution-friendly metadata. For repeated DuckDB queries, especially ones touching a tiny subset of a very wide table, the native database can be noticeably faster even when Parquet is also columnar.

I also repeated the query with a second dimension that has noticeably higher cardinality than a very low-cardinality first field (os_name):

SELECT count(1), os_name, country
FROM flatjsonl
GROUP BY os_name, country
ORDER BY count(1);

This makes every path slower than the earlier one-field query, but the overall ranking stays the same. Row-oriented databases pay much more once the grouping key widens from one dimension to two.

Output	Narrow	Normal	Wide
CSV via DuckDB direct scan	`370% / 0.37s`	`1567% / 1.41s`	`5113% / 4.09s`
Parquet Snappy via DuckDB	`160% / 0.16s`	`167% / 0.15s`	`975% / 0.78s`
Parquet Zstd via DuckDB	`150% / 0.15s`	`178% / 0.16s`	`975% / 0.78s`
DuckDB database	`100% / 0.10s`	`100% / 0.09s`	`100% / 0.08s`
SQLite CLI database	`4110% / 4.11s`	`7078% / 6.37s`	`7138% / 5.71s`
SQLite direct database	`3990% / 3.99s`	`6844% / 6.16s`	`7488% / 5.99s`
DuckDB native appender database	`100% / 0.10s`	`100% / 0.09s`	`100% / 0.08s`

DuckDB stayed surprisingly fast even after widening the grouping key. The likely reason is that the query still touches only a tiny subset of the table: two projected dimension columns plus the row count. In a columnar, vectorized engine, adding one more grouped column is much cheaper than widening a row-store scan across the full record. The query does get slower, but not dramatically, because it still reads only a few columns and the resulting group count remains manageable.

What The Shapes Teach

Narrow

The narrow shape has only 23 columns. Most values are short strings with low or medium cardinality, plus one numeric measure and a few boolean flags.

For this shape, output choice barely affects write/export time. The data pass itself dominates. CSV, Parquet, DuckDB CLI, and SQLite CLI all finish in the same practical range.

Artifact size still changes dramatically. CSV and SQLite are roughly 8-9x larger than Zstd Parquet. DuckDB CLI and Zstd Parquet are both compact enough that the choice mainly depends on whether the desired result is a database file or a portable columnar file.

Normal

The normal shape has 118 columns. It is still string-heavy, but it includes enough numeric measures, duration/window fields, timestamps, request metadata, and boolean flags to look like a practical operational analytics table.

This is where writer cost becomes visible. CSV remains fastest, but produces a very large artifact. Parquet Snappy adds moderate overhead and cuts size substantially. Parquet Zstd costs a bit more time and gives the smallest file.

DuckDB CLI is slower than Parquet, but it creates a ready-to-query database in one step. SQLite CLI remains workable, but its size and write cost are less attractive. Direct row-wise database APIs are already a poor fit.

Wide

The wide shape has fewer rows but about 1680 columns. It is sparse and string-heavy: optional identifiers, free-form strings, long text fields, and nested dimensions dominate the schema. Numeric and boolean fields are mixed in, but they are not what makes the workload difficult.

This is where format and ingestion path dominate. Parquet costs more CPU than CSV, but cuts storage substantially. DuckDB CLI produces the smallest measured artifact and avoids a separate import step, but costs more write/export time than Parquet.

SQLite CLI is much faster than direct SQLite inserts, but still slow and large compared with Parquet or DuckDB. Direct row-wise database APIs spend too much time crossing driver boundaries, converting values, appending/binding thousands of cells per row, and managing memory.

Practical Conclusions

Use CSV when the main goal is fastest raw export and temporary text output is acceptable.

Use Parquet Snappy as the default portable analytics artifact. It is much smaller than CSV and has acceptable write overhead for narrow and normal shapes.

Use Parquet Zstd when storage or transfer size matters more than a bit of extra CPU.

Use DuckDB CLI when the desired output is directly a DuckDB database. It is competitive on import across narrow, normal, and wide shapes, and then gives excellent query performance once the data is loaded. In these experiments it was especially strong for analytical scans that touched only a small subset of columns.

Use SQLite CLI when SQLite compatibility is required. Prefer CLI bulk loading over direct inserts.

Avoid direct row-wise database insertion for uncertain production JSON shapes unless the dataset is small or the schema is known to stay narrow.

The broader lesson is that "structured logs" are not one benchmark. Row count, column count, string cardinality, sparse optional fields, and output format all matter. For exploratory analytics at scale, the safest default is to produce a portable columnar artifact first, and only create a database file directly when that is the actual target.

Top comments (1)

toshihiro shishido • Apr 28

Flat-vs-nested ends up depending on whether you're aggregating or drilling. Flat wins for time-series rollups, nested wins for incident drill-downs. We landed on both — flat copy in the warehouse, nested copy in the log store, indexed by request_id.