The Log Management Cost Trap: Part II — Storage

#logging #devops #infrastructure #observability

Authored by Benoit Gaudin

In Part I of this series, I explored the challenges of designing, running, and managing a centralised log management solution, with a focus on data ingestion. In Part II, I focus on data storage. Part III covers search.

I'll discuss different storage types and how their characteristics can fulfil the requirements of log management solutions, how data is organised within these systems, and the role of file formats in enabling efficient ingestion, storage, and retrieval.

Storage Types

When evaluating storage options, the type of storage medium is the first decision to make. File systems and blob storage each come with distinct characteristics.

Disks and File Systems

File systems operate at a lower level of abstraction and often require explicit management of storage capacity, throughput, and IOPS. Managed services like AWS EFS and FSx simplify some of this — EFS, for example, supports automatic scaling of storage and throughput capacity.

One major advantage of file systems is the ability to append data to existing files. This is especially relevant in log management, where data is immutable and continuously streamed.

At Bronto, we leverage file systems for data aggregation — specifically their ability to append to files. Aggregation runs over a few hours before data is transferred to blob storage, so the storage footprint stays modest and cost-effective. This aggregation phase prevents small files from landing on blob storage, which is known to cause performance issues at query time.

Blob Storage

Blob storage is the popular choice for data analytics workloads due to scalability and cost-effectiveness. Unlike file systems, blob storage doesn't support appending — files must be rewritten entirely when modified.

The pricing model differs significantly: costs include both storage and per-transaction API operations (writes, reads). Overall, blob storage is more cost-efficient than remote disks for large, infrequently-modified datasets.

Blob storage also supports extremely high throughput. AWS S3, for instance, enables massive parallel processing — making it ideal for data-intensive workloads like AWS EMR and AWS Athena.

The tradeoff: blob storage isn't well-suited for frequent appends or aggregations. Solutions like Datadog Husky and ClickHouse use compaction to address this — writing many small objects over time, then consolidating them into larger ones.

Bronto combines both: blob storage for long-term, large immutable files; file storage for short-term data aggregation. This balance optimises both performance and cost at scale.

File Formats and Data Organisation

File format alone doesn't determine query performance — how data is physically organised in storage matters just as much. Here are the key techniques.

Compression

Compression is essential at scale. The primary benefit is reduced storage footprint, translating directly into lower costs. At large volumes, the savings are substantial.

That said, maximum compression isn't always ideal. Higher compression ratios demand more CPU, memory, and time — increasing compute cost. The right point on the curve depends on your access patterns.

Row-based vs. Column-based Formats

In row-oriented storage, all fields for each record are stored together sequentially. In column-oriented storage, all values for each field are stored together.

Row-oriented formats suit unstructured data with write-intensive workloads. But with the rise of structured logging and agents that annotate data with attributes, columnar formats have become increasingly relevant for log data — enabling much more efficient scans when you only need specific fields.

Partitioning

Partitioning divides large datasets into smaller segments so queries can skip irrelevant data entirely. The key is choosing a logical criterion for segmentation.

For log data, time-based partitioning is the natural choice — queries almost always specify a time range, so only the relevant time partition needs to be scanned. This dramatically reduces both the volume of data read and the cost of doing so, especially when data is retained over months or years.

Indexing

Indexes work like a book index: rather than reading the entire dataset to find a value, you consult the index to jump directly to where it lives.

Inverted indexes are especially effective for searching uncommon values across large datasets. The tradeoff is size — inverted indexes can grow as large as the original dataset in some cases, significantly increasing storage cost.

Predicate Pushdown

Predicate pushdown evaluates filter conditions using file metadata or summary statistics — without downloading or inspecting full file contents. File formats like Parquet support this by storing column statistics (min/max values) in each data block.

If the statistics for a file guarantee that a filter condition can't match any record in it, the entire file can be skipped. At scale, across datasets distributed across many files, this can dramatically reduce both data transfer and compute cost.

Bloom Filters

A Bloom filter is a probabilistic data structure that answers one question: is a value definitely not present, or possibly present, in a dataset?

When a file's Bloom filter returns "definitely not," the system skips that file entirely — no scan needed. Compared to inverted indexes, Bloom filters are smaller and more lightweight. They don't pinpoint exact data locations, but they're highly effective at eliminating irrelevant files before any data is transferred.

Dictionary Encoding

Dictionary encoding optimises storage and search for key-value pairs where values have low cardinality — country names, log levels, environment tags, and so on. Instead of storing the full value in every row, a compact reference (dictionary entry) is stored, and the actual values live in a separate dictionary.

This reduces storage size and enables a query optimisation: if filtering by a key whose values don't appear in a file's dictionary at all, that file's entire column can be skipped.

Conclusion

Developing a storage strategy for a large-scale log management system demands deep expertise and a clear understanding of data ingestion and access patterns. The choices made at the storage layer directly shape what's possible — and what it costs — at the ingestion and search layers.

Bronto combines file storage for aggregation and blob storage for long-term retention, and borrows techniques from databases and analytics engines — partitioning, Bloom filtering, predicate pushdown, and dictionary encoding — to achieve high search performance at low cost.

In Part III, I'll focus on the approaches and economics of search, and detail how Bronto uses AWS Lambda to provide a fast, cost-effective way to process large volumes of data stored in S3.

See How Bronto Handles This