The Log Management Cost Trap: Ingestion

#logging #devops #infrastructure #observability

Authored by Benoit Gaudin

For systems with low log data volumes, self-hosting open-source solutions or using SaaS free plans are often excellent starting points. But as data volume inevitably grows, the complexity and costs associated with these solutions often become unviable.

This post is for you if your logging costs have risen to a point where you're hesitant to send more data, or are excluding certain sources because of what they'd cost to ingest. At that point you're typically faced with two options: invest resources to reduce costs within your existing solution (reducing retention, archiving data, etc.), or build your own logging system for better cost control.

For centralised log management systems, the sheer volume of data and its unstructured nature are typically the biggest factors driving cost and complexity. I break these challenges down into three key areas:

Ingesting large volumes of data
Storing large volumes of data
Querying large volumes of data

These challenges are closely related — design decisions in one area directly impact the others. This post focuses on ingestion. Storage and search will be tackled in follow-up posts.

Ingestion

Ingestion is the part of the system that receives data and processes it to make it searchable. Because of the volumes involved, log management solutions share many similarities with data analytics engines like Hadoop or Spark — but with one critical difference: data must be searchable in real time, or with minimal delay.

This freshness requirement exists because log management supports urgent troubleshooting use cases. In a production incident, engineers need access to logs from the last few minutes immediately — they can't wait for data to be batched. At the same time, other use cases (like browser version analysis across months of traffic) don't require fresh data at all.

Because log management must support both real-time troubleshooting and analytical queries over large historical datasets, it can't rely solely on off-the-shelf analytics platforms. The ingestion pipeline has to be designed with both speed and scale in mind.

Reliability

Upon receiving data, the system must acknowledge its reception and ensure it's securely handled. Mechanisms like data buffering must be in place to gracefully handle temporary issues.

Apache Kafka is an effective and commonly used solution for data buffering at scale, integrated into many log management solutions including ELK, Datadog, and Honeycomb. A Kafka layer in the ingestion pipeline allows the system to absorb temporary processing impediments without data loss.

That said, efficient Kafka cluster management requires real expertise. Even with managed cloud offerings like AWS MSK, the overhead can be substantial and costly at large data volumes.

Indexing and Partitioning

When ingesting log data, how you organise it in the backend directly determines how it can be searched later. Two main approaches exist:

Index-based

Systems like Elasticsearch and OpenSearch build indexes that point to exact locations of relevant data. This offers good search performance but typically requires extracting key-value pairs from logs (e.g. via Logstash in the ELK stack) — and the index itself can grow to a significant size.

Partition-based

No index is involved. Instead, data is organised so that large portions can be skipped entirely at query time. Most log management solutions partition by time range, since log data is timestamped and queries almost always specify a time window.

Some solutions go further and partition on additional attributes beyond time — Grafana Loki and AWS Athena are good examples. Athena stores data on S3 and uses separate prefixes per partition to avoid full-dataset scans.

The hybrid approach

Relying on indexing alone is expensive — building indexes is a heavy task. Partitioning alone may not narrow the dataset efficiently enough. Datadog Husky uses a hybrid approach, and we believe at Bronto this is the right pattern: it provides multiple levers for tuning performance and cost independently.

Append-only and Compaction

Two competing requirements shape how data gets written:

Fresh data must be available to search quickly — ideally within seconds — meaning it must be written in small increments
Large historical datasets must be searchable efficiently, which favours large files and batch-oriented access patterns

Writing lots of small files creates the classic "small files problem" in analytics workloads: many parallel compute units each making small network requests, which kills throughput. Two techniques address this:

Compaction

Used by Datadog Husky and ClickHouse, among others. Data is first stored in small units, then consolidated into larger ones over time. Since small objects only apply to recent data, this remains suitable for historical queries.

Append-only

Data is incrementally added to a growing unit. Easy on a file system, but problematic with object stores like AWS S3 — where appending isn't possible and the entire object must be rewritten on every update. This impacts both performance and ingestion cost.

Despite that limitation, object stores are cost-efficient for long-term storage and well-suited to high-parallelism search access.

Bronto's approach

We implemented a two-tier storage solution: data is first appended to local files, making it immediately available to the search engine; once a file reaches a suitable size, it's uploaded to an object store. This avoids compaction entirely while still keeping fresh data searchable.

Conclusion

Log management solutions are designed to handle vast amounts of unstructured data — a task that introduces significant cost and complexity. They must serve conflicting use cases: real-time troubleshooting that demands fresh data immediately, and analytical queries that demand efficient access to large historical datasets.

At scale, choosing how to ingest data requires careful attention to the trade-offs between reliability, performance, cost, and system complexity. The expertise required to design, implement, and maintain this pipeline is substantial — and that's before accounting for storage and search.

Subsequent posts will cover those remaining challenges. In the meantime, if your logging costs are already a problem worth solving, it's worth understanding what's driving them at each layer.

See How Bronto Handles This