The Log Management Cost Trap: Part III — Search

#logging #devops #infrastructure #observability

Authored by Benoit Gaudin

In Part I (Ingestion) and Part II (Storage) of this series, I explored the challenges of designing, running, and managing a centralised log management solution. In Part III, I'll focus on search.

The Competing Requirements of Log Search

Log data search has two distinct use cases with fundamentally different requirements.

Real-time troubleshooting — when a system outage occurs, engineers need visibility into what caused the issue immediately. Log data must be searchable almost as soon as it's generated. This imposes a hard constraint: batch windows must be short. And short batch windows tend to produce small files.

Large-scale historical analysis — analyzing web or CDN access logs to identify patterns in API usage, track slowly degrading performance trends, or audit activity over weeks or months. Here, data freshness is irrelevant. What matters is the ability to efficiently scan large datasets.

These two use cases create a direct tension. Making data available quickly often means processing small batches and creating many small files — which severely degrades performance when running queries across long time ranges. This is the classic small file problem.

A good log management solution must balance both: newly ingested data searchable immediately, stored in a format that also supports efficient querying over time.

Performant and Cost-Effective Search

As covered in Part II, the right data format and storage strategy are the foundation. Key techniques include indexing, Bloom filtering, and data partitioning.

Needle-in-a-haystack queries

Indexing and Bloom filtering shine when searching for data that appears infrequently across a large time range — for example, finding a specific trace_id across several terabytes of log data. As explained in Why is Bronto so fast at searching logs, well-designed indexing and Bloom filtering can dramatically reduce the volume of data scanned, narrowing the dataset to a much smaller subset more likely to contain the target value.

Full-scan analytical queries

Some queries can't be narrowed. If you want the maximum response time per endpoint over the past few months, every log entry must be examined — there's no rare value to isolate, no filter to push down, no partition to skip.

Pre-aggregated summaries could help if you know in advance exactly how users will slice their data. But general-purpose log management systems can't predict every analytical angle users will need. Full dataset scans are unavoidable.

For these cases, the only viable solution is brute-force compute: massive parallelism and high-performance processing to deliver results even when every record must be touched.

Bronto's approach: AWS Lambda for bursty workloads

To support demanding full-scan queries while keeping costs in check, Bronto uses AWS Lambda functions. Lambda enables high concurrency — large volumes of data stored in S3 can be processed in parallel, on demand, with no infrastructure to provision or manage in advance.

The cost model is key: you only pay for compute time used. Even when running many functions concurrently, short execution times keep overall cost low. This makes it ideal for bursty, unpredictable workloads.

That said, Lambda isn't always the right tool. When query volume consistently exceeds a certain threshold, sustained compute options like AWS EC2 become more cost-effective. The right architecture uses both: Lambda for bursts, EC2 for the baseline.

High Cardinality

Log data frequently contains high-cardinality fields — client IP addresses, trace IDs, user IDs. Queries over these fields (e.g. counting unique IP addresses across a large dataset) can lead to slow performance, high memory consumption, and a poor user experience.

A naive solution is to cap the number of unique values the system handles — but that means users simply can't get value from their data beyond the cap.

A better approach: compute exact results up to a certain cardinality threshold, then switch to approximations when cardinality genuinely becomes too large to handle exactly. Several probabilistic data structures make this practical:

HyperLogLog — approximate distinct counts
Count-Min Sketch — approximate frequency counts
Cuckoo Filter — approximate set membership
Top-K — approximate top values by frequency

This approach keeps resource consumption bounded while still giving users meaningful, actionable insights from high-cardinality data.

Conclusion

This wraps up the three-part Log Management Cost Trap series. Across ingestion, storage, and search, the same theme emerges: design decisions in one layer constrain and shape what's possible in the others. Trade-offs are unavoidable, and navigating toward an optimal solution requires deep experience across all three.

Bronto brings 150+ years of combined experience in log management at scale — and implements that experience into a platform designed to be cost-efficient, high-performance, and ready for logging in the AI era.

See Bronto in Action