Sergei Prosvirnin

Posted on Jun 30 • Originally published at blog.triplecloud.tech

Can you build observability ingestion on S3 alone — no Kafka, no disks, no coordination layer?

#rust #observability #aws #architecture

TL;DR — A Kafka + Flink + OTel ingestion pipeline cost us ~$700–800/month at 10 MB/s. We rebuilt it as a single binary where the data, the write-ahead log, and the Iceberg catalog all live in S3 alone — no Kafka, no local disks, no coordination service — for ~$100/month. Here's the design.

Self-hosted observability sooner or later runs into the problem of storing state. Query load, CPU, and data volume can all be handled by scaling out, but the stateful layer is something you have to operate by hand. At first it's almost unnoticeable: a disk degrades here, replication falls behind there, a recovery hangs somewhere else. As the data grows, incidents stop being one-offs and start to recur. At some point your observability stack - whether it's Grafana Loki, Elastic, or ClickHouse - starts demanding the same attention as a full-blown database that you're on the hook for.

Kubernetes operators cover some of these cases, but operating the state is still on you. Managed solutions take that burden away and bring their own: rising costs, ingestion-pipeline constraints, and limits on retention and cardinality.

But if you'd rather not sign up for the constant operational grind - or live with the constraints of managed solutions - it's worth asking: can we take the stateful part out of operations entirely?

Storage and format

The first candidate for offloading storage responsibility is Amazon S3. S3 gives you what local disks can't: fault tolerance and practically unlimited scale, with no storage to manage yourself. It isn't free, though: data-access latency goes up, and you pick up separate costs for API operations. For OLTP workloads that's a dealbreaker. For observability workloads - which are dominated by sequential writes and analytical reads - these trade-offs are often acceptable.

At first glance, this problem is already solved. Loki, for example, uses S3 as its primary storage. But according to Loki's public documentation (v3.6.x) at the time of writing, Loki doesn't remove the stateful layer: ingesters buffer writes in memory and in a WAL on a local disk, and only then flush them to S3. So a stateful component remains, and it has to be operated. On top of that, Loki only covers part of observability - you still have to assemble the other, separate pieces to get a complete stack (traces, metrics, events). Loki does guarantee durability for acknowledged writes through its WAL and replication factor, but that durability rests on local disks and replication, which you have to maintain.

Once you treat S3 as your storage, the next question is what format to store the data in. We need a model that:

works efficiently with an append-only workload;
lets you add new data atomically;
filters data well at large volumes.

Apache Iceberg fits these requirements well. Iceberg stores data as immutable Apache Parquet files and adds them through atomic commits, so readers always see a consistent snapshot. A separate metadata layer prunes files by their statistics before the data itself is ever read, and those statistics can be extended to match an observability filtering profile.

Writing

Iceberg handles storage and reads, but writing to it is more involved. Before a commit, the data has to be gathered into batches, sorted, have its statistics computed, written to Parquet, and committed atomically. That's an asynchronous process. The client, meanwhile, expects a fast acknowledgment that its data has been accepted and won't be lost. This creates a mismatch: writing to Iceberg is slow, whereas ingest has to be fast and must not lose acknowledged data.

The standard move is to put an ingestion layer like Kafka in front of Iceberg. But that just pushes the stateful part into a separate layer again. We tried a Kafka + Flink + OpenTelemetry collector setup, and in our configuration it cost us ~$700–800/month at around 10 MB/s of traffic (storage not included). In other words, the problem doesn't go away - it just moves to another component.

We wanted a simpler design:

accept observability records and persist them to a WAL in S3;
acknowledge the client only after the data is safely stored in S3;
asynchronously collect data from the WAL, group it into batches, and write it to Iceberg.

This raises the central question: how do you distribute that processing across workers without standing up a separate coordination layer? One option is centralized orchestration - a control component that sees the whole pipeline state and makes the decisions: which WAL files to process, who performs the next step, and when a commit can happen. The catch with this approach is that you need leader election and correct handling of leader failure.

The other option is a single shared state in S3 with independent, scalable workers. You can build it using S3's conditional requests. The If-Match header on an S3 request gives you compare-and-swap semantics: the write only goes through if the object's version (its ETag) matches the one you expect. If another worker changed the state first, the operation fails. This lets you use S3 not just as a WAL, but as a coordination layer as well.

On top of this you can build a job manager:

one worker plans which WAL files are ready to be moved over;
other workers read those files and assemble Parquet for Iceberg;
a dedicated worker performs the final batch commit to Iceberg.

At a high level, that gives you a pipeline for logs like this:

clients send batches of logs over the OTLP protocol;
we return OK only after the data is stored in the WAL - that is, in S3;
the job manager's workers periodically go to S3 and move the data into Iceberg.

Losing a single worker doesn't break the pipeline: task state lives in S3, so another worker can pick up the processing. This architecture lets you scale the number of workers up and down with the load.

The result is distributed processing with no Kafka, no separate database for ingest, and no dedicated coordination service - just a single binary and S3. That binary is the first key component of our observability engine, called IceGate.

The cost profile reflects that. In our tests, this ingest path handled the same ~10 MB/s on a single small instance, with cost dominated by S3 API requests - on the order of ~$100/month (storage not included), against the ~$700–800/month of the Kafka + Flink + OTel setup it replaces.

Iceberg Catalog

To track where tables live and to add data to them atomically, Iceberg has a dedicated mechanism: the Catalog. But the common implementations (Project Nessie, Lakekeeper, Apache Polaris) typically rely on an external database or metastore to hold the catalog state. With an observability profile, the catalog won't be a heavily loaded component - batch commits every N seconds, a single commit worker - so why not use S3 to store the catalog metadata too?

Using the same CAS mechanism via If-Match, you can build an Iceberg catalog whose state lives in S3: the entire catalog is a single root.json object (the source of truth), and If-Match guarantees atomic updates. That removes the last dependency on a database.

Reading

So much for writing. But an observability system isn't just ingest - it's reads too. And here the key question is: can you even read data directly from S3 without paying tens of seconds of latency?

The intuitive answer is no. The practical answer is that it doesn't come down to S3. Reads are bound less by S3 access latency than by how efficiently queries are planned. The more files we read from S3, the longer the client's query takes, the more we pay for the S3 API, and the more resources the query service burns - everything we read has to be processed.

Iceberg gives you a baseline level of optimization: you can skip files that definitely don't contain the rows you need. Thanks to metadata - partitioning and statistics - most files are discarded before they're ever read. Our job comes down to using that metadata as effectively as possible to limit the number of file reads, and to read with as much parallelism as possible. But how effectively you can skip files depends directly on how those files were formed. If your Parquet files are too small, for instance, both the metadata size and the number of reads grow. So the ingest service is tuned to keep files at a reasonable size, and a compaction process consolidates them further.

Rust is the best fit for building the read engine. First, Rust lets you use resources efficiently - parallelism, no GC - and work with large volumes of data. Second, the Rust ecosystem is well suited to the task: there are native crates for Apache Arrow, Parquet, and Iceberg and, crucially, a ready-made columnar query engine in Apache DataFusion. You don't have to build the planner and the metadata handling from scratch.

So DataFusion together with Iceberg gives you a solid starting point. DataFusion can build optimal query plans, working directly with Iceberg metadata to skip file reads. It also operates at the level of the Parquet files themselves. Inside a Parquet file the data is split into sorted groups, so - using the Parquet statistics - you can read only the relevant groups within a file. That way you read just part of a file rather than the whole thing, cutting the time spent reading and preparing data.

But at high record volumes that still isn't enough - you need caching and prefetching. These become essential once the data has to be fetched from S3. Caching is needed both for Iceberg metadata (to skip file reads) and for Parquet file metadata (to read only the needed row groups within a file). In our runs, metadata caching and prefetching brought attribute-value lookups down from tens of seconds to on the order of one second.

Where we are now

What we ended up with is IceGate, an observability engine that stores everything in S3 alone and depends on infrastructure maintenance as little as possible. The pipeline itself is signal-agnostic: logs, metrics, traces, and events are just different Iceberg tables in a single engine, accessible through familiar APIs and SQL. There's no need to run separate stacks per signal type, each with its own storage, limits, and operational overhead.

IceGate - the open-source engine behind TripleCloud - is currently at the prototype stage and under active development: the logs and traces pipelines already work, and metrics are on the way.

If the idea of observability without stateful infrastructure resonates with you, you can try it locally (Docker / Kubernetes) and star the repo:

icegatetech / icegate

IceGate is an observability data lake engine

IceGate

An Observability Data Lake engine designed to be fast, easy-to-use, cost-effective, scalable, and fault-tolerant.

IceGate supports open protocols and APIs compatible with standard ingesting and querying tools. All data is persisted in Object Storage, including WAL for ingested data, catalog metadata, and the data layer.

Features

Highly Scalable: Scale compute resources independently based on workload demands of specific components
ACID Transactions: Full transaction support without requiring a dedicated OLTP database
Exactly-Once Delivery: Reliable data ingestion with no data loss or duplication
Real-Time Queries: Access live data through WAL while maintaining historical query capabilities
Open Standards: Built on Apache Iceberg, Arrow, and Parquet with OpenTelemetry protocol support
Cost-Effective: Object storage-based architecture minimizes infrastructure costs
Fault-Tolerant: Designed for resilience and high availability

Architecture

IceGate employs a compute-storage separation architecture, allowing independent scaling of processing and storage resources This design enables cost-effective scaling where compute…

View on GitHub

There's still plenty left to solve; you can find the target architecture at github.com/icegatetech. And if you'd like to contribute - you're welcome.

Where would this design break for your workload? I'd love to hear it in the comments - especially the failure modes I haven't hit yet.

FAQ

Can you read data directly from S3 without tens of seconds of latency?
Yes. Read latency is bound less by S3 access than by how well queries are planned. Iceberg metadata and Parquet statistics let the engine skip most files before they're ever read, and metadata caching plus prefetching brought attribute-value lookups down from tens of seconds to about one second in our runs.

How do workers coordinate without Kafka or a separate database?
Through S3 itself. S3 conditional requests provide compare-and-swap semantics: the If-Match header makes a write succeed only if the object's ETag matches the version you expect. That lets independent workers claim and hand off jobs through shared state in S3 - no leader election, no dedicated coordination service.

What makes writes durable without a local WAL on disk?
The write-ahead log lives in S3. The ingestor acknowledges a client only after its data is safely stored in the S3 WAL, so an acknowledged write inherits S3's durability by the time the client sees OK.

What is IceGate?
IceGate is a single-binary observability engine that stores logs, metrics, traces, and events as Apache Iceberg tables on Amazon S3 - the data, the WAL, and the Iceberg catalog all live in S3, with no separate database or coordination service to operate.

Originally published on the TripleCloud blog.

Comparisons reflect our own tests in the configurations described and our reading of public docs as of the publication date. Apache®, Apache Iceberg™, Kafka®, Flink®, Parquet™, Arrow™, DataFusion™, Polaris™ are trademarks of The Apache Software Foundation; Grafana® and Loki® of Grafana Labs; other names belong to their respective owners. IceGate and TripleCloud are not affiliated with or endorsed by any of the projects mentioned.

Top comments (1)

Evgenii Mineev • Jul 13

So true! Data lakes are not just about analytics. Observability has to be affordable for everyone because every product deserves to be high-quality. I don’t know of any other storage option that’s cheaper or more convenient at the same time then S3.