S3-native streaming isn't a new idea. WarpStream, AutoMQ, and S2 are all betting on the same thesis: object storage is durable and cheap enough to replace broker-local disks for event streaming.
I've been building my own take on this called StreamHouse. It's open source (Apache 2.0), written in Rust, and I want to walk through the actual architecture rather than just pitch it.
What "S3-native" means in practice
The write path:
- Producer sends a record
- Record gets appended to a local WAL (fsync'd to disk)
- Records accumulate in an in-memory buffer
- Buffer hits a size or age threshold, flushed as a compressed segment to S3
Reads pull segments from S3. Metadata (topics, offsets, consumer groups) lives in Postgres or SQLite. Agents are stateless, they just need S3 credentials and a metadata connection.
No inter-broker replication. No partition reassignment when nodes change. Storage scales independently from compute.
Where StreamHouse sits in the space
WarpStream proved the S3-native model works. AutoMQ took a different angle, building on top of the Kafka codebase. S2 is focused on being a storage primitive.
StreamHouse is:
- Built from scratch in Rust. No Kafka codebase underneath, no JVM.
- Multi-protocol. Kafka wire protocol, REST, and gRPC all serving the same data. Existing Kafka clients connect without changes.
- Batteries included. SQL engine with window functions, schema registry, multi-tenancy, consumer groups, log compaction, transactions. Not just a log, closer to a full platform.
- Open source and self-hostable. Or you can use the hosted version if you don't want to manage it yourself.
Whether any of that matters depends on your use case. If you want managed and don't care about self-hosting, WarpStream might be the better pick. If you want to run it yourself, inspect the code, and not depend on a vendor, that's what I'm building for.
The durability model
Two ack modes:
acks=leader: confirmed after WAL fsync. ~2.2M records/sec. There's a window where data exists only on local disk before the S3 flush. WAL protects against process crashes, but a full disk failure in that window means data loss.
acks=durable: confirmed after S3 upload. Multiple producers batch into a 200ms window and share a single upload. Slower, but the data is in S3 before the producer gets an ack.
The hard part: metadata vs data consistency
If metadata says a segment exists but S3 doesn't have it (or vice versa), you have a problem. This is the thing that bit me early on and that people rightly called out.
Orphan cleanup: A background reconciler diffs S3 against metadata periodically. Orphans get a 1-hour grace period so it doesn't race with in-progress uploads, then get cleaned up.
Full disaster recovery: If you lose your metadata store, the server rebuilds on startup. It discovers orgs from S3 prefixes, restores from automatic metadata snapshots (saved to S3 after every segment flush), and reconciles any gaps. I have a 5-phase test suite that deletes all metadata and verifies full recovery.
What's in it besides the storage engine
- SQL queries over topic data with TUMBLE, HOP, SESSION window functions
- Kafka transactions with exactly-once through the standard wire protocol
- Log compaction with tombstone handling
- Schema registry for Avro, JSON Schema, Protobuf
- Multi-tenancy with org isolation, quotas, API key auth
- CLI (streamctl) for topic and consumer group management
What's missing
- Connectors. Zero shipped today. Snowflake, ClickHouse, Postgres, Parquet/Iceberg are next. This is the biggest gap.
- Tail latency. S3 reads are inherently slower than local disk. This is a throughput/cost architecture, not a low-latency one.
- Production mileage. It's tested (chaos tests, DR tests, load tests, 17-phase e2e suite) but it's a solo project. Nobody else is running it in production yet.
Source: https://github.com/gbram1/streamhouse
Website and how it works: https://streamhouse.app/how-it-works
If you're interested in the S3-native streaming space or want to dig into the internals, I'd love feedback. And if you have opinions on which connectors matter most, I'm all ears, that's the next big piece of work.
Top comments (0)