Discussion on: Event Storage in Postgres

View post

Replies for: You're right. I went one step too far, thinking already about how to still make the first transaction which took too long work. But in reality, I'l...

I found a possible solution: ctid.

ctid is a system column which all postgres tables have. It contains the physical location of the row in the table file. It will be in insert order initially, but will change as records are updated and deleted and auto-vacuum is run. Since this implementation has rules to prevent updates and deletes, it should always be in insert order.

However, there are some issues. Ordering by ctid can be less performant than a primary key. Main issue I found was queries ordered by ctid cannot be index only scans, just index scans. (Because of technical reasons around HOT chains.) I haven't measured to see if that makes a difference here. Also if you do ever update or delete+autovacuum events, the ctids will change and no longer be in insert order. So if ctids are used for ordering, the only option to patch data is to copy events to a new table, modifying or omitting during the copy as necessary. (This was always our plan of last resort rather than modifying events, but maybe not everyone's.) Even then, event changes could shift the ctids. An updated event could be larger or smaller by enough that a different number of events fit in the 8k block, shifting up or down the ctids of every event that comes after it. Then you need to determine how the listeners will deal with that. View writers are not a problem as long as you don't mind rebuilding all views. Event sourced process managers should be okay if they didn't listen for the deleted/updated event -- just shift the last seen position accordingly. Otherwise they will require further analysis.

This makes me think that if I ever do need to delete an event (e.g. legal reasons), it might be better to update it to be a tombstone. And also maybe I should keep the last event id (stream id + version) instead of last position on event listeners. Keep the ordering column as an implementation detail rather than requiring clients to use it directly as I do now. More to think on.

ilija • Apr 19 '22

A little bit late, however since it is almost impossible to avoid locking when using global counters, it is perhaps a good idea to think about how to reduce the IO operations (calls between api and db). One solution is to use batching and making just one sql statement. Here is my version:

        with pos as (
            UPDATE benchmark.positioncounter
            SET position = position + ${items.length}
            returning position
        ),
        new_events as (
            SELECT
                stream,
                event_name,
                version,
                data
            FROM json_to_recordset(${items}) as x(
                stream text,
                event_name text,
                version integer,
                data json
            )
        )
        INSERT INTO benchmark.events (
            position,
            stream,
            event_name,
            version,
            data
        ) SELECT 
            (row_number() OVER () + pos.position -  ${items.length}) as position, 
            stream, 
            event_name, 
            version, 
            data 
        FROM new_events, pos

Same idea as suggested by kasey, however using a CTE and ability to insert multiple events. To increase performance you can do some batching (inserting like 50 events at a time). Using a basic benchmarking with batches of 20 events and concurrency of 50, I can insert 200000 events in about 20 seconds.

Kasey Speakman • Apr 21 '22 • Edited

Thanks for posting this. If my math is right, this is about 10k events per second. This could be a decent increase depending on your hardware. We've gotten about 1k events per second on an AWS t2.micro instance. That's in testing of course. We'll have company growth problems before we get close to that number.

Our use cases are somewhat pathological as far as seeing performance gains from this alone. Our use cases only save a handful of events at a time, often just 1. When there are multiple, they need to be saved in the same transaction (all-or-nothing). Sometimes saving an event can and should fail due to optimistic concurrency, but it should not prevent subsequent events from being saved. We would need a batching layer in the API that follows these constraints. Complicated, but seems doable.

But that's our usage, not everyone's. Once we're looking to raise this limit, we'll probably evaluate system options like sharding, switching to EventStore, etc. By then we should be in a different ballpark revenue-wise too, so would be a great problem to have.

ilija • Apr 21 '22 • Edited

I wonder did you try to use advisory locks (should be faster in general) instead of row locking (on the position) and if so, did it improve the performance? Using batches has major drawbacks because of the stream + version constraint.

And what about write performance if the table gets grows very large, is it consistent or does it degrades over number of rows?

Kasey Speakman • Apr 22 '22

I haven't tried an advisory lock. My basic understanding is that advisory locks are simpler than row/table locks. So they should reduce lock/unlock time. But how much, I haven't measured.

My observation in testing is that write perf hits a degradation breakpoint when indexes can't fit in memory anymore. And the bigger they get, the more cache misses, the more disk IO has to be done. First fix is scaling up db mem/cpu. When that limit is hit or impractically expensive, it's time to use multiple nodes like sharding.

The ultimate solution for perf is to have no position at all. The only order that matters is within a stream. Cross-stream ordering is best effort. So you could use something unstable (sometimes wrong) like timestamp. Readers should tolerate some out-of-order events across streams. So for example, no strict foreign key enforcement in tabular views (aka projections). Because the causal event from one stream might mistakenly be ordered after the effect event in another stream. But the events within each stream are totally ordered correctly. Sometimes it can be very convenient to have a stable ordering across streams so I probably wouldn't take this step until perf was a critical constraint.

ilija • Apr 22 '22

On how many rows did you see the performance degregation? Is it in range of 1,10 or 100 millions?

Kasey Speakman • Apr 22 '22 • Edited

I can't remember the exact breakpoints and it would vary anyway based on your hardware and index structures (3 fields vs 1 field, etc). This is anecdotal memory rather than a scientific experiment. I wrote a chess simulation program that generated unique boards and looked for checkmates. It would start off saving millions of boards per second. But very quickly stabilized in the tens of thousands of writes per second. I was using COPY to insert rows with batch sizes of 1024. I ran it for 2-3 months. When I stopped, db had over a trillion records and a write rate in the single digits. DB size was getting close to 3TB. The indexes were bigger than the data table. I learned a lot from that. I was very impressed with postgres.

Also, the performance drop-off was accelerated due to duplicate checking. Which also got more expensive (more IO) as time went on. I was tracking the dupe rate as well, but don't remember it off the top.

ilija • Apr 22 '22

Interesting. Did you also test your implementation of the eventstore with large amounts of data?

Kasey Speakman • Apr 22 '22 • Edited

Frankly no. We are using it for business ops rather than something like sensor data. If our write ops crosses even 1 event per second monthly avg we would have already grown the organization dramatically. The goal is not to ride out this solution to a million ops per second where it becomes very complicated to use. But to evaluate all the options when we come to the next inflection point. Though I constructed this and iterate on it, I realize that it has a context where it is useful and instructive. I am open to other options to meet different needs. I am still looking for perf gains with this, as that also means better efficiency with same resources. And that means less cost for cloud services, which is especially helpful early in a product life cycle. So I very much appreciate the code you posted and this discussion. :) And any testing anyone wants to do. I still want to post my updates and the F# libraries I use on top of this at some point. Still not quite where I want them.

ilija • Apr 22 '22 • Edited

Yes you are correct. I am currently using kind of the same eventstore implementation in postgres for a SaaS app and the event rate is really low. However migrating the data to another system could be painful and you want to make the best decisions as early as possible but keeping the infra manageable without using many different systems. My worries for postgres are mostly about the table size. A rate of 1 event per second already produced 32 million of events per year. And after a year or two, you still want a decent performance for fetching a event stream or inserting an event

I also found another eventstore implementation for postgres, used with elixir: github.com/commanded/eventstore . They use a different table structure, as can be found here: github.com/commanded/eventstore/bl...

Kasey Speakman • Apr 22 '22 • Edited

The nice thing about event sourcing is that events are the source of truth and they are pretty portable. We have copies of the system for multiple environments as I'm sure you do. So it would not be too much of a stretch to work out the integration details with a different solution, spin up a new system with it, save the existing events onto that copy, and validate it.

32 million rows is easy for postgres as far as just storage and insertion perf. One of our products is getting close to a million events after 6 years and it still runs lightly on micro instances. So I have lots of scale-up room before I need to reevaluate. That one used full consistency, so the read models are in the same db and there is write amplification. I used different schemas per tenant so it's not one set of indexes. If that does make a difference, the same could be accomplished with table partitioning.

The length of a stream (and therefore replay time) is more in the design camp than the perf camp. I've been meaning to make a post summarizing things I've learned and rules of thumb I use for event stream design. For example I will use unbounded streams for repeated processes. Like a yearly audit. But I don't replay the whole thing. I replay from the last completed audit. So replay sees a bounded number of events every time. And perf doesn't significantly change with years of history.

I will check that implementation out and see what I can learn. Thanks for linking it! :)

Kasey Speakman • Apr 23 '22

I only looked at the table structures. It looks like it's meant to be eventstore.com on postgres. Interesting idea.

One of things I didn't mention with scaling limits to postgres is connection limits. Each connection eats non-trivial amounts of server resources. Our AWS t2.micro instances can handle about 85 before it can't make new ones. (Ask me how I know.) But resources allocated to sessions are resources not allocated to running SQL. So we want to stay well below the limit. This is why things like Pgbouncer exist.

I want to explore creating my own event store service that will accept ES commands and use SignalR or WebSockets for listeners on top of my postgres ES. It can maintain its own limited number of connections to give the db as much resources as possible. And assign Positions from mem to alleviate that bottleneck. It can use opportunistic batching for yet more perf. Practically though, this creates many new downsides. An extra network hop (+latency). New failure modes and recovery models. Pub/sub handling. This is all potentially fun stuff I might like to do for my own learning. But when approaching limits of the original solution, for work I'd be evaluating eventstore.com instead since they solved most of these problems already. And it's high availability and free to use. I'm sure it has its own issues / workarounds, but probably so would my service. :)