Discussion on: Event Storage in Postgres, Multi-tenant

View post

Replies for: The way you are writing events (multiple concurrent writers, as mentioned in your previous post) combined with the way you are reading events ("And...

I can see why you would think that. But I'm not using notifications in the way that most event stores use them -- to push events to listeners. For one thing, Postgres NOTIFY payloads are too small for that to work. For another I don't like the push model for listeners; I prefer a pull model.

Using these notifications, you can setup the listener either way -- to process all events, or to only listen for the latest happenings.

To process all events, the listener uses notifications simply to be informed of a new event and it's position (SequenceNum). If the sequence number is higher than its last processed one, it will fetch all new events starting from it's last processed SequenceNum. The listener tracks its own last processed position and loads events sequentially. It doesn't depend on notifications for tracking. No events are missed.

But there are cases where the listener doesn't care about processing every event, only the most recent. In that case, the listener loads the specific event referenced in the notification. Some events will be missed (e.g. listener node failure), but this kind of listener isn't concerned about that.

P.S. As far as notifications and concurrency. With Postgres NOTIFY, only committed events trigger notifications. They are never triggered when the transaction aborts, such as when there is a unique key constraint (concurrency) violation.

Alexander Langer • Jan 24 '19 • Edited

Yes, I understood that. The error is in here:

To process all events, the listener uses notifications simply to be informed of a new event and it's position (SequenceNum). If the sequence number is higher than its last processed one, it will fetch all new events starting from it's last processed SequenceNum.

Consider two transactions T1 and T2:

T1 starts, obtains next sequence number N from Postgres.
T1 stalls for whatever reason.
T2 starts, obtains next sequence number N+1 from Postgres.
T2 commits, sends notification.
Listener sees N+1 is higher than $last, retrieves all events since $last including N+1, sets $last := N+1.
T1 revives, commits, sends notification.
Listener sees N, not higher than $last (= N+1), ignores.

=> missed event.

(AFAIK the only way to avoid this with this architecture is to have a single writer per partition...).

Kasey Speakman • Jan 24 '19 • Edited

Ah, I see. Excellent point.

The case of "T1 stalls for whatever reason." is certainly possible, but unlikely for us so far. Aside from not yet reaching the IOPS to make this likely, transactions are literally BEGIN ... INSERT ... INSERT ... COMMIT, all sent as a single (maybe even prepared) statement. Although I'm not an expert in what effects auto-vaccuming pauses can have.

If I run into this problem, I will have to give this more of a think. A temporary workaround that would work for us is to rebuild read models if this happens.

Anyway, I'm glad you mentioned this. Thank you!

Alexander Langer • Jan 24 '19

Yes, beside from that issue, Postgres is a great choice and we are using a similar setup as the one you are describing here. We are currently overcoming this by having a single threaded writer, which is still fast enough given that transactions are really fast for us, as you mentioned, and the single writer is certainly not the bottleneck :)

Keep in mind, however, that even if it's very, very unlikely, given that you might store tens of millions of events per year (which is not uncommon even for a moderate business software application with a couple of users), even a 0.0001% chance per event means you are likely to encounter that race condition about 10,000,000 * 0.0001% = 10 times a year. From my own negative experience, I can tell you that a) this will happen, unfortunately, and b) this is one of those bugs that are very hard to find :-)

Anyways, keep up the good work, this is a very nice usage of the notify feature in postgres I hadn't considered before.

Kasey Speakman • Jan 24 '19 • Edited

I really appreciate your feedback!

On quick search, I found this article on gapless sequences. Using what was described there would also enforce a totally ordered sequence number. This will have a performance impact, not sure how much. Perhaps a single writer is a better solution overall for now.

I am concerned that business logic (command processing) will have to scale compute resources. And currently I planned to colocate the writer with the command processing, hence the multiple concurrent writers as command processing scales. I guess good problems to have and figure out when I get there. But I don't like the idea of the single writer being its own separate service.

Kasey Speakman • Jan 24 '19 • Edited

Here is a quick sketch of what the solution would look like that supported multiple concurrent writers. Using the strategy from that link. This is accomplished by making sequence number generation part of the parent transaction. I even added some rules to make the Event table append-only and the sequence counter update only to prevent maintenance mistakes. Also inspired by that link.

Here, I also renamed SequenceNum to Position.

--DROP TABLE IF EXISTS PositionCounter CASCADE;
--DROP FUNCTION IF EXISTS NextPosition();
--DROP TABLE IF EXISTS Event CASCADE;
--DROP TRIGGER IF EXISTS trg_EventRecorded ON Event;
--DROP FUNCTION IF EXISTS NotifyEvent();

-- transactional sequence number
CREATE TABLE IF NOT EXISTS PositionCounter
(
    Position bigint NOT NULL
);

INSERT INTO PositionCounter VALUES (0);

-- prevent removal / additional rows
CREATE RULE rule_positioncounter_noinsert AS 
ON INSERT TO PositionCounter DO INSTEAD NOTHING;
CREATE RULE rule_positioncounter_nodelete AS 
ON DELETE TO PositionCounter DO INSTEAD NOTHING;

-- function to get next sequence number
CREATE FUNCTION NextPosition() RETURNS bigint AS $$
    DECLARE
        nextPos bigint;
    BEGIN
        UPDATE PositionCounter
           SET Position = Position + 1
        ;
        SELECT INTO nextPos Position FROM PositionCounter;
        RETURN nextPos;
    END;
$$ LANGUAGE plpgsql;

-- event table
CREATE TABLE IF NOT EXISTS Event
(
    Position bigint NOT NULL,
    TenantId uuid NOT NULL,
    StreamId uuid NOT NULL,
    Version int NOT NULL,
    Type text NOT NULL,
    Meta jsonb NOT NULL,
    Data jsonb,
    LogDate timestamptz NOT NULL DEFAULT now(),
    CONSTRAINT pk_event_position PRIMARY KEY (TenantId, Position),
    CONSTRAINT uk_event_streamid_version UNIQUE (TenantId, StreamId, Version)
) PARTITION BY LIST (TenantId);

-- Append only
CREATE RULE rule_event_nodelete AS 
ON DELETE TO Event DO INSTEAD NOTHING;
CREATE RULE rule_event_noupdate AS 
ON UPDATE TO Event DO INSTEAD NOTHING;

-- notification
CREATE FUNCTION NotifyEvent() RETURNS trigger AS $$

    DECLARE
        payload text;

    BEGIN
        -- { position }/{ tenantId }/{ streamId }/{ version }/{ event type }
        SELECT CONCAT_WS( '/'
                        , NEW.Position
                        , NEW.TenantId
                        , REPLACE(CAST(NEW.StreamId AS text), '-', '')
                        , NEW.Version
                        , NEW.Type
                        )
          INTO payload
        ;

        -- using lower case channel name or else LISTEN would require quoted identifier.
        PERFORM pg_notify('eventrecorded', payload);

        RETURN NULL;
    END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_EventRecorded
    AFTER INSERT ON Event
    FOR EACH ROW
    EXECUTE PROCEDURE NotifyEvent()
;

And here's how to append an event with the transactional sequence number.

-- inside a transaction
INSERT
  INTO Event
     ( Position
     , TenantId
     , StreamId
     , Version
     , Type
     , Meta
     , Data
     )
VALUES
     ( NextPosition()
     , '3791B53677C840FC81CA65BFDAF34F1F'
     , 'A88F94DB6E7A439E9861485F63CC8A13'
     , 1
     , 'EmptyEvent'
     , '{}'
     , NULL
     )
;

Ron Gonzalez Lobo • Jan 29 '20

Thanks for putting this together, great read!

Working on porting this to YugaByteDB.

Ski • Apr 12 '20

Would it not be simpler instead to use explicit table lock when inserting events instead of PositionCounter? It limits performance but I believe result is same as PositionCounter table because it too needs a row level lock

begin;
lock table event;
insert into event ...
commit

Kasey Speakman • Apr 13 '20

Simpler? It would be less to setup vs a Position table. In usage it's adding a lock statement vs calling NextPosition() in the insert.

I suppose you will just have to test for the performance difference between the two.