DEV Community

Use cases for persistent logs with NATS Streaming

Byron Ruth on September 11, 2017

What are persistent logs? In this context, a log is an ordered sequence of messages that you can append to, but cannot go back and cha...

Read full post

Alex Efros • Jul 13 '18

Thanks for the article, but there are some important things missing.

One large missing point is reconnection and NATS/STAN restarts, which isn't trivial to implement.

Another is producer may need to store outgoing events into own persistent outgoing queue before trying to send them to STAN to avoid missing events on producer restarts.

For this to be true after restarts, the client would need to persist the lastProcessed value someone and load it on start. But this could be as simple as a local file to contains the ID of the message that was last processed.

This won't work unless updating this file can be done atomically with handling message. Which is impossible in general case, and usually can be done in cases like when handling message is done by updating data in SQL database and this ID is updated in same database and in same transaction.

Byron Ruth • Jul 13 '18

Both the NATS and STAN clients reconnect automatically. One configurable setting is how many reconnects should take place and you set this to unlimited.

Another is producer may need to store outgoing events into own persistent outgoing queue before trying to send them to STAN to avoid missing events on producer restarts.

On a publish, the client will return an error if it cannot contact the server or if the server fails to ack. If you do this asynchronously (not in the request hot path for example), then you are correct you will need to handle this locally by storing the events and retrying perpetually. However if this is part of a request transaction, then presumably any effects could be compensated for and an error would returned to the caller.

This won't work unless updating this file can be done atomically with handling message.

Yes absolutely, the devil is in the details. The general case is that if you are doing at least two kinds of I/O in a transaction, you will need some kind of two-phase commit or consensus + quorum since any of the I/O can fail.

The intent for the examples here is to demonstrates patterns, not the failure cases (which are very interesting in their own right!) But thanks for pointing out a couple (very serious) failure cases!

Alex Efros • Jul 13 '18

Both the NATS and STAN clients reconnect automatically.

NATS - maybe, but STAN doesn't reconnect.

Byron Ruth • Jul 13 '18

MaxReconnects option for NATS, setting to -1 will cause the client to retry indefinitely. For the STAN client, you can achieve the same thing by passing in the NATS connection using this option. So..

nc, _ := nats.Connect("nats://localhost:4222", nats.MaxReconnects(-1))
sc, _ := stan.Connect("test", "test", stan.NatsConn(nc))

The STAN client just build on a NATS connection, so any NATS connection options will be respected in the STAN client.

Alex Efros • Jul 13 '18

STAN doesn't reconnect automatically, just test it yourself. Also, if no nats connection provided to STAN then it create default connection with same -1 setting.

Byron Ruth • Jul 13 '18

From the README, I read that the intent is for reconnection to happen transparently, but there are cases where this fails (there appears to have been quite a few changes and improvements since I wrote this article). It describes the case that the server doesn't respond to PINGs and the client decided to close the connection. Likewise if there is a network partition or something, the server may disconnect, but the client (in theory) could re-establish the connection after the interruption.

I am not arguing that it can't happen, but I am simply stating that the intent is for the client to reconnect. If you are observing otherwise, then that may be an issue to bring up to the NATS team.

Alex Efros • Jul 13 '18 • Edited

Team is aware about this. I agree intent is to support reconnects, but for now you have to implement it manually, and it's unclear when (if ever) this will be fixed in library. Main issue with reconnects is needs to re-subscribe, and how this should be done is depends on application, so it's not ease to provide some general way to restore subscriptions automatically - this is why it's not implemented yet and unlikely will be implemented soon.

Byron Ruth • Jul 13 '18

Good to know. Agreed, in the case of consumers how subscriptions need or should be re-established can vary based on the application.

Milos Gajdos • Sep 13 '17

Thanks for the write up. One thing I noticed in your code snippets - combination of log.Fatal and defer calls. Defer calls are never called after log.Fatal -> github.com/golang/go/issues/15402#...

Byron Ruth • Sep 13 '17

Good catch and thanks for letting me know! I do know this but I seem to let that bad habit creep in for examples and such. But here it is a bigger issue because properly closing connections is very important. I will update the examples shortly!

Byron Ruth • Sep 14 '17 • Edited

I guess I didn't realize this is a bit of a pain to do. Not terrible, but easy to miss. Basically only main should call Fatal or os.Exit and no defers should be used. More generally as long as no function in the stack above this call should use a deferred function.

Edit

Alex Efros • Jul 13 '18

Why do you use atomic.SwapUnint64(&lastProcessed, msg.Sequence)? As far as I understand handler function (for single subscription) won't be executed in parallel, so lastProcessed = msg.Sequence should be safe here.

Byron Ruth • Jul 13 '18

Hi Alex, you are correct. This example doesn't have two threads competing for the lastProcessed variable. I did it to be explicit about the kind of operation it is suggesting. Its not obvious for many that a race condition could easily be introduced here if, for example, the value is being written to disk or a database or something to keep track of the offset.

Ziyun Yang • Dec 2 '19

Thanks for your article! I have a question, how can I simultaneously realize delivering all messages available and controlling acks manually? If I only set options of DurableName and DeliverAllAvailable, I can receive all messages from the beginning after subscription restart. However, If I set one more option SetManualAckMode, I can just receive the message that was sending after subscription restart. I'm so confused about that.

Byron Ruth • Dec 2 '19

Thank you! There are a few different concerns you are describing. One is defining the start message for a given subscription, such as the beginning for a new subscription or some other arbitrary position. The second is implicit vs. manual ack-ing of messages as the consumer processes them. If the manual ack mode is enabled, then the consumer is required to call Ack() in order to tell the server (NATS Streaming) that the message should not be redelivered. Not doing manual acks means the server will implicitly assume once it delivers the message to the consumer it has been handled and/or the consumer doesn't want to receive it again even if there was a problem processing it locally.

Hopefully that helps a bit?

darxkies • Sep 16 '17

One of NATS Streaming's biggest drawbacks is the lack for clustering.

Richard Marcum • Mar 3 '19

FYI ... Clustering is now available.

Byron Ruth • Sep 17 '17 • Edited

Full clustering is being actively worked on. However, it does currently support "partitioning" which enables configuring multiple NATS Streaming servers to handle different sets of channels. In this mode, they are "clustered" in that a client can connect to any server and it will be routed to the correct server for the given channel. So for scaling needs, that is one option. The best option for replication/fault tolerance of the data are to setup up a consumer that relays data from the various channels into a separate server on another host.

Ziyun Yang • Apr 21 '20

Hi Byron, thanks for the article! I have a question that if the durable queue group guarantees in-order messages after all consumers restart.