loading...

Zero dependancy Pub / Sub system with PostgreSQL

_gdelgado profile image Gio ・3 min read

Elephants

Photo by Chen Hu on Unsplash.

At Setter we have a 2nd generation API server that handles:

  • API requests coming in from internal software
  • API requests from our customer-facing iOS and Android applications
  • Webhook requests from 3rd party services

This is all quite typical stuff.

Communicating With Systems Outside Of Our Control

In the process of handling some of these requests, we have to communicate with 3rd party services.

One example being customer order approvals, in which we have to send the customer an email to confirm the order and provide a sort of feedback loop to the user.

So in this case, the flow looks like:

  1. Receive API request from mobile app
  2. Process API request (which will involve make some DB inserts / updates)
  3. Dispatch API request to 3rd party email provider (in our case we use Postmark and we highly recommend it)

By making API requests directly from our system, we've now reduced the certainty of success, and introduced incomplete states. For example, Postmark (the email service provider we use) could be down for routine maintenance, and hence a request to their service could fail at unpredictable times. This introduces an incomplete state in that the customer will never receive an email to let them know that their order was indeed processed and acknowledged.

This sort of thing has happened a few times at our company.

Eliminating The Dependancy on 3rd Party Services

Currently we're undergoing an internal RFC processes to decide how we're going to decouple 3rd party services from the core of our system.

I took the lead on this particular RFC (although I've had lots of guidance from my colleagues while writing it), and in this post I discuss the bulk of it.

What I'm proposing at our company is that we leverage the technologies we already have (PostgreSQL & NodeJS) in order to not increase system complexity - as opposed to using a tool such as RabbitMQ (not to say that RabbitMQ is bad).

By using PostgreSQL's LISTEN / NOTIFY features, you have everything you need in order to have a high-performance, fault-taulerant pub / sub system.

I went ahead and created an example app that implements this system - feedback welcome!

https://github.com/gDelgado14/pg-pubsub

Here are the relevant parts of the example project (as found in the README.md):

Implementing the "Pub" in Pub / Sub

The migrations folder contains the schemas / triggers / and SQL functions necessary to implement the publishing aspect of the system.

More specifically,a db_events table is created which stores messages sent into the pub sub system. Further, there is a trigger made that executes a sql function on any insertion into the db_events.

Implementing the "Sub" in Pub / Sub

Inside src/services/client.ts, I use the pg module to:

  1. Connect to the db
  2. Listen to "pub_sub" events being invoked from within postgres (which I've defined in the migrations)
  3. Invoke any asynchronous functions associated with the various events that can occur in the system.

Now you can subscribe to any event you want. You can define the events in your code. It really helps if you use a statically-typed language (which is why I implemented the example in TypeScript) so that your message payload is always consistent to the message the payload is associated to.

You can see some example channel / payload combinations inside src/services/client.ts. For example, if you publish an sms message, the payload going in and out of the pub / sub system will always be { msg: 'some string' }

Another awesome aspect of this system is that you can choose to run your subscription logic on a different machine / runtime / language. This is because it's postgres that's sending messages into the runtime. In my example I kept it simple and had both the publishing and subscribing happening in the same app, but it doesn't have to be that way if you don't want it to be!

Handling failure gracefully

One thing I haven't thought enough about is how to handle message processing failures.

Example: Say I publish an email message into the pub / sub system and a subscriber tries to process the message by, say, sending an API request to Postmark and Postmark is down. How should I best manage this?

I think implementing an exponential back-off retry might be the right approach.

Would love to hear your thoughts on this!

Discussion

pic
Editor guide
Collapse
parity3 profile image
parity3

I looked into postgres and loved the concept, but in practice, I was using pgbouncer as a connection pool with aggressive connection sharing, and I think pgbouncer did not support the notify/listen commands, and even if it did, I think keeping transactions open was required which meant all the waiters would need separate real connections to the db, which ate up resources and created more problems. That's not to say I couldn't have overcome that barrier but at the time I considered it too risky to try or research further.
Like you, I prefer being able to debug and fix my own applications vs worry about configuration / the learning curve of other software, particularly when my needs are specific. I just rolled my own notifications, of 2 types:

  1. When host-distributed was necessary, via a custom HTTP based server. It's basically an extension of what S3 provides, but I added persistent connection support. You could just as easily do this with HTTP 2 or websockets but I went old-school HTTP/1.1 with chunked transfer encoding for message framing. The server uses explicit scheduling (ie there's no preemptive context switching; this can be handled in many languages with task worker queue pattern but some frameworks have this built in and seamless, I used Python/Twisted but am a big fan of Python/Trio). This works around lots of potential problems (in exchange for single point of failure, SPOF). I think SPOF is really fine unless you're slack or have a justifiable need for 1 million + notifications / second (which I can't really wrap my head around). It feels like if you're getting into that realm then you've made a mistake along the way. IE Slack should have done some more infrastructure sharding/isolation at that point. Anyways, having a single-threaded server can be achieved via redis and maybe some server-side lua but I did not like the fact that redis was memory-only and had a lot of things I did not need which added complexity. The server I wrote easily supports fan-out or 1-1 "get-and-claim-next-job" patterns with the locking implicitly done via existing constructs of the language. Have more than a few types of messages or use cases? Time to run a new process and bind a new port.
  2. For mainly machine-specific, implemented an append-only binary log with auto-rotation. This I use for shipping logs, so multiple inputs and 1 worker, although you can use marker files to implement multiple workers. It leverages inotifywait, and keeps markers on multiple files (1 marker file per input per worker). I can also use fallocate to punch holes in the files (making them sparse) when data has been confirmed shipped. I also ended up writing a subscriber HTTP API for this, and could easily accomplish the same behavior as #1, while supporting native lock-free writes (because they are multiple files). However, only #1 can do back-pressure inline, which I did not implement here for #2 (different use cases).

Some things to keep in mind:

  1. This is pure opinion, but RabbitMQ (and I started with RabbitMQ) tries really hard to solve problems that they should not have tried solving. I firmly believe that retries should be handled by the client and the messaging itself should be a connection with framing support. Retries, in my use cases, have all been extremely tied to the business logic of the application and should have resided there to begin with, instead of messing around with meta data and queue routing configuration/monitoring.
  2. Notifications should not store messages with input or output. They should only be used to wake up waiters. That's it. When workers wake up, they should do enough work (on average) to make the wake up worth-while. There is a cost to everything. Try not to wake up more than you need to accomplish things. Treat the notifications as communicating with people; sometimes they drop out and you need to handle that. This is also zeromq's philosophy to an extent, although I don't have any experience with using that.
  3. Bundling in job context is a bonus that I won't get into technically, but I've found it incredibly beneficial; if you have a big task that consists of chunks that you want to spread to workers followed by a wrap-up (ie map/reduce), keep the context "open" for all worker participants until the end of the input chunks is reached, then close all the open contexts. IE prevent repeating worker task setup boilerplate, cache the setup, keep the context/files as local as possible, and then workers can focus on the real work of the chunk.
Collapse
bgadrian profile image
Adrian B.G.

Wow, first of all don't do it.

Second, a nice way to decouple the 3 steps you said is with "functions, serverless", it fits your example very well ( async events that triggers 2 separate functions).

First point again, a few notes:

  • a new dependency is not such a big deal, you will gain benefit from it from multiple systems (I bet this 3rd party separation is only the first for more to come)
  • relational DB aren't such a good fit for this communication type
  • you will put a bigger load on a persistent DB with transient messages
  • you cannot scale it horizontally (except by doing separate databases for each "channel")
  • from what I know "LISTEN / NOTIFY" sends the message to ALL listeners, this means you will have the problem of "at least 1", you will need a PUB/SUB system that does "almost exactly 1" so you will send only 1 email. If you only have 1 nodeJS server that listens again you cannot scale.
  • by not using a proper messaging systems you guys need to develop a failure/retry and timeout system (if one worker failed it must be reintroduced in the queue as a "not done job")
  • low performance - most messaging systems are in RAM so you can't match that
  • if you rely on the trigger/instant call of the queue update you will actually build a (time) tightly coupled system, systems like RabbitMQ/Kafta allows a "decoupled system" in time, allowing a greater flexibility in the workers usage (they can all be down and resume later or handle spikes in time).

PS: I may be wrong, I'm not deeply familiar with PostgreSQL
PS2: there is no such thing as zero dependency, if you guys do this system in house you are creating a new project (in project), that requires your attention, tests, time and resources. Also the system is dependent on your business logic, infrastructure and database (things that can be mitigated by using a PaaS / cloud PUB/SUB))

Collapse
acoh3n profile image
Arik

I like this idea quite a bit. We actually have a postgres based setup at work plus a rabbitmq and it would be nice to try to eliminate a moving piece (rabbit) if I can. I love rabbit but I love minimalism even more. The only disadvantage I can think of is giving all the nodes access to the database which I may not want to due to security considerations.

Collapse
_gdelgado profile image
Gio Author

Agreed, opening up access to all nodes is a serious drawback to this approach but I am sure that there are ways to go around this.

How are you and your colleagues liking RabbitMQ. We too are big fans of minimalism. Have there been some unexpected downsides to to introducing rabbit (besides having an additional moving piece)?

Collapse
acoh3n profile image
Arik

I'm a big fan of rabbit. Used it in production on several occasions for rather large (hundreds of messages per second) deployments and it worked really well for me. It's very easy to setup and it just sort of works.

Collapse
bgadrian profile image
Adrian B.G.

Minimalist would be a simple Redis, like other said. You can get also persistent storage (from time to time), performance and scaling (if needed).

Slack devs are also reticent to new dependencies and avoids adding new tech in their stack, you can read about their Job Queue system here, which is exactly what you need too (more or less). The old system is on redis (previous version, probably is enough for your needs)

Collapse
theodesp profile image
Theofanis Despoudis

Why don't you use a Redis task queue to see how it goes?

Collapse
skatkov profile image
Stanislav(Stas) Katkov

You can create a dedicated database user that has access to certain tables.

Collapse
skatkov profile image
Stanislav(Stas) Katkov

Classical Pub/Sub might not be a best fit, some of the reasons Adrian outlined pretty nicely for you in his comment.

May I propose to consider Event Sourcing pattern? You don't really remove or modify any records -- you just append them. Usually this pattern is being used with CQRS pattern -- it brings really nice scaling advantages later on into your project.

Collapse
bgadrian profile image
Adrian B.G.

I think the requests and the results (process and email) are ephemeral, so they do not need to be kept indefinitely. Event sourcing solves the problem when you need to keep all the previous states of the entities (kind of), I don't think this is the case. The requests are unique and do not change either need to be kept for longer period of time.

Collapse
nubunto profile image
Bruno Luis Panuto Silva

Hey Gio, great post!

Just my two cents on this topic.

Apparently, your problem is one I was facing in my previous company: how to deal with scale.

It also appears that the first step to decouple your application was taken: boundaries have been defined for third party systems and completely removed from your core business logic.

There are some contenders out there to PostgreSQL that are minimal and perform quite well.

I'm a big fan of Nats. I mean, dead simple, blazing fast, and Just Works™ out of the box. It's also under the CNCF, so it shows a lot of promise.

Having that well defined boundary around this kind of dependency should make it trivial to use. It needs some special care, e.g. defining your own request/response mechanism in order to achieve at-least-once semantics, but in my opinion this extra step is well worth it.

Collapse
mnasiruddin profile image
MOHAMMED NASIRUDDIN

thanks, nice example of postgres pub sub

Collapse
damianesteban profile image
Damian Esteban

If you are on AWS I would recommend going with SNS/SQS for pub sub. As another user mentioned, Event Sourcing might be a good alternative. In that case I would look at AWS Kinesis / DynamoDB