DEV Community: Karim Hamidou

Speeding up our Webhooks System 60x

Karim Hamidou — Wed, 20 Jun 2018 18:12:18 +0000

Last December, we had an interesting problem. Our webhook system, which lets our customers know whenever a change happens to a Nylas account, was struggling to keep up with the traffic. We were seeing some webhooks being delayed as much as 10 minutes, which is a long time for most of our end-users. We decided to take the plunge and rearchitect our webhook system to be faster. In the end, we made it 60x faster and made sure it’ll stay that way for the foreseeable future. In this post you’ll see how.

Detour: A brief introduction to the Nylas transaction log

The Nylas API is built around the idea of a transaction log. The transaction log is an append-only log of all the changes that happened to our API objects. If you send a message through the Nylas API, we will create a transaction log entry noting that a “message” object was created. The same thing happens if you update an event, calendar or any other API object.

We use the transaction log to power all our change notification APIs. For example, the transaction log is how you can ask our delta stream API all the changes for a specific account in the last 24 hours.

Behind the scenes, the transaction log is implemented as a regular MySQL table. We’ve instrumented our ORM (SQLAlchemy) to write to this table whenever there is a change to an API object. In practice, this works relatively well, even though this part of the codebase relies a lot on SQLAlchemy internals, which makes it very brittle (if you’re curious about how it works, feel free to take a lot at the sync-engine source code).

How our legacy webhook system worked

Our original webhook system was pretty simple and really reliable. It would spawn one reader thread per webhook. Each reader thread would sequentially read from each of our MySQL shards and send the changes it found. This let us offload most of the reliability work to MySQL – for example, if one of the webhooks machines crashed, we would just have to reboot it and it would pick up things where it left off. That also meant that if a customer webhook went down then came back up, we’d be able to send them all the changes that happened in the interval, which made outage recovery for our customers a lot easier.

Unfortunately, as we grew from a couple MySQL shards to several hundred, and from a half-dozen customers to hundreds, this architecture started making less and less sense. Having one thread per customer meant that as our number of customer grew, our system would get slower and slower.

This might sound like a glaring limitation of our legacy system, but this was the right call when it was written three years ago. At the time, it was more important to have a simple and reliable system than to have something that was fast and scaled.

Rebuilding webhooks

Once we decided to rebuild the system, we had to figure out what kind of architecture would work best for our workload. To do that, we started by looking at the flamegraphs for our legacy system. Here’s a typical one:

One things jumps out immediately: we’re spending a lot of time executing SQLAlchemy code, and waiting for our MySQL shards. Seeing this confirmed a hunch we’ve had for a long time – we had too many readers.

To figure out if this was right, we decided to built a prototype that uses a single-reader architecture to send webhooks. Here’s the architecture we came up with:

Basically, we’d be moving from several readers per shards to one reader per-shard. We decided to try this out and see if it would solve our load problems.

The phantom reads problem

After three weeks of work, we felt that we had a system that was reliable enough to run production workloads, so we decided to ship a test version of the system that wouldn’t send actual webhooks. This way, we’d be able to iron out performance issues and make sure that there were no consistency issues between the legacy and new systems.

After doing a lot of testing, one issue kept happening – sometimes the new system wouldn’t send a transaction that should have been sent. This happened unpredictably and at any time of the day.

We spent a lot of time trying to figure out the issue – was it a problem in the way we were creating transaction objects? Was there a subtle bug in the way we were consuming the transaction table?

The problem was both more and less complicated. Every time we’d send a transaction to a customer we’d save its id to know where to resume in case of an interruption. However, it turns out that this isn’t safe at all, because of a misunderstanding we had about MySQL autoincrements: we assumed that they were generated at COMMIT time – which means that they’d be always incrementing.

However, this isn’t the case all the time, for example if two transactions are executing concurrently, one of these may not be – here’s an example of why:

This issue meant we couldn’t rely on MySQL to get transactions in order. From our point of view, there were three different ways we could go:

We could read from the MySQL binlog, which is the mechanism MySQL uses for its replication
We could use Apache Kafka
We could use Amazon Web Services’ Kafka clone, AWS Kinesis

In the end, we ended up going with AWS Kinesis mainly because it’s the easiest one to operate – both the MySQL binlog and Kafka have a significant operational cost that we couldn’t pay at the time (by the way, we’re hiring ops people 😅).

Kinesis to the rescue

Kinesis is an interesting system – it’s really reliable and easy to operate, as long as you fit inside of its (seemingly arbitrary) constraints. For example, Kinesis supports sharding, and you have to decide of the shard assignment yourself. This is fine but there’s a catch – every Kinesis is limited to 1MB of writes per second and 2MB of reads. On top of that, you can only have 5 transactions per seconds for reads, so having several processes reading from the same shard was out of question.

Obviously, this constraints aren’t the end of the world be we had to build our system around them. Here’s the system we ended up with – like the previous system we end up having a single reader per shard, except this time it’s reading from Kinesis instead of the database.

One interesting property of the new system is that it only uses Kinesis for getting an ordered log of changes – for the rest (like for example, catching up a customer webhook after a period of downtime), the new system will read the data from our database, which is a plus for durability, and helps us avoid Kinesis’ five transactions per second limitation.

Roll out

Webhooks are really really important for our customers. Breaking the system would mean breaking their apps for an undefined amount of time. To avoid doing this, we had to roll out this new service a little differently than the way we usually do it.

The way we usually roll out a new service is to do a staged rollout – we start by shipping the changes to 10% of our userbase, and then gradually increase the numbers throughout the day. We could have done this with the new webhooks service but we wanted to avoid any unexpected issues, were they dropping webhooks, or performance regressions, etc.

To do this, we decided at the start of the project to work on a version of the service that would simulate sending webhooks instead of actually sending them. This would give us confidence that we wouldn’t run into any issues at the launch. By the time we rolled out the service to our customers, it had been running in “simulated mode” for two months. That gave us a lot of confidence in the reliability of the new system.

We didn’t stop there though – given that our new system shouldn’t be dropping webhooks, ever, we spent several weeks instrumenting it to make sure that wasn’t the case. There weren’t a lot of good solutions for this, so we ended up instrumenting both the legacy webhooks systems and the new system to make sure that they were sending the same transactions. There’s no good way to do this, so we had to build a custom Flask app that got pinged whenever we sent a webhook.

Eventually, we became confident enough in the new system to roll it out to all our customers. That meant making a copy of each existing webhook, pointing that copy to the new system, turning off the old webhooks then turning on the new one and making sure no transactions were dropped.

Conclusion

Once the rollout was complete, we were able to measure the latency improvement for all our customers. The results were drastic – the old system would very often take more than a minute just to process a transaction and send it. Our P90 latency was over 90 seconds. With the new system, our P90 latency sits at around 1.5 secs – that’s a 60x improvement!

Finally, thanks to Russell Cohen who’s contracted with Nylas on this project and was instrumental in getting it out the door!

Why Consistency Matters When Building Reliable Systems

Karim Hamidou — Wed, 06 Dec 2017 21:42:00 +0000

We recently faced what could have been a serious user-facing outage: we rolled out a bad patch that caused the deletion and recreation of several hundred thousand folders from user’s inboxes. As a Nylas customer, you may have seen dozens of folders getting deleted then recreated.

As we do after every outage, we conducted an internal postmortem to learn what went wrong and prevent issues like this from happening in the future.
This time, we’ve decided to share this postmortem publicly because we think the bug was interesting!

What caused it?

Before digging into the issue, it’s helpful to have context on how the Nylas sync engine works. The sync engine is the core of our infrastructure and is responsible for syncing billions of emails, calendar data, and contacts with our customer’s platforms — which are often customer relationship management platforms (CRMs), applicant tracking systems, and such.

Keeping email data in a two-way sync between the end-user’s inbox and our customer’s platform is a pretty interesting challenge because protocols like IMAP and Exchange weren’t really built for bi-directional sync (in fact, they were built long before most people even know what a CRM was).

For example, when you rename an IMAP folder, you’re not just re-writing the folder name. Instead, your email app has to re-download all the emails in this folder and reconcile them with the emails it’s previously synced. This means that we can not delete IMAP folders immediately. Instead, we mark them as deleted and move on. Later on, we’ll detect whether deletion was actually a rename by looking at the messages inside it. Folders that weren’t renamed are then garbage-collected by a separate process to make sure we’re not storing them forever.

This garbage collection process was the source of the incident. Let’s dive into the part of the source code that caused the issue (if you’re curious, you can find the full source code here.

Here and below, Categories is the model we use to track IMAP folders.

def gc_deleted_categories(self):
    # Delete categories which have been deleted on the backend.
    # Go through all the categories and check if there are messages
    # associated with it. If not, delete it.
    with session_scope(self.namespace_id) as db_session:
        categories = db_session.query(Category).filter(
            Category.namespace_id == self.namespace_id,
            Category.deleted_at > EPOCH)

        for category in categories:
            # Check if no message is associated with the category. If yes,
            # delete it.
            count = db_session.query(func.count(MessageCategory.id)).filter(
                MessageCategory.category_id == category.id).scalar()

            if count == 0:
                db_session.delete(category)
                db_session.commit()

Pretty straightforward, right? Get all deleted Categories, then check if they have actual messages associated with them. If not, delete them.

The only strange thing is in the database query we’re making:

categories = db_session.query(Category).filter(
                Category.namespace_id == self.namespace_id,
                Category.deleted_at > EPOCH)

Wait, why are we filtering on messages more recent than epoch? Let’s look at the category model to figure it out! Here’s what the code looks like:

class Category(MailSyncBase, HasRevisions, HasPublicID, UpdatedAtMixin,
               DeletedAtMixin):
    # Need `use_alter` here to avoid circular dependencies
    namespace_id = Column(ForeignKey('namespace.id', use_alter=True,
                                     name='category_fk1',
                                     ondelete='CASCADE'), nullable=False)
    namespace = relationship('Namespace', load_on_pending=True)
    name = Column(String(MAX_INDEXABLE_LENGTH), nullable=False, default='')
    display_name = Column(CategoryNameString(), nullable=False)

    type_ = Column(Enum('folder', 'label'), nullable=False, default='folder')

    # Override the default `deleted_at` column with one that is NOT NULL --
    # Category.deleted_at is needed in a UniqueConstraint.
    # Set the default Category.deleted_at = EPOCH instead.
    deleted_at = Column(DateTime, index=True, nullable=False,
                        default='1970-01-01 00:00:00')

    __table_args__ = (UniqueConstraint('namespace_id', 'name', 'display_name',
                                       'deleted_at'),
                      UniqueConstraint('namespace_id', 'public_id'))

So, this is a pretty standard model too. It defines a bunch of fields and sets a UniqueConstraint to make sure we’re not storing duplicate folders. There is one problem though — because of a 12-year-old MySQL bug we have to set deleted_at = epoch in order to mark folders as not deleted.

Fast forward to two weeks ago: we noticed a rare condition where we would delete a folder right before detecting a rename. To work around this, we decided to add a six hour delay before deleting folders. To do that, we changed our previously innocuous query from this:

categories = db_session.query(Category).filter(
            Category.namespace_id == self.namespace_id,
            Category.deleted_at > EPOCH)

To this:

current_time = datetime.datetime.utcnow()

with session_scope(self.namespace_id) as db_session:
    categories = db_session.query(Category).filter(
        Category.namespace_id == self.namespace_id,
        Category.deleted_at <= current_time - CATEGORY_TTL)

This code would be perfectly normal if Category.deleted_at had a default value of None. However, in our case it had a default value of EPOCH to work around this MySQL bug — causing us to delete a whopping 300k (empty) folders in the 45 minutes it took us to notice the issue and roll back this patch.

Timeline

(All times in PST)

13:15 We started rolling out this patch as part of our daily code push to production.
13:25 The oncall engineer got paged because of an elevated number of accounts failing. They immediately created a status page incident and started looking into the issue.
13:35 After escalating the issue, several engineers started digging into our logs to figure out what was happening.
13:45 We decided to rollback as a preventive measure

Later in the afternoon, we dug through the patches that got shipped and found the real cause of the issue.

What measures are we taking to prevent this from happening again?

This is an interesting bug because it highlights the importance of consistency in complex systems: all our models follow the convention where deleted_at is NULL if the object isn’t soft-deleted. Both the person who wrote the code and the two people who reviewed assumed it was innocuous. There is no obvious culprit; just unexpected interactions.

To make sure that this doesn’t happen again we started working on two things:
1) Auditing our codebase for parts which don’t follow our internal conventions and fixing them.
2) Stable identifiers for all our API objects. This would let us resync an account from scratch without any customer-visible effects.

Stay tuned for more details about this!

This post was originally published on the Nylas Engineering Blog.