Tackling Architectural Debt: How We Replaced a Production Elasticsearch Cluster
John Gerhardt May 31, 2017
As the quantity and complexity of application data scales with any burgeoning startup, we all run into performance and scalability issues. Some issues can be addressed with slight adjustments to existing infrastructure, while others require fundamental changes to the system’s architecture.
At Contactually, one such instance of this was dealing with an unhealthy Elasticsearch cluster that suffered from availability issues, syncing issues, and fundamental inefficiencies of data flow through our systems.
This article will focus primarily on the large architectural change we made at Contactually in June of 2016 to reinforce an unreliable Elasticsearch cluster.
Why We Use Elasticsearch at Contactually
Elasticsearch is a distributed, > RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected.
At Contactually, we use Elasticsearch to allow our users to quickly filter their contacts by numerous search criteria such as name, email address, mailing address, phone number, zip code, and even custom fields the user has created. Speed is key here — many of our users have hundreds of thousands of contacts.
We store billions of data points in our normalized, relational data model. I won’t go into the pros and cons of normalization in this article, but they are very important to consider when designing any
Denormalized document storage
Imagine trying to find a record in your database by attributes that could be in any of 10 tables. This might require an extremely expensive join, especially if the scale of the data you’re working with is in the billions of records.
When using Elasticsearch, or any other distributed search service, you can build index schemas that intentionally denormalize data so that it can quickly be queried without needing to expensive joins between numerous database tables.
However, this has the obvious downside that any data that changes in your “source of truth” database must quickly be replicated and indexed into your Elasticsearch cluster. For us at Contactually, this should usually take no longer than 50–100ms, but also depends on how frequently the data is changing and your particular use-case.
One of the biggest benefits of Elasticsearch is the distributed nature in which the cluster stores documents. Each node within the cluster has a certain number of shards.
Elasticsearch makes horizontal scaling easy — when a new node is added to the cluster, Elasticsearch automatically copies the shards to the new node in order to rebalance the cluster.
Redundancy is fundamental principle of engineering — each shard is stored on at least two nodes (depending on configuration). More importantly, each shard and it’s replica are stored on different nodes. This ensures that if a node fails, the primary shard and its replica aren’t on the same node. This avoids having any single points of failure.
Lessons Learned in Self-Managing Our Cluster
We’ve learned about why you might use Elasticsearch in your application, but we haven’t learned the complexity in managing the cluster itself. Our first implementation of Elasticsearch was set up with numerous nodes on EC2 instances with AWS.
Focus on Your Product — Not Managing your Cluster
With a small engineering team, we didn’t want to spend precious time and resources doing cluster management when we really needed to focus on the experience of our end users — fast query time, complex queries, and partial string matches using ngrams, etc.
Without an In-House Expert — Failure is Likely
Since our initial implementation of Elasticsearch was self-hosted using EC2 instances with AWS, we had all kinds of reliability issues. Worse — we were serving data to our client directly out of Elasticsearch. If our Elasticsearch index got out of sync with our database, we’d be serving stale data to our users, who expect data to be updating in near realtime.
Fundamental Engineering Principles Are Essential
The kicker — we didn’t have fault tolerance in place. When the Elasticsearch cluster went down, users thought we’d lost all their data. This obviously degrades trust in our brand — probably the biggest expense of all.
Regardless of what systems you inherit at a new team or company, an audit and analysis of all systems is critical to identifying technical debt and prioritizing which debt to tackle first.
The Bandaid — Use Postgres
The immediate takeaway was to prefer being slow and right over being fast and wrong — especially when core user experience was on the line. Our game plan consisted of a few key steps:
- Simplify the search functionality we provided users temporarily
- Use Postgres to search for contacts, knowing we’d require more IOPS to handle the increased Postgres load
- Remove Elasticsearch cluster from our stack, including all EC2 instances and indices
This approach had two benefits: 1) even though we were slower, we’d never give the user the impression that all of their data was gone and 2) by building a Postgres version of our search feature, we were setting ourselves up to have fault tolerance out of the box if our next implementation of Elasticsearch ever crashed.
Let’s Try this Again — Re-adding Elasticsearch
Once we’d put the bandaid in place and our users were feeling much more confident in our ability to prove to them that their data wasn’t going anywhere, it was time to reimplement Elasticsearch. There were some core engineering principles we wanted to keep in mind:
- High availability — if the cluster crashed, we wanted a backup in place to which we could automatically failover.
- No single point of failure — if the Elasticsearch cluster and it’s backup went down, we could fall back to Postgres search with a reduced feature set.
- *No self-managed cluster *— we chose elastic.co to host our cluster and automatically perform all cluster management (including scaling the cluster up and down for bulk indexing operations, e.g. when a schema changes).
The Other Side of Move Fast and Break Things
Early the life of a startup, many shortcuts will be taken. If you slow down and do everything the right way, someone might beat you to market and gain critical marketshare. However, as a company matures, the time to pay down that technical debt will come.
This is just one example of projects we’ve tackled at Contactually that have informed how we embark on new projects — keeping best engineering practices in mind when implementing critical systems.
Have you or your company experienced a similar growing pain? Please comment and share your experiences!
For millions of professionals, relationships are the backbone of a viable business. Whether you’re working with clients, prospects, or potential investors, Contactually helps you build stronger relationships with the people who can make you successful.
Contactually is looking for experienced software engineers who are passionate about solving complex problems with code. We’ve taken a novel approach to building business software — focus on the end user — and it’s been working! Our users love us. Come help us build a product that makes contact management easy and rescue 10,000s of people from the jaws of clunky, outdated software.