How multi-tenancy forced removal of caching in a critical API

#software #discuss #design #learning

When we think about data caching in an API, we automatically think about faster response time, less database calls, better overall performance - everything positive. The more critical the API (e.g. financial), the more benefits an optimization technique like caching can bring. However, in one specific scenario, data caching became a bottleneck. Yes, that is right.

This was the case when an on-premises API (note - API is REST end point, back-end Spring Boot and many other integrations and libraries combined) that is working well in production however to be more scalable, made more stable, needed to be run in high availability / multi-instance mode.

When faced with a situation like this, there is generally a good amount of opinions within a development team and option analysis, like let's use Redis/distributed caching, let's synchronize Spring cache between instances and more similar options plus another unusual option - remove data caching. And after a lot of deliberation, the last option was chosen i.e. remove caching altogether from the API. There were few reasons for this:

Synchronizing cache across multiple service instances is error prone if done through custom mechanisms (heard of the cache invalidation problem in computer science? :-))
Using a new distributed cache like Redis would involve additional components, maintenance and possibly infrastructure (and skills too)
The API was write heavy, more data is written or sent downstream vs being read, removing caching removed additional look ups within the boundary of the application
The API heart beat checks are periodic (e.g. every 30 seconds), besides that, the read traffic is not in significant/high use
The API volume is growing gradually, not exponentially (in short to medium term)
We need to move the API to cloud in 12+ months and that will eventually need some re-design
High availability is a design principle that needs to be adhered to (by design, not by accident)

The trade off? a miniature hit to performance in leu of high availability benefits & stability (that is proven through load testing, tangibly).

Given the above, a decision was made to remove caching from the API. This was not an easy decision as every developer feels emotional about the code they write :-), however in this case, this clearly was the pragmatic option, while also an example where caching seemed like an existing unnecessary over engineering in an API.

Lesson: a best practice like caching also needs to be considered for fitment before applying it - whether the service is on-premises or on the cloud, considering data load, consumer size, non functional requirements, long term impacts and more.

The outcome was a leaner API service which can be run in multiple instance mode and we do not need to worry about whether the data being saved/retrieved is latest or stale.

Thanks for reading thus far, if anyone has any similar experience or other experiences with design choices, would love to hear and learn from that.

DEV Community

How multi-tenancy forced removal of caching in a critical API

Top comments (0)