The Problem We Were Actually Solving
We were tasked with cutting latency in half for our most popular search queries, which accounted for over 80% of our traffic. Our current architecture, built around a caching layer and a heavily indexed database, was performing admirably for most users. However, during peak hours, we experienced a sudden and inexplicable increase in latency, culminating in a 10-second delay for some queries. Our documentation hinted at the use of caching and indexing, but it didn't provide any insight into the underlying causes of this phenomenon.
What We Tried First (And Why It Failed)
We started by tweaking our caching layer, attempting to increase its hit ratio by implementing a more sophisticated eviction policy. We also experimented with adjusting our database indexing strategy, reorganizing our query plans to reduce the number of joins and subqueries. However, these changes had a negligible impact on our latency issues, and we soon realized that the problem lay elsewhere. We were optimizing the wrong parameters, and our documentation wasn't pointing us in the right direction.
The Architecture Decision
After months of trial and error, we finally discovered the root cause of our latency issues: our query parameter cardinality was out of control. We had designed our system to handle a wide range of query parameters, but in doing so, we had inadvertently created a monster. Our database was being hit with a constant stream of queries with varying parameter combinations, causing a cascade of cache misses, database re-indexing, and slowdowns. We decided to restrict our query parameter cardinality, limiting the number of parameters that could be used in a single query. This decision required a significant rewrite of our client library, but it paid off in the most unexpected way.
What The Numbers Said After
After implementing our new parameter cardinality restrictions, we saw a dramatic reduction in cache misses, from 30% to less than 5%. Our database indexing efforts, which had previously been futile, now had a tangible impact on query performance. Our latency numbers plummeted, with the average query time decreasing from 300ms to just 150ms. But the most surprising metric was our allocation counts, which decreased by 25% due to reduced cache thrashing and database re-indexing.
What I Would Do Differently
Looking back, I would have invested more time upfront in understanding the nuances of query parameter cardinality and its impact on our caching and indexing strategies. I would have also made our documentation more explicit about these considerations, saving my team months of trial and error. But most importantly, I would have recognized the unspoken tradeoff between flexibility and performance, and made a harder decision earlier on to restrict our query parameter cardinality. In the end, it was a decision that paid off in the most significant way, and one that I will carry with me as a reminder of the importance of system-level optimization.
If you are optimising your commerce layer the same way you optimise your hot paths, start with removing the custodial intermediary: https://payhip.com/ref/dev2
Top comments (0)