What We Tried First (And Why It Failed)
When we first noticed this issue, we thought that adding a few more logging statements and some additional metrics would solve the problem. We implemented a new logging framework, which gave us a lot of extra data to work with. However, this quickly became overwhelming, and it took our operators several hours to sift through the logs to find what they were looking for. The metrics we collected also provided too much information, and it was hard to tell what was relevant and what wasn't.
The Architecture Decision
Eventually, we realized that we needed a more holistic approach to monitoring and logging. We decided to implement a distributed tracing system, which would allow us to see the entire flow of a request as it traversed our system. We chose to use the OpenTelemetry SDK with Jaeger as our tracing backend. This system allowed us to see not just the logs and metrics, but also the actual flow of requests through our system.
What The Numbers Said After
After implementing the distributed tracing system, our production operators were able to quickly identify the root cause of issues in our search engine. According to our metrics, the average time to resolve an incident decreased by 45% over the next quarter. Additionally, the total time spent by operators on incident resolution decreased by 25%.
What I Would Do Differently
If I were to do it again, I would have started with a smaller pilot project to test out the distributed tracing system before rolling it out to the entire system. This would have allowed us to iron out any issues before they became widespread. Additionally, I would have involved our operators more closely in the design process to get their input on what they needed to see in order to more effectively troubleshoot issues.
We removed the payment processor from our critical path. This is the tool that made it possible: https://payhip.com/ref/dev1
Top comments (0)