My name is Zhenya Kosmak, a product manager, and I wrote this article describing my experience as a Technical Product Manager. You can connect with me on LinkedIn if you want to discuss your project or anything related to this article. And I will be glad to share with you some pieces of advice 🙃
If your development team spent several thousand hours on your product and it's already in production, the issue of its stability is already quite significant. All services on the servers should work stable, and if there is a critical problem somewhere — the development team should figure it out and start fixing it. In this article, we'll talk about our experience setting up a Calibra alerting system. In this case, we have managed not only to ensure the technical stability of the product but also to optimize costs and improve our client's processes.
This article is a part of the cycle "Alerting system: why it's necessary both for developers and product owners." This article describes how the alerting system might come in handy when it's hard to keep everything close at hand.
To make understanding the problems we solved easier, we need to tell the essentials around the product. Calibra is a BPMS, i.e., a system that covers most of our client's workflows. The client's company managed numerous advertising campaigns for their clients in their interests. The client's company earned from each lead they brought. Calibra managed the accounts from which the ad was launched; all advertising settings; automatically changed the ads; collected statistics on the effectiveness of advertising, and much more.
How it worked all in all
TLDR:
- We collected metrics on each server of the system using versatile tools. Each metric was sent to the centralized storage.
- We called any unexpected situation "event-to-alert." It has the date and time of beginning (when the issue happened) and its end (when the issue was resolved).
- We used centralized settings for sending notifications for such events. When something wrong happened, we sent a message to the channel on Slack. Same thing when the situation is resolved.
As a result, the client stayed informed about the product's problems. And by this, we mean not just technical issues but also the problems of the client's business as such.
"Reminders" to pay for 3rd-party tools
Almost all products now use 3rd-party tools — this is advantageous given the savings in development costs. Calibra has had a dozen such integrations. Some could be paid annually; some are billed on an unpredictable schedule. Our client had to use several credit cards because not all 3rd-parties could bill each card.
So our client inevitably forgets to top up some of the credit cards or is mistaken with payment amount sometimes. As a result, the 3rd-party tool stops working until you make a payment. Forasmuch as it was not rare for us to change 3rd-parties, we decided this risk should be covered with an automated solution.
Since such a situation shouldn't be considered a real long-term threat, we agreed this would be managed post factum. When a payment error happens — we inform the client. So we developed such a simple process:
- Bill failed, after which the 3rd-party stopped working.
- The 3rd-party API stops working; alerting system determines that it is non-payment according to the error code.
- The client and we find it out with a Slack notification. The client can solve this problem immediately.
- As a result, the integration resumes working in the shortest time.
With this type of alert, you require a minimum amount of false alerts. When your employee has to check the issue instantly after notification, it shouldn't become a routine. If such alerts fire frequently, the problem should be solved in another way. For example, a client may hire a separate clerk who will take care of this. But when you have 1–3 notifications monthly, as it should be on a growing product, this solution is just what you need.
In our case, such situations occurred constantly, but they didn't harm the stability significantly. Therefore, thanks to the monitoring system, we secure the product's stability from the human factor in this matter.
There were several remarkable technical aspects behind this solution. One of the 3rd-party tools we used occurred to reply with 401 (Unauthorized) HTTP code. Usually, it means that you had provided incorrect credentials, e.g., the wrong password in your request. We rechecked our credentials, and they were fine. When we contacted support, we found out that this response meant an expired subscription. According to their logic, if a user fails to pay for a subscription, this user no longer exists, so the response with the 401 code is valid. Huh, it happens. So we added this case to the alerting system. Also, we documented why this solution was built in such a bizarre way.
Another case. One of the 3rd-parties began showing frequent alert noises. As we figured out, the 3rd-party team struggled with the growing load and repeatedly responded with a "Gateway timeout" error. We've just changed that we consider responses as event-to-alert only if we've got no 200 responses during one hour.
Thus to build a trustworthy alerting system, you have to know your 3rd-parties pretty well.
Bottom line
It might look like something unimportant. But when on Monday you discover that during all week-ends some of the landings were down since Cloudflare payment failed, and marketing budgets for two days are wasted… You change your opinion.
The same solution could be achieved in other ways. But it's much more convenient when you have a generalized solution that could be customized anytime.
If you need something similar on your product, we will discuss it with pleasure.
Top comments (0)