Yesterday a few of us at the office were noticing that we hadn't gotten an alert in the #monitoring channel in Slack for over a week. We get alerts every time error rate on dev.to pass a certain threshold.
@maestromac investigated. When I caught up with him a little while later about the issue, this was the prognosis:
Everything's fine with the monitoring. Turns out the site's just more stable.
Have you ever heard a more beautiful utterance?
We lowered the threshold a bit, and should expect an alert now and then at the new level.
Happy coding ❤️
I'm certain I've just jinxed it, so expect some significant downtime.
Top comments (24)
At a previous company, one day, I looked into the server room and noticed a lot of red lights flashing on disks. I ran to the admin and told him, but he shrugged and told me "just because the lights are flashing red, doesn't mean there's something wrong"
Maybe logging systems should be built to also include period "All's Well" alerts...
(a) That way, you always know the alert system is working,
(b) Who couldn't use more good news?
Haha, well, understand by "periodic," it's some quiet little message once a day in the log/channel, with no loud beeping every five seconds... ;)
Incidentally, an ironic twist on this is...when I got the email notification for your response, my email client couldn't load the YouTube video. So, I just saw "an error occurred".
My first thought was, "Aw, crap, I jinxed it!"
My team has been playing with the idea of “Monitoring Driven Development”. Create the failing alerts first, then get things deployed, now green. Guarantees we have good monitoring in place.
Next up: Before implementing a feature, put the instrumentation/metrics in place we need to determine if that feature is a success.
Amazing! Great job @dev.to team!
Ahahh I pictured Mac going back to check knobs and levers and gauges with one of those yellow safety hats with the embedded torchlight
We should have these kinds of props handy now that I think about it
One of our customers often has a "high transaction" week (200%-300%), and they warn us about it before it starts. There have been various load issues in the past (not even during these high transaction weeks). A couple of weeks before I figured out an issue which could lead to erratic behavior and addressed it. Various monitors became quite stable. When the high transaction week started, our monitoring showed absolutely nothing of significance. System load, memory usage, etc. everything was still pretty much a flat line. The people on stand-by were worried something was broken and the transactions weren't going through. But nope, everything was working perfectly. This was quite a while ago. In the mean time average number of transactions per day have increase, and peak transactions have become higher. But none of this is really visible in our system monitoring.
Best of luck. I’m praying for you.
What do you use for monitoring, alarms and log gathering at Dev.to?
Performance is something I've learn to to keep an eye on it in the previous company I worked (newspaper) the high traffic keep me on edge always specially on big events.
I saw all the graphs and asked to the devOps team "it's ok don't worry, if something happened we'll let you know"
that phrase keep me ok but still at edge lol.
What do you mean by "error rate"?
Percentage of web requests which fail.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.