Yesterday a few of us at the office were noticing that we hadn't gotten an alert in the #monitoring channel in Slack for over a week. We get alerts...
For further actions, you may consider blocking this person and/or reporting abuse
At a previous company, one day, I looked into the server room and noticed a lot of red lights flashing on disks. I ran to the admin and told him, but he shrugged and told me "just because the lights are flashing red, doesn't mean there's something wrong"
Maybe logging systems should be built to also include period "All's Well" alerts...
(a) That way, you always know the alert system is working,
(b) Who couldn't use more good news?
Haha, well, understand by "periodic," it's some quiet little message once a day in the log/channel, with no loud beeping every five seconds... ;)
Incidentally, an ironic twist on this is...when I got the email notification for your response, my email client couldn't load the YouTube video. So, I just saw "an error occurred".
My first thought was, "Aw, crap, I jinxed it!"
My team has been playing with the idea of “Monitoring Driven Development”. Create the failing alerts first, then get things deployed, now green. Guarantees we have good monitoring in place.
Next up: Before implementing a feature, put the instrumentation/metrics in place we need to determine if that feature is a success.
Amazing! Great job @dev.to team!
Ahahh I pictured Mac going back to check knobs and levers and gauges with one of those yellow safety hats with the embedded torchlight
We should have these kinds of props handy now that I think about it
One of our customers often has a "high transaction" week (200%-300%), and they warn us about it before it starts. There have been various load issues in the past (not even during these high transaction weeks). A couple of weeks before I figured out an issue which could lead to erratic behavior and addressed it. Various monitors became quite stable. When the high transaction week started, our monitoring showed absolutely nothing of significance. System load, memory usage, etc. everything was still pretty much a flat line. The people on stand-by were worried something was broken and the transactions weren't going through. But nope, everything was working perfectly. This was quite a while ago. In the mean time average number of transactions per day have increase, and peak transactions have become higher. But none of this is really visible in our system monitoring.
Best of luck. I’m praying for you.
What do you use for monitoring, alarms and log gathering at Dev.to?
Performance is something I've learn to to keep an eye on it in the previous company I worked (newspaper) the high traffic keep me on edge always specially on big events.
I saw all the graphs and asked to the devOps team "it's ok don't worry, if something happened we'll let you know"
that phrase keep me ok but still at edge lol.
What do you mean by "error rate"?
Percentage of web requests which fail.
Yes, it's a problem if you don't have any problem. :D
Stellar job, guys! Keep it up. Thank you for the hard work you put into this community. It means a lot to everyone here.
I use Sentry in one project and it's a good feeling to get the weekly reports after an update when the error rate has gone down 40% or so.
The error graphs approaching a flat-line more and more XD
Ben what's your take on Elixir? I see so many benefits. Have you ever considered using it for Dev.to?
I think it’s pretty sweet. Never seriously considered it for dev.to unless it just plugged right in nicely.
If we grow and find some time to be more exploratory (or have more dire scaling needs), it’ll definitely get some stronger consideration.
One pretty interesting thing for the future is Rust interop usehelix.com
Whoa, thanks. This is really cool.
Whoa, that kind of "things are working well" makes me nervous.
Software engineering is never having to say you're done.
Maybe we could add a new observation here:
en.wikipedia.org/wiki/Fallacies_of...
That's so cool to hear, nice job!
Just curious, how do you usually justify the current threshold and if it should be lowered or raised?