Good to see yet another SRE team taking ownership of monitoring! At HelloFresh we were into the same scenario, actually we had tons of infrastructure and product services w/out any sort of monitoring.
With our move to k8s, anything that runs on top if can leverage system metrics (CPU, Memory, Network etc...). Services whose expose HTTP endpoints at the k8s edge (ingress) can have RED metrics (Req, Err Duration) automatically. Since edge metrics are common we were able to automate away dashboards by creating one general allowing ppl to filter by service name. Automating alerts were also possible.
We are truly believers that w/out monitoring software ownership is not possible. Now on, incidents are much faster to be detected (MTTD) and recovered (MTTR).
We tune alerts religiously, TBH I don't even know how we could be flying w/out the monitoring we have nowadays
Good to see yet another SRE team taking ownership of monitoring! At HelloFresh we were into the same scenario, actually we had tons of infrastructure and product services w/out any sort of monitoring.
With our move to k8s, anything that runs on top if can leverage system metrics (CPU, Memory, Network etc...). Services whose expose HTTP endpoints at the k8s edge (ingress) can have RED metrics (Req, Err Duration) automatically. Since edge metrics are common we were able to automate away dashboards by creating one general allowing ppl to filter by service name. Automating alerts were also possible.
We are truly believers that w/out monitoring software ownership is not possible. Now on, incidents are much faster to be detected (MTTD) and recovered (MTTR).
We tune alerts religiously, TBH I don't even know how we could be flying w/out the monitoring we have nowadays
Right?! Once you have a good monitoring system in place its hard to envision life without it!