12 practices to skyrocket your client's uptime as a DevOps Engineer

#webdev #beginners #community #motivation

For the last 5 years, I have built and managed large distributed systems for multiple clients.

Here are 12 practices I learnt that will make you an excellent DevOps Engineer:

1. Infrastructure health monitoring:

If Machines are overloaded, it can degrade the overall customer experience. Being aware about the instances that are in red is a must. Basic metrics that should be monitored:
-CPU Utilization
-Memory Usage
-Disk Usage

2. Service health monitoring:

Answers the question is this backend service healthy? Metrics that ensure that:
-Traffic hitting the endpoint
-Error rate
-Latency
-Measuring average latency is not useful. Measuring p95, p99 is a lot better to deliver a great experience

3. Business metrics monitoring:

Identify business events that the service enables and monitor those events. This kind of monitoring is bespoke for each organization.
So spending some brain power to list and monitor them, helps resolve business issues proactively.

4. Oncall Practices:

Oncall enables teams to improve their service over time. Teams who build services should own them and be responsible for their oncall as well.
Whenever alert fires, oncall engineer would respond and look into the details.

5. Alerting System:

When to fire an alert is a difficult question. It takes multiple iterations to get it right.
Tracking and categorizing alerts and measuring signal-to-noise is essential in improving the alerting system.

6. Outages & Incident Management:

Having few standard processes at the time of an incident can make a huge difference:
-Runbooks must be attached to alerts. These are your first line of defense.
-Communicate outages across the organization
-Mitigate now,investigate tomorrow

7. Postmortems, Incident Reviews & a Culture of Ongoing Improvement:

A good postmortem is blameless yet thorough and is accompanied by incident reviews. Robust systems are not built overnight. They are built through continuous iterations.

8. Planned Downtime:

These are excellent ways to test the resiliency of the system. These are also great ways to discover unexpected usage of a specific system.

9. Capacity Planning:

The cost of computing and storage can increase drastically if all the resources are not used efficiently. It is important for:
-Financial Budgeting
-Choosing the right vendors
-Locking in discounts with cloud providers

10. Blackbox Testing:

It is a way to measure the correctness of a system as an end user would see. This type of testing is similar to end-to-end testing. You should make the key user flows BlackBox testable and enable it to be triggered at any time to check if the systems work.

11. SRE as an Independent Team:

When an organization grows, the reliability of work across teams takes up more than a few engineers' time. It's time to introduce a dedicated team who owns standard monitoring and alerting tools. Introduce best practices all across the company.

12. Reliability as an Ongoing Investment:

When building an everlasting product, the first version is just the start. If it is successful, the work just keeps coming. To ensure the system works reliably, continuous validation, and checks are needed.

Following these in our team has helped us achieve an uptime of 99.945%. That's only 5 hours of downtime in a year. I hope these practices help you do better.

Thanks for reading this.

If you have an idea and want to build your product around it, schedule a call with me.

If you want to learn more in DevOps and Backend space, follow me.

If you want to connect, reach out to me on Twitter and LinkedIn.