Let's talk quality - Part 3

In my first post I covered Operational Quality at a high level. The reality is that you might have written the cleanest, nicest code by all the metrics covered in my (last post)[https://dev.to/sociablesteve/lets-talk-quality-part-2-3h60], but that doesn't mean you're being efficient with the resources you have available.

As you may recall, one of the aspects of a high quality system is that it's easily scale-able. In my mind this means there's an efficient use of resources so you can scale both horizontally (more instances) and vertically (bigger instances) as necessity requires.

I've seen many examples of software engineers focusing only on the quality of the code, making their own lives easier, and ignoring the operational concerns until it's too late and systems have already broken down.

In this post I'll discuss what metrics I think are good to measure and feed into a more holistic quality viewpoint.

What is Operational Quality?

When we write software we know that it'll need to run somewhere. That might be on a server, a PC, a mobile device, an embedded device, or somewhere else. Operational Quality is about how efficiently the resources in that execution environment are used, and how effective the application is at working with those resources when things need to scale.

We also know that things will always go wrong, no matter how hard you try to make sure your system is robust enough, so there's a question around how well you can understand when things are going wrong, what's going wrong, and where it's going wrong.

In my mind this includes a whole range of things, and a starting point of things to consider could be:

Memory usage
CPU usage
Network usage
Disk usage
Start-up times
Responsiveness
Appropriate Logging
Monitoring and Alerting

There's quite a bit to measure here. Some of this can be automatically tracked, and some requires a bit more insight from people to understand what 'appropriate' means.

Let's talk about how I'd measure these.

Tooling

I'm certainly not an expert in this area, and most of my experience is working with server environments rather than desktop or mobile applications. That said my go-to tooling for understanding resource usage seems to work for all these application, and that's NewRelic. This gives details on resource usage and lets you start getting error traces.

For a less costly setup in server environments I tend towards a couple of other stacks.

Prometheus + AlertsManager + Grafana for resource usage monitoring and alerting
ElasticSearch + LogStash + Kibana for log aggregation and interrogation

Understanding resource usage for desktop and mobile apps may require may require more specialised knowledge that I have at the moment, but NewRelic claims to have you covered.

What is Appropriate Logging?

Appropriate logging in this case is about getting enough information to understand where errors are occurring, but not so many that your systems are getting overwhelmed. The actual amount of logs will vary from application to application, and there's no real studies I can point at as to what good looks like.

The real test of if logging is appropriate or not comes from how easily things are investigated in an emergent situation, e.g. live incident, or suspected issue. This leads nicely on to the next section...

Handling Live Problems

While you may have the most efficient resource usage possible and can scale like a pro, there will always be times when things break. How you handle them is another important operational concern, and the metrics that can be gathered from incident handling can help to understand quality in more depth. Metrics around incidents include things like:

Mean-time-to-acknowledge (MTTA). How long did it take to know there was an issue and for the first response to acknowledge they were dealing with the issue.
Mean-time-to-idenfity (MTTI). How long did it take to identify the cause of the issue
Mean-time-to-resolve (MTTR). How long did it take to resolve the issue. This is a bit trickier to define, and could mean either service is restored OR the permanent fix is released

While these metrics themselves aren't really a point of consideration in quality, any retrospective on incidents should aim to understand if the monitoring, alerting, and logging were at the right kind of level to help resolve the incident effectively. This is where the feedback of the quality around these metrics really comes from.

Summary

There's a lot of variables in measuring resource usage, and different environments will require different understanding and tooling. Your best bet is working with your local, friendly SRE to see what they can do to help you get the metrics you need.

In this post we've looked beyond just development quality and started to understand that how our systems run is equally important to a more holistic view of quality. Even if your specialty isn't operational it'll pay dividends in the future to at least understand what's happening to your software in the wild.

DEV Community