DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’» is a community of 963,673 amazing developers

We're a place where coders share, stay up-to-date and grow their careers.

Create account Log in
Cover image for 3 Metrics Relevant  to Every Service, Always
Laura Lindeman for Salesforce Engineering

Posted on • Originally published at engineering.salesforce.com

3 Metrics Relevant to Every Service, Always

Happy services are all alike.

In particular, happy services share a few common characteristics and success criteria. In Salesforce Security Engineering, our teams retrospected on these similarities and identified 3 positive β€œtells” for service health. Now, our engineering efforts explicitly invest to account for these measurable attributes:

  • SLA attainment
  • Engagement
  • Code velocity

less > more: β€œIf you have any more than three priorities, you don’t have any.β€β€Šβ€”β€ŠJim Collins

The right metrics help our service owners, our leaders, and our customers. Understanding what matters empowers service owners to act autonomously for results and to prioritize: β€œDoes it move the metric? Do that first.” Alignment on metrics enables leadership to perceive service value and quality: β€œAre we meeting customer needs? Are we building the right product? Are we delivering value at a rate that makes sense?” Customers trust dependencies that are honest and lose faith in those that are not: β€œIf I cannot use your service successfully, how can you be reporting green?!” If customer pain is not captured in the metric then the metric is a lie. When we constrain ourselves to only three priorities, we are forced to identify keystone metrics. The above three attributes trigger cascading positive outcomes.

SLA Attainment: β€œDon’t mistake activity with achievement.β€β€Šβ€”β€ŠJohn Wooden

Services promise value to customers, which is captured in a statement of SLA (service-level agreement). Consider the GCP Cloud KMS: β€œβ€¦[Cloud KMS] will provide a Monthly Uptime Percentage to Customers of 99.5% for Encrypt, Decrypt, and Sign operations…” The SLA has fine print, but the promise is clear. In our organization, we define SLA attainment as our ability to deliver on the promise: β€œIn a given reporting period (typically 1 week), did your customers receive SLA-quality service?”

Customers depend upon the SLAβ€Šβ€”β€Ša clear, measurable promiseβ€Šβ€”β€Što inform their own decisions. For the above example, a customer may ask, β€œIf KMS is expected to be down for a period of time, should I build retry logic into my clients? Should I consider a design with multiple keys, assuming that they do not fail coincidentally?” When our engineering teams define an SLA, the structure of the promise matters, our ability to deliver the promise matters, and our ability to know that we broke the promise matters. A reasonable goal for a team could be to achieve SLA for 95% of customers in a week. If we hold ourselves to a high bar then we can expect to not be perfectβ€Šβ€”β€Šwe leave room for progress. We can dive deep on those 5% which were outside of SLA and find ways to improve.

Reporting SLA attainment builds trust between service owners and customers, as well as service owners and leadership:

  • Automated measurement implies automation, which implies detection and engagement before a customer notices. Transparency precedes trust. Trust in our ability to deliver grows when we proactively inform our customers of availability flickers.
  • A shared definition of success held by the team, leadership, and customers clarifies the daily decisions and trade-offs we make as service owners.
  • Teams can perform without traditional organizational paranoia (β€œEverything is great over here! Please don’t look too closely!”). An engineering leadership that expects and welcomes the service-level hormesis that follows incidents is a leadership that shares priorities with service teams. Measures exist in part to inform our planning and our investments to improve.

The specific SLA measurement depends on the service, its goals, and particular customer expectations. For example, an SLA for a PKI certificate issuance / renewal service could be,β€œ99.9% of all certificates on healthy clients are renewed before expiration.”

SLAs (and associated metrics) often advance as the service matures. After a period of time (e.g. a quarter or semester), if business needs demand, a service might be able to β€œdeliver another nine.” A service could strive to increase the SLA attainment targetβ€Šβ€”β€Šfor example, from 95% to 98%. The example certificate issuance service could drastically reduce certificate lifetimeβ€Šβ€”β€Šfor example, from 7 days to 24 hoursβ€Šβ€”β€Ša policy change which would impact SLA without modifying SLA statement. Some services will need more than one SLA.

Every service can and should measure SLA attainment. Our goal is to make sure that all our services define and measure SLA, and that those measurements accurately reflect the customer experience.

Engagement: β€œThe winner is the first company to deliver the food the dogs want to eat.β€β€Šβ€”β€ŠAndy Rachleff

When customers are adopting and benefiting from your service, we say that they are engaged. Service engagement is a function of customer trust, value delivery, and quality. Engagement should measure actual customer interaction with a service, rather than approximate customer intent. In our teams, example engagement metrics include β€œnumber of unique secrets accessed this week,” β€œnumber of unique workloads authenticating with a certificate today,” or β€œnumber of unique keys used for crypto operations in last month.”

Prefer measures of engagement over measures of adoption. Adoption can quickly become a vanity metric, risking incorrect conclusions about service value. For example, tracking the β€œcount of objects stored” could mask usability issues if 99% of the objects are only ever accessed once and then abandoned by the customers.

Every service can and should measure engagement. If you measure engagement, you begin to know your market, your customers, the problem you are solving and the value you provide. You know if you are spending your time wisely, or if you should focus elsewhere.

Velocity: β€œIf you measure one thing in your organization, measure your commit to deploy latency.β€β€Šβ€”β€ŠAdrian Cockcroft

Shipping frequently has so many benefits. Shipping rapidly has an obvious impact on SLA: you get to improve faster. This also impacts engagement: you deliver value faster, meeting market needs sooner. If shipping compromises SLA, you recognize the architectural defects and mitigate. Smaller deliveries are less destabilizing, and by shipping frequently we can invest to make deployment boring. Shipping is a keystone behavior with cascading positive outcomes. Poor velocity is a frequent symptom of disease.

Every service can and should measure code velocity. Visualize service deployment as a controlled brownout: a fixed portion of service capacity transitions to offline, and is then replaced by new instances of the service. If deploying is boring, it means that the service permits SLA delivery despite this rolling degradation. Conversely, announcing scheduled maintenance is a negative tell of bad architecture. Velocity bounds time-to-iterate, speaks to an optimized developer experience, catalyzes improvements to SLA attainment and engagement. Healthy velocity depends on upstream validation, meaning velocity is a first derivative of quality.

To help focus on the metrics, recast the duties of a service owner in these measurable terms: β€œI continuously improve my service to keep the promises I’ve made to my engaged customers.” Measure the rate of improvement. Measure the rate of promises kept. Measure the customer engagement. Be a happy service owner.

Originally posted by Jeffrey Hamblin for Salesforce Engineering on Medium

Top comments (0)

🌚 Friends don't let friends browse without dark mode.

Just kidding, it's a personal preference. But you can change your theme, font, etc. in your settings.

The more you know. 🌈