DEV Community: memattchung

CloudWatch Metrics: Stop averaging, start percentiling

memattchung — Sat, 17 Sep 2022 18:07:47 +0000

AWS CloudWatch is a corner service used by almost all AWS Service teams for monitoring and scaling software systems. Though it is a foundational software service that most businesses could benefit from, CloudWatch’s features are unintuitive and therefore often overlooked.

Out of the box, CloudWatch offers users the ability to plot both standard infrastructure and custom application metrics. However, new users can easily make the fatal mistake of plotting their graphs using the default statistic: average. Stop right there! Instead of averages, use percentiles. By switching the statistic type, you are bound to uncover operational issues that have been hiding right underneath your nose.

In this post, you’ll learn:

About the averages that can hide performance issues
Why software teams favor percentiles
How percentiles are calculated.

Example scenario: Slowness hiding in plain sight

Imagine the following scenario between a product manager, A, and an engineer, B, both of them working for SmallBusiness.

A sends B a slack message, alerting B that customers are reporting slowness with CoffeeAPI:

A: “Hey — some of our customers are complaining. They’re saying that CoffeeAPI is slower than usual”.

B: “One second, taking a look…”

B signs into the AWS Console and pulls up the CloudWatch dashboard. Once the page loads, he scrolls down to the specific graph that plots CoffeeAPI latency, execution_runtime_in_ms

He quickly reviews the graph for the relevant time period, the last 24 hours.

There’s no performance issue, or so it seems. Latencies sit below the team defined threshold, all data points below the 600 milliseconds threshold:

B: “Um…Look good to me” B reports back.

A: “Hmm…customers are definitely saying the system takes as long as 900ms…”

Switching up the statistic from avg to p90

In B’s mind, he has a gut feeling that something’s off — something isn’t adding up. Are customers misreporting issues?

Second guessing himself, B modifies the line graph, duplicating the execution_runtine_in_ms metric. He tweaks one setting -under the statistic field, he swaps out Average for P90.

He refreshes the page and boom — there it is: datapoints revealing latency above 600 milliseconds!

Some customers’ requests are even taking as long as 998 milliseconds, 300+ milliseconds above the team’s defined service level operation (SLO).

Problematic averages

Using CloudWatch metrics may seem simple at first. But it’s not that intuitive. What’s more is that by default, CloudWatch plots metrics with the average as the default statistic. As we saw above, this can hide outliers.

Plans based on assumptions about average conditions usually go wrong.

Sam Savage

For any given metric with multiple data points, the average may show no change in behavior throughout the day, when really, there are significant changes.

Here’s another example: let’s say we want to measure the number of requests per second.

Sounds simple,right? Not so fast.

First we need to talk measurements. Do we measure once a second, or by averaging requests over a minute? As we have already discovered, averaging requests can hide higher latencies that arrive in small bursts. Let’s consider a 60 second period as an example. If during the first 30 seconds there are 200 requests per second, and during the last 30 seconds there are zero requests per second, then the average would be 100 requests per second. However, in reality, the “instantaneous load” is twice that amount if there are 200 requests/s in odd-numbered seconds and 0 in others.

How to use Percentiles

Using percentiles makes for smoother software.

Swapping out average for percentile is advantageous for two reasons:

metrics are not skewed by outliers and just as important
every percentile data is an actual user experience, not a computed value like average

Continuing with the above example of a metric that tracks execution time, imagine an application publishing the following data points:

[535, 400, 735, 999, 342, 701, 655, 373, 248, 412]

If you average the above data, it comes out to 540 milliseconds, yet for the P90, we get 999 milliseconds. Here’s how we arrived at that number:

Let’s look at the above graphic in order to calculate the p90. First, start with sorting all the data points for a given time period, sorting them in ascending order from lowest to highest. Next, split the data points into two buckets. If you want the P90, you split the first 90% of data points into bucket one, and the remaining 10% into bucket two. Similarly, if you want the P50 (i.e. the median), assign 50% of the data points to the first bucket and 50% into the second.

Finally, after separating the data points into the two buckets, you select the first datapoint in the second bucket. The same steps can be applied to any percentile (e.g. P0, P50, P99).

Some common percentiles that you can use are p0, p50, p90, p99 and p99.9. You’ll want to use different percentiles for different alarm thresholds (more on this in an upcoming blog post). Say you are exploring CPU utilization, the p0, p50, and p100 give you the lowest usage, medium usage, and highest usage, respectively.

Summary

To conclude, let’s make sure that you’re using percentiles instead of averages so that when you use CloudWatch, you aren’t getting false positives.

Take your existing graphs and switch over your statistics from average to percentile today, and start uncovering hidden operational issues. Let me know if you make the change and how it positively impacts your systems.

Get my tutorials delivered straight to your inbox and sign up for my newsletter by clicking here

References

Chris Jones. “Google – Site Reliability Engineering.” Accessed September 12, 2022. https://sre.google/sre-book/service-level-objectives/.

Smith, Dave. “How to Metric.” Medium (blog), September 24, 2020. https://medium.com/@djsmith42/how-to-metric-edafaf959fc7.

Writing data to disk: transforming brittle code to robust code with atomic writes

memattchung — Fri, 20 Aug 2021 03:13:34 +0000

This is the first post in a series where I'll cover about writing robust code that's can tolerate both expected and unexpected failures

Problem identification

Receiving feedback through code reviews is one of the many ways to grow your career as a software developer. But of course, not all feedback hold the same value. Not so useful comments tend to focus on nit picking (e.g. white space); moderately useful comments detect logic or semantic bugs; fairly useful ones help you see problems through a different lens; the best comments open your eyes to issues that you didn't even know existed.

One of the most eye-opening code reviews I submitted during my tenure at Amazon Web Service (AWS) revealed to me the importance of atomic writes to disk.

Example: Brittle Non-Atomic write to disk

Let's take a look at the snippet of Python code below that writes data to disk.

dataset = fetch_data()
...
with open('customers.txt') as fh:
    for each customer in dataset:
        fh.write(...)

At a glance, the above code looks and smells 👃 okay. It's coded with idiomatic Python: the context manager (i.e. with open) cleans up lingering resources for you, automatically closing out the file handle. Awesome. I see code like this all the time. But, can you spot the issue?

The lack of atomicity?

What is an atomic write?

In general, an atomic operation is all or nothing, binary, 0 or 1; the operation has either 1) not yet started or 2) has completed successfully. No gray areas. In the context of writing data to disk, the destination must contain all the data we expect to be present in the file, non-corrupted. Not some of the data — all of it.

So how do transform the above code such that we atomically write to disk?

As is stands, the above code is brittle, susceptible to failures. What happens if the program raises an exception mid-write? Or if the server powers off in between one of the read or write operations, leaving the data corrupted? In other words, the code opens us up to leaving the file in an unknown state.

Atomically writing a file

Here's how we go about writing an atomic file.

Steps

Create a temporary file
Write contents to temporary file
Flush buffers
Sync to disk
Rename file.

Example

from x import TempFile
# 1. create temporary file
with open(tempfile) as fh:
    # 2. Write contents to file handle
    fh.write(...)
    # 3. Flush from any runtime or OS buffers
    fh.flush()
    # 4. Sync from memory to disk
    os.fsync(fh.fileno()) 

# 5. Rename and replace destination file
os.rename(tempFile, "customers.txt")

We start the procedure with opening a temporary file; this temporary file becomes the intermediate destination in which we direct our writes. By writing to a temporary file, we leave the ultimate destination file (if it exists) in tact, only replacing the destination file if all the data has been successfully written to the temporary file. Once all writes finished, then we simply rename the temporary file to that of the destination file, an atomic operation in itself.

Summary

Above, I demonstrated one way to apply atomicity. This principle can be applied to many other situations. For example, if you are writing multi-threaded code and accessing shared memory, a thread needs to atomically obtain a lock before modifying the underlying shared data structures.

So, moving forward, when writing or reviewing code, keep the possibility of failures at the fore front of your mind and identify ways you can apply the principle of atomicity to turn fragile code into robust software.

Let's Connect

Let's connect and talk more about software and devops. Follow me on Twitter: @memattchung

References

Stupid Python Ideas: Getting atomic writes right
python - How to make file creation an atomic operation? - Stack Overflow

Monitoring Systems with Canaries

memattchung — Sun, 20 Jun 2021 14:57:24 +0000

You launched your service and rapidly onboarding customers. You're moving fast, repeatedly deploying one new feature after another. But with the uptick in releases, bugs are creeping in and you're finding yourself having to troubleshoot, rollback, squash bugs, and then redeploy changes. Moving fast but breaking things. What can you do to quickly detect issues — before your customers report them?

Canaries.

In this post, you'll learn about the concept of canaries, example code, best practices, and other considerations including both maintenance and financial implications with running them.

What is a canary

Back in early 1900s, canaries were used by miners for detecting carbon monoxide and other dangerous gases. Miners would bring their canaries down with them to the coalmine and when their canary stopped chirping, it was time for the everyone to immediately evacuate.

In the context of computing systems, canaries perform end-to-end testing, aiming to exercise the entire software stack of your application: they behave like your end-users, emulating customer behavior. Canaries are just pieces of software that are always running and constantly monitoring the state of your system; they emit metrics into your monitoring system (more discussion on monitoring in a separate post), which then triggers an alarm when some defined threshold breaches.

What do canaries offer?

Canaries answer the question: "Is my service running?" More sophisticated canaries can offer a deeper look into your service. Instead of canaries just emitting a binary 1 or 0 — up or down — they can be designed such that they emit more meaningful metrics that measure latency from the client's perspective.

First steps with building your canary

If you don't have any canaries running that monitor your system, you don't necessarily have to start with rolling your own. Your first canary can require little to no code. One way to gain immediate visibility into your system would be to use synthetic monitoring services such as BetterUptime or PingDom or StatusCake. These services offer a web interface, allowing you to configure HTTP(s) endpoints that their canaries will periodically poll. When their systems detect an issue (e.g. TCP connection failing, bad HTTP response), they can send you email or text notifications.

Or if your systems are deployed in Amazon Web Services, you can write Python or Node scripts that integrate with CloudWatch (click here for Amazon CloudWatch documentation).

But if you are interested in developing your own custom canaries that do more than a simple probe, read on.

Where to begin

Remember, canaries should behave just like real customers. Your customer might be a real human being or another piece of software. Regardless of the type of customer, you'll want to start simple.

Similar to the managed services describe above, your first canary should start with emitting a simple metric into your monitoring system, indicating whether the endpoint is up or down. For example, if you have a web service, perform a vanilla HTTP GET. When successful, the canary will emit http_get_homepage_success=1 and under failure, http_get_homepage_success=0.

Example canary - monitoring cache layer

Imagine you have a simple key/value store system that serves as a caching layer. To monitor this layer, every minute our canary will: 1) perform a write 2) perform a read 3) validate the response.

while(True):

    successful_run = False

    try:
        put_response = cache_put('foo', 'bar')
        write_successful = put_response == 'OK'
        Publish_metric('cache_engine_successful_write', write_successful)
        value = cache_get('foo')
        successful_read = value = 'bar'
        publish_metric('cache_engine_successful_read', is_successful_read)
        canary_successful_run = True

    Except as error:
        log_exception("Canary failed due to error: %s" % error)
    Finally:
        Publish_metric('cache_engine_canary_successful_run', int(successful_run))
        sleep_for_in_seconds = 60
        sleep(sleep_for_in_seconds)

Cache Engine failure during deployment

With this canary in place emitting metrics, we might then choose to integrate the canary with our code deployment pipeline. In the example below, I triggered a code deployment (riddled with bugs) and the canary detected an issue, triggering an automatic rollback:

Best Practices

The above code example was very unsophisticated and you'll want to keep the following best practices in mind:

The canaries should NOT interfere with real user experience. Although a good canary should test different behaviors/states of your system, they should in no way interfere with the real user experience. That is, their side effects should be self contained.
They should always be on, always running, and should be testing at a regular intervals. Ideally, the canary runs frequently (e.g. every 15 seconds, every 1 minute).
The alarms that you create when your canary reports an issue should only trigger off more than one datapoint. If your alarms fire off on a single data point, you increase the likelihood of false alarms, engaging your service teams unnecessarily.
Integrate the canary into your continuous integration/continuous deployment pipeline. Essentially, the deployment system should monitor the metrics that the canary emits and if an error is detected for more then N minutes, the deployment should automatically roll back (more of safety of automated rollbacks in a separate post)
When rolling your own canary, do more than just inspect the HTTP headers. Success criteria should be more than verifying that the HTTP status code is a 200 OK. If your web services returns payload in the form of JSON, analyze the payload and verify that it's both syntactically and semantically correct.

Cost of canaries

Of course, canaries are not free. Regardless of whether or not you rely on a third party service or roll your own, you'll need to be aware of the maintenance and financial costs.

Maintenance

A canary is just another piece of software. The underlying implementation may be just few bash scripts cobbled together or full blown client application. In either case, you need to maintain them just like any other code package.

Financial Costs

How often is the canary running? How many instances of the canary are running? Are they geographically distributed to test from different locations? These are some of the questions that you must ask since they impact the cost of running them.

Beyond canaries

When building systems, you want a canary that behaves like your customer, one that allows you to quickly detect issues as soon as your service(s) chokes. If you are vending an API, then your canary should exercise the different URIs. If you testing the front end, then your canary can be programmed mimic a customer using a browser using libraries such as selenium.

Canaries are a great place to start if you are just launching a service. But there's a lot more work required to create an operationally robust service. You'll want to inject failures into your system. You'll want a crystal clear understanding of how your system should behave when its dependencies fail. These are some of the topics that I'll cover in the next series of blog posts.

Let's Connect

Let's connect and talk more about software and devops. Follow me on Twitter: @memattchung

3 Tips on getting eyeballs on your code review

memattchung — Mon, 14 Jun 2021 15:36:49 +0000

"Why is nobody reviewing my code?"

I sometimes witness new engineers (or even seasoned engineers new to the company) submit code reviews that end up sitting idle, gaining zero traction. Often, these code reviews get published but comments never flow in, leaving the developer left scratching their head, wondering why nobody seems to be taking a look. To help avoid this situation, check out the 3 tips below for more effective code reviews.

3 tips for more effective code reviews

Try out the three tips for more effective code reviews. In short, you should:

Assume nobody cares
Strive for bite sized changes
Add a descriptive summary

1. Assume nobody cares

After you hit the publish button, don't expect other developers to flock to your code review. In fact, it's safe to assume that nobody cares. I know, that sounds a bit harsh but as Neil Strauss suggests:

"Your challenge is to assume — to count on — the completely apathy of the reader. And from there, make them interested.”

At some point in our careers, we all fall into this trap.

We send out a review, one that lacks a clear description (see section below “Add a descriptive summary”) and then the code review would sometimes sits there, patiently waiting for someone to sprinkle comments. Sometimes, those comments never come.

Okay, it's not that people don't necessary care. It has more to do with the fact people are busy, with their own tasks and deliverable. They too are writing code that they are trying to ship.

So your code review essentially pulls them away from delivering their own work. So, make it as easy as possible for them to review.

One way to do gain their attention is simply by giving them a heads up.

Before publishing your code review, send them an instant message or e-mail, giving them a heads up. Or if you are having a meeting with that person, tell them that you plan on sending out a code review and ask them if they can take a look at the code review. This puts your code review on their radars. And if you don't see traction in an appropriate (which varies, depending on change and criticality), then follow up with them.

2. Strive for bite sized code reviews

Anything change beyond than 100-200 lines of code requires a significant amount of mental energy (unless the change itself is a trivial updates to comments or formatting). So how can you make it easier for your reviewer?

Aim for small, bite sized code reviews.

In my experience, a good rule of them is submit less than 100 lines of code.

What if there’s no way your change can squeeze into double digits?

Then consider breaking down the single code review into multiple, smaller sized code reviews and once all those independent code reviews are approved, submit a single code review that merges all those changes in atomically.

And if you still cannot break down a large code review into these lengths and find that it’s unavoidable to submit a large code review, then make sure you schedule a 15-30 minute meeting to discuss your large code review (I’ll create a separate blog post for this).

3. Add a descriptive summary for the change

I’m not suggesting you write a miniature novel when adding a description to your code review. But you’ll definitely need to write something with more substance than a one-liner: “Adds new module”. Rob Pike put’s it succinctly and his criteria for a good description includes “What, why, and background”.

In addition to adding this criteria, be sure to describe how you tested your code — or, better yet, ship your code review with unit tests. Brownie points if you explicitly call out what is out of scope. Limiting your scope reduces the possibility of unnecessary back-and-forth comments for a change that falls outside your scope.

Finally, if you want some stricter guidelines on how to write a good commit message, you might want to check out Kabir Nazir’s blog post on “How to write good commit messages."

Summary

If you are having trouble with getting traction on your code reviews, try the above tips. Remember, it's on you, the submitter of the code review, to make it as easy as possible for your reviews to leave comments (and approve).

Let's Connect

Let's chat more and connect! Follow me on Twitter: @memattchung

3 project management tips for the Well-Rounded Software Developer

memattchung — Wed, 09 Jun 2021 22:34:58 +0000

This is the second in the series of The Well Rounded Developer. See previous post "Network Troubleshooting for the Well-Rounded Developer"

Whether you are a solo developer working directly with your clients, or a software engineer part of a larger team that's delivering a large feature or service, you need to do more than just shipping code. To succeed in your role, you also need good project management skills, regardless of whether there's an officially assigned "project manager". By upping your project management skills, you'll increase the odds of delivering consistently and on time — necessary for earning trust among your peers and stakeholders. In fact, I'd go as far to say that it's critical for your Personal Brand

3 Project Management Tips

Just like programming, project management is another skill that requires practice — you'll get better with it overtime. Sometimes you'll grossly underestimate a task, thinking it'll take 3 days ... when it really took 10 days (or more!). Don't sweat it. Project management gets easier the more you do it.

Capturing Requirements

This seems obvious and almost goes without saying, but as a developer, you need to be able to extract the mental image of your customer/product manager. Then, distill them into words, often referred to as user stories: "When I do X, Y happens" or "As a [role] ... I want [goal] ... so that [benefit].

These conversations will require a lot of back and forth discussion. With each iteration, aim to be as specific as possible. Include numbers, pictures, diagrams. The more detail, the better. And most important, beyond defining your acceptance criteria, spell out your assumptions — loud and clear. Because if any of the assumptions get violated while working on the task, you need to sound the alarm and communicate (see "sending frequent communication updates" below) that the current estimated time has been derailed.

Example

When we receive a packet with a length exceeding the maximum transmission unit (MTU) of 1514 bytes, the packet gets dropped and the counter "num_dropped_packets_exceeding_mtu" is incremented.

Sending frequent communication updates

Most importantly, keep your stakeholders in the loop. Regardless the task at hand is trending on time, slipping behind, or being delivered ahead of schedule, send an update. That might be in the form of an e-mail, or closing out your task using your project management system.

Example of a short status update

More often than not, we developers tend to send updates too infrequently and as a result, our stakeholders are often guessing where the project(s) stand. These updates can be short and simple: "Completed task X. Code has been pushed to feature branch but still needs to be merged into mainline and deployed through pipeline."

Breaking tasks into small deliverables

It pays off to break down large chunks of work into small, actionable items.

The smaller, the better. Ideally, although not always possible to achieve, strive to break down tasks such that they can be completed within a single day. This isn't an absolute requirement but serves as a forcing function to crystalize requirements. Changes are, the larger the estimates, the greater chance of it slipping off schedule.

Of course, some tasks just require more days, like fleshing out a design document. For ambiguous tasks, create spike stories (i.e. research tasks) — just make sure these discovery tasks are time-bounded to a few days.

Summary

Project management is an essential skill that every well-rounded developer must have in their toolbox. This skill combined with your technical depth will help you stand out as a strong developer: not someone who just delivers code, but someone who does it consistently and on time.

Let's connect

Let's chat more about being a well-rounded software developer. If you are curious about learning how to move from front-end to back-end development, or from back-end development to low-level systems programming, follow me on Twitter: @memattchung

Why all developers should learn how to perform basic network troubleshooting

memattchung — Sun, 06 Jun 2021 04:51:03 +0000

Regardless of whether you work on the front-end or back-end, I think all developers should gain some proficiency in network troubleshooting. This is especially true if you find yourself gravitating towards lower level systems programming.

The ability to troubleshoot the network and systems separates good developers from great developers. Great developers understand not just code abstraction, but understand the TCP/IP model:

Source: https://www.guru99.com/tcp-ip-model.html

Some basic network troubleshooting skills

If you are just getting into networking, here are some basic tools you should add to your toolbelt:

Perform a DNS query (e.g. dig or nslookup command)
Send an ICMP echo request to test end to end IP connectivity (i.e. ping command)
Analyze the various network hops (i.e. traceroute X.X.X.X)
Check whether you can establish a TCP socket connection (e.g. telnet X.X.X.X [port])
Test application layer (i.e. curl https://somedomain)
Perform a packet capture (e.g. tcpdump -i any) and what bits are sent on the wire

What IP address is my browser connecting to?

% dig dev.to

; <<>> DiG 9.10.6 <<>> dev.to
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39029
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;dev.to.                IN  A

;; ANSWER SECTION:
dev.to.         268 IN  A   151.101.2.217
dev.to.         268 IN  A   151.101.66.217
dev.to.         268 IN  A   151.101.130.217
dev.to.         268 IN  A   151.101.194.217

Is the web server listening on the HTTP port?

% telnet 151.101.2.217 443
Trying 151.101.2.217...
Connected to 151.101.2.217.
Escape character is '^]'.

Each of the above tools helps you isolate connectivity issues. For example, if your client receives an HTTP 5XX error, you can immediately rule out any TCP level issue. That is, you don't need to use telnet to check whether there's a firewall issue or whether the server is listening in on the right socket: the server already sent an application level response.

Summary

Learning more about the network stack helps you quickly pinpoint and isolate problems:

Is it my client-side application?
Is it a firewall blocking certain ports?
Is there a transient issue on the network?
Is the server up and running?

Let's chat more about network engineering and software development

If you are curious about learning how to move from front-end to back-end development, or from back-end development to low level systems programming, hit me up on Twitter: @memattchung