DEV Community: Adam Hammond

Error Budgets and their Dependencies

Adam Hammond — Thu, 25 Feb 2021 11:40:46 +0000

Does your team struggle with not having balanced error budget, that impacts your reliabilty & pace of innovation? Adam Hammond in his latest blog talks about error budget - accountable for planned & unplanned outages that your systems may encounter & how teams can calculate error budget efficiently.

In our last few articles, we’ve discussed SLOs and how important picking them correctly can make or break for your application’s performance. Today we’re going to cover error budgets, which are used to account for planned and unplanned outages that your systems may encounter. In essence, error budgets exist to cover you when your systems fail and to allow time for upgrades and feature improvement. No system can be expected to be 100% performant, and even if it were, you need to have time available for maintenance. Activity like database major version upgrades can cause significant downtime when they occur. Error budgets allow you to plan ahead and put aside time for your team to manage their services while providing customers with lead time so that they can plan for the downstream impacts of your service going offline.

An introduction to service calculations

There is an easy trap to fall into when it comes to determining your error budgets. Calculating your error budgets - as with everything in regards to process improvement - is a journey. Most people would usually say “well, my error budget is simply the left-over time once my SLO is taken away” and that formula for them might look like this:

Error Budget = 100% - Service SLO

However, this is incorrect and is “starting at the end”. This is definitely your aspirational error budget, but it doesn’t take into account your service’s current performance and what the current state of your service’s error budget is. The initial equation for your error budget is as follows:

Error Budget = Projected Downtime + Projected Maintenance

If you remember from our previous article on SLOs, we need to do a lot of research into understanding factors like how performant our customers expect our system to be, but another part of that is, understanding maintenance and existing application error rates. The projections will most likely track very closely to your past performance, unless your service’s performance has been widely variable in the past. When you first define your error budget, it is acceptable to baseline it against what your service can currently provide. If you can only deliver an SLO of 85%, there is no point promising 90%. However, once you have established your baseline error budget, you must never allow it to move below your starting point. Error budgets decrease, they do not increase. The first port of call for most organisations when implementing their error budget is to focus on maintenance as you usually get the best “bang for buck”; there are usually processes that can be improved or better software versions to be installed. This is where your SRE teams come in to help deliver streamlined, automated, and focused software pipelines that minimise application downtime. Move away from manual, labour intensive processes and single-click developer experiences to minimise intentional error budget usage.

The point of error budgets it to allow you to focus on where your product improvement hours are spent. New features can be implemented if you have not utilised your error budget, consider service improvement if it is nearly consumed, and you absolutely must focus all resources on stabilizing your service if your error budget is in deficit. Ultimately, an error budget is designed to help you understand where you should focus your engineering resources to ensure your SLOs are met. The final stage of our error budget baseline is to compare it against the SLO that we intend to maintain for our service. We can do this by simply reverting to our calculation from the beginning:

Expected Service SLO = 100% - Error Budget

It is at this point, that you can determine the immediate direction you need to take in regards to service improvement. If your error budget is running higher than expected, you should focus on reducing it. Once you’ve completed your initial service improvements to bring your error budget into line (if any was required), you can then finally use the “simple” calculation to determine your error budgets:

Error Budget = 100% - Service SLO

The important thing to note is that things like customer expectations serve as minimums in terms of SLOs, so we don’t include them in our initial calculations. At the beginning of our error budget journey, we are understanding our current state and in a lot of cases, it is probably less than the desired target. Another key aspect to keep in mind is that if our SLO performance is ever less than the minimum, then we need to reduce our actual error budget via service improvement as soon as possible.

What is downtime, really?

In our calculations, we separated our downtimes into two categories: unexpected and maintenance. To properly calculate our error budgets, we need definitions for what “downtime” is, in general, and then we also need to differentiate between the two categories. For our purposes, a suitable definition for downtime is “systems are not in a state to meet the required metric”. This specifically targets the SLO and it’s associated metric.

We then further define our two categories, with “maintenance downtime” being “downtime caused by an intentional disruption due to system maintenance” and “unexpected downtime” simply being “all other downtime”. We differentiate between these two types of downtime not specifically to build the error budget, but to provide us with guidance on how we can improve them. For example, if we want to reduce maintenance we need process improvement, but if we want to reduce unexpected downtime we probably need to fix bugs or errors within our services. These categories provide strategic guidance on where we need to look for potential error budget savings when we need to deliver better service to our customers.

Calculating our error budgets

Now that we have all of our required definitions and formulas, now it’s a simple process to actually calculate our error budgets. In fact, a quick visit to our maintenance procedures and our metrics dashboard should suffice:

Determine our total downtime by retrieving our current monthly error rates from our metric dashboards.
Find out how much downtime is scheduled for our maintenance each month.
Calculate our unexpected downtime amount by subtracting scheduled downtime from actual error rates.

Now we have our three metrics: total downtime, maintenance downtime, and unexpected downtime. Now, let’s return to Bill Palmer at Acme Interfaces, Inc for a practical look at how effective error budgets can be, and how we can use all of this information to calculate them appropriately.

“Help, Bill! The system is too slow!”

Bill Palmer sat at his desk, exasperated. Acme Interfaces had been putting off their database upgrade for years. He received an email from their cloud provider today, advising that the database would be upgraded forcibly if no action was taken in the next four months. Coming in at 15TBs and feeding into over 500 interfaces, their database was at the heart of the business. As part of the upgrade, everything would need to be tested along with the actual upgrade itself. Bill required hours for the upgrade, but it actually looked like Acme Interfaces was going over their error budget by a few minutes every month. Now that their cloud provider had forced their hand, something needed to be done.

He pulled up an excel spreadsheet with service metrics and began looking for places for a quick win.

Within a few minutes, he’d found what looked like the root of their error budget deficit. Looking at the error reporting for HTTP requests over the last year at Acme Interfaces, relatively simple requests were returning HTTP 50X errors at quite large volumes and for unknown reasons. He’d made a promise to Dan that he’d get the error rate lower than 10% to get the error budget back in surplus for the upgrade; it was time to get to work. He looked at the detailed statistics and noticed that about half of the errors were 503s and 504s, and the other half were 500s. He just didn’t understand how there could be so many transport errors.

He picked up his phone and dialed the NOC.

“Ring, Ring…. Ring, Ring…. Hello, Acme NOC, this is Charlie.”

“G’day Charlie, this is Bill, the CTO, do you have a few minutes to discuss some statistics I’m reviewing.”

“Sure, Bill.”

“Excellent. I’m just taking a look at our HTTP error codes for the past year and for some reason we return a lot of bad gateway and service unavailable errors, do you know why that would be the case?”

“Sure do, Bill. Our load balancer software is on a really old version. It’s got a bug, where it hits a memory leak and won’t be able to parse requests back from the backend servers. That’s what throws the 502s. After a few minutes, the server will restart but because it is our load balancer we can’t easily take it out of service so we return 503s. We used to have to manually restart the servers, but we implemented a script that checks for health and can reboot within a few minutes.”

Bill paused for a few moments. “...Is there a reason why the infrastructure team hasn’t upgraded to a new version of the load balancer?”

“Well, that’s the problem, we don’t really have anyone dedicated to the load balancers. They were setup up a few years ago as part of a project, and now the NOC just fixes them up when they go a bit crazy. The vendor has confirmed that the newer version of the software doesn’t have the bug but we just don’t have the expertise to manage that at the moment. We also restart them all at night which takes about an hour which would cause 503s.”

“Okay, well thanks for the information, Charlie. I’ll see what we can do. click”

Bill started to write up all the information he had gained from the phone call with Charlie.

After he was done, he called Jenny.

“Jenny, can you please do me a favour and find out how much a System Administration course for our Load Balancing software would be, please?”

“Sure, is this about those HTTP errors?”

“You know it!”

Bill continued to look at the whiteboard, and just knew the fastest way to improve performance would be to bring the load balancer up to scratch, and get the NOC team up-skilled to handle these systems. They’ve been improving these systems in spite of not having any official training, so they definitely are great operators.

Bill’s phone rang, “Bill, it’s Jenny. I just got off the phone with them and they said they could do a 20% discount on the training with a group larger than 10 people and that it would be $10,000 a head.”

“Okay, get back to them and book in two sessions of 15 people each. I want the whole NOC to be up-skilled on the Load Balancer immediately. Draw up a project proposal for shift-left knowledge transfer from some of the application teams as well as SRE development for the NOC team. Their skills are wasted waiting for fires to break out, I know they can get this environment up to where it needs to be.”

“Sounds good, I’ll get onto it now!”

Bill surveyed the room, taking in the hundreds of leaders from across Acme Interfaces, as he prepared to talk about his team’s development over the last six months.

“Hi everyone, I’m sure most of you know me by now, but I’m Bill, the new-ish CTO. Today I’m going to be talking about how we were able to eliminate a major barrier to our database upgrade by analysing and refining the error budgets for our HTTP requests.”

“Six months ago, we were seeing an error rate on HTTP requests of up to 15% per month which was well above our expected error budget of 10%. About 5% of these were caused by application errors, but 8.5% of these errors were being seen at the load balancer and were due to availability issues. We wanted our error budget to be 10% or less request errors, but we were tracking 5% above that. We had to improve something if we wanted to meet that target.”

“I got onto the NOC and spoke with Charlie who enlightened me to some issues we were having with our load balancer: it hadn’t been updated for a few years and a bug was causing all these errors. Further exacerbating the issue, no one with the skills to actually upgrade the load balancer worked at the company so that wasn’t an immediate option.”

“Jenny got onto the vendor and arranged training for the entire NOC. Within three weeks they were all skilled up, then we began our project to upgrade the load balancers. With everyone skilled up, it only took us two weeks to upgrade all of the servers and we were able to do this during downtime that was previously reserved for maintenance (otherwise known as restarting servers due to the bug). We’ve also begun transitioning all of the existing NOC operators to new SRE-based roles that will allow them to assume greater responsibility for the improvement of our core infrastructure.”

“Within two months of defining our current state error budget, we had used them to identify where our issues were coming from, resolved those issues, and, now we’ve been able to meet (and exceed) our target of less than 10% HTTP request errors. We’ve also used the experience to refine our NOC and give our staff greater responsibility.”

“I’d heartily recommend everyone has a look at the internal error budgets that you are responsible for, as I am very sure that it can only have positive outcomes for the business. Thanks for attending my session, and I hope the rest of the retreat goes well.”

Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

How small changes to your SLOs can be SMART for your business - A narrative case study

Adam Hammond — Wed, 25 Nov 2020 11:08:07 +0000

In the second part of his "Choosing SLOs that are appropriate for our customers" blog, Adam Hammond, narrates a fictional case study through Bill Palmer, one of the protagonists of The Phoenix Project and shows "How small changes to your SLOs can be SMART for your business"

In our previous blog, we discussed why you need to choose SLOs that are appropriate for your customers. We don’t always write out S M A R T and list our SLOs immediately. The process is organic, and it may take a while. Most business have a rigorous reporting and metric gathering regime, and in most situations, you will just need to tweak this to get the desired results.

To elaborate more, in this blog, we will focus on a fictional company named “Acme Interfaces, Inc” (Acme) who already have Measurable, Achievable, and Timebound SLOs. Bill Palmer, one of the protagonists of The Phoenix Project, is jumping into the newly formed role of CTO. He is going to help Acme reform their SLOs so that they’re Specific and Relevant for their customer’s needs and their business strategy. Despite consistently met internal and external service levels and fantastic feedback scores, customers are rushing away from Acme. It’s Bill’s job to figure out why and restore Acme to its previous glory.

“Bill, we’ve got a problem here but I’m not sure what it is. Steve said you were great at fixing difficult problems; I’m hoping you can do that here. Our sales are down, and our long term customers are getting ready to leave. We need help fast.” Dan looked across at me, a grim expression on his face, negative analyst briefings were strewn across his desk.

“I’ve had a bit of experience with this. It’s extraordinary - I’ve looked through the service level reports for the last two years: reported internal metrics have been stable.” I put a report from my hand onto the table. “No breaches of external SLAs.”

All requests with a Status Code of 2XX and total volume of requests

I put down another stack of paper. “...and customer satisfaction looks great.” I top the stack off with a print of the customer satisfaction dashboard.

NPS for Acme Interfaces, Inc of customers with spending higher than US$250,000

Dan looks at me and exhales heavily “you’ve found the same things as we have. Everything looks fine, but everything is not fine. The company was going great under our previous CEO, Nick, but as soon as we diversified our customer base, we started having problems. Our customers want our product until they use it and we can’t close long-term contracts. When we were a one customer company, we didn’t have these problems.”

He pulls a paper from the stack I placed on the table and pulls out the twelve-month rolling satisfaction report and a second customer churn report with it as well. “What I don’t understand is how we can have an NPS score of 8.2, but our churn is close to 80% after a year of using our product. I need your help.”

I sat and thought about the situation for a little bit. It was definitely a unique situation. “Look, I’ve got some ideas. I need a pre-sales business analyst to help me understand the current business profile and what our history a bit better.”

Looking back at me with a bemused expression, Dan picks up the phone “John, send in Jenny.” He put the phone down. “Well, you’re definitely on the right track. Jenny is our business engagement lead. She can help you out with at least a third of what you’re after. She knows everyone; Jenny can sort you out.”

Jenny opened the doors and walked in, nodded at Dan and then looked at me. “Bill, I take it?”

“That’s me, ready to jump into it?” I asked. She nodded back at me.

“Bill, I’ll be honest. Our biggest problem is Globex Corporation.”

“You’re saying that our largest customer is our biggest problem?” Bill looked slightly quizzically at Jenny as she sipped her coffee.

“Frankly, yes. As our first customer; Nick always prioritised what they wanted. But that’s the problem: they’re large and cumbersome. What they want is not what the market wants. They’ve had decreasing sales year-on-year, and, all of their feature requests have been perceived negatively by new customers in direct user surveys.”

Bill scratched his chin. “I’m not saying I don’t believe you, but how do you explain all of our metrics that look great.”

Jenny smiled grimly at Bill. “That’s an easy answer: all of our SLAs have been designed to look after Globex and no one else. Take a look here.” Jenny pulled out a stack of paper that looked very similar to the one that Bill gave to Dan. “Customers over $250,000, we only have one of those: Globex. All of our other customers pull out before larger contracts, or they never go further with our product. Nick tailored every single SLA to Globex; they don’t care about latency because they predominantly use our interfaces for driving their reporting engine. Their reports take a week to run each quarter. All of our metrics are around volume, but none focus on service quality.”

Bill flicked through the papers Jenny had put in front of him, eyeing each one closely. “Well... You’re right, how has this not been picked up before?”

“Nick had a strict ‘clean dashboard’ policy. He didn’t want to see raw data; he just wanted dashboard views. He was unequivocal that as far as he was concerned, data would confuse industry analysts and that they just needed to provide positive results. Coupled with his intense focus on Globex, it just ended up that everything focuses on them and the market responds because they are a significant source of revenue for us. Of course, that has had problems now that Dan has been trying to diversify our client base.”

Bill sat quietly for a few minutes. “So… I think I have a way out of this. Can you please get me some data?”

“Sure, what do you need?” Jenny pulled up a document on her iPad.

“Please send me the engagement reports we have for all of our leads and clients. I want to specifically focus on what problems our customers are trying to solve with our products. Please also have the BI team send me the raw data for customer surveys and a separate data set which shows NPS of customers that exited our platform. Also have the SRE team send me the stats on response times for requests, as well as the status code breakup for a rolling twelve-month period.”

Jenny finished typing out the list of statistics and looked back to Bill. “I’ll have these back to you by tomorrow morning.”

“Well Dan, we’ve discovered the problem.”

Dan’s face lit up as Bill sat down in front of him. “You what!? How did you do that? It’s only been two weeks.”

Bill sighed heavily. “Well, we’ve found the problems, but I’m afraid it’s going to require substantial work to fix them.”

Dan’s smile faltered slightly. “How bad is it?”

“It’s quite bad, Dan. All of the stats our commercial team use for our Service Level Objectives are not suited to the market and our template SLA only satisfies the needs of Globex Corporation. I sincerely doubt if anyone except Globex has had a good time using our platform.”

Dan leaned back in his chair. “Don’t pull any punches, Bill. Tell me how bad it is.”

“Our NPS across our entire customer base is four. On average, 11.5% of our requests fail per month, which can be up to 23%. Anyone using our platform APIs for real-time activity has found it to be non-functional under load.”

“...But, how can this be? All of our data has been so good. We’re still making our revenue targets. How have we missed this?”

“I said it before; our reporting focuses on Globex corporation. They use our platform for their extensive quarterly reporting, so all of our SLO reporting has targeted this use case. The problem is our market is not interested in using our product for reporting - they want to use our real-time APIs for generic use cases so they can focus on their core development tasks. Diversifying is not working because our platform is not built for the markets we’re trying to break into.

“Here’s the most prominent example we could find of reporting that looks really great but isn’t. Nick had these SLAs determined based on Globex using the system for reporting.” Bill picks up a sheet and places it in front of Dan.

Percentage Breakdown of Major Status Codes

“This looks good, but look what happens when we remove the ‘202’ status codes, which just means the system is processing a request.”

Percentage Breakdown of Major Status Codes excluding 202s

“As you can see, in reality, we’re barely meeting what would be considered a proper SLO for our systems. If we want to increase our market, we need to make some drastic technology changes now and update our SLOs to meet the expectations of potential customers.”

“Well, Bill. You’ve done this before, what do you suggest?”

“Dan, with the help of the SRE Team and Jenny, we’ve been able to build a plan.” Bills pulls out his phone and sends

We investigated the cause of most of the errors, and it seems like there are issues with IOPS on the database. The first step is to migrate our storage to at least 5000 provisioned IOPS so we can meet the real-time request demand. The SRE team has already upgraded that. Here’s the rest of our plan to normalise our performance and track our progress. Here is our planned SLA, with both internal and external SLOs to help us meet our customer expectations.”

“Our most obvious problem was our reporting metrics for request failures. We’ve added more specific wording so that reporting requests do not dilute our metrics. To support this, we’ve added an internal SLO for IOPS to be monitored by the SRE Team. We found that after waiting 5 seconds, our software would error when a result wasn’t returned. Increasing IOPS eliminated most of our request failures.”

Dan looked over at Bill, surprised. “What do you mean most?”

“I mean, we only had an average 1.2% failure rate. We’ve also changed up our NPS reporting to include all paying customers, and we’ve also lowered our targets because we cannot possibly meet an NPS target of 8, given the reality of the situation.”

“That’s fair enough; I’m fine with explaining the difference to industry analysts.” Dan motioned for Bill to continue.

“Finally, we have a new internal SLO for new customer NPS average of 8. That might sound crazy, but we found that almost all of our customers that failed to contract predominantly complained of slow request response times. We think that by ensuring our request times are lower than 250 milliseconds, we can retain most of our new customers and they will be promoters.”

Dan was silent for a few moments as his eyes rolled over the documents that Bill had placed in front of him.

“Okay Bill, I like the look of all these changes, what can you guarantee me in terms of customer retention? After all, that’s the biggest problem. I don’t want to move focus away from Globex if it’s not going to increase our operating performance.”

“Well, that’s the thing, Dan. After working with Jenny, we think we can get 80% customer retention on new sales if we implement these targets. Almost all of our customers said they loved the system when it worked. They just need it to be performant, and they will come. Jenny thinks she’ll be able to reach out to some former customers and get them to sign back up with us for a new trial, too. We went out and met the customers, understood what they wanted, and we made sure that these new SLOs were specific and attainable. These aren’t numbers we’ve picked from a hat; this is science. We already had a great system in place for metric measurement and monthly reporting; we just needed to tune it correctly.”

“Okay, you’ve got six months. I want to see all of these SLOs met and the customer retention numbers. If it all works, out we can announce our results the month after and make our new SLA public at the same time.”

Bill stood up in the boardroom, buttoned his coat and walked up to the lectern.

“Thank you all for the opportunity to demonstrate our progress before we announce our results tomorrow.”

“Six months ago, everyone here believed we had rock-solid SLOs, a great SLA with our target clients, and a great reporting system. That was true for our old market, but not for the new. Dan came to me looking to transform our business so that we could diversify our client base and grow as a business.”

“Our first step in solving this problem was to look at the data we’d been using and look for any discrepancies; there were a few. Our main problems were our focus on a single client, and some issues with what we considered to be a successful client request. After changing the parameters of our reporting to match what we wanted our company to be delivering, it was immediately clear that our system was failing a lot more than we thought, and our customers were not having their expectations met.”

“The first thing we did was resolve the root cause of our system failures, which was rather simple. This hadn’t been caught earlier because our SRE team was focused on another set of goals that aligned with assisting Globex Corporation, our primary customer. Simply, our storage was not keeping up with our systems, and we just needed to upgrade it which was relatively straight forward.”

“Our second issue, which was our primary focus, took a little longer to resolve fully. Over the past six months, we’ve moved resources away from supporting predominantly reporting interfaces to real-time interfaces, and we’ve adjusted our SLOs to focus on non-reporting response times. With the great work of our developers tuning their code, and the SRE team tuning our web servers, we’ve seen our response time drop below our target of 250ms, in the last two months.”

“These two issues were not our only problem, and we made sure to include all of our paying customers in our NPS surveys. We also put an internal focus on making sure our new customers were fully satisfied with the product, and any small issues they had were prioritised on the development pipeline. This increased our overall NPS on paper to 8.75, but six months ago our NPS was only 6 in our first all-customer survey.”

“Overall, a strong focus on our vision for where we wanted to be, making sure our goals were aligned with that vision, and then re-focusing our existing SLO and SLA reporting has seen us expand our revenue by 35% and see 85% of new customers stay with our product. We understood our market, listened to our customers, and responded accordingly.”

PRESS RELEASE FOR IMMEDIATE CIRCULATION

“Acme Interfaces, Inc sees explosive growth over the last half, attributes success to Know-Your-Customer approaches and SMART Service Level development.”

After implementing a significant change that empowered our SRE and Sales Teams, we’ve been able to drive 85% new customer retention and see a 35% increase in revenue. Our average Net Promoter Score (NPS) has increased by 0.5 to 8.75, even though we’ve expanded our survey base to all our paying customers. We’re also introducing a new customer-focused SLA today that should provide a performant base for all of our customers to depend on as we move into the future.

A big thanks to our CTO Bill Palmer and our new Head of Business Development Jenny Masters who spearheaded the internal development of our new SLO and SLA offering.

Did you enjoy this piece of content written in a narrative case study format? We would love to hear your thoughts! Leave us a comment or reach out over a DM via Twitter.

Choosing SLOs that users need, not the ones you want to provide

Adam Hammond — Wed, 18 Nov 2020 10:52:46 +0000

In our latest two-part series blog, Adam Hammond, talks about how you can build sustainable SLOs that are appropriate for your users, your technology platform, and your business which in turn will help you make your systems robust, your customers happy, and your business boom.

Service Level Objectives (SLOs) are a powerful operational tool that uses metric-based targets to constrain activities that may have a negative impact on users (such as maintenance or failed deployments). Traditionally, you may have heard it being used in contractual terms within Service Level Agreements (SLAs), where SLOs are used to identify guarantees for IT platforms (SaaS, IaaS, PaaS, etc.). However, they are far more than that: SLOs are a powerful tool that can be used not only by the “business people” but also by technical staff to drive process improvement and technological advancement. SLOs have a formidable use as metric-based indicators that show you what needs to be improved in your systems, its capabilities, and where you can get your best “bang for buck” when it comes to focusing your work efforts. However, SLOs must be influenced by data, and that data can only come from your customers. A lot of IT professionals tend to think that they know the best metrics, and they do; the only problem is that they are the best metrics for monitoring systems, not for improving customer satisfaction. Today, we’re going to help you build sustainable SLOs that are appropriate for your users, your technology platform, and your business that will help you make your systems robust, your customers happy, and your business boom.

Asking the right questions

Now that we have an idea of what SLOs are, we need to go about establishing a data-based approach that will result in positive user outcomes. This is a two-stage process that involves data gathering and then using that data to build your SLOs. The source data for these questions come from three main places: your users, your system, and your business processes. Prepare to go out and talk to clients on zoom calls, trawl through logs, and understand the maintenance and support lifecycle of your system. There is no prescription for these questions, they are subjective, and everyone’s scenario will be different. It is also important to remember the Pareto Principle: 80% of your users use about 20% of your system. Therefore you will get the best value out of this exercise by targeting and providing SLOs for the most commonly used parts of your system.

  Example Questions

  - When do my users actively or passively use my system?

  - How much maintenance do I need to perform and how regularly does
    it need to be?

  - What tolerance would my users have for outages?

  - Would your users consider your application critical to your 
    business?

  - How well is my system performing at the moment?

  - What levels of performance do my users require?

Determining SLOs

When you have finished your data gathering exercise, it is time to focus on actually setting your SLOs. SLOs will generally - but not always - fall into the following categories:

These categories cover most of the things that people consider to be aspects of quality. They also translate easily into metrics that you can use to objectively measure your system against the requirements of your SLOs. Finally, when you define your SLOs, remember that a good SLO should be S.M.A.R.T.

Specific: an SLO should expressly state what it measures (e.g. we want to measure availability by testing whether a request can be made to the server, not we want the server to be up).
Measurable: the SLO should be something that can be measured (e.g. disk latency should be less than 5ms, not the disk should be quick).
Achievable: you should be able to meet your SLOs (e.g. if an underlying service has an SLO of 95%, you cannot guarantee 100%).
Relevant: your SLO should reflect the user experience (e.g. an appropriate metric for a web server is response time, not CPU activity).
Timebound: an SLO should cover a period that is appropriate for how your system is used (e.g. if your users only use your system between 9 AM and 5 PM, a 24-hour SLO will only dilute your actual metrics and hide issues).

Now, let’s get down to creating an SLO. Whether an SLO is achievable or relevant is not pertinent to the specific wording required, but it dictates whether a particular SLO should be set. For example, if the average time to retrieve a file is five minutes, you would not guarantee that the file can be delivered faster than that (because on average, it won’t). Alternatively, if your users only care that files are consistently, but eventually delivered to them then a retrieval time-based SLO is probably not for you. In this case, the best SLO would be one that guarantees that a proportion of files are always delivered to users, regardless of time to retrieve and deliver (i.e. percentage of successful retrievals).

Once we’ve determined that an SLO is appropriate, let’s get the SLO down on paper. Remember, we need to make sure that the wording is Specific, Timebound, and that it is Measurable. If it is not all of these things, then it simply cannot be used as an SLO. Let’s consider an example. A system processes stock trades and all requests need to be finalised within 300ms as dictated by a regulatory body. The company running the system wants to offer an SLO that requests, on average, over 30 days are completed faster than 250ms. The system currently responds to 98% requests within 232ms on a 30-day rolling average. The SLO text would look like this:

Is this a good SLO? Yes. The system already exceeds the SLO, so it is Achievable. There is a legal requirement that requests are finalised within the SLO limits, so it is Relevant. We are Specific with the metric we want to guarantee our performance against, which is the request response rate. We have limited our SLO to a 30-day period, which allows us to run reporting that is Timebound. Finally, our metric is Measurable via a Prometheus metric. We have met all the requirements for a SMART SLO that has been tailored to the user experience.

  How to account for maintenance and scheduled downtime in your SLOs

  Everyone needs to maintain their systems; some are highly available
  and have no downtime, while others need some downtime. The simple
  answer is to bake your maintenance into the SLO. If you know you
  can provide 97% availability for a system over a month, but you
  need 14 hours of maintenance (2%), then only offer 95%. It is
  better to underpromise and overdeliver than be red-faced (and out
  of pocket) because your system has been offline (and you expected
  it).

Providing Better Service (and Increasing your SLO guarantees)

Now that we have our SLOs, they’re SMART, but… we are just not meeting our targets (or want to exceed them). What do we do? We need to make our systems performant enough to overcome this challenge. While demanding in terms of effort, this is right in the SRE wheelhouse, and will predominantly rely on your expertise and knowledge to improve your system performance. If users require faster requests, streamline your proxy config. If disk reads are too slow, consider high IOPS or higher throughput alternatives. If batch jobs are taking too long, right-size the instances so that they process in the correct amount of time. Some more difficult approaches may include changing your operating system, database platforms, or, even development frameworks. It entirely depends on your ability to analyse and understand the factors in your system that affect your SLOs and mitigating those issues through proper SRE practice.

There are also other options aside from the more technical approach: improved monitoring and disaster recovery. By improving your monitoring, you can ensure that problems are caught before they affect your SLOs. Your disaster recovery plan is key to managing and maintaining your SLOs. Disasters come when we least expect them, so practising and improving DR procedures means that if disaster strikes, you are able to restore service as quickly as possible. This will limit the overall impact to SLOs by ensuring that any disaster downtime is limited to only that which is strictly necessary to recover your systems.

Using these processes, you can deliver SLOs that will please your users and make their experience with your systems a delight. By meeting (and hopefully, exceeding) their expectations, you will build lifelong customers that will evangelise your business and products.

To be Continued...

In the second part of this blog, we will be looking at an example based on Bill from The Phoenix Project that will highlight how “achieving SLOs” is not always good for business if those SLOs aren’t derived from customer needs.

Don't Let ICANN Increase the Wholesale Domain Name Price

Adam Hammond — Tue, 11 Feb 2020 01:28:57 +0000

ICANN is considering letting Verisign increase the wholesale costs of Domain Names by up to 70% over the next 10 year period. For me, Domain Names (especially .COM) are an easy way for anyone to build an online brand or business. While this price increase is negligible for a lot of people, this impacts entrepreneurs in developing nations looking to establish a brand for their online business.

Everyone should be angry and upset over this, as the continued corporatisation of the internet threatens it's openness and core foundations. Please, go and send an Email to ICANN.

I've written a letter to them already to encourage them to not proceed with this change that will limit innovation. It's below. For more information, see the Namecheap Article.

To Whom It May Concern

I am both a customer of the .COM domain system and a seller as a wholesaler through my Hosting Business.

The .COM domain name enjoys a de facto "default" status that encourages most new domain names to flow through to the .COM registrar, even after the introduction of the new gTLDs and country-specific TLDs. In the same way that the .ORG domain name should be managed and reserved for the enrichment of public organisations, .COM domain name should be made freely available to all people at a relatively cheap price. As we see an ever increasing amount of entrepreneurship, we see a massive decrease in the cost of compute; if anything we should see the wholesale prices for domain names lower. This should be done to make the internet more free and open, and encourage everyone to get domain names. I love domain names, and for me I would love it if domain names proliferated to the same extent as email addresses.

I make a note of ICANN's vision which is "To be a champion of the single, open, and globally interoperable Internet". I do not see how increasing the price of Domain Names, which will only increase the operating costs of small business and individuals and not affect incumbent players in all markets (which the price increase will not even be noticeable). This action is completely in opposition of this vision. Please also refer to one of ICANN's Strategic Goals: "Sustain and improve openness, inclusivity, accountability, and transparency". How are you making the system more inclusive if the cost of a .COM domain name is about eight times more than what one third of the world's population makes per day? If anything, ICANN should be focusing on reducing the wholesale .COM domain price to make it more open and available.

I wholeheartedly disagree with this move to make obtaining .COM domains harder. ICANN should reconsider this decision.

Regards,

Adam Hammond

What's Your Favourite tools?

Adam Hammond — Fri, 07 Feb 2020 12:58:47 +0000

I've worked a lot of places over the 10 years. In that time I've worked in big corporate, small business, and all sizes of Government, and I've come across a lot of tools. I've got a bit of a list going of my current favourites, which I'll drop below.

What's your favourite tools? Share in the comments below then come back for a read of mine!

Web Servers

SSL Certificates: Let's Encrypt

Free SSL certificates? Yes please! If you want a free SSL certificate for your website, I'd recommend using Let's Encrypt. If you're on Kubernetes, use Jetstack's Cert Mananger or if you're on a server, use EFF's CertBot.

SSL Testing: SSL Lab's SSL Test

I LOVE SSL. If you've read my article on TLS (which I'd recommend), you'd see that it's something I'm passionate about. The folks over at SSL Labs have done a GREAT job of providing a tool that will compare your TLS settings against a set of baselines and give you an A to E grade based on the outcome. They test supported TLS versions, Ciphers, vulnerabilities, and heaps of other things. I check every site I set up with SSL Labs so I can be sure it's as hardened as possible. This is my favourite tool on the list to use (Let's Encrypt would be, but I don't "use" it enough - they make it too easy to get certificates).

CDN: Cloudflare

Cloudflare are probably the most well-known CDN available. They provide a quality service and protect some of the largest websites in the world. That's why you should use Cloudflare: they know what they're doing and they're product is CDN, they do it well. They also provide a free plan that is very generous and will suit the needs of almost all small-to-medium users.

Mail

Provider: ProtonMail

I use ProtonMail for two reasons: they're privacy focused and secure. They do a great job of protecting customers and their communications. In most of my Ops roles, I've needed to communicate secure information with people. ProtonMail does the best job of providing a secure method for communication. You can get a 250 MB mailbox for free with a @protonmail.com/ch email address that includes all of the security features they're known for.

Forwarding: ForwardEmail

A lot of companies have multiple domains. If you have addon domains that you want to maintain a mail presence for, but don't want to manage as part of your normal email administration, ForwardEmail is for you. You set up your DNS records to point to ForwardEmail's servers and your emails will be sent to the email address you specify. ForwardEmail provide a great free tier, but you need to expose a public email address on your DNS (I'd suggest only using this for public-facing mailboxes).

Mail-related DNS Checkers: DMARCAnalyzer

I am a big nerd when it comes to configuring DNS. DMARCAnalyzer provides three tools for troubleshooting your mail-based DNS settings. They have tools for checking your SPF, DKIM, and DMARC records. I've used this multiple times to troubleshoot SPF and DMARC records. It's a great service.

DNS

Provider: 1.1.1.1

Unlike most other Tech companies, Cloudflare doesn't have an advertising sales department. They have made a solid promise to provide secure DNS services, and have done a great job with 1.1.1.1. They have the traditional provider available, but also have a set of apps to configure your phone to work with their service. They make secure DNS easy. I'd recommend anyone using Google DNS move to 1.1.1.1, because you just can't trust an advertising company with data about your browsing habits.

Propagation Checker: DNS Checker

If you need to check whether your DNS changes have propagated, head here. This tool connects to multiple DNS servers around the globe and reports the current value of records.

DNS Cache Clearers: 1.1.1.1 and Google DNS

If you deploy new DNS records and want them to propagate quickly, you can head to both the tools for Cloudflare's 1.1.1.1 and Google's DNS service and purge their caches. You just need to enter your website's domain name and the record type, and these services will refresh their records across their networks.

Logging

Uptime Monitoring: Uptime Robot

If you love getting woken up in the middle of the night by emails indicating your site is down, Uptime Robot is great. They provide all of the standard features of uptime monitoring, with a generous free plans. They also provide a hosted status page service. It's a lot better than other competitors, while providing way better value for money. If you choose to pay for Uptime Robot, their plans are cheap and are great value for money.

Logging and Metrics: Stackdriver

Google's Stackdriver is fantastic. It has a great interface for logs and metrics, provides an incident response interface, and just works. It is far superior to most other cloud-provider logging offerings, and is much cheaper than most premium logging solutions. I would recommend Stackdriver to anyone with a small-to-medium project because it has so much functionality and is so cheap.

Automation

Workflow Automation: Zapier

Zapier is really expensive, and if you want to automate a lot of things I'd definitely recommend going with one of the multitude of serverless solutions out there and rolling your own. However, if your use falls within the 100 executions/month in the free plan, Zapier has a lot of great integrations with services and is easy to use. You can link it up to almost any system and get going immediately with very low touch. They have a lot of example "zaps", and they provide a lot of tools for how to extend and configure your own.

VPN

Managed: ProtonVPN

ProtonVPN does something that most other VPN companies do: they're open. Most VPN companies are very shady and cagey about their data access and logging practices. They make bold claims about their services without provide any real substantive evidence. However, ProtonVPN goes against the crowd: not only have they had their service audited by the likes of Mozilla, they have also open-sourced all of their apps, so you can see how their service works. If you want a VPN, there is no decision to make - just use ProtonVPN.

Roll-Your-Own: Outline

If you're in a super sensitive environment, you still might not trust a company to manage your VPN service. If this is the case, I'd highly recommend Outline, which is a VPN product that has been developed by Jigsaw, Alphabet's emerging threats research subsidiary. Outline provides a simple-to-use interface that connects to your choice of cloud provider and automatically builds a secure VPN server. Once the server is ready, a set of credentials are provided which you can distribute to users. That's it, it's really that easy.

Let's Embark! Securing our Internet Treasures (Part 3)

Adam Hammond — Thu, 30 Jan 2020 16:22:28 +0000

Today we secure our Ingress! Security is one of the most important parts of any modern software installation. Let's see how easy it is to get 'er done on Kubernetes.

It's that time of the day... You sit down at your computer to access your application you're running on Kubernetes. You navigate to your site, and what's this!? Someone has defaced it! Oh no, looks like you didn't have your ingress secured and someone has sniffed all your traffic and gotten your admin password. This of course, is what could happen, but we're smart SysAdmins - we're going to secure our ingress controllers using Let's Encrypt.

But what is Let's Encrypt?

If you've read my Article on TLS (which you should), you know that securing a website is important because it allows each party to trust that they are the only ones participating in their conversation. A few years ago, the ability to provide encryption on a website was hampered by the fact all Certificate Authorities (CAs) issued SSL Certificates for cash. This meant that for most hobbyists or smaller websites, securing their stuff was expensive. Enter Let's Encrypt, a service run by the non-profit Internet Security Research Group. Let's Encrypt effectively has democratised access to secure communications by maintaining a CA that issues Certificates for free. That's right, free as in free beer. This means that you can setup TLS on your website for free, and anyone who access your site does so with the knowledge and comfort that your communication is secure. A vitally important, yet criminally underrated privilege that a lot of people take for granted.

Building the Door

This process for configuring Let's Encrypt is relatively straight forward thanks to cert-manager, a project by Jetstack. We're going to use cert-manager and Digital Ocean DNS to provision our certificates. Let's get started by jumping into kubectl and installing our components.

# Creates Namespace
kubectl create namespace cert-manager

# Installs Custom Resource Definitions and the cert-manager Services
kubectl apply --validate=false -f https://github.com/jetstack/cert-manager/releases/download/v0.12.0/cert-manager.yaml

This setups up the basic pods and configuration required to run cert-manager. The main configuration items we need to take care of, is setting up our Digital Ocean API Key, and creating our Cluster Issuers. Before we creating our Digital Ocean Secret, we need to go to the Digital Ocean Control Panel and Generate an API Key (here's a referral link, although DNS is free with Digital Ocean). Once this is done, let's create the Secret:

# dns_secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: digitalocean-dns
  namespace: cert-manger-issuer
data:
  access-token: <DIGITAL_OCEAN_API_KEY>

kubectl apply -f dns_secret.yaml

Now we've got all our configuration in place, we just need to tell cert-manager how to issue certificates for our Cluster. We do this using ClusterIssuers which can issue Certificates across all Ingress in our Cluster. The cluster issuer consists of two pieces of configuration: your email address (for notifications) and solver configuration (so that Let's Encrypt can prove you own your domain). With these in place, we can get to the part where we actually secure our Ingress!

# production_do_dns_issuer.yaml
apiVersion: cert-manager.io/v1alpha2
kind: ClusterIssuer
metadata:
  name: do-issuer-production
  namespace: cert-manger-issuer
spec:
  acme:
    # The ACME server URL
    server: https://acme-v02.api.letsencrypt.org/directory
    # Email address used for ACME registration
    email: <YOUR_EMAIL_ADDRESS>
    # Name of a secret used to store the ACME account private key
    privateKeySecretRef:
      name: letsencrypt-production
    # Enable the DNS-01 challenge provider
    solvers:
    - dns01:
        digitalocean:
          tokenSecretRef:
            name: digitalocean-dns
            key: access-token

kubectl apply -f production_do_dns_issuer.yaml

Installing the Lock

We're finally on the home stretch! The final step is to apply our TLS configuration to our Ingress. This involves:

Adding a cert-manager.io/cluster-issuer annotation which tells cert-manager that we want a Certificate issued by the listed ClusterIssuer.
Add tls block under the spec that lists the host names we want the Certificate to provide TLS for, as well as the name of the Secret where the Certificate should be stored.

# production_ingress.yaml
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
    [...]
    annotations:
        # Add our annotation
        cert-manager.io/cluster-issuer: letsencrypt-production
spec:
  # Add our host and Secret configuration
  tls:
  - hosts:
    - www.example.com
    - example.com
    secretName: wp-example-com-tls-production
[...]

kubectl apply -f production_ingress.yaml

Once this changes have been applied to the ingress, the following will occur:

a TXT record will be placed into your DNS Zone by the dns01 solver.
a secret will be created in the same namespace as the ingress with the name listed at spec > tls > secretName.

Within about five minutes, your Certificate will be issued and if you refresh your site it will now be secured.

See, Totally Easy!

As you can see, it is very easy to setup TLS on our Ingress on Kubernetes. No one could possibly have any excuse to let their own and their customers private communications be snooped on by third party actors. I love TLS and this is probably my favourite article of the series. Security is really important, and it is easy to forget that given how ubiquitous and easy it is to use in our modern day. However, we can't let our guard down because threats and bad actors take a high baseline of security into account. Don't be easy pickings.

If you liked this article, consider giving Let's Encrypt a donation, or you could just follow me on Twitter @stophammotime. Thanks for reading, and see you for the next installment where we setup backups on our Cluster.

The Secret: Kubernetes Secrets and AWS SSM

Adam Hammond — Tue, 28 Jan 2020 10:36:13 +0000

Kubernetes and secrets is always a difficult problem. I've got a super simple solution using AWS SSM today that we can use during our CI/CD pipeline to inject our secrets into our services. This is so simple and quick, that you might miss it, so I'll get to it.

First, log into AWS and open up Systems Manager. Go to Parameter Store, and create a new Parameter. The parameter type needs to be SecureString, feel free to name it whatever you like; I like to go with /<cloud_provider>/k8s/<application>/<environment>. Add the contents of secret.yaml as the parameter's value.

apiVersion: v1
kind: Secret
metadata:
  name: wp-secrets
  namespace: wp-custom-domain
data:
  wordpress_db_password: QXdm .. mRUg=

Secondly, jump into your CI configuration and add the following as a step prior to creating your Kubernetes Deployment.

# create secrets
# /do/k8s/$APP_TYPE/$CI_ENVIRONMENT_NAME
aws ssm get-parameters-by-path \
  --path "/${CLOUD_PROVIDER}/k8s/${APP_TYPE}/" \
  --query "Parameters[?Name==\`/do/k8s/${APP_TYPE}/${CI_ENVIRONMENT_NAME}\`].Value" \
  --with-decryption --output text | kubectl apply -f -

Finally, configure your Deployment spec to include the value of the secret using the valueFrom directive.

spec:
  containers:
  - name: wordpress
    image: _/wordpress:5.3.2
    env:
    - name: WORDPRESS_DB_PASSWORD
      valueFrom:
         secretKeyRef:
           name: wp-secrets
           key:  wordpress_db_password

The only thing you need to do now is run your CI Deployment and your secrets will be available in Kubernetes! See, I told you it was simple! This is a simple, yet effective way to deploy secrets into your environment while keeping them out of source code.

Docker Hub: Automatically Building Images

Adam Hammond — Thu, 23 Jan 2020 15:18:42 +0000

Ever wanted to build and distribute a tool on Docker Hub? Well, it's easy to automatically build an image. Let's get started!

Docker has become one of those ubiquitous technologies, that follows us everywhere. I myself use docker every single day without fail, building a lot of images on my machine for all sorts of utility tasks. However, what if we have a great idea and we want to share it with the world? That's where Docker Hub comes in. If we have a cool tool or idea we want to distribute using Docker, we only need to build a Dockerfile, connect our GitHub account to our Docker Hub account, and set up some configuration.

Before continuing, please make sure you have GitHub and Docker Hub Accounts. If you don't, they will only take a few minutes to create. You will also need Docker Desktop.

Setting up our Source Repository

Today, we're going to be setting up Docker Builds for my cool repository called "My Task List". This is a super simple app that displays my task list when I run it via Docker. If I want to update my task list, I just push an update to my GitHub repository and then voilà, my image will be updated on my next docker pull.

To get started, go to my-task-list. On the top right-hand corner, click the Fork button. A prompt will come up saying "Where should we work my-task list?". Click your GitHub username. Once you click your username, you will be taken to a new copy that has been created under your user name!

Connecting our GitHub Account to Docker Hub

Open up Docker Hub Linked Accounts. Click Connect on the GitHub line.

An authorisation page will come requesting access to your GitHub account. Review the settings, and when you're happy to proceed click Authorize docker. Enter in your GitHub password when prompted.

Once the authorisation process has been successful, you will be returned to Docker Hub. If everything has gone well, your Account name will be shown in the Linked Accounts section.

Setting up our Docker Builds

Sweet! Now we have a GitHub Repository and our Docker Hub account linked. Where to from here? We want to setup our own repository within Docker Hub; this is where all of our images will be deployed when they build. We configure these settings when we create the repository.

On Docker Hub, click on Repositories. You should see an empty list of repositories with your name in the right-hand corner of the page. Click Create Repository. Give your repository a name of "my-task-list", and select Private for the Visibility setting (this is our personal task list afterall).

We've got our settings for our repository all added, now we just need to setup our builds. Click on the GitHub logo under Build Settings (it should be labelled Connected). From the dropdown Select organisation dropdown select your GitHub username and select "my-task-list" from the Select repository dropdown. Click the + next to Build Rules, leaving the defaults.

That's it, let's click Create & Build.

Building and Running our Image

Now, we'll be taken to our repository's landing page. Well done, you've created a new Docker repository, and our image is building as we speak! We should see our new build on the Recent builds list; click on it.

The build should be running and should show as "Pending", wait until the build completes and it flags as "Successful". Once that is done, let's open up our Terminal and enter the following commands.

$ docker-login --username <docker_hub_username> --password <docker_hub_password>
WARNING: login credentials saved in /home/username/.docker/config.json
Login Succeeded

$ docker run --name task-list -p 8080:80 <docker_hub_username>/my-task-list:latest

Once that's done, open up a browser and go to localhost:8080. You should see the current task list displayed. Behold, your first docker image running in all it's glory! Once you're done marvelling at your great work, hit CTRL-C on the terminal to kill the container.

Editing our Image

Let's update our task list to add a new item. To do this, we need to update our GitHub repository. Go back to GitHub, and open up the main repository screen for your fork of my-task-list. Click on the my-task-list.html file and when it opens, click the Pencil to edit the file. Add a new item to the list, and click Commit Changes down the bottom of the editor.

Let's head back to Docker Hub and view the repository. If our commit has been successful, we should see a new build pending which is tagged with a new commit hash. Wait for the build to complete like before.

Now to test our final edits. Run the following on your local machine and then once it's running, browse to localhost:8080 again. If you're successful you should see the new item you added to the task list!

$ docker run --name task-list-v2 -p 8080:80 <docker_hub_username>/my-task-list:latest

Final Thoughts

Writing and building a Docker image via CI/CD using Docker Hub and GitHub is a great way to build and distribute tools. As you can see, it's super easy to get up and running within 15 minutes, and if you're pushing public images the it's all free. So, what are you waiting for? Write a Dockerfile and push a repo, what have you got to lose?

If you'd like to contact me, please send me a message on Twitter @stophammotime. Thanks for reading and thanks to EJ Yao on Unsplash for the header picture! Please also check out my blog at engi.fyi where I have a bunch of DevOps and Engineering related blog posts.

Let's Embark! Setting up Ingress (Part 2)

Adam Hammond — Wed, 22 Jan 2020 14:16:52 +0000

In this article of the "Let's Embark!" series, we cover how to setup nginx-ingress using the offical nginxinc/nginx-ingress images and getting your cluster connected to the internet. To find out how to setup a Cluster on Digital Ocean, see Part 1 and use my Referral Code for $100 credit to get you up and running.

Well, we have a cluster. What now? We need to get our Cluster connected to the internet so that it can receive connections to our services. By the end of this article, you will have nginx-ingress setup and configured, and a demo app reachable from the internet.

Installing nginx-ingress

I prefer to use the offical nginx-ingress images from Nginx, Inc. We will be setting up our Ingress controllers as a DaemonSet, so each pod will run on a node within our Cluster. Let's clone the source repository for our configuration and get our basic configuration items setup on our Cluster.

git clone git@github.com:nginxinc/kubernetes-ingress.git
cd deployments/
kubectl apply -f common/ns-and-sa.yaml
kubectl apply -f rbac/rbac.yaml
kubectl apply -f common/custom_resource_definitions.yaml
kubectl apply -f common/nginx-config.yaml
kubectl apply -f daemon-set/nginx-ingress.yaml

Within about five minutes, you should see a DaemonSet on each node you have active. For the final part of our installation, we need to configure our domain name to point to our IP address, so run the following command and the value available under EXTERNAL-IP should be created as an A record in your DNS settings. I would recommend using Digital Ocean's DNS Service for this.

$ kubectl get service nginx-ingress -n nginx-ingress
NAME           TYPE          CLUSTER-IP     EXTERNAL-IP  PORT(S)        AGE
nginx-ingress  LoadBalancer  10.245.114.32  1.1.1.1      80:31908/TCP   3d1h

Exposing Services

On Kubernetes, when we talk about "Services" we are talking about the endpoint that gets exposed via a Cluster's external IP Address. There are three things that go into creating and running a service on Kubernetes:

Deployment: this creates the template for the pods, including container and replica configuration which includes image, network, and metadata information.
Service: this defines the port that the item created by the Deployment (ReplicaSet, DaemonSet, etc) wil be exposed on in the cluster. Once you've created a Service, you can access it within the cluster at <deployment_name>.<namespace>.svc.cluster.local at the port you have exposed in the Service configuration.
Ingress: This defines the domain name that we expose the service as on the external IP. Ingress within Kubernetes generally uses Server Name Indication (SNI) which means without a domain name, it will be impossible to get to your Service.

For us to get our service up and running, we need to run five scripts which will setup everything we need and expose our service on our nginx-ingress. Prior to continuing, you will have needed to setup your domain name, as we will use it in the configuration below.

$ kubectl apply -f echo_namespace.yaml

# echo_namespace.yaml
kind: Namespace
apiVersion: v1
metadata:
  name: echo
  labels:
    name: echo


$ kubectl apply -f echo_deployment.yaml

# echo_deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: echo
  namespace: echo
spec:
  selector:
    matchLabels:
      app: echo
  replicas: 2
  template:
    metadata:
      labels:
        app: echo
    spec:
      containers:
      - name: echo
        image: hashicorp/http-echo
        args:
        - "-text=Default HTTP Service"
        ports:
        - containerPort: 5678

$ kubectl apply -f echo_service.yaml

# echo_service.yaml
apiVersion: v1
kind: Service
metadata:
  name: echo
  namespace: echo
spec:
  ports:
  - port: 80
    targetPort: 5678
  selector:
    app: echo

kubectl apply -f echo_ingress.yaml

# echo_ingress.yaml
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: echo-ingress
  namespace: echo
  annotations:
    kubernetes.io/ingress.class: nginx
spec:
  rules:
  - host: <YOUR_DOMAIN_NAME>
    http:
      paths:
      - backend:
          serviceName: echo
          servicePort: 80

Once all of these templates have been applied to your cluster, your new service will be available at the domain name you specified. The way that nginx-ingress knows to expose your service is through the annotation on the Ingress template: kubernetes.io/ingress.class: nginx.

What's Next?

Okay, so we've successfully got our cluster up and running, and we've exposed a service to the internet. Next up, we have details on how to secure our service using TLS via Jetstack's cert-manager and Let's Encrypt. For more information on the series, please visit the series page.

Let's Embark! Setting up Kubernetes (Part 1)

Adam Hammond — Tue, 21 Jan 2020 07:16:39 +0000

Ever wanted to build a Kubernetes Cluster? Well, you can. It's easy. Cross-posted from engi.fyi.

I just went through the process of setting up my Kubernetes Cluster, and it was pretty easy the third time around! So, I thought I'd put up a series of tutorials that focus on getting a Kubernetes Cluster up and running with Ingress, Certificates, and a basic Service.

Setting up your Account

First thing is first. Head over to Digital Ocean using my Referral Code. I run my personal clusters on Digital Ocean and they are the cheapest and also one of the most featureful Hosted Kubernetes providers.

On top of this, they also provide an excellent cluster base using the best-in-class Kubernetes networking solution Cilium. They also provide a hosted dashboard that you can access via a link on the Digital Ocean Control Panel, and doesn't need to be installed on your cluster itself (unlike other providers, I'm looking at you AWS).

If you don't want to test this on a real cluster, feel free to install Docker Desktop and enable the Kubernetes feature which comes with a local cluster.

Setting up your Cluster

Once you're logged in and ready to go, open the Kubernetes Control Panel. On the top-right corner of the screen is the Create button. Click it, and select Clusters Create Kubernetes Clusters.

Select your Region (I like to go to Singapore, as I'm in Australia), name your Node Pool, select Standard Nodes (2GB Memory / 1 vCPU), with 1 node. There is no need for any more nodes at the moment, but when we get to setting up nginx we will scale up this to demonstrate Daemon Sets.

The cost for your cluster should be below, at $10 a month. That means the credit you got from my referral link should last 10 months! Yay! Name and tag (optional) your cluster, then click Create Cluster.

Accessing our Cluster

While we're waiting for the cluster to provision, let's get our CLI access up. Click Download Config File. After this is downloaded, fire up a terminal and install kubectl. This can be done with either Choco (kubernetes-cli) on Windows or Brew on macOS (kubectl). Once this is installed, move your confg file to ~/.kube/config.

Once you're configured, run kubectl get pods -n kube-system. If you've suceessfully installed kubectl and the put the configuration file in the right place, it should list a bunch of pods including kube-proxy, cilium, do-node-agent, and kube-state-metrics. If it doesn't work, your cluster is still being configured. Try again in a few minutes.

adam.hammond@adam-laptop k8s-config % kubectl get pods -n kube-system
NAME                                    READY   STATUS    RESTARTS  AGE
cilium-7vd9t                            1/1     Running   0         5m
cilium-operator-d5cd7d758-stsqw         1/1     Running   0         5m
coredns-84c79f5fb4-m9snd                1/1     Running   0         5m
csi-do-node-zc5w8                       2/2     Running   0         5m
do-node-agent-mhc88                     1/1     Running   0         5m
kube-proxy-4nm5b                        1/1     Running   0         5m
kube-state-metrics-7fd44b48b5-jgmz4     1/1     Running   0         5m
kubelet-rubber-stamp-7f966c6779-ztb5s   1/1     Running   0         5m

Once your cluster is ready, a little green light will show on the Kubernetes Dashboard and the Kubernetes Dashboard button should be available. Click it.

If your cluster is up-and-running, you should see three green circles and a bunch of green Daemon Sets. Congratulations, you're up and running with a Kubernetes Cluster. It's that easy.

Next Time

Now that we've got a cluster up and running, we'll be getting into setting up our Ingress Controllers with plain ol' HTTP so we can access workloads!

Good Monitoring Answers Questions

Adam Hammond — Thu, 16 Jan 2020 04:34:35 +0000

I recently had to do a proposal for our Kubernetes monitoring solution. Kubernetes is a tricky beast, especially with Prometheus because there is so much information. I was daunted by the amount of effort and understanding I'd need to begin on the task, and it took me a few days to actually open the ticket.

After reading through a lot of articles on Kubernetes monitoring, I knew the answer to my question wasn't going to be a simple one. As I stared at the space above my monitor, pondering the question, a thought struck me. It was simple, but what if I just wrote down all the questions I needed answered by my monitoring and then got all the appropriate metrics. In every job I've ever had, it's always been "monitor x, y, and z", but never "tell me when this isn't working properly."

I started off my question chain as follows:

Can my pods be scheduled?
Are there nodes available to schedule pods?
Are there resources available on my nodes?
Are my pods running as expected?

If I can answer each of these questions in the affirmative, then I can generally expect that a service is running correctly. Likewise if I ask questions for each layer (hardware, control plane, nodes, services, ingress, etc) and those answers are in the negative, I can understand cascading dependencies on my compute, advising customers and mitigating an outage as much as possible.

Actually getting the metrics to answer these questions was super easy. I just opened up the Kubernetes Dashboard and checked the places where I could check if I was manually fixing the issues. Then, I simply noted them and created Prometheus Alarms for them. It's that easy.

Have a good process, and good results will follow.

DevOps: Culture vs. Tooling

Adam Hammond — Mon, 09 Dec 2019 01:46:13 +0000

One of the last questions asked in a DevOps interview is usually "so, what does DevOps mean to you?" I think this is a smart question, because DevOps is wildly misunderstood by the greater IT community. Some may answer that it's Continuous Integration and Releases, another may say it's having everything in Git, and the last might say that it's having tests available. All of these technical solutions do represent a key aspect of DevOps which is the tool chain, but it is the least important. Primarily, it is the least important because underlying the implementation of these tools is a make-or-break attitude into implementing them. For example, I may have a build but it might break or deployments may be manual. I may also have everything in git but I might only commit once a year. Or, I may have tests but all of them pass even if errors are thrown. As you can see, just because a team has these things, don't mean they are truly living the DevOps way.

Improving Toxic Team Culture

If you think that your team's culture is toxic or work is limited for some reason, there are ways you can go about improving it. Fundamentally, DevOps is about empowering individuals to do the work they need to do. There are a few immediate things you can do to get started on this journey:

Change your KPIs from "tickets resolved" to "problems fixed": this will allow your team to distinguish between the "busy work" of resolving repeated failures, and the real work of actually fixing the problem.
Begin code reviews: code reviews are important, not only because they may prevent bugs slipping into your code, but they also ensure that multiple members of your team understand and can work on your codebase. If code isn't reviewed, there is probably only a single person who understands it. The more of this code that is added your codebase, the further you embed single points of failure into your business. If a team member leaves and their code breaks for some reason, it may take days or weeks to resolve a problem that will inevitably come along.
Introduce a CI pipeline: if your team does work on production servers, this is an indication that the your work environment is in a precarious position. Introducing a CI pipeline will force your team to standardise their deployment processes and make sure rigour is applied to deployments. This should also reduce rework as failed production deployments should be a thing of the past.
Introduce automated testing: If you have a CI pipeline, good work. Now that you've got processes in place to reduce production impairing incidents, let's get started on introducing testing into the CI pipeline. Automated tests are great bang-for-buck as they only need to be defined once and are an immediate indicator if buggy code has been committed onto a branch.

These are just a few suggestions, but they at least give you an idea of simple things you can do to start to introduce your team to DevOps culture. The one thing that is important, is that if you decide to actually implement one of these things, you need to make sure you follow through. A half-implemented practice is worse than nothing because it allows you to operate with a false sense of security.

If you'd like to read more articles like this, please check out my blog engi.fyi.