DEV Community: Raoul Meyer

Improving website performance with brotli

Raoul Meyer — Mon, 01 Feb 2021 21:09:48 +0000

Brotli is a new compression algorithm developed by Google. Under the right circumstances, it manages to produce significantly smaller files than gzip can. Although gzip is currently the standard for web compression, brotli is a good candidate to take over this role, seeing widespread adoption in most modern browsers over the last couple of years.

In this post we will explore the characteristics of brotli and determine which consequences it has for its usage.

New trade-off

Both gzip and brotli come with a quality setting, which determines how much effort should be put into compressing files as efficiently as possible. Higher quality settings will produce smaller files, but it will take more time and CPU to produce those smaller files. Gzip allows quality settings from 1 to 9, brotli adds levels 10 and 11 to the options.

With gzip it is common to see the first couple of levels give increasingly better results, but at level 5/6 the improvements slow down or stop completely. Depending on the contents of the file you're compressing, the compressed file size at level 9 will probably be at most 1% smaller than the file size at level 6. Because of this, it is standard to pick a gzip quality setting of 5 or 6 for most use cases.

Brotli's quality setting behaves differently compared to gzip. Even at high quality levels brotli still shows significant improvements to the compressed file size. That comes at a cost though, with significantly larger compress times compared to gzip. In this experiment the compression speed of brotli at level 10 is about 100 times slower than gzip at level 6.

This characteristic of brotli means that there are now more complex compression strategies that can improve client side performance, bandwidth usage and/or (CDN) CPU usage.

Static content

For static content, like Javascript assets or statically generated HTML, higher levels of brotli compression can be very beneficial. Let's cover two ways in which you can serve static content with higher compression levels.

You can pre-compress your files, meaning you can compress them at the origin that your CDN or proxy gets them from. Time and CPU compressing files is completely spent during deployment, so no client will notice if you use the highest level of brotli (level 11). Not all browsers support brotli yet. Depending on your CDN/proxy, it can be difficult to setup a fallback to gzip for browsers that don't support brotli.

You can also choose to compress files on-the-fly, and then cache the compressed result. This way you can let your CDN/proxy handle compressing to a format that the client's browser supports. Depending on how long you cache your files, the cache hit rate and how big the files are, it can make sense to use brotli compression levels between 4 and 9. Levels 10 and 11 would produce smaller results, but because they are more than 10 times slower than level 9, there are few situations in which it makes sense to on-the-fly compress using brotli levels 10 and 11.

Dynamic content

Currently it is common to use gzip level 6 for dynamic content, as this offers a good trade-off between time spent compressing for every request and improved transfer size. When given small files, brotli can compress up to 30% more efficient than gzip but at the cost of being about 2-3 times slower. For dynamic content, that means that there is not much to gain using brotli for small files. However, for large dynamic content (bigger than 64kb), brotli offers three levels that can beat the results of gzip in multiple ways.

Brotli level	Size (vs. gzip 6)	Time (vs. gzip 6)
3	1% bigger	42% faster
4	1% smaller	2% faster
5	6% smaller	43% slower

Level 3: If you care about reducing compression time, brotli level 3 provides 1% bigger but 42% faster compression compared to gzip level 6.
Level 4: If you want to improve transfer size, brotli level 4 produces files that are smaller or similar to gzip level 9, and slightly faster than gzip level 6.
Level 5: If you can live with longer compression times (for example because you are transferring really big files), brotli 5 produces results 6% smaller but 43% slower than gzip level 6.

For dynamic content, you are likely to want to choose from one of these levels to replace gzip compression for big files. Brotli level 4 seems like a good drop-in replacement for gzip level 6 with some percentage smaller files as a result.

Trying it out

A nice tool that I found to try brotli out is this gzip and brotli compression level estimator. You can give it any URL and it will compress that file at all the different possible configurations and give you a table to compare the results with. Go ahead and give it a try!

Lessons from a first attempt at Chaos Engineering

Raoul Meyer — Mon, 28 Sep 2020 08:41:47 +0000

A year ago, I got really interested in the idea of chaos engineering. I had read a couple of blog posts and I was ready to get started with breaking things in production. In a controlled way of course. To get started, I created a small application that would get us started with some basic chaos experiments. In this post, I want to share some things I learned while taking our first steps in chaos engineering.

What is Chaos Engineering?

Chaos engineering is understanding and improving the resiliency of your systems through experimentation. First you create a hypothesis for how your system will behave when you put it through a hard time in some way. Then you verify if the system behaves as you expected. For example, you could cause a network problem between the reader and writer nodes of your database cluster, and then verify that this does not cause any requests to the application that uses this database to fail.

Why is this type of experimenting useful?

Because it is really hard to understand all the components of your systems and how exactly they work together. Even if you think you have a good grasp on your system as a whole, there are too many details to all of the components for one person to know. Although we can easily think in abstractions while developing our application, in production everything is pretty concrete and our assumptions are tested whether we want it or not.

Let's go over the lessons we learned.

1. Make sure you have good enough observability

There is no point in starting with chaos engineering if you don't have the right level of observability of your systems. It is the process of investigating why your system didn't do what you hypothesized that will make you understand your system's behavior better. Without proper logs and metrics, that is going to be hard.

2. Make it realistic

To get started, you need to decide what you are going to experiment with. For us, running our workloads in AWS, it is relatively easy to test with certain scenario's. Rebooting or terminating an EC2 instance is a straightforward action. It is very tempting to list everything that you can easily do, and to create experiments for those. So that is what I did when I introduced our own chaos creator.

We ended up with a tool that was performing server maintenance type tasks on a schedule. Few of the reboots and failovers it did ever happened out in the wild. It's easy to make things break, it's harder to make things break in a realistic way, even though that is where more of the learning is done.

3. Manual first, automation later

We care very much about completely automating repetitive tasks at Coolblue. When I started working on these experiments, I was convinced by the literature that mentioned the importance of automatically running your experiments. In practice, this didn't work out as well as I would have liked.

My mistake was automating the experiments immediately. Defining and creating experiments is a process of exploration. By automating right from the start, you are slowing down the rate at which you can try new things. I quickly settled on a small number of experiments and kept rerunning those daily or weekly. But this defeated the purpose of the whole exercise, which is to learn about the behavior of your systems.

As soon as you learn something, you want to cement that knowledge. This is where automation becomes very powerful. It allows you to regularly rerun experiments, so you can check if they still have the same outcome. This allows you to act on any regressions, in your application or in your understanding of your application.

For companies just starting out with chaos engineering, automation is a good way to detect regressions, but not a great way to start. In the end, it's 20% about breaking things in production, but 80% about learning about the behavior of your system in production. And that learning you can not automate.

Pugmark - Online Book club For Developers

Raoul Meyer — Sun, 05 Jul 2020 07:11:37 +0000

A book club seems like something from the last century. You're meeting up with a group of people, who does that anyway nowadays? You've managed to read through the whole book in the month that you had to do that. Scribbling down some notes along the way, carefully highlighting sentences because they resonated with you. It's been a tedious and lonely process until now.

When you finally meet up, it gives you a feeling of belonging, knowing there are others who share your interests and passions. Finding out someone has also read your favorite book is a great conversation starter. Those conversations are more often than not full of new insights.

A year ago, I mentioned the idea of an online book club to my friend Maxi. We both like reading books as an educational tool. For me, starting to read books has been eye-opening. We often discussed the books we read with each other. Those were the times I discovered the point of the book, the bigger lesson it was trying to teach me.

Now, one year later, we think with pugmark we have built something that helps us in having that discussion. Having used it myself for a couple of books now, I notice how much more I remember of what I read. With pugmark, we want to give you a way to find those like-minded readers. It's the async, remote version of a book club. We've tried to provide two tools to help you really absorb what you are reading: structured reading and async discussion.

Structured reading

First, in pugmark you will be presented with a per chapter overview of each book. You can create notes for yourself, and rate each chapter individually. You'll get an overview of your progress in reading the book.

We also want to help you in building the habit of reading regularly. This is a good way to make sure you keep on growing your knowledge. You can set up reminders on whatever schedule works for you, and we'll let you know that it's time for the next chapter.

Async discussion

Any notes you make, you can share. You can use this to summarise what you learned, or to ask questions to other readers. When you share a note, your fellow readers can learn from your perspective, and they can add their perspective as well. It's like a mini blog, about a single chapter of a book.

Where we can go

We have worked on this for a year now, but we still have many ideas on how to make it better. We'd love for you to give it a try and let us know what you think.

If you're someone who always tells themselves they would like to read more, please give pugmark a shot:

pugmark.io

Deliberate practice: The art of teaching and learning

Raoul Meyer — Sun, 19 Apr 2020 11:27:39 +0000

Gaining knowledge about the tools, code, applications you're working with is a natural process. Just by doing your job you will improve your development skills, while building understanding of the codebase you're working on. This takes time.

My team is a group of experienced and amazing developers, but most of them have only quite recently joined the team. It takes time for any new-joiner to get a feeling for what our application landscape looks like and how applications cooperate.

I wanted to find a way to efficiently share knowledge within my team. To accelerate the time it takes for people to get the bigger picture of what we are working on. I noticed we were sharing knowledge a lot already, through regular presentations, pair programming, pull requests and many other ways. This was great, but it was hard to make sure we addressed certain knowledge gaps using these.

The idea

What I introduced was a more deliberate moment of learning. We would get our hands dirty instead of watching presentations. The topics were focussed on those areas that needed most attention. Everyone was pushed at least a little bit out of their comfort zone.

We did some small and fun exercises. Some examples of exercises we did:

List all applications we are responsible for from the top of your head. Pick one application that you didn't work with yet, and do a 1 minute presentation about how it works next time.
Last week we saw a small outage, but the cause is still undetermined. Find the cause.
Draw all components that are involved in handling a single request from a customer.

These were specifically aimed at some of the knowledge gaps we had. Let's go through the things that I learned while facilitating these exercises.

1. Make it a challenge

It can be hard to really challenge yourself during day-to-day work. You don't want to pick a task to work on and get stuck, you'd rather pick something you know you can do. In general, day-to-day work doesn't feel like a learning opportunity by itself.

But when you create a specific moment for learning, a challenge is good. I tried to design challenges that would push everyone a little bit, outside of the things they normally do and work with. Without the pressure of failing, people are willing to go pretty far outside of their comfort zone.

2. Create pairs

With the challenges I gave my team, a lot of people felt lost. They didn't know where to look or what to do. If they were on their own, they would have been stuck. I learned how important and valuable it is in these cases to create teams.

For most challenges, I found that teams of two are perfect. It is amazing to see how much two people together can build on each others knowledge. Bigger teams often meant the person most out of their depth was left out of discussions. Two really is the ideal number in most cases.

3. Keep challenges short

I try to limit the time we spend on these exercises to half an hour. This keeps the investment of participating pretty low. Also, for some challenges, the added time pressure can help in making it more realistic and sometimes even more fun. For example, in one case we tried to debug a production issue that had happened earlier that week in half an hour. This gave everyone the thrill of having to debug something fast, but without anything on the line.

4. Repeat regularly

We do this weekly, which means we can repeat certain topics as well, to refresh what we've learned. Compared to a training of a day or more, this gives way more room to actually process what you learn. In most cases there are only one or two takeaways per week, which gives everyone time to investigate those a little bit more.

5. Pay attention to what people struggle with

One of the most interesting things I found is that through these challenges, it was way easier to notice specifically which knowledge was missing. While being thrown in the deep, people can more easily describe what it is that makes them feel out of their comfort zone. And because of the more controlled nature of the challenges, there is more room for people to admit they don't know. There's no important feature on the line. There are no customers that are seeing a broken application.

Also, because I made everyone learn about the same thing at the same time, it was easier to see patterns of missing knowledge. This was great input for future sessions.

Conclusion

While doing this I was reminded a lot of school. I guess those people were onto something. Short, focused and regularly repeated sessions are the norm in school. We tell ourselves that as adults we can concentrate for longer, that we can remember what was explained in a couple hours of presentation. But we're not that different from our younger selves.

I would love to know more about how you learn and teach. Please leave a comment with your thoughts!

Our approach to dealing with technical debt

Raoul Meyer — Tue, 11 Feb 2020 18:56:32 +0000

Technical debt, when not handled correctly, can have a big impact on the ability to deliver features. Through the years, my team has tried out a lot of different ways of dealing with technical debt. The following 5 findings are the things that I think helped us the most in dealing with the right technical debt at the right time.

1. Pick up what you can immediately

When I joined my team, I was explained the boyscout/girlscout rule. When you're working on some part of the codebase, you should always leave that part in a better way then you found it. It's like cleaning up after yourself, and then some. This way, you can gradually make small improvements that add up over time.

2. Create a backlog

When you apply the boyscout/girlscout rule, you will often find improvements that are so big that they would take longer than the feature itself. For those improvement opportunities, we have a separate backlog. This backlog is maintained by us as developers, but visible to non-developers. A great way to make this backlog more tangible is by creating a Wall of technical debt.

You'll notice that as you discover more technical debt, your backlog will slowly grow. Especially with a physical backlog, there is clear feedback when your backlog gets too big.

3. Prioritize your tech debt backlog

Not all technical debt is equal. Some of it prevents you from adding a nice feature easily. Some may wake you up at night because something broke. Some technical debt is just there, where you don't really notice it that much.

It is important to make sure that the technical debt you're solving helps in achieving business results. We prioritize our technical debt stories once every sprint. When we do, we look at what we are going to work on in the upcoming sprints. We also look at which technical debt stories have been open the longest or might be creating a high operational load on our team. We then choose which stories will provide the most value to pick up right now.

I've noticed also how important it is to prioritize together. Before we did this, we would all be working on improving something, but often on different parts. By discussing openly why we thought some improvement was important, we got everybody on the same page. And we improved the prioritization itself as well, because we could use the knowledge of the whole team to make these decisions.

4. Handle tech debt stories as normal stories

For the longest time, we would handle technical debt as something special. Something that you reserve a Friday afternoon for. Something that you do off the books. This special treatment has a couple of bad side-effects:

It isn't clear how much time is going towards improvement efforts. When we don't make the sprint, is it because we spent too much time on technical debt?
To prevent too much time going to improvement efforts, we would schedule some time in our agendas. This often results in half finished improvements. There was quite some time wasted by starting again where you left things a week later.
Improvements would not be tracked on our scrum board. This makes it hard to know who is working on what, if they might be blocked or need help.

Estimating our technical debt stories the same way we do normal stories has helped in scoping what we want to improve. It also allows us to take the time needed to finish whatever we put in scope. And we can track progress the way we do all other stories, which allows detecting any blockers or problems early.

5. Spend time regularly on technical debt

Picking up technical debt should be a regular thing. This way, you'll never have to completely stop delivering value. To make this work, it is important to break down big improvement efforts into smaller parts. In a lot of cases, we have first investigated or made proof of concepts to get a better understanding of what to do.

Google famously has a dedicated one day per week in which engineers can work on anything they want. We've noticed that by consistently spending 20-30% of our time on improvement efforts, the remaining 70-80% of time has gotten more productive. This only really works if you apply all previous points. Without an open technical debt backlog, with business value based prioritization, it is hard to build the trust that these improvement efforts are worth the time invested.

7 Site Reliability lessons from Google and Amazon

Raoul Meyer — Thu, 30 Jan 2020 21:25:43 +0000

Companies like Google and Amazon share a lot of great content about their approach to certain technical problems. At re:Invent this year, Amazon announced the Amazon Builders' Library. This is a collection of articles that discuss the approach Amazon takes in their architecture and software delivery processes. Similarly, Google shared a great collection of lessons in Site Reliability Engineering in their free SRE book.

In this post, we'll go over 7 site reliability lessons we can learn from these two great resources.

1. Make alerts actionable

Good monitoring is a fine art. This chapter from the Google SRE book is a single stop for everything you need to effectively monitor your systems. It goes over the why, the what and the how of monitoring. A snippet from this chapter is actually in the pull request template of our monitoring repository. It has helped me multiple times to think again about how I wanted to monitor something:

When creating rules for monitoring and alerting, asking the following questions can help you avoid false positives and pager burnout:

- Does this rule detect an otherwise undetected condition that is urgent, actionable, and actively or imminently user-visible?
- Will I ever be able to ignore this alert, knowing it’s benign? When and why will I be able to ignore this alert, and how can I avoid 
  this scenario?
- Does this alert definitely indicate that users are being negatively affected? Are there detectable cases in which users aren’t being 
  negatively impacted, such as drained traffic or test deployments, that should be filtered out?
- Can I take action in response to this alert? Is that action urgent, or could it wait until morning? Could the action be safely 
  automated? Will that action be a long-term fix, or just a short-term workaround?
- Are other people getting paged for this issue, therefore rendering at least one of the pages unnecessary?

2. Prevent alert fatigue

Dealing with a high volume of alerts is really taxing. Next to the context switching, every alert is a new and potentially stressful situation that you have to evaluate. It's very common to start assuming the impact or cause of an alert, or to start ignoring alerts that trigger often. In the Google SRE book chapter there is a very concrete notion of what too much is in this case:

We’ve found that on average, dealing with the tasks involved in an on-call incident—root-cause analysis, remediation, and follow-up activities like writing a postmortem and fixing bugs—takes 6 hours. It follows that the maximum number of incidents per day is 2 per 12-hour on-call shift.

Based on that, they state that most days should have zero incidents.

I've found that it's a really good exercise to go through and analyze all alerts of the last month. When you do that, you'll notice very quick which alerts are triggering often. You might also see patterns that are less obvious in the moment, for example certain alerts always triggering on wednesday morning. Doing this and then checking the alerts that triggered using the checklist above will help improve the quality of your alerts.

3. Simulate outages

I really like the idea of simulating outages as a way to practice with debugging. It's a safe way to learn more about how your systems work. It can work as a tool to better understand previous outages and what can be done to make it harder for things to break the same way. Google does Disaster Role Playing, as an onboarding tool, to share knowledge between the different experience levels within their SRE group, and as a fun exercise.

Next to this, instead of role playing, you can also inject failures into your actual system in some way. This practice, when combined with a hypothesis of what impact the failure is going to have, is called chaos engineering. There's lots of tooling around nowadays that will allow you to manually make your application fail in any way you can think of. If you want to get started or know more about how you could do chaos engineering right, I highly recommend reading these blog posts by Adrian Hornsby.

4. Prevent changes on failure

It is tempting to try and make your systems "self-healing". However in most cases, when a failure is ongoing, automated changes make it harder to understand what is going on. In a lot of cases, the automated change may actually make things worse. A common example is the redistributing or resharding of data in the case a node in a cluster drops out. The data transfer and load this causes might actually have a bigger negative impact than the node not being there. When thinking about redundancy of infrastructure, make sure to also think about what needs to happen in case the redundant infrastructure fails. Although this is not always possible, ideally, nothing has to happen.

I went to a couple of talks at AWS re:Invent in which Amazon engineers described how they architect systems to improve reliability. There a related idea was referenced often under the name Static stability. AWS has published a nice article in the Amazon Builders' library in which they explain how they apply static stability to EC2 and other AWS services.

5. Prevent cascading failures

When an application fails, this shouldn't bring other applications down too. There are several ways you can prevent cascading failures:

Use circuit breakers. This means, stop calling a service if it seems like it is not working currently. This way, you won't overload the services that you depend on with requests.
Use smart retry strategies.
Add timeouts to requests. It is common for a failure to be caused by an application being overloaded with requests. This causes slow requests. Without timeouts, all services that depend on this application will also slow down.

6. Shard your infrastructure

Shuffle sharding is another really interesting way to reduce the impact of failure. In most examples this was used as a way to protect against malicious clients of an application. As a start, you can shard your clients into a couple of groups. For each shard, you have separate infrastructure running your application. Now, when a malicious client affects your application, only the clients assigned to that same shard will see the effects. This can greatly reduce the impact a single malicious client has.

Taking this one step further, you can put each client in two shards instead of just one. Now, a malicious client can bring down two shards. But the chance of other clients being assigned to exactly the same two shards is pretty small. It is likely that clients will see one of their shards fail, but a big group of clients can fall back on a shard that's still working.

If none of this made sense, be sure to read this blog post from the Amazon Builders' library. It does an excellent job at explaining and visualizing how it works.

7. Accept stale data

In most cases, it's better to show stale data than no data. You don't want to mask failures to yourself, but maybe you do want to mask them to your customers. An obvious place to apply this is where you're currently caching results. You can either change the dynamic, making the thing you're calling push results to you. Or you can make the caching a bit smarter, storing (separate) entries for a bit longer, so you can fall back to those in case of failure.

Conclusion

Hopefully these tips have helped you think about how you can make your systems more reliable. There's tons of really good resources when it comes to reliability. I shared a couple already in this post, be sure to check those out!

I would love to hear your thoughts on all of this!

Book review: Accelerate - The comprehensive DevOps guide

Raoul Meyer — Sat, 04 Jan 2020 10:12:29 +0000

Over all the books and blog posts about DevOps I've read, I've seen most of them reference the book Accelerate by Nicole Forsgren, Jez Humble, Gene Kim. I now understand why. The book presents a comprehensive overview of important DevOps practices. This by itself is super helpful.

Research

More importantly, it delivers a research based motivation for following these practices. The authors have done extensive and convincing research, which forms the basis for the book. They have managed to link things like using version control to the commercial performance of organizations. Covering all important practices related to DevOps, they've found several interesting ways in which these practices impact companies and people positively.

Every time I had a skeptical thought about something presented in the book, it caught me off guard and addressed my skepticism. The book contains a lot of chapters focused on convincing readers about the validity of their research. They convinced me more than enough, so I could focus on the actual findings of the research and I knew how to interpret those findings.

Software has a significant impact on the commercial and non-commercial success of most companies. This has been said before, and for us as developers it feels very true. The research proves this is actually the case by correlating several software delivery factors and company performance.

Continuous delivery

One of the most interesting findings for me was the impact certain software delivery practices can have on developers themselves. For example, developers are less likely to burn out and more likely to be satisfied with their jobs if continuous delivery practices are applied in their company.

It's easy to explain what the data is showing here. Release deadlines, crunch time, big bang releases and deployment freezes all are things that cause a lot of stress for developers. Often these go together with committees and heavy processes that frustrate even more. It seems obvious that such an environment will burn out developers. When deployments become a regular event by applying continuous delivery practices, stress levels drop.

Next to the impact continuous delivery has on developers, it also has a positive impact on several metrics that are essential to the success of a system. Continuous delivery goes hand in hand with a reduced change failure rate. Smaller changes and automated deployments are likely reasons for this improvement.

Surprisingly, the amount of time spent on planned new work (versus unplanned work/rework) is also correlated with applying continuous delivery. Companies that apply continuous delivery on average spend more time on planned new work. More time spent on planned work in most cases means more time spent on those things that improve business outcomes, whatever those may be.

I imagine every company wants these benefits. "Implementing continuous delivery" or any of the other practices covered in the book is not straightforward. That's why in between the findings, you'll see actionable advice on things you can do to for example implement continuous delivery practices in your organization. This includes tons of small snippets of (researched) wisdom that I found very helpful.

As an example, data from the research showed that it is more important to have system and application configuration in version control than application code. I found there were tons of interesting tips, both for companies still new to these practices and for companies that have been applying them already in some form.

Culture

Culture is a big topic in Accelerate. As one of the first things, the concept of Westrum organizational culture is introduced. This is a scale that measures the organizational culture as experienced by an employee. Westrum defined three different types of organizations:

Pathological (Power-oriented)
Bureaucratic (Rule-oriented)
Generative (Performance-oriented)

In a pathological organization, information is withheld out of fear. In a bureaucratic organization, rules are more important than the organization's mission. A generative organization focuses fully on its mission. Based on Westrum's research, generative organizations perform better and pathological ones perform worse.

These three types form a scale that predicts how well information flows in an organization. When there is less fear for sharing information, because the information is used to improve and not to blame and punish, information is shared more, and cooperation and learning improves. In this way, increased information sharing helps organizations achieve their mission. This is exactly what Westrum found:

Westrum's theory posits that organizations with better information flow function more effectively.

In Accelerate this construct of culture is analysed. They find that the Westrum organizational culture construct predicts organizational performance as well as software delivery performance and job satisfaction. This is in line with what you would expect based on Westrum's own theory.

I found the tips given to improve culture very insightful. In the end, the culture within a company is based on the behaviour and interactions of its employees. As a single employee, you can set a good example by applying the following:

Encourage cross-functional collaboration. You can do this by "actively seeking, encouraging, and rewarding work that facilitates collaboration" and by building trust with peers in other teams.
Create a climate of learning. You can encourage this by opening up resources/time/budget if you're able to, by creating moments and opportunities to share learnings between teams, and by creating an environment in which people feel safe taking reasonable risks and failing.
Be transparent about your application's performance. You can do this by sharing key metrics, alerts, SLO/SLA's and failures (for example in the form of post-mortems) publicly.

If you don't have this yet, setting up a regular timeslot to share learnings between teams is a great way you can positively impact your company's culture.

Conclusion

I think Accelerate is an essential read for anyone who wants to understand how DevOps practices can influence or are already influencing their organization. If that's you, be sure to give it a read!

Book review: The Unicorn Project

Raoul Meyer — Tue, 19 Nov 2019 08:20:17 +0000

Around a year ago, I read The Phoenix Project. As you may have read in my review, I thought it was amazing. It touched on so many important aspects of software development and DevOps.

Its successor The Unicorn Project is now available, and it's equally amazing. One part in the book about user feedback really struck a chord with me.

Owning your product

In The Phoenix Project, there is a big focus on how developers, ops and security teams should work together to get better flow of work. In the story, the conflicting priorities of these split up teams often comes in the way of an idea going to production. When put together in one team and given the same goal, the team becomes way more efficient.

But improving efficiency by itself doesn't solve all your problems. You might still be building the wrong thing, or building it in a way that doesn't match the needs of the people you're building it for. The book has a renewed focus on how a group of people, broader than just DevSecOps, can work together to improve how effective they are.

In one example in the book, an idea had gone through several committees over the course of 2 years before it was picked up. In the end, a developer picked up the story, but couldn't figure out the details of what needed to happen. When idea and implementation are somewhat detached, both in time or in physical distance between stakeholders and developers, it heavily influences both efficiency and effectiveness.

I could really relate to another example. This week, my team had a meeting with our in-house translators. They showed us how they translated everything we need for our website. We built the system to sync strings to be translated to them and sync translations back when they are done. In just one hour, we found several low effort changes we could make that would make their lives easier.

Being out of touch with your actual stakeholders is a real problem. The book makes really clear that this is not a single person's responsibility. Instead, everyone is responsible for understanding how their products are being used.

And more

The book touches on a lot of different topics and ideas, many of which I didn't expect. For example, there's a clear focus on psychological safety and the positive impact it can have on both a team and organization level. All in all the book made me reflect in a lot of ways, the same experience as the one I had with The Phoenix Project.

On top of that it's just very entertaining to read about the struggles of a developer and how she overcomes those. It's easy to relate to the frustration that comes from a company that wants to but is unable to change.

If you're interested in DevOps, I highly recommend you give The Unicorn Project a read.

Black Friday: Virtual Stampede Edition

Raoul Meyer — Fri, 15 Nov 2019 20:03:33 +0000

You may be preparing for Black Friday by thinking of all the things you want to buy. For us, as a big online retailer, we've been thinking hard about how we can make sure you can check off everything on your list (and maybe even more).

As a developer, that boils down to making sure our website is available and responsive when you all come storming in at the same time. We've been load testing our website in a lot of different ways. In this post I want to share some considerations and thoughts on how to effectively prepare for busy days with load tests.

Keeping it real

During special events like Black Friday, chances are your traffic behaves differently from normal. It takes quite some effort to make sure your load test behaves somewhat similar to users. Let's go over some ways in which you can make your load test more realistic.

Access logs

A good starting point for simulating production traffic is looking at actual production traffic. Access logs form the basis of most of our load tests. This works quite well if the bulk of your traffic is GET type requests.

A bigger sample of URLs to load test helps. If your sample is too small, you'll be hitting the same page more often than users would. This will skew your result if you're doing any kind of caching.

Balancing out based on expectations

Although access logs are a good start, the types of URLs on a normal day are often distributed differently when comparing it to traffic on a busy day. On a day like black friday, there is a clear focus from our customers on a small part of our assortment. We might have some special page to show all our deals. Customers are more inclined to add some of these deals to their shopping carts compared to a normal day.

This shift can mean that customers will be doing more intensive operations in some cases and more easy-to-compute operations in other cases. To get an accurate read on the capacity of your application, you want to change your sample of URLs to match with the distribution you're expecting.

Sessions

The most fragile parts of a website are almost always the personalized parts. The shopping cart, order history, personalized recommendations can be hard to compute and can hardly be cached. Any time you've got strict consistency requirements, for example in a shopping cart, your application will have to coordinate to make sure data is always the same.

Ironically, it's really easy to misconfigure most load testing tools, most of them will by default not even store cookies. With that, every request will get a fresh session. That means the performance impact of coordinating data between nodes in a cluster is reduced significantly.

On the other hand, storing cookies by itself is not enough to simulate real users. The ratio between session count and request count needs to be as close to reality as possible. If your load test creates a new session once and then uses that for thousands of requests, that doesn't come close to what a real user would do.

Ramp-up

Most load testing tools provide control over the amount of load and how that load should be distributed over time. In a lot of cases it makes sense to slowly increase load to the desired level, to give your application the chance to fill its caches.

However, you might actually want to verify that a burst of traffic is also handled well. In a lot of cases it makes sense to both verify how fast and how far your application scales.

Data changes

We load test our applications in a separate environment. In this environment, there are way less changes to the data behind our website, because nobody is actively changing it there. Importantly though, a lot of popular databases invalidate caches in some way whenever underlying data changes. This means that a lot of changes to the data in for example our Elasticsearch cluster may invalidate several tiers of caching, causing more load. Also, the indexing of things into our Elasticsearch cluster can be heavy, especially when done in bulk.

This problem is bigger than just Elasticsearch of course. The artificial element of load tests can be increased by the lack of changes to data. Because of that, it can make sense to generate random changes to important data at a rate similar to your production environment.

Setting goals

In my experience, there are three main reasons to execute load tests. It's good to have an idea up front about what answers would make you feel comfortable with your system, as it is right now, supporting the load you're expecting.

Getting a number

The most quantifiable target in load testing web applications is the amount of requests the application will be able to handle in a certain time with its current infrastructure configuration. To get an accurate read on this, use the number that your application sustains on average over a longer period of time.

One thing to take into account: errors are actually really fast in a lot of cases. Not being authenticated or not being able to connect to a database are often fast checks that also fail fast. Having a lot of faulty or failing requests can make results look more positive than they should be.

Taking that into account, the request rate you can sustain gives a good indication of what will happen in the real world under high load. Still, load tests don't have the randomness that users have. In real world scenarios, requests will be less evenly spread out, with sudden spikes at times. The request rate that you get from your load test is really an upper bound for what you can expect to successfully handle.

Finding bottlenecks

Another learning from load testing your application is the uncovering of bottlenecks in your application. These can be either components that don't scale or don't scale fast enough to accommodate for all incoming traffic.

In an ideal situation, the load you can handle scales proportionally with the amount of money you're spending on infrastructure. In reality, there is always some limiting factor that makes this not true.

Knowing how things break

In the end, maybe the biggest lesson from load testing your application is knowing how your application behaves and breaks under high load. Especially because under high load the chances of failures affecting other components of your application are really high.

To give a concrete example, most common databases have limits to the amount of connections they can support at once. This is mostly the case because there is a clear overhead to having an open connection and all state related to it. When other components of your application slow down, there will be more requests being processed at any point in time. This also means that there will be more concurrent connections to databases. In a lot of cases this can mean that delays in one part of an application can bring a whole application down.

Although there are some common patterns here, the connections and dependencies on your application are unique. The only way to figure out how your application breaks is by breaking your application.

Conclusion

Load testing is hard to get right. Making faulty predictions about the capacity of your application can be a costly mistake. I hope these tips help you in better understanding and tuning your load tests to make sure they resemble a realistic scenario.

5 tips on debugging a production outage

Raoul Meyer — Sun, 06 Oct 2019 20:51:09 +0000

1. Tools

There's no way to debug an outage if there is no way to extract information out of your systems. You'll need some tooling to give you insights into what's going on on the inside of your systems. That can be as basic as log files, or as advanced as some of the amazing observability tools that exist nowadays.

I clearly remember how overwhelming it was for me as a new developer to try and navigate the tools that we have. It took quite some time before I was somewhat comfortable with searching through our logs. There is just so much information, it can be hard to know where to start.

One way you can get more comfortable with your tooling is by doing small exercises. Last week, I organized a session for my team with exactly this goal. They got a whole list of questions about our applications and had to find the answer. How many requests did we serve the last day? How many of those failed? What was the most common failure reason?

This achieves three things:

You'll have to navigate the UI of your observability tools
You'll have to find the right information to look at, whether that is alerts, dashboards or metrics
You'll find out how well you are able to interpret the information you get from your tools

Doing this in a situation where there is no pressure can be very valuable.

Tip 1: Get comfortable with your tools

2. Infrastructure

Once I was part of a long lasting outage of our webshop. The impact was significant, the whole website was down. We weren't seeing any requests on our load balancers, meanwhile all our customers were getting error pages.

None of us knew exactly how our application worked from this perspective. We often debugged production issues coming from our application, but everything that happened between the browser of a customer and our application was somewhat unknown.

Because of this, the outage lasted pretty long. By the time we located the issue, we were already multiple hours into the outage. With a better understanding of our whole infrastructure, impact would have been way lower.

A quick overview of your application documented somewhere can make all the difference. Knowing how the components of your application interact is crucial, especially because most problems occur at connections between components.

Tip 2: Know your infrastructure

3. Experiment

It's very helpful to approach debugging in a methodical way. The free Google Site Reliability Engineering book dives into a lot of the details of how you can make sure your debugging efforts are effective.

The general idea is very similar to a scientific experiment. At every step, you formulate a hypothesis based on the information you have. Then, you verify if that hypothesis is true. Based on the new information you just obtained, you repeat the process. This structured approach helps because it prevents you from making assumptions about what is going on.

Tip 3: Hypothesize an explanation, check this hypothesis, repeat

4. Summarize

Inevitably, you'll get stuck in your debugging at some point. This can be tough to deal with, especially if you feel pressure from something being broken.

This is the time to summarize what you've learned about the problem until now. It really helps if there's someone else, so they can check your summary for gaps or inconsistencies. It can also help to write down everything you learn about a problem. This makes it really easy to go over it again and can be really interesting for evaluation in for example a post-mortem.

You'll notice when you do this you'll always have one of a couple outcomes:

You mention two observations that seem to contradict each other: there are no errors in the application logs, but I get an error page when I do a request
You notice there is something you haven't looked at yet: the application can't talk to our database anymore, did we change anything in our configuration?
Your observations point you to an obvious conclusion: the database looks fine, but the load balancer shows errors, so the problem is probably in the application

In all cases, it will be easier to think of the next thing to investigate.

Tip 4: When stuck, summarize what you know

5. Practice

Chances are, your applications don't randomly break every day. In a lot of cases, there are numerous safe guards in place to prevent outages of your applications. This can mean that you get out of touch with the current architecture, observability tooling, dashboards and metrics.

This is one of the reasons why Pagerduty has a weekly "Failure Friday". In these, they simulate outages in a controlled way. This way you're guaranteed to look at production systems regularly. You can keep your knowledge of systems you don't touch that often fresh, and you can stay up to date on the current setup of applications that change often.

Tip 5: Simulate outages to stay in touch with production

42: The answer to life, the universe and why your website might be slow

Raoul Meyer — Sun, 29 Sep 2019 19:37:32 +0000

Recently, we enabled an APM (Application Performance Monitoring) tool for our webshop. This gave us incredible insights into what's happening in every single request. We can now see time spent on retrieving things from databases, rendering HTML and middleware processing for a single request.

One thing immediately caught our attention: Most of our requests' time was spent communicating with Redis. We use Redis for caching and sharing data between instances in our cluster, for example for session information.

Looking at a couple of examples, we saw talking to Redis was mostly really fast, 9 times out of 10 we saw sub-millisecond response times. But then sometimes, it would take ~42ms to execute a command. This happened for both GET and SET type commands sometimes, and was always around 42ms. And it also happened when talking to a Redis server running on the same host.

Now that's what I call a weird problem.

Investigating

Redis is single-threaded and all commands are executed in sequence. Naturally, our initial assumption was that there was some very big thing being retrieved or set. This could block the whole process for some time, and all other processes would have to wait. We took a look at the actual data being stored and retrieved in some slow cases, and it wasn't close to a size that would cause problems.

Redis provides a very extensive latency debugging guide. We went through this step by step (a couple of times) to find the cause of our latency issues.

Did we suffer from intrinsic latency? Yes, but not even close to the 42ms kind of latency we were seeing.

Did our network have latency? Well, we were seeing about the same numbers for localhost and remote communication. The cause can't really be in network latency.

Were these commands we were executing just really heavy? We enabled slowlog and looked at the results. We were still seeing 42ms latency from our application hundreds of times every minute, but there wasn't even a single command that took more than 10ms on the Redis side of things.

The same way we went through all other possible issues. The transparent_hugepage setting was already disabled. There was no swapping going on. We had already disabled persistence (since this store is only used for caching). Even on a Redis server with only a couple of keys we could reproduce the issue, so expires were not affecting our latency.

Our conclusion: Redis wasn't the issue here. So what was?

Going deeper

We tried to reproduce the issue using another handy built-in tool, redis-benchmark. This way we could run any kind of command with arbitrary payload sizes. We tried to reproduce the cases we saw as closely as possible, but didn't get close to the latency we were seeing from our application.

Then we did the same, but now with the client that our application was using. This application is written in PHP, and we use Predis as our Redis client. With a small benchmark script, we could easily reproduce the issue.

So is Predis just slow? It's really just a small wrapper around a TCP connection to Redis. We managed to determine that actually in our benchmark script all time was spent waiting on a response from Redis. And every time almost exactly 42ms. So Redis is fast, Predis is fast, but for some reason we are waiting 42ms for local network communication.

This is when we found a really old Github issue referencing this problem. The issue is actually a combination of TCP optimizations that work against each other. One is Nagle's algorithm, which tries to reduce the amount of packets that need to be sent over the line by bundling packets. The other is Delayed acknowledgement, which tries to reduce the amount of acknowledgement packets that need to be sent by bundling those. The whole issue is explained really nicely in this blog post by Julia Evans.

Fortunately by now a fix has been implemented in Predis, so we could enable TCP_NODELAY on our connections. This immediately improved latency.

Conclusion

The TCP issue that we ran into can happen on any TCP connection, not just with Redis. It's just that Redis is normally extremely low latency which makes it very obvious to see something is wrong. For libcurl, the default setting has actually been updated somewhat recently, so you shouldn't see problems there. For other HTTP/TCP clients, this might be something you want to investigate for your applications.

Enabling APM has given us a lot of insights into what our application is doing. If you haven't tried it yet, I would highly recommend it for any web application. There are lots of options out there, we've had good experiences in trying out Datadog APM.

The DevOps circle of life

Raoul Meyer — Thu, 15 Aug 2019 18:17:51 +0000

DevOps is about everyone involved in developing a system sharing the same goal. That goal can be as vague as winning in the market or making customers happy. With this goal, decision making becomes easier.

More often than necessary, teams make decisions based on how it will affect them, disregarding the bigger picture. Although it's often hard to measure, this unproductive behavior costs a lot of time. Time that could be spent on your business goals. DevOps is all about reducing the amount of finger pointing and told-you-so's.

The easiest way to make sure a team takes the bigger picture into account for decisions, is by making the team responsible for the bigger picture of their system. As long as responsibility is shared among teams, you risk those teams forming their own slightly different goals, which will generate frustration and suboptimal decisions.

The hard thing is that you need to be able to get a good overview of the impact of your decisions on the bigger picture. That means everyone should have knowledge or easy access to knowledge about many aspects of their systems. In many cases, it requires a team to have more skills than a single person can even have. There are real challenges in composing and sustaining a team with such diverse skills.

When your team is composed of people with diverse skills, this brings other benefits. There's less coordination with other teams necessary. Not having to wait for another team to start thinking about something is an amazing feeling. With all these skills combined in a team, it becomes easier to transfer knowledge and spread the workload where necessary.

In a lot of organisations there is a tension between development and operations. Development releases new features that operations has to make sure keeps working together with what's already there. It is tempting to optimize for rate of change on the one side and stability on the other side. When one team is responsible for both adding new things and maintaining existing things, a very helpful feedback loop forms. When the system is stable, the team has more time to add features. On the other hand, adding features too fast will directly impact the team's ability to maintain the stability of their system.

It's this circle of life that makes sure you're optimizing for the bigger picture. What is the main goal of your company? Do you feel that your day to day decisions align with this goal?