Why managing dev environments is a full time job at Eventbrite

#devops #microservices #architecture #productivity

Deciding when to invest in developer productivity improvements is hard. If you’re on the ops side of things, you’re usually concerned about production and releases. If you’re a developer, you’re concerned about getting new features out as quickly as possible.

Usually, teams make development productivity improvements in two situations. Either the fix is so small that you can just do it in addition to your other work, or development is so painful that making changes has ground to a halt.

However, there’s still a large murky middle ground: how do you decide that it’s worth investing in a large change to your development workflow before development has ground to a halt?

Remy DeWolf spent three years making these sorts of decisions as a principal engineer on the DevTools team at Eventbrite. He was part of the decision to build yak, which moved Eventbrite’s development environment into the cloud. This was a highly calculated decision since it cost a few EC2 instances per engineer and yak was built from scratch.

In this first post, we’ll dig into how Remy made this tough decision, and got buy-in from the rest of the company. In our next post, we’ll get into the nitty-gritty on how their remote development environment works, and what it’s been like for developers.

Read part 2 of this interview about Eventbrite’s specific setup and 3 unexpected benefits of remote dev environments.

Q&A

How is the Eventbrite application architected?

This is a common story that you will find in a lot of startups. The founding engineers built a monolith and the strategy was to build features fast and capture the market. It was a very successful approach.

As the company grew over time, having a large team working on the monolith became challenging. And after a certain size, it was also harder to keep scaling the monolith vertically.

Over time, some of the monolith was migrated over to microservices. New services are generally containerized, and the monolith is containerized in development but not in production.

What’s your development environment setup now?

Every engineer runs ~50 containers which corresponds to the monolith, the microservices, the data stores (MySQL, Redis, Kafka…) and various tools (logging, monitoring).

Developers use yak (which we built internally) to deploy and manage their remote containers.

We use AWS EKS for the Kubernetes clusters, in which every developer has their own namespace. We have hundreds of developers and many EKS clusters.

yak is very similar to blimp since it enables the engineers to manage their remote containers without exposing them to the complexity of Kubernetes.

How did you decide it was time to build yak?

Before yak, each developer ran their development environment locally on their laptop. However, the development environment became so big that it slowed down developer laptops.

The main issue was that you might not realize that this was an issue because it was creeping one service at a time.

Once we added instrumentalization to our tools, we started to understand the scale of the problems. Moving to the cloud is expensive but when we were able to put it side by side with the wasted engineering time, the decision was easy for us.

Another goal of yak was to make Kubernetes easy for developers. We kept it as minimal as possible and the configuration files are plain Kubernetes manifest files. The intent was to feed developer curiosity so they learn more about Kubernetes over time.

What areas do you recommend tracking regarding developer productivity?

Whenever possible, align the developer productivity goals with the business. Every DevTool team should understand how they contribute to the company goals and vice versa. If this is unclear, I would start with that.

Next, make sure that developer productivity is part of the plan, not an afterthought. For example, some engineering teams move to microservices and only track the number of services and the uptime in production. These are great metrics, but they’re incomplete. They will generate inconsistency and the developer experience will suffer over time.

In terms of which metrics to pick, there is no general recommendation. It’s important to understand how developers work, understand how frequently they perform critical tasks, and instrument the tools that they use. With this data, you will be able to identify the most important areas to invest and track the progress over time.

I would also recommend having a metric about mean time to recovery (MTTR). If a developer is completely stuck, how would you bring them back to a clean state so they can resume their work? For this one, if you run the developer environment locally, you will have many different combinations of OS/tools/versions resulting in many different issues. If you are on the cloud and use a generic solution (e.g. Docker + Kubernetes), this problem will be much easier to solve.

How did you collect feedback at Eventbrite?

We had many channels:

Instrumentation into the tools. Every time a developer would build, run, or deploy docker images we would send metrics. Similarly, every CI job would do the same. Then we would generate some dashboards for the metrics to track and measure the progress over time. If you are using a tool like Sumologic or Datadog, it’s very easy to send custom metrics and build dashboards.
Quarterly engagement surveys.
Demos: invite other engineers to show them the progress and engage with them.
New hires: these new employees bring a fresh perspective and they are not afraid to ask questions and challenge the status quo.
Networking: build relationships with other developers (coffee breaks, office visits, lunches, etc..)

Can you give some examples of developer productivity OKRs?

Time to start the developer environment is under x min

This time is usually wasted time, so it’s important to track it and improve it. If the dev stack is unreliable or slow, it would be captured in this OKR.
Engagement is over x%

If you send an engagement survey every quarter, you can have an OKR to make sure the trend is upward. Seeing a drop would mean that the team might not be working on the most relevant projects.
Average time from commit to QA/Prod

This one will capture the CI/CD pipeline effectiveness. If you experience some flaky tests or deployment errors in the pipeline, it would negatively impact the key results.

Over time, some OKRs will be exhausted, so consider renewing them over time. For example, if your survey always has the same questions, developers will eventually stop responding. Also if an OKR has been greatly improved, it’s a good time to shift priorities.

In my personal experience, I would focus on a few OKRs instead of having too many. Sometimes by trying to please everybody, you will not have a big impact. Some projects might require the full team focus, which can temporarily impact other OKRs. This would be a calculated strategy as these projects would bring huge improvements when delivered.

Are there any warning signs people should look out for in order to know their developer productivity is suffering?

This is where it’s important to have good metrics and monitor them over time. You should be able to feel the pulse of your developers by looking at different data points. Ideally, you would tie these to your OKRs and review the progress every sprint and make adjustments.

If you don’t have this data there are still warning signs that productivity is suffering:

Increase in support cases and/or requests for help. If developers need external help to do their work, this is a sign that a process is too hard to use or not well documented.
On the other hand, I’d be worried if you find out that some processes aren’t working properly but nobody reported them to your team. You want developers to be always looking for improvements and not accepting a broken process.

Kelda and Eventbrite

Kelda has collaborated with Eventbrite for a long time. We first met when we were building the predecessor to Blimp, which moves your Docker Compose development environment into the cloud. Eventbrite had already built yak internally, and we were trying to make a general solution. We’ve been trading ideas ever since.

Check out Blimp to get the benefits of yak without having to build it yourself!