DEV Community: Alexander Bartlow

Platform Engineering vs. Site Reliability Engineering

Alexander Bartlow — Wed, 28 Sep 2022 17:33:52 +0000

Striker and goalie. Offense and defense. Deploy and recalibrate. Many disciplines have dichotomy between the tasks that accomplish a goal and tasks that protect the ability to do so. Delivering features, building out user flows, and optimizing conversion rates constitute the "offense" of many software companies. All of these activities are directly tied to the goal of growing the business. Defensive tasks encompass security, operations, and disaster recovery. These tasks are designed to prevent loss and empower the offense.

An engineering team cannot run optimal code or move the business forward with a weak infrastructure.

In the software world, this alliance is realized through the engineering and operations teams. Some organizations refer to these as the "profit-center" and "cost-center," respectively.

Traditional site reliability engineering

Though symbiotic, the relationship between engineering and operations may encounter pitfalls at more complex organizations. The introduction to Google's Site Reliability Engineering: How Google Runs Production Systems describes how the two groups may have varied goals, backgrounds, skill sets, incentives, vocabulary, and assumptions. And when left unchecked, strife can form as organizational effectiveness breaks down.

Google's solution to this problem is the discipline of site reliability engineering (SRE). Developers are tasked with writing software that creates a stable environment for the deployment and operation of the infrastructure. Site reliability engineers at Google work with development teams on a sort of contract — SRE runs the application only if development follows the practices needed to make the application run efficiently. If the development team does not hold up their end of the bargain, then the SRE will decline to hold up theirs and work with teams who do.

This works well for Google, but turning down work from one part of our engineering team is not an option for the Aha! team. We have no site reliability engineers on staff at Aha! — because we found a better way.

Unified engineering goals

We solved the problem of the development-operations split by defining clear goals for our dedicated platform team. Our platform team has goals similar to a traditional operations team, such as:

Operability — How do we run the system on a day-to-day basis and make routine changes?
Reliability — How do we ensure the system is available for our customers?
Resilience — When the system is unavailable, how do we get it running again?
Observability — How do we ensure that we know before our customers when the system needs help?

We have also adopted six additional goals to help us create a happy, effective, and high-performing engineering team:

Inverse toil

Toil refers to the time engineers spend doing repetitive work that doesn't deliver real value to customers. While the amount of manual work the platform team needs to do to operate production is higher than the rest of the engineering team, we still try to keep our time spent toiling to a minimum. For most of our platform engineers, it's less than 20% of their time. Google's SREs target 50% or less time spent toiling, so we believe that our investments in automation have been very effective.

Release velocity

How long does it take a pull request to be deployed after being flagged as ready? While we attempt to optimize the time taken in continuous integration, test automation is not a core focus for us. Our developers closest to the code are best suited to test it, so we leave that responsibility to them. We instead focus on the deployment pipeline — building the core images and getting them into production. Simplifying that process allows it to be done many times a day.

Change management

In addition to deployment speed, some changes require a gradual rollout or an experimental stage to validate the changes we made. The platform team manages the feature flag functionality at Aha! and continues to extend it to provide support for more use cases.

Developer productivity

The platform team maintains a robust logging system to allow our engineers to troubleshoot problems in production quickly and effectively without direct access to customer data. We maintain the Docker setup for local development to give engineers a stable way to run the application and all of its dependencies. Our dynamic staging setup allows us to get new features in front of the product, marketing, and Customer Success teams early in the development process.

Application performance

All engineers should attempt to create a performant application. Our focus on monitoring and observability allows us to find problems in production and start the troubleshooting process. Some features may need to go back into development for optimization, but we can often solve the problem through algorithmic tweaks, indexing changes, or other database management techniques. Our experience looking into these issues gives us a sense of where additional caching, reducing object allocations, or other techniques could be used to bolster end-user performance.

These goals and techniques are also shared by the DevOps movement, which attempts to push all the operations work back onto the engineering team. However, developing a large monolithic Rails application with a robust and modern front-end framework encompasses two very deep skill sets. Asking all of our engineers to also be AWS, networking, and database engineers is asking far too much.

Our platform team serves and empowers the rest of the engineering organization. All of our goals unite to serve the company vision.

Making complicated problems simple

Without new features to uncover what is currently possible with our infrastructure, we wouldn't be able to solve complex problems. Part of that empowerment entails building out all-new infrastructure and paradigms that can be used to solve tomorrow's problems. We've made a massive effort toward integrating Kafka into our infrastructure this year, which is typical of this kind of investment. By having some space away from product-facing work, we're able to think of our product engineers as internal customers. We can build out tools and infrastructure that will support the development of future features and products.

Platform for profit

Remember the profit-center vs. cost-center distinction? New development generally gets billed as profit-generating, but the operations side of an organization has the unique ability to drive costs down. This is done through fine-tuning the spend on Cloud resources, optimizing execution to require less of those resources in the first place, and selecting time-saving technologies.

Infrastructure budget is a vital component of the technical roadmap for any platform team. When the development team is bottlenecked for infrastructure reasons, the organization loses out on critical first-mover advantages in the marketplace.

Poor engineering practices can lead to a deficit of our most important asset — the team.

We want our teammates to love what they do — increasing employee joy is the aim and core of our company's existence. And it makes financial sense too. Replacing an engineer's institutional knowledge and expertise takes time. Attrition makes your hiring plan doubly difficult to achieve.

Need for speed

Focusing on optimization allows platform engineers to improve the company's service in noticeable ways. Speed is a feature because performance expectations are similar to the delays we expect in a normal conversation. A response of 100ms feels instantaneous but delays longer than one second start to interrupt the user's flow of thought. Beyond four seconds, the user can no longer smoothly interact with a system.

Products in your daily workflow should be reliable and resilient — these two factors drive the happy use of software.

Failures can occur through human error, hardware malfunction, or bugs in vendor packages. However, my experience has shown that a system that recovers quickly when it fails is viewed more favorably than one that fails infrequently but has a longer recovery time.

A system in this case may refer not only to your software and hardware, but also your operations, engineers, and customer success teams. When failures are still within the realm of exceptional behavior and are fixed rapidly, we find those customer interactions are generally positive — even lovable. Customers who love your product will keep using it (reducing churn, increasing profit) and expand the use of the product within their own organization. Platform teams support the broader engineering organization by being prepared to respond quickly to the most debilitating failure modes. This is a core overlap with SRE teams as well.

Structuring engineering teams

If SRE teams are the defense and engineers are the offense, then platform teams are the mid-fielders. They save goals as often as they score them. Their presence makes the rest of the engineering organization more effective at their core objectives.

However, keep in mind the time spent toiling. If the platform, infrastructure, QA, and tooling engineers spend over half of their time responding directly to incidents or on-call work, you may need a dedicated reliability team.

If every week has a new emergent behavior that requires platform engineers to drop everything and fix it, it's time to adopt more proactive measures to build resilience, monitoring, and fault-tolerance. And when those tasks take up more than half the platform team's time, it's time to honor reality and break off a dedicated reliability engineering team.

Platform and SRE teams do much of the same work — the division between them is more like a gradient than a clear line:

When the platform team is fully aligned with engineering and company goals, it can incorporate a healthy SRE culture that meets its needs. Focusing on observability, recovery, and reliability saves costs and increases reliability while the platform team works on enhancing the developer experience and building new infrastructure for future features or projects.

Grow your career and be happy

If this kind of alignment and environment sounds intriguing to you, come join us and work with the sharpest engineers I've ever had the pleasure of calling colleagues. Career happiness comes from doing meaningful work with motivated teammates and being appreciated for it. That is Aha! — a talented group of people changing how the best-known companies innovate and bring products to market.

Optimizing with the PostgreSQL deterministic query planner

Alexander Bartlow — Fri, 22 Apr 2022 19:08:23 +0000

Feed the planner, trust the plan

Of all the Aha! engineering tool expenses, the money I'm happiest to spend is on a big RDS instance running PostgreSQL. It powers our full-text search, our reporting, and even some of our analytics. If you run it at scale, I recommend dedicating a Friday to curling up somewhere cozy with a cup of coffee and reading the fine manual. It will be time well spent.

PostgreSQL is an easy tool to start using. The defaults provide a fast, secure experience out of the box. However, once you scale to a certain size, performance can start to degrade in some cases. Most people blame the query planner when this happens and start trying to find ways to "trick" the planner into getting a high-performance query.

The problem with these tricks is that they usually end up biting you later — after you've grown as a business for a couple of years and have far higher throughput on your database. If there's one thing to take away from this article, it is this:

The PostgreSQL query planner will always discover the optimum query plan, given its statistical knowledge of your system and your data.

We've discovered some tuning settings in the manual that give the planner the information it needs to do its job correctly. I'm going to cover three settings that we've adjusted over the last couple of years that have improved our query performance up to 20,000%.

Set the right cost for page access

One decision the planner sometimes makes is to full-scan your table instead of using an index. This may seem counter-intuitive, but remember, the planner will always discover the optimum plan. Using an index requires reading random pages from disk; whereas reading the full table gets to read pages sequentially, which is faster.

For a table of a certain size, on a slow hard drive not already cached in ram, with a query that will return a lot of rows — the full scan will be faster. The planner is right. That is, unless you're running a well-provisioned database (caching a large fraction of your data ¹) with a fast NVME disk. (It's 2022 — these things are cheap; use one!) In that case, the index would be faster. How do we tell the planner that we have well-cached data and that the disk is really good at random reads?

The query planner's estimates for the relative speed difference between a random and a sequential read is determined by (random_page_cost and seq_page_cost). The default values for these columns assume that a random read from disk is going to be 40x slower than a sequential read from disk and that 90% of your disk pages will be cached in memory. These defaults assume a modestly sized database server using spinning-metal storage device.

Using a fast NVME drive, you don't have a 40x latency multiple for random access — that number is probably closer to 5–10x. Additionally, if your ram cache rate is close to 99%, then the random_page_cost parameter should be set to something like 1.05. (Multiply how much slower random access is than sequential by the inverse of your cache rate, and you will get a proper value for random_page_cost. 5 * (1 / 0.99) = 1.05)

After implementing these cost changes, we noticed PostgreSQL would stop deciding to full-scan tables unless the tables were very small. This lead to a drastic improvement in our query performance around custom pivots.

Set up statistics for dependent columns

Let's assume the following query, with all three relevant columns indexed:

select count(*) from features where account_id=1 and product_id=2 and initiative_id=3

By default, PostgreSQL assumes that each column is independent of all other columns on your table. This can lead to PostgreSQL deciding to do a bitmap merge of all three indexes, assuming that you're going to get a different set of results for each index, like this:

However, in reality, the data looks like this:

Given an individual product_id or initiative_id on a feature, all features with that initiative_id or product_id will have the same account_id. Moreover, features with the same initiative_id will be highly likely to have the same product_id. So how do we tell the planner that if we know the initiative_id, we can basically ignore the account_id and product_id checks for the initial data fetch process?

These statistical dependencies can be leveraged by the query planner to decide that only one index fetch on initiative_id is necessary. That is the most selective index, so it will give us the smallest result set. The result is likely to be small and the rows that have the initiative_id are likely to have the same product_id and account_id. So doing the one index lookup and then filtering the results in memory is almost assuredly faster than setting up a bitmap scan, which requires going out to disk for random page reads.

Use CREATE STATISTICS features_stats (dependencies) on account_id, product_id, initiative_id from features; analyze features. PostgreSQL will sample those columns and keep track of the coefficient of correlation between their values, coming up with better query plans. In this example, the index on initiative_id is going to be the most selective (largest total cardinality), which means that selecting on that will give us a small number of rows. Then, since all items with initiative_id=3 are probably going to have product_id=2, are certainly going to have account_id=1, and the result set from the initiative_id index can be filtered in memory, all of those rows will likely make it through the check anyway.

The example I describe here is not hypothetical. We found a pathological case where a query for one customer was taking 20 seconds. By adding dependent statistics to the table, the same query's time was reduced to less than a millisecond = 20,000% improvement.

Beware the join collapse

Many candidates that I interview say something to the effect of "PostgreSQL falls over when you have too many joins." This is a commonly known phenomenon but very few candidates know the exact reason for this.

PostgresSQL's query planner by default will exhaust every possibility trying to determine the fastest way to execute a query you request. The complexity of this is exponential with the amount of joins added. Attempting to plan the join of many tables together when the join order is not constrained could take longer than doing the actual query.

To avoid this planner's analysis paralysis, PostgreSQL also has a genetic algorithm-based query optimizer, which attempts to find a heuristic solution. Most of the time, it's pretty good but sometimes the results are pretty bad. The decision to use heuristic planning instead of deterministic planning is determined by the geqo_threshold. The amount of work the heuristic planner does is determined by geqo_effort , which is a simple 1-10 scale with a default to 5.

Since the cost of a bad query is far more than the cost of some extra time planning, we've opted to increase our geqo_threshold to 20. This covers the vast majority of our reporting queries. We're also looking into increasing our geqo_effort values, since when we're joining more than 20 tables together, it's worth spending extra time planning to ensure that we get the query plan right. This is a continued area of experimentation for us - Aha! has a lot of very small, optimized queries; it also has some very large, gnarly ones in order to report on customer's data in the way the customer desires. Striking a balance between these extremes appears to be a bit of an art form.

Additionally, it pays to increase the values of join_collapse_limit and from_collapse_limit. These parameters determine the maximum number of tables for which PostgreSQL is willing to optimize the join order. And if you have a lot of tables, get the query plan right so you can save far more time executing the query.

Feed the planner, trust the plan

The deterministic query planner will always find the optimal result. If the planner is generating bad plans, it means you need to find ways to give it better statistical data. Instead of "tricks," use the prescribed methods that come with PostgreSQL to instruct the planner on the shape of your data. You'll be very happy with the result. And seriously, read the manual.

Sign up for a free trial of Aha! Develop

Aha! Develop is a fully extendable agile development tool. Prioritize the backlog, estimate work, and plan sprints. If you are interested in an integrated product development approach, use Aha! Roadmaps and Aha! Develop together. Sign up for a free 30-day trial or join a live demo to see why more than 5,000 companies trust our software to build lovable products and be happy doing it.

You can check your database cache rate with the following query. You want your cache hit rate to be near 99%. If it's lower than 95% for heavily accessed tables, consider increasing your database's RAM.
```
SELECT 

  sum(heap_blks_read) as heap_read,

  sum(heap_blks_hit)  as heap_hit,

  sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio

FROM 

  pg_statio_user_tables;
```
↩

Log management using AWS Athena

Alexander Bartlow — Tue, 15 Dec 2020 16:10:04 +0000

Many SaaS providers will happily sell you a turn-key log management system, and in the early days of a startup when you value time over money, purchasing one makes a lot of sense. AWS Cloudwatch logs are another good solution if you are already hosted on AWS. With any of these paid services, the cost is eventually going to start raising a few eyebrows and it might become prudent to build your own in-house. The Elasticsearch-Logstash-Kibana cocktail is the most popular deployment for this, but one SaaS provider I found calculated the TCO over running a moderately-sized ELK stack for a year at nearly $500,000!

We've developed our own system using Fluentd, AWS S3, and AWS Athena to ingest, store, and query our logs. With this combination, I estimate that ingesting 640GB per day of log data and keeping it searchable over the course of the entire year would cost only $79,248 — a savings of 83 percent! Not to mention the time spent monitoring, scaling, and operating the cluster. The performance of our system is good; queries that return the data for a single request complete in less than 3 seconds. Searching the full text of our logs for specific markers or references can take longer but it's still very manageable.

This system not only gives us fast access to the logs that we do have, but it also enables us to log additional information per request and keep logs searchable for longer than we otherwise would if we had to pay more for storage. But having this additional data is invaluable for helping us solve problems and answer questions in a rapid and responsive way.

In this post, I'm going to walk through what we did, provide some code and discussions to show our approach, and talk about opportunities for future work that have opened up. I'm also going to mention some "gotchas" and road-blocks that we encountered along the way.

Building the system

There are four stages to a logging system: ingestion, aggregation, storage, and query. We're going to talk about them in reverse order.

Query

Assuming we have our logs in a structured format, what would be an ideal interface to gain access to it? While Kibana is the preferred choice for many companies, Kibana provides a lot of things that we don't really need. Pretty charts and graphs are nice, but when debugging an error state, what we really need is quick access to our logs. Additionally, exposing something that resembles SQL allows writing sophisticated queries to answer questions as they emerge.

Amazon Athena is the technology we settled on to query our logs. Athena searches data that already exists in S3, and it has a few properties that make it ideal for the task at hand:

Athena is really inexpensive. We pay only for the data that we scan in the process of completing a query: only $5 per terabyte at the time of writing.

Athena understands and can query the parquet file format out of the box. Parquet is column-oriented, which means that it is able to scan data only in the columns we are searching. This is a huge boon to us since the vast majority of our searches are "get me all of the logs for this single request id". That's only a 128bit UUID, and Athena is able to quickly determine which chunks of data contain the value under search, saving us both time and money. Additionally, parquet is compressible, and Athena is able to query it while in its compressed format, saving even more money on scanning costs.

We wrote a small custom log search engine to act as a front-end to Athena's interface for our team. This interface has fields, basic query parameters, a date selector to help generate the SQL-like query language that Athena uses, as well as a few endpoints that kick off a query immediately based on request uuid, pod, and time. We embed links to our tool in Datadog for rapid log analysis of poorly performing endpoints.

Storage

As mentioned, Athena queries data that is stored in S3. It's a durable, reliable medium that ensures millisecond access to our data when we need it. S3 also comes with another benefit. We can set up lifecycle rules to move files to lower cost tiers over time. These lifecycle rules can be set up with just a few clicks (or a few lines of AWS CloudFormation) and then it's done.

Aggregation

Aggregating the logs from our services turned out to require some work. Instead of using an Amazon product off the shelf, we opted to use Fluentd to ingest, aggregate, and then process our logs. The configuration itself is very simple. We created a docker image that built Fluentd with libjemalloc to keep the memory usage in check and lib-arrow to generate the compressed data in Parquet format. This container could then be deployed to our standard ECS cluster and then treated like any other service.

Ingestion

Now that we have our query, aggregation, and storage services ready, the last step is to get data from our application into the Fluentd service. There are a couple of options for this. One was to set the docker logging provider to Fluentd and point it at the aggregation cluster we have deployed. This allows us to log to stdout, which would keep the logger relatively simple. However, we decided it was worth it to make our application aware of the aggregation service and install a client-logging library for Fluentd. Fluentd maintains a Ruby version of their logger, which works as a drop-in replacement for the logger on rails applications: fluent-logger-ruby, but this is language-agnostic and could be used anywhere.

Gotchas

Who logs the logger?

Unfortunately, there is one area where we cannot rely on this system for logging — it's not able to log to itself! The ECS agent requires that the new container be able to log to its log provider immediately upon boot. This is a problem when the logging service itself needs to boot, initialize, and then pass health checks to start taking requests.

One way around this would be to bake a "forwarding" ECS service into the AMI for every ECS cluster instance. This forwarder could receive and buffer logs, then send them on to a private hostname configured to point to our new logging service.

Ensure proper chunk size

Athena has a "sweet spot." Because of the columnar format of data and the ability for Athena to read the headers out of the parquet file before it fetches the whole object, Athena works best when the size of scanned files is around 200MB. Once it gets larger, Athena is missing out the ability to parallelize. When it is smaller, each query worker spends a lot of time reading and parsing the initial headers instead of scanning the actual data. We had to tweak our time keys and chunk sizes to get this just right.

There is a balancing act between chunk size and capacity. The more aggregation containers we were running, the more log pressure we needed to get the right amount of saturation to create files in the sweet spot. In our experience, the cluster really wants to be run at about 80% CPU. Higher than that and we risk the TCP log forwarding protocol providing upstream back-pressure to the services logging to it. Less than that and we end up creating log files that are not big enough to get full performance out of Athena. Since our use-case is an internal developer-facing tool and the risk of slowing down our main application is not worth the slightly better throughput, we have opted to slightly over-provision our containers. A second Fluentd container layer serving as a forwarder in between our application servers and our aggregation containers could help buffer this substantially. This will give us great performance without risk of back-pressure at the cost of slightly more complexity and cost overhead.

Partition projection

We first attempted to create an AWS glue table for our data stored in S3 and then have a Lambda crawler automatically create Glue partitions for Athena to use. This was a bad approach.

Partition projection tells Athena about the shape of the data in S3, which keys are partition keys, and what the file structure is like in S3. We partition our data by service, shard, year, month, day, and hour. In our testing, we found that partition projection was essential to getting full value out of Athena. Without it, many of our queries would take up to thirty seconds as Athena consulted our AWS glue tables to determine which data to scan. With partition projection, no such lookup was necessary, and the query could begin immediately, scanning only the relevant data.

Keep in mind that partitions are great for scoping down the amount of data to access, but by partitioning too aggressively, we noticed our chunk sizes dropping below the "sweet spot" discussed above. We noticed a similar problem partitioning our data by EC2 instance. Clumping all hosts together counterintuitively made our queries to the system a lot faster.

Future use cases

The power of this system is enormous. We push a staggering amount of log data into it, pay relatively little for it, and query it in seconds. Additionally, the ingestion is managed by a service that we run and provision ourselves. So it is easy to autoscale down when demand is low and scale up when we need the additional horsepower to keep up with demand. But now that we have this system running, other use cases have presented themselves that look promising:

Analytics — We are able to log to Fluentd with a special key for analytics events that we want to later ETL and send to Redshift. We can create a new rule in our Fluentd config to take the analytics tag, and write it into the proper bucket for later Athena queries to export to Redshift, or for Redshift itself to query directly from S3 using Redshift Spectrum.
Request correlation — At Aha!, every request to our service is issued a request_uuid. When we queue a background job or make call to another service, those services can all log using the same framework and can include that originating request_uuid in the appropriate field. This means we can trace everything that was involved or triggered from a single user request through all of our backend services through a single query interface and interlace all of the logs together in one timeline.
Finding contemporary logs — Using this system, it is trivial to search for all logs from all requests with a very small timestamp window. If we know that we had a deadlock around a certain row in a database, this would allow you to find all processes that attempted to update that row, and then trace back to find the controller action that caused the problem.

If finding innovative solutions to problems like this sounds interesting, come work with us as we build the world's #1 product roadmapping software. We are not only creating a powerful platform for the biggest companies in the world to plan and manage their strategy, we are also building sophisticated tooling to help our talented and friendly development team continue to excel.

Footnotes

1: Assuming 35 Fargate container instances with 1.25 VCPU and 2GB memory each, 640GB ingested per day, 60-day S3 Standard Access storage, 335-day Infrequent Access storage, and 100 Athena queries per day each covering 100GB scanned. (This is a generous allowance.) Reserved ECS cluster instances drop this cost down further.