<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alexander Bartlow</title>
    <description>The latest articles on DEV Community by Alexander Bartlow (@alexbartlow).</description>
    <link>https://dev.to/alexbartlow</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F540963%2F3e0c33ff-2d99-4b6b-8269-c5372dcca4c0.jpeg</url>
      <title>DEV Community: Alexander Bartlow</title>
      <link>https://dev.to/alexbartlow</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alexbartlow"/>
    <language>en</language>
    <item>
      <title>Platform Engineering vs. Site Reliability Engineering</title>
      <dc:creator>Alexander Bartlow</dc:creator>
      <pubDate>Wed, 28 Sep 2022 17:33:52 +0000</pubDate>
      <link>https://dev.to/aha/platform-engineering-vs-site-reliability-engineering-3ei5</link>
      <guid>https://dev.to/aha/platform-engineering-vs-site-reliability-engineering-3ei5</guid>
      <description>&lt;p&gt;Striker and goalie. Offense and defense. Deploy and recalibrate. Many disciplines have dichotomy between the tasks that accomplish a goal and tasks that protect the ability to do so. Delivering features, building out user flows, and optimizing conversion rates constitute the "offense" of many software companies. All of these activities are directly tied to the goal of growing the business. Defensive tasks encompass security, operations, and disaster recovery. These tasks are designed to prevent loss and empower the offense.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;An engineering team cannot run optimal code or move the business forward with a weak infrastructure.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the software world, this alliance is realized through the engineering and operations teams. Some organizations refer to these as the "profit-center" and "cost-center," respectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Traditional site reliability engineering
&lt;/h2&gt;

&lt;p&gt;Though symbiotic, the relationship between engineering and operations may encounter pitfalls at more complex organizations. The introduction to Google's &lt;a href="https://sre.google/sre-book/introduction/" rel="noopener noreferrer"&gt;Site Reliability Engineering: How Google Runs Production Systems&lt;/a&gt; describes how the two groups may have varied goals, backgrounds, skill sets, incentives, vocabulary, and assumptions. And when left unchecked, strife can form as organizational effectiveness breaks down.&lt;/p&gt;

&lt;p&gt;Google's solution to this problem is the discipline of site reliability engineering (SRE). Developers are tasked with writing software that creates a stable environment for the deployment and operation of the infrastructure. Site reliability engineers at Google work with development teams on a sort of contract — SRE runs the application only if development follows the practices needed to make the application run efficiently. If the development team does not hold up their end of the bargain, then the SRE will decline to hold up theirs and work with teams who do.&lt;/p&gt;

&lt;p&gt;This works well for Google, but turning down work from one part of our engineering team is not an option for the Aha! team. We have no site reliability engineers on staff at Aha! — because we found a better way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Unified engineering goals
&lt;/h2&gt;

&lt;p&gt;We solved the problem of the development-operations split by defining clear goals for our dedicated platform team. Our platform team has goals similar to a traditional operations team, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Operability — How do we run the system on a day-to-day basis and make routine changes?&lt;/li&gt;
&lt;li&gt;Reliability — How do we ensure the system is available for our customers?&lt;/li&gt;
&lt;li&gt;Resilience  — When the system is unavailable, how do we get it running again?&lt;/li&gt;
&lt;li&gt;Observability — How do we ensure that we know before our customers when the system needs help?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We have also adopted six additional goals to help us create a happy, effective, and high-performing engineering team:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inverse toil&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://sre.google/sre-book/eliminating-toil/#:~:text=Toil%20defined" rel="noopener noreferrer"&gt;Toil&lt;/a&gt; refers to the time engineers spend doing repetitive work that doesn't deliver real value to customers. While the amount of manual work the platform team needs to do to operate production is higher than the rest of the engineering team, we still try to keep our time spent toiling to a minimum. For most of our platform engineers, it's less than 20% of their time. Google's SREs target 50% or less time spent toiling, so we believe that our investments in automation have been very effective.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Release velocity&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;How long does it take a pull request to be deployed after being flagged as ready? While we attempt to optimize the time taken in continuous integration, test automation is not a core focus for us. Our developers closest to the code are best suited to test it, so we leave that responsibility to them. We instead focus on the deployment pipeline — building the core images and getting them into production. Simplifying that process allows it to be done many times a day.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Change management&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In addition to deployment speed, some changes require a gradual rollout or an experimental stage to validate the changes we made. The platform team manages the feature flag functionality at Aha! and continues to extend it to provide support for more use cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Developer productivity&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The platform team maintains a &lt;a href="https://www.aha.io/engineering/articles/log-management-using-aws-athena" rel="noopener noreferrer"&gt;robust logging system&lt;/a&gt; to allow our engineers to troubleshoot problems in production quickly and effectively without direct access to customer data. We maintain the Docker setup for local development to give engineers a stable way to run the application and all of its dependencies. Our &lt;a href="https://www.aha.io/engineering/articles/dynamic-stagings" rel="noopener noreferrer"&gt;dynamic staging setup&lt;/a&gt; allows us to get new features in front of the &lt;a href="https://www.aha.io/roadmapping/guide/product-management/work-with-engineers" rel="noopener noreferrer"&gt;product&lt;/a&gt;, marketing, and Customer Success teams early in the development process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Application performance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All engineers should attempt to create a performant application. Our focus on monitoring and observability allows us to find problems in production and start the troubleshooting process. Some features may need to go back into development for optimization, but we can often solve the problem through algorithmic tweaks, indexing changes, or other &lt;a href="https://www.aha.io/engineering/articles/optimizing-with-the-postgresql-deterministic-query-planner" rel="noopener noreferrer"&gt;database management techniques&lt;/a&gt;. Our experience looking into these issues gives us a sense of where additional caching, reducing object allocations, or other techniques could be used to bolster end-user performance.&lt;/p&gt;

&lt;p&gt;These goals and techniques are also shared by the &lt;a href="https://www.atlassian.com/devops/what-is-devops/history-of-devops#:~:text=devops%20movement" rel="noopener noreferrer"&gt;DevOps movement&lt;/a&gt;, which attempts to push all the operations work back onto the engineering team. However, developing a &lt;a href="https://www.aha.io/engineering/articles/from-one-many-building-a-product-suite-with-a-monolith" rel="noopener noreferrer"&gt;large monolithic Rails application&lt;/a&gt; with a robust and modern front-end framework encompasses two very deep skill sets. Asking all of our engineers to also be AWS, networking, and database engineers is asking far too much.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Our platform team serves and empowers the rest of the engineering organization. All of our goals unite to serve the company vision.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Making complicated problems simple&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without new features to uncover what is currently possible with our infrastructure, we wouldn't be able to solve complex problems. Part of that empowerment entails building out all-new infrastructure and paradigms that can be used to solve tomorrow's problems. We've made a massive effort toward integrating &lt;a href="https://kafka.apache.org/" rel="noopener noreferrer"&gt;Kafka&lt;/a&gt; into our infrastructure this year, which is typical of this kind of investment. By having some space away from product-facing work, we're able to think of our product engineers as internal customers. We can build out tools and infrastructure that will support the development of future features and products.&lt;/p&gt;

&lt;h2&gt;
  
  
  Platform for profit
&lt;/h2&gt;

&lt;p&gt;Remember the profit-center vs. cost-center distinction? New development generally gets billed as profit-generating, but the operations side of an organization has the unique ability to drive costs down. This is done through fine-tuning the spend on Cloud resources, optimizing execution to require less of those resources in the first place, and selecting time-saving technologies.&lt;/p&gt;

&lt;p&gt;Infrastructure budget is a vital component of the &lt;a href="https://www.aha.io/roadmapping/guide/roadmap/technology-roadmap" rel="noopener noreferrer"&gt;technical roadmap&lt;/a&gt; for any platform team. When the development team is bottlenecked for infrastructure reasons, the organization loses out on critical first-mover advantages in the marketplace.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Poor engineering practices can lead to a deficit of our most important asset — the team.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We want our teammates to love what they do — &lt;a href="https://www.aha.io/company/careers" rel="noopener noreferrer"&gt;increasing employee joy&lt;/a&gt; is the aim and core of our company's existence. And it makes financial sense too. Replacing an engineer's institutional knowledge and expertise takes time. Attrition makes your hiring plan doubly difficult to achieve.&lt;/p&gt;

&lt;h2&gt;
  
  
  Need for speed
&lt;/h2&gt;

&lt;p&gt;Focusing on optimization allows platform engineers to improve the company's service in noticeable ways. Speed is a feature because performance expectations are similar to the delays we expect in a normal conversation. A response of 100ms feels instantaneous but delays longer than one second start to interrupt the user's flow of thought. Beyond four seconds, the user can no longer smoothly interact with a system.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Products in your daily workflow should be reliable and resilient — these two factors drive the happy use of software.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Failures can occur through human error, hardware malfunction, or &lt;a href="https://www.aha.io/roadmapping/guide/agile/what-is-issue-tracking" rel="noopener noreferrer"&gt;bugs&lt;/a&gt; in vendor packages. However, my experience has shown that a system that recovers quickly when it fails is viewed more favorably than one that fails infrequently but has a longer recovery time.&lt;/p&gt;

&lt;p&gt;A system in this case may refer not only to your software and hardware, but also your operations, engineers, and customer success teams. When failures are still within the realm of exceptional behavior and are fixed rapidly, we find those customer interactions are generally positive — even &lt;a href="https://www.aha.io/lovability" rel="noopener noreferrer"&gt;lovable&lt;/a&gt;. Customers who love your product will keep using it (reducing churn, increasing profit) and expand the use of the product within their own organization. Platform teams support the broader engineering organization by being prepared to respond quickly to the most debilitating failure modes. This is a core overlap with SRE teams as well.&lt;/p&gt;

&lt;h2&gt;
  
  
  Structuring engineering teams
&lt;/h2&gt;

&lt;p&gt;If SRE teams are the defense and engineers are the offense, then platform teams are the mid-fielders. They save goals as often as they score them. Their presence makes the rest of the engineering organization more effective at their core objectives.&lt;/p&gt;

&lt;p&gt;However, keep in mind the time spent toiling. If the platform, infrastructure, QA, and tooling engineers spend over half of their time responding directly to incidents or on-call work, you may need a dedicated reliability team.&lt;/p&gt;

&lt;p&gt;If every week has a new emergent behavior that requires platform engineers to drop everything and fix it, it's time to adopt more proactive measures to build resilience, monitoring, and fault-tolerance. And when those tasks take up more than half the platform team's time, it's time to honor reality and break off a dedicated reliability engineering team.&lt;/p&gt;

&lt;p&gt;Platform and SRE teams do much of the same work — the division between them is more like a gradient than a clear line:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.aha.io%2F270964cec5e0150f8845216b5c243786%2Fdevops.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.aha.io%2F270964cec5e0150f8845216b5c243786%2Fdevops.jpeg" alt="devops"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When the platform team is fully aligned with engineering and company goals, it can incorporate a healthy SRE culture that meets its needs. Focusing on observability, recovery, and reliability saves costs and increases reliability while the platform team works on enhancing the developer experience and building new infrastructure for future features or projects.&lt;/p&gt;

&lt;h3&gt;
  
  
  Grow your career and be happy
&lt;/h3&gt;

&lt;p&gt;If this kind of alignment and environment sounds intriguing to you, come join us and &lt;a href="https://www.aha.io/company/careers/current-openings/#:~:text=engineering" rel="noopener noreferrer"&gt;work with the sharpest engineers&lt;/a&gt; I've ever had the pleasure of calling colleagues. Career happiness comes from doing meaningful work with motivated teammates and being appreciated for it. That is Aha! — a talented group of people changing how the best-known companies innovate and bring products to market.&lt;/p&gt;

</description>
      <category>hiring</category>
      <category>devops</category>
      <category>platform</category>
      <category>sre</category>
    </item>
    <item>
      <title>Optimizing with the PostgreSQL deterministic query planner</title>
      <dc:creator>Alexander Bartlow</dc:creator>
      <pubDate>Fri, 22 Apr 2022 19:08:23 +0000</pubDate>
      <link>https://dev.to/aha/optimizing-with-the-postgresql-deterministic-query-planner-5hfa</link>
      <guid>https://dev.to/aha/optimizing-with-the-postgresql-deterministic-query-planner-5hfa</guid>
      <description>&lt;h2&gt;
  
  
  Feed the planner, trust the plan
&lt;/h2&gt;

&lt;p&gt;Of all the Aha! engineering tool expenses, the money I'm happiest to spend is on a big RDS instance running PostgreSQL. It powers our full-text search, our reporting, and even some of our analytics. If you run it at scale, I recommend dedicating a Friday to curling up somewhere cozy with a cup of coffee and reading the fine manual. It will be time well spent.&lt;/p&gt;

&lt;p&gt;PostgreSQL is an easy tool to start using. The defaults provide a fast, secure experience out of the box. However, once you scale to a certain size, performance can start to degrade in some cases. Most people blame the query planner when this happens and start trying to find ways to "trick" the planner into getting a high-performance query.&lt;/p&gt;

&lt;p&gt;The problem with these tricks is that they usually end up biting you later — after you've grown as a business for a couple of years and have far higher throughput on your database. If there's one thing to take away from this article, it is this:&lt;/p&gt;

&lt;blockquote&gt;
The PostgreSQL query planner will &lt;em&gt;always&lt;/em&gt; discover the optimum query plan, given its statistical knowledge of your system and your data.
&lt;/blockquote&gt;

&lt;p&gt;We've discovered some tuning settings in the manual that give the planner the information it needs to do its job correctly. I'm going to cover three settings that we've adjusted over the last couple of years that have improved our query performance up to 20,000%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Set the right cost for page access
&lt;/h2&gt;

&lt;p&gt;One decision the planner sometimes makes is to full-scan your table instead of using an index. This may seem counter-intuitive, but remember, the planner will &lt;em&gt;always&lt;/em&gt; discover the optimum plan. Using an index requires reading random pages from disk; whereas reading the full table gets to read pages sequentially, which is faster.&lt;/p&gt;

&lt;p&gt;For a table of a certain size, on a slow hard drive not already cached in ram, with a query that will return a lot of rows — the full scan will be faster. The planner is right. That is, unless you're running a well-provisioned database (caching a large fraction of your data &lt;sup id="fnref1"&gt;1&lt;/sup&gt;) with a fast NVME disk. (It's 2022 — these things are cheap; use one!) In that case, the index would be faster. How do we tell the planner that we have well-cached data and that the disk is really good at random reads?&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.postgresql.org/docs/current/runtime-config-query.html" rel="noopener noreferrer"&gt;query planner's estimates&lt;/a&gt; for the relative speed difference between a random and a sequential read is determined by (&lt;code&gt;random_page_cost&lt;/code&gt; and &lt;code&gt;seq_page_cost&lt;/code&gt;). The default values for these columns assume that a random read from disk is going to be 40x slower than a sequential read from disk and that 90% of your disk pages will be cached in memory. These defaults assume a modestly sized database server using spinning-metal storage device.&lt;/p&gt;

&lt;p&gt;Using a fast NVME drive, you don't have a 40x latency multiple for random access — that number is probably closer to 5–10x. Additionally, if your ram cache rate is close to 99%, then the &lt;code&gt;random_page_cost&lt;/code&gt; parameter should be set to something like &lt;code&gt;1.05&lt;/code&gt;. (Multiply how much slower random access is than sequential by the inverse of your cache rate, and you will get a proper value for &lt;code&gt;random_page_cost&lt;/code&gt;. 5 * (1 / 0.99) = 1.05)&lt;/p&gt;

&lt;p&gt;After implementing these cost changes, we noticed PostgreSQL would stop deciding to full-scan tables unless the tables were very small. This lead to a drastic improvement in our query performance around custom pivots.&lt;/p&gt;

&lt;h2&gt;
  
  
  Set up statistics for dependent columns
&lt;/h2&gt;

&lt;p&gt;Let's assume the following query, with all three relevant columns indexed:&lt;/p&gt;

&lt;pre&gt;select count(*) from features where account_id=1 and product_id=2 and initiative_id=3&lt;/pre&gt;

&lt;p&gt;By default, PostgreSQL assumes that each column is independent of all other columns on your table. This can lead to PostgreSQL deciding to do a bitmap merge of all three indexes, assuming that you're going to get a different set of results for each index, like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.aha.io%2Fc1760ac08fee9bb571541b069deae256%2Findependent.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.aha.io%2Fc1760ac08fee9bb571541b069deae256%2Findependent.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, in reality, the data looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.aha.io%2F78131a955122b4f91aa260eba4ae2983%2Fdependent.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.aha.io%2F78131a955122b4f91aa260eba4ae2983%2Fdependent.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Given an individual &lt;code&gt;product_id&lt;/code&gt; or &lt;code&gt;initiative_id&lt;/code&gt; on a feature, all features with that &lt;code&gt;initiative_id&lt;/code&gt; or &lt;code&gt;product_id&lt;/code&gt; will have the same &lt;code&gt;account_id&lt;/code&gt;. Moreover, features with the same &lt;code&gt;initiative_id&lt;/code&gt; will be highly likely to have the same &lt;code&gt;product_id&lt;/code&gt;. So how do we tell the planner that if we know the &lt;code&gt;initiative_id&lt;/code&gt;, we can basically ignore the &lt;code&gt;account_id&lt;/code&gt; and &lt;code&gt;product_id&lt;/code&gt; checks for the initial data fetch process?&lt;/p&gt;

&lt;p&gt;These statistical dependencies can be leveraged by the query planner to decide that only one index fetch on &lt;code&gt;initiative_id&lt;/code&gt; is necessary. That is the most selective index, so it will give us the smallest result set. The result is likely to be small and the rows that have the &lt;code&gt;initiative_id&lt;/code&gt; are likely to have the same &lt;code&gt;product_id&lt;/code&gt; and &lt;code&gt;account_id&lt;/code&gt;. So doing the one index lookup and then filtering the results in memory is almost assuredly faster than setting up a bitmap scan, which requires going out to disk for random page reads.&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;CREATE STATISTICS features_stats (dependencies) on account_id, product_id, initiative_id from features; analyze features&lt;/code&gt;. PostgreSQL will sample those columns and keep track of the coefficient of correlation between their values, coming up with better query plans. In this example, the index on &lt;code&gt;initiative_id&lt;/code&gt; is going to be the most selective (largest total cardinality), which means that selecting on that will give us a small number of rows. Then, since all items with &lt;code&gt;initiative_id=3&lt;/code&gt; are probably going to have &lt;code&gt;product_id=2&lt;/code&gt;, are certainly going to have &lt;code&gt;account_id=1&lt;/code&gt;, and the result set from the &lt;code&gt;initiative_id&lt;/code&gt; index can be filtered in memory, all of those rows will likely make it through the check anyway.&lt;/p&gt;

&lt;p&gt;The example I describe here is not hypothetical. We found a pathological case where a query for one customer was taking 20 seconds. By adding dependent statistics to the table, the same query's time was reduced to less than a millisecond = 20,000% improvement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beware the join collapse
&lt;/h2&gt;

&lt;p&gt;Many candidates that I interview say something to the effect of "PostgreSQL falls over when you have too many joins." This is a commonly known phenomenon but very few candidates know the exact reason for this.&lt;/p&gt;

&lt;p&gt;PostgresSQL's query planner by default will exhaust every possibility trying to determine the fastest way to execute a query you request. The complexity of this is exponential with the amount of joins added. Attempting to plan the join of many tables together when the join order is not constrained could take longer than doing the actual query.&lt;/p&gt;

&lt;p&gt;To avoid this planner's analysis paralysis, PostgreSQL also has a &lt;a href="https://www.postgresql.org/docs/12/geqo.html" rel="noopener noreferrer"&gt;genetic algorithm-based query optimizer&lt;/a&gt;, which attempts to find a heuristic solution. Most of the time, it's pretty good but sometimes the results are pretty bad. The decision to use heuristic planning instead of deterministic planning is determined by the &lt;code&gt;geqo_threshold&lt;/code&gt;. The amount of work the heuristic planner does is determined by &lt;code&gt;geqo_effort&lt;/code&gt; , which is a simple 1-10 scale with a default to 5. &lt;/p&gt;

&lt;p&gt;Since the cost of a bad query is far more than the cost of some extra time planning, we've opted to increase our &lt;code&gt;geqo_threshold&lt;/code&gt; to 20. This covers the vast majority of our reporting queries. We're also looking into increasing our &lt;code&gt;geqo_effort&lt;/code&gt; values, since when we're joining more than 20 tables together, it's worth spending extra time planning to ensure that we get the query plan right. This is a continued area of experimentation for us - Aha! has a lot of very small, optimized queries; it also has some very large, gnarly ones in order to report on customer's data in the way the customer desires. Striking a balance between these extremes appears to be a bit of an art form.&lt;/p&gt;

&lt;p&gt;Additionally, it pays to increase the values of &lt;code&gt;join_collapse_limit&lt;/code&gt; and &lt;code&gt;from_collapse_limit&lt;/code&gt;. These parameters determine the maximum number of tables for which PostgreSQL is willing to optimize the join order. And if you have a lot of tables, get the query plan right so you can save far more time executing the query.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feed the planner, trust the plan
&lt;/h2&gt;

&lt;p&gt;The deterministic query planner will &lt;em&gt;always&lt;/em&gt; find the optimal result. If the planner is generating bad plans, it means you need to find ways to give it better statistical data. Instead of "tricks," use the prescribed methods that come with PostgreSQL to instruct the planner on the shape of your data. You'll be very happy with the result. And seriously, read the manual.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sign up for a free trial of Aha! Develop&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Aha! Develop is a fully extendable agile development tool. Prioritize the backlog, estimate work, and plan sprints. If you are interested in an integrated &lt;a href="https://www.aha.io/suite-overview" rel="noopener noreferrer"&gt;product development&lt;/a&gt; approach, use &lt;a href="https://www.aha.io/product/overviewhttps://www.aha.io/product/integrations/develop" rel="noopener noreferrer"&gt;Aha! Roadmaps and Aha! Develop&lt;/a&gt; together. Sign up for a &lt;a href="https://www.aha.io/trial" rel="noopener noreferrer"&gt;free 30-day trial&lt;/a&gt; or &lt;a href="https://www.aha.io/live-demo" rel="noopener noreferrer"&gt;join a live demo&lt;/a&gt; to see why more than 5,000 companies trust our software to build lovable products and be happy doing it.&lt;/p&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;You can check your database cache rate with the following query. You want your cache hit rate to be near 99%. If it's lower than 95% for heavily accessed tables, consider increasing your database's RAM.&lt;br&gt;
&lt;/p&gt;
&lt;pre&gt;&lt;br&gt;
SELECT &lt;br&gt;
  sum(heap_blks_read) as heap_read,&lt;br&gt;
  sum(heap_blks_hit)  as heap_hit,&lt;br&gt;
  sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio&lt;br&gt;
FROM &lt;br&gt;
  pg_statio_user_tables;&lt;br&gt;
&lt;/pre&gt; ↩
&lt;/li&gt;

&lt;/ol&gt;

</description>
    </item>
    <item>
      <title>Log management using AWS Athena</title>
      <dc:creator>Alexander Bartlow</dc:creator>
      <pubDate>Tue, 15 Dec 2020 16:10:04 +0000</pubDate>
      <link>https://dev.to/aha/log-management-using-aws-athena-34ne</link>
      <guid>https://dev.to/aha/log-management-using-aws-athena-34ne</guid>
      <description>&lt;p&gt;Many SaaS providers will happily sell you a turn-key log management system, and in the early days of a startup when you value time over money, purchasing one makes a lot of sense. AWS Cloudwatch logs are another good solution if you are already hosted on AWS. With any of these paid services, the cost is eventually going to start raising a few eyebrows and it might become prudent to build your own in-house. The Elasticsearch-Logstash-Kibana cocktail is the most popular deployment for this, but one SaaS provider I found calculated the TCO over running a moderately-sized ELK stack for a year at nearly $500,000!&lt;/p&gt;

&lt;p&gt;We've developed our own system using Fluentd, AWS S3, and AWS Athena to ingest, store, and query our logs. With this combination, I estimate that ingesting 640GB per day of log data and keeping it searchable over the course of the entire year would cost only $79,248 — a savings of 83 percent! Not to mention the time spent monitoring, scaling, and operating the cluster. The performance of our system is good; queries that return the data for a single request complete in less than 3 seconds. Searching the full text of our logs for specific markers or references can take longer but it's still very manageable.&lt;/p&gt;

&lt;p&gt;This system not only gives us fast access to the logs that we do have, but it also enables us to log additional information per request and keep logs searchable for longer than we otherwise would if we had to pay more for storage. But having this additional data is invaluable for helping us solve problems and answer questions in a rapid and responsive way. &lt;/p&gt;

&lt;p&gt;In this post, I'm going to walk through what we did, provide some code and discussions to show our approach, and talk about opportunities for future work that have opened up. I'm also going to mention some "gotchas" and road-blocks that we encountered along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building the system
&lt;/h2&gt;

&lt;p&gt;There are four stages to a logging system: ingestion, aggregation, storage, and query. We're going to talk about them in reverse order.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query
&lt;/h3&gt;

&lt;p&gt;Assuming we have our logs in a structured format, what would be an ideal interface to gain access to it? While Kibana is the preferred choice for many companies, Kibana provides a lot of things that we don't really need. Pretty charts and graphs are nice, but when debugging an error state, what we really need is quick access to our logs. Additionally, exposing something that resembles SQL allows writing sophisticated queries to answer questions as they emerge.&lt;/p&gt;

&lt;p&gt;Amazon Athena is the technology we settled on to query our logs. Athena searches data that already exists in S3, and it has a few properties that make it ideal for the task at hand:&lt;/p&gt;

&lt;p&gt;Athena is really inexpensive. We pay only for the data that we scan in the process of completing a query: only $5 per terabyte at the time of writing.&lt;/p&gt;

&lt;p&gt;Athena understands and can query the parquet file format out of the box. Parquet is column-oriented, which means that it is able to scan data only in the columns we are searching. This is a huge boon to us since the vast majority of our searches are "get me all of the logs for this single request id". That's only a 128bit UUID, and Athena is able to quickly determine which chunks of data contain the value under search, saving us both time and money. Additionally, parquet is compressible, and Athena is able to query it while in its compressed format, saving even more money on scanning costs.&lt;/p&gt;

&lt;p&gt;We wrote a small custom log search engine to act as a front-end to Athena's interface for our team. This interface has fields, basic query parameters, a date selector to help generate the SQL-like query language that Athena uses, as well as a few endpoints that kick off a query immediately based on request uuid, pod, and time. We embed links to our tool in Datadog for rapid log analysis of poorly performing endpoints.&lt;/p&gt;

&lt;h3&gt;
  
  
  Storage
&lt;/h3&gt;

&lt;p&gt;As mentioned, Athena queries data that is stored in S3. It's a durable, reliable medium that ensures millisecond access to our data when we need it. S3 also comes with another benefit. We can set up lifecycle rules to move files to lower cost tiers over time. These lifecycle rules can be set up with just a few clicks (or a few lines of AWS CloudFormation) and then it's done. &lt;/p&gt;

&lt;h3&gt;
  
  
  Aggregation
&lt;/h3&gt;

&lt;p&gt;Aggregating the logs from our services turned out to require some work. Instead of using an Amazon product off the shelf, we opted to use Fluentd to ingest, aggregate, and then process our logs. The configuration itself is very simple. We created a docker image that built Fluentd with &lt;code&gt;libjemalloc&lt;/code&gt; to keep the memory usage in check and &lt;code&gt;lib-arrow&lt;/code&gt; to generate the compressed data in Parquet format. This container could then be deployed to our standard ECS cluster and then treated like any other service.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ingestion
&lt;/h3&gt;

&lt;p&gt;Now that we have our query, aggregation, and storage services ready, the last step is to get data from our application into the Fluentd service. There are a couple of options for this. One was to set the docker logging provider to Fluentd and point it at the aggregation cluster we have deployed. This allows us to log to stdout, which would keep the logger relatively simple. However, we decided it was worth it to make our application aware of the aggregation service and install a client-logging library for Fluentd. Fluentd maintains a Ruby version of their logger, which works as a drop-in replacement for the logger on rails applications: &lt;a href="https://github.com/fluent/fluent-logger-ruby"&gt;fluent-logger-ruby&lt;/a&gt;, but this is language-agnostic and could be used anywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gotchas
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Who logs the logger?
&lt;/h3&gt;

&lt;p&gt;Unfortunately, there is one area where we cannot rely on this system for logging — it's not able to log to itself! The ECS agent requires that the new container be able to log to its log provider immediately upon boot. This is a problem when the logging service itself needs to boot, initialize, and then pass health checks to start taking requests.&lt;/p&gt;

&lt;p&gt;One way around this would be to bake a "forwarding" ECS service into the AMI for every ECS cluster instance. This forwarder could receive and buffer logs, then send them on to a private hostname configured to point to our new logging service.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ensure proper chunk size
&lt;/h3&gt;

&lt;p&gt;Athena has a "sweet spot." Because of the columnar format of data and the ability for Athena to read the headers out of the parquet file before it fetches the whole object, Athena works best when the size of scanned files is around 200MB. Once it gets larger, Athena is missing out the ability to parallelize. When it is smaller, each query worker spends a lot of time reading and parsing the initial headers instead of scanning the actual data. We had to tweak our time keys and chunk sizes to get this just right.&lt;/p&gt;

&lt;p&gt;There is a balancing act between chunk size and capacity. The more aggregation containers we were running, the more log pressure we needed to get the right amount of saturation to create files in the sweet spot. In our experience, the cluster really wants to be run at about 80% CPU. Higher than that and we risk the TCP log forwarding protocol providing upstream back-pressure to the services logging to it. Less than that and we end up creating log files that are not big enough to get full performance out of Athena. Since our use-case is an internal developer-facing tool and the risk of slowing down our main application is not worth the slightly better throughput, we have opted to slightly over-provision our containers. A second Fluentd container layer serving as a forwarder in between our application servers and our aggregation containers could help buffer this substantially. This will give us great performance without risk of back-pressure at the cost of slightly more complexity and cost overhead.&lt;/p&gt;

&lt;h3&gt;
  
  
  Partition projection
&lt;/h3&gt;

&lt;p&gt;We first attempted to create an AWS glue table for our data stored in S3 and then have a Lambda crawler automatically create Glue partitions for Athena to use. This was a bad approach.&lt;/p&gt;

&lt;p&gt;Partition projection tells Athena about the shape of the data in S3, which keys are partition keys, and what the file structure is like in S3. We partition our data by service, shard, year, month, day, and hour. In our testing, we found that partition projection was essential to getting full value out of Athena. Without it, many of our queries would take up to thirty seconds as Athena consulted our AWS glue tables to determine which data to scan. With partition projection, no such lookup was necessary, and the query could begin immediately, scanning only the relevant data.&lt;/p&gt;

&lt;p&gt;Keep in mind that partitions are great for scoping down the amount of data to access, but by partitioning too aggressively, we noticed our chunk sizes dropping below the "sweet spot" discussed above. We noticed a similar problem partitioning our data by EC2 instance. Clumping all hosts together counterintuitively made our queries to the system a lot faster.&lt;/p&gt;

&lt;h3&gt;
  
  
  Future use cases
&lt;/h3&gt;

&lt;p&gt;The power of this system is enormous. We push a staggering amount of log data into it, pay relatively little for it, and query it in seconds. Additionally, the ingestion is managed by a service that we run and provision ourselves. So it is easy to autoscale down when demand is low and scale up when we need the additional horsepower to keep up with demand. But now that we have this system running, other use cases have presented themselves that look promising:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Analytics&lt;/strong&gt; — We are able to log to Fluentd with a special key for analytics events that we want to later ETL and send to Redshift. We can create a new rule in our Fluentd config to take the analytics tag, and write it into the proper bucket for later Athena queries to export to Redshift, or for Redshift itself to query directly from S3 using Redshift Spectrum.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Request correlation&lt;/strong&gt; — At Aha!, every request to our service is issued a request_uuid. When we queue a background job or make call to another service, those services can all log using the same framework and can include that originating request_uuid in the appropriate field. This means we can trace everything that was involved or triggered from a single user request through all of our backend services through a single query interface and interlace all of the logs together in one timeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Finding contemporary logs&lt;/strong&gt; — Using this system, it is trivial to search for all logs from all requests with a very small timestamp window. If we know that we had a deadlock around a certain row in a database, this would allow you to find all processes that attempted to update that row, and then trace back to find the controller action that caused the problem.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If finding innovative solutions to problems like this sounds interesting, &lt;a href="https://www.aha.io/company/careers/current-openings"&gt;come work with us&lt;/a&gt; as we build the world's #1 product roadmapping software. We are not only creating a powerful platform for the biggest companies in the world to plan and manage their strategy, we are also building sophisticated tooling to help our talented and friendly development team continue to excel.&lt;/p&gt;

&lt;h4&gt;
  
  
  Footnotes
&lt;/h4&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;1: Assuming 35 Fargate container instances with 1.25 VCPU and 2GB memory each, 640GB ingested per day, 60-day S3 Standard Access storage, 335-day Infrequent Access storage, and 100 Athena queries per day each covering 100GB scanned. (This is a generous allowance.) Reserved ECS cluster instances drop this cost down further.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>devops</category>
      <category>logging</category>
    </item>
  </channel>
</rss>
