DEV Community: Ryan Perry

How to Debug Ruby Performance Issues Using Profiling

Ryan Perry — Fri, 31 Dec 2021 18:50:14 +0000

Pyroscope Rideshare Example

Note: For documentation on the Pyroscope ruby gem visit Pyroscope's website

Background

In this example there is a simplified, basic use case of Pyroscope. We simulate a "ride share" company which has three endpoints found in server.rb:

/bike : calls the order_bike(search_radius) function to order a bike
/car : calls the order_car(search_radius) function to order a car
/scooter : calls the order_scooter(search_radius) function to order a scooter

We also simulate running 3 distinct servers in 3 different regions (via docker-compose.yml)

us-east-1
us-west-1
eu-west-1

One of the most useful capabilities of Pyroscope is the ability to tag your data in a way that is meaningful to you. In this case, we have two natural divisions, and so we "tag" our data to represent those:

region: statically tags the region of the server running the code
vehicle: dynamically tags the endpoint (similar to how one might tag a controller rails)

Tagging static region

Tagging something static, like the region, can be done in the initialization code in the config.tags variable:

Pyroscope.configure do |config|
  config.app_name = "ride-sharing-app"
  config.server_address = "http://pyroscope:4040"
  config.tags = {
    "region": ENV["REGION"],                     # Tags the region based of the environment variable
  }
end

Tagging dynamically within functions

Tagging something more dynamically, like we do for the vehicle tag can be done inside our utility find_nearest_vehicle() function using a Pyroscope.tag_wrapper block

def find_nearest_vehicle(n, vehicle)
  Pyroscope.tag_wrapper({ "vehicle" => vehicle }) do
    ...code to find nearest vehicle
  end
end

What this block does, is:

Add the tag { "vehicle" => "car" }
execute the find_nearest_vehicle() function
Before the block ends it will (behind the scenes) remove the { "vehicle" => "car" } from the application since that block is complete

Resulting flamgraph / performance results from the example

Running the example

To run the example run the following commands:

# Pull latest pyroscope image:
docker pull pyroscope/pyroscope:latest

# Run the example project:
docker-compose up --build

# Reset the database (if needed):
# docker-compose down

What this example will do is run all the code mentioned above and also send some mock-load to the 3 servers as well as their respective 3 endpoints. If you select our application: ride-sharing-app.cpu from the dropdown, you should see a flamegraph that looks like this. After we give 20-30 seconds for the flamegraph to update and then click the refresh button we see our 3 functions at the bottom of the flamegraph taking CPU resources proportional to the size of their respective search_radius parameters.

Where's the performance bottleneck?

The first step when analyzing a profile outputted from your application, is to take note of the largest node which is where your application is spending the most resources. In this case, it happens to be the order_car function.

The benefit of using the Pyroscope package, is that now that we can investigate further as to why the order_car() function is problematic. Tagging both region and vehicle allows us to test two good hypotheses:

Something is wrong with the /car endpoint code
Something is wrong with one of our regions

To analyze this we can select one or more tags from the "Select Tag" dropdown:

Narrowing in on the Issue Using Tags

Knowing there is an issue with the order_car() function we automatically select that tag. Then, after inspecting multiple region tags, it becomes clear by looking at the timeline that there is an issue with the us-west-1 region, where it alternates between high-cpu times and low-cpu times.

We can also see that the mutex_lock() function is consuming almost 70% of CPU resources during this time period.

Comparing two time periods

Using Pyroscope's "comparison view" we can actually select two different time ranges from the timeline to compare the resulting flamegraphs. The pink section on the left timeline results in the left flamegraph and the blue section on the right represents the right flamegraph.

When we select a period of low-cpu utilization, and a period of high-cpu utilization we can see that there is clearly different behavior in the mutex_lock() function where it takes 23% of CPU during low-cpu times and 70% of CPU during high-cpu times.

Visualizing Diff Between Two Flamegraphs

While the difference in this case is stark enough to see in the comparison view, sometimes the diff between the two flamegraphs is better visualized with them overlayed over each other. Without changing any parameters, we can simply select the diff view tab and see the difference represented in a color-coded diff flamegraph.

More use cases

We have been beta testing this feature with several different companies and some of the ways that we've seen companies tag their performance data:

Tagging controllers
Tagging regions
Tagging jobs from a redis or sidekiq queue
Tagging commits
Tagging staging / production environments
Tagging different parts of their testing suites
Etc...

Future Roadmap

We would love for you to try out this example and see what ways you can adapt this to your ruby application. Continuous profiling has become an increasingly popular tool for the monitoring and debugging of performance issues (arguably the fourth pillar of observability).

We'd love to continue to improve this gem by adding things like integrations with popular tools, memory profiling, etc. and we would love to hear what features you would like to see.

How to Profile Python Code

Ryan Perry — Wed, 08 Dec 2021 18:13:31 +0000

Pyroscope Rideshare Example

Note: For documentation on the Pyroscope pip package visit our website

Background

In this example we show a simplified, basic use case of Pyroscope. We simulate a "ride share" company which has three endpoints found in server.py:

/bike : calls the order_bike(search_radius) function to order a bike
/car : calls the order_car(search_radius) function to order a car
/scooter : calls the order_scooter(search_radius) function to order a scooter

We also simulate running 3 distinct servers in 3 different regions (via docker-compose.yml)

us-east-1
us-west-1
eu-west-1

region: statically tags the region of the server running the code
vehicle: dynamically tags the endpoint (similar to how one might tag a controller rails)

Tagging static region

Tagging something static, like the region, can be done in the initialization code in the config.tags variable:

pyroscope.configure(
    app_name       = "ride-sharing-app",
    server_address = "http://pyroscope:4040",
    tags           = {
        "region":   f'{os.getenv("REGION")}', # Tags the region based off the environment variable
    }
)

Tagging dynamically within functions

Tagging something more dynamically, like we do for the vehicle tag can be done inside our utility find_nearest_vehicle() function using a with pyroscope.tag_wrapper() block

def find_nearest_vehicle(n, vehicle):
    with pyroscope.tag_wrapper({ "vehicle": vehicle}):
        i = 0
        start_time = time.time()
        while time.time() - start_time < n:
            i += 1

What this block does, is:

Add the tag { "vehicle" => "car" }
execute the find_nearest_vehicle() function
Before the block ends it will (behind the scenes) remove the { "vehicle" => "car" } from the application since that block is complete

Resulting flamegraph / performance results from the example

Running the example

To run the example run the following commands:

# Pull latest pyroscope image:
docker pull pyroscope/pyroscope:latest

# Run the example project:
docker-compose up --build

# Reset the database (if needed):
# docker-compose down

What this example will do is run all the code mentioned above and also send some mock-load to the 3 servers as well as their respective 3 endpoints. If you select our application: ride-sharing-app.cpu from the dropdown, you should see a flamegraph that looks like this (below). After we give 20-30 seconds for the flamegraph to update and then click the refresh button we see our 3 functions at the bottom of the flamegraph taking CPU resources proportional to the size of their respective search_radius parameters.

Where's the performance bottleneck?

Something is wrong with the /car endpoint code
Something is wrong with one of our regions

To analyze this we can select one or more tags from the "Select Tag" dropdown:

Narrowing in on the Issue Using Tags

We can also see that the mutex_lock() function is consuming almost 70% of CPU resources during this time period.

Comparing two time periods

Using Pyroscope's "comparison view" we can actually select two different time ranges from the timeline to compare the resulting flamegraphs. The pink section on the left timeline results in the left flamegraph, and the blue section on the right represents the right flamegraph.

When we select a period of low-cpu utilization and a period of high-cpu utilization we can see that there is clearly different behavior in the mutex_lock() function where it takes 51% of CPU during low-cpu times and 78% of CPU during high-cpu times.

Visualizing Diff Between Two Flamegraphs

More use cases

We have been beta testing this feature with several different companies and some of the ways that we've seen companies tag their performance data:

Tagging controllers
Tagging regions
Tagging jobs from a redis / sidekiq / rabbitmq queue
Tagging commits
Tagging staging / production environments
Tagging different parts of their testing suites
Etc...

Future Roadmap

We would love for you to try out this example and see what ways you can adapt this to your python application. Continuous profiling has become an increasingly popular tool for the monitoring and debugging of performance issues (arguably the fourth pillar of observability).

We'd love to continue to improve this pip package by adding things like integrations with popular tools, memory profiling, etc. and we would love to hear what features you would like to see.

O(log n) makes continuous profiling possible in production

Ryan Perry — Thu, 11 Mar 2021 03:33:13 +0000

O(log n) makes continuous profiling possible

Pyroscope is software that lets you continuously profile your code to debug performance issues down to a line of code. With just a few lines of code it will do the following:

Pyroscope Agent

Polls the stack trace every 0.01 seconds to see which functions are consuming resources
Batches that data into 10s blocks and sends it to Pyroscope server

Pyroscope Server

Receives data from the Pyroscope agent and processes it to be stored efficiently
Pre-aggregates profiling data for fast querying when data needs to be retrieved

Storage Efficiency

The challenge with continuous profiling is that if you just take frequent chunks of profiling data, compress it, and store it somewhere, it becomes:

Too much data to store efficiently
Too much data to query quickly

We solve these problems by:

Using a combination of tries and trees to compress data efficiently
Using segment trees to return queries for any timespan of data in O(log n) vs O(n) time complexity

Step 1: Turning the profiling data into a tree

The simplest way to represent profiling data is in a list of string each one representing a stack trace and a number of times this particular stack trace was seen during a profiling session:

server.py;fast_function;work 2
server.py;slow_function;work 8

The first obvious thing we do is we turn this data into a tree. Conviniently, this represenation also makes it easy to later generate flamegraphs.

Compressing the stack traces into trees saves space on repeated elements. By using trees, we go from having to store common paths like net/http.request in the db multiple times to only having to store it 1 time and saving a reference to the location at which it's located. This is fairly standard with profiling libraries since its the lowest hanging fruit when it comes to optimizing storage with profiling data.

Step 2: Adding tries to store individual symbols more efficiently

So now that we've compressed the raw profiling data by converting into a tree, many of the nodes in this compressed tree contain symbols that also share repeated elements with other nodes. For example:

net/http.request;net/io.read 100 samples
net/http.request;net/io.write 200 samples

While the net/http.request, net/io.read, and net/io.write functions differ they share the same common ancestor of net/.

Each of these lines can be serialized using a prefix tree as follows. This means that instead of storing the same prefixes multiple times, we can now just store them once in a trie and access them by storing a pointer to their position in memory:

In this basic example we save ~80% of space going from 39 bytes to 8 bytes. Typically symbol names are much longer and as the number of symols grows, storage requirements grow logarithmically rather than linearly.

Step 1 + 2: Combining the trees with the tries

In the end, by using a tree to compress the raw profiling data and then using tries to compress the symbols we get the following storage amounts for our simple example:

| data type           | bytes |
|---------------------|-------|
| raw data            | 93    |
| tree                | 58    |
| tree + trie         | 10    |

As you can see this is a 9x improvement for a fairly trivial case. In real world scenarios the compression factor gets much larger.

Step 3: Optimizing for fast reads using Segment Trees

Now that we have a way of storing the data efficiently the next problem that arises is how do we query it efficiently. The way we solve this problem is by pre-aggregating the profiling data and storing it in a special segment tree.

Every 10s Pyroscope agent sends a chunk of profiling data to the server whuch writes the data into the db with the corresponding timestamp. You'll notice that each write happens once, but is replicated multiple times.

Each layer represents a time block of larger units so in this case for every two 10s time blocks, one 20s time block is created. This is to make reading the data more efficient (more on that in a second).

Turn reads from O(n) to O(log n)

If you don't use segment trees and just write data in 10 second chunks the time complexity for the reads becomes a function of how many 10s units the query asks for. If you want 1 year of data, you'll have to then merge 3,154,000 trees representing the profiling data. By using segment trees you can effictevely decrease the amount of merge operations from O(n) to O(log n).

Help us add more profilers

We spent a lot of time on solving this storage / querying problem because we wanted to make software that can do truly continuous profiling in production without causing too much overhead.

While Pyroscope currently supports 4 languages, we would love to add more.

Any sampling profiler that can export data in the "raw" format linked above can become a Profiling agent with Pyroscope. We'd love your help building out profilers for other languages!

[x] Go
[x] Python
[x] eBPF
[x] Ruby
[ ] Java
[ ] Rust
[ ] Node
[ ] PHP

If you want to help contribute or need help setting up Pyroscope heres how you can reach us:

Join our Slack
Set up a time to meet with us here
Write an issue
Follow us on Twitter

Debug Performance issues in Python with Pyroscope (OSS)

Ryan Perry — Tue, 26 Jan 2021 20:47:58 +0000

How to Debug Performance Issues in Python with profilers

Using flame graphs to get to the root of the problem

I know from personal experience that debugging performance issues on Python servers can be incredibly frustrating. Usually, increased traffic or a transient bug would cause end users to report that something was wrong.

More often than not, it's impossible to exactly replicate the conditions under which the bug occured, and so I was stuck trying to figure out which part of our code/infrastructure was responsible for the performance issue on our server.

This article explains how to use flame graphs to continuously profile your code and reveal exactly which lines are responsible for those pesky performance issues.

Why You should care about CPU performance

CPU utilization is a metric of application performance commonly used by companies that run their software in the cloud (i.e. on AWS, Google Cloud, etc).

In fact, Netflix performance architect Brendan Gregg mentioned that decreasing CPU usage by even 1% is seen as an enormous improvement because of the resource savings that occur at that scale. However, smaller companies can see similar benefits when improving performance because regardless of size, CPU is often directly correlated with two very important facets of running software:

How much money you're spending on servers - The more CPU resources you need, the more it costs to run servers
End-user experience - The more load placed on your server's CPUs, the slower your website or server becomes

So when you see a graph of CPU utilization that looks like this:

During the period of 100% CPU utilization, you can assume:

End-users are having a frustrating experience (i.e. App / Website is loading slow)
Server costs will increase after you provision new servers to handle the additional load

The question is: which part of the code is responsible for the increase in CPU utilization? That's where flame graphs come in!

How to use flame graphs to debug performance issues (and save $66,000 on servers)

Let's say the flame graph below represents the timespan that corresponds with the "incident" in the picture above where CPU usage spiked. During this spike, the server's CPUs were spending:

75% of time in foo()
25% of time in bar()
$100,000 on server costs

You can think of a flame graph like a super detailed pie chart, where:

The width of the flame graph represents 100% of the time range
Each node represents a function
The biggest nodes are taking up most of the CPU resources
Each node is called by the node above it

In this case, foo() is taking up 75% of the total time range, so we can improve foo() and the functions it calls in order to decrease our CPU usage (and save $$).

Creating a flame graph and Table with Pyroscope

To recreate this example with actual code, we'll use Pyroscope - an open-source continuous profiler that was built specifically for debugging performance issues. To simulate the server doing work, I've created a work(duration) function that simulates doing work for the duration passed in. This way, we can replicate foo() taking 75% of time and bar() taking 25% of the time by producing this flame graph from the code below:

# where each iteration simulates CPU time
def work(n):
    i = 0
    while i < n:
        i += 1

# This would simulate a CPU running for 7.5 seconds
def foo():
    work(75000)

# This would simulate a CPU running for 2.5 seconds
def bar():
    work(25000)

Then, let's say you optimize your code to decrease foo() time from 75000 to 8000, but left all other portions of the code the same. The new code and flame graph would look like:

# This would simulate a CPU running for 0.8 seconds
def foo():
    # work(75000)
    work(8000)

# This would simulate a CPU running for 2.5 seconds
def bar():
    work(25000)

Improving `foo()` saved us $66,000

Thanks to the flame graphs, we were able to identify immediately that foo() was the bottleneck in our code. After optimizing it, we were able to significantly decrease our cpu utilization.

This means your total CPU utilization decreased 66%. If you were paying $100,000 dollars for your servers, you could now manage the same load for just $34,000.

DEV Community: Ryan Perry

How to Debug Ruby Performance Issues Using Profiling

Pyroscope Rideshare Example

Background

Tagging static region

Tagging dynamically within functions

Resulting flamgraph / performance results from the example

Running the example

Where's the performance bottleneck?

Narrowing in on the Issue Using Tags

Comparing two time periods

Visualizing Diff Between Two Flamegraphs

More use cases

Future Roadmap

How to Profile Python Code

Pyroscope Rideshare Example

Background

Tagging static region

Tagging dynamically within functions

Resulting flamegraph / performance results from the example

Running the example

Where's the performance bottleneck?

Narrowing in on the Issue Using Tags

Comparing two time periods

Visualizing Diff Between Two Flamegraphs

More use cases

Future Roadmap

O(log n) makes continuous profiling possible in production

O(log n) makes continuous profiling possible

Pyroscope Agent

Pyroscope Server

Storage Efficiency

Step 1: Turning the profiling data into a tree

Step 2: Adding tries to store individual symbols more efficiently

Step 1 + 2: Combining the trees with the tries

Step 3: Optimizing for fast reads using Segment Trees

Turn reads from O(n) to O(log n)

Help us add more profilers

Debug Performance issues in Python with Pyroscope (OSS)

How to Debug Performance Issues in Python with profilers

Using flame graphs to get to the root of the problem

Why You should care about CPU performance

How to use flame graphs to debug performance issues (and save $66,000 on servers)

Creating a flame graph and Table with Pyroscope

Improving foo() saved us $66,000

Improving `foo()` saved us $66,000