Pascal Widdershoven for Kabisa Software Artisans

Posted on Jul 5, 2019 • Originally published at theguild.nl on Apr 25, 2019

Caveats storing large amounts of data in Elixir Agents

#elixir #phoenix

Recently while working on an Elixir project I ran into an interesting gotcha with Agents that caused massive amounts of resource usage. Read on to find out what happened.

What are Agents in Elixir?

Agents are a simple abstraction around state.

Often in Elixir there is a need to share or store state that must be accessed from different processes or by the same process at different points in time.

The Agent module provides a basic server implementation that allows state to be retrieved and updated via a simple API.

Elixir is an immutable language where nothing is shared by default. This has many benefits, but it also means that when you do want to share data between processes you need to do some extra work. Fortunately, Elixir provides a lot of great building blocks to achieve this like Agents, ETS and Mnesia.

So what’s the problem?

There are two ways to use the state stored in an agent:

By operating on the data form within the Agents process:

# Compute in the agent/server
def get_something(agent) do
  Agent.get(agent, fn state -> do_something_expensive(state) end)
end

By pulling the data into the client process and operating on it there:

# Compute in the client
def get_something(agent) do
  Agent.get(agent, & &1) |> do_something_expensive()
end

If you look at the code the differences are very subtle. The difference in behaviour, however, is not subtle.

In approach #1 the data will remain in the Agent process. However, if you perform expensive operations there the agent will be blocked for the entire duration of the operation, meaning no other process can access the data until the operation is finished. Using this model to respond to an HTTP request is killing for performance.

In approach #2 the Agent will not be blocked, but the data will be copied into the process that is accessing the data. When the amount of data is small this is not really a problem, but if you start storing larger amounts of data this becomes really expensive real quick.

Real life example

The impact of this can be huge as I will demonstrate in the case below.

In the project I’m working on we were storing a set of rules in an Agent. A rule is a struct with 27 fields and we were storing approximately ~ 5000 rules in the Agent. There’s an HTTP endpoint that for every request uses these rules to determine the response.

For a while this was fine, but when the load started increasing we noticed the server going out of memory. To debug this I started throwing load at the endpoint using wrk. Results below:

Running 1m test @ http://localhost:4001
  4 threads and 1000 connections
  Thread Stats Avg Stdev Max +/- Stdev
    Latency 2.23s 553.43ms 3.00s 57.78%
    Req/Sec 59.72 121.98 680.00 88.64%
  Latency Distribution
     50% 2.24s
     75% 2.67s
     90% 2.88s
     99% 3.00s
  5551 requests in 1.00m, 11.80MB read
  Socket errors: connect 0, read 5971, write 3, timeout 5506
  Non-2xx or 3xx responses: 4575
Requests/sec: 92.38
Transfer/sec: 201.05KB

As you can see 92 requests per second are handled and a lot of requests time out (take more than 3 seconds). During the test, the Elixir process consumed around 10GB of memory.

Solutions

As we’ve seen in the previous section, storing these amounts of data in an Agent requires a lot of memory and performance is frankly not great.

Looking at the code and reading the Agent documentation, I quickly realised that the root cause of this issue was the fact that all rules were copied to the process handling the HTTP request, for every request. So how can we prevent this?

‘Shared nothing’ is a very core principle of Elixir/Erlang, so the short answer is you can’t prevent the data from being copied if you want to share the data between processes. This affects all ways of storing data in memory, so not just Agents.

There are workarounds, like fast_global. Fastglobal works by dynamically compiling a module at runtime, but it’s not without drawbacks.

So the solution is to make sure the data does not have to be shared between processes. There are a variety of ways to do this. The approach I took was to create a pool of worker processes (with Poolboy) that handle executing the rules. When an HTTP request comes in, the rule matching is handled by one of the worker processes.

In code this looks roughly like this (simplified):

defmodule Worker do
  use GenServer

  def start_link(_) do
    GenServer.start_link( __MODULE__ , nil, [])
  end

  def init(_) do
    rules = State.get()
    {:ok, rules}
  end

  def handle_call({:match_rules, input}, _from, rules) do
    matches = match_rules(rules, input)
    {:reply, matches, rules}
  end
end

When a worker starts it loads (copies) the rules from the Agent (State is a module wrapping the Agent) into the worker process. Each worker process contains a copy of the rules so the memory usage is predictable.

If the rules change at runtime, the processes are simply killed and restarted so the new rules will be used automatically. Poolboy takes care of starting N workers and selecting a worker from the pool.

End result

With that in place, wrk results started looking as follows:

Running 1m test @ http://localhost:4001
  4 threads and 1000 connections
  Thread Stats Avg Stdev Max +/- Stdev
    Latency 1.04s 270.71ms 1.66s 69.89%
    Req/Sec 221.14 65.34 405.00 66.08%
  Latency Distribution
     50% 1.07s
     75% 1.25s
     90% 1.36s
     99% 1.47s
  52823 requests in 1.00m, 14.41MB read
  Socket errors: connect 0, read 1014, write 0, timeout 0
Requests/sec: 879.16
Transfer/sec: 245.55KB

As you can see the throughput increased from 92 req/sec to 879 req/sec. Average latency went down from 2.23s to 1.04s. Memory used went down from 10GB to 400MB.

Not bad!

Top comments (2)

Edison Yap • Jul 6 '19 • Edited

That was a really cool read! Thanks for sharing!

Could you talk a little bit about how you got to the hypothesis that Agent was the bottleneck? You said you tested the endpoint and your memory usage went up, but how did you know it was your Rules agents? Observers?

Pascal Widdershoven Kabisa Software Artisans • Jul 6 '19

Thanks, that's a great question!

I noticed the memory increasing so rapidly that I figured it had to be copying the full set of rules somewhere. So reading the code and the Agent docs I stumbled on this snippet in the docs [1]:

The first function blocks the agent. The second function copies all the state to the client and then executes the operation in the client. One aspect to consider is whether the data is large enough to require processing in the server, at least initially, or small enough to be sent to the client cheaply.

Before reading this I didn't realize that using an Agent like this copies the whole state upon reading. It actually makes a lot of sense, it's just something I hadn't ran into earlier.

[1] hexdocs.pm/elixir/Agent.html#modul...