DEV Community: Erlang Solutions

Lessons learned from a decade consulting XMPP clients

Erlang Solutions — Fri, 26 Jun 2020 14:00:27 +0000

Over the last ten years, we have helped A LOT of companies add value to their products using MongooseIM or XMPP based chat applications. This has allowed us to partner with, and get insights from companies spanning almost every industry and company size. You may wonder how much variety and complexity a team who specialises in scalable Instant Messaging is exposed to, but the truth is, the only thing that ties these projects together is MongooseIM, our scalable, customisable XMPP server. Once we dive deeper into the needs of our clients, we usually find deployments need new features, tailored to their specific need, to help them achieve their most critical business objectives.

Customisation is what sets the work we do apart. We make sure that your Instant Messaging is fit for purpose. Even when XMPP or MongooseIM offer a rich enough chat experience, there are other details we mare are right to deliver a reliable solution. This includes taking into consideration your workload and usage scenarios to ensure that you have the scalability to guarantee uptime.

How can we create or improve a chat experience for your business and its customers?

Building and maintaining a chat application is a continuous process. We join our customers at many different stages, and depending on their needs, we can help in different ways. Let's take a closer look at some examples.

Building a bespoke solution.

We can become an embedded part of the team that builds the service. This allows us to architect and build the solution together from scratch. From the very first day, we learn about the unique properties of the customer and how instant messaging helps them to achieve their goals. As a result, we can use our expertise to make sure the system is built perfectly, from the start, before any less-than-ideal decisions are made. An added benefit of this for our customers is that their team gets to see how we go about architecting and designing a best-practice system. That way, there is a natural knowledge sharing of every step in the process, so when the product is launched and in-use, your team is familiar with how to operate and maintain your system. When possible, this is the perfect option for both the organisations we work with and us.

Improving and optimising before release.

We can't always be there from the start. Sometimes, there is a proof of concept application ready which needs improving, often because there is a technical issue that needs to be solved before the release.
At this stage, it's still possible to help design the architecture of the entire chat system. We can also deploy it to the production environment and make sure it can cope with the increasing load introduced by new users. There are a number of common problems we help solve for companies looking to optimise their chat before release.

One of the most common reasons companies need help at this stage is that developers who have less experience with XMPP have implemented inadequate solutions which have already been solved by the existing, public XEP - XMPP extension.

We also often see problems arising when a custom extension is built on top of an existing one, which, when done incorrectly, can cause significant complications.
In the lead up to the release of a chat solution companies often need help with custom integrations.

This is especially true for companies adding real time chat to an existing product. In many cases, this can be done outside of the MongooseIM code, on the other hand, it's usually more scalable and efficient if it can be run within MongooseIM. Not every company has the need to hire full-time Erlang developers, especially for implementing a few integrations. Our team are experienced, ready and happy to help with custom integrations to ensure the reliability and scalability of the release.

To ensure success when working with an existing product that has yet to be released, we often need to take a step back and look at the product, its goals and its failures holistically. From there we can suggest the best solution. This may mean we need to rethink the existing implementation and change or reshuffle some of the code. In extreme cases, we may need to throw away large chunks of the existing solution to ensure the release is successful. This is the last resort and a decision that is taken collaboratively in the name of reaching an optimal solution.

Increasing the scalability of your chat to allow you to handle more users.

Your instant messaging is deployed and running. Your product is growing successfully. It should be a time to celebrate, but often increased adoption or use of a chat application comes with increased scalability issues. These need to be sorted fast and future-proofed to avoid giving your users a poor experience and damaging the growth you've worked hard to achieve.

Our role is to find the bottlenecks and fix them. We study the architecture, the server setup, configuration, enabled extensions and customised code. Then, we decide on the best possible solution together, in collaboration with our clients.

Improving or customising your chat application.

All right, the app is deployed, and users are chatting, so far so good. How can joining forces with us make things even better? There are many reasons companies will come to us to improve an existing instant messaging solution.

Many of our customers come to us to develop unique functionalities that are necessary for their success. We help by designing the extension on the protocol level and implementing it on the server.
The public XEPs only cover use cases which are generic and where other people deploying chat services based on XMPP can also use them. So, if your chat needs a specific functionality, we're always happy to help.

We also have projects where MongooseIM replaces another XMPP server. This is common when clients discover that the MongooseIM alternative technology they have chosen did not meet their expectations. A common reason for this is that out of the box functionalities can turn out to be black boxes that make it impossible to carry out necessary customisations or improvements.

Another common reason to switch to MongooseIM is to improve scalability when an existing solution reaches its capacity. In this case, we play the role of surgeons, carefully implementing a transplant to put MongooseIM at the heart, improving scalability but keeping the rest of the system running smoothly.

Tips to avoid common mistakes when using open-source MongooseIM:

Choose your XEP, and choose wisely.

X in XMPP stands for eXtensible. This means it might not be the simplest protocol, but there's hardly a chat feature it doesn't cover. There is a set of core RFCs on top of which custom extensions (XEPs) are built. There are usually no issues with the core functionality that is covered in RFCs. They are adopted by many client libraries and servers. XEPs are often isolated from each other, independent from the core functionalities. They add extra features or capabilities to the core of XMPP. Here comes the biggest challenge: deciding which XEP to choose (you can find more details in our XMPP use-cases guide). From our experience, developers starting their adventure with XMPP may have a hard time finding a suitable XEP. This can sometimes accidentally lead to your choice creating difficulties or limitations. Some examples:

Too many extensions are enabled on the server-side, but not used by the client app.

Certain XEPs like XEP-0012: Last Activity put extra load on the server, even when the client's application is not using them. Enabling only used extensions helps to scale the MongooseIM cluster.

An outdated or rejected extension is used.

Often, different products with instant messaging functionality store the messages on the server. For this, there is extension XEP-0313: Message Archive Management. The majority of servers and client libraries have implemented the XEP for several years already. It replaces an old, more complicated and now deprecated XEP-0136: Message Archiving. Sometimes, developers choose the deprecated extension over the new one.

A custom extension is built even if an existing XEP covers the required functionality.

For instance, if a user wants to know if someone is active in a chat conversation. To achieve that, some extra XMPP stanzas need to be sent. Developers, me included, are creative creatures, and sometimes we rush to reinvent the wheel. In this particular case, XEP-0085: Chat State Notifications fulfils this need.

Reduce unknowns

When preparing for a wave of users using instant messaging functionality, we need to remember to load test. It's very crucial to run load tests to know the capacity of your production setup. To my surprise, this is often neglected. Many people want an answer to the following simple question: "how many resources do I need to handle X number of users, sending Y messages per second". There are too many factors and variables to accurately answer this question. To have a better idea about the capacity, you can run load tests simulating user behaviours. It usually starts with developing the scenario. After making sure that monitoring is in place, the load testing can start. Now we can learn how the system behaves and what to expect in the production environment.

Summing it up

It is very important:

to know what functionalities are needed and enabled on the server
to verify if a custom extension is already covered in a XEP
to only build a custom extension if necessary
to run load tests to find the capacity and be better prepared for the real traffic

All of the above may not be rocket science; many services run on MongooseIM without our help.

If you are up to the challenge, good, we are always happy to see our product in use.

But, when your time is critical, or if you stumble upon a problem you can't fix, we're happy to help. If hacking through 80K lines of unknown code is not your cup of tea, we can guide you through it. Stay safe and have fun adding real-time communication to your product, if required.

New webinar - Building Tetris in Elixir

Erlang Solutions — Thu, 09 Jan 2020 11:11:45 +0000

Last year, Sandesh Soni developed a fully playable version of Tetris in Elixir using Phoenix LiveView and OTP. It's a great example of what can be done in Phoenix LiveView without any JavaScript, the game looks slick and captures the charm of the original. In this live coding demonstration, Sandesh will build a new and improved version of the game as you follow along. You'll learn some great tips and tricks for Phoenix LiveView and OTP, as well as learning how to build the game in real time.

Register at https://www2.erlang-solutions.com/tetris4 and even if you can't make the live webinar, you'll be sent a recording at a later date.

Erlang Highlights 2019 - Best Of The BEAM.

Erlang Solutions — Wed, 11 Dec 2019 17:51:51 +0000

Despite being over 30 years old (and open source for 21), Erlang continues to be evolving, finding new industries to impact, fresh use cases and exciting stories.
In February 2019, the Erlang Ecosystem Foundation was announced at Code BEAM San Francisco. This group brings together a diverse community of BEAM users, including corporate and commercial interests, in the Erlang and Elixir Ecosystem. It encourages the continued development of technologies and open source projects based on/around the BEAM, its runtime and languages. This is an exciting step to ensure the ongoing success of the technologies we specialise in as their hashtag so aptly puts it, #weBEAMtogether.

Erlang's own enigmatic standing in the developer community was best summed up by StackOverflow's annual developer survey. This year Erlang was featured as one of the top 10 paid languages for developers. It also featured in all three of the most loved, dreaded and wanted technologies.

Throughout 2019, there have been many fantastic articles, guides, podcasts and talks given by members of the community showing off the capabilities of the technology. If you're looking for inspiration, or want to see why a language that is over 30 years old is still provides some of the best paid jobs, check out these fantastic stories.

Top Erlang Resources 2019

TalkConcurrency

We were privileged to host a panel of industry legends including Sir Tony Hoare, Carl Hewitt and the late Joe Armstrong. What followed was an open discussion about the need for concurrency and how it is likely to evolve in the future. Carl Hewitt is the designer of logic programming language Planner, he is known for his work on evolving the actor model. Sir Tony Hoare developed the sorting algorithm Quicksort. He has been highly decorated for his work within Computer Science including six Honourary Doctorates and a Knighthood in the year 2000. Most in our community will be familiar with Joe Armstrong as being one of the inventors of Erlang, and someone whose work was highly influential to the field of concurrency. Each of our three guests are highly celebrated for their work and approaches to concurrency and impact it in their own way, whilst using different technologies. The wisdom that these three legends hold is clearly on show during the discussion. It is a truly must-watch talk for anyone with a passing interest in Erlang, Elixir and the BEAM.

How to introduce dialyser to a large project

Dialyser is a fantastic tool to identify discrepancies and errors in Erlang code. Applying dialyser to a considerable codebase can lead to performance issues, particularly when you are working with a large codebase that has never been analysed with dialyser before. In this blog, Brujo Benavides demonstrates how the team at NextRoll were able to reduce discrepancies in the code by a third, in just a week while also setting up the system to be able to include dialyser in the ongoing development.
Read the blog here.

Five 9's for five years at the UK’s National Health Service

Martin Sumner joined the Elixir Talk podcast for a fantastic discussion of the work they're doing at the NHS. Their centralised exchange point handles over 65 million record requests a day. Availability is vital due to the nature of medical information. Using Riak, they have managed to maintain 99.999% availability for over five years, an impressive effort. Listen to the podcast here.

Sasa Juric shared the Soul of Erlang

One of the most shared and talked about conferences videos of 2019 was Sasa Juric's 'the soul of Erlang' at GoTo Chicago 2019, and with good reason. It is an articulate, passionate summary of what makes Erlang so unique, and why it can achieve things that are so difficult in other technologies. Watch the video here.

Who is using Erlang & why?

When we launched our blog on the companies using Erlang and why we had no idea just how much it would resonate with the community. To date, there have been over 25,000 visits to the page. It was the top story on HackerNews and continues to generate a high volume of visits four months after its initial release. The reception to this blog shows the ongoing interest in the language, and the appetite for people sharing in-production examples of Erlang at work. Read the blog here.

BEAM extreme

AdRoll deals with an average of half-a-million real-time bid requests per second, with spikes substantially higher than that. Each big spike has a significant financial implication. As a result, they've had to develop a set of tricks to give their system a little performance boost. In this talk at ElixirConf, Miriam Pena demonstrates some of the tactics she's seen and made to provide the BEAM with an extra edge when it comes to speed or memory. Watch the talk here.

Ten years of Erlang

Fred Hebert is an experienced, passionate and respected member of the Erlang community. His conference talks, books and webinars are all extremely valuable resources. This year, he celebrated ten years as part of the community and took time to reflect on Erlang's past, its growth, and where it may go in the future. The blog is a fantastic read, and we recommend it for anyone who is passionate about the BEAM. Read the blog here.

Testable, high performance, large scale distributed Erlang

Christopher Meiklejohn presents the design of an alternative runtime system for improved scalability and reduced latency in distributed actor applications using Partisan, which is built in Erlang. Watch the talk here.

Erlang for Blockchain

As blockchain continues to increase the number of in-production uses, such as Walmart's use of smart contracts in their logistics supply chain, Erlang has increasingly become the language of choice for blockchain providers. ArcBlock joined the Erlang Ecosystem Foundation as a founding sponsor, and also joined us for guest blogs and a webinar. Aeternity is another big advocate for the use of Erlang in blockchain development. You can read about their experience using the BEAM for blockchain here.

Solving embarrassingly obvious problems in Erlang

Often, when people complain about the syntax of Erlang, they are making simple errors that can be fixed with a change of mindset. In this blog, Garret Smith shows how to make simple shifts to eliminate these errors and, in the process, become a better programmer. Read his solutions here.

Whatsapp user migration

Whatsapp continues to be one of the most famous examples of Erlang development. This year, they spoke to the crowd at Code BEAM SF about how they migrated their 1.5 billion users to the Facebook infrastructure. Watch their talk here. And, for those interested, Whatsapp are currently growing their London team.

Summary

2019 showed that there is still a demand for Erlang and the reliability and fault-tolerance it delivers. 2020 is already looking like an exciting year. The growth of FinTech, digital banking and blockchain all provide exciting avenues for expansion for the language. The newly developed Erlang Ecosystem Foundation has working groups dedicated to developing libraries and tools to make the Erlang Ecosystem even easier to use to help grow the community. And, for the first time, the BEAM will have a dedicated room at FOSDEM, which is sure to introduce more developers to the language. If you'd like to catch all of our news, guides and webinars in 2020 and beyond, join our mailing list.

How to debug your RabbitMQ

Erlang Solutions — Mon, 25 Nov 2019 12:00:28 +0000

What you will learn in this blog.

Our RabbitMQ consultancy customers come from a wide range of industries. As a result, we have seen almost all of the unexpected behaviours that it can throw at you. RabbitMQ is a complex piece of software that employs concurrency and distributed computing (via Erlang), so debugging it is not always straightforward. To get to the root cause of an unexpected (and unwanted) behaviour, you need the right tools and the right methodology. In this article, we will demonstrate both to help you learn the craft of debugging in RabbitMQ.

The problem of debugging RabbitMQ.

The inspiration for this blog comes from a real-life example. One of our customers had the RabbitMQ Management HTTP API serving crucial information to their system. The system relied on the API heavily, specifically on /api/queues endpoint because the system needed to know the number of messages ready in each queue in a RabbitMQ cluster. The problem was that sometimes a HTTP request to the endpoint lasted up to tens of seconds (in the worst case they weren't even able to get a response from the API).

So what caused some requests to take so much time? To answer that question, we tried to reproduce the issue through load testing.

Running load tests

We use a platform that we created for MongooseIM to run our Continuous Load Testing. Here are some of the most important aspects of the platform:

all the services that are involved in a load test run inside docker containers
the load is generated by Amoc; it's an open source tool written in Erlang for generating massively parallel loads of any kind (AMQP in our case)
metrics from the system under test and Amoc site are collected for further analysis.

The diagram below depicts a logical architecture of an example load test with RabbitMQ:

In the diagram, the left-hand side, shows a cluster of Amoc nodes that emulate AMQP clients which, in turn, generate the load against RabbitMQ. On the other side, we can see a RabbitMQ cluster that serves the AMQP clients. All the metrics from both the Amoc and RabbitMQ services are collected and stored in an InfluxDB database.

Slow Management HTTP API queries

We tried to reproduce the slow queries to Management HTTP API in our load tests. The test scenario was fairly simple. A bunch of publishers were publishing messages to default exchange. Messages from each publisher were routed to a dedicated queue (each publisher had a dedicated queue). There were also consumers attached to each queue. Queue mirroring was enabled.

For concrete values, check the table below:

That setup stressed the Rabbit servers on our infrastructure. As seen in the graphs below:

Every RabbitMQ node consumed about 6 (out of 7) CPU cores and roughly 1.4GB of RAM except for rabbitmq-1 which consumed significantly more than the others. That was likely because it had to serve more of the Management HTTP API requests than the other two nodes.

During the load test /api/queues endpoint was queried every two seconds for the list of all queues together with corresponding messages_ready values. A query looked like this:

http://rabbitmq-1:15672/api/queues?columns=name,messages_ready

Here are the results from the test:

The figure above shows the query time during a load test. It's clear that things are very slow. The median equals 1.5s while the 95, 99, 999 percentiles and max reach 20s.

Debugging

Once the issue is confirmed and can be reproduced, we are ready to start debugging. The first idea was to find the Erlang function that is called when a request to the RabbitMQ Management HTTP API comes in and determine where that function spends its execution time. If we were able to do this it would allow us to localise the most time expensive code behind the API.

Finding the entrypoint function

To find the function we were looking for we took the following steps:

looked through the RabbitMQ Management Plugin to find the appropriate "HTTP path to function" mapping,
used the Erlang tracing feature to verify if a found function is really called when a request comes in.

The management plugin uses cowboy (an Erlang HTTP server) underneath to serve the API requests. Each HTTP endpoint requires a cowboy callback module, so we easily found the rabbit_mgmt_wm_queues:to_json/2 function which seemed to handle requests coming to the /api/queues. We confirmed that with tracing (using a recon library that is shipped with RabbitMQ by default).

root@rmq-test-rabbitmq-1:/rabbitmq_server-v3.7.9# erl -remsh rabbit@rmq-test-rabbitmq-1 -sname test2 -setcookie rabbit  
Erlang/OTP 21 [erts-10.1] [source] [64-bit] [smp:22:7] [ds:22:7:10] [async-threads:1]  

Eshell V10.1  (abort with ^G)  
(rabbit@rmq-test-rabbitmq-1)1> recon_trace:calls({rabbit_mgmt_wm_queues, to_json, 2}, 1).  
1  

11:0:48.464423 <0.1294.15> rabbit_mgmt_wm_queues:to_json(#{bindings => #{},body_length => 0,cert => undefined,charset => undefined,  
  has_body => false,  
  headers =>  
      #{<<"accept">> => <<"*/*">>,  
        <<"authorization">> => <<"Basic Z3Vlc3Q6Z3Vlc3Q=">>,  
        <<"host">> => <<"10.100.10.140:53553">>,  
        <<"user-agent">> => <<"curl/7.54.0">>},  
  host => <<"10.100.10.140">>,host_info => undefined,  
  media_type => {<<"application">>,<<"json">>,[]},  
  method => <<"GET">>,path => <<"/api/queues">>,path_info => undefined,  
  peer => {{10,100,10,4},54136},  
  pid => <0.1293.15>,port => 53553,qs => <<"columns=name,messages_ready">>,  
  ref => rabbit_web_dispatch_sup_15672,  
  resp_headers =>  
      #{<<"content-security-policy">> => <<"default-src 'self'">>,  
        <<"content-type">> => [<<"application">>,<<"/">>,<<"json">>,<<>>],  
        <<"vary">> =>  
            [<<"accept">>,  
             [<<", ">>,<<"accept-encoding">>],  
             [<<", ">>,<<"origin">>]]},  
  scheme => <<"http">>,  
  sock => {{172,17,0,4},15672},  
  streamid => 1,version => 'HTTP/1.1'}, {context,{user,<<"guest">>,  
               [administrator],  
               [{rabbit_auth_backend_internal,none}]},  
         <<"guest">>,undefined})  
Recon tracer rate limit tripped.  
```

`  
The snippet above shows that we enabled tracing for `rabbit_mgmt_wm_queues:to_json/2` first, then  we manually sent a request to the Management API (using curl; not visible on the snippet) and which generated the trace event. That's how we found our entrypoint for further analysis.   

### Using flame graphs  
Having found a function that serves the requests, we can now check how that function spends its execution time. The ideal technique to do this is [Flame Graphs](http://www.brendangregg.com/flamegraphs.html). One of its [definitions](http://www.brendangregg.com/flamegraphs.html) states:   

*Flame graphs are a visualisation of profiled software, allowing the most frequent code-paths to be identified quickly and accurately.*  
In our case, we could use flame graphs to visualise the stack trace of the function or, in other words, which functions are called inside a traced function, and how much time it takes (relatively to the traced function's execution time) for these functions to execute. This visualisation helps to identify suspicious functions in the code quickly.  

For Erlang, there is a library called [eflame](https://github.com/proger/eflame) that has tools for both: gathering traces from the Erlang system and building a flame graph from the data. But how do we inject that library into Rabbit for our load test?   

### Building a custom RabbitMQ docker image  
As we mentioned previously, all the services in our load testing platform run inside docker containers. Hence, we had to build a custom `RabbitMQ docker` image with the eflame library included in the server code. We created a [rabbitmq-docker repository](https://github.com/esl/rabbitmq-docker) that makes it easy to build a docker image with modified RabbitMQ source code.   

### Profiling with eflame  
Once we had a modified RabbitMQ docker image with eflame included, we could run another load test (specifications were the same as the previous test) and start the actual profiling. These were the results:  

<p style="text-align:center"><img src="https://i.imgur.com/gGz7pQc.png" alt="rabbitmq ram table" width="1600" height="432" data-load="full" style=""></p>  

<p style="text-align:center"><img src="https://i.imgur.com/AlhCzIX.png" alt="rabbitmq ram table" width="1600" height="432" data-load="full" style=""></p>  

We ran a number of measurements and had two types of result as presented above. The main difference between these graphs is in `rabbit_mgmt_util:run_run_augmentation/2` function. What does that difference mean?    

From the results of the previous load tests and manual code analysis, we know that there are slow and fast queries. The slow requests can take up to twenty seconds while the fast ones only take a few. It confirms the query time chart above with: 50 percentile about 1.5s but 95 (and higher percentiles) equaling up to 20s. Moreover, we manually measured execution time of both cases using [timer:tc/3](http://erlang.org/doc/man/timer.html#tc-3) and the results were consistent.    

This happens because there is a [cache](https://github.com/rabbitmq/rabbitmq-management/blob/v3.7.9/src/rabbit_mgmt_db_cache.erl) in the Management plugin. When the cache is valid, the requests are served much faster as the data has already been collected, but when it's invalid, all the necessary information needs to be gathered.    

Despite the fact that the graphs have the same length in the picture, they represent different execution times (fast vs slow). Hence it's hard to guess which graph shows which query without actually taking a measurement. The first graph shows a fast query while the second shows a slow one. In the slow query graph `rabbit_mgmt_util:augment/2 -> rabbit_mgmt_db:submit_cached/4 -> gen_server:call/3 -> …` the stack takes so much time because the cache is invalid and fresh data needs to be collected. So what happens when data is collected?  

### Profiling with fprof  
You might ask "why don't we see the data collection function(s) in the  flame graphs?" This happens because the cache is implemented as another Erlang process and the data collection happens inside the cache [process](https://github.com/rabbitmq/rabbitmq-management/blob/v3.7.9/src/rabbit_mgmt_db_cache.erl#L101-L119 ). There is a `gen_server:call/3` function visible in the graphs that makes a call to the cache process and waits for a response. Depending on the cache state (valid or invalid) a response can come back quickly or slowly.    

Collecting data is implemented in [`rabbit_mgmt_db:list_queue_stats/3`](https://github.com/rabbitmq/rabbitmq-management/blob/v3.7.9/src/rabbit_mgmt_db.erl#L357-L368) function which is invoked from the cache process. Naturally, we should profile that function. We tried eflame and after **several dozens of minutes** this is the result we got:  

`

```
eheap_alloc: Cannot allocate 42116020480 bytes of memory (of type "old_heap").
```

`
The Erlang heap memory allocator tried to allocate **42GB** of memory (in fact, the space was needed for [garbage collector](https://www.erlang-solutions.com/blog/erlang-19-0-garbage-collector.html) to operate) and crashed the server. As eflame takes advantage of Erlang Tracing to generate flame graphs it was, most probably, simply overloaded with a number of trace events generated by the traced function. That's where [fprof](http://erlang.org/doc/man/fprof.html) comes into play.  

According to the official Erlang documentation fprof is:   

*a Time Profiling Tool using trace to file for minimal runtime performance impact.*  
That’s very true.  The tool dealt with collecting data function smoothly, however it took several minutes to produce the result. The output was quite big so there are only crucial lines listed below:
`

```
(rabbit@rmq-test-rabbitmq-1)96> fprof:apply(rabbit_mgmt_db, list_queue_stats, [RA, B, 5000]).  
...
(rabbit@rmq-test-rabbitmq-1)97> fprof:profile().  
...
(rabbit@rmq-test-rabbitmq-1)98> fprof:analyse().  
...
%                                       CNT        ACC       OWN  
{[{{rabbit_mgmt_db,'-list_queue_stats/3-lc$^1/1-1-',4}, 803,391175.593,  105.666}],  
 { {rabbit_mgmt_db,queue_stats,3},              803,391175.593,  105.666},     %  
 [{{rabbit_mgmt_db,format_range,4},            3212,390985.427,   76.758},  
  {{rabbit_mgmt_db,pick_range,2},              3212,   58.047,   34.206},  
  {{erlang,'++',2},                            2407,   19.445,   19.445},  
  {{rabbit_mgmt_db,message_stats,1},            803,    7.040,    7.040}]}.  

```

`  
The output consists of many of these entries. The function marked with the % character is the one that the current entry concerns. The functions below are the ones that were called from the marked function. The third column (`ACC`) shows the total execution time of the marked function (the functions own execution time and callees) in milliseconds. For example, in the above entry the total execution time of the `rabbit_mgmt_db:pick_range/2` function equals 58,047ms. For a detailed explanation of the fprof output check the [official fprof documentation](http://erlang.org/doc/man/fprof.html#analysis-format).  

The entry above is the top level entry concerning `rabbit_mgmt_db:queue_stats/3` which was called from the traced function. That function spent most of its execution time in `rabbit_mgmt_db:format_range/4` function. We can go to an entry concerning that function and check where it spent its execution time accordingly. This way, we can go through the output and find potential causes of the Management API slowness issue.    

Reading through the fprof output in a top-down fashion we ended up with this entry:  
`

```
{[{{exometer_slide,'-sum/5-anonymous-6-',7},   3713,364774.737,  206.874}],
 { {exometer_slide,to_normalized_list,6},      3713,364774.737,  206.874},     %
 [{{exometer_slide,create_normalized_lookup,4},3713,213922.287,   64.599}, %% SUSPICIOUS
  {{exometer_slide,'-to_normalized_list/6-lists^foldl/2-4-',3},3713,145165.626,   51.991}, %% SUSPICIOUS
  {{exometer_slide,to_list_from,3},            3713, 4518.772,  201.682},
  {{lists,seq,3},                              3713,  837.788,   35.720},
  {{erlang,'++',2},                            3712,   70.038,   70.038},
  {{exometer_slide,'-sum/5-anonymous-5-',1},   3713,   51.971,   25.739},
  {garbage_collect,                               1,    1.269,    1.269},
  {suspend,                                       2,    0.151,    0.000}]}.  
```

` 

The entry concerns `exometer_slide:to_normalized_list/6` function which in turn called two “suspicious” functions from the same module. Going deeper we found this:   

`

```
    {[{{exometer_slide,'-create_normalized_lookup/4-anonymous-2-',5},347962,196916.209,35453.182},
  {{exometer_slide,'-sum/5-anonymous-4-',2},   356109,16625.240, 4471.993},
  {{orddict,update,4},                         20268881,    0.000,172352.980}],
 { {orddict,update,4},                         20972952,213541.449,212278.155},     %
 [{suspend,                                    9301,  682.033,    0.000},
  {{exometer_slide,'-sum/5-anonymous-3-',2},   31204,  420.574,  227.727},
  {garbage_collect,                              99,  160.687,  160.687},
  {{orddict,update,4},                         20268881,    0.000,172352.980}]}.  
```

`

and:    

```
    {[{{exometer_slide,'-to_normalized_list/6-anonymous-5-',3},456669,133229.862, 3043.145},
  {{orddict,find,2},                           19369215,    0.000,129761.708}],
 { {orddict,find,2},                           19825884,133229.862,132804.853},     %
 [{suspend,                                    4754,  392.064,    0.000},
  {garbage_collect,                              22,   33.195,   33.195},
  {{orddict,find,2},                           19369215,    0.000,129761.708}]}.  
```  

A lot of the execution time was consumed by `orddict:update/4` and `orddict:find/2` functions. These two combined accounted for **86%** of the total execution time.  

This led us to the [`exometer_slide`](https://github.com/rabbitmq/rabbitmq-management-agent/blob/v3.7.9/src/exometer_slide.erl) module from the [RabbitMQ Management Agent Plugin](https://github.com/rabbitmq/rabbitmq-management-agent/tree/v3.7.9). If you look into the module, you'll find all the functions above and the connections between them.  

We decided to close the investigation at this stage because this was clearly the issue. Now, that we've shared our thoughts on the issue with the community in this blog, who knows, maybe we'll come up with a new solution together.   

### The observer effect  
There is one last thing that is essential to consider when it comes to debugging/observing systems - [the observer effect](https://en.wikipedia.org/wiki/Observer_effect_(physics)). The observer effect is a theory that claims if we are monitoring some kind of phenomena the observation process changes that phenomena.  

In our example, we used tools that take advantage of tracing. Tracing has an impact on a system as it generates, sends and processes a lot of events.   

Execution times of the aforementioned functions increased substantially when they were called with profiling enabled. Pure calls took several seconds while calls with profiling enabled several minutes. However, the difference between the slow and fast queries seemed to remain unchanged.   

The observer effect was not evaluated in the scope of the experiment described in this blog post.  

### A workaround solution 
The issue can be solved in a slightly different manner. Let's think for a while if there is another way of obtaining queues names corresponding to the amount of messages in them?
There is the [`rabbit_amqqueue:emit_info_all/5`](https://github.com/rabbitmq/rabbitmq-server/blob/v3.7.9/src/rabbit_amqqueue.erl#L758) function that allows us to retrieve the exact information we are interested in - directly from a queue process. We could use that API from a custom RabbitMQ plugin and expose a HTTP endpoint to send that data when queried.  

We turned that idea into reality and built a proof of concept plugin called [`rabbitmq-queue-info`](https://github.com/esl/rabbitmq-queue-info) that does exactly what's described above.
The plugin was even load tested (test specification was exactly the same as it was with the management plugin; from earlier in the blog). The results are below and they speak for themselves:  
<p style="text-align:center"><img src="https://i.imgur.com/MQnUb8B.png" alt="rabbitmq ram table" width="2264" height="504" data-load="full" style=""></p>   

### Want more  
Want to know more about tracing in RabbitMQ, Erlang & Elixir? Check out WombatOAM, an intuitive system that makes monitoring and maintenance of your systems easy. [Get your free 45 day trial of WombatOAM now](https://www.erlang-solutions.com/products/wombatoam.html). 

### Apendix 
Version 3.7.9 of RabbitMQ was used in all the load tests mentioned in this blog post.
Special thanks go to Szymon Mentel and Andrzej Teleżyński for all the help with that publication.

The benefits of Erlang & Elixir for blockchain

Erlang Solutions — Fri, 22 Nov 2019 14:12:46 +0000

I first came to know Erlang/OTP through one of Joe Armstrong’s talks, where he broke down the world into processes that can talk to each other like humans. When I started at ArcBlock, we were tasked with building a blockchain platform, and we decided to use Erlang/OTP extensively for our backend services, as well as our blockchain framework - Forge. The reasons for that are described in this article, and because of the functions of OTP, we have been able to build a highly practical, production-ready blockchain framework that not only delivers critical services to run a blockchain network but greatly simplifies what is required for next-generation applications and services.

A Blockchain Framework Primer

Forge is a tool that significantly simplifies the process of building a framework to support multi-chain networks or the concept of Build Your Own Chain (BYOC). Before Forge, it was challenging to build a chain. If people wanted to start their own blockchain, they would first need to set up different components of a blockchain system, including a consensus algorithm, p2p network, and the many other parts. After they went to the effort of making the components work together, they would need to decide how to adjust the different parameters of the blockchain, like total token supplies and distribution, specific transaction settings and admin access control. If they were lucky enough to get the blockchain running and found even the slightest thing wrong, they would need to stop all the running nodes go through the process of setting it again.

With Forge, we used the already available features and benefits of Erlang/OTP to deliver a framework that does all the hard work for the developer. For example, if you start a blockchain with Forge, the only requirement is to set the behaviours by enabling or disabling them in the configuration or at runtime. What’s more, if you want to update something when the chain has started, individual parts of the system can be hot-upgraded without rebooting the entire node — a critical feature for any product-grade application or service.

During the design and planning phase for our framework, we also evaluated other popular languages in the blockchain community, such as Golang. Golang has its benefits, including some pretty advanced libraries; however, to build the robust platform we wanted to deliver to our customers, there are three things that really pushed us towards Elixir.

Processes Simplify Problems

First is the need to breakdown complicated problems into processes. Blockchain itself is a mixture of solutions to many problems, and OTP allows us to deconstruct and tackle them one by one. The flexibility of grouping different processes into applications also helps us to maintain our codebase.

For example, when a user needs to build a blockchain node, it’s a very similar process to building an operating system. We need to orchestrate a list of “applications” to work together for exchanging events (for example, transactions for a blockchain system), executing these events and then storing the updated states. To help everyone understand how this works, it is very easy to break down the structure into several core applications:

consensus application: processes that manage consensus related tasks
storage application: processes that manage file system related tasks
Forge application: processes that execute smart contract and support RPC interface
event application: processes that manage event subscription
indexer application: processes that continuously pull data from states database and index them into a relational database

Within these applications, Forge has some additional processes that collaborate to help process and handle the transactional activity for the blockchain. For example:

when a user sends a transaction, there is a gRPC server that will process it and push it to the queue of mempool
if the transaction is valid, it will be inserted into mempool, then flooded to the entire network; otherwise, it gets dropped
once a new block is synced to us, the transactions will be picked up one by one and executed by the smart contract engine

Orchestrating these activities can be difficult. However, with the help of OTP, we are able to easily manage the complexity of the processes, through a continuous divide and conquer approach - things are organized into applications, each application is organized into a supervision tree, and each tree consists of many small processes. When there’s a need for concurrency, we are able to output a pool of processes; when robust sequential processing is required, we use a single process - by nature, its inbox serves as a message queue, which guarantees the tasks are processed in the right order.

Crashing Made Easy

Second is the ‘Let it crash’ mentality. A blockchain system consists of many running entities that are connected through a network. It’s essential to have an appropriate error handling system to maintain everything when the network is unstable or other unexpected disruptions happen.

For example, if one process needs to read on-chain information to handle an RPC request and crashes due to network instability, where a few retries would have fixed the problem, the supervisor in OTP helps bring the process back. This a perfect illustration of “write once, run forever,” as described by Joe Armstrong.

Ready-Made for Blockchain

The third important reason we choose Erlang/OTP is that it comes with many great built-in features for a blockchain system, like hot-upgrade, concurrency and high-availability.

One of the blockchain framework’s responsibilities is to run both the framework code and the customer’s code in a mixed manner, which requires secure isolation to work appropriately.

For example, user-defined smart contracts might use the same variable names as framework-defined contracts. Other implementations might replace parts of the system (e.g. customers may replace a consensus engine with their own implementation), and new features could be added into the existing blockchain node at runtime, without compromising the availability and stability. Thus, Forge was built on the shoulders of giants, a battle-tested production system offering the features we needed.

Open Flexibility

Using Erlang/OTP allows a blockchain framework to have a very important advantage over other languages — flexibility. As a blockchain framework, Forge by design is open to extension: you can extend the framework by adding more applications to implement more complicated features, like using a different consensus engine.

Blockchain networks created within our Erlang/OTP based framework allow users to hot-upgrade their smart contracts when needed without taking down the whole system, which gives users great flexibility in run time. For example, if you need to take down a node in a blockchain system to upgrade parts of the code, all nodes need to be upgraded at the same time, so that they use the same set of code logic and output the same result. In this case, OTP allows the partial upgrade to be included in a transaction, and all nodes can execute this transaction and upgrade the code at the same time.

Summary

Given the choice to build a blockchain framework in the future, our team would use Erlang/OTP 100% of the time. While others are still struggling with building and maintaining complicated systems, Erlang/OTP is time tested and proven in high-stress environments. Today’s Erlang/OTP has solved most of the challenges for us, allowing our team to focus more on building high-level features as well as making them user-friendly.

Learn more at our blockchain on the BEAM webinar

Learn more about how ArcBlock use the BEAM when their VP of Engineering, Tyr Chen guest hosts our next webinar.

Why successful blockchains should be built on the BEAM.

Erlang Solutions — Mon, 11 Nov 2019 17:22:53 +0000

Who are ArcBlock, and why do they love the BEAM?

ArcBlock is on a mission to take the complexity out of blockchain and fast track its adoption into everyday life. To do this, they've developed an all-inclusive blockchain development platform
that gives developers everything they need to build, run and deploy decentralized applications (dApps) easily. At the heart of their platform is the BEAM VM. They're such big believers and supporters in the Erlang Ecosystem that they joined the Erlang Ecosystem foundation as a founding sponsor. In this guest blog Tyr Chen, VP of Engineering at ArcBlock, will discuss why they love the BEAM VM and the benefits of using it as a cornerstone for anyone wanting to build dApps.

An introduction to the BEAM and blockchain

Erlang is one of the best programming languages for building highly available, fault tolerant, and scalable soft real-time systems. The BEAM is the virtual machine - and the unsung hero from our viewpoint. The benefits of the BEAM apply to other languages run on the VM, including Elixir. No matter what high-level programming language people are using, it all comes down to the BEAM. It is this essential piece of technology that helps achieve the all-important nine-nines of availability.

Today, the BEAM powers more than half of the world's internet routers and we don't think you have to look much further than that for validation. Below are some of the benefits of the BEAM that make it perfect for building blockchains.

Network Consensus

Our decision to leverage the BEAM as a critical component for building decentralized applications (dApss) was an easy one. To start, blockchain and decentralized applications, need to achieve a consistent state across all the nodes in the network. We accomplish this by using a state replica engine (also known as a consensus engine). Consensus is important as this mechanism ensures that the information is added to a blockchain ledger is valid. To achieve consensus, the nodes on the network need to agree on the information, and once consensus happens, the data can be added to the ledger. There are multiple engines available, including our current platform choice, Tendermint, to support the state replica engine.

The BEAM + dApps

Apart from the consensus engine, the BEAM is the perfect solution to satisfy several other critical requirements for decentralized applications. For decentralized applications to work in our development framework, we need to have an embedded database to store the application state and an index database for the blockchain data. While this is happening, we also need the blockchain node(s) to have the ability to listen to peers on the network and "vote" for the next block of data. For these requirements, the system needs to be continuously responsive and available.

Now, it's also important to note that in addition to being continually responsive, we also need to account for CPU-tasks. In particular, our blockchain platform and services cannot stop working when the system encounters CPU-intensive tasks. If the system becomes unresponsive, a potentially catastrophic error could occur.

Hot Code Reloading

Besides BEAM's Scheduler, another feature we love is hot code reloading. It lets you do virtually anything on the fly without ever needing to take the BEAM down. For example, our blockchain application platform ships with lots of different smart contracts that developers can use to make their decentralized applications feature-rich. However, with blockchain, you have a distributed network and need to ensure that every node behaves the same.

In most cases, developers have to update and reboot their nodes to get the latest software enabled, which causes potential issues and unnecessary downtime. With ArcBlock, we utilize the hot code reloading feature of BEAM to let the nodes enable/disable smart contracts on the fly across the entire network. This is simply done by sending a transaction that tells the system it should upgrade the software at a specific time. When that happens, ArcBlock will tell the BEAM to install the new code, and then every node in the network will have the latest features ready to go.

Speed is Relative

The BEAM uses the "actor model" to simulate the real world, and everything is immutable. Because of this, there is no need to lock the state and prevent the race condition. Of course, everything comes with a cost. The simplicity and beauty of immutability in the BEAM could lead things to run slower. To mitigate potential slowness, ArcBlock leverages Rust to help the CPU on intensive tasks such as the Merkle-Patricia tree for states. And once again, the BEAM demonstrates its value by providing an easy way to communicate with the outside world using Rust to boost the performance to another level.

Garbage Collecting

Don't let the name fool you. Garbage collecting is critical. Erlang uses dynamic memory with a tracing garbage collecting. Each process has its own stack and heap, which is allocated in the same memory block and can grow towards each other. When the stack and the heap meet, the garbage collector is triggered, and memory will be reclaimed.

While this explanation is a bit technical, the process for garbage collecting in BEAM is done at a process level, ensuring that there will never be a "stop-the-world-and-let-me-clean-up-my-trash" type of garbage collection. Instead, it ensures that processes will continue without any type of interruption.

OTP

Last but not least, Erlang provides a development suite called OTP that gives developers an easy way to use well established best practices in the BEAM world. For any enterprise or blockchain application platform, building around industry standards is a must, and OTP makes it easy to write code that utilizes all the goodness available to developers in BEAM.

Fault Tolerance

There is a reason we saved this for last. This is by far the feature ArcBlock rely upon the most on the BEAM, it is what elevates the BEAM over many competitor technologies when it comes to blockchain. Although tens of thousands of transactions are happening simultaneously; any error that occurs in certain parts of the system won't impact the entire node. The errors will be self-healing, enabling the node to resist bad behaviour or specific attacks. For anyone who is delivering a service to users, or supports a production application, this is a critical feature. By including fault tolerance by default, we can ensure that anyone running on the ArcBlock platform can remain online and available.

We believe that the BEAM, while designed many years ago, was intended for blockchain. It gives developers and blockchain platforms like ArcBlock, all the necessary features, and capabilities to run a highly concurrent, fault tolerant system that makes developers' lives easier.

Keep calm and BEAM on.

Learn more

Tyr Chen, VP of Engineering at ArcBlock is the guest host of our webinar on Wednesday, November 27. Register to take part, and even if you can’t make it on the day, you’ll be the first to get a recording of the webinar once it is completed.

New webinar - Building Decentralized Applications (dApps) using the BEAM

Erlang Solutions — Wed, 06 Nov 2019 14:18:42 +0000

This month we're excited to host Tyr Chen, VP of Engineering at ArcBlock, on our webinar.

ArcBlock are founding sponsors of the Erlang Ecosystem Foundation. They’ve developed an all-inclusive blockchain development platform that gives developers everything they need to build, run and deploy decentralized applications (dApps) easily. At the heart of their platform is the BEAM VM.

In this webinar, Tyr will discuss how the features of the BEAM are utilised on their platform and why the BEAM VM is perfect for dApss.

Webinar announcement! Bruce Tate teaches you how to get more out of OTP with GenStateMachine

Erlang Solutions — Wed, 21 Aug 2019 09:50:13 +0000

On September 4th, Bruce Tate will be hosting a live coding webinar on how to Get more out of OTP with GenStateMachine.

In this live coding session, Bruce will implement a safety protocol with a GenStateMachine. We'll live code a safety protocol for climbing and then we'll spice it up with a more flexible API, timeouts and a more secure protocol.

When we're done, you'll have a better understanding of how to:

Work with state machines
Build state machines that implement policy
Tailor them to your purposes

You don't want to miss this one.

DEV Community: Erlang Solutions

Lessons learned from a decade consulting XMPP clients

How can we create or improve a chat experience for your business and its customers?

Building a bespoke solution.

Improving and optimising before release.

Increasing the scalability of your chat to allow you to handle more users.

Improving or customising your chat application.

Tips to avoid common mistakes when using open-source MongooseIM:

Choose your XEP, and choose wisely.

Reduce unknowns

Summing it up

You may also like:

MongooseIM

Online Erlang and Elixir training

Our RabbitMQ services

Our next webinar

New webinar - Building Tetris in Elixir

Erlang Highlights 2019 - Best Of The BEAM.

Top Erlang Resources 2019

TalkConcurrency

How to introduce dialyser to a large project

Five 9's for five years at the UK’s National Health Service

Sasa Juric shared the Soul of Erlang

Who is using Erlang & why?

BEAM extreme

Ten years of Erlang

Testable, high performance, large scale distributed Erlang

Erlang for Blockchain

Solving embarrassingly obvious problems in Erlang

Whatsapp user migration

Summary

You may also like

How to debug your RabbitMQ

What you will learn in this blog.

The problem of debugging RabbitMQ.

Running load tests

Slow Management HTTP API queries

Debugging

Finding the entrypoint function

The benefits of Erlang & Elixir for blockchain

A Blockchain Framework Primer

Processes Simplify Problems

Crashing Made Easy

Ready-Made for Blockchain

Open Flexibility

Summary

Learn more at our blockchain on the BEAM webinar

Why successful blockchains should be built on the BEAM.

Who are ArcBlock, and why do they love the BEAM?

An introduction to the BEAM and blockchain

Network Consensus

The BEAM + dApps

Hot Code Reloading

Speed is Relative

Garbage Collecting

OTP

Fault Tolerance

Learn more

New webinar - Building Decentralized Applications (dApps) using the BEAM

Webinar announcement! Bruce Tate teaches you how to get more out of OTP with GenStateMachine