Did you know that Elixir Supervisors will stop trying to restart a child process if they detect something has gone haywire in the child?
They do!
This leads to the next question, how do they know when a child process is a lost cause?
Supervisor Intensity!
What is it?
The intensity setting of a Supervisor is how many failures it can tolerate from a child process within a certain period of time. If more failures are recorded, then the Supervisor stops all child processes and fails itself.
The intensity and period are optional settings that can be set in Elixir like so
opts = [..., max_restarts: 3, max_seconds: 5]
Supervisor.start_link(children, opts)
Note that I have used the values that Elixir currently defaults to, 3 failures over a 5 second period.
When Do You Run Into This?
I figured this out when stepping through a nice bug. At the top of one of my child processes I had this line
def start_link(port: port, dispatch: dispatch) do
{:ok, socket} = :gen_tcp.listen(port, active: false, packet: :http_bin, reuseaddr: true) do
...
I was crashing the process on purpose and subsequently the entire app would shutdown. This confused me, because shouldn't the Supervisor restart the process?
This issue was that :gen_tcp.listen
returns an error if the port is already in use by another socket.
So after the first time, this module throws an error on the first line of start_link
. The Supervisor very quickly tries and fails three times on that same bug and the Supervisor is shut down.
Why Do Supervisors Do This?
In short it is to prevent infinite loops of processes trying to be restarted over and over.
However note that the Supervisor does not just stop trying to restart the child. If the intensity limits are surpassed the Supervisor shuts down itself.
Why?
It shuts down so that any possible Supervisor of that Supervisor can be notified that something wacky is going on and try to fix the issue. The original Supervisor already gave it its best effort and is now basically passing the issue up a level to ask for help.
Try at Home!
Here is a nice little module you can try out in your own app.
Drop this line into your Supervisor child start list
{FailTwoSeconds, []}
And add this module to your project
defmodule FailTwoSeconds do
def start_link([]) do
IO.inspect "New Process Starting"
pid = spawn_link(fn ->
Process.sleep(2000)
raise "Failing Now"
end)
{:ok, pid}
end
def child_spec(opts) do
%{
id: FailTwoSeconds,
start: {FailTwoSeconds, :start_link, [opts]},
}
end
end
We can see that the module exists just to start up, wait 2 seconds, and crash.
But crashing every 2 seconds is within the default intensity settings for Elixir. So this module will go on crashing and being restarted til the cows come home.
If you switch the sleep time to 1000
, then it will trip the Supervisor's intensity limits and the Supervisor will say "I've had too much!" and shutdown.
Thanks for reading and hope this saves you a headache one day!
Top comments (4)
Thanks for the article! So how did you eventually configured the supervisor for the tcp socket example? I guess it could happen hubdrefs of times per second, couldnt it?
For this failing example, within my server I was using
spawn_link(process_request_function)
, you can get a lot of added safety switching this tospawn(process_request_function)
.But I did figure out a way to restart the server safely for fun.
Right after the server starts, I save the socket in a Singleton Genserver.
Then on failure I close the socket and let the process fail.
It fails because it does not return an {:ok, pid} tuple.
Then next time the Supervisor restarts the process, it should succeed since there is no socket bound to the port.
Not the most robust, but it works for this toy example.
Thanks for the example. I remember sometimes the port is occupied for a longer time, I’ve seen this a couple of times even if no process actually uses it. There is just some delay until it’s freed by the os. So in that case it still would not help, Right?
I haven't used this hand made server very much, but what you are saying about the delay in closing the socket sounds like it could happen.
There may be a blocking command that I am unfamiliar with to check whether the port is free. Otherwise I would think of putting a sleep command in there to give the os time to free it.