Do not crash the Agent!

In order to stress test our streaming infrastructure I have developed an Elixir application that manages several concurrent interactors and uses an Agent to maintain state.

The Agent abstraction is very potent but be careful about the code run in the Agent process. If this code crashes, the agent is terminated, losing all its data :/

To illustrate this, imagine you are launching several concurrent tasks and use an Agent to store partial results from each task. Later on, you want to do some computation for each partial result and decide to run the code on the agent process itself. In this code I am making the agent crash causing an arithmetic exception:

crasher_crashing

This can be prevented making the code run on the agent resilient to exceptions or running the code on the client instead of the agent, so even if the client process crashes, the agent is still healthy.
In this example, I am preventing the crash on the code run on the agent:

crasher_not_crashing

EDIT: As José Valim pointed out in the comments, try/catch should avoided whenever possible. My point (and my self-reminder) is: be careful with the code you run on an agent process, as the data will be lost if it crashes.

4 Comments

  1. Thank you for writing about Elixir!

    Sorry for being blunt but this is the completely wrong approach to the problem in Elixir. Exceptions in Elixir typically means that something unexpected happened and when you are using try/rescue in your Agent, you are potentially allowing your Agent to run with *corrupted state*. We typically recommend in getting started guide and all major resources to avoid rescuing errors for this reason. Rescuing is worse in the long term because you will allow your system to continue running in face of unexpected situations, often on top of corrupted data.

    The solution here is two fold:

    1. Fix the problem correctly. In this case, this means matching on the value and avoiding the division by zero:

    Agent.get :crasher_agent, fn dict ->
    Enum.map dict, fn
    {key, 0} -> {key, :failed}
    {key, value} -> {key, 10 / value}
    end
    end

    2. Ensure the Agent is running in a supervision tree. If the Agent crashes, Elixir will log the crash by default and the supervisor will start a new version. If there are other processes depending on the Agent, place them under the same supervisor. We cover this in the getting started guide (http://elixir-lang.org/getting-started/introduction.html) but it is also covered in more advance in books like Elixir in Action.

    To sum up: try/rescue is already known as defensive programming in most languages and it reduces the confidence in our code because you are always “rescuing” your code instead of handled the cases explicitly. In Elixir, it is absolutely discouraged.

    The solution is to explicitly handle the case at hand. That’s also why File.read/1 in Elixir doesn’t crash if the file does not exist but allow you to explicitly handle missing files and so on. And, even if you handle the expected cases explicitly, something bad happens, trust the supervision tree.

    Such mistakes are part of the learning curve and I am glad you are learning Elixir. But remember, leave try/rescue at home.

    1. José thank you so much for your all the support you provide to the community. This is a great example of that :)

      I think I understand the reasoning behind the “let it crash” mantra but there are cases when that is not the best option. If I cannot know the exception that will be raised in advance, then I cannot match against it. If I let it crash under a supervisor, then the already stored data is lost. Try/catch should be avoided, but it exist in the language for a reason and may have its use case.

      The best approach would be to fix the error in the external library and make it communicate better its failures so my code can handle them better.
      Besides, it could be better to do the execution of the troublesome code on the client process instead of the agent process, that way the agent won’t lose its data and the client can be restarted/managed after the crash.

      I will update my post to reflect your feedback.

      Thank you

  2. > If I cannot know the exception that will be raised in advance, then I cannot match against it.

    If you don’t know what is the exception, then you don’t know what you are rescuing. You may as well be rescuing a data corruption, a system limit error or worse.

    The idea behind let it crash is that the only option in those cases is restarting the system.

    “But I may lose my data” – you won’t, because that’s the point, they are not supposed to happen. If it is meant to happen, you can either handle it or maybe rescue the known errors, but almost never *all errors*.

    Maybe you ended up simplifying your actual problem too much in the blog post, as the error at hand can be easily treated, but I would still be very careful in recommending try/rescue. That’s why I completely agree with your last paragraph: improving the library invoked or handling it on the client are definitely improvements. :)

  3. An Agent is a “simple abstraction around state.”

    If you need a more complex abstraction, including your case of being able to persist data between crashes, I would strongly suggest you take a look at using a regular GenServer. GenServer has a terminate/1 callback which can be used to transfer state to an alternate “stash” GenServer (or Agent). When the process restarts, it can read its initial state from the stash.

    This pattern is described in the Pragmatic Bookshelf “Programming Elixir” https://pragprog.com/book/elixir/programming-elixir book; I advise you to take a read. (Not affiliated in any way)

    Good luck with your Elixir adventures! :)

Leave a Comment

Your email address will not be published. Required fields are marked *