DEV Community

Cover image for How to monitor your app's health
rbglod
rbglod

Posted on

How to monitor your app's health

Monitoring in web apps is crucial element in terms of maintaining stability and ability to quickly react to critical incidents.
When an incident happens, you or even worse - your customer - finds a bug, you drop your feature work and hop on a debugging train. Having metrics and monitoring in place is incredibly helpful to trace where the issue lies, what's the impact and how it can be fixed quickly.

In this blog post I'd like to share some ideas around health metrics that can be added to crucial parts of the app to increase visibilty of how our code works, what's working as expected and what code paths fail most often. I'm not gonna dive into tools like Airbrake or Sentry, which are used to catch errors and report them. I'll explore what we can implement in our code to have as much useful information in our logs (or tools like Datadog) instead.

Health metrics and where to put them

Start with finding crucial parts of the app without which the project is broken. For me, the first thing that comes to my mind is payment processing. If we fail to process a payment, business is not making money, which means that we're losing it. We pay for hosting the app, but fail to satisfy users needs, which may lead to users abandoning our app - reducing our income, increasing our costs.

attempt, success, failure

So we have our most vulnerable area - payment processing. Implementing health metrics is extremely easy, but also extremely useful. Let's have a look on this example class.

class PaymentProcessor
  ...
  def perform(amount, user)
    payment = create_payment(amount, user)
    payment.process_payment
    payment.mark_as_finished
  rescue Payments::Errors::PaymentFailed => e
    payment.mark_as_failed(e)
  ensure 
    set_result(payment.status)
  end
  ...
end
Enter fullscreen mode Exit fullscreen mode

We can define three phases of what's going on here.

  1. A payment is being created, we attempt to process a payment.
  2. If payment processing succeeds, we mark it as finished and return payment's status.
  3. If payment processing fails, we mark it as failed, save error message and return payment's status.

This gives us an idea of what health metrics we can implement here. And this rule is rather universal. We attempt to perform an action, then we either receive success or failure from the operation.

Here's how the metrics may look like in the most simple way.

class PaymentProcessor
  ...
  def perform(amount, user)
    metric_recorder.record_attempt
    payment = create_payment(amount, user)
    payment.process_payment
    payment.mark_as_finished
    metric_recorder.record_success
  rescue Payments::Errors::PaymentFailed => e
    payment.mark_as_failed(e)
    metric_recorder.record_failure(e.error_code)
  ensure 
    set_result(payment.status)
  end

  private

  def metric_recorder
    @metric_recorder ||= MetricRecorder.new(domain: 'payment_processing')
  end
end
Enter fullscreen mode Exit fullscreen mode

This would already give us some insight on how much payment processing attempts we had, how many of them succeeded and how many failed.

attempts = success + failed

If the numbers don't add up, it means we have some issue in-between the metrics being recorded. Maybe we get an error in payment.mark_as_finished that is not handled, hence we don't get ending metric properly? That's also valuable information for debuggin purposes. We can see how far the code execution went.

Wrapper

This was the simpliest version, but it pollutes the code, and we need to remember about adding it each time we implement new service. Ruby let's us make it much more cleaner way and makes it possible to reuse the code. We can add a method to our MetricRecorder module that will accept a block and handle the rest.

module MetricRecorder
  def record_health(&block)
    record_attempt
    block.call
    record_success
  rescue StandardError => e
    record_failure(e)
    raise e # we want to re-raise error to handle it outside of the recorder
  end
end
Enter fullscreen mode Exit fullscreen mode

And here's how record_health wrapper method can be used in the PaymentProcessor.

class PaymentProcessor
  include MetricRecorder

  def perform(amount, user)
    record_health do
      payment = create_payment(amount, user)
      payment.process_payment
      payment.mark_as_finished
    rescue Payments::Errors::PaymentFailed => e
      payment.mark_as_failed(e)
      raise e # we re-raise the error to propagate it to recorder and further
    ensure
      set_result(payment.status)
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

This makes our class clean again, reduces need to think where we should put metric loggers, and moves metric recorders outside of this class, making it reusable in the whole project. Once we want to add some extra data to metrics, we can update just MetricRecorder instead of all classes that use record_... methods.

The recorder module

Okay, we know how this should look from the code execution perspective, but what the record_attempt, record_success and record_failure should actually do?

Metrics are designed to just give an idea of the traffic in the app. If we want to have more detailed data, we should incorporate loggers probably.

If we're using Datadog, then we'd probably just increment attempt, success, failure metrics, with adding error_code to failure probably. It's due to costs and metrics design in DD - we can only send defined set of metrics. We can't propagate IDs, amount, or some specific user data in those metrics, as those are infinite.

module MetricRecorder
  def record_health(metric_name, &block)
    record_attempt(metric_name)
    block.call
    record_success(metric_name)
  rescue StandardError => e
    record_failure(metric_name, e)
    raise e
  end

  private

  def record_attempt(metric_name)
    DatadogService.increment(metric_name, 'attempt')
  end

  def record_success(metric_name)
    DatadogService.increment(metric_name, 'success')
  end

  def record_failure(metric_name, e)
    DatadogService.increment(metric_name, 'failure', error_code: e.error_code)
  end
end
Enter fullscreen mode Exit fullscreen mode

To get more data and really monitor our app, we could also write a logger module.

module LogsRecorder
  def record_log(method_name, &block)
    log(method_name, { action_type: 'attempt' })
    result = block.call
    log(method_name, { action_type: 'success', result: result })
    result
  rescue StandardError => e
    log(method_name, { action_type: 'failure', result: { error_class: e.class.name.demodulize, error_message: e.message } })
    raise e
  end

  private

  def log(method_name, additional_tags)
    Rails.logger.info(
      class_name: self.class.name.demodulize.underscore,
      method_name: method_name,
      additional_tags: additional_tags
    )
  end
end
Enter fullscreen mode Exit fullscreen mode

And here's how we could incorporate both modules - having metrics and logs in place. This would let us quickly monitor metrics from Datadog web dashboard for example, and if something feels off - we can jump into detailed logs, grep for action_type: 'failure' and see what went wrong.

class PaymentProcessor
  include MetricRecorder
  include LogsRecorder

  def perform(amount, user)
    record_health('payment_processing') do
      record_log('perform') do
        payment = create_payment(amount, user)
        payment.process_payment
        payment.mark_as_finished
      rescue Payments::Errors::PaymentFailed => e
        payment.mark_as_failed(e)
        raise e
      ensure
        set_result(payment.status)
      end
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

Top comments (0)