K for moesif

Posted on Aug 31, 2020 • Originally published at moesif.com

Comparison of Open Source API Analytics and Monitoring Tools

#monitoring #api

For any API-first company, implementing the right API analytics platform is critical for tracking the utilization of your APIs and to discover any performance or functional issues that may be impacting customers.

There are a variety of open-source projects you can leverage to build a complete API analytics platform.

Before jumping into building an API analytics solution yourself, you should first list out your requirements and
use cases. Not all tools support all use cases directly, and may require heavy investment in development and integration.

API analytics build requirements

Need to answer engineering or business questions

Engineers commonly turn to API logs and metrics to understand what's happening over their APIs, inspect payloads, and
root cause issues with their services that come up.

Real-time API logging capabilities is a must for engineers
that want to depend on their API analytics solution to put out the (hopefully far and few between) fires due to an API outage or reliability issue. Because real-time logging can raise the compute and storage cost, not all analytics infrastructure maintains a real-time pipeline.

On the flip side, only the most recent data is needed to answer hair on fire engineering problems, so data can be retired after a short time period such as after 24 hours.

Engineering and product leaders make strategic decisions based on lines, not dots. Business questions are answered from historical trends in data, which may be over months or even years. This means your API analytics build should be capable of storing data for long retention periods such as for multiple years.

This also requires infrastructure that can roll up and compress your data as storing and running aggregations on raw event logs will cripple your
analytics infrastructure

Real-time alerting and monitoring?

While building dashboards and throwing them up on a monitor can be a great way to monitor your metrics, it still requires
manually checking them periodically. To become more proactive, many API analytics builds also have some sort of monitoring and alerting features.

These can vary from simple threshold-based alerts that sends an email when a metric reaches a certain value, to complex
monitoring rules and workflows that can perform complex aggregations in real-time and route alerts to incident response platforms like PagerDuty and BigPanda.

Some customer-facing teams like developer relations and customer success may want to build automated workflows to notify customers or internal specific internal stakeholders based on complex user behavioral flows.

Access data anywhere vs controlled access

The post-COVID world has accelerated the shift towards work from home. This means internal users of your API analytics build may need to access metrics on home networks or on the go without a VPN.

Placing unnecessary security restrictions to access the API data may limit the value your API analytics provides to your company or can even backfire creating bad habits such as password sharing or exporting large amounts of data to a personal device rather than using the API analytics build the way it was designed.

On the flip side, providing self-service API analytics from anywhere also means having strong security and access control. If you think typical enterprise access control like single sign on and role-based access control may be needed in the future, you should plan your API analytics build accordingly even if not needed immediately.

Ripping out and changing authentication and authorization design is not an easy task and could force a complete rewrite. You don't want to be caught off guard and be the engineer who failed at foreshadowing typical future enhancements.

Flexibility of visualizations

While most analytics platforms can display event data or plot basic single-value metrics over time, you platform may also be used for more advanced and specialized analysis like funnel analysis or cohort retention analysis. These are common queries used by business functions like marketing and growth teams but rarely used by developers themselves. Yet, building a funnel can be quote challenging if the right data model was not used.

Many times, you don't know the types of queries that need to be displayed but should choose a project that enables flexibility in both the data model and the visualization layer.

Comparison

There are a variety of open-source analytics and monitoring projects out there. Some are focused on monitoring infrastructure metrics like Kibana and Grafana. While others are more focused on web analytics like Matomo (also known as Piwik).

While none of these are designed for API products, you could develop custom code to piece a few components together to build an open-source API analytics platform

Kibana

Kibana is one of the de facto open-source log visualization tools out there for engineers. It's part of the official ELK stack (Elasticsearch Logstash Kibana) which makes it one of the quickest visualization tools to get up and running due to it's tight integration with Elasticsearch relative to tools like Grafana which are far more complex to set up.

Elasticsearch itself is well suited for high-cardinality, high-dimension log data which is a must for API logs. The downside is that Kibana is only compatible with Elasticsearch. If you want to visualize data stored in a SQL database or other data store, you'll need to look elsewhere.

While quick to set up, Kibana is also quite limited in the types of visualizations and flexibility supported. Kibana's primary use case is to provide log search and light analysis on raw event data rather than offering a true API monitoring tool.

For debugging use cases, this may be sufficient, but popular business metrics like funnels and retention analysis cannot be performed by Kibana limiting it's application outside of engineering teams.

Kibana and Elasticsearch is designed to be paired with a Logstash instance which enables you to design a logging pipeline to process and enrich API logs such as normalized any HTTP headers or add geo IP information to each API call.

Keep in mind Logstash cannot perform aggregations across multiple events at a time. Such processing requires a separate cluster-computing framework for map-reduce operations like Spark or Hadoop.

Kibana is designed for use cases where you want to explore your data in an ad hoc fashion rather than create a daily dashboard. You're able to leverage Elasticsearch query DSL or Lucene query syntax providing great flexibility, but these do have a steep learning curve.

By default, Kibana is purely a visualization tool which means things like alerting, anomaly detection, and authentication are separate. This means anyone with access to your Kibana endpoint can access your data so you shouldn't have it public.

However, you can purchase and install the Elasticsearch X-Pack to gain some monitoring functionality along with access control.

Grafana

Unlike Kibana which focuses on log search, Grafana focuses on time-series based metrics. You can visualize data in a variety of databases including Elasticsearch, InfluxDb, OpenTSDB, Graphite, and Prometheus.

Grafana does one thing, and one thing really well, which is visualizing time series metrics stored in a database with beautiful dashboards. This does leave everything else up to you including configuring your data source and processing your data into a time series metric that can be displayed by Grafana.

Compared to Kibana, Grafana only works on time-series data already stored in a database and does not have any real-time log search nor a way to browse or explore your raw data in an ad hoc way. The primary use case for Grafana is to design a dashboard to monitor time-series metrics regularly such as on a TV in the office.

For example, you may want to display disk utilization, system CPI, and requests per minute for your servers. Grafana has a lot of options to display your metrics in the way you want, like showing storage capacity using base 2 units and percentile based metrics.

Due to it's time-series based architecture, Grafana's application for high-cardinality, high-dimension analytics on API calls is limited. Instead, you need to spend time deciding exactly what specific time-series metrics you want to track ahead of time and model your data accordingly.

This also limits Grafna's use case for the self-service data exploration that business users may be looking for where you're looking to segment with multiple group by's or correlate multiple dimensions in your data.

Compared to Kibana, Grafana is known for built-in support for authentication and access control. You can also hook up your Grafana instance to an external Lightweight Directory Access Protocol (LDAP) server or SQL server to better control access in an enterprise setting.

You can also connect Grafana to an incident response platform like PagerDuty to create and trigger alerts from your Grafana instance. Keep in mind these alerts are limited to the same time series metrics you are already monitoring in Grafana.

Grafana does have a separate product called Loiki, which provides some of the log exploration features that Kibana has.

Jaeger

Like Grafana which specializes in time-series metrics, Jaeger does one thing and one thing well, which is visualizing distributed traces.

This makes Jaeger quite a bit of a different tool than Grafana and Kibana in that each trace is created and viewed in isolation vs monitoring metrics or logs over time. A trace is a snapshot of all context and timing info as a request propagates through a service mesh or hits various services in a microservice architecture.

Because trace generation is expensive, sampling is usually employed to snapshot every X requests or specific criteria.

Unlike Grafana which focuses on monitoring time-series metrics and Kibana which focuses on log search, Jaeger focuses on root causing specific issues with a service mesh or dependency issues. Jaeger supports multiple data sources like Grafana including Cassandra and Elasticsearch.

Since traces are created in isolation, the only view is a trace view as expected. There is no way to create trends over time. Jaeger also doesn't have any alerting or monitoring features so you'll want to still have a Grafana or similar instance handy.

Moesif

Many API teams find needing a multitude of tools for their monitoring needs. Grafana has time-series metrics whereas Kibana enables log search. A faster option is an end to end solution designed specifically for API products like Moesif.

Unlike Grafana where time-series metrics need to be preplanned, Moesif is designed for ad hoc data exploration using high-cardinality, high-dimension API analytics which is a must for product and engineering leaders looking to make strategic decisions from their API usage data.

Compared to infrastructure monitoring tools like Kibana and Grafana, Moesif leverages a user-centric data model which enables you to align your API metrics to business goals by tieing API metrics to individual customers rather than tied to infrastructure.

This is also known as user behavioral analytics which enables understanding complex user flows holistically across multiple user actions and API calls rather than looking at each time-series in isolation.

A classic example of user behavioral analytics is monitoring a conversion funnel and then breaking it down by user acquisition channel to decide where to invest marketing dollars or viewing user retention for different features or endpoints to make your API products more sticky. This also enables tracking higher level account health for security research and customer success.

Compared to Grafana and Kibana, Moesif also provides elaborate real-time alerting and reporting features. This enables you to see which endpoints are causing the most performance issues broken down by each customer email.

With behavioral emails and workflows, you're able to scale customer outreach and support efforts with automated emails with a set of steps.