Ultra Monitoring with Victoria Metrics

Challenges

Recently, my team has been assigned tasks to redesign the monitoring system. My organization has an ecosystem with hundreds of applications deployed across multiple cloud providers, mostly AWS (tens of AWS accounts in our AWS Org).

The old monitoring system was designed and deployed years ago. It’s a prom stack with a Prometheus instance, Grafana, Alermanager, and various types of exporters. It was good at that time. When the ecosystem grows fast, however, it now has problems:

Not highly available
Not scalable, scaling is too complex and not efficient
Data retention is too short (14 days) due to performance dramatically decreasing and scaling difficulties

With all the above problems, the ideal solution must meet the requirements below:

Highly available
Scalable, able to scale easily
Disaster recovery
Data must be stored for at least a year
Compatible with Prom stack and PromQL so that we don’t spend much effort on migration and getting familiar with the new stack.
Have an efficient way to collect metrics from multiple AWS accounts
The deployment process must be automated, both infra and configurations
Easy to be managed/maintain and automate daily operations tasks
Nice to have if supporting multi-tenant

Solution

After researching and making some PoC, we find that Victoria Metrics is a good fit for us. Vitoria Metrics has all of the required features. It’s highly available built-in, scaling is so easy since every component is separated. We implemented it and are using it for the production environment. We call it by name Ultra Metrics. Let’s look at our solution in detail.

High-level architecture

This is the high-level architecture of the solution:

We use cluster version of Victoria Metrics (VM), the cluster has some major components:

vmstorage: stores the raw data and returns the queried data on the given time range for the given label filters. This is the only stateful component in the cluster.
vminsert: accepts the ingested data and spreads it among vmstorage nodes according to consistent hashing over metric name and all its labels.
vmselect: performs incoming queries by fetching the needed data from all the configured vmstoragenodes
vmauth: is a simple auth proxy, router for the cluster. It reads auth credentials from Authorization HTTP header (Basic Auth, Bearer token, and InfluxDB authorization is supported), matches them against configs, and proxies incoming HTTP requests to the configured targets.
vmagent: is a tiny but mighty agent which helps you collect metrics from various sources and store them in Victoria Metrics or any other Prometheus-compatible storage systems that support the remote_write protocol.
vmalert: executes a list of the given alerting or recording rules against configured data sources. For sending alerting notifications vmalert relies on configured Alertmanager. Recording rules results are persisted via remote write protocol. vmalert is heavily inspired by Prometheus implementation and aims to be compatible with its syntax
promxy: used for querying the data from multiple clusters. It’s Prometheus proxy that makes many shards of Prometheus appear as a single API endpoint to the user.

How does the solution fit into our case?

Here are how Ultra Metrics addresses the requirements:

High availability

The system is able to continue accepting new incoming data and processing new quires when some components of the cluster are temporarily unavailable.

We accomplish this by using the cluster version of VM. Each component is deployed with redundancy and auto-healing. Data is also redundant by replicating (read more) to multiple nodes in the same cluster.

vminsert and vmselect are stateless components and deployed behind a proxy vmauth. vmauth stops routing requests into unavailable nodes.
vmstorage is the only stateful component, however, since data is redundant, it’s fine if some nodes go down temporarily.
- vminsert re-routes incoming data from unavailable vmstorage nodes to healthy vmstorage nodes
- vmselect continues serving responses if a vmstorage node is unavailable

Scalability

Since each component's responsibility is separated, and is mostly stateless services. It’s much easier to scale both vertical and horizontal. Each component may scale independently.

The storage component is the only stateful one. However, vmstorage nodes don't know about each other, don't communicate with each other, and don't share any data. It simplifies cluster maintenance and cluster scaling. Scaling storage layer is now so easy, just adding new nodes and updating vminsert and vmselect configurations. That’s it, no more steps are required.

Disaster recovery

Following Victoria Metrics’ recommendation that all components run in the same subnet network (same availability zone) to utilize high bandwidth, low latency, and thus low error rates. This increases cluster performance.

To have a multi-AZ, even multi-region (which we choose) setup, we run an independent cluster in each AZ or region. Then configure vmagent to send data to all clusters. vmagent has this feature built-in. [promxy](https://github.com/jacksontj/promxy) may be used for querying the data from multiple clusters. It provides a single data source for all PromQL queries meaning Grafana can have a single source and we can have globally aggregated PromQL queries.

Failover can be achieved by a combination of Route53 failover and/or promxy. When an entire AZ/region goes down, the system is still available for both read and write operations. Once the AZ/region is back in operation, missing data will be sent to that cluster by vmagent from its caching buffer.

Multi-tenancy

The system is centralization monitoring system, there are multiple teams using it. Data of each team is stored independently and isolated from others. Team has ability to access data for their own team only. This is exactly what are VM multi-tenancy feature offers.

Victoria Metrics cluster has built-in support for multiple isolated tenants. It’s expected that the data of tenants be stored in a separate database managed by a separate service sitting in front of the Victoria Metrics cluster such as vmauth

Data for all the tenants are evenly spread among available vmstorage nodes. This guarantees even load among vmstorage nodes when different tenants have different amounts of data and different query loads. Performance and resource usage doesn't depend on the number of tenants also.

Let’s say a tenant is an AWS account in the above architecture.

vmagent remote write URL are configured as example below:

URLs for data ingestion:
- https://us-east-1.ultra-metrics.com:8427/api/v1/write
- https://ap-southeast-1.ultra-metrics.com:8427/api/v1/write
URLs for Prometheus querying:
- https://us-east-1.ultra-metrics.com:8427/api/v1/query
- https://ap-southeast-1.ultra-metrics.com:8427/api/v1/query

vmauth configurations look like this snippet:

users:
...
# Requests with the 'Authorization: Bearer account1Secret' and 'Authorization: Token account1Secret'
# header are proxied to https://<internal-nlb-domain>:8481
# For example, https://<internal-nlb-domain>:8427/api/v1/query is proxied to https://<internal-nlb-domain>:8481/select/1/prometheus/api/v1/query
- bearer_token: account1Secret
  url_map:
  - src_paths:
    - /api/v1/query
    - /api/v1/query_range
    - /api/v1/series
    - /api/v1/label/[^/]+/values
    - /api/v1/metadata
    - /api/v1/labels
    - /api/v1/query_exemplars
    url_prefix:
    - https://<internal-nlb-domain>:8481/select/1/prometheus
  - src_paths: 
    - /api/v1/write
    url_prefix: 
    - https://<internal-nlb-domain>:8480/insert/1/prometheus

# Requests with the 'Authorization: Bearer account2Secret' and 'Authorization: Token account2Secret'
# header are proxied to https://<internal-nlb-domain>:8481
# For example, https://<internal-nlb-domain>:8427/api/v1/query is proxied to https://<internal-nlb-domain>:8481/select/2/prometheus/api/v1/query
- bearer_token: account2Secret
  url_map:
  - src_paths:
    - /api/v1/query
    - /api/v1/query_range
    - /api/v1/series
    - /api/v1/label/[^/]+/values
    - /api/v1/metadata
    - /api/v1/labels
    - /api/v1/query_exemplars
    url_prefix:
    - https://<internal-nlb-domain>:8481/select/2/prometheus
  - src_paths: 
    - /api/v1/write
    url_prefix: 
    - https://<internal-nlb-domain>:8480/insert/2/prometheus

Note that:

8247 is vmauth's port
8481 is vmselect's port
8480 is vminsert's port

Prom-stack compatibility

VM implements Prometheus querying API so there is no changes from query APIs, syntax, etc.. So all tools used continue to function as they are.
We don’t even need to make any changes (sidecar, agent, etc...) except to add few lines configurations to the old monitoring system to make it works with new system.
```
remote_write:
  - url: https://us-east-1.ultra-metrics.com:8427/api/v1/write
  - url: https://ap-southeast-1.ultra-metrics.com:8427/api/v1/write
```
Thus, we can continue using the old monitoring while experimenting new system.

Some statistics

Will be updated soon.

What’s next?

Infrastructure provisioning by CDK
Automate cluster deployment using AWS Automation runbook
GitOps for daily operation tasks on the cluster
- A GitOps Way To Manage Grafana Data Sources At Scale