selfhosting.sh

Posted on Mar 18 • Originally published at selfhosting.sh

Checkmk vs Grafana: Monitoring Compared

#monitoring #devops #opensource #tooling

Checkmk started as a Nagios plugin in 2008 and has evolved into a standalone infrastructure monitoring platform that discovers hosts, checks services, and sends alerts — all from one package. Grafana is a visualization layer that turns time-series data from sources like Prometheus, InfluxDB, or Loki into dashboards and alerts. They solve different problems, and understanding the distinction matters before you deploy either.

Quick Overview

Aspect	Checkmk Raw	Grafana OSS
Purpose	Infrastructure monitoring (all-in-one)	Data visualization & dashboarding
Latest version	2.3.0p44	v12.4.1
Docker image	`checkmk/check-mk-raw:2.3.0p44`	`grafana/grafana-oss:12.4.1`
License	GPL-2.0 (Raw Edition)	AGPL-3.0
Built-in data collection	Yes — agent-based + SNMP + agentless	No — requires external data sources
Service auto-discovery	Yes	No
Built-in alerting	Yes (rules + notifications)	Yes (alert rules + contact points)
Built-in dashboards	Yes (pre-built per service)	Yes (community dashboards + custom)
Host/service check engine	Yes (Nagios-compatible core)	No
Default port	5000 (web UI), 8000 (agent receiver)	3000
RAM usage	~1-2 GB	~200-400 MB

Feature Comparison

Feature	Checkmk Raw	Grafana OSS
Agent deployment	Built-in agent (Linux, Windows, macOS)	N/A (no agents)
SNMP monitoring	Built-in	Via Prometheus SNMP exporter
Network device monitoring	Built-in (switches, routers, firewalls)	Via external exporters
Log aggregation	Basic (via agent)	Via Loki integration
Metrics storage	Built-in RRD	External (Prometheus, InfluxDB, etc.)
Custom check scripts	Yes (local checks, MRPE)	N/A
API	REST API	REST API
LDAP/SSO	Yes	Yes
Mobile app	No official app	Grafana Cloud mobile app
Plugin ecosystem	Check plugins (~2,000 in exchange)	Data source + panel plugins (hundreds)
Multi-site support	Yes (distributed monitoring)	Via data source federation
Uptime monitoring	Built-in	Via external tools or plugins

Architecture

Checkmk is a complete monitoring stack. It includes:

A monitoring core (CMC in Enterprise, Nagios in Raw)
Agent framework for data collection
Service discovery engine
Check processing pipeline
Notification system
Web UI with pre-built dashboards
RRD-based metrics storage

You install Checkmk, deploy agents on your hosts, and monitoring starts automatically. The agent sends data to the Checkmk server, which processes checks, stores metrics, and fires alerts — no additional tools needed.

Grafana is a visualization layer. It needs external systems for everything:

Data collection → Prometheus, Telegraf, or other collectors
Metrics storage → Prometheus, InfluxDB, VictoriaMetrics
Log storage → Loki, Elasticsearch
Alerting → Grafana's built-in alerting or Alertmanager

A production Grafana monitoring stack typically runs 3-5 containers (Grafana + Prometheus + node_exporter + optional Loki + optional Alertmanager). Grafana itself just renders dashboards.

Installation Complexity

Step	Checkmk	Grafana (with Prometheus)
Containers needed	1	3+ (Grafana + Prometheus + exporters)
Time to first dashboard	~15 minutes	~30-60 minutes
Agent deployment needed	Yes (on monitored hosts)	Yes (node_exporter on hosts)
Auto-discovery	Yes — discovers services automatically	No — manual target config
Configuration language	Web UI (WATO)	YAML (Prometheus) + Web UI (Grafana)
Dashboard creation	Pre-built per service type	Manual or import community dashboards

Checkmk is faster to get running for infrastructure monitoring. You add a host in the web UI, deploy the agent, and Checkmk auto-discovers services (CPU, disk, memory, network, running processes, Docker containers). Pre-built dashboards appear automatically.

Grafana requires more assembly. You configure Prometheus scrape targets in YAML, deploy exporters, then build or import dashboards. The flexibility is greater, but the initial setup time is higher.

Performance and Resource Usage

Metric	Checkmk Raw	Grafana + Prometheus
RAM (10 hosts)	~800 MB - 1 GB	~500-800 MB total
RAM (100 hosts)	~1.5-2 GB	~1-2 GB total
CPU	Moderate (check processing)	Low (Grafana) + Moderate (Prometheus)
Disk (metrics retention)	~50 MB/host/year (RRD)	~100+ MB/host/year (Prometheus TSDB)
Check interval	Default 60s	Default 15s (Prometheus scrape)

Checkmk uses more RAM as a single process because it handles everything. The Grafana+Prometheus stack distributes load across multiple containers but uses comparable total resources.

Monitoring Approach

Checkmk uses a check-based model. It runs checks against services (Is the disk full? Is the service running? Is the CPU overloaded?) and returns OK/WARN/CRIT/UNKNOWN states. This maps directly to traditional infrastructure monitoring — you see green/yellow/red status at a glance.

Grafana uses a metrics-based model. Prometheus scrapes numeric time-series data (cpu_usage_percent=73.2 at timestamp T), and Grafana visualizes trends. You define alert thresholds on metrics, but the default view is graphs and dashboards, not service states.

Both approaches work. Checkmk's state-based view is better for ops teams who need "is everything OK?" at a glance. Grafana's time-series view is better for engineering teams who want to understand trends and correlate metrics.

Use Cases

Choose Checkmk If...

You need traditional infrastructure monitoring (servers, switches, printers)
You want auto-discovery of services without manual configuration
You monitor Windows servers alongside Linux (Checkmk has a native Windows agent)
You prefer a single application over assembling a monitoring stack
Your priority is uptime and alerting, not custom dashboards

Choose Grafana If...

You want beautiful, customizable dashboards
You already run or plan to run Prometheus
You need to visualize data from multiple sources (databases, cloud APIs, custom apps)
You monitor containerized/Kubernetes workloads
You want fine-grained control over metrics collection and retention

Use Both If...

You want Checkmk's auto-discovery and state-based monitoring AND Grafana's visualization
Checkmk supports Grafana integration via its REST API and InfluxDB export

Final Verdict

If you need infrastructure monitoring and don't want to assemble a multi-tool stack, Checkmk is the right tool. It handles host discovery, service checks, alerting, and basic dashboards in one package. Deploy the agent, add your hosts, and monitoring works.

If you need flexible visualization, custom dashboards, or you're monitoring application-level metrics alongside infrastructure, Grafana with Prometheus is more powerful. The trade-off is complexity — you're building and maintaining a stack, not deploying a single tool.

For home server monitoring with 5-20 hosts, Checkmk gets you running faster. For larger environments or teams that want deep observability, the Grafana ecosystem scales further.

Frequently Asked Questions

Can Checkmk export data to Grafana?

Yes. Checkmk can export metrics to InfluxDB, which Grafana reads as a data source. The Checkmk REST API also provides performance data that Grafana can query directly.

Is Checkmk Raw Edition really free?

Yes. The Raw Edition is GPL-2.0 licensed with no host limits. The Enterprise and Cloud editions add features like the Checkmk Micro Core (faster), advanced dashboards, and managed services.

Can Grafana replace Checkmk entirely?

Not on its own. Grafana doesn't collect data or run service checks. With Prometheus + Alertmanager + exporters, you can replicate most of Checkmk's functionality — but you're assembling 4-5 tools to do what Checkmk does in one.

DEV Community