John Ajera

Posted on May 23

EKS Metrics: Amazon Managed Prometheus vs Self-Managed Prometheus

#prometheus #eks #observability #amp

EKS Metrics: Amazon Managed Prometheus vs Self-Managed Prometheus

Once your cluster is running workloads, you need a metrics backend: something that scrapes (or receives) time series, stores them, and powers dashboards and alerts. On AWS the fork is usually Amazon Managed Service for Prometheus (AMP)—a managed, Prometheus-compatible store—or self-managed Prometheus in the cluster (Helm chart, operator, or agent + remote storage).

This article is a practical decision guide for that choice on EKS, especially when you are not migrating a decade of PromQL dashboards and recording rules. It covers what each path optimizes for, how cost shows up on the invoice versus in engineer time, how alerting differs, and a short rubric for defaulting to AMP, self-managed, or a hybrid (remote_write).

1. Overview

This guide helps you decide, for a new or early EKS observability stack:

What AMP and self-managed Prometheus each own (ingest, storage, query, alerting)
How to compare cost at a high level (ingestion, storage, queries, and cluster resources)
How to measure or estimate active series and ingestion before you pick a tier
What you still run in-cluster either way (exporters, scrape config, ADOT)
When a hybrid (Prometheus in-cluster → AMP via remote_write) is the least painful path

2. Prerequisites

Familiarity with Prometheus concepts: scrape targets, labels, TSDB retention, PromQL
An EKS cluster (Standard or Auto Mode) with a plan for who operates platform add-ons
Current Amazon Managed Service for Prometheus pricing and AMP documentation—limits and pricing change over time

3. Name the two starting paths

Both paths speak Prometheus (PromQL, exposition format, alert rule semantics). The split is who runs the TSDB and query API.

Side-by-side

	Amazon Managed Prometheus (AMP)	Self-managed Prometheus (in-cluster)
AWS runs	Ingestion pipeline, durable TSDB, query plane (serverless-style ops)	—
You run	Collectors/agents, scrape or receive config, IAM, workspace wiring	Prometheus server (often StatefulSet), PVCs, upgrades, HA, backups
Typical ingest	AWS Distro for OpenTelemetry (ADOT) collector, Prometheus agent mode, or `remote_write` from an in-cluster server	In-cluster scrape of ServiceMonitors, Pod annotations, static targets
Alerting	AMP managed Alertmanager and/or route to SNS; Grafana alerting is a separate choice	Alertmanager you deploy (often same Helm release or a sibling chart)
Dashboards	Often Amazon Managed Grafana or self-hosted Grafana with AMP as datasource	Grafana (or similar) pointing at in-cluster Prometheus Service
Multi-cluster	Natural fit: one workspace per env/region or federation patterns with less TSDB ops	Per-cluster Prometheus + optional Thanos/Mimir if you outgrow one server

AMP — responsibility flow

+-------------------------------+
| EKS: workloads + exporters    |
| (node-exporter, kube-state,   |
|  app /metrics, ADOT collector)|
+-------------------------------+
              |
              | remote_write / ADOT pipeline
              v
+-------------------------------+
| AWS: AMP workspace (TSDB +      |
| PromQL query API)             |
+-------------------------------+
              |
              v
+-------------------------------+
| You: Grafana / AMP Alertmgr / |
| SNS routes, dashboards, SLOs  |
+-------------------------------+

Self-managed — responsibility flow

+-------------------------------+
| EKS: Prometheus server        |
| (scrape + TSDB on PVC/emptyDir)|
+-------------------------------+
              |
              v
+-------------------------------+
| EKS: Alertmanager (optional)  |
| receivers → Slack/PagerDuty   |
+-------------------------------+
              |
              v
+-------------------------------+
| You: Grafana, rule lifecycle, |
| upgrades, capacity, backups   |
+-------------------------------+

Same for both: You still own what gets scraped, label cardinality, RBAC for the UI, and runbooks when alerts fire. Picking AMP does not remove the need for good metric and alert design.

4. Cost at a glance

Pricing moves; verify against AMP pricing and your EC2/EBS bill for self-managed. Think in three buckets: ingestion, storage/retention, queries (AMP) versus compute + disk + people (self-managed).

AMP (managed)

Cost driver	What drives it up
Ingested samples	Short scrape intervals, high-cardinality labels, many targets
Storage	Long retention, high churn, many series
Queries	Heavy Grafana dashboards, ad-hoc PromQL, recording rules evaluated in AMP

AMP can be cheap at small scale and expensive when cardinality explodes (unbounded pod labels, high-cardinality HTTP paths, per-UUID labels). Cost guardrails (sampling, relabel drops, allowed metric lists) matter more than on a single dev cluster where you only notice disk growth.

Self-managed (in-cluster)

Cost driver	What drives it up
EC2 / Karpenter nodes	CPU/memory for Prometheus replicas and rule evaluation
EBS (or equivalent)	TSDB size × retention; compactions and WAL
Engineer time	Chart upgrades, PVC expansion, HA drills, backup/restore testing

A minimal HA self-managed stack (two Prometheus replicas, anti-affinity, 20–50 Gi PVCs each, small Alertmanager) is often tens of dollars per month in AWS resources for a small cluster—before counting on-call and upgrade work. The invoice line is predictable; the hidden cost is operational.

Bottom line: AMP trades variable, usage-based spend for less TSDB operations. Self-managed trades fixed-ish infra cost for more control and more chores. Neither removes the need to design metrics carefully.

5. How to check and estimate your usage

Cost conversations are easier when you separate active series (how many distinct time series the TSDB holds) from ingestion rate (how many samples per second you push—what AMP bills most directly).

What counts as one series

A time series is one metric name plus one unique label set. Example: http_requests_total{method="GET",status="200",pod="abc"} is one series. More pods, paths, or IDs in labels → more series.

Rough scale tiers (planning only)

These are order-of-magnitude guides, not limits:

Tier	Active series (ballpark)	Typical EKS picture
Small	~1k–5k	Minimal scrape (little kube-state-metrics, no full cAdvisor), or aggressive `metric_relabel_configs` drops
Medium	~10k–30k	node-exporter + kube-state-metrics + kubelet/cAdvisor + apiserver on a small cluster (handful of nodes, tens of pods)
Large	50k+	Full default chart scrape, many microservices, per-path HTTP metrics, or duplicate exporters

A standard platform scrape (kube-state-metrics, kubernetes-nodes-cadvisor, apiserver, node-exporter) on a small cluster often lands above 10k even when it still feels “small” operationally—measure rather than assuming the 1k tier.

Turn series into ingestion (AMP-style math)

At a steady scrape interval, each active series produces about one sample per scrape:

samples_per_second ≈ active_series ÷ scrape_interval_seconds

Examples at 30s scrape:

Active series	~samples/s	~samples/month (30 days)
1,000	~33	~86M
10,000	~333	~864M
25,000	~833	~2.2B
50,000	~1,667	~4.3B

Use 86,400 × 30 seconds per month for back-of-envelope planning. If jobs use different intervals, sum per job or use a weighted average.

Back-of-envelope before deploy

List scrape jobs and estimate series per target:

Source	How to estimate
node-exporter	~400–1,000 series × node count (disks and NICs add labels)
kube-state-metrics	~20–40 series × pod count, plus Deployments, PVCs, HPA, PDB, …
kubelet / cAdvisor (`kubernetes-nodes-cadvisor`)	Often the largest bucket—scales with containers, not just nodes
apiserver	Often thousands of series (histogram buckets and verb/resource labels)
Application `/metrics`	Highly variable—histograms and high-cardinality labels dominate

active_series ≈ Σ (targets × series_per_target)   # plus a little for Prometheus self-metrics

Churn (pods created and destroyed) affects AMP storage more than steady active series; dynamic fleets need headroom.

How to check (self-managed Prometheus already running)

Port-forward to the Prometheus server (adjust namespace and service name to match your install):

kubectl -n prometheus port-forward svc/prometheus-server 9090:80

Headline count — TSDB stats API:

curl -s http://127.0.0.1:9090/api/v1/status/tsdb | jq '.data.numSeries'

Same value in the UI: Status → TSDB Stats. The metric prometheus_tsdb_head_series tracks it over time.

Top metrics by cardinality (can be expensive on very large TSDBs—use in non-prod or off-peak):

curl -sG 'http://127.0.0.1:9090/api/v1/query' \
  --data-urlencode 'query=topk(20, count by (__name__) ({__name__=~".+"}))' \
  | jq '.data.result[] | {metric: .metric.__name__, series: .value[1]}'

Top scrape jobs (find what to trim):

curl -sG 'http://127.0.0.1:9090/api/v1/query' \
  --data-urlencode 'query=topk(10, count by (job) ({__name__=~".+"}))' \
  | jq '.data.result[] | {job: .metric.job, series: .value[1]}'

In the UI: Status → Targets shows per-target health and last scrape size—useful when a single job spikes.

How to check (AMP)

CloudWatch metrics on the AMP workspace (ingestion and active series—see Monitor AMP).
AWS console → AMP workspace usage views for growth over time.
If you use ADOT or remote_write, the sender (collector or in-cluster Prometheus) still exposes scrape stats—debug cardinality at the source before it hits the workspace.

What usually blows up series count

High-cardinality labels — pod, url, trace_id, unbounded user_id on high-volume metrics.
Duplicate scrape — EKS addon node-exporter and Helm node-exporter, or two Prometheus replicas each scraping everything without needing two full TSDBs for AMP.
Histograms — each bucket, plus _sum and _count, is multiple series per logical metric.
Full cAdvisor / apiserver defaults with no relabel drops.

After you have numSeries and the top two job values, you can map yourself to the small / medium / large table above and plug numbers into the samples per month formula for AMP pricing.

6. Who runs what—and what the first year feels like

Where the work lands

Area	AMP	Self-managed
TSDB durability & scaling	AWS	You (PVC size, retention flags, compaction behavior)
Prometheus version upgrades	Managed compatibility window	You (Helm chart / operator upgrades, CRD drift)
Scrape discovery	You (collector config, EKS receivers, ServiceMonitor CRDs)	You (same; often more familiar with `prometheus.yml` in-cluster)
Recording / alerting rules	AMP rule groups or in-cluster evaluation + `remote_write`	Native `serverFiles` / PrometheusRule CRDs in Git
Long-term retention / global view	AMP + optional export; or Mimir/Thanos later	Add Thanos/Mimir/Cortex when one server is not enough
UI for debugging	Grafana → AMP; limited “SSH into Prometheus”	`kubectl port-forward` to `:9090`—fast for platform engineers

AMP in practice

Less toil on TSDB HA, backups, and “disk full” pages for the metrics store
Strong fit for org-wide standards and IAM-bound workspaces
Watch ingestion billing and cardinality; use ADOT/Prometheus relabeling deliberately
Alertmanager behavior is AWS-shaped—read AMP Alertmanager docs before assuming OSS Alertmanager feature parity

Self-managed in practice

Full control of scrape config, rule files, federation, and “break glass” PromQL on localhost
Familiar path: prometheus-community/prometheus Helm chart, GitOps overlays per cluster
Recurring work: chart bumps with Kubernetes upgrades, PVC growth, proving Alertmanager HA, securing the Prometheus UI (it has no built-in auth—use network policy, private ingress, or an auth proxy)
EKS add-ons can cover node-exporter (managed addon) while you keep a narrow Prometheus release for the server and rules—avoid duplicating the same metrics twice

7. EKS-specific wiring (both paths)

Neither option removes in-cluster collection:

Component	Typical role
Metrics server	HPA CPU/memory—not a Prometheus replacement
prometheus-node-exporter	Node/host metrics (DaemonSet or EKS managed addon)
kube-state-metrics	Kubernetes object metrics (Deployment, Pod, PVC, …)
CoreDNS / kubelet / apiserver	Cluster health; scrape config or ADOT receivers
Application `/metrics`	Your SLOs and business metrics

AMP path: deploy ADOT (or Prometheus in agent mode) with remote_write to the AMP workspace endpoint; use EKS Pod Identity or IRSA for SigV4 authentication.

Self-managed path: enable targets in Helm values or the Prometheus Operator; pin kube-state-metrics and exporters intentionally so you do not scrape the same series twice.

Hybrid (common): run a small in-cluster Prometheus for fast local debugging and federation-style rules, remote_write aggregates or critical series to AMP for org dashboards and long retention. You pay complexity once, not double storage for every raw sample.

8. Alerting and Grafana

Topic	AMP	Self-managed
Rule evaluation	AMP rule groups and/or in-cluster Prometheus firing → AMP	Prometheus `alerting_rules.yml` / PrometheusRule CRDs
Alert routing	AMP Alertmanager, SNS, EventBridge; fewer “random webhook” examples in AWS docs	Alertmanager receivers (Slack, PagerDuty, Opsgenie) in Git—with secrets via External Secrets
Grafana	Managed Grafana with AMP datasource is the path of least resistance	Platform Grafana in-cluster; datasource = Prometheus Service DNS
Double paging	Risk if both AMP rules and Grafana unified alerting fire on the same metric	Risk if both Prometheus and Grafana own the same alerts—pick one owner

9. Decision rubric (greenfield)

Lean toward AMP when:

You want no StatefulSet TSDB to babysit and are fine with usage-based metrics cost
Centralized observability across many accounts/clusters matters soon
You will standardize on ADOT / AWS observability patterns and Managed Grafana
The team is small and should not own Prometheus compaction, PVC resize, and version skew

Lean toward self-managed Prometheus when:

You need maximum control (custom scrape hooks, exotic service discovery, air-gapped patterns)
Most engineers live in port-forward PromQL and Git-managed prometheus.yml / rules
Predictable infra cost matters more than eliminating ops (small cluster, disciplined cardinality)
You already run kube-prometheus-stack patterns and want one chart to own rules + Alertmanager

Lean toward hybrid when:

You need in-cluster debugging and org-wide long retention or dashboards in AMP
You are migrating: start self-managed, remote_write to AMP, cut over Grafana datasources, then shrink in-cluster retention

Default suggestion for a greenfield EKS platform team: if nobody wants to own TSDB operations, start with AMP + ADOT and Managed Grafana; if the team is building GitOps muscle on platform charts and wants the fastest “see /targets and /rules locally” loop, self-managed with a narrow chart (server + optional Alertmanager, exporters wired deliberately) is a solid teaching path. Revisit when cardinality, retention, or multi-cluster pain appears.

10. Troubleshooting: common misconceptions

“AMP means we don’t run anything in the cluster.” You still run collectors/exporters and own cardinality.
“Self-managed is always cheaper.” EBS + replicas can be modest; engineer time and incident cost often dominate.
“We installed the prometheus-node-exporter EKS addon, so we have Prometheus.” The addon is node metrics, not the TSDB server.
“Grafana alerting replaces Prometheus/Alertmanager.” It can—but two owners for the same alert is how you get double pages.
“remote_write is free duplication.” It is not; you pay network, ingest, and often double evaluation unless you design what gets forwarded.

DEV Community

EKS Metrics: Amazon Managed Prometheus vs Self-Managed Prometheus

EKS Metrics: Amazon Managed Prometheus vs Self-Managed Prometheus

1. Overview

2. Prerequisites

3. Name the two starting paths

Side-by-side

AMP — responsibility flow

Self-managed — responsibility flow

4. Cost at a glance

AMP (managed)

Self-managed (in-cluster)

5. How to check and estimate your usage

What counts as one series

Rough scale tiers (planning only)

Turn series into ingestion (AMP-style math)

Back-of-envelope before deploy

How to check (self-managed Prometheus already running)

How to check (AMP)

What usually blows up series count

6. Who runs what—and what the first year feels like

Where the work lands

AMP in practice

Self-managed in practice

7. EKS-specific wiring (both paths)

8. Alerting and Grafana

9. Decision rubric (greenfield)

10. Troubleshooting: common misconceptions

11. References

Top comments (0)