Part of the series: Building a Production-Grade DevSecOps Pipeline on AWS
Introduction
Observability answers the question: what is my system doing right now, and why?
The three pillars:
- Metrics — numerical measurements over time (CPU%, request rate, error rate, latency)
- Logs — structured event records from every container
- Traces — request flows across services (not covered in this series, but Grafana Tempo is the natural next step)
This pipeline implements metrics with Prometheus + Grafana and logs with Fluent Bit → CloudWatch. Together they give you both real-time dashboards and historical log search without leaving the AWS ecosystem.
┌──────────────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY ARCHITECTURE │
│ │
│ Application Pods │
│ ├─ /metrics endpoint → ServiceMonitor → Prometheus scrape │
│ └─ stdout/stderr → Fluent Bit DaemonSet → CloudWatch Logs │
│ │
│ Infrastructure │
│ ├─ node-exporter (CPU, memory, disk, network per node) → Prometheus │
│ └─ kube-state-metrics (pod state, deployment state) → Prometheus │
│ │
│ Prometheus → Grafana (dashboards + alert rules) │
│ Prometheus → Alertmanager (notifications) │
│ │
│ Falco (security events) → stdout → Fluent Bit → CloudWatch │
│ All containers → stdout → Fluent Bit → CloudWatch │
└──────────────────────────────────────────────────────────────────────────┘
kube-prometheus-stack
Rather than installing Prometheus, Grafana, and Alertmanager separately, we use the kube-prometheus-stack Helm chart. It bundles:
- Prometheus Operator — manages Prometheus, Alertmanager, and PrometheusRule CRDs
- Prometheus — the metrics database
- Alertmanager — routes alerts to notification channels
- Grafana — dashboards and visualization
- kube-state-metrics — exposes Kubernetes object state as metrics
- node-exporter — exposes node-level metrics (CPU, memory, disk, network)
We install it only on staging and production clusters (4 of 6) — dev clusters skip monitoring to reduce cost.
Installation via ArgoCD
# infrastructure/monitoring/applicationset.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: kube-prometheus-stack
namespace: argocd
spec:
generators:
- list:
elements:
- cluster: myapp-production-use1
region: us-east-1
grafanaIngress: "true"
certArn: "arn:aws:acm:us-east-1:591120834781:certificate/9ab022c9-..."
- cluster: myapp-production-usw2
region: us-west-2
grafanaIngress: "false"
certArn: ""
- cluster: myapp-staging-use1
region: us-east-1
grafanaIngress: "false"
certArn: ""
- cluster: myapp-staging-usw2
region: us-west-2
grafanaIngress: "false"
certArn: ""
template:
metadata:
name: "prometheus-{{cluster}}"
spec:
project: production
sources:
- repoURL: https://prometheus-community.github.io/helm-charts
chart: kube-prometheus-stack
targetRevision: "61.9.0"
helm:
valueFiles:
- $gitopsValues/infrastructure/monitoring/prometheus-values.yaml
parameters:
- name: "grafana.ingress.enabled"
value: "{{grafanaIngress}}"
- name: "grafana.ingress.annotations.alb\\.ingress\\.kubernetes\\.io/certificate-arn"
value: "{{certArn}}"
- repoURL: https://github.com/MatthewDipo/myapp-gitops.git
targetRevision: main
ref: gitopsValues
destination:
name: "{{cluster}}"
namespace: monitoring
syncPolicy:
syncOptions: [CreateNamespace=true, ServerSideApply=true]
retry:
limit: 3
backoff: { duration: 30s, maxDuration: 10m, factor: 2 }
Important install flag: Use
--no-hooks --timeout 10m(without--wait). Pre-upgrade admission webhook Jobs consistently time out in ArgoCD, causing the sync phase to showFailed— but the actual resources (Prometheus, Grafana, Alertmanager) deploy correctly. This is a known false positive. Do not let it alarm you.
prometheus-values.yaml
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp2
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 50Gi
# Auto-discover ALL ServiceMonitors and PodMonitors across all namespaces
# Without these settings Prometheus only scrapes resources with matching Helm labels
podMonitorSelectorNilUsesHelmValues: false
serviceMonitorSelectorNilUsesHelmValues: false
ruleSelectorNilUsesHelmValues: false
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: gp2
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 10Gi
config:
global:
resolve_timeout: 5m
route:
group_by: [alertname, region]
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: "null" # Placeholder — replace with Slack/PagerDuty
receivers:
- name: "null"
grafana:
admin:
existingSecret: grafana-admin-secret
userKey: admin-user
passwordKey: admin-password
persistence:
enabled: true
storageClassName: gp2
size: 10Gi
sidecar:
dashboards:
enabled: true
searchNamespace: ALL # Pick up dashboards from all namespaces
ingress:
enabled: false # Overridden per-cluster via ApplicationSet parameter
ingressClassName: alb
annotations:
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
alb.ingress.kubernetes.io/ssl-redirect: "443"
alb.ingress.kubernetes.io/certificate-arn: "" # Injected per-region
hosts:
- grafana.matthewoladipupo.dev
path: /
pathType: Prefix
kubeStateMetrics:
enabled: true
nodeExporter:
enabled: true
Grafana Public Access
Grafana is only exposed publicly on myapp-production-use1. The reasons:
- Grafana uses an EBS
ReadWriteOncePVC — only one node can mount it at a time, making it inherently single-instance - EBS data is AZ-local — running a second public Grafana in usw2 would show different historical data
- One public Grafana that federates data from all clusters is cleaner than four separate Grafana instances
Access URL: https://grafana.matthewoladipupo.dev
The Route53 A record points to the ALB provisioned by AWS LBC when the Ingress is applied.
Getting the Grafana Password
The Grafana admin credentials are stored in grafana-admin-secret in the monitoring namespace. Important caveat: Grafana only reads this secret on first database initialization. If the secret value changes after the pod has started, you must reset the password via grafana-cli:
kubectl exec -n monitoring <grafana-pod> -c grafana -- \
grafana-cli admin reset-admin-password 'YourNewPassword'
ServiceMonitor for the Application
By default, Prometheus only scrapes the cluster components provided by kube-prometheus-stack. To scrape your application, add a ServiceMonitor CRD:
# apps/myapp/templates/servicemonitor.yaml
{{- if .Values.serviceMonitor.enabled }}
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: {{ include "myapp.fullname" . }}
namespace: {{ .Release.Namespace }}
labels:
{{- include "myapp.labels" . | nindent 4 }}
spec:
selector:
matchLabels:
{{- include "myapp.selectorLabels" . | nindent 6 }}
endpoints:
- port: http # Named port on the Service
path: /metrics
interval: {{ .Values.serviceMonitor.interval | default "30s" }}
scrapeTimeout: {{ .Values.serviceMonitor.scrapeTimeout | default "10s" }}
namespaceSelector:
matchNames:
- {{ .Release.Namespace }}
{{- end }}
The application's /metrics endpoint returns Prometheus exposition format:
# HELP myapp_http_requests_total Total HTTP requests
# TYPE myapp_http_requests_total counter
myapp_http_requests_total{method="GET",route="/health",status_code="200"} 1423
myapp_http_requests_total{method="GET",route="/metrics",status_code="200"} 89
# HELP process_cpu_seconds_total Total user and system CPU time
...
The key setting that makes auto-discovery work: serviceMonitorSelectorNilUsesHelmValues: false in prometheus-values.yaml. Without this, Prometheus only scrapes ServiceMonitors that have the Helm release's labels — ignoring your application's ServiceMonitor.
PrometheusRule — Alert Rules
# infrastructure/monitoring/alert-rules/myapp-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: myapp-alerts
namespace: monitoring
labels:
release: prometheus # Must match what Prometheus Operator is watching
spec:
groups:
- name: myapp.rules
interval: 30s
rules:
- alert: HighErrorRate
expr: |
(
sum(rate(myapp_http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(myapp_http_requests_total[5m]))
) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate in myapp ({{ $value | humanizePercentage }})"
description: "More than 1% of requests are failing with 5xx errors for 5 minutes."
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total{namespace="myapp"}[15m]) * 60 * 15 > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
description: "Pod has restarted more than 2 times in 15 minutes."
- alert: HighMemoryUsage
expr: |
(
container_memory_working_set_bytes{namespace="myapp",container!=""}
/ container_spec_memory_limit_bytes{namespace="myapp",container!=""}
) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Memory usage above 90% in {{ $labels.pod }}"
Alertmanager — Adding Slack Notifications
The current config uses a null receiver (alerts fire but go nowhere). To wire up Slack:
alertmanager:
config:
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
group_by: [alertname, region, cluster]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: slack-critical
routes:
- match:
severity: critical
receiver: slack-critical
- match:
severity: warning
receiver: slack-warning
receivers:
- name: slack-critical
slack_configs:
- channel: '#alerts-critical'
text: |
*Alert:* {{ .GroupLabels.alertname }}
*Severity:* {{ .GroupLabels.severity }}
*Cluster:* {{ .GroupLabels.cluster }}
{{ range .Alerts }}*Description:* {{ .Annotations.description }}{{ end }}
send_resolved: true
- name: slack-warning
slack_configs:
- channel: '#alerts-warning'
text: '{{ .GroupLabels.alertname }}: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
send_resolved: true
Fluent Bit — Log Shipping to CloudWatch
Fluent Bit runs as a DaemonSet — one pod per node — and reads all container logs from /var/log/containers/*.log.
IRSA for Fluent Bit
Fluent Bit needs IAM permissions to write to CloudWatch. The key lesson: use wildcard ARNs for both log groups AND log streams.
# _modules/fluent-bit-irsa/main.tf
resource "aws_iam_role_policy" "fluent_bit" {
name = "fluent-bit-cloudwatch"
role = aws_iam_role.fluent_bit.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "CloudWatchLogs"
Effect = "Allow"
Action = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogStreams",
"logs:DescribeLogGroups"
]
Resource = [
# Log group operations (CreateLogGroup, DescribeLogGroups)
"arn:aws:logs:${var.aws_region}:${var.account_id}:log-group:/eks/*",
# Log stream operations (CreateLogStream, PutLogEvents)
# The :* suffix is REQUIRED for stream-level permissions
"arn:aws:logs:${var.aws_region}:${var.account_id}:log-group:/eks/*:*"
]
}
]
})
}
Lesson learned: Using only
log-group:/eks/*(without the:*suffix) grants permissions on the log group resource but NOT on log streams within it.CreateLogStreamandPutLogEventsoperate on the log stream resource, which requires the:*suffix. Without this, pods getAccessDeniedExceptiononCreateLogStreameven thoughCreateLogGroupsucceeds.
Fluent Bit Helm Values
# infrastructure/logging/fluent-bit-values.yaml
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: "{{irsaRoleArn}}" # Injected per-cluster
cloudWatch:
region: "{{region}}"
logGroupName: "/eks/{{cluster}}/$(kubernetes['namespace_name'])"
logStreamName: "$(kubernetes['pod_name'])/$(kubernetes['container_name'])"
autoCreateGroup: true
# Enrich log records with Kubernetes metadata
filter:
kubernetes:
Merge_Log: On
Keep_Log: Off
K8S-Logging.Parser: On
K8S-Logging.Exclude: On
CloudWatch Log Groups Created
After Fluent Bit starts, these log groups appear in CloudWatch:
/eks/myapp-production-use1/myapp
/eks/myapp-production-use1/monitoring
/eks/myapp-production-use1/argocd
/eks/myapp-production-use1/kyverno
/eks/myapp-production-use1/falco
... (one per namespace, all clusters)
Querying Logs with CloudWatch Insights
# Find all 5xx errors in the last hour
fields @timestamp, log
| filter kubernetes.namespace_name = "myapp"
| filter log like /5[0-9][0-9]/
| sort @timestamp desc
| limit 100
# Find Falco security alerts
fields @timestamp, output
| filter kubernetes.namespace_name = "falco"
| filter priority = "Warning" or priority = "Error"
| sort @timestamp desc
# Count errors by pod
stats count(*) as errors by kubernetes.pod_name
| filter log like /ERROR/
| sort errors desc
Grafana Dashboards
Pre-built Dashboards (from kube-prometheus-stack)
These are included automatically and show up in Grafana immediately after installation:
- Kubernetes / Compute Resources / Cluster — total CPU/memory across all nodes
- Kubernetes / Compute Resources / Namespace — resource breakdown per namespace
- Node Exporter / Nodes — per-node CPU, memory, disk, network I/O
- Alertmanager / Overview — alert firing/resolved history
Importing Community Dashboards
- Go to
https://grafana.matthewoladipupo.dev - Dashboards → Import
- Enter the dashboard ID from grafana.com:
-
1860— Node Exporter Full (very detailed node metrics) -
13332— Kubernetes Pods (pod-level resource view) -
15757— ArgoCD (sync status, app health)
-
Application Dashboard
With the ServiceMonitor installed, Grafana can display your app metrics. Create a panel with:
# Request rate (requests per second)
sum(rate(myapp_http_requests_total[5m])) by (route)
# Error rate (percentage of 5xx)
sum(rate(myapp_http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(myapp_http_requests_total[5m])) * 100
# 95th percentile latency (if using histogram metric)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Summary
By the end of Part 9 you have:
- ✅ kube-prometheus-stack on 4 clusters (staging + production) with EBS persistent storage
- ✅ Grafana publicly accessible at
https://grafana.matthewoladipupo.dev(production-use1 only) - ✅ ServiceMonitor scraping myapp's
/metricsendpoint every 30 seconds - ✅ PrometheusRule alert rules for high error rate, crash looping, high memory
- ✅ Fluent Bit DaemonSet on all 6 clusters shipping logs to CloudWatch
- ✅ CloudWatch log groups per namespace per cluster
- ✅ CloudWatch Insights for ad-hoc log queries
Screenshot Placeholders
SCREENSHOT: Grafana — Kubernetes cluster overview dashboard showing node CPU and memory
SCREENSHOT: Grafana — Node Exporter dashboard showing per-node metrics
SCREENSHOT: AWS CloudWatch — Log groups showing /eks/ hierarchy from Fluent Bit
SCREENSHOT: CloudWatch Insights — query result showing application logs
Next: Part 10 — Resilience: Karpenter, HPA, Argo Rollouts, and Velero
Follow the series — next part publishes next Wednesday.
Live system: https://www.matthewoladipupo.dev/health
Runbook: Operations Guide
Source code: myapp-infra | myapp-gitops | myapp




Top comments (0)