Matthew

Posted on May 13

Part 9: Observability — Prometheus, Grafana, Fluent Bit, and CloudWatch

#kubernetes #monitoring #devops #aws

Part of the series: Building a Production-Grade DevSecOps Pipeline on AWS

Introduction

Observability answers the question: what is my system doing right now, and why?

The three pillars:

Metrics — numerical measurements over time (CPU%, request rate, error rate, latency)
Logs — structured event records from every container
Traces — request flows across services (not covered in this series, but Grafana Tempo is the natural next step)

This pipeline implements metrics with Prometheus + Grafana and logs with Fluent Bit → CloudWatch. Together they give you both real-time dashboards and historical log search without leaving the AWS ecosystem.

┌──────────────────────────────────────────────────────────────────────────┐
│  OBSERVABILITY ARCHITECTURE                                              │
│                                                                          │
│  Application Pods                                                        │
│  ├─ /metrics endpoint → ServiceMonitor → Prometheus scrape               │
│  └─ stdout/stderr → Fluent Bit DaemonSet → CloudWatch Logs               │
│                                                                          │
│  Infrastructure                                                          │
│  ├─ node-exporter (CPU, memory, disk, network per node) → Prometheus     │
│  └─ kube-state-metrics (pod state, deployment state) → Prometheus        │
│                                                                          │
│  Prometheus → Grafana (dashboards + alert rules)                         │
│  Prometheus → Alertmanager (notifications)                               │
│                                                                          │
│  Falco (security events) → stdout → Fluent Bit → CloudWatch              │
│  All containers → stdout → Fluent Bit → CloudWatch                       │
└──────────────────────────────────────────────────────────────────────────┘

kube-prometheus-stack

Rather than installing Prometheus, Grafana, and Alertmanager separately, we use the kube-prometheus-stack Helm chart. It bundles:

Prometheus Operator — manages Prometheus, Alertmanager, and PrometheusRule CRDs
Prometheus — the metrics database
Alertmanager — routes alerts to notification channels
Grafana — dashboards and visualization
kube-state-metrics — exposes Kubernetes object state as metrics
node-exporter — exposes node-level metrics (CPU, memory, disk, network)

We install it only on staging and production clusters (4 of 6) — dev clusters skip monitoring to reduce cost.

Installation via ArgoCD

# infrastructure/monitoring/applicationset.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: kube-prometheus-stack
  namespace: argocd
spec:
  generators:
    - list:
        elements:
          - cluster:        myapp-production-use1
            region:         us-east-1
            grafanaIngress: "true"
            certArn:        "arn:aws:acm:us-east-1:591120834781:certificate/9ab022c9-..."
          - cluster:        myapp-production-usw2
            region:         us-west-2
            grafanaIngress: "false"
            certArn:        ""
          - cluster:        myapp-staging-use1
            region:         us-east-1
            grafanaIngress: "false"
            certArn:        ""
          - cluster:        myapp-staging-usw2
            region:         us-west-2
            grafanaIngress: "false"
            certArn:        ""
  template:
    metadata:
      name: "prometheus-{{cluster}}"
    spec:
      project: production
      sources:
        - repoURL:        https://prometheus-community.github.io/helm-charts
          chart:          kube-prometheus-stack
          targetRevision: "61.9.0"
          helm:
            valueFiles:
              - $gitopsValues/infrastructure/monitoring/prometheus-values.yaml
            parameters:
              - name:  "grafana.ingress.enabled"
                value: "{{grafanaIngress}}"
              - name:  "grafana.ingress.annotations.alb\\.ingress\\.kubernetes\\.io/certificate-arn"
                value: "{{certArn}}"
        - repoURL:        https://github.com/MatthewDipo/myapp-gitops.git
          targetRevision: main
          ref: gitopsValues
      destination:
        name:      "{{cluster}}"
        namespace: monitoring
      syncPolicy:
        syncOptions: [CreateNamespace=true, ServerSideApply=true]
        retry:
          limit: 3
          backoff: { duration: 30s, maxDuration: 10m, factor: 2 }

Important install flag: Use --no-hooks --timeout 10m (without --wait). Pre-upgrade admission webhook Jobs consistently time out in ArgoCD, causing the sync phase to show Failed — but the actual resources (Prometheus, Grafana, Alertmanager) deploy correctly. This is a known false positive. Do not let it alarm you.

prometheus-values.yaml

prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp2
          accessModes: [ReadWriteOnce]
          resources:
            requests:
              storage: 50Gi
    # Auto-discover ALL ServiceMonitors and PodMonitors across all namespaces
    # Without these settings Prometheus only scrapes resources with matching Helm labels
    podMonitorSelectorNilUsesHelmValues:     false
    serviceMonitorSelectorNilUsesHelmValues: false
    ruleSelectorNilUsesHelmValues:           false

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: gp2
          accessModes: [ReadWriteOnce]
          resources:
            requests:
              storage: 10Gi
  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: [alertname, region]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: "null"    # Placeholder — replace with Slack/PagerDuty
    receivers:
      - name: "null"

grafana:
  admin:
    existingSecret: grafana-admin-secret
    userKey:        admin-user
    passwordKey:    admin-password
  persistence:
    enabled:          true
    storageClassName: gp2
    size:             10Gi
  sidecar:
    dashboards:
      enabled:          true
      searchNamespace:  ALL   # Pick up dashboards from all namespaces
  ingress:
    enabled:          false   # Overridden per-cluster via ApplicationSet parameter
    ingressClassName: alb
    annotations:
      alb.ingress.kubernetes.io/scheme:        internet-facing
      alb.ingress.kubernetes.io/target-type:   ip
      alb.ingress.kubernetes.io/listen-ports:  '[{"HTTPS":443}]'
      alb.ingress.kubernetes.io/ssl-redirect:  "443"
      alb.ingress.kubernetes.io/certificate-arn: ""  # Injected per-region
    hosts:
      - grafana.matthewoladipupo.dev
    path: /
    pathType: Prefix

kubeStateMetrics:
  enabled: true
nodeExporter:
  enabled: true

Grafana Public Access

Grafana is only exposed publicly on myapp-production-use1. The reasons:

Grafana uses an EBS ReadWriteOnce PVC — only one node can mount it at a time, making it inherently single-instance
EBS data is AZ-local — running a second public Grafana in usw2 would show different historical data
One public Grafana that federates data from all clusters is cleaner than four separate Grafana instances

Access URL: https://grafana.matthewoladipupo.dev

The Route53 A record points to the ALB provisioned by AWS LBC when the Ingress is applied.

Getting the Grafana Password

The Grafana admin credentials are stored in grafana-admin-secret in the monitoring namespace. Important caveat: Grafana only reads this secret on first database initialization. If the secret value changes after the pod has started, you must reset the password via grafana-cli:

kubectl exec -n monitoring <grafana-pod> -c grafana -- \
  grafana-cli admin reset-admin-password 'YourNewPassword'

ServiceMonitor for the Application

By default, Prometheus only scrapes the cluster components provided by kube-prometheus-stack. To scrape your application, add a ServiceMonitor CRD:

# apps/myapp/templates/servicemonitor.yaml
{{- if .Values.serviceMonitor.enabled }}
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: {{ include "myapp.fullname" . }}
  namespace: {{ .Release.Namespace }}
  labels:
    {{- include "myapp.labels" . | nindent 4 }}
spec:
  selector:
    matchLabels:
      {{- include "myapp.selectorLabels" . | nindent 6 }}
  endpoints:
    - port: http          # Named port on the Service
      path: /metrics
      interval: {{ .Values.serviceMonitor.interval | default "30s" }}
      scrapeTimeout: {{ .Values.serviceMonitor.scrapeTimeout | default "10s" }}
  namespaceSelector:
    matchNames:
      - {{ .Release.Namespace }}
{{- end }}

The application's /metrics endpoint returns Prometheus exposition format:

# HELP myapp_http_requests_total Total HTTP requests
# TYPE myapp_http_requests_total counter
myapp_http_requests_total{method="GET",route="/health",status_code="200"} 1423
myapp_http_requests_total{method="GET",route="/metrics",status_code="200"} 89
# HELP process_cpu_seconds_total Total user and system CPU time
...

The key setting that makes auto-discovery work: serviceMonitorSelectorNilUsesHelmValues: false in prometheus-values.yaml. Without this, Prometheus only scrapes ServiceMonitors that have the Helm release's labels — ignoring your application's ServiceMonitor.

PrometheusRule — Alert Rules

# infrastructure/monitoring/alert-rules/myapp-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: myapp-alerts
  namespace: monitoring
  labels:
    release: prometheus   # Must match what Prometheus Operator is watching
spec:
  groups:
    - name: myapp.rules
      interval: 30s
      rules:
        - alert: HighErrorRate
          expr: |
            (
              sum(rate(myapp_http_requests_total{status_code=~"5.."}[5m]))
              /
              sum(rate(myapp_http_requests_total[5m]))
            ) > 0.01
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High error rate in myapp ({{ $value | humanizePercentage }})"
            description: "More than 1% of requests are failing with 5xx errors for 5 minutes."

        - alert: PodCrashLooping
          expr: rate(kube_pod_container_status_restarts_total{namespace="myapp"}[15m]) * 60 * 15 > 2
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} is crash looping"
            description: "Pod has restarted more than 2 times in 15 minutes."

        - alert: HighMemoryUsage
          expr: |
            (
              container_memory_working_set_bytes{namespace="myapp",container!=""}
              / container_spec_memory_limit_bytes{namespace="myapp",container!=""}
            ) > 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Memory usage above 90% in {{ $labels.pod }}"

Alertmanager — Adding Slack Notifications

The current config uses a null receiver (alerts fire but go nowhere). To wire up Slack:

alertmanager:
  config:
    global:
      resolve_timeout: 5m
      slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
    route:
      group_by: [alertname, region, cluster]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      receiver: slack-critical
      routes:
        - match:
            severity: critical
          receiver: slack-critical
        - match:
            severity: warning
          receiver: slack-warning
    receivers:
      - name: slack-critical
        slack_configs:
          - channel: '#alerts-critical'
            text: |
              *Alert:* {{ .GroupLabels.alertname }}
              *Severity:* {{ .GroupLabels.severity }}
              *Cluster:* {{ .GroupLabels.cluster }}
              {{ range .Alerts }}*Description:* {{ .Annotations.description }}{{ end }}
            send_resolved: true
      - name: slack-warning
        slack_configs:
          - channel: '#alerts-warning'
            text: '{{ .GroupLabels.alertname }}: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
            send_resolved: true

Fluent Bit — Log Shipping to CloudWatch

Fluent Bit runs as a DaemonSet — one pod per node — and reads all container logs from /var/log/containers/*.log.

IRSA for Fluent Bit

Fluent Bit needs IAM permissions to write to CloudWatch. The key lesson: use wildcard ARNs for both log groups AND log streams.

# _modules/fluent-bit-irsa/main.tf

resource "aws_iam_role_policy" "fluent_bit" {
  name = "fluent-bit-cloudwatch"
  role = aws_iam_role.fluent_bit.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "CloudWatchLogs"
        Effect = "Allow"
        Action = [
          "logs:CreateLogGroup",
          "logs:CreateLogStream",
          "logs:PutLogEvents",
          "logs:DescribeLogStreams",
          "logs:DescribeLogGroups"
        ]
        Resource = [
          # Log group operations (CreateLogGroup, DescribeLogGroups)
          "arn:aws:logs:${var.aws_region}:${var.account_id}:log-group:/eks/*",
          # Log stream operations (CreateLogStream, PutLogEvents)
          # The :* suffix is REQUIRED for stream-level permissions
          "arn:aws:logs:${var.aws_region}:${var.account_id}:log-group:/eks/*:*"
        ]
      }
    ]
  })
}

Lesson learned: Using only log-group:/eks/* (without the :* suffix) grants permissions on the log group resource but NOT on log streams within it. CreateLogStream and PutLogEvents operate on the log stream resource, which requires the :* suffix. Without this, pods get AccessDeniedException on CreateLogStream even though CreateLogGroup succeeds.

Fluent Bit Helm Values

# infrastructure/logging/fluent-bit-values.yaml
serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: "{{irsaRoleArn}}"   # Injected per-cluster

cloudWatch:
  region: "{{region}}"
  logGroupName: "/eks/{{cluster}}/$(kubernetes['namespace_name'])"
  logStreamName: "$(kubernetes['pod_name'])/$(kubernetes['container_name'])"
  autoCreateGroup: true

# Enrich log records with Kubernetes metadata
filter:
  kubernetes:
    Merge_Log: On
    Keep_Log: Off
    K8S-Logging.Parser: On
    K8S-Logging.Exclude: On

CloudWatch Log Groups Created

After Fluent Bit starts, these log groups appear in CloudWatch:

/eks/myapp-production-use1/myapp
/eks/myapp-production-use1/monitoring
/eks/myapp-production-use1/argocd
/eks/myapp-production-use1/kyverno
/eks/myapp-production-use1/falco
... (one per namespace, all clusters)

Querying Logs with CloudWatch Insights

# Find all 5xx errors in the last hour
fields @timestamp, log
| filter kubernetes.namespace_name = "myapp"
| filter log like /5[0-9][0-9]/
| sort @timestamp desc
| limit 100

# Find Falco security alerts
fields @timestamp, output
| filter kubernetes.namespace_name = "falco"
| filter priority = "Warning" or priority = "Error"
| sort @timestamp desc

# Count errors by pod
stats count(*) as errors by kubernetes.pod_name
| filter log like /ERROR/
| sort errors desc

Grafana Dashboards

Pre-built Dashboards (from kube-prometheus-stack)

These are included automatically and show up in Grafana immediately after installation:

Kubernetes / Compute Resources / Cluster — total CPU/memory across all nodes
Kubernetes / Compute Resources / Namespace — resource breakdown per namespace
Node Exporter / Nodes — per-node CPU, memory, disk, network I/O
Alertmanager / Overview — alert firing/resolved history

Importing Community Dashboards

Go to https://grafana.matthewoladipupo.dev
Dashboards → Import
Enter the dashboard ID from grafana.com:
- 1860 — Node Exporter Full (very detailed node metrics)
- 13332 — Kubernetes Pods (pod-level resource view)
- 15757 — ArgoCD (sync status, app health)

Application Dashboard

With the ServiceMonitor installed, Grafana can display your app metrics. Create a panel with:

# Request rate (requests per second)
sum(rate(myapp_http_requests_total[5m])) by (route)

# Error rate (percentage of 5xx)
sum(rate(myapp_http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(myapp_http_requests_total[5m])) * 100

# 95th percentile latency (if using histogram metric)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Summary

By the end of Part 9 you have:

✅ kube-prometheus-stack on 4 clusters (staging + production) with EBS persistent storage
✅ Grafana publicly accessible at https://grafana.matthewoladipupo.dev (production-use1 only)
✅ ServiceMonitor scraping myapp's /metrics endpoint every 30 seconds
✅ PrometheusRule alert rules for high error rate, crash looping, high memory
✅ Fluent Bit DaemonSet on all 6 clusters shipping logs to CloudWatch
✅ CloudWatch log groups per namespace per cluster
✅ CloudWatch Insights for ad-hoc log queries

Screenshot Placeholders

SCREENSHOT: Grafana — Kubernetes cluster overview dashboard showing node CPU and memory

SCREENSHOT: Grafana — Node Exporter dashboard showing per-node metrics

SCREENSHOT: AWS CloudWatch — Log groups showing /eks/ hierarchy from Fluent Bit

SCREENSHOT: CloudWatch Insights — query result showing application logs