DEV Community

Nikitas Gargoulakis for AWS Community Builders

Posted on • Originally published at allaboutcloud.co.uk on

Complete Guide to AWS Monitoring and Observability for DevOps Teams

In today’s cloud-first world, many organisations find themselves wrestling with a common challenge: monitoring fragmentation. If you’re migrating to AWS from on-premises infrastructure, you’ve likely accumulated a collection of monitoring tools, Grafana here, Zabbix there, maybe some Prometheus, Scrutinizer, and a dash of CloudWatch. Each tool serves a purpose, but together they create operational chaos.

This article walks through a real-world architecture for consolidating multiple monitoring tools into a unified, AWS-native observability platform. Whether you’re monitoring EKS clusters, Active Directory, firewalls, or a hybrid infrastructure, this guide will help you build a single panel of glass for your entire estate.

The Problem: Death by a Thousand Dashboards

Let’s paint a familiar picture:

  • 3 AM : Your phone rings. Production is down.
  • 3:02 AM : You check CloudWatch. Nothing obvious.
  • 3:05 AM : Switch to Grafana. Some weird metrics.
  • 3:10 AM : Check Zabbix. Server CPU is spiking.
  • 3:15 AM : But why? Check logs. wait, where are those logs again?
  • 3:25 AM : Finally correlate the issue across four different systems.
  • MTTR : 45 minutes (30 of which were spent context-switching between tools)

Sound familiar? You’re not alone.

The Core Requirements

When consolidating monitoring infrastructure, we need to solve for:

  1. Unified Visibility : One place to see everything
  2. Proactive Detection : Catch issues before users do
  3. Fast Root Cause Analysis : Correlate events across layers
  4. Compliance Ready : Query data for audits without panic
  5. Operational Efficiency : Stop paying for five tools when one will do

The Solution: AWS-Native Observability Stack

After extensive research and real-world implementation, here’s the architecture that actually works:


┌─────────────────────────────────────────────────────────┐
│ Visualization Layer │
│ CloudWatch Dashboards | Managed Grafana | QuickSight │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│ Analytics & Investigation Layer │
│ CloudWatch Insights | Athena | OpenSearch Service │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│ Centralized Data Lake (Optional) │
│ AWS Security Lake (OCSF) │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│ Monitoring & Security Services │
│ CloudWatch | Security Hub | GuardDuty | Config │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│ Your Infrastructure │
│ EKS | EC2 | Lambda | RDS | On-Prem (Logs Only) │
└─────────────────────────────────────────────────────────┘

Enter fullscreen mode Exit fullscreen mode

The Core AWS Services

Let’s break down each component:


1. Amazon CloudWatch: Your Foundation

CloudWatch is unavoidable when working with AWS. Instead of fighting it, embrace it as your foundation.

What You Get:

  • Metrics : CPU, memory, disk, network, custom application metrics
  • Logs : Centralized log aggregation with retention policies
  • Alarms : Threshold-based and anomaly detection alerting
  • Dashboards : Pre-built and custom operational views
  • Insights : SQL-like queries for log analysis

Real-World Setup:


{
  "agent": {
    "metrics_collection_interval": 60
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/application/*.log",
            "log_group_name": "/aws/application/myapp",
            "log_stream_name": "{instance_id}",
            "retention_in_days": 30
          }
        ]
      }
    }
  },
  "metrics": {
    "namespace": "CustomApp/Metrics",
    "metrics_collected": {
      "cpu": {
        "measurement": [
          {"name": "cpu_usage_idle", "unit": "Percent"}
        ]
      },
      "mem": {
        "measurement": [
          {"name": "mem_used_percent", "unit": "Percent"}
        ]
      }
    }
  }
}

Enter fullscreen mode Exit fullscreen mode

2. Container Insights for EKS

If you’re running Kubernetes on AWS, Container Insights is a game-changer.

Deployment:


# Enable control plane logging
aws eks update-cluster-config \
  --name my-cluster \
  --logging '{"clusterLogging":[{"types":["api","audit","authenticator"],"enabled":true}]}'

# Deploy FluentBit DaemonSet
kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/fluent-bit/fluent-bit.yaml

Enter fullscreen mode Exit fullscreen mode

What You See:

  • Cluster-level metrics (CPU, memory, network)
  • Namespace and pod-level breakdowns
  • Node performance and capacity
  • Application logs automatically collected from stdout/stderr

This replaces your standalone Prometheus + Grafana setup for most use cases.


3. AWS Security Hub: Your Security Command Center

Think of Security Hub as your security findings aggregator. It’s like having a security operations assistant that never sleeps.

What It Aggregates:

  • GuardDuty : AI-powered threat detection
  • AWS Config : Configuration compliance
  • IAM Access Analyzer : Permission issues
  • Macie : Sensitive data discovery
  • Inspector : Vulnerability scanning

Compliance Made Easy:


# Enable Security Hub with CIS AWS Foundations Benchmark
aws securityhub enable-security-hub \
  --enable-default-standards

# Get compliance summary
aws securityhub get-findings \
  --filters '{"ComplianceStatus": [{"Value": "FAILED", "Comparison": "EQUALS"}]}'

Enter fullscreen mode Exit fullscreen mode

4. Amazon OpenSearch: Your SIEM Replacement

Replacing Microsoft Sentinel? OpenSearch Service is your answer.

Why OpenSearch Over Sentinel?

Use OpenSearch’s anomaly detection feature. It’s surprisingly good at catching unusual patterns you’d miss manually.


5. AWS Security Lake: The Long-Term Play

Here’s where things get interesting. Security Lake is AWS’s answer to the question: “Where do I store petabytes of security data without going bankrupt?”

The OCSF Advantage

Security Lake automatically normalizes logs to the Open Cybersecurity Schema Framework (OCSF). This means:

  • Standardized queries across all log sources
  • Multi-cloud ready (Azure, GCP logs can be normalized too)
  • Future-proof (vendor-agnostic format)

When to Use Security Lake:

YES if you need:

  • 1 year log retention
  • Compliance with strict audit requirements
  • Multi-cloud strategy
  • Cost-effective long-term storage (S3 is cheap!)

NO if you need:

  • Real-time alerting (use CloudWatch + OpenSearch instead)
  • Simple single-account setup
  • Quick implementation (<4 weeks)

Use Security Lake for retention, OpenSearch for hot analytics (last 30 days).

6. The On-Premises Challenge

Let’s address the elephant in the room: on-premises monitoring in a cloud-native world.

What’s Realistic:

You CAN:**

  • Forward logs via CloudWatch Agent
  • Send syslogs via Kinesis Firehose
  • Store and search on-prem logs in AWS
  • Create basic alerts on log patterns

You CANNOT (easily):**

  • Get real-time metrics dashboards
  • Automated remediation for on-prem resources
  • Full observability parity with AWS resources

The Pragmatic Approach:


# On-premises server → CloudWatch Logs
# Install agent
wget https://s3.amazonaws.com/amazoncloudwatch-agent/linux/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i amazon-cloudwatch-agent.deb

# Configure to send logs only (no metrics)
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a fetch-config \
  -m onPremise \
  -s \
  -c file:/opt/aws/amazon-cloudwatch-agent/etc/config.json

Enter fullscreen mode Exit fullscreen mode

For true on-premises monitoring, you might need to keep Zabbix or Prometheus for a while.

Architecture Decision: Two Approaches

Approach 1: With Security Lake (Compliance-First)

Best For : Healthcare, finance, government, or anyone with >1 year log retention requirements


AWS Services → Security Lake (S3/OCSF) → Athena (SQL queries)
                    ↓
              OpenSearch (Last 30 days hot analytics)
                    ↓
              CloudWatch Dashboards + Managed Grafana

Enter fullscreen mode Exit fullscreen mode

Pros :

  • Cost-effective long-term retention
  • OCSF standardization
  • Multi-cloud ready
  • Compliance-friendly

Cons :

  • More complex setup
  • Longer implementation (16-20 weeks)
  • Requires OCSF knowledge

Approach 2: Direct CloudWatch/OpenSearch (Speed-First)

Best For : Startups, lower compliance reqs, quick wins


AWS Services → CloudWatch Logs → OpenSearch (direct)
                    ↓
              CloudWatch Dashboards + Managed Grafana
                    ↓
              S3 (archived logs via export)

Enter fullscreen mode Exit fullscreen mode

Pros :

  • Faster implementation
  • Simpler architecture
  • Real-time everything
  • Lower learning curve

Cons :

  • Higher CloudWatch Logs costs at scale
  • No OCSF normalization
  • OpenSearch storage costs

Real-World Implementation: Step-by-Step

Let’s build this thing. Here’s the actual deployment sequence:

Week 1-2: Foundation


# 1. Enable AWS Organizations (if not already)
aws organizations create-organization

# 2. Enable CloudTrail (all regions, all accounts)
aws cloudtrail create-trail \
  --name organization-trail \
  --s3-bucket-name my-cloudtrail-bucket \
  --is-organization-trail \
  --is-multi-region-trail

# 3. Enable GuardDuty
aws guardduty create-detector --enable

# 4. Enable Security Hub
aws securityhub enable-security-hub

# 5. Enable AWS Config
aws configservice put-configuration-recorder \
  --configuration-recorder name=default,roleARN=arn:aws:iam::ACCOUNT:role/aws-service-role/config.amazonaws.com/AWSServiceRoleForConfig
aws configservice start-configuration-recorder \
  --configuration-recorder-name default

Enter fullscreen mode Exit fullscreen mode

Week 3-4: EKS Monitoring


# fluent-bit-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: amazon-cloudwatch
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush 5
        Log_Level info

    [INPUT]
        Name tail
        Path /var/log/containers/*.log
        Parser docker
        Tag kube.*

    [FILTER]
        Name kubernetes
        Match kube.*
        Kube_URL https://kubernetes.default.svc:443
        Merge_Log On

    [OUTPUT]
        Name cloudwatch_logs
        Match kube.*
        region us-east-1
        log_group_name /aws/eks/my-cluster
        log_stream_prefix app-
        auto_create_group true



kubectl apply -f fluent-bit-config.yaml

Enter fullscreen mode Exit fullscreen mode

Week 5-6: OpenSearch SIEM


# Create OpenSearch domain
aws opensearch create-domain \
  --domain-name security-analytics \
  --engine-version "OpenSearch_2.11" \
  --cluster-config InstanceType=r6g.large.search,InstanceCount=3 \
  --ebs-options EBSEnabled=true,VolumeType=gp3,VolumeSize=100 \
  --encryption-at-rest-options Enabled=true \
  --node-to-node-encryption-options Enabled=true \
  --advanced-security-options Enabled=true,InternalUserDatabaseEnabled=false

Enter fullscreen mode Exit fullscreen mode

Week 7-8: Dashboards and Alerts


// cloudwatch-dashboard.json
{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "metrics": [
          ["AWS/EC2", "CPUUtilization", {"stat": "Average"}]
        ],
        "period": 300,
        "stat": "Average",
        "region": "us-east-1",
        "title": "EC2 CPU Overview"
      }
    },
    {
      "type": "log",
      "properties": {
        "query": "SOURCE '/aws/eks/my-cluster' | fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20",
        "region": "us-east-1",
        "title": "Recent Errors"
      }
    }
  ]
}



aws cloudwatch put-dashboard \
  --dashboard-name "Production-Overview" \
  --dashboard-body file://cloudwatch-dashboard.json

Enter fullscreen mode Exit fullscreen mode

Automated Incident Response

Here’s where it gets interesting. Let’s automate security responses:


# lambda/security_response.py
import boto3

ec2 = boto3.client('ec2')
sns = boto3.client('sns')

def lambda_handler(event, context):
    """
    Responds to GuardDuty findings automatically
    """
    finding = event['detail']
    finding_type = finding['type']

    # SSH Brute Force detected
    if 'SSHBruteForce' in finding_type:
        instance_id = finding['resource']['instanceDetails']['instanceId']

        # Quarantine instance
        ec2.modify_instance_attribute(
            InstanceId=instance_id,
            Groups=['sg-quarantine'] # Pre-created quarantine security group
        )

        # Notify team
        sns.publish(
            TopicArn='arn:aws:sns:us-east-1:ACCOUNT:security-alerts',
            Subject=f'CRITICAL: Instance {instance_id} Quarantined',
            Message=f'Detected SSH brute force attack. Instance automatically isolated.\n\nFinding: {finding}'
        )

        return {'status': 'quarantined', 'instance': instance_id}

Enter fullscreen mode Exit fullscreen mode

EventBridge Rule :


{
  "source": ["aws.guardduty"],
  "detail-type": ["GuardDuty Finding"],
  "detail": {
    "severity": [7, 8, 8.9],
    "type": ["UnauthorizedAccess:EC2/SSHBruteForce"]
  }
}

Enter fullscreen mode Exit fullscreen mode

Result : Threat detected → Instance isolated → Team notified. All in <30 seconds.

Cost Optimization Tips

Let’s talk money. Here’s how to keep costs reasonable:

1. CloudWatch Logs: The Got you by Surprise Cost


# Set appropriate retention periods
import boto3

logs = boto3.client('logs')

# Development logs: 7 days
logs.put_retention_policy(
    logGroupName='/aws/lambda/dev-functions',
    retentionInDays=7
)

# Production logs: 30 days
logs.put_retention_policy(
    logGroupName='/aws/lambda/prod-functions',
    retentionInDays=30
)

# Compliance logs: Export to S3, then delete
logs.put_retention_policy(
    logGroupName='/aws/cloudtrail',
    retentionInDays=90
)

Enter fullscreen mode Exit fullscreen mode

2. Use Log Sampling

Not every log line needs immediate indexing:


# Sample 10% of high-volume logs
import random

def lambda_handler(event, context):
    if random.random() &lt; 0.1: # 10% sampling
        # Send to OpenSearch
        pass

    # Always send to S3 (cheap storage)
    # Send everything

Enter fullscreen mode Exit fullscreen mode

3. OpenSearch Reserved Instances


# Save 30-40% with 1-year reserved capacity
aws opensearch purchase-reserved-instance-offering \
  --reserved-instance-offering-id offering-id \
  --instance-count 3

Enter fullscreen mode Exit fullscreen mode

4. S3 Intelligent-Tiering


# Automatic cost optimization for Security Lake
aws s3api put-bucket-intelligent-tiering-configuration \
  --bucket security-lake-bucket \
  --id intelligent-tiering \
  --intelligent-tiering-configuration '{
    "Id": "intelligent-tiering",
    "Status": "Enabled",
    "Tierings": [
      {"Days": 90, "AccessTier": "ARCHIVE_ACCESS"},
      {"Days": 180, "AccessTier": "DEEP_ARCHIVE_ACCESS"}
    ]
  }'

Enter fullscreen mode Exit fullscreen mode

Migration Strategy: The Practical Path

Don’t try to do everything at once. Here’s the battle-tested sequence:

Phase 1: AWS Resources

  • Start with EKS (highest ROI)
  • Add EC2 instances
  • Enable RDS Enhanced Monitoring
  • Configure Lambda logging

Win : 60% of your monitoring consolidated

Phase 2: Security

  • Enable Security Hub
  • Deploy GuardDuty
  • Set up OpenSearch SIEM
  • Migrate from Sentinel

Win : Security team has single console

Phase 3: Dashboards

  • Build CloudWatch operational dashboards
  • Deploy Managed Grafana
  • Recreate critical legacy dashboards
  • Train operations team

Win : Ops team stops using old tools

Phase 4: On-Premises

  • Deploy CloudWatch Agent to servers
  • Configure syslog forwarding
  • Archive on-prem logs in S3

Phase 5: Decommission

  • Parallel run validation (2 weeks)
  • Export historical data
  • Turn off Zabbix, Prometheus
  • Reclaim licenses and infrastructure

Common Issues (And How to Avoid Them)

Issue #1: CloudWatch Logs Cost Explosion

Problem : Someone enables debug logging in production

Solution :


# Implement log sampling and filtering at source
import logging
import watchtower

# Only send WARNING and above to CloudWatch
handler = watchtower.CloudWatchLogHandler(log_group='/aws/app')
handler.setLevel(logging.WARNING)

logger = logging.getLogger( __name__ )
logger.addHandler(handler)

Enter fullscreen mode Exit fullscreen mode

Issue #2: Alert Fatigue

Problem : 500 alerts per day, all marked “critical”

Solution :


# Implement alert prioritization
def calculate_severity(metric_value, threshold):
    if metric_value > threshold * 1.5:
        return 'CRITICAL' # Page on-call
    elif metric_value > threshold * 1.2:
        return 'WARNING' # Slack notification
    else:
        return 'INFO' # Log only

Enter fullscreen mode Exit fullscreen mode

Issue #3: The “We’ll Monitor Everything” Trap

Problem : Monitoring 10,000 metrics per instance

Solution : Start with the Golden Signals :

  • Latency : How long requests take
  • Traffic : Request volume
  • Errors : Failure rate
  • Saturation : Resource utilization

# Focused metric collection
CRITICAL_METRICS = [
    'CPUUtilization',
    'MemoryUtilization',
    'NetworkIn',
    'NetworkOut',
    'DiskReadOps',
    'DiskWriteOps'
]

Enter fullscreen mode Exit fullscreen mode

Issue #4: Forgetting About Cardinality

Problem : OpenSearch cluster dies from high-cardinality fields

Solution :


# Don't index user IDs, session IDs, or timestamps as keywords!
PUT /logs/_mapping
{
  "properties": {
    "user_id": {
      "type": "text", # Don't use "keyword" for high-cardinality
      "index": false # Don't index if you won't search it
    },
    "timestamp": {
      "type": "date"
    }
  }
}

Enter fullscreen mode Exit fullscreen mode

Success Metrics: Measuring Your Win


Old Way: "Let me check 5 systems..."
Time to answer: 15-30 minutes

New Way: "Here's the CloudWatch dashboard..."
Time to answer: 30 seconds

Enter fullscreen mode Exit fullscreen mode

Troubleshooting Guide

Issue: CloudWatch Agent Not Sending Logs


# Check agent status
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a query -m ec2 -c default

# Check agent logs
sudo tail -f /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log

# Common fix: IAM permissions
# Ensure instance role has CloudWatchAgentServerPolicy

Enter fullscreen mode Exit fullscreen mode

Issue: OpenSearch “Cluster Red” Status


# Check cluster health
curl -XGET 'https://your-domain.region.es.amazonaws.com/_cluster/health?pretty'

# Common causes:
# 1. Unassigned shards (need more nodes)
# 2. Disk space >85% used (scale storage)
# 3. JVM pressure (scale instance type)

# Quick fix: Delete old indices
curl -XDELETE 'https://your-domain.region.es.amazonaws.com/old-index-*'

Enter fullscreen mode Exit fullscreen mode

Issue: High CloudWatch Costs


# Find expensive log groups
aws logs describe-log-groups \
  --query 'logGroups[*].[logGroupName,storedBytes]' \
  --output table | sort -k2 -rn

# Check for debug logs in production
aws logs filter-log-events \
  --log-group-name /aws/lambda/my-function \
  --filter-pattern "DEBUG" \
  --limit 10

Enter fullscreen mode Exit fullscreen mode

Best Practices Checklist

  • Document current log volumes (GB/day)
  • List all alert rules from legacy systems
  • Identify compliance retention requirements
  • Get buy-in from security and ops teams
  • Set realistic budget expectations

  • Start with non-production environment

  • Run legacy and new systems in parallel (2+ weeks)

  • Train ops team before cutover

  • Have rollback plan ready

  • Document everything (future you will thank you)

  • Monitor CloudWatch costs daily (first month)

  • Review alert effectiveness weekly

  • Gather user feedback from ops team

  • Optimise based on actual usage patterns

  • Schedule quarterly reviews


The Bottom Line

Consolidating from multiple monitoring tools to a unified AWS native stack isn’t just about reducing complexity, it’s about operational excellence :

  • Faster incident response : 15 minutes instead of 45
  • Better security posture : Automated threat response
  • Compliance confidence : Query any log in seconds
  • Cost savings : £5-10k+/year in eliminated tools
  • Happier ops team : One system to master, not five

Getting Started

If you’re ready to begin:

  1. Week 1 : Audit current tools and costs
  2. Week 2 : Estimate AWS costs with AWS Pricing Calculator
  3. Week 3 : POC with non-prod EKS cluster
  4. Week 4 : Build business case
  5. Week 5+ : Execute phased migration

Resources

AWS Documentation

Tools

Community

Conclusion

Building a unified AWS monitoring solution is a journey, not a destination. Start small, prove value quickly, and iterate based on real-world usage.

The goal isn’t monitoring perfection, it’s operational sanity. When your phone rings at 3 AM, you want answers in minutes, not a hunt across five different tools.

Tags: #aws #monitoring #observability #cloudwatch #devops #sre #kubernetes #eks #security #siem

The post Complete Guide to AWS Monitoring and Observability for DevOps Teams first appeared on Allaboutcloud.co.uk.

Top comments (0)