Nikitas Gargoulakis for AWS Community Builders

Posted on Jan 23 • Originally published at allaboutcloud.co.uk on Jan 22

Complete Guide to AWS Monitoring and Observability for DevOps Teams

#aws #cloudwatch #monitoring

In today’s cloud-first world, many organisations find themselves wrestling with a common challenge: monitoring fragmentation. If you’re migrating to AWS from on-premises infrastructure, you’ve likely accumulated a collection of monitoring tools, Grafana here, Zabbix there, maybe some Prometheus, Scrutinizer, and a dash of CloudWatch. Each tool serves a purpose, but together they create operational chaos.

This article walks through a real-world architecture for consolidating multiple monitoring tools into a unified, AWS-native observability platform. Whether you’re monitoring EKS clusters, Active Directory, firewalls, or a hybrid infrastructure, this guide will help you build a single panel of glass for your entire estate.

The Problem: Death by a Thousand Dashboards

Let’s paint a familiar picture:

3 AM : Your phone rings. Production is down.
3:02 AM : You check CloudWatch. Nothing obvious.
3:05 AM : Switch to Grafana. Some weird metrics.
3:10 AM : Check Zabbix. Server CPU is spiking.
3:15 AM : But why? Check logs. wait, where are those logs again?
3:25 AM : Finally correlate the issue across four different systems.
MTTR : 45 minutes (30 of which were spent context-switching between tools)

Sound familiar? You’re not alone.

The Core Requirements

When consolidating monitoring infrastructure, we need to solve for:

Unified Visibility : One place to see everything
Proactive Detection : Catch issues before users do
Fast Root Cause Analysis : Correlate events across layers
Compliance Ready : Query data for audits without panic
Operational Efficiency : Stop paying for five tools when one will do

The Solution: AWS-Native Observability Stack

After extensive research and real-world implementation, here’s the architecture that actually works:


┌─────────────────────────────────────────────────────────┐
│ Visualization Layer │
│ CloudWatch Dashboards | Managed Grafana | QuickSight │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│ Analytics &amp; Investigation Layer │
│ CloudWatch Insights | Athena | OpenSearch Service │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│ Centralized Data Lake (Optional) │
│ AWS Security Lake (OCSF) │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│ Monitoring &amp; Security Services │
│ CloudWatch | Security Hub | GuardDuty | Config │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│ Your Infrastructure │
│ EKS | EC2 | Lambda | RDS | On-Prem (Logs Only) │
└─────────────────────────────────────────────────────────┘

The Core AWS Services

Let’s break down each component:

1. Amazon CloudWatch: Your Foundation

CloudWatch is unavoidable when working with AWS. Instead of fighting it, embrace it as your foundation.

What You Get:

Metrics : CPU, memory, disk, network, custom application metrics
Logs : Centralized log aggregation with retention policies
Alarms : Threshold-based and anomaly detection alerting
Dashboards : Pre-built and custom operational views
Insights : SQL-like queries for log analysis

Real-World Setup:


{
  "agent": {
    "metrics_collection_interval": 60
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/application/*.log",
            "log_group_name": "/aws/application/myapp",
            "log_stream_name": "{instance_id}",
            "retention_in_days": 30
          }
        ]
      }
    }
  },
  "metrics": {
    "namespace": "CustomApp/Metrics",
    "metrics_collected": {
      "cpu": {
        "measurement": [
          {"name": "cpu_usage_idle", "unit": "Percent"}
        ]
      },
      "mem": {
        "measurement": [
          {"name": "mem_used_percent", "unit": "Percent"}
        ]
      }
    }
  }
}

2. Container Insights for EKS

If you’re running Kubernetes on AWS, Container Insights is a game-changer.

Deployment:


# Enable control plane logging
aws eks update-cluster-config \
  --name my-cluster \
  --logging '{"clusterLogging":[{"types":["api","audit","authenticator"],"enabled":true}]}'

# Deploy FluentBit DaemonSet
kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/fluent-bit/fluent-bit.yaml

What You See:

Cluster-level metrics (CPU, memory, network)
Namespace and pod-level breakdowns
Node performance and capacity
Application logs automatically collected from stdout/stderr

This replaces your standalone Prometheus + Grafana setup for most use cases.

3. AWS Security Hub: Your Security Command Center

Think of Security Hub as your security findings aggregator. It’s like having a security operations assistant that never sleeps.

What It Aggregates:

GuardDuty : AI-powered threat detection
AWS Config : Configuration compliance
IAM Access Analyzer : Permission issues
Macie : Sensitive data discovery
Inspector : Vulnerability scanning

Compliance Made Easy:


# Enable Security Hub with CIS AWS Foundations Benchmark
aws securityhub enable-security-hub \
  --enable-default-standards

# Get compliance summary
aws securityhub get-findings \
  --filters '{"ComplianceStatus": [{"Value": "FAILED", "Comparison": "EQUALS"}]}'

4. Amazon OpenSearch: Your SIEM Replacement

Replacing Microsoft Sentinel? OpenSearch Service is your answer.

Why OpenSearch Over Sentinel?

Use OpenSearch’s anomaly detection feature. It’s surprisingly good at catching unusual patterns you’d miss manually.

5. AWS Security Lake: The Long-Term Play

Here’s where things get interesting. Security Lake is AWS’s answer to the question: “Where do I store petabytes of security data without going bankrupt?”

The OCSF Advantage

Security Lake automatically normalizes logs to the Open Cybersecurity Schema Framework (OCSF). This means:

Standardized queries across all log sources
Multi-cloud ready (Azure, GCP logs can be normalized too)
Future-proof (vendor-agnostic format)

When to Use Security Lake:

YES if you need:

1 year log retention
Compliance with strict audit requirements
Multi-cloud strategy
Cost-effective long-term storage (S3 is cheap!)

NO if you need:

Real-time alerting (use CloudWatch + OpenSearch instead)
Simple single-account setup
Quick implementation (<4 weeks)

Use Security Lake for retention, OpenSearch for hot analytics (last 30 days).

6. The On-Premises Challenge

Let’s address the elephant in the room: on-premises monitoring in a cloud-native world.

What’s Realistic:

You CAN:**

Forward logs via CloudWatch Agent
Send syslogs via Kinesis Firehose
Store and search on-prem logs in AWS
Create basic alerts on log patterns

You CANNOT (easily):**

Get real-time metrics dashboards
Automated remediation for on-prem resources
Full observability parity with AWS resources

The Pragmatic Approach:


# On-premises server → CloudWatch Logs
# Install agent
wget https://s3.amazonaws.com/amazoncloudwatch-agent/linux/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i amazon-cloudwatch-agent.deb

# Configure to send logs only (no metrics)
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a fetch-config \
  -m onPremise \
  -s \
  -c file:/opt/aws/amazon-cloudwatch-agent/etc/config.json

For true on-premises monitoring, you might need to keep Zabbix or Prometheus for a while.

Architecture Decision: Two Approaches

Approach 1: With Security Lake (Compliance-First)

Best For : Healthcare, finance, government, or anyone with >1 year log retention requirements


AWS Services → Security Lake (S3/OCSF) → Athena (SQL queries)
                    ↓
              OpenSearch (Last 30 days hot analytics)
                    ↓
              CloudWatch Dashboards + Managed Grafana

Pros :

Cost-effective long-term retention
OCSF standardization
Multi-cloud ready
Compliance-friendly

Cons :

More complex setup
Longer implementation (16-20 weeks)
Requires OCSF knowledge

Approach 2: Direct CloudWatch/OpenSearch (Speed-First)

Best For : Startups, lower compliance reqs, quick wins


AWS Services → CloudWatch Logs → OpenSearch (direct)
                    ↓
              CloudWatch Dashboards + Managed Grafana
                    ↓
              S3 (archived logs via export)

Pros :

Faster implementation
Simpler architecture
Real-time everything
Lower learning curve

Cons :

Higher CloudWatch Logs costs at scale
No OCSF normalization
OpenSearch storage costs

Real-World Implementation: Step-by-Step

Let’s build this thing. Here’s the actual deployment sequence:

Week 1-2: Foundation


# 1. Enable AWS Organizations (if not already)
aws organizations create-organization

# 2. Enable CloudTrail (all regions, all accounts)
aws cloudtrail create-trail \
  --name organization-trail \
  --s3-bucket-name my-cloudtrail-bucket \
  --is-organization-trail \
  --is-multi-region-trail

# 3. Enable GuardDuty
aws guardduty create-detector --enable

# 4. Enable Security Hub
aws securityhub enable-security-hub

# 5. Enable AWS Config
aws configservice put-configuration-recorder \
  --configuration-recorder name=default,roleARN=arn:aws:iam::ACCOUNT:role/aws-service-role/config.amazonaws.com/AWSServiceRoleForConfig
aws configservice start-configuration-recorder \
  --configuration-recorder-name default

Week 3-4: EKS Monitoring


# fluent-bit-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: amazon-cloudwatch
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush 5
        Log_Level info

    [INPUT]
        Name tail
        Path /var/log/containers/*.log
        Parser docker
        Tag kube.*

    [FILTER]
        Name kubernetes
        Match kube.*
        Kube_URL https://kubernetes.default.svc:443
        Merge_Log On

    [OUTPUT]
        Name cloudwatch_logs
        Match kube.*
        region us-east-1
        log_group_name /aws/eks/my-cluster
        log_stream_prefix app-
        auto_create_group true



kubectl apply -f fluent-bit-config.yaml

Week 5-6: OpenSearch SIEM


# Create OpenSearch domain
aws opensearch create-domain \
  --domain-name security-analytics \
  --engine-version "OpenSearch_2.11" \
  --cluster-config InstanceType=r6g.large.search,InstanceCount=3 \
  --ebs-options EBSEnabled=true,VolumeType=gp3,VolumeSize=100 \
  --encryption-at-rest-options Enabled=true \
  --node-to-node-encryption-options Enabled=true \
  --advanced-security-options Enabled=true,InternalUserDatabaseEnabled=false

Week 7-8: Dashboards and Alerts


// cloudwatch-dashboard.json
{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "metrics": [
          ["AWS/EC2", "CPUUtilization", {"stat": "Average"}]
        ],
        "period": 300,
        "stat": "Average",
        "region": "us-east-1",
        "title": "EC2 CPU Overview"
      }
    },
    {
      "type": "log",
      "properties": {
        "query": "SOURCE '/aws/eks/my-cluster' | fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20",
        "region": "us-east-1",
        "title": "Recent Errors"
      }
    }
  ]
}



aws cloudwatch put-dashboard \
  --dashboard-name "Production-Overview" \
  --dashboard-body file://cloudwatch-dashboard.json

Automated Incident Response

Here’s where it gets interesting. Let’s automate security responses:


# lambda/security_response.py
import boto3

ec2 = boto3.client('ec2')
sns = boto3.client('sns')

def lambda_handler(event, context):
    """
    Responds to GuardDuty findings automatically
    """
    finding = event['detail']
    finding_type = finding['type']

    # SSH Brute Force detected
    if 'SSHBruteForce' in finding_type:
        instance_id = finding['resource']['instanceDetails']['instanceId']

        # Quarantine instance
        ec2.modify_instance_attribute(
            InstanceId=instance_id,
            Groups=['sg-quarantine'] # Pre-created quarantine security group
        )

        # Notify team
        sns.publish(
            TopicArn='arn:aws:sns:us-east-1:ACCOUNT:security-alerts',
            Subject=f'CRITICAL: Instance {instance_id} Quarantined',
            Message=f'Detected SSH brute force attack. Instance automatically isolated.\n\nFinding: {finding}'
        )

        return {'status': 'quarantined', 'instance': instance_id}

EventBridge Rule :


{
  "source": ["aws.guardduty"],
  "detail-type": ["GuardDuty Finding"],
  "detail": {
    "severity": [7, 8, 8.9],
    "type": ["UnauthorizedAccess:EC2/SSHBruteForce"]
  }
}

Result : Threat detected → Instance isolated → Team notified. All in <30 seconds.

Cost Optimization Tips

Let’s talk money. Here’s how to keep costs reasonable:

1. CloudWatch Logs: The Got you by Surprise Cost


# Set appropriate retention periods
import boto3

logs = boto3.client('logs')

# Development logs: 7 days
logs.put_retention_policy(
    logGroupName='/aws/lambda/dev-functions',
    retentionInDays=7
)

# Production logs: 30 days
logs.put_retention_policy(
    logGroupName='/aws/lambda/prod-functions',
    retentionInDays=30
)

# Compliance logs: Export to S3, then delete
logs.put_retention_policy(
    logGroupName='/aws/cloudtrail',
    retentionInDays=90
)

2. Use Log Sampling

Not every log line needs immediate indexing:


# Sample 10% of high-volume logs
import random

def lambda_handler(event, context):
    if random.random() &lt; 0.1: # 10% sampling
        # Send to OpenSearch
        pass

    # Always send to S3 (cheap storage)
    # Send everything

3. OpenSearch Reserved Instances


# Save 30-40% with 1-year reserved capacity
aws opensearch purchase-reserved-instance-offering \
  --reserved-instance-offering-id offering-id \
  --instance-count 3

4. S3 Intelligent-Tiering


# Automatic cost optimization for Security Lake
aws s3api put-bucket-intelligent-tiering-configuration \
  --bucket security-lake-bucket \
  --id intelligent-tiering \
  --intelligent-tiering-configuration '{
    "Id": "intelligent-tiering",
    "Status": "Enabled",
    "Tierings": [
      {"Days": 90, "AccessTier": "ARCHIVE_ACCESS"},
      {"Days": 180, "AccessTier": "DEEP_ARCHIVE_ACCESS"}
    ]
  }'

Migration Strategy: The Practical Path

Don’t try to do everything at once. Here’s the battle-tested sequence:

Phase 1: AWS Resources

Start with EKS (highest ROI)
Add EC2 instances
Enable RDS Enhanced Monitoring
Configure Lambda logging

Win : 60% of your monitoring consolidated

Phase 2: Security

Enable Security Hub
Deploy GuardDuty
Set up OpenSearch SIEM
Migrate from Sentinel

Win : Security team has single console

Phase 3: Dashboards

Build CloudWatch operational dashboards
Deploy Managed Grafana
Recreate critical legacy dashboards
Train operations team

Win : Ops team stops using old tools

Phase 4: On-Premises

Deploy CloudWatch Agent to servers
Configure syslog forwarding
Archive on-prem logs in S3

Phase 5: Decommission

Parallel run validation (2 weeks)
Export historical data
Turn off Zabbix, Prometheus
Reclaim licenses and infrastructure

Common Issues (And How to Avoid Them)

Issue #1: CloudWatch Logs Cost Explosion

Problem : Someone enables debug logging in production

Solution :


# Implement log sampling and filtering at source
import logging
import watchtower

# Only send WARNING and above to CloudWatch
handler = watchtower.CloudWatchLogHandler(log_group='/aws/app')
handler.setLevel(logging.WARNING)

logger = logging.getLogger( __name__ )
logger.addHandler(handler)

Issue #2: Alert Fatigue

Problem : 500 alerts per day, all marked “critical”

Solution :


# Implement alert prioritization
def calculate_severity(metric_value, threshold):
    if metric_value > threshold * 1.5:
        return 'CRITICAL' # Page on-call
    elif metric_value > threshold * 1.2:
        return 'WARNING' # Slack notification
    else:
        return 'INFO' # Log only

Issue #3: The “We’ll Monitor Everything” Trap

Problem : Monitoring 10,000 metrics per instance

Solution : Start with the Golden Signals :

Latency : How long requests take
Traffic : Request volume
Errors : Failure rate
Saturation : Resource utilization


# Focused metric collection
CRITICAL_METRICS = [
    'CPUUtilization',
    'MemoryUtilization',
    'NetworkIn',
    'NetworkOut',
    'DiskReadOps',
    'DiskWriteOps'
]

Issue #4: Forgetting About Cardinality

Problem : OpenSearch cluster dies from high-cardinality fields

Solution :


# Don't index user IDs, session IDs, or timestamps as keywords!
PUT /logs/_mapping
{
  "properties": {
    "user_id": {
      "type": "text", # Don't use "keyword" for high-cardinality
      "index": false # Don't index if you won't search it
    },
    "timestamp": {
      "type": "date"
    }
  }
}

Success Metrics: Measuring Your Win


Old Way: "Let me check 5 systems..."
Time to answer: 15-30 minutes

New Way: "Here's the CloudWatch dashboard..."
Time to answer: 30 seconds

Troubleshooting Guide

Issue: CloudWatch Agent Not Sending Logs


# Check agent status
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a query -m ec2 -c default

# Check agent logs
sudo tail -f /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log

# Common fix: IAM permissions
# Ensure instance role has CloudWatchAgentServerPolicy

Issue: OpenSearch “Cluster Red” Status


# Check cluster health
curl -XGET 'https://your-domain.region.es.amazonaws.com/_cluster/health?pretty'

# Common causes:
# 1. Unassigned shards (need more nodes)
# 2. Disk space >85% used (scale storage)
# 3. JVM pressure (scale instance type)

# Quick fix: Delete old indices
curl -XDELETE 'https://your-domain.region.es.amazonaws.com/old-index-*'

Issue: High CloudWatch Costs


# Find expensive log groups
aws logs describe-log-groups \
  --query 'logGroups[*].[logGroupName,storedBytes]' \
  --output table | sort -k2 -rn

# Check for debug logs in production
aws logs filter-log-events \
  --log-group-name /aws/lambda/my-function \
  --filter-pattern "DEBUG" \
  --limit 10

Best Practices Checklist

Document current log volumes (GB/day)
List all alert rules from legacy systems
Identify compliance retention requirements
Get buy-in from security and ops teams
Set realistic budget expectations
Start with non-production environment
Run legacy and new systems in parallel (2+ weeks)
Train ops team before cutover
Have rollback plan ready
Document everything (future you will thank you)
Monitor CloudWatch costs daily (first month)
Review alert effectiveness weekly
Gather user feedback from ops team
Optimise based on actual usage patterns
Schedule quarterly reviews

The Bottom Line

Consolidating from multiple monitoring tools to a unified AWS native stack isn’t just about reducing complexity, it’s about operational excellence :

Faster incident response : 15 minutes instead of 45
Better security posture : Automated threat response
Compliance confidence : Query any log in seconds
Cost savings : £5-10k+/year in eliminated tools
Happier ops team : One system to master, not five

Getting Started

If you’re ready to begin:

Week 1 : Audit current tools and costs
Week 2 : Estimate AWS costs with AWS Pricing Calculator
Week 3 : POC with non-prod EKS cluster
Week 4 : Build business case
Week 5+ : Execute phased migration

Resources

AWS Documentation

Tools

Community

Conclusion

Building a unified AWS monitoring solution is a journey, not a destination. Start small, prove value quickly, and iterate based on real-world usage.

The goal isn’t monitoring perfection, it’s operational sanity. When your phone rings at 3 AM, you want answers in minutes, not a hunt across five different tools.

Tags: #aws #monitoring #observability #cloudwatch #devops #sre #kubernetes #eks #security #siem

The post Complete Guide to AWS Monitoring and Observability for DevOps Teams first appeared on Allaboutcloud.co.uk.