Truong An

Posted on Jul 5 • Edited on Jul 8

AWS-native Observability for EC2 with the CloudWatch Agent

#devops #aws #cloudwatch #observability

View my website: Read the original post on K.Kloud Tarus

GitHub repository: cloudwatch-agent-ec2-observability

AWS Community Builder: cloudwatch-agent-ec2-observability

When you run a workload on EC2, CloudWatch gives you only a small slice of the operational picture out of the box: CPU, network, disk I/O, and status checks. In real DevOps/SRE work that is not enough. You usually also want to know how much memory is in use, whether the root disk still has free space, whether the Nginx process is still running, whether the access/error logs contain anything wrong, and whether the system actually alerts you when a threshold is crossed.

This lab is a hands-on build of a simple but realistic observability pipeline for EC2, done the AWS-native way with the CloudWatch Agent.

I run it as two scenarios:

Case 1: An EC2 instance already exists, is running Nginx, but has no CloudWatch Agent yet.
Case 2: Create a brand-new EC2 instance and bootstrap the CloudWatch Agent via User Data.

The point of this article is not just "install the agent", but walking the whole flow end to end:

EC2
→ CloudWatch Agent
→ CloudWatch Metrics
→ CloudWatch Logs
→ CloudWatch Dashboard
→ CloudWatch Alarm
→ Amazon SNS
→ Email Notification

Why do you need the CloudWatch Agent?

By default, EC2 basic monitoring observes the instance from the outside, mostly through the hypervisor. That is why CloudWatch can see metrics such as CPU, network, disk I/O, and status checks.

However, information that lives inside the operating system, such as how much memory is in use, how much space is left on a filesystem, what the application logs are writing, or whether the Nginx process is still running, is not visible to CloudWatch with basic monitoring alone.

The CloudWatch Agent closes that gap by running inside the EC2 instance. The agent reads metrics and logs from the operating system, then ships the data to CloudWatch Metrics and CloudWatch Logs.

In short:

EC2 basic monitoring
→ sees the instance from the outside

CloudWatch Agent
→ sees inside the operating system

That is why data such as memory usage, disk usage, application logs, and process status requires the CloudWatch Agent.

High-level architecture

In this architecture, EC2 runs both the Nginx workload and the CloudWatch Agent. The agent runs inside the instance, reads metrics at the operating-system level, reads Nginx log files, and then ships the data to CloudWatch.

The main flow:

User / Local terminal
→ AWS Systems Manager Session Manager
→ EC2 instance
→ CloudWatch Agent
→ CloudWatch Metrics / CloudWatch Logs
→ Dashboard / Alarm
→ SNS Email Notification

I do not use SSH in this lab. The EC2 instance is accessed through AWS Systems Manager Session Manager, which means no need to open port 22, no key pair, and access control is managed through IAM.

Repository structure

Evidence, architecture, and scripts should be kept separate so they are easy to review:

cloudwatch-agent-ec2-observability/
├── architecture/
│   └── high-level-architecture.jpg
│
├── evidence/
│   ├── Case 1 - Existing EC2 Running.jpg
│   ├── Case 1 - IAM Role Attached.jpg
│   ├── Case 1 - SSM Managed Node.jpg
│   ├── Case 1 - CloudWatch Agent Installed.jpg
│   ├── Case 1 - Agent Running.jpg
│   ├── Case 1 - CWAgent Metrics.jpg
│   ├── Case 1 - CloudWatch Logs.jpg
│   ├── Case 1 - CloudWatch Alarm.jpg
│   ├── Case 1 - SNS Email Confirmed.jpg
│   ├── Case 1 - CloudWatch Dashboard.jpg
│   ├── Case 2 - New EC2 Running.jpg
│   ├── Case 2 - Agent Running.jpg
│   ├── Case 2 - Key CWAgent Metrics.jpg
│   ├── Case 2 - CloudWatch Log Groups.jpg
│   └── Case 2 - CloudWatch Dashboard.jpg
│
├── scripts/
│   ├── case1-cloudwatch-agent-config.json
│   ├── case2-user-data.sh
│   └── cleanup.md
│
├── 01-cloudwatch-agent-lab-evidence.md
└── 01-aws-native-observability-for-ec2-with-cloudwatch-agent.md

This blog file only explains the flow and the key results. The long CloudWatch Agent config and the User Data are kept under the scripts/ directory so they are easy to reuse.

AWS services used

Service	Role
Amazon EC2	The host running the Nginx workload
CloudWatch Agent	Collects metrics and logs from inside EC2
CloudWatch Metrics	Stores memory, disk, CPU, and process metrics
CloudWatch Logs	Stores Nginx access/error logs and system logs
CloudWatch Dashboard	Visualizes the important metrics
CloudWatch Alarm	Alerts when a metric crosses a threshold
Amazon SNS	Sends the email notification when an alarm is triggered
IAM Role	Grants EC2 permission to send data to CloudWatch
AWS Systems Manager	Accesses EC2 via Session Manager instead of SSH

Cost and Log Retention

The CloudWatch Agent can incur cost depending on the number of custom metrics, the number of log events, how long logs are retained, and how frequently metrics are collected.

In this lab, metrics such as memory, disk, and Nginx process count are sent into the CWAgent namespace. These are custom metrics, so the cost depends on the number of metrics, the number of dimensions, and how long the data is stored/observed.

I set metrics_collection_interval to 60 seconds to balance granularity against cost:

metrics_collection_interval: 60

Lowering the interval to 10 seconds gives more detailed data but increases the number of datapoints, which in turn can raise CloudWatch cost. So for a lab or a small environment, 60 seconds is the more sensible choice.

For CloudWatch Logs, if you do not configure retention, a log group can keep logs indefinitely. So once a log group is created, you should set a retention policy to avoid keeping logs you do not need.

For example, setting a 7-day retention for Case 1:

aws logs put-retention-policy \
  --log-group-name "/ec2/cloudwatch-agent/case1/nginx/access" \
  --retention-in-days 7 \
  --region us-east-1

aws logs put-retention-policy \
  --log-group-name "/ec2/cloudwatch-agent/case1/nginx/error" \
  --retention-in-days 7 \
  --region us-east-1

For example, setting a 7-day retention for Case 2:

aws logs put-retention-policy \
  --log-group-name "/ec2/cloudwatch-agent/case2/nginx/access" \
  --retention-in-days 7 \
  --region us-east-1

aws logs put-retention-policy \
  --log-group-name "/ec2/cloudwatch-agent/case2/nginx/error" \
  --retention-in-days 7 \
  --region us-east-1

In production, retention should be chosen according to audit, compliance, and cost needs, for example 7 days, 14 days, 30 days, or longer.

Preparing the IAM Role for EC2

The CloudWatch Agent needs permission to send metrics and logs to CloudWatch. EC2 also needs permission to work with Systems Manager Session Manager.

I create an IAM Role for EC2:

ec2-cloudwatch-agent-role

Attach two managed policies:

CloudWatchAgentServerPolicy
AmazonSSMManagedInstanceCore

What they mean:

CloudWatchAgentServerPolicy
→ Allows the CloudWatch Agent to send metrics/logs to CloudWatch.

AmazonSSMManagedInstanceCore
→ Allows EC2 to appear in Systems Manager and be accessed via Session Manager.

Connecting to EC2 with Session Manager

Instead of SSH, I use the AWS CLI from my local machine to connect to EC2:

aws ssm start-session \
  --target <id-instance-ec2-của-bạn> \
  --region us-east-1

Example:

aws ssm start-session \
  --target i-xxxxxxxxxxxxxxxxx \
  --region us-east-1

Once inside the EC2 instance, switch to root to work through the lab:

sudo su -
whoami

Check the OS and hostname:

hostname
cat /etc/os-release

This approach means the lab never has to open SSH port 22 to the Internet.

Case 1: Installing the CloudWatch Agent on an existing EC2 instance

Context

In the first case, I assume there is already an EC2 instance running the Nginx workload. This instance has never had the CloudWatch Agent installed. This is a fairly realistic situation: the system is already running, and then the DevOps/SRE team wants to add observability without rebuilding the instance.

The execution flow:

Existing EC2
→ Attach IAM Role
→ Check the SSM Managed Node
→ Check that Nginx is running
→ Confirm the CloudWatch Agent is not installed
→ Install the CloudWatch Agent
→ Create the CloudWatch Agent config
→ Start the CloudWatch Agent
→ Verify Metrics and Logs
→ Create the Alarm, SNS, and Dashboard

Inspecting the current EC2 instance

The instance used in Case 1:

Instance name: cwagent-existing-ec2
Instance ID: i-004d22f414fe421f0
AMI: Amazon Linux 2023
Instance type: t3.micro
VPC: CW-Agent-Ec2-vpc
Subnet: Public subnet us-east-1a
IAM Role: ec2-cloudwatch-agent-role

Evidence:

The EC2 instance also appears under AWS Systems Manager Managed Nodes with the status Online.

Inspecting the workload and the initial state

After connecting to EC2 with Session Manager, I check:

whoami
hostname
cat /etc/os-release
systemctl status nginx
rpm -qa | grep amazon-cloudwatch-agent

The result:

Current user: root
OS: Amazon Linux 2023
Nginx: active running
CloudWatch Agent: not installed yet

Evidence:

This confirms the intended context: the EC2 instance already has a workload but no CloudWatch Agent.

Installing the CloudWatch Agent

Install the CloudWatch Agent package on Amazon Linux 2023:

sudo dnf install -y amazon-cloudwatch-agent

Verify the package:

rpm -qa | grep amazon-cloudwatch-agent
ls -l /opt/aws/amazon-cloudwatch-agent/

Evidence:

Creating the CloudWatch Agent config

The config file lives at:

/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json

The full config is kept in the repo at: Case 1 - File Agent Config

This config collects:

Metrics:
- mem_used_percent
- disk_used_percent
- cpu usage metrics
- Nginx process count via procstat

Logs:
- /var/log/nginx/access.log
- /var/log/nginx/error.log

The most important parts of the config are procstat and logs. procstat lets the CloudWatch Agent track the Nginx process, while the logs section lets the agent read log files on EC2 and ship them to CloudWatch Logs.

Example of the procstat section:

{
  "procstat": [
    {
      "exe": "nginx",
      "measurement": [
        "pid_count",
        "cpu_usage",
        "memory_rss"
      ],
      "metrics_collection_interval": 60
    }
  ]
}

Example of the log collection section for the Nginx access log:

{
  "file_path": "/var/log/nginx/access.log",
  "log_group_name": "/ec2/cloudwatch-agent/case1/nginx/access",
  "log_stream_name": "{instance_id}-access",
  "timezone": "UTC"
}

I do not inline the entire JSON config so the article stays concise, but I still include the core parts so the reader understands what the agent is collecting. The full config lives under the scripts/ directory so it can be copied and run again.

Evidence:

Start CloudWatch Agent

Start the agent with the config just created:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a fetch-config \
  -m ec2 \
  -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json \
  -s

Check the status:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -m ec2 -a status
sudo systemctl status amazon-cloudwatch-agent

Expected result:

{
  "status": "running",
  "configstatus": "configured",
  "version": "1.300067.1"
}

Evidence:

Verifying CloudWatch Metrics

Once the agent is running, go to CloudWatch Metrics and look for the namespace:

CWAgent

The main metrics to check:

mem_used_percent
disk_used_percent
procstat_lookup_pid_count

What they mean:

mem_used_percent
→ the percentage of memory in use.

disk_used_percent
→ the percentage of disk used on the `/` filesystem.

procstat_lookup_pid_count
→ the number of Nginx processes the CloudWatch Agent found.

Why can the Nginx Process Count be greater than 1?

The procstat_lookup_pid_count metric reports the number of Nginx processes the CloudWatch Agent found.

For Nginx, this value is usually greater than 1 because Nginx typically runs with the model:

1 master process
+ N worker processes

For example, if the dashboard shows:

Nginx Process Count = 3

that can be read as Nginx running 1 master process and 2 worker processes. So a value of 3 is not an error, but the normal state when Nginx is running multiple workers.

Evidence:

Verifying CloudWatch Logs

The CloudWatch Agent also ships Nginx logs to CloudWatch Logs.

The main log groups:

/ec2/cloudwatch-agent/case1/nginx/access
/ec2/cloudwatch-agent/case1/nginx/error

The Nginx access log contains a GET / HTTP/1.1 request, which proves that the log file on EC2 was read by the agent and shipped to CloudWatch Logs.

Evidence:

Creating a CloudWatch Alarm and SNS Email

Once the memory metric exists, I create a CloudWatch Alarm on the metric:

Metric: mem_used_percent
Namespace: CWAgent
Condition: Greater than threshold
Action: Send notification to SNS topic

In the lab, the threshold is set low so the alarm is easy to trigger and to capture evidence. For a production environment, the threshold should be set from an actual baseline, for example around 80% or 85%.

Beyond the threshold, when configuring an alarm you should also pay attention to these parameters:

Property	Meaning
`Period`	The window over which datapoints are aggregated for each datapoint
`Evaluation periods`	The number of datapoints used to evaluate the alarm
`Datapoints to alarm`	The number of datapoints that must breach the threshold to move to `ALARM`
`TreatMissingData`	How to handle the case when the metric sends no data

With a 1 out of 1 datapoint configuration, the alarm reacts quickly but is prone to false alarms on short spikes.

In production, you should use more evaluation periods, for example:

Period: 5 minutes
Evaluation periods: 3
Datapoints to alarm: 2 out of 3

This configuration reduces false alarms caused by temporary spikes.

TreatMissingData also matters a lot. If the CloudWatch Agent stops sending the metric, the alarm may move to, or get stuck in, the INSUFFICIENT_DATA state, depending on how it is configured. This is a real operational situation to account for when designing monitoring.

Evidence for the alarm:

Then create an SNS topic and an email subscription to receive the alerts.

Evidence SNS:

When the alarm changes state, an email notification is sent to the mailbox. This is the part that proves the alerting flow works end to end:

CloudWatch Metric
→ CloudWatch Alarm
→ SNS Topic
→ Email Notification

Creating a CloudWatch Dashboard

The dashboard helps visualize the important metrics of the EC2 instance.

Dashboard name:

cwagent-existing-ec2-dashboard

Metrics displayed:

Memory Used Percent
Nginx Process Count
Disk Used Percent

Evidence:

Case 1 results:

[✓] The existing EC2 is running the Nginx workload
[✓] The CloudWatch Agent was installed manually
[✓] The agent sends metrics to CloudWatch Metrics
[✓] The agent sends logs to CloudWatch Logs
[✓] The alarm was created and triggered
[✓] SNS sent the email notification
[✓] The dashboard displays the key metrics

Case 2: Bootstrapping the CloudWatch Agent when creating a new EC2 instance

Context

In Case 2, I no longer install the agent manually after EC2 is running. Instead, I use User Data to automatically:

Install Nginx
→ Start Nginx
→ Install the CloudWatch Agent
→ Write the CloudWatch Agent config
→ Start the CloudWatch Agent
→ Generate test requests with curl localhost

Flow:

Launch new EC2
→ Attach IAM Role
→ User Data installs Nginx
→ User Data installs CloudWatch Agent
→ User Data writes config
→ User Data starts CloudWatch Agent
→ Verify Logs, Metrics and Dashboard

Preparing the User Data

The User Data script is kept in the repo at: Case 2 - User Data

This script goes into:

EC2 Launch Instance
→ Advanced details
→ User data

User Data is the core of Case 2 because it lets the new EC2 instance bootstrap its own observability right at launch, instead of having to SSH/SSM into the box and install everything by hand.

The main idea of the User Data:

dnf update
→ install nginx
→ install amazon-cloudwatch-agent
→ enable/start nginx
→ write CloudWatch Agent config
→ start CloudWatch Agent
→ verify status

An important part of the User Data:

#!/bin/bash
set -euxo pipefail

dnf update -y
dnf install -y nginx amazon-cloudwatch-agent

systemctl enable nginx
systemctl start nginx

echo "Bootstrap EC2 for CloudWatch Agent lab" > /usr/share/nginx/html/index.html

curl http://localhost || true
curl http://localhost || true
curl http://localhost || true

The block above makes sure the new EC2 instance installs Nginx, starts the service, and generates a few test requests to produce Nginx access log entries.

The next part of the User Data writes the CloudWatch Agent config to the correct path on EC2:

cat > /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json <<'EOF'
{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "root"
  },
  "metrics": {
    "namespace": "CWAgent",
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}"
    }
  }
}
EOF

In the full file, the config also includes mem, disk, cpu, procstat, and log collection for the Nginx/system logs. I only inline a short snippet so the reader understands what the User Data is doing; the full script is kept under the scripts/ directory.

Finally, the User Data starts the CloudWatch Agent with:

/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a fetch-config \
  -m ec2 \
  -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json \
  -s

systemctl enable amazon-cloudwatch-agent
systemctl restart amazon-cloudwatch-agent

These commands make the CloudWatch Agent read the config just written and start shipping metrics/logs to CloudWatch right after EC2 finishes booting.

Launching the new EC2 instance

The instance used in Case 2:

Instance name: cwagent-bootstrap-ec2
Instance ID: i-0dd7183a3899a9462
AMI: Amazon Linux 2023
Instance type: t3.micro
VPC: CW-Agent-Ec2-vpc
Subnet: Public subnet us-east-1a
IAM Role: ec2-cloudwatch-agent-role

Evidence:

Verifying the CloudWatch Agent

After EC2 finishes booting, I connect to the instance with Session Manager and check the agent:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -m ec2 -a status
sudo systemctl status amazon-cloudwatch-agent

Evidence:

This proves that the User Data automatically installed and started the CloudWatch Agent when the EC2 instance was launched.

Verifying CloudWatch Logs

Case 2 creates its own separate log groups:

/ec2/cloudwatch-agent/case2/nginx/access
/ec2/cloudwatch-agent/case2/nginx/error
/ec2/cloudwatch-agent/case2/system/cloud-init
/ec2/cloudwatch-agent/case2/system/dnf

The Nginx access log contains requests from curl http://localhost, while the Nginx error log contains Nginx startup logs. This is proof that the User Data ran successfully and that the agent shipped logs to CloudWatch.

Evidence:

Verifying CloudWatch Metrics

In the CWAgent namespace, I check the three main metrics of the Case 2 instance:

mem_used_percent
disk_used_percent
procstat_lookup_pid_count

Evidence:

This result proves that the new EC2 instance, bootstrapped via User Data, automatically sent metrics to CloudWatch.

Creating a Dashboard for Case 2

Dashboard name:

cwagent-bootstrap-ec2-dashboard

Widgets displayed:

Memory Used Percent
Nginx Process Count
Disk Used Percent

Evidence:

Case 2 results:

[✓] A brand-new EC2 instance was created from scratch
[✓] The User Data ran successfully
[✓] Nginx was installed and started automatically
[✓] The CloudWatch Agent was installed and started automatically
[✓] Logs were shipped to CloudWatch Logs
[✓] Metrics were shipped to CloudWatch Metrics
[✓] The dashboard visualizes the key metrics

Comparing the two deployment approaches

Criterion	Case 1: Existing EC2	Case 2: New EC2 from scratch
Situation	EC2 already exists	EC2 created fresh
How the agent is installed	Installed manually while EC2 is already running	Installed automatically via User Data
Goal	Retrofit observability	Bootstrap observability
Best when	The server is already running in dev/prod	You want a new server to have monitoring from the start
Key evidence	Agent running, metrics, logs, alarm, SNS, dashboard	User Data, agent running, logs, metrics, dashboard

Case 1 fits when you already have a running system and want to add observability without changing how the instance is launched.

Case 2 fits when you want to standardize how new EC2 instances are created: the moment the instance boots, it already has Nginx, the CloudWatch Agent, logs, metrics, and is dashboard-ready.

Lessons learned

The CloudWatch Agent is a crucial piece if you want to observe EC2 more deeply than basic monitoring.

Without the CloudWatch Agent, you mostly see metrics outside the instance such as CPU, network, disk I/O, and status checks. With the CloudWatch Agent, you additionally get:

Memory usage
Disk usage by filesystem
Application logs
System logs
Process status
Custom metrics in the CWAgent namespace

Another important point is that an IAM Role should be used instead of access keys. When EC2 has the right role, the CloudWatch Agent automatically uses those permissions to send metrics/logs to CloudWatch.

The lab also shows that Session Manager is a better choice than SSH in a lab or a basic production environment, because there is no need to open port 22 to the Internet.

Cleanup

After finishing the lab, clean up to avoid incurring cost. Besides deleting the EC2 instances, dashboards, alarms, and SNS, pay attention to the CloudWatch Log Groups as well, since a log group can keep logs for a long time if no retention policy is set.

Terminate EC2 Case 1 if it was only used for the lab
Terminate EC2 Case 2
Delete the CloudWatch Dashboards
Delete the CloudWatch Alarms
Delete the SNS Topic
Delete the CloudWatch Log Groups if you do not need to keep them
Delete the IAM Role if it was only used for the lab

You can use this file cleanup template: cleanup-file

Conclusion

This lab builds an AWS-native observability pipeline for EC2 with the CloudWatch Agent. Across the two cases, I validated both deployment directions:

Existing EC2
→ install the CloudWatch Agent manually
→ add observability to a running workload

New EC2
→ bootstrap via User Data
→ have observability from the moment of launch

By combining the CloudWatch Agent with CloudWatch Metrics, CloudWatch Logs, a Dashboard, an Alarm, and SNS, you can build a monitoring system that is simple, easy to understand, realistic enough for a basic DevOps environment, and can be extended further for larger workloads.

Top comments (3)

Huỳnh Lê Nhất Nghĩa • Jul 5

So good!

Huy Hoang • Jul 5

wowwww

Nguyen Thanh Vinh • Jul 6

Very helpful, thanks you <3