DEV Community

Cover image for Course 3 of 3: AIOps ☁️πŸ’ͺ
Laura
Laura

Posted on • Originally published at lalidev.hashnode.dev

Course 3 of 3: AIOps ☁️πŸ’ͺ

Table of Contents

  1. Introduction

  2. You Can't Fix What You Can't See

    1. Monitoring
    2. Observability
    3. Monitoring vs Observability
  3. Observability + AIOps = Smarter IT Operations

  4. AWS AIOps Tools

    1. AWS CloudWatch Anomaly Detection
    2. AWS X-Ray insights
    3. AWS DevOps Guru
  5. Better Dev Experience

    1. Amazon Q DeveloperClosing Thoughts
  6. Closing Thoughts

Introduction

Welcome to the third and final blog post of this 3-part seriesπŸŽ‰ where I share my learning process in getting the DevOps and AI on AWS Specialization certification. The first blog is about Upgrading Apps with Gen AI and the second one is about CI/CD for Generative AI Applications .

This course is all about Artificial intelligence for IT operations (AIOps), which means using AI to maintain infrastructure. To learn more about what is AIOps, what are the benefits and use cases check this link out from Amazon website.

You Can't Fix What You Can't See

Monitoring

Monitoring checks systems health by collecting and analyzing data from systems, based on a predefined set of metrics and logs. In DevOps, it helps teams keep an eye on application health, catch known failures early, and avoid downtime.

Where monitoring really shines is spotting long-term trends, reveals how the app performs and how usage patterns changes over time. But, to be effective teams must know which metrics and logs to track.

Observability

Observability means being able to make sense of what's happening inside a complex system from it's external outputs. When a system is observable, engineers can pinpoint the root cause of a performance issue by analyzing the data already available, directly from telemetry data. Allows us to understand why an issue occurred.

The three pillars of observability are

  • Logs (application logs, what's is happening when a failure occurs?)

  • Metrics (CloudWatch metrics, how much CPU utilization? or app specific metrics)

  • Traces (A trace contains data from each services used to better understand what's the issue and where the error occurred).

Monitoring vs Obersavility

Monitoring Observability
Tracks system's performance over time, it's main focus is in finding systems problems and notify stakeholders. Metrics could respond to questions like "Is my app up an running?" Uses telemetry data to get a complete picture of the of overall network performance, making it easier to get the root cause of the issue.
Monitoring tools rely on predefined metrics and logs to identify systems errors, usage patterns and know failures but can't provide enough context by itself (Is the app online? Is it offline? Is it experiencing latency issues?) Gives a team a complete view of the entire architecture, capturing configurations and data from across the network. Observability tools enhance telemetry data with additional context about the network environment.
Gather data on usage trends and performance, revealing what is happening within a system but they can't respond to "why this event occur?" Observability tools goes deeper, provide more context and correlate seemingly unrelated system events.
Monitoring tools typically present system data trough dashboards to view key metrics. However fall short in tracing the origins of the system's errors. Observability tools build maps and connect system errors to their root causes, automating the analysis process and making troubleshooting faster and easier.

Observability + AIOps = Smarter IT Operations

Improving observability means making sense of a large amounts of data coming from many different resources. This is where AIOps proves it's value, by automating the correlation of logs, traces, and metrics, identifying anomalies in real time, and reducing manual intervention for repetitive analysis tasks. Instead of digging into raw data, teams can focus on solving bigger issues. For example, when a latency spike shows up, AIOps points you straight to the service or component causing it.

Using AIOps It's like an AI assistant that constantly monitors you infrastructure identifying patterns and anomalies so the team doesn't have to monitor everything themselves.

Key Capabilities of AIOps:

βœ… Anomaly Detection: The AI looks at your system's logs and metrics for suspicious activities, to catch issues before they become a problem. In my previous post I mentioned a CloudWatch feature that utilizes AI-powered Anomaly Detection.

βœ… Predictive Analysis: Predicts future events based on historical data.

βœ… Automated Root Cause Analysis: When something breaks, engineers typically have to manually check through logs to identify the issue. Automated root cause analysis saves time by streamlining this process.

βœ… Remediation: Doesn't just tell you what's wrong, it can take action to fix it based on policies or real-time learning.

It can help you monitor and track the entire CI/CD pipeline in real-time to keep things running smoothly.

AIOps is focused on operations management. This includes: monitoring logs, real-time system health analysis and automated corrective actions

AWS AIOps Tools

AWS CloudWatch Anomaly Detection

CloudWatch is a comprehensive monitoring and observability platform for your cloud resources and applications. The core components are alarms (notify me you when a metric crosses a defined threshold), metrics (data points collected over time) and logs.

When talking about alarms, how do we determine the appropriate threshold to set? CloudWatch can analyze your metrics and establish these thresholds for you. To accomplish this, you can use a feature called CloudWatch anomaly detection.

AWS X-Ray Insights

AWS X-Ray insights is a feature that keeps a continuous watch over trace data in your account to identify any issues that may occur. It uses machine learning to detect anomalies and patterns that could cause issues. When anomalies, error rates or fault rates surpass the expected range, it generates an insight that documents the issue and monitors its impact until it's resolved. It also helps you identify the issues's severity and the priority.

AWS DevOps Guru

DevOps Guru uses machine learning to improve application availability by detecting any anomalous behavior. How does it work? DevOps Guru employs machine learning to identify anomalies, when an anomaly is detected it generates an insight, which is a compilation of related anomalies within an analyzed resource.

There are two types of insights:

πŸ‘‰ Reactive Insights: Contain anomalies with recommendations, related metrics and events.

πŸ‘‰ Proactive Insights: Tells you about issues that are predicted to affect your application in the future.

You can receive notifications when an issue arises by setting up an SNS notification topic and configure an email to be alerted.

Better Dev Experience

Amazon Q Developer Security Scanning

This tool helps write secure code and is designed to support developers during the development process. While AIOps concentrates on operational aspects post-deployment, Amazon Q Developer primarily targets the development phase. Both AI-powered tools enhance processes, but at different stages.

We always want to minimize and prevent vulnerabilities to reach production, here's where Amazon Q Developer could be beneficial. It offers a security scanning feature that scans your code and allows you to catch vulnerabilities early in the process, before reaching production. You can view details about a finding, relevant information and It can also provide suggestions for fixing the issue.

As security policies evolve, this tool incorporates this new detectors ensuring the scans are up-to-date.

Closing Thoughts

And this is the last blog post in this 3-part series where I share my journey of obtaining the DevOps and AI on AWS Specialization Certification πŸŽ‰πŸ’ͺ


I learned a lot, and this specialization added tremendous value to me. I highly recommend it. Thanks for reading through. Sharing my journey with you all is a pleasureπŸ™Œ.

I'm happy to connect with you on LinkedIn, feel free to send a DM and share your thoughts on this blog series.

Top comments (0)