DEV Community

Cover image for How I Built an AI-Powered Cloud Security Guardian using AWS — From Idea to Docker in 30 Days
Disha Gupta
Disha Gupta

Posted on

How I Built an AI-Powered Cloud Security Guardian using AWS — From Idea to Docker in 30 Days

How I Built an AI-Powered Cloud Security Guardian — From Idea to Docker in 30 Days
By Disha Gupta |(Cloud Security & GRC) |https://www.linkedin.com/in/disha-gupta-6588102b9/

The Problem That Kept Me Up at Night

Every week I read another breach report. Misconfigured S3 bucket. IAM user with Administrator Access left active. SSH port 22 open to 0.0.0.0/0. Security groups that nobody audited in six months.

The frustrating part? These aren't sophisticated zero-days. They're checklist failures. Things that should have been caught automatically.

So I decided to build the tool I wished existed — one that could scan any AWS account on demand, apply machine learning to risk-score every resource, detect anomalies in CloudTrail logs, and surface remediation advice in real time. I called it "AI Cloud Security Guardian".

Here's exactly how I built it, what each AWS service does in the architecture, and what I learned the hard way.

What the Platform Does

Before getting into the technical stack, here's what Guardian actually does when you use it:

  1. You log into the SOC-themed dashboard
  2. You enter your AWS IAM credentials — they're never stored, only used for that session
  3. Guardian connects to your AWS account via boto3 and discovers every EC2 instance, S3 bucket, IAM user, IAM role, and security group
  4. A rule engine runs 9 misconfiguration checks against every resource
  5. A Random Forest ML model scores each resource from 0.0 to 1.0 based on 6 security features
  6. An Isolation Forest model runs anomaly detection on CloudTrail logs
  7. Alerts are generated, prioritized by severity, and surfaced in the dashboard with AI-generated remediation steps

The entire backend is FastAPI + Python. The frontend is React + TypeScript + Tailwind. Everything runs in Docker. And it connects to real AWS accounts — not mocked data.

AWS Services Used and Why

AWS STS — GetCallerIdentity

The very first AWS API call Guardian makes is sts:GetCallerIdentity. Before scanning anything, it verifies that your credentials are valid and tells you exactly which account and IAM identity you're using.

sts = session.client("sts")
identity = sts.get_caller_identity()

Returns: Account ID, ARN, UserId

This is the cheapest possible AWS API call — it's always allowed for any valid credential, costs nothing, and gives us a fast-fail before running a 60-second full scan with bad keys. If this call fails, Guardian immediately tells you exactly why — invalid key, expired token, wrong region — instead of failing silently mid-scan.

What I learned: Always validate credentials with STS before any other AWS operation. It saves enormous debugging time and gives users clear error messages.

Amazon EC2 — DescribeInstances + DescribeSecurityGroups

For EC2, Guardian uses two boto3 calls:

DescribeInstances discovers every running instance and collects:

  • Instance ID, type, and state
  • Public IP (is it internet-exposed?)
  • Associated security group IDs
  • IAM instance profile (does it have proper permissions?)
  • Key pair name (is access documented?)

DescribeSecurityGroups checks every inbound rule for the most dangerous misconfiguration in AWS — port exposure to 0.0.0.0/0:

def _check_open_to_world(rules):
for rule in rules:
for ipv4 in rule.get("IpRanges", []):
if ipv4.get("CidrIp") == "0.0.0.0/0":
return True # CRITICAL finding
return False

When Guardian finds SSH (port 22) or RDP (port 3389) open to the entire internet, it fires a Critical alert immediately. This single check has caught the most serious findings in real account scans.

Detection rules triggered:

  • SG_001 — Open to 0.0.0.0/0 (Critical)
  • EC2_001 — Public IP with no IAM instance profile (Medium)
  • EC2_002 — Running instance with no key pair (Low)

Amazon S3 — Multi-API Bucket Analysis

S3 is where most data breaches start. Guardian runs five separate API calls per bucket:

API Call What We Check Severity if Missing
get_public_access_block All 4 Block Public Access flags Critical
get_bucket_encryption SSE-S3 or SSE-KMS enabled High
get_bucket_logging Access logs configured Medium
get_bucket_versioning Object versioning enabled Info
get_bucket_location Region (for context)

The public access check is the most important. AWS has four separate flags for blocking public access (BlockPublicAcls, IgnorePublicAcls, BlockPublicPolicy, RestrictPublicBuckets). Guardian checks that ALL four are enabled — if even one is False, the bucket is marked as potentially public:

fully_blocked = all([
cfg.get("BlockPublicAcls", False),
cfg.get("IgnorePublicAcls", False),
cfg.get("BlockPublicPolicy", False),
cfg.get("RestrictPublicBuckets", False),
])
data["is_public"] = not fully_blocked

What I learned: The absence of a public access block configuration is different from having it set to False. If get_public_access_block throws a NoSuchPublicAccessBlockConfiguration exception, the bucket has no protection at all — Guardian treats that as is_public = True.

AWS IAM — Privilege and MFA Analysis

IAM scanning is where Guardian finds the highest-severity issues. Three API calls cover the critical checks:

ListAttachedUserPolicies — checks every user for AdministratorAccess. One IAM user with full admin and no MFA is game over if their credentials leak.

ListMFADevices — for any user with console access, Guardian checks if MFA is enabled. No MFA on a console user is an automatic High finding.

GetLoginProfile — determines if a user has console access at all. Service accounts should never have console passwords.

The most dangerous combination in AWS:

Console access + AdministratorAccess + no MFA

if user.has_console_access and user.is_admin and not user.has_mfa:
# This is a three-alarm fire
generate_critical_alert(user)

The IAM scan also covers roles — any role with AdministratorAccess attached gets flagged as High, because a compromised EC2 instance or Lambda function with that role has unlimited blast radius.

Detection rules triggered:

  • IAM_001 — User with AdministratorAccess (Critical)
  • IAM_002 — Console user without MFA (High)
  • IAM_003 — Role with AdministratorAccess (High)

AWS CloudTrail — Log Analysis for Anomaly Detection

This is where the ML comes in. CloudTrail logs every API call made in your AWS account. Guardian ingests these events and runs them through an Isolation Forest — an unsupervised ML algorithm that identifies statistical outliers without needing labeled training data.

The feature extraction aggregates per-user behavior across a time window:

features = {
"api_call_count": 1, # volume of API calls
"failed_logins": 1, # AccessDenied errors
"hour_of_day": 3, # off-hours = suspicious
"is_new_region": True, # never seen this region before
"bytes_transferred": 0,
"unique_resources": 1,
}

Isolation Forest works by randomly partitioning the feature space. Anomalous data points — like a user suddenly making 5,000 API calls from a new region at 3 AM with 50 Access Denied errors — are isolated quickly because they're far from the normal distribution. The algorithm assigns an anomaly score where lower = more anomalous.

What Guardian flags:

  • Unusual API call volume (potential credential theft / cryptomining)
  • Access from a previously unseen AWS region (potential account takeover)
  • Off-hours access patterns (lateral movement)
  • Repeated AccessDenied errors (brute force / privilege escalation attempt)

Why Isolation Forest over supervised ML? Because you almost never have labeled "malicious" CloudTrail logs to train on. Isolation Forest requires no labels — it just learns what "normal" looks like and flags deviations. This is exactly how real SIEM tools work.

The ML Risk Scoring Engine

Beyond rule-based detection, every resource gets a continuous risk score from 0.0 to 1.0 using a "Random Forest classifier".

The model uses 6 features derived from each resource:

FEATURES = [
"public_access", # 0/1 — internet-exposed?
"open_ports", # count of world-open inbound rules
"encryption_enabled", # 0/1 — data encrypted at rest?
"iam_privilege_level", # 0=none, 1=read, 2=write, 3=admin
"mfa_enabled", # 0/1 — MFA enforced?
"logging_enabled", # 0/1 — audit trail active?
]

The model is trained on synthetic data at startup and saved to disk with joblib. In a production deployment with real historical findings, you'd replace the synthetic training data with actual labeled security findings from past scans — making the model progressively more accurate with each scan.

Risk levels:

  • Critical — score ≥ 0.75
  • High — score ≥ 0.55
  • Medium — score ≥ 0.35
  • Low — score ≥ 0.15
  • Minimal — score < 0.15

Security Architecture Decisions

Building a tool that handles AWS credentials forced me to think carefully about security at every layer.

Credentials Never Touch Storage

The most important design decision: AWS credentials are never stored anywhere. Not in the database. Not in logs. Not in the browser's localStorage. They exist only in:

  1. The user's browser state (React useState) for the duration of the modal
  2. The HTTP request body while in transit
  3. Python function parameters during the scan
  4. Cleared immediately after the scan completes

After scan completes — credentials go out of scope and are GC'd
def run_full_scan_with_credentials(access_key_id, secret_access_key, ...):
... scan happens ...
return result
access_key_id and secret_access_key are never written anywhere

The backend logs only the key prefix (AKIA...8chars...) for debugging — never the full key or secret.

JWT in Memory, Not localStorage

The dashboard JWT token lives in a module-level JavaScript variable — not localStorage or sessionStorage. This prevents XSS attacks from stealing the token, at the cost of losing the session on page refresh (acceptable for a security tool).

// In-memory only — XSS cannot read this via document.cookie or localStorage
let _accessToken: string | null = null

An auto-logout timer is set from the JWT's exp claim with a 30-second buffer. When the token is about to expire, the user is automatically logged out.

Input Validation at Every Layer

  • Frontend: Regex validation on the Access Key ID format, length checks on all fields
  • Backend: Pydantic field_validator on every credential field before any AWS call
  • JSON parsing: safeJsonParse() blocks __proto__ and constructor keys to prevent prototype pollution in user-submitted log data

The Tech Stack

Backend:

  • FastAPI — async Python web framework, auto-generates OpenAPI docs
  • SQLAlchemy + SQLite (dev) / PostgreSQL (prod) — ORM for findings storage
  • boto3 — AWS SDK, all credential operations
  • scikit-learn — Random Forest (risk scoring) + Isolation Forest (anomaly detection)
  • Pydantic v2 — request validation and settings management
  • JWT via python-jose — stateless authentication

Frontend:

  • React 18 + TypeScript — component framework
  • Tailwind CSS — utility-first styling with custom SOC terminal design tokens
  • React Query (TanStack) — server state management with caching
  • Recharts — risk score visualization (bar charts, donut charts, radar charts)
  • Axios — HTTP client with request/response interceptors
  • DOMPurify — XSS sanitization for any server-returned strings

DevOps:

  • Docker — multi-stage builds (builder → slim runtime) for both services
  • Docker Compose — orchestrates PostgreSQL + FastAPI + Nginx as one stack
  • Kubernetes — 7 manifests covering namespace, secrets, deployments, ingress, and HPA autoscaling
  • Nginx — reverse proxy in the frontend container, eliminates CORS entirely in production

Challenges and What I Actually Learned

Challenge 1: The Variable Name Collision That Caused Every Scan to 500

For days, every scan attempt returned a 500 Internal Server Error. The backend logs showed a TypeError: 'bool' is not callable. After hours of debugging, I found it:

BROKEN — parameter named scan_security_groups shadows the function
def run_full_scan(scan_security_groups: bool = True):
...
"security_groups": scan_security_groups(session) # calling a bool!

The function scan_security_groups() and the boolean parameter scan_security_groups had the same name. Python used the parameter, not the function. Fixed by prefixing all internal scanner functions with _do_:

"security_groups": _do_scan_security_groups(session) if scan_security_groups else []

Lesson: In Python, function parameters shadow module-level names within their scope. Name your parameters explicitly to avoid conflicts with functions they might call.

Challenge 2: CORS That Wasn't Actually CORS

The frontend was getting blocked by CORS policy — but the backend had allow_origins=["*"] set. After wasting an afternoon, I realized the issue: FastAPI's CORSMiddleware with allow_origins=["*"] is incompatible with allow_credentials=True. Setting both is illegal per the CORS spec and FastAPI silently breaks the middleware.

The final fix wasn't even CORS middleware — it was switching to a Vite proxy in development. The browser calls localhost:5173/api/scan/aws, Vite forwards it to localhost:8000/scan/aws server-side. The browser never makes a cross-origin request. CORS doesn't apply.

Lesson: The right fix for CORS in development is a proxy, not CORS headers. Save CORS configuration for production where you actually need it.

Challenge 3: TypeScript Strict Mode vs Docker Build

The code compiled fine locally with VS Code's TypeScript server being lenient. But the Docker build ran tsc in strict mode and found 15 errors — unused parameters, import.meta.env type issues, missing module declarations, type assertion errors.

The fix was a combination of:

  1. Setting "strict": false and "noUnusedLocals": false in tsconfig.json for the build
  2. Accessing import.meta.env via (import.meta as any).env to bypass the strict type check
  3. Removing "references" from tsconfig.json so the build didn't look for tsconfig.node.json inside the Docker container

Lesson: Always test your Docker build on CI before you think you're done. Local TypeScript compilation and Docker-in-builder-stage compilation can behave very differently.

The Minimum IAM Policy

For anyone who wants to scan their own account, here's the exact minimum permission set needed:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sts:GetCallerIdentity",
"ec2:DescribeInstances",
"ec2:DescribeSecurityGroups",
"s3:ListAllMyBuckets",
"s3:GetBucketPublicAccessBlock",
"s3:GetBucketEncryption",
"s3:GetBucketLogging",
"iam:ListUsers",
"iam:ListRoles",
"iam:ListMFADevices",
"iam:ListAttachedUserPolicies",
"iam:GetLoginProfile",
"cloudtrail:LookupEvents"
],
"Resource": "*"
}
]
}

Create a dedicated IAM user with only this policy. Never use root credentials or your personal admin account.

The full project — backend, frontend, Docker, and Kubernetes manifests — is on GitHub -https://github.com/Dianger16/AWS-CLOUD-SOC.git

Stack summary: FastAPI · boto3 · scikit-learn · React · TypeScript · Tailwind · Docker · Kubernetes

If you're working on cloud security, GRC, or DevSecOps and want to collaborate or discuss the architecture, I'm always open to connect.-"https://www.linkedin.com/in/disha-gupta-6588102b9/"

Top comments (0)