How I Built an AI-Powered Cloud Security Guardian — From Idea to Docker in 30 Days
By Disha Gupta |(Cloud Security & GRC) |https://www.linkedin.com/in/disha-gupta-6588102b9/
The Problem That Kept Me Up at Night
Every week I read another breach report. Misconfigured S3 bucket. IAM user with Administrator Access left active. SSH port 22 open to 0.0.0.0/0. Security groups that nobody audited in six months.
The frustrating part? These aren't sophisticated zero-days. They're checklist failures. Things that should have been caught automatically.
So I decided to build the tool I wished existed — one that could scan any AWS account on demand, apply machine learning to risk-score every resource, detect anomalies in CloudTrail logs, and surface remediation advice in real time. I called it "AI Cloud Security Guardian".
Here's exactly how I built it, what each AWS service does in the architecture, and what I learned the hard way.
What the Platform Does
Before getting into the technical stack, here's what Guardian actually does when you use it:
- You log into the SOC-themed dashboard
- You enter your AWS IAM credentials — they're never stored, only used for that session
- Guardian connects to your AWS account via boto3 and discovers every EC2 instance, S3 bucket, IAM user, IAM role, and security group
- A rule engine runs 9 misconfiguration checks against every resource
- A Random Forest ML model scores each resource from 0.0 to 1.0 based on 6 security features
- An Isolation Forest model runs anomaly detection on CloudTrail logs
- Alerts are generated, prioritized by severity, and surfaced in the dashboard with AI-generated remediation steps
The entire backend is FastAPI + Python. The frontend is React + TypeScript + Tailwind. Everything runs in Docker. And it connects to real AWS accounts — not mocked data.
AWS Services Used and Why
AWS STS — GetCallerIdentity
The very first AWS API call Guardian makes is sts:GetCallerIdentity. Before scanning anything, it verifies that your credentials are valid and tells you exactly which account and IAM identity you're using.
sts = session.client("sts")
identity = sts.get_caller_identity()
Returns: Account ID, ARN, UserId
This is the cheapest possible AWS API call — it's always allowed for any valid credential, costs nothing, and gives us a fast-fail before running a 60-second full scan with bad keys. If this call fails, Guardian immediately tells you exactly why — invalid key, expired token, wrong region — instead of failing silently mid-scan.
What I learned: Always validate credentials with STS before any other AWS operation. It saves enormous debugging time and gives users clear error messages.
Amazon EC2 — DescribeInstances + DescribeSecurityGroups
For EC2, Guardian uses two boto3 calls:
DescribeInstances discovers every running instance and collects:
- Instance ID, type, and state
- Public IP (is it internet-exposed?)
- Associated security group IDs
- IAM instance profile (does it have proper permissions?)
- Key pair name (is access documented?)
DescribeSecurityGroups checks every inbound rule for the most dangerous misconfiguration in AWS — port exposure to 0.0.0.0/0:
def _check_open_to_world(rules):
for rule in rules:
for ipv4 in rule.get("IpRanges", []):
if ipv4.get("CidrIp") == "0.0.0.0/0":
return True # CRITICAL finding
return False
When Guardian finds SSH (port 22) or RDP (port 3389) open to the entire internet, it fires a Critical alert immediately. This single check has caught the most serious findings in real account scans.
Detection rules triggered:
-
SG_001— Open to 0.0.0.0/0 (Critical) -
EC2_001— Public IP with no IAM instance profile (Medium) -
EC2_002— Running instance with no key pair (Low)
Amazon S3 — Multi-API Bucket Analysis
S3 is where most data breaches start. Guardian runs five separate API calls per bucket:
| API Call | What We Check | Severity if Missing |
|---|---|---|
get_public_access_block |
All 4 Block Public Access flags | Critical |
get_bucket_encryption |
SSE-S3 or SSE-KMS enabled | High |
get_bucket_logging |
Access logs configured | Medium |
get_bucket_versioning |
Object versioning enabled | Info |
get_bucket_location |
Region (for context) | — |
The public access check is the most important. AWS has four separate flags for blocking public access (BlockPublicAcls, IgnorePublicAcls, BlockPublicPolicy, RestrictPublicBuckets). Guardian checks that ALL four are enabled — if even one is False, the bucket is marked as potentially public:
fully_blocked = all([
cfg.get("BlockPublicAcls", False),
cfg.get("IgnorePublicAcls", False),
cfg.get("BlockPublicPolicy", False),
cfg.get("RestrictPublicBuckets", False),
])
data["is_public"] = not fully_blocked
What I learned: The absence of a public access block configuration is different from having it set to False. If get_public_access_block throws a NoSuchPublicAccessBlockConfiguration exception, the bucket has no protection at all — Guardian treats that as is_public = True.
AWS IAM — Privilege and MFA Analysis
IAM scanning is where Guardian finds the highest-severity issues. Three API calls cover the critical checks:
ListAttachedUserPolicies — checks every user for AdministratorAccess. One IAM user with full admin and no MFA is game over if their credentials leak.
ListMFADevices — for any user with console access, Guardian checks if MFA is enabled. No MFA on a console user is an automatic High finding.
GetLoginProfile — determines if a user has console access at all. Service accounts should never have console passwords.
The most dangerous combination in AWS:
Console access + AdministratorAccess + no MFA
if user.has_console_access and user.is_admin and not user.has_mfa:
# This is a three-alarm fire
generate_critical_alert(user)
The IAM scan also covers roles — any role with AdministratorAccess attached gets flagged as High, because a compromised EC2 instance or Lambda function with that role has unlimited blast radius.
Detection rules triggered:
-
IAM_001— User with AdministratorAccess (Critical) -
IAM_002— Console user without MFA (High) -
IAM_003— Role with AdministratorAccess (High)
AWS CloudTrail — Log Analysis for Anomaly Detection
This is where the ML comes in. CloudTrail logs every API call made in your AWS account. Guardian ingests these events and runs them through an Isolation Forest — an unsupervised ML algorithm that identifies statistical outliers without needing labeled training data.
The feature extraction aggregates per-user behavior across a time window:
features = {
"api_call_count": 1, # volume of API calls
"failed_logins": 1, # AccessDenied errors
"hour_of_day": 3, # off-hours = suspicious
"is_new_region": True, # never seen this region before
"bytes_transferred": 0,
"unique_resources": 1,
}
Isolation Forest works by randomly partitioning the feature space. Anomalous data points — like a user suddenly making 5,000 API calls from a new region at 3 AM with 50 Access Denied errors — are isolated quickly because they're far from the normal distribution. The algorithm assigns an anomaly score where lower = more anomalous.
What Guardian flags:
- Unusual API call volume (potential credential theft / cryptomining)
- Access from a previously unseen AWS region (potential account takeover)
- Off-hours access patterns (lateral movement)
- Repeated AccessDenied errors (brute force / privilege escalation attempt)
Why Isolation Forest over supervised ML? Because you almost never have labeled "malicious" CloudTrail logs to train on. Isolation Forest requires no labels — it just learns what "normal" looks like and flags deviations. This is exactly how real SIEM tools work.
The ML Risk Scoring Engine
Beyond rule-based detection, every resource gets a continuous risk score from 0.0 to 1.0 using a "Random Forest classifier".
The model uses 6 features derived from each resource:
FEATURES = [
"public_access", # 0/1 — internet-exposed?
"open_ports", # count of world-open inbound rules
"encryption_enabled", # 0/1 — data encrypted at rest?
"iam_privilege_level", # 0=none, 1=read, 2=write, 3=admin
"mfa_enabled", # 0/1 — MFA enforced?
"logging_enabled", # 0/1 — audit trail active?
]
The model is trained on synthetic data at startup and saved to disk with joblib. In a production deployment with real historical findings, you'd replace the synthetic training data with actual labeled security findings from past scans — making the model progressively more accurate with each scan.
Risk levels:
- Critical — score ≥ 0.75
- High — score ≥ 0.55
- Medium — score ≥ 0.35
- Low — score ≥ 0.15
- Minimal — score < 0.15
Security Architecture Decisions
Building a tool that handles AWS credentials forced me to think carefully about security at every layer.
Credentials Never Touch Storage
The most important design decision: AWS credentials are never stored anywhere. Not in the database. Not in logs. Not in the browser's localStorage. They exist only in:
- The user's browser state (React
useState) for the duration of the modal - The HTTP request body while in transit
- Python function parameters during the scan
- Cleared immediately after the scan completes
After scan completes — credentials go out of scope and are GC'd
def run_full_scan_with_credentials(access_key_id, secret_access_key, ...):
... scan happens ...
return result
access_key_id and secret_access_key are never written anywhere
The backend logs only the key prefix (AKIA...8chars...) for debugging — never the full key or secret.
JWT in Memory, Not localStorage
The dashboard JWT token lives in a module-level JavaScript variable — not localStorage or sessionStorage. This prevents XSS attacks from stealing the token, at the cost of losing the session on page refresh (acceptable for a security tool).
// In-memory only — XSS cannot read this via document.cookie or localStorage
let _accessToken: string | null = null
An auto-logout timer is set from the JWT's exp claim with a 30-second buffer. When the token is about to expire, the user is automatically logged out.
Input Validation at Every Layer
- Frontend: Regex validation on the Access Key ID format, length checks on all fields
- Backend: Pydantic
field_validatoron every credential field before any AWS call - JSON parsing:
safeJsonParse()blocks__proto__andconstructorkeys to prevent prototype pollution in user-submitted log data
The Tech Stack
Backend:
- FastAPI — async Python web framework, auto-generates OpenAPI docs
- SQLAlchemy + SQLite (dev) / PostgreSQL (prod) — ORM for findings storage
- boto3 — AWS SDK, all credential operations
- scikit-learn — Random Forest (risk scoring) + Isolation Forest (anomaly detection)
- Pydantic v2 — request validation and settings management
- JWT via
python-jose— stateless authentication
Frontend:
- React 18 + TypeScript — component framework
- Tailwind CSS — utility-first styling with custom SOC terminal design tokens
- React Query (TanStack) — server state management with caching
- Recharts — risk score visualization (bar charts, donut charts, radar charts)
- Axios — HTTP client with request/response interceptors
- DOMPurify — XSS sanitization for any server-returned strings
DevOps:
- Docker — multi-stage builds (builder → slim runtime) for both services
- Docker Compose — orchestrates PostgreSQL + FastAPI + Nginx as one stack
- Kubernetes — 7 manifests covering namespace, secrets, deployments, ingress, and HPA autoscaling
- Nginx — reverse proxy in the frontend container, eliminates CORS entirely in production
Challenges and What I Actually Learned
Challenge 1: The Variable Name Collision That Caused Every Scan to 500
For days, every scan attempt returned a 500 Internal Server Error. The backend logs showed a TypeError: 'bool' is not callable. After hours of debugging, I found it:
BROKEN — parameter named scan_security_groups shadows the function
def run_full_scan(scan_security_groups: bool = True):
...
"security_groups": scan_security_groups(session) # calling a bool!
The function scan_security_groups() and the boolean parameter scan_security_groups had the same name. Python used the parameter, not the function. Fixed by prefixing all internal scanner functions with _do_:
"security_groups": _do_scan_security_groups(session) if scan_security_groups else []
Lesson: In Python, function parameters shadow module-level names within their scope. Name your parameters explicitly to avoid conflicts with functions they might call.
Challenge 2: CORS That Wasn't Actually CORS
The frontend was getting blocked by CORS policy — but the backend had allow_origins=["*"] set. After wasting an afternoon, I realized the issue: FastAPI's CORSMiddleware with allow_origins=["*"] is incompatible with allow_credentials=True. Setting both is illegal per the CORS spec and FastAPI silently breaks the middleware.
The final fix wasn't even CORS middleware — it was switching to a Vite proxy in development. The browser calls localhost:5173/api/scan/aws, Vite forwards it to localhost:8000/scan/aws server-side. The browser never makes a cross-origin request. CORS doesn't apply.
Lesson: The right fix for CORS in development is a proxy, not CORS headers. Save CORS configuration for production where you actually need it.
Challenge 3: TypeScript Strict Mode vs Docker Build
The code compiled fine locally with VS Code's TypeScript server being lenient. But the Docker build ran tsc in strict mode and found 15 errors — unused parameters, import.meta.env type issues, missing module declarations, type assertion errors.
The fix was a combination of:
- Setting
"strict": falseand"noUnusedLocals": falseintsconfig.jsonfor the build - Accessing
import.meta.envvia(import.meta as any).envto bypass the strict type check - Removing
"references"fromtsconfig.jsonso the build didn't look fortsconfig.node.jsoninside the Docker container
Lesson: Always test your Docker build on CI before you think you're done. Local TypeScript compilation and Docker-in-builder-stage compilation can behave very differently.
The Minimum IAM Policy
For anyone who wants to scan their own account, here's the exact minimum permission set needed:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sts:GetCallerIdentity",
"ec2:DescribeInstances",
"ec2:DescribeSecurityGroups",
"s3:ListAllMyBuckets",
"s3:GetBucketPublicAccessBlock",
"s3:GetBucketEncryption",
"s3:GetBucketLogging",
"iam:ListUsers",
"iam:ListRoles",
"iam:ListMFADevices",
"iam:ListAttachedUserPolicies",
"iam:GetLoginProfile",
"cloudtrail:LookupEvents"
],
"Resource": "*"
}
]
}
Create a dedicated IAM user with only this policy. Never use root credentials or your personal admin account.
The full project — backend, frontend, Docker, and Kubernetes manifests — is on GitHub -https://github.com/Dianger16/AWS-CLOUD-SOC.git
Stack summary: FastAPI · boto3 · scikit-learn · React · TypeScript · Tailwind · Docker · Kubernetes
If you're working on cloud security, GRC, or DevSecOps and want to collaborate or discuss the architecture, I'm always open to connect.-"https://www.linkedin.com/in/disha-gupta-6588102b9/"
Top comments (0)