Building a Cloud SIEM from Scratch with AWS Lambda and EventBridge

Antonio Lopez — Sat, 30 May 2026 23:35:23 +0000

How I built a real-time serverless security detection pipeline on AWS using CloudTrail, EventBridge, Lambda, DynamoDB, and SNS — and what broke along the way.

All source code for this project is on GitHub: aws-siem-detection-pipeline

Most cloud security tutorials show you how to turn on GuardDuty and call it a day. I wanted to better understand what actually happens under the hood. Things like how a detection pipeline routes an event, evaluates it, and fires an alert in real time? So I built one from scratch using AWS-native services, no managed threat detection, just CloudTrail, EventBridge, Lambda, DynamoDB, SNS, and some Python.

This is what I built, what broke, and what I'd do differently.

What I Was Trying to Detect

To keep the scope manageable, I focused on four categories of cloud risk that I found most interesting to detect:

Authentication abuse — brute force login attempts, root account usage
Privilege escalation — IAM policy attachments, unexpected role changes
Destructive infrastructure actions — EC2 terminations, S3 bucket deletions
Data exposure risk — public bucket policy changes

In practice, cloud environments generate enormous volumes of control plane activity. An account running routine operations can produce thousands of CloudTrail events per hour — legitimate console logins, routine IAM role changes, expected EC2 stop/start cycles, normal S3 bucket management. The same event types the pipeline watches also fire for benign reasons. Without something to filter, classify, and route that signal, you're just drowning in events that look identical to the malicious ones. The challenge isn't capturing the events — CloudTrail does that automatically. The challenge is separating real signals from noise.

Initial Design

My initial design mapped cleanly to AWS services:

Detection Need	AWS Service
Audit log source	CloudTrail
Real-time event routing	EventBridge
Detection logic	Lambda
State / alert persistence	DynamoDB
Alert delivery	SNS
Visualization	QuickSight

The idea was that CloudTrail would feed everything into EventBridge, an event pattern rule would filter for the API calls I cared about, Lambda would run the detection logic, and QuickSight would give me a dashboard to visualize the findings.

Building It

The EventBridge Rule

The first decision was what to actually watch. Not every CloudTrail event is relevant — the account generates hundreds of API calls per hour from normal operations. The EventBridge rule acts as the intake filter: only the events that match the pattern get forwarded to Lambda, everything else is dropped at the source.

I settled on watching four AWS sources — aws.signin, aws.iam, aws.ec2, aws.s3 — and twelve specific event names:

Event	Source	Why it matters
`ConsoleLogin`	aws.signin	Every auth attempt flows through here. Failed logins feed the brute-force counter; unexpected successful logins can indicate account compromise.
`CreateAccessKey`	aws.iam	Programmatic credential creation is one of the most common ways attackers establish persistence after gaining initial access.
`AttachUserPolicy`	aws.iam	Directly granting permissions to a user is the clearest path to privilege escalation — and the event this demo pivots on.
`AttachRolePolicy`	aws.iam	Same signal on roles, which can have automated trust relationships or cross-account access that amplifies blast radius.
`StopInstances`	aws.ec2	Could be routine maintenance or the start of a disruption campaign. High by default.
`TerminateInstances`	aws.ec2	Permanent and irreversible. CRITICAL by default.
`DeleteVolume`	aws.ec2	Direct data destruction. EBS volumes may hold data not covered by automated backups.
`DeleteSecurityGroup`	aws.ec2	Removing a firewall rule exposes other resources. Often a cleanup or precursor step in an attack.
`CreateBucket`	aws.s3	New buckets in unexpected regions can be exfiltration staging areas.
`DeleteBucket`	aws.s3	Data destruction — routed through allowlist logic to separate authorized admin cleanups from unauthorized deletions.
`PutBucketPolicy`	aws.s3	Bucket policies can grant public read access. The handler inspects the policy for `Principal: *` and escalates to CRITICAL if found.
`DeleteBucketPolicy`	aws.s3	Removing a bucket policy may leave the bucket relying solely on ACLs. Worth auditing every time.

This maps to the same concept as a Sigma rule's logsource block — Sigma is an open standard for writing detection rules that can be converted to work across different SIEM platforms. The logsource block defines what log source and event category a rule applies to before any conditions are evaluated. The EventBridge pattern serves the same purpose: define the scope of what you're ingesting before you write the detection logic.

Here's what the rule looks like in the EventBridge console — the event pattern on the left and the Lambda target it routes to:

The Lambda Detection Engine

Lambda receives each filtered event and routes it to a handler function based on the event name. Each handler extracts the relevant fields from the CloudTrail detail object and builds an alert.

A few design decisions worth explaining:

Severity is dynamic, not static. AttachUserPolicy is HIGH by default, but if the policy ARN contains "Admin" it escalates to CRITICAL automatically. PutBucketPolicy is HIGH unless the policy grants public access (Principal: *), in which case it's CRITICAL. The severity reflects the actual risk of the specific action, not just the event type.

Every alert includes a recommended_action field. A severity label tells you how urgent something is. The recommended action tells the responder what to actually do — verify authorization, remove the policy, check for data loss, restore from backup. That distinction matters at 2am when you're triaging a live alert. A pipeline that only tells you "something happened" isn't much better than no pipeline at all.

Every alert includes a direct CloudTrail investigation link pre-filtered to the actor's username. Small thing, but it removes friction when you're triaging and want to pull the full event history for a specific user immediately.

Here's the function overview showing EventBridge wired up as the trigger:

Stateful Brute Force Detection

This was one of the more interesting engineering problems. Lambda functions are stateless — every invocation starts cold. So you can't just do counter += 1 in your function code and expect it to persist across calls.

The solution is DynamoDB as the external state store. Each failed console login atomically increments a counter keyed by username using DynamoDB's ADD operation. A ttl attribute written 300 seconds into the future tells DynamoDB to automatically expire the record when the 5-minute window closes — no scheduled cleanup jobs needed. When the count hits 5, the brute-force alert fires and the counter resets.

The trade-off is latency and cost — every failed login requires a DynamoDB write and read. Under a real brute force attack with many concurrent Lambda invocations, you could see race conditions on the counter. In a production system you'd address this with conditional writes or a purpose-built atomic counter service.

The actual implementation in lambda_handler.py:

FAILED_LOGIN_THRESHOLD = 5
TTL_WINDOW = 300  # 5 minutes

def handle_failed_login(user, source_ip):
    current_time = int(time.time())
    expiry_time = current_time + TTL_WINDOW

    response = failed_logins_table.update_item(
        Key={'username': user},
        UpdateExpression='ADD fail_count :inc SET last_attempt = :ts, #ttl_attr = :ttl',
        ExpressionAttributeNames={'#ttl_attr': 'ttl'},
        ExpressionAttributeValues={
            ':inc': 1,
            ':ts': current_time,
            ':ttl': expiry_time
        },
        ReturnValues='UPDATED_NEW'
    )

    fail_count = int(response['Attributes']['fail_count'])

    if fail_count >= FAILED_LOGIN_THRESHOLD:
        send_alert(
            event_name='ConsoleLogin - Brute Force Detected',
            user=user,
            source_ip=source_ip,
            severity='HIGH',
            extra=f"Failed attempts: {fail_count} in last 5 minutes — MITRE T1110"
        )
        # Reset counter after alert fires
        failed_logins_table.update_item(
            Key={'username': user},
            UpdateExpression='SET fail_count = :zero',
            ExpressionAttributeValues={':zero': 0}
        )

When the count hits 5 the alert fires, the counter resets, and DynamoDB auto-expires the record after 5 minutes — no scheduled cleanup needed.

Here's the SIEM-logs table with real alert records being persisted — event name, severity, and username all visible:

The Approved User Allowlist for S3 Deletes

Bucket deletion is always high-impact — it's CRITICAL regardless of who does it. But not all deletions are malicious. Rather than suppressing alerts for known-good admins, I keep alerting on everyone, but with a different message depending on whether the actor is on the approved list.

An approved admin deleting a bucket still fires a CRITICAL alert with a message like "S3 bucket deleted — possible data destruction or resource cleanup." An unapproved user deleting a bucket fires a separate CRITICAL path with an explicit unauthorized flag in the message and a recommended_action that calls out the specific user and tells the responder to revoke their S3 permissions immediately.

This is the cloud equivalent of a Sigma filter that excludes known-legitimate parent processes while still alerting on everything else. You're not turning off the signal — you're adding context that changes how the responder acts on it. Both paths write to DynamoDB, S3, and SNS. The difference is in the alert message and the recommended response.

SNS Alerts

Every alert is published to the siem-alerts SNS topic as a structured JSON payload containing the alert ID, timestamp, event name, user, source IP, severity, MITRE tag, recommended action, and a direct CloudTrail investigation link pre-filtered to the actor's username.

The SNS publish also includes MessageAttributes for severity and team. This lets SNS filter policies route alerts to different subscribers without them receiving every message. A security team subscriber can filter for severity = CRITICAL or team = security. An infra team subscriber can filter for team = infra. The team field is set per event type in util.py — IAM events go to security, EC2 events go to infra, S3 events go to cloud.

Here's a real brute force alert as it arrives in the inbox — severity, MITRE tag, recommended action, and investigation link all included:

Demo — Privilege Escalation to Unauthorized S3 Deletion

Rather than just describing what the pipeline detects, here's the concrete attack chain I ran to validate it.

Setup: A sample account (final-project-user) with zero permissions. An admin account simulating a compromised attacker.

Step 1 — Privilege escalation: Logged into the admin account and attached AmazonS3FullAccess to final-project-user. This simulates an attacker using a compromised admin account to stage a backdoor with destructive capabilities.

The pipeline immediately fires:

{
  "event": "AttachUserPolicy",
  "severity": "HIGH",
  "mitre": "T1078 - Privilege Escalation",
  "detail": "Policy 'AmazonS3FullAccess' attached to 'final-project-user' — possible privilege escalation",
  "recommended_action": "Confirm the actor is authorized. If unexpected, remove the policy and investigate."
}

The HIGH severity alert fires immediately in the inbox:

Step 2 — Unauthorized bucket deletion: Signed into final-project-user and deleted the sample S3 bucket.

The pipeline fires a second alert, this time CRITICAL — because final-project-user is not in the APPROVED_S3_DELETE_USERS allowlist, the handler takes the unauthorized path:

{
  "event": "DeleteBucket - Unauthorised User",
  "severity": "CRITICAL",
  "mitre": "T1485 - Data Destruction / Unauthorized Access",
  "detail": "UNAUTHORISED: 'final-project-user' attempted to delete S3 bucket 'sample-bucket'.",
  "recommended_action": "Verify if intentional. Revoke S3 delete permissions if unauthorised. Restore from backup if needed."
}

Seconds later, the CRITICAL unauthorized deletion alert arrives:

Two events, two alerts, two different severity levels, both delivered within seconds.

Issues I Hit

Issue 1: The Brute Force Alert Never Fired

Early in testing, I tried to trigger the brute-force alert by intentionally failing console logins — and nothing happened. Lambda wasn't receiving the events at all. After checking CloudWatch logs and confirming Lambda was healthy, I traced the issue back to the EventBridge rule itself.

The initial rule was only configured with aws.iam, aws.ec2, and aws.s3 as sources. Console sign-in events don't come from any of those — they come from aws.signin with a detail-type of AWS Console Sign In via CloudTrail, which is separate from the generic AWS API Call via CloudTrail detail-type that covers IAM, EC2, and S3 actions. The rule was simply never matching ConsoleLogin events, so they were silently dropped before reaching Lambda.

Fix: Added aws.signin as a source and AWS Console Sign In via CloudTrail as a detail-type in the EventBridge rule. Once both were present, failed login events started flowing immediately.

Lesson: Test each event type individually before building the handler. EventBridge silently drops events that don't match the rule — there's no "event rejected" log unless you wire up a dead-letter queue or an explicit catch-all target. The absence of Lambda invocations doesn't mean Lambda is broken; it may mean the event never arrived.

Here's CloudTrail showing ConsoleLogin events flowing after the fix was applied:

Issue 2: QuickSight S3 Integration Wouldn't Work

This was the biggest time sink. The original design used QuickSight as the visualization layer, fed from the S3 JSON event files Lambda was writing. In practice, getting QuickSight to read from S3 turned into a multi-day problem.

QuickSight recently overhauled its data source integration UI and the documentation hadn't caught up. Multiple attempts to connect the S3 bucket as a data source failed with opaque permission errors — the error messages pointed to IAM roles and manifest files without explaining clearly what was missing. After trying several combinations of bucket policies, manifest configurations, and QuickSight IAM role permissions, I opened a support case with AWS. The resolution path was involved enough that I made the call to pivot.

Fix: Switched to writing event records as structured JSON to S3 (which Lambda was already doing) and reading them locally with pandas and matplotlib in a Jupyter notebook. This turned out to be more flexible — I could iterate on chart types and filtering logic without waiting on QuickSight's dataset refresh cycle.

Lesson: Before committing to a managed visualization service, run a minimal end-to-end integration test: write one record, connect the service, verify it reads the record. QuickSight's S3 integration requires a specific bucket policy, a manifest file pointing to the data prefix, and an IAM role with s3:GetObject on the bucket — none of which is clearly surfaced until something fails.

Here's the events/ prefix in S3 showing the structured JSON files Lambda was writing for each detection:

Issue 3: S3 Bucket Permissions Were Locked Down

When I first created the siem-data bucket, I enabled the most restrictive settings by default — Block All Public Access, no resource-based policy, and access limited to the bucket owner. That's the right security posture, but it created friction at every integration point afterward.

The Lambda execution role needed an explicit s3:PutObject grant on the bucket. The QuickSight integration needed a different grant. Any time a new service needed access, the only path forward was updating the bucket policy manually, because the Block All Public Access setting prevented IAM permission changes from outside the account. This compounded quickly during the QuickSight debugging phase, where each failed attempt required another round of policy edits.

Fix: Added an explicit resource-based bucket policy granting s3:PutObject to the Lambda execution role ARN and s3:GetObject to the QuickSight service role. This resolved the Lambda write failures immediately and unblocked the QuickSight connection attempts.

Lesson: Set up S3 bucket permissions with the full access pattern in mind before writing the first byte of data. Block Public Access is correct — keep it on — but write the bucket policy that grants your application roles exactly the permissions they need at the same time you create the bucket, not reactively when something breaks. A minimal bucket policy template costs five minutes up front and saves hours of debugging later.

Final Architecture

After the fixes and pivots, the final architecture looked like this:

CloudTrail → EventBridge (siem-detection-rule) → Lambda
                                                    ├── DynamoDB (SIEM-logs, 24h TTL)
                                                    ├── DynamoDB (SIEM-failed-logins, 5m TTL)
                                                    ├── S3 (siem-data/events/ — structured JSON)
                                                    └── SNS (siem-alerts — severity + team filters)

Key changes from the initial design:

S3 added as a second persistence layer specifically for analysis and visualization
SNS MessageAttributes added so subscribers can filter by severity and team without processing every message
QuickSight replaced by a local Python notebook (pandas + matplotlib)

Visualizations

The notebook (visuals.ipynb) reads all event JSON files from S3 and produces six charts covering event timelines, type distribution, severity breakdown, most active users, source IP activity, and a high/critical-only focus view. Here are the two most useful ones:

Severity breakdown — A pie chart of CRITICAL / HIGH / MEDIUM / LOW counts. The most useful view for communicating pipeline output to a non-technical audience. A healthy pipeline should have a long tail of LOW and MEDIUM events with a small number of HIGH/CRITICAL findings — if CRITICAL dominates, either the thresholds are wrong or something serious is happening.

High & Critical events only — A filtered bar chart showing only HIGH and CRITICAL events by type. This is the analyst focus view — strip out the noise and show only what requires a response. In my data, this chart shows AttachUserPolicy and DeleteBucket - Unauthorised User from the demo scenario, which is exactly the signal the pipeline was designed to produce.

What I'd Do Differently & Next Steps

Use GuardDuty. I learned about it in more detail at a later date. It would have replaced a significant chunk of the manual EventBridge + Lambda detection work, given me network-level threat detection that CloudTrail can't provide, and solved the visualization problem through Security Hub. Building the pipeline from scratch was a great learning exercise — but in a real environment, GuardDuty is the right starting point and custom Lambda rules are the supplement, not the foundation.

Design for automated response, not just alerting. The pipeline detects and notifies. A real detection pipeline should also act — disable credentials flagged for brute force, quarantine instances that trigger termination alerts, block logins from unexpected countries. The SNS topic is already there; adding a Lambda subscriber that takes action is the natural next step.

Add geographic anomaly detection. The sourceIPAddress field in CloudTrail events can be enriched with a GeoIP lookup at alert time. A login from a country the account has never been accessed from is a much stronger signal than a failed login count alone. This would add a new DynamoDB table keyed by username to track known source countries — any deviation fires an alert immediately, without needing to wait for a brute force threshold to be hit.

Takeaways

Building this from scratch gave me something you can't get from turning on a managed service: an understanding of the actual mechanics. I now know that CloudTrail is the source of truth for your control plane, that EventBridge is doing real filtering work before anything expensive runs, and that every detection rule is just a function with inputs from a cloud event and outputs of severity + context + recommended action — the same structure whether you're writing it in Lambda Python or Sigma YAML. The statefulness problem for brute-force detection is a microcosm of a bigger truth: cloud-native detection isn't stateless, and the gap between "Lambda fired" and "alert fired" is where most of the interesting engineering lives.

DEV Community: Antonio Lopez