Christian

Posted on Mar 26

Event-Driven Threat Detection: Building Real-Time Security on Conditional Access Gaps

#azure #security #streamanalytics #identity

In the previous post, we identified three key gaps that Conditional Access cannot address:

Brute force patterns (e.g. 10 failures in 2 minutes)
Activity from excluded users (e.g. executives bypassing geo-blocking)
behavioural anomalies (e.g. Saturday midnight logins)

This post builds the detection layer that catches what CA misses. Not prevention but detection. Stream Analytics complements Conditional Access, not replaces it.

What this system detects:

Brute force patterns (5+ failures in 10-minute windows)
Geographic anomalies from excluded users (non-UK access with no CA oversight)
behavioural anomalies (off-hours activity from UK locations)

What this system does NOT detect:

Token theft without anomalous sign-in activity
Lateral movement after successful authentication
Data exfiltration post-login

This highlights a critical principle: identity security requires both preventative controls (Conditional Access) and detective controls (event-driven monitoring).

Note: This is about detection. This should feed into a SIEM integration for SOC investigation and response

Architecture: Event Hub + Stream Analytics

The Pipeline:

Entra ID sign-ins → Real authentication events (success, CA blocks, password failures)
Event Hub (signin-events) → Buffers events for stream processing (2 partitions, 1-day retention)
Stream Analytics → 3 continuous queries running SQL against event stream
Event Hub (threat-alerts) → Stores detected threats with full investigation context

Why Event Hub? Decouples collection from processing. Events persist even if Stream Analytics fails. Query has bug? Replay events with corrected query. Connection drops? Events buffered.

Why monitor excluded users? When Senior Level executive logs in from New York, CA policy doesn't apply (user excluded from geo-blocking) → authentication succeeds → no CA oversight. Stream Analytics flags this for investigation: legitimate executive travel or compromised account?

Infrastructure: Terraform deploys everything in ~5 minutes. Event Hub Basic tier (£0.86/day), Stream Analytics 1 SU (£0.10/hour active).

With the ingestion layer in place, we can now define detection logic as continuous queries running against the event stream.

Query 1: Brute Force Detection

Purpose: Detect 5+ authentication failures from same user within 10-minute window.

SELECT
    userPrincipalName,
    COUNT(*) as failed_attempts,
    System.Timestamp() AS window_end,
    'Failed Login Spike' as alert_type
INTO [FailedLoginOutput]
FROM [EventHubInput]
WHERE status.errorCode <> 0
GROUP BY userPrincipalName, TumblingWindow(minute, 10)
HAVING COUNT(*) >= 5;

How tumbling windows work:

Fixed 10-minute intervals: 14:00-14:09, 14:10-14:19, etc.
Non-overlapping: User with 6 failures in one window → alert fires
Separate windows: 3 failures in window 1 + 3 in window 2 → no alert (distributed, not burst)

Why this matters: Prevents alert fatigue from slow distributed attacks while catching concentrated bursts indicative of automated credential stuffing.

Key decision: Include errorCode 53003 (CA blocks) or only 50126 (wrong passwords)?

Early iteration excluded CA blocks (user might have correct password, wrong location—not a credential attack). After testing, included both to capture attackers trying multiple locations during brute force.

Production teams: Adjust based on threat model. Exclude 53003 for pure credential attacks. Include for comprehensive activity monitoring. Or better still; split them into separate detections.

Query 2: Geographic Anomalies

Purpose: Flag ALL non-UK access for investigation. Operational staff will be CA-blocked, but we want to monitor excluded executives.

SELECT
    userPrincipalName,
    location,
    ipAddress,
    createdDateTime,
    'Non-UK Access' as alert_type
INTO [HighRiskOutput]
FROM [EventHubInput]
WHERE location NOT LIKE '%United Kingdom%'
    AND location NOT LIKE '%UK%'
    AND location NOT LIKE '%, GB'
    AND location IS NOT NULL
    AND location <> 'Unknown, Unknown';

The GB Location Bug (cost me an hour of debugging):

Version 1 (failed):

WHERE location NOT LIKE '%UK%' AND location NOT LIKE '%United Kingdom%'

Looked reasonable. Deployed. Tested.

Alerts fired for Salford, Manchester, Islington—all UK cities!

The problem: Entra ID uses ISO country code "GB", not "UK". Sign-in location appears as "Salford, GB", not "Salford, UK".

My query checked NOT LIKE '%UK%'. It correctly detected "Salford, GB" doesn't contain "UK"—so it flagged a UK city as non-UK. False positive cascade.

Version 2 (deployed):
Added NOT LIKE '%, GB' to catch the ISO format. Also added <> 'Unknown, Unknown' to filter geolocation lookup failures.

Lesson: Never assume data format. Always inspect actual payloads before writing filters. The GB vs UK issue is obvious in hindsight—but you only find it by testing with real data.

Why alert on excluded users?

Executive logs in from Singapore → Policy doesn't apply (excluded from geo-blocking) → Stream Analytics flags it → Security investigates:

Check group membership → Senior Level executive
Verify CA policy exclusion → Correctly excluded
Conclusion: Legitimate travel (no action) OR unexpected location (escalate)

Alert forces human review of activity from users who bypass CA policies.

Query 3: Off-Hours Activity

Purpose: Flag UK-based logins on weekends or outside 9-5 UTC business hours.

SELECT
    userPrincipalName,
    location,
    createdDateTime,
    DATEPART(hour, createdDateTime) as login_hour,
    DATEPART(weekday, createdDateTime) as day_of_week,
    'Off-Hours Activity' as alert_type
INTO [OffHoursOutput]
FROM [EventHubInput]
WHERE (
    location LIKE '%, GB'
    OR location LIKE '%United Kingdom%'
    OR location LIKE '%UK%'
  )
  AND (
      DATEPART(weekday, createdDateTime) IN (1, 7)  -- Weekday numbering: 1=Sunday, 7=Saturday (verify DATEFIRST setting)
      OR DATEPART(hour, createdDateTime) NOT BETWEEN 9 AND 17
  );

UK-only location filter: Non-UK logins already trigger Query 2 (geographic anomalies). This query focuses on unusual timing from allowed locations. Prevents duplicate alerts—keeps queries mutually exclusive.

Location matching brittleness: This query uses the same string pattern matching as Query 2 (LIKE '%, GB'). For production, consider extracting countryOrRegion during ingestion and using structured field comparison (WHERE country = 'GB') instead of string matching. More reliable and avoids the GB vs UK inconsistency issues.

Timezone considerations: All Entra ID timestamps are UTC. This query checks UTC hours 9-17, not local UK time. Implications:

UK summer (BST, UTC+1): "9-5 UK time" = 8-16 UTC → query misses 16-17 UTC hour
UK winter (GMT, UTC+0): "9-5 UK time" = 9-17 UTC → query is accurate

For precise UK business hours detection, adjust query to account for BST/GMT transitions or accept UTC-based approximation.

Early iteration mistake: Singapore login at 2 AM triggered BOTH Query 2 (non-UK) AND Query 3 (off-hours). Duplicate alerts cause investigation fatigue.

Fix: Added UK-only filter to Query 3. Now each query targets a distinct signal dimension:

Query 1: Authentication failures (any location, any time)
Query 2: Geographic anomalies (non-UK access, any time)
Query 3: Behavioral anomalies (UK access, unusual timing)

Note: A single event can still trigger multiple queries if it matches multiple dimensions. For example, a UK user failing login 10 times at 2 AM on Saturday triggers Query 1 (failures) AND Query 3 (off-hours). This is expected—each query surfaces a different investigation angle for the same suspicious event.

Data Collection: Graph API to Event Hub

Python script fetches sign-in logs from Graph API → transforms to simplified schema → sends to Event Hub in batches.

# Get connection string from Terraform
cd terraform
terraform output -raw eventhub_send_connection_string

# Add to .env file
echo "EVENTHUB_CONNECTION_STRING=$(terraform output -raw eventhub_send_connection_string)" >> ../scripts/.env

# Process events
cd ../scripts
python export_signin_logs_to_eventhub.py

Typical completion: ~30 seconds for 3300 events.

Schema transformation:

{
    "userPrincipalName": log["userPrincipalName"],
    "status": {
        "errorCode": log.get("status", {}).get("errorCode", 0)
    },
    "location": f"{log.get('location', {}).get('city')}, {log.get('location', {}).get('countryOrRegion')}",
    "ipAddress": log.get("ipAddress"),
    "createdDateTime": log["createdDateTime"]
}

Debugging gotcha: Script crashed with cryptic DNS error (getaddrinfo failed). Looked like network issue. Spent 20 minutes checking firewalls, DNS settings, network connectivity.

Actually: Extra quote in .env file: 'Endpoint=sb://.... Parser read the quote, couldn't parse hostname, threw DNS error.

Lesson: Connection string format errors manifest as network failures. Print connection strings (redacted) to verify exact format before debugging network stack.

Investigation Workflows

All 3300 threats stored in Event Hub threat-alerts with complete investigation context. Azure Portal → Event Hub → Data Explorer shows each threat.

Example 1: Brute Force Attack

The sign-in logs show repeated failures (error code 50126) from a UK location.
Individually, these events would not trigger Conditional Access. However, when analysed as a sequence, they form a clear brute force pattern and worth investigating.

Investigation:

Check user's normal location: UK (matches profile?)
Failed attempts: 10 in 2 minutes (definite brute force pattern)
Action: Contact user to verify, force password reset, review MFA enrollment

Example 2: Excluded User (Critical Scenario)

Investigation:

Alert received: Non-UK access from New York
Check Entra ID: Navigate to user → Group memberships
Discovery: User is member of "Senior Level" group (executives)
Verify CA policy: "LAB - Block Non UK (Exclude Senior Level)" → Senior Level group excluded
Conclusion: Expected (excluded executive), NOT breach

Policy doesn't apply to this user (excluded for travel requirements). No remediation needed—log for audit trail.

This demonstrates defense-in-depth: CA policy doesn't apply (excluded), Stream Analytics flagged it for visibility, security team confirms expected behavior.

Without Stream Analytics, this login is invisible. No flag, no investigation, no confirmation that exclusion is being used legitimately vs. account compromise.

Example 3: Off-Hours Activity

Investigation:

Location: UK (expected for this user)
Time: Saturday 11:34 (outside Mon-Fri 9-5)
Action: Log for trend analysis, contact user if pattern emerges

Key Learnings

1. Real Data Reveals Edge Cases: GB vs UK location bug (Entra uses "Salford, GB"), CA blocks (53003) vs auth failures (50126) require different handling, connection string format errors manifest as DNS failures.

2. Event Hub Decouples Collection from Processing: Events persist even if Stream Analytics fails. Query bug? Replay with corrected query. Critical for production reliability. This also enables replay—allowing you to reprocess historical events with updated detection logic.

3. Alert Design Prevents Fatigue: Early iteration had duplicate alerts (Singapore 2 AM triggered geo + off-hours). Fix: UK-only filter for Query 3. Each query targets distinct signal dimension.

4. IaC Accelerates Iteration: Manual deployment ~2 hours. Terraform apply ~5 minutes. Debugging 2 query bugs took ~30 minutes with IaC vs. 4+ hours manual.

Security Audit: Stream Analytics in Production

This is a LAB environment. The following security gaps would fail production review.

What We Skipped (Intentionally):

Component	Lab Setup	Production Requirement	Risk
Network	Public Event Hub endpoints	Private Link + VNet integration	Data exposure, unauthorized access
Secrets	Connection strings in .env files	Azure Key Vault + managed identities	Credential theft, no rotation
Encryption	Basic tier (no encryption at rest)	Premium tier with customer-managed keys	Data breach if storage compromised
Query Changes	Manual edits in portal	CI/CD with approval gates	Accidental query breakage, no audit trail
Monitoring	No alerting on job failures	Azure Monitor alerts + runbooks	Silent detection failures
Data Retention	1-day Event Hub retention	Long-term storage (SQL/Log Analytics)	Compliance violations, lost audit trail

Production Must-Haves for Stream Analytics:

1. Network Isolation

Private Link for Event Hub (~£12/month per endpoint)
VNet integration for Stream Analytics job
Network security groups restricting inbound/outbound

2. Identity & Access

Managed identities for Stream Analytics → Event Hub authentication
Azure Key Vault for any connection strings (Python script)
RBAC with least privilege (no account keys in queries)

3. Data Protection

Premium Event Hub with encryption at rest (~£531/month)
Customer-managed keys (CMK) for compliance
TLS 1.2+ for all data in transit

4. Operational Excellence

Multi-region deployment for disaster recovery
Automated failover for Event Hub namespace
Azure Monitor alerts on: job stopped, output errors, SU utilization >80%
Runbooks for common failure scenarios

5. Change Management

Version control for query definitions (Git)
CI/CD pipeline for query deployments
Approval gates before production changes
Rollback capability if query breaks

6. Compliance & Audit

Forward threat-alerts to Log Analytics workspace (7-year retention)
Immutable audit logs for compliance
Data residency controls for GDPR/regional requirements
Regular access reviews for Event Hub/Stream Analytics permissions

Conditional Access evaluates individual sign-in events.
This system detects patterns across events—the difference between blocking risk and understanding it.

Skills demonstrated: Stream Analytics query development, Event Hub event-driven architecture, SQL tumbling windows, real-time threat detection, production security architecture design, infrastructure as code iteration.