Build a realistic Identity & Access Management lab with 1,606 AI-generated employees, multi-cloud infrastructure, and Terraform teardown strategies that keep costs under $1."
Most IAM tutorials use toy datasets—10 users, simplified scenarios, no real infrastructure.
That's fine for learning basics, but useless for testing ML detection systems at production scale.
The problem: Machine learning needs realistic data volumes. A single attack among 10 users
represents 10% of your dataset. Among 1,606 users generating 31,644 events, 69 attacks
represent 0.218%—an enterprise-realistic attack rate.
What I built:
- 1,606 AI-generated employees with realistic organizational structure
- 45 security groups (departments, seniority, functional roles)
- Behavioral metadata (devices, work patterns, VPN usage)
- Multi-cloud data pipeline (Azure Blob → Neo4j → Entra ID)
- Total cost: $0.21 using Terraform teardown strategy
This foundation supports ML training, attack simulation, and behavioral analytics—all at
enterprise scale without enterprise costs.By the end, you'll have infrastructure that supports realistic attack simulation, graph-based privilege analysis, and behavioral anomaly detection.
What You'll Build (Step-by-Step Roadmap)
Here's our implementation plan:
- Step 1: Design Terraform infrastructure for teardown efficiency
- Step 2: Generate 1,606 realistic employees using Gemini API
- Step 3: Create realistic behavioral attributes (not just names)
- Step 4: Deploy multi-cloud data pipeline (Azure Blob, Neo4j, Entra ID)
- Step 4: Cost optimization and actual spend analysis
Each step builds on the previous, creating a complete IAM lab environment suitable for ML training, attack simulation, and detection validation.
Step 1: Design Terraform Infrastructure for Teardown Efficiency
The key to keeping costs under $1 is aggressive teardown. Cloud resources cost money every hour they exist. My strategy: deploy infrastructure only when needed, destroy it immediately after data upload, and rely on Azure Blob Storage (pennies per month) for persistence.
Here's the complete Terraform configuration:
# terraform/main.tf
resource "azurerm_resource_group" "identity_lab" {
name = "rg-ai-iam-identity-lab"
location = "uksouth"
tags = {
Project = "AI-IAM-Engineering"
Environment = "Demo"
ManagedBy = "Terraform"
}
}
resource "azurerm_storage_account" "data_lake" {
name = "${var.prefix}datalake${random_string.unique_suffix.result}"
resource_group_name = azurerm_resource_group.identity_lab.name
location = azurerm_resource_group.identity_lab.location
account_tier = "Standard"
account_replication_type = "LRS" # Cheapest option
account_kind = "StorageV2"
min_tls_version = "TLS1_2"
access_tier = "Hot"
}
resource "azurerm_storage_container" "synthetic_identities" {
name = "synthetic-identities"
storage_account_name = azurerm_storage_account.data_lake.name
container_access_type = "private"
}
resource "azurerm_virtual_network" "identity_lab" {
name = "${var.prefix}-vnet"
location = azurerm_resource_group.identity_lab.location
resource_group_name = azurerm_resource_group.identity_lab.name
address_space = ["10.1.0.0/16"]
}
Why this approach matters:
The critical Terraform configuration is the prevent_deletion_if_contains_resources = false flag in the AzureRM provider. This allows terraform destroy to clean up everything in one command, even if resources remain. No orphaned resources = no surprise bills.
Deployment workflow:
terraform init
terraform apply # Deploy infrastructure
# ... upload data to blob storage ...
terraform destroy # Tear down everything except blob storage
Storage account remains (costs ~$0.20/month), VNet gets destroyed (saves ~$5/month). Data persists in blob storage for future use.
Results: Infrastructure deploys in 2 minutes, destroys in 1 minute. Storage costs $0.20/month ongoing. VNet and compute resources only exist during active testing.
Step 2: Generate 1,606 Realistic Employees Using Gemini API
Realistic identities require diverse names, proper job titles, organizational hierarchy, and believable security group assignments. I used Gemini 2.0 Flash (free tier) to generate all 1,606 employees through structured prompting.
Gemini generates realistic names, but across 161 API calls, duplicates emerge.
My solution: track all generated names globally and exclude them from future prompts.
How it works:
- Generate batch 1 (10 employees) → store names in
self.used_names - Generate batch 2 → pass 50 random previously used names to exclude
- Repeat for all 161 batches
- Result: 100% unique names across 1,606 employees
Without this, you'd get 20-30 duplicate "John Smith" entries that break Entra ID imports
(UPNs must be unique).
The challenge: ensuring uniqueness across 161 API calls (10 employees per batch). Gemini can accidentally generate duplicate names across batches. The solution: global name tracking with AI-guided exclusions.
# generate_employees_universal.py (core logic)
import google.generativeai as genai
import json
BATCH_SIZE = 10
TENANT_DOMAIN = "laplace90210gmail.onmicrosoft.com"
class EntraIDEmployeeGenerator:
def __init__(self, target_count, tenant_domain):
self.target_count = target_count
self.model = genai.GenerativeModel('gemini-2.0-flash')
self.used_names = set() # Global uniqueness tracking
self.used_upns = set()
def generate_prompt(self, batch_num, employees_in_batch):
"""Generate prompt with anti-duplicate instructions"""
name_exclusions = ""
if len(self.used_names) > 0:
sample_names = list(self.used_names)[:50]
name_exclusions = f"\nAVOID THESE NAMES: {', '.join(sample_names)}"
prompt = f"""Generate {employees_in_batch} realistic UK employees for ACME Corporation.
CRITICAL: Each employee must have a COMPLETELY UNIQUE name.
- Mix British, Indian, African, Chinese, Middle Eastern, European names
- Use varied surnames: Smith, Patel, Khan, O'Connor, Chen, etc.
{name_exclusions}
Requirements:
1. Diverse UK names reflecting modern Britain
2. Realistic job titles matching seniority distribution
3. UK cities: London, Manchester, Birmingham, Edinburgh, Leeds, Bristol
4. 100% MFA enabled
5. Only Executives/VPs get admin access
Return ONLY JSON array with exact structure..."""
return prompt
Why this approach works:
Traditional data generation tools create bland, unrealistic identities. Gemini understands context: "Senior Security Engineer" gets appropriate security groups, "Junior Engineer" doesn't get admin access, VPs get E5 licenses while juniors get E3.
Distribution achieved:
- Departments: Engineering (40%), Security (15%), Operations (20%), Finance (10%), People (10%), Executive (5%)
- Seniority: Executive (0.5%), VP (0.5%), Director (2%), Manager (7%), Senior (20%), Mid (40%), Junior (30%)
- Security groups: 45 total (6 departments + 7 seniority levels + 32 functional groups)
API efficiency:
- Total API calls: 161 (1,606 ÷ 10 per batch)
- Rate limiting: 4 seconds between calls
- Total generation time: ~11 minutes
- Cost: $0.00 (Gemini free tier)
Results: 1,606 unique employees with realistic organizational structure, ready for Entra ID import. File size: 1.1 MB (organization_1606_entra.json).
Step 3: Creating Realistic Behavioral Attributes
It's not enough to generate names and job titles. For realistic IAM testing,
you need behavioral patterns that mirror real organizations.
What I added to each employee:
-
Consistent devices:
- Each employee assigned 1 primary device (OS + browser)
- 95% of logins use primary device, 5% use secondary
- Executives more likely to use mobile (iOS/Safari)
-
Work pattern metadata:
- Work hours by seniority (Executives: 7am-8pm, Juniors: 9am-5pm)
- Weekend work probability (Executives: 40%, Juniors: 2%)
- Location consistency (London-based employees rarely VPN from Manchester)
-
VPN usage patterns:
- 80% of employees in VPN-Users group
- VPN users get private IPs (10.x.x.x) 60% of the time
- Non-VPN users always get public IPs mapped to office locations
-
Group membership logic:
- Everyone in their department group
- Seniority-based groups (Junior-Level, Senior-Level, etc.)
- Functional groups (MFA-Enabled: 100%, Admin-Access: 3%)
- VIP users (5+ groups) represent 0.5% of population
Why this matters:
When you eventually test ML detection systems, these patterns are what
the models learn. Random data = models can't distinguish normal from abnormal.
Realistic patterns = models can identify true anomalies.
Code example:
def assign_device(seniority):
"""Assign realistic device based on seniority"""
if seniority in ['Executive', 'VP']:
# Executives more likely to use mobile
if random.random() < 0.3:
return {'os': 'iOS', 'browser': 'Safari 17'}
# Most employees use desktop
return {
'os': random.choice(['Windows 11', 'macOS']),
'browser': random.choice(['Chrome 120', 'Edge 120', 'Firefox 121'])
}
Results:
- Device consistency: 95% (matches real employee behavior)
- Weekend work distribution: Realistic by seniority
- VPN usage: 80% of employees (typical for remote-hybrid orgs)
- VIP users: 8 employees with 5+ group memberships
Step 4: Deploy Multi-Cloud Data Pipeline (Azure Blob, Neo4j, Entra ID)
Generated identities need to flow through three systems:
- Azure Blob Storage – Long-term persistence (cheap)
- Neo4j Graph Database – Privilege escalation path analysis
- Microsoft Entra ID – Real identity provider for authentication testing
Each system serves a different purpose. Blob storage is the source of truth. Neo4j enables graph queries like "find all privilege escalation paths from junior engineers to admin access." Entra ID provides real OAuth tokens for testing authentication flows.
Azure Blob Upload:
# upload_to_blob.py
from azure.storage.blob import BlobServiceClient
class DataLakeUploader:
def __init__(self, connection_string):
self.blob_service_client = BlobServiceClient.from_connection_string(
connection_string
)
def upload_file(self, local_file, blob_name=None):
blob_client = self.blob_service_client.get_blob_client(
container="synthetic-identities",
blob=blob_name or os.path.basename(local_file)
)
with open(local_file, 'rb') as data:
blob_client.upload_blob(data, overwrite=True)
return blob_client.url
Neo4j Graph Load:
# load_to_neo4j.py (simplified)
from neo4j import GraphDatabase
def load_organization(driver, employees):
with driver.session() as session:
# Create employee nodes
for emp in employees:
session.run("""
CREATE (e:Employee {
employee_id: $id,
name: $name,
department: $dept,
seniority: $seniority,
admin_access: $admin
})
""", id=emp['employee_id'],
name=emp['displayName'],
dept=emp['department'],
seniority=emp['seniority'],
admin=emp.get('admin_access', False))
# Create group relationships
for emp in employees:
for group in emp['groups']:
session.run("""
MATCH (e:Employee {employee_id: $id})
MERGE (g:Group {name: $group})
CREATE (e)-[:MEMBER_OF]->(g)
""", id=emp['employee_id'], group=group)
Why Neo4j matters:
Graph databases excel at relationship queries. Finding privilege escalation paths in SQL requires recursive CTEs and performs poorly. In Neo4j:
// Find paths from junior employees to admin access
MATCH path = (junior:Employee {seniority: 'Junior'})-[:MEMBER_OF*1..3]->(admin:Group)
WHERE admin.name CONTAINS 'Admin'
RETURN path
This query runs in milliseconds and reveals 43 potential privilege escalation paths invisible to traditional analytics.
Results:
- Azure Blob: 2 files uploaded (organization + groups), 1.15 MB total
- Neo4j: 1,606 Employee nodes, 45 Group nodes, 6,424 MEMBER_OF relationships
- Entra ID: (Optional) Real identity provider available for OAuth testing
Step 5: Cost Optimization and Actual Spend Analysis
The complete infrastructure ran for 2 weeks during development. Here's the actual cost breakdown:
Azure Costs:
- Storage account (blob): $0.16
- Azure total: $0.21
Other Costs:
- Gemini API (161 calls): $0.00 (free tier)
- Neo4j Aura (free tier): $0.00
- Non-Azure total: $0.00
Total project cost: $0.21 (rounding up to $0.21 for safety margin)
Cost optimization strategies that worked:
Terraform teardown immediately after uploads – VNet costs $5/month if left running. Destroyed after 3 days = $0.50 saved.
Blob storage only – After initial upload, only blob storage remains. 1.15 MB costs $0.02/month.
Gemini free tier – 1,500 free requests/day. 161 batches easily fits within free tier.
Neo4j Aura free tier – 200k nodes/1M relationships included free. 1,606 employees well under limit.
UK South region – Cheapest Azure region in Europe (vs. West Europe).
What $0.21 Buys You vs. Commercial Alternatives
This Lab ($0.21):
- 1,606 employees, 45 groups, 30 days behavioral data
- Multi-cloud infrastructure (Azure + Neo4j + Entra ID)
- Full control over data generation and attack scenarios
- Scales to 10,000+ users by changing one variable
Commercial IAM Training Labs:
- Pluralsight/Cloud Academy IAM courses: $29-49/month subscription
- Pre-built datasets (50-100 users max)
- No customization, no attack injection
- No infrastructure provisioning practice
Azure Training Environments (without teardown):
- Running VNet + VM 24/7: ~$150/month
- Most tutorials ignore cost optimization entirely
- Students rack up $500+ bills learning "free" cloud skills
Terraform teardown reduced my cost by 99.3%.
What costs would look like without optimization:
- VNet running 30 days: $5.00
- Premium blob storage: $2.50
- Neo4j paid tier: $65/month
- Unoptimized cost: ~$72/month
Teardown strategy reduced cost by 99.3%.
Conclusion: What You Built and What's Next
You now have a production-ready IAM lab with 1,606 AI-generated employees, 45 security groups, and multi-cloud infrastructure connecting Azure, Neo4j, and Entra ID. Total cost: $0.21 for initial deployment.
This foundation supports realistic attack simulation (covered in my next post), ML-based anomaly detection, graph-based privilege escalation analysis, and behavioral analytics. The data is realistic enough that detection systems trained here translate to production environments.
What this unlocks:
- Inject realistic attacks (impossible travel, compromised accounts, credential stuffing)
- Train ML models on 30+ days of behavioral login patterns
- Analyze privilege escalation paths through Neo4j graph queries
- Test detection systems against enterprise-scale data volumes
The infrastructure scales beyond this initial deployment. Need 10,000 users? Change one variable. Need 90 days of login history? Adjust the event generator. The teardown strategy keeps costs minimal regardless of scale.
This foundation is ready for behavioral simulation, attack injection, and ML
detection testing.* I'll explore those areas in future posts as I continue
building on this lab
Code repository: Available on request
Questions? Drop them in the comments.



Top comments (0)