Leonard Esere

Posted on Mar 7

I Built an Autonomous Cloud Engineer That Actually Fixes Your Infrastructure

#azure #devops #security #cloud

TLDR: After 12 years in DoD and DoE environments watching the same security violations get detected, ticketed, and ignored for weeks, I built PolicyCortex -- an AI agent that does not just alert you about problems in your cloud infrastructure. It fixes them. In production. With a full audit trail.

The Problem

Here is what cloud security looks like at most organizations right now:

A scanner finds a public-facing storage account with sensitive data.
It fires an alert into your CSPM tool -- Wiz, Prisma Cloud, take your pick.
That alert becomes a ticket in ServiceNow.
The ticket sits in a queue for 6-14 days.
Meanwhile, your infrastructure is exposed.

That is not a tooling problem. It is an architecture problem.

I spent 12 years in Department of Defense and Department of Energy environments. The stakes there are not abstract. A misconfigured storage account is not just a compliance checkbox -- it is a potential national security incident. And yet even in those environments, the workflow was the same: detect, alert, ticket, wait.

When I left government work and started looking at what enterprise cloud teams were dealing with, I found the same dysfunction at scale. The average organization runs 4-7 separate tools to cover security posture, compliance, cost management, observability, and change management. None of them talk to each other meaningfully. Every one of them generates output for humans to act on. The bottleneck is always the human queue.

The problem is not that organizations lack visibility. They have too much visibility. What they lack is action.

What I Built

PolicyCortex is an autonomous cloud engineer. It does not generate alerts. It generates outcomes.

The system connects to your cloud environment, continuously monitors for security violations, compliance gaps, and cost anomalies, and -- where authorized -- remediates them automatically. The full audit trail is generated as part of execution, not as an afterthought.

To be specific about what "autonomous" means here: PolicyCortex operates in two modes.

Autonomous Mode executes approved remediation patterns without human intervention. You define the policy; the system executes against it.

Gated Mode pauses before any write operation and presents an AWAITING APPROVAL prompt with the exact API call it intends to make, the resources affected, and the compliance controls it will satisfy. You approve or skip. Nothing touches your infrastructure without the intent being explicit.

This is the architecture I would have wanted in a DoD environment. Full automation where trust is established. Human-in-the-loop where it is not. No surprises.

How It Works Under the Hood

Let me walk through a real remediation flow. This is not a sanitized demo. This is what the system actually executes.

The trigger: PolicyCortex detects a public storage account in a production Azure subscription.

ALERT [CRITICAL]
Type:     Public storage account detected
Resource: stprod-customer-data
Scope:    Production subscription
Action:   Initiating autonomous remediation

From that trigger, the system executes 8 steps in roughly 3 minutes.

Step 1 -- Authenticate

POST https://login.microsoftonline.com/{tenant}/oauth2/token
Body: grant_type=client_credentials
      client_id={service_principal}
      scope=https://management.azure.com/.default

PolicyCortex authenticates against Azure Resource Manager using a scoped service principal. The principle of least privilege applies here -- the service principal has exactly the permissions required for remediation, nothing more.

Step 2 -- Read current configuration

GET https://management.azure.com/subscriptions/{sub}/resourceGroups/{rg}/
    providers/Microsoft.Storage/storageAccounts/stprod-customer-data

Before touching anything, the system reads the current state. It checks allowBlobPublicAccess, networkAcls, encryption, and minimumTlsVersion. This snapshot becomes part of the audit trail and is used to validate the post-remediation state.

Step 3 -- Disable public blob access

PATCH https://management.azure.com/subscriptions/{sub}/resourceGroups/{rg}/
      providers/Microsoft.Storage/storageAccounts/stprod-customer-data

{
  "properties": {
    "allowBlobPublicAccess": false
  }
}

[WRITE] allowBlobPublicAccess: true -> false
[VERIFIED]

Step 4 -- Create private endpoint

PUT https://management.azure.com/subscriptions/{sub}/resourceGroups/{rg}/
    providers/Microsoft.Network/privateEndpoints/pe-stprod-data

{
  "properties": {
    "subnet": { "id": "/subnets/snet-data" },
    "privateLinkServiceConnections": [{
      "groupIds": ["blob"],
      "privateLinkServiceId": "/storageAccounts/stprod-customer-data"
    }]
  }
}

[WRITE] Private endpoint created: pe-stprod-data
        Subnet: snet-data
        DNS zone: privatelink.blob.core.windows.net
[VERIFIED]

Step 5 -- Update Network Security Group

PUT https://management.azure.com/.../networkSecurityGroups/{nsg}/securityRules/DenyAllInternetOutbound

{
  "properties": {
    "priority": 4096,
    "direction": "Outbound",
    "access": "Deny",
    "protocol": "*",
    "destinationAddressPrefix": "Internet"
  }
}

[WRITE] NSG rule added: DenyAllInternetOutbound (priority 4096)
[VERIFIED]

Step 6 -- Verify encryption at rest

GET https://management.azure.com/.../storageAccounts/stprod-customer-data/encryptionScopes

Encryption:     AES-256
Blob service:   enabled
Key source:     Microsoft.Storage
TLS minimum:    TLS1_2

No write operation here. This is a verification pass. If encryption were not configured correctly, the system would pause and escalate rather than proceed.

Step 7 -- Run compliance check

Compliance evaluation:
  CMMC SC.3.177  [PASS] - Data encryption in transit and at rest
  NIST SC-28     [PASS] - Protection of information at rest
  SOC 2 CC6.1   [PASS] - Logical and physical access controls

This is where the compliance automation pays off. The remediation did not just fix a configuration problem. It generated verified evidence for three control frameworks simultaneously.

Step 8 -- Audit trail entry

{
  "remediation_id": "rem-20260307-0842",
  "resource": "stprod-customer-data",
  "triggered_by": "policy:PUBLIC_STORAGE_CRITICAL",
  "executed_by": "policycortex-agent-v2",
  "duration_seconds": 187,
  "write_operations": 4,
  "controls_satisfied": ["CMMC-SC.3.177", "NIST-SC-28", "SOC2-CC6.1"],
  "pre_state_snapshot": {},
  "post_state_snapshot": {},
  "verified": true
}

Remediation complete
8 steps executed | 4 write operations verified | 3 compliance controls satisfied
Time to remediation: 3 minutes 7 seconds

Compare that to 6-14 days in a traditional ticket-based workflow.

Beyond Remediation

Security remediation is the core capability, but it is not the only one. A few other things the system handles:

FinOps Intelligence

PolicyCortex tracks cloud spend against budget in real time and identifies optimization opportunities. The current dashboard shows $185.73 in current spend against an $8,000 monthly budget -- but more importantly, $1,175 in savings achieved this month through automated rightsizing and waste elimination, with next-month spend forecast at $218.67. That forecast is generated from usage trend analysis, not guesswork.

AI Observability

As organizations run more AI workloads, model inference costs have become a meaningful line item that most FinOps tools are not built to track. PolicyCortex surfaces AI model spend by provider -- currently $15,420 this month across OpenAI (56.7%), internal ML infrastructure (20.7%), and Azure Cognitive Services (12.1%) for a sample enterprise deployment. When a model cost profile changes unexpectedly, you want to know before the invoice arrives.

Natural Language Operations

Infrastructure operations should not require memorizing API syntax. PolicyCortex accepts natural language instructions:

"Tag all Dev VMs with Environment=Development"

The system parses intent, identifies 487 affected resources, shows you a preview, and executes on confirmation. The same pattern works for policy queries, compliance checks, and bulk operations across resource groups.

ATO Automation

For organizations pursuing federal authorizations -- FedRAMP Moderate, CMMC L2/L3, NIST SP 800-171 -- the evidence collection burden alone can take months. PolicyCortex maintains continuous control evidence across 14 control domains. CMMC L2 covers 110 controls. CMMC L3 covers 130. FedRAMP Moderate maps 325 controls. Evidence is generated as a byproduct of normal operations, not as a separate audit exercise.

Why Not Just Use Existing Tools?

The honest answer is: the existing tools were built to sell visibility, not action.

CSPM tools (Wiz, Prisma Cloud) are excellent at finding problems. They are not built to fix them. Their output is a findings report that feeds a human workflow.

Cloud management platforms (CloudHealth, Apptio) are strong on cost analytics. They tell you what to optimize. The optimization itself is manual.

Observability platforms (Datadog, Dynatrace) give you metrics and traces. They do not touch your infrastructure configuration.

ITSM platforms (ServiceNow) are designed to manage human workflows. They are the queue where alerts go to wait.

Each of these tools does its job. The problem is the seams between them -- the handoffs, the context loss, the ticket lag. PolicyCortex is not trying to be a better CSPM or a better FinOps tool. It is trying to be what none of them are: an agent that closes the loop from detection to remediation without requiring a human to operate every step.

That said, PolicyCortex integrates with these tools rather than demanding you rip them out. If your team is invested in Datadog for observability, PolicyCortex can ingest signals from it. If ServiceNow is your system of record, remediation actions can be logged there. The goal is to reduce the number of tools you need, not create a migration crisis.

The Hard Parts

I want to be honest about what is difficult here.

Trust calibration is a real problem. The hardest product decision was figuring out what the system should do autonomously versus what it should gate. Too conservative and you have an expensive alert tool. Too aggressive and you have an outage risk. The current model uses a combination of violation severity, resource criticality tags, change window schedules, and explicit policy rules to decide. It is not perfect and I expect this to evolve significantly based on how teams actually use it.

Cloud APIs are inconsistent. Azure, AWS, and GCP each have their own resource models, authentication patterns, and eventually-consistent behavior. A remediation that works cleanly in Azure takes a different implementation in AWS. The abstraction layer that makes natural language operations possible is non-trivial to maintain across providers.

Compliance mapping is genuinely complex. CMMC L3 and FedRAMP Moderate share significant overlap but are not identical. Mapping a single infrastructure control to the right framework requirements, and doing it accurately enough that the evidence is actually usable in an audit, requires domain knowledge that is hard to encode. I spent a meaningful portion of my DoD career doing this by hand. Automating it correctly took longer than any other part of the system.

What is Next

A few things I am actively working on:

Multi-cloud remediation parity. Azure is production-ready today. AWS support is in private beta. GCP is on the roadmap.

Remediation playbook library. Right now, remediation logic is built into the system. I am working on a public playbook format so teams can write, share, and audit their own remediation patterns.

Drift detection and rollback. If a human makes a manual change that puts a resource out of policy, PolicyCortex should detect the drift and either alert or re-remediate depending on policy. Rollback of PolicyCortex own operations is also something I want to make first-class.

Compliance report generation. Right now the evidence exists in structured logs. Generating a human-readable FedRAMP package or CMMC assessment report from that evidence is the next step.

Try It

If you are running production workloads in Azure and you are tired of security findings sitting in queues while your infrastructure stays exposed, I want you to try this.

Visit policycortex.com to request early access. I am working directly with the first cohort of users and I will take every piece of feedback seriously.

If you have questions about the architecture, the compliance automation, or the design decisions behind gated versus autonomous mode -- drop them in the comments. I have been heads-down building this for a while and I am genuinely interested in what resonates and what does not with people who live in cloud infrastructure every day.