DEV Community: Mike Anderson

The Color of War: AI Purple Teaming Link 16 J-Messages Without Touching the Live Network

Mike Anderson — Fri, 17 Jul 2026 14:34:47 +0000

The Color of War: AI Purple Teaming Link 16 J-Messages Without Touching the Live Network

There are networks you do not casually scan.

Not because they are secure.

Because they are dangerous to break.

A web app can be tested in staging.

A cloud workload can be isolated.

A container can be rebuilt.

A failed API release can be rolled back.

A tactical data link is different.

If the wrong message is trusted, the system may believe a false track.

If timing is manipulated, the system may act on stale information.

If identity is confused, the system may build a corrupted battlespace picture.

If detection is noisy, operators may stop trusting the alarms.

If testing is careless, the test itself becomes the risk.

That is the problem this article solves:

How do you red-team, blue-team, and purple-team Link 16 J-message protocol behavior when the real network cannot be touched?

This is not a generic AI security article.

This is a defense-centric walkthrough of a controlled purple-team exercise: build a digital twin of the Link 16 protocol layer, train red AI to discover J-message failure modes, train blue AI to detect them, and use purple-team engagement to turn the exercise into engineering controls.

The objective is simple:

Break the twin 10,000 times so the real mission network does not have to break once.

What We Are Building

We are building a safe test architecture for Link 16-style J-message protocol testing.

The system under test is not the aircraft.

It is not the radio hardware.

It is not classified cryptography.

It is not a live operational network.

The system under test is the J-message processing and timing logic:

message schema validation
source identity handling
timestamp and freshness validation
TDMA slot discipline
replay behavior
malformed-field handling
track-fusion plausibility
degraded communication behavior
alert explainability
operator decision support
remediation and regression proof

Everything happens inside a lab.

No live RF.

No live tactical network.

No real aircraft.

No operational exploitation.

The phrase to keep in mind:

We test the protocol logic, not the live battlespace.

The Journey

This article follows one continuous exercise.

Understand why Link 16 is not normal IT.
Treat TDMA timing as a security signal.
Treat J-messages as the protocol API.
Build a digital twin where the protocol can be safely broken.
Train a red AI agent to discover J-message failure modes.
Train a blue AI agent to detect timing, identity, sequence, and plausibility anomalies.
Run a purple-team engagement.
Produce evidence, remediation, and regression tests.
Convert the lesson into a model for future defense systems.

This is the missing connection in many AI-security conversations.

AI is not the strategy.

The purple-team engagement is the strategy.

AI is the accelerator.

1. The Problem: Link 16 Is Not Your Enterprise LAN

Security teams are comfortable with IP.

Ports. Packets. Agents. Logs. SIEM. EDR. CloudTrail. VPC Flow Logs. Suricata. Zeek. Kubernetes events. GitHub alerts. Terraform drift.

That world gives us visibility.

A Link 16-style tactical network does not give us that comfort.

It is a tactical data link used to share situational awareness and command information between military platforms. For this article, the important thing is not the operational implementation. The important thing is the security model:

Link 16 is a deterministic, time-disciplined, structured-message network where trust depends on message validity, timing, source identity, and shared state.

That is a very different target.

There may be no normal IP path to scan.

There may be no endpoint agent to install.

There may be no safe packet capture.

There may be no acceptable test outage.

There may be no room for “we were just testing.”

In enterprise IT, a failed test might break a service.

In a tactical environment, a failed assumption can corrupt the picture people rely on to make decisions.

That is why traditional pentesting is the wrong starting point.

The right starting point is a safe, replayable twin.

2. TDMA: The Clock Is Part of the Security Boundary

Link 16-style communication is time-disciplined.

A useful mental model is TDMA: Time Division Multiple Access.

Each participant gets a scheduled time slot. It speaks when it is allowed to speak. Others listen when they are supposed to listen.

Simplified:

Cycle 1:
  Slot 0 -> F-16 #1
  Slot 1 -> AWACS
  Slot 2 -> Ship
  Slot 3 -> F-16 #2

Cycle 2:
  Slot 4 -> F-16 #1
  Slot 5 -> AWACS
  Slot 6 -> Ship
  Slot 7 -> F-16 #2

For software engineers, imagine a distributed system where each service has a strict write window. If a service writes outside its window, that is not just bad engineering. It is a security signal.

Timing becomes telemetry.

Blue team can ask:

Did the message arrive in the expected slot?
Did the participant transmit when expected?
Did timing drift slowly?
Did the message arrive too late to be trusted?
Did a stale update look fresh?
Did silence itself become meaningful?

This is the first defense insight:

In deterministic networks, time is not metadata. Time is part of the control surface.

That cuts both ways.

The defender can detect deviations because the system is predictable.

The adversary, defect, or failure condition can also exploit trust in predictable timing if validation is weak.

That is why the purple-team exercise focuses on timing and J-message behavior together.

3. J-Messages: The Language and the Attack Surface

If TDMA is the clock, J-messages are the language.

A J-message is the structured data that travels in the slot. It may represent participant identity, track data, position, velocity, status, commands, or other tactical state.

For software engineers, the best analogy is:

A J-message is a strict binary API call sent on a clock.

Not HTTP.

Not JSON.

Not a TCP port.

Not a web form.

Not a normal packet capture exercise.

A simplified simulated J-message frame might look like this:

+------------+-----------+----------------------+----------+------------+
| Msg Type   | Source ID | Position             | Velocity | Timestamp  |
+------------+-----------+----------------------+----------+------------+
| Track      | F16-01    | lat / lon / altitude | vector   | T+217s     |
+------------+-----------+----------------------+----------+------------+

Now the attack surface becomes clearer.

Not “Can I run nmap?”

The real questions are:

Can the parser safely reject malformed fields?
Can the receiver detect duplicate or conflicting source IDs?
Can the system reject stale but well-formed messages?
Can track fusion detect physically impossible movement?
Can timing validation detect messages outside the expected slot?
Can the system degrade safely when slots are missing?
Can the SOC explain which message, which slot, and which rule caused the alert?

This is why J-message testing matters.

A message can be syntactically valid but operationally dangerous.

That is the heart of the exercise.

4. Why Traditional Pentesting Fails

Traditional pentesting asks:

What is exposed?
What service is vulnerable?
Can authentication be bypassed?
Can privilege be escalated?
Can data be extracted?

Those questions matter in normal IT.

They do not fully solve Link 16 J-message risk.

For this problem, the better questions are:

What message fields are trusted too easily?
What timestamp assumptions are not enforced?
What identity conflicts are not resolved safely?
What replay windows are too permissive?
What malformed fields destabilize parsing?
What impossible track update survives fusion?
What blue-team signal proves the issue?
What engineering control prevents recurrence?

A scanner cannot answer those questions.

A live tactical test is unsafe.

A generic dashboard is not enough.

The answer is a digital twin with a purple-team operating model.

5. The Digital Twin: The Safe War Lab

The digital twin is where dangerous questions become safe experiments.

It does not need to be a real aircraft.

It does not need to be real RF.

It does not need to expose classified implementation details.

It needs to simulate the protocol-layer behavior we care about:

J-message structure
TDMA slot timing
participant identity
timestamp and freshness logic
parser behavior
track database updates
fusion plausibility
degraded-link behavior
telemetry capture
replay

A useful twin gives us one thing the live network cannot:

permission to fail safely.

Inside the twin, red can try.

Blue can detect.

Purple can judge.

Engineering can fix.

The replay can prove whether the fix worked.

Physical World to Digital Twin Mapping

Physical Link 16 Concept	Digital Twin Equivalent
F-16 mission computer	Pod running simulated terminal logic
AWACS node	Pod running command/coordination logic
Ship node	Pod running participant logic
TDMA slot plan	ConfigMap or mounted YAML schedule
Radio transmission	UDP or event bus message between pods
RF interference	Network delay, loss, deny policy, or chaos injection
Mission replay	Persistent replay file
Tactical picture	Simulated track database
SOC visibility	Kafka stream, logs, metrics, model output
Purple-team evidence	Replay bundle and after-action report

This does not mean we containerize a fighter jet.

We containerize the protocol node.

That is the correct abstraction.

Kubernetes as the Lab Harness

Kubernetes is useful because this exercise needs repeatability, isolation, and scale.

Each participant can be a pod.

Each mission can be a namespace.

Each TDMA schedule can be YAML.

Each scenario can emit telemetry.

Each replay can be stored.

Each defense change can be versioned.

Each night can run hundreds or thousands of engagements.

A simplified architecture:

+----------------------------------------------------------+
| Kubernetes Digital Twin                                  |
|                                                          |
|  +----------+     +----------+     +----------+          |
|  | F-16 #1  | --> | AWACS    | --> | Ship     |          |
|  +----------+     +----------+     +----------+          |
|        |               |               |                 |
|        +---------------+---------------+                 |
|                        |                                 |
|                 J-message Event Mesh                     |
|                        |                                 |
|        +---------------+---------------+                 |
|        |                               |                 |
|  Red AI Agent                    Blue AI Agent            |
|        |                               |                 |
|        +---------------+---------------+                 |
|                        |                                 |
|              Purple Evidence Store                       |
+----------------------------------------------------------+

The twin is not a gimmick.

It is the only safe place where this exercise can be run at useful scale.

6. The Mission: Test J-Message Trust

The purple-team mission statement:

Determine whether AI-assisted red, blue, and purple teams can safely discover, detect, explain, and remediate J-message protocol failure modes inside a Link 16 digital twin.

That mission gives each team a role.

Red Team Mission

Discover safe, replayable J-message failure modes in the twin.

Blue Team Mission

Detect and explain J-message anomalies using timing, identity, freshness, sequence, parser, and track-fusion signals.

Purple Team Mission

Validate whether the red behavior matters, whether blue detection is useful, and whether engineering can fix the control gap.

This framing prevents the article from becoming an AI tool dump.

Everything serves the mission.

7. The AI Toolchain: Models, Engines, Harnesses, and What Each Team Actually Does

The prompt is explicit: the AI system is not one model doing everything. It is three different AI roles running inside the same Kubernetes-based digital twin.

Each role has a different model, engine, harness, and mission.

Team	Model	Engine / Runtime	Harness	Primary Job
Red Team	PPO policy with LSTM memory	Ray RLlib	Custom Gym environment wrapping the twin	Discover safe, replayable J-message failure modes
Blue Team	Pre-trained Transformer in PyTorch	TorchServe or Triton	Kafka consumer + validators + correlation engine	Detect timing, identity, sequence, freshness, and track anomalies
Purple / SOC Team	Llama 3 or Mistral LLM	Ollama for local lab or approved API	LangChain + MCP tools	Retrieve evidence, summarize engagement, draft after-action report

This matters because each model type solves a different problem.

The red problem is exploration.

The blue problem is sequence understanding.

The purple problem is evidence explanation.

Using one generic LLM for all three would be the wrong architecture.

Why PPO/LSTM for Red?

The red agent is trying to discover multi-step protocol failure modes.

A single J-message anomaly may not be interesting. The interesting failure may require a sequence:

introduce timing jitter,
wait for recovery state,
replay a stale synthetic track,
observe whether the track database accepts it,
adapt if blue detects it.

That is sequential decision-making.

PPO gives the red agent a stable reinforcement learning method for exploring the action space. LSTM memory helps it remember previous steps in the episode, which matters when the failure only appears after a chain of message and timing events.

The red agent is not a free-form attacker.

It is a policy model trained inside a controlled Gym environment.

Red PPO/LSTM Agent
        ↓
Custom Gym Environment
        ↓
MCP Tool API
        ↓
Kubernetes Digital Twin
        ↓
J-message stream + track database
        ↓
Reward calculation
        ↓
Policy update

The red agent observes the twin state, selects a safe simulated action, receives reward or penalty, and learns over many episodes.

Why a Transformer for Blue?

The blue agent is watching a stream.

J-messages are not isolated events. They form sequences over time:

source → slot → message type → timestamp → track state → next message

A static signature can catch obvious failures, but it struggles with context.

The blue Transformer learns normal protocol grammar:

what messages usually follow each other,
which source usually speaks in which slot,
how timing behaves during normal and degraded modes,
how track updates evolve,
which kinematic changes are plausible,
which sequence patterns are unusual.

The blue model does not replace deterministic validation.

It sits beside it.

Kafka J-message stream
        ↓
Tokeniser / Feature Extractor
        ↓
Transformer sequence model
        ↓
Physics checker + timing validator
        ↓
Correlation engine
        ↓
Anomaly score + reason codes

This is why the blue output can say:

This was not just a weird message. It was stale during recovery, from a source with valid identity, in a slot that looked normal, but the sequence and track movement were inconsistent.

That is the kind of explanation a SOC analyst and protocol engineer can use.

Why Llama/Mistral + LangChain + MCP for Purple?

The purple team does not need an LLM to invent findings.

It needs an LLM to retrieve evidence, organize timelines, and write a clear after-action report.

The purple LLM analyst uses:

Llama 3 or Mistral as the language model,
Ollama when the lab needs local/offline execution,
an approved API when policy allows managed inference,
LangChain as the agent harness,
MCP as the safe tool interface to internal lab data.

The LLM does not touch the live network.

It reads from the twin’s evidence stores:

LLM Analyst
  ├── retrieve_message_history(track_id)
  ├── get_blue_attention_map(alert_id)
  ├── get_red_episode_trace(episode_id)
  ├── get_slot_plan(scenario_id)
  ├── get_track_db_diff(track_id)
  └── generate_after_action_report()

MCP is important because it gives the LLM a controlled, auditable way to access the twin’s evidence. The LLM is not browsing around freely. It is using approved tools against approved lab data.

The LLM’s mission:

Turn red/blue telemetry into a human-readable report without inventing facts.

8. Kubernetes Lab Setup: How the AI System Actually Runs

The lab runs as a controlled Kubernetes environment.

Not because Kubernetes is a fighter jet.

Because Kubernetes gives us repeatable scenarios, isolated namespaces, declarative configuration, telemetry, scaling, and replay.

A practical lab namespace looks like this:

namespace: link16-purple-lab

Workload Pods:
  node-f16-1
  node-awacs
  node-ship
  node-f16-2
  rf-emulator
  red-rl-agent
  blue-transformer
  soc-llm-analyst
  mcp-server
  kafka
  track-db
  evidence-store
  soc-dashboard

The flow:

+---------------------------------------------------------------+
| link16-purple-lab namespace                                   |
|                                                               |
|  Simulated Link 16 Nodes                                      |
|  +---------+   +---------+   +---------+   +---------+        |
|  | F16-01  |   | AWACS   |   | Ship    |   | F16-02  |        |
|  +----+----+   +----+----+   +----+----+   +----+----+        |
|       |             |             |             |             |
|       +-------------+-------------+-------------+             |
|                         |                                     |
|                 Kafka J-message Stream                        |
|                         |                                     |
|       +-----------------+------------------+                  |
|       |                                    |                  |
| +-----v------+                      +------v---------+        |
| | Red PPO/   |  MCP safe actions    | Blue Transformer|        |
| | LSTM Agent |--------------------->| + Validators    |        |
| +-----+------+                      +------+---------+        |
|       |                                    |                  |
|       +-----------------+------------------+                  |
|                         |                                     |
|                 Evidence Store / Track DB                     |
|                         |                                     |
|                 SOC LLM Analyst + Dashboard                   |
+---------------------------------------------------------------+

The key Kubernetes objects:

Component	Kubernetes Object	Purpose
Simulated participants	Deployments / StatefulSets	Run terminal logic and J-message processing
TDMA schedule	ConfigMap	Defines slot plan and participant timing
Scenario definition	ConfigMap or CRD	Defines mission, participants, allowed red actions
Kafka	StatefulSet / Operator	Carries J-message telemetry
Track DB	StatefulSet	Maintains simulated tactical picture
Red training	RayCluster / Jobs	Runs PPO/LSTM training episodes
Blue inference	Deployment	Serves Transformer model via TorchServe or Triton
LLM analyst	Deployment	Runs LangChain agent with local Ollama or API gateway
MCP server	Deployment	Provides controlled tool access to twin telemetry
Evidence store	PVC / object storage	Stores replays, alerts, traces, and reports
SOC dashboard	Deployment / Service	Shows alerts, timelines, and after-action reports
Network effects	NetworkPolicy / chaos tooling	Simulates loss, denial, latency, and degraded links

This is not a demo-only architecture.

This is the operating model.

Lab Deployment Flow

A safe lab run looks like this:

1. GitOps deploys namespace and base services
2. TDMA slot plan loads from ConfigMap
3. Simulated participants start producing J-messages
4. Kafka captures full message stream
5. Blue Transformer starts baseline scoring
6. Red PPO/LSTM agent starts lab-only training episodes
7. MCP server exposes approved twin actions and evidence queries
8. Purple dashboard shows red action, blue detection, and timeline
9. LLM analyst drafts evidence-bound after-action report
10. Engineering fix is deployed back into the twin
11. Replay confirms whether the fix worked

A conceptual lab manifest structure:

k8s/
  namespaces/
    link16-purple-lab.yaml
  configmaps/
    tdma-slot-plan.yaml
    scenario-ghost-track.yaml
    red-action-policy.yaml
  deployments/
    node-f16-1.yaml
    node-awacs.yaml
    node-ship.yaml
    node-f16-2.yaml
    blue-transformer.yaml
    soc-llm-analyst.yaml
    mcp-server.yaml
  ray/
    red-rl-training-job.yaml
  serving/
    torchserve-blue-model.yaml
    triton-blue-model.yaml
  policies/
    networkpolicy-deny-awacs-f16.yaml
    red-agent-egress-deny.yaml
    mcp-tool-allowlist.yaml
  storage/
    evidence-pvc.yaml
    replay-store.yaml

The most important lab control:

The red agent never gets direct Kubernetes admin rights and never touches the live network. It only calls approved MCP tools scoped to the twin.

MCP Tool Boundary for the Lab

The MCP server is the control point between AI agents and the twin.

Red tools are action tools:

simulate_delay_message(message_id, delay_ms)
simulate_drop_slot(node_id, slot_id)
simulate_replay_message(message_id, replay_window)
simulate_identity_conflict(source_id, scenario_id)
simulate_kinematic_edge_case(track_id, profile)

Blue tools are context tools:

get_slot_plan(scenario_id)
get_track_state(track_id)
check_slot_sync(message_id)
check_physics(track_id)
get_message_context(message_id)

Purple tools are evidence tools:

retrieve_message_history(track_id)
get_red_episode_trace(episode_id)
get_blue_attention_map(alert_id)
get_detection_timeline(alert_id)
generate_report(finding_id)

Each tool call must be logged:

mcp_audit_event:
  timestamp: T+217s
  agent: red-rl-agent
  tool: simulate_replay_message
  scenario: ghost_track_recovery
  allowed: true
  evidence_pointer: replay/episode-00981/

This is how the lab stays safe and auditable.

9. Red AI Agent: Training the Synthetic Adversary

The red AI agent is not a hacker.

It is a controlled failure-mode discovery engine.

It has no live network access.

It has no real RF access.

It does not bypass real crypto.

It does not deploy malware.

It does not operate outside the twin.

Its job is to explore how the simulated protocol stack can be confused, degraded, or forced into unsafe state assumptions.

Think of the red agent as a sparring partner for the protocol.

Red Agent Objective

The red agent’s model choice is deliberate:

PPO gives stable reinforcement learning for bounded action exploration.
LSTM memory helps the agent learn multi-step timing and replay sequences.
Ray RLlib lets training scale across GPU or CPU worker nodes.
Gym harness exposes the digital twin as reset(), step(action), reward, and done.
MCP tools are the only way the agent can act on the twin.

The red agent tries to answer:

Can I create a J-message sequence that degrades the tactical picture while staying inside realistic lab boundaries?

Safe simulated action categories:

delay a simulated J-message
suppress a simulated slot
replay a synthetic lab message
introduce a simulated source identity conflict
alter a simulated field within schema boundaries
trigger a malformed-field test case
introduce timing jitter inside approved limits
create a kinematic edge case for track fusion

These are not operational attack instructions.

They are controlled test actions exposed by the twin.

Red Agent Configuration

A conceptual red-agent configuration:

red_agent:
  name: jmessage_red_rl_agent
  environment: link16_digital_twin
  live_system_access: false

  observations:
    - recent_jmessage_sequence
    - current_slot_number
    - participant_state_table
    - track_database_summary
    - blue_detection_feedback
    - scenario_phase

  allowed_actions:
    - simulate_delay_message
    - simulate_drop_slot
    - simulate_replay_message
    - simulate_identity_conflict
    - simulate_schema_boundary_case
    - simulate_kinematic_edge_case

  prohibited_actions:
    - live_network_access
    - real_radio_interaction
    - credential_access
    - malware_behavior
    - destructive_payloads
    - persistence
    - external_network_calls

  evidence:
    log_action: true
    retain_episode_replay: true
    retain_reward_trace: true
    retain_blue_response: true

The control philosophy matters more than the syntax:

The red agent can only manipulate the twin through approved, logged, replayable simulation actions.

Red Agent Reward Design

A bad reward function teaches chaos.

A good reward function teaches useful failure discovery.

The red agent should not be rewarded for maximum disruption. It should be rewarded for finding realistic, repeatable, safety-bounded gaps that blue misses or detects too late.

Conceptual reward function:

def red_reward(system_impact, blue_response, action):
    reward = 0

    if action.violates_safety_boundary:
        reward -= 100

    if action.is_unrealistic_for_scenario:
        reward -= 25

    if system_impact.created_ghost_track:
        reward += 8

    if system_impact.created_stale_state_acceptance:
        reward += 8

    if system_impact.caused_parser_instability:
        reward += 6

    if system_impact.degraded_track_confidence:
        reward += 5

    if blue_response.detected_immediately:
        reward -= 6

    if blue_response.detected_late:
        reward += 3

    if blue_response.missed and system_impact.is_replayable:
        reward += 10

    if not system_impact.has_replay_evidence:
        reward -= 10

    return reward

This trains the red agent to discover control gaps, not generate noise.

Red Agent Training Loop

Reset mission scenario
    ↓
Observe J-message state
    ↓
Choose safe simulated action
    ↓
Apply action inside twin
    ↓
Measure tactical-picture impact
    ↓
Measure blue-team response
    ↓
Calculate reward
    ↓
Store replay evidence
    ↓
Repeat

Pseudocode:

for episode in range(NUM_EPISODES):
    state = twin.reset(scenario=random_scenario())
    done = False

    while not done:
        action = red_policy.select_action(state)

        if not safety_policy.allowed(action):
            evidence.log_violation(action)
            reward = -100
            break

        next_state, impact, blue_response, done = twin.step(action)

        reward = red_reward(
            system_impact=impact,
            blue_response=blue_response,
            action=action
        )

        red_policy.learn(state, action, reward, next_state)

        evidence.store(
            episode=episode,
            state=state,
            action=action,
            impact=impact,
            blue_response=blue_response,
            reward=reward
        )

        state = next_state

The output is a failure-mode catalog.

Example:

failure_mode_id: RED-FM-041
title: stale track accepted during recovery window
scenario: degraded_link_recovery
impact: track database accepted old position as current
blue_result: detected late
repeatability: 88%
recommended_control: enforce timestamp freshness during recovery state
evidence:
  - replay/RED-FM-041/messages.jsonl
  - replay/RED-FM-041/timing.jsonl
  - replay/RED-FM-041/blue_response.json

That is the first meaningful handoff.

Red found a failure mode.

Now blue must prove it can detect it.

10. Blue AI Agent: Training the Defender

The blue team model is a pre-trained Transformer implemented in PyTorch, served through TorchServe or Triton for real-time inference against the Kafka J-message stream.

The blue AI agent is not just a model.

It is a detection system.

Its mission:

Watch the J-message stream and determine whether message, timing, identity, sequence, freshness, or track behavior violates expected mission state.

The blue agent must be explainable.

If it cannot tell the operator why the alert fired, it is not ready for high-consequence environments.

Blue Agent Inputs

The blue agent consumes structured telemetry:

J-message stream
TDMA slot timing
participant identity table
timestamp deltas
track database changes
parser validation results
replay and freshness indicators
simulated kinematic plausibility
scenario metadata
red-agent replay labels for training

This is not “throw logs into AI.”

This is detection engineering.

Blue Agent Architecture

Kafka J-message stream
        ↓
Tokeniser / Feature Extractor
        ↓
+----------------------+-------------------------+
| Deterministic Rules  | Sequence Model          |
| - schema validity    | - Transformer / LSTM    |
| - timing boundary    | - next-message predict  |
| - replay freshness   | - anomaly scoring       |
| - source identity    | - sequence drift        |
| - state transition   |                         |
+----------------------+-------------------------+
        ↓
Correlation Engine
        ↓
Anomaly Score + Reason Codes
        ↓
SOC Alert + Evidence Pointer

Rules catch what must never happen.

The model catches what looks wrong in context.

Correlation decides whether it matters.

Blue Agent Configuration

blue_agent:
  name: jmessage_blue_detector
  purpose: detect_explain_prioritize

  telemetry_inputs:
    - jmessage_stream
    - tdma_timing_events
    - participant_identity_state
    - track_database_updates
    - parser_validation_results
    - red_replay_labels_for_training

  validators:
    - schema_validator
    - timing_window_validator
    - replay_freshness_validator
    - source_identity_validator
    - track_plausibility_validator
    - state_transition_validator

  model:
    type: sequence_anomaly_detector
    objectives:
      - next_message_prediction
      - timing_sequence_anomaly
      - track_plausibility_classification

  correlation:
    alert_threshold: 0.85
    critical_threshold: 0.95
    require_reason_codes: true

  output:
    - anomaly_score
    - reason_codes
    - affected_track
    - affected_source_id
    - slot_number
    - message_pointer
    - replay_pointer
    - recommended_playbook

The output must answer:

What happened?
Why is it suspicious?
Which message caused it?
Which slot was involved?
Which source was involved?
What control failed?
What should the operator do?

If blue cannot explain it, purple cannot use it.

Blue Agent Training

Blue training starts with normal mission traffic.

The model needs to learn what normal looks like before it can identify abnormal behavior.

Training data classes:

Class	Meaning
Normal	Expected J-message behavior
Degraded but acceptable	Loss, latency, or recovery within bounds
Suspicious	Requires investigation
Confirmed failure mode	Red replay proved a control gap
Remediated	Fixed and regression-tested behavior

Training loop:

Collect baseline J-message telemetry
    ↓
Label normal and degraded scenarios
    ↓
Replay red-agent failure modes
    ↓
Tokenize message, timing, identity, and state features
    ↓
Train sequence model
    ↓
Tune deterministic validators
    ↓
Evaluate false positives in degraded mode
    ↓
Promote stable detections into purple exercise

Pseudocode:

normal = load_sequences("telemetry/baseline/")
degraded = load_sequences("telemetry/degraded_acceptable/")
red_replays = load_sequences("evidence/red_failure_modes/")
remediated = load_sequences("evidence/remediated_replays/")

dataset = build_dataset(
    normal=normal,
    degraded=degraded,
    suspicious=red_replays,
    remediated=remediated
)

model = SequenceDetector(
    features=[
        "message_type",
        "source_id",
        "slot_delta",
        "timestamp_delta",
        "schema_valid",
        "freshness_score",
        "track_plausibility",
        "state_transition"
    ]
)

model.train(dataset)

results = evaluate(
    model=model,
    test_sets={
        "normal": normal.holdout,
        "degraded": degraded.holdout,
        "red_replay": red_replays.holdout
    }
)

The important metrics are not only ML metrics.

For blue-team operations, the useful metrics are:

Metric	Why it matters
Detection precision	Avoids analyst overload
Detection recall	Measures missed failure modes
Mean time to detect	Shows operational value
False positives in degraded mode	Prevents alert storms under stress
Replay consistency	Proves repeatability
Explanation quality	Helps operators trust the alert
Engineering actionability	Helps owners fix the control

The blue agent is ready only when it can detect, explain, and survive degraded-mode testing.

11. Purple Team Engagement: The Center of the Exercise

Red AI training is not the outcome.

Blue AI detection is not the outcome.

A dashboard is not the outcome.

The outcome is purple-team improvement.

Purple team connects:

Red action
  → expected telemetry
  → blue detection
  → operator decision
  → control gap
  → engineering fix
  → replay proof

Without that chain, the exercise is theater.

Purple Team Roles

Role	Responsibility
Purple Lead	Owns mission, scope, safety boundary, and final decision
Red AI Engineer	Trains red agent and validates replay realism
Blue Detection Engineer	Builds validators, model scoring, and alert logic
Twin Platform Engineer	Maintains Kubernetes lab, telemetry, replay, and isolation
SOC Analyst	Tests whether alerts are understandable and actionable
Protocol Engineer	Fixes J-message validation, timing, parser, or fusion logic
Risk Owner	Accepts, rejects, or prioritizes residual risk
Evidence Scribe	Maintains timeline, replay bundle, and final report

Purple team is not a color.

It is the control function.

Purple Team Training

Purple-team training is not the same as red or blue training.

Red learns how to challenge the protocol.

Blue learns how to detect protocol anomalies.

Purple learns how to judge whether the exercise produced a real control improvement.

Purple training has six drills.

Drill 1: Baseline Recognition

The team studies normal J-message flow, slot timing, participant behavior, and track-state transitions.

The goal:

Everyone can explain normal before discussing abnormal.

Drill 2: Red Replay Review

The red agent produces a failure-mode replay.

Purple asks:

Is this behavior realistic in the lab model?
Is it safe?
Is it repeatable?
Does it represent a meaningful protocol risk?
Is it just model weirdness?

The goal:

Separate useful failure modes from artificial noise.

Drill 3: Blue Alert Validation

Blue raises an alert.

Purple asks:

Did the alert identify the message?
Did it identify the slot?
Did it identify the source?
Did it explain the reason?
Did it provide evidence?
Would an operator know what to do?

The goal:

Improve alert quality, not just detection rate.

Drill 4: Operator Decision Tabletop

The SOC analyst receives the alert and must choose:

monitor
enrich
suppress
escalate
isolate in the twin
open engineering defect
request replay

The goal:

Train human judgment under uncertainty.

Drill 5: Engineering Remediation Workshop

Protocol engineering reviews the evidence.

The team decides whether the fix belongs in:

parser validation
timestamp freshness enforcement
source identity logic
TDMA timing validation
track-fusion plausibility
degraded-mode handling
detection tuning
operator playbook

The goal:

Convert the finding into a specific control.

Drill 6: Regression Replay

The original red replay is run again after the fix.

The goal:

Prove the control works and does not break normal or degraded behavior.

That is purple-team training.

It trains the people, the process, and the system.

12. The Engagement Scenario: Ghost Track in the Twin

Now the article becomes a story.

The lab spins up before sunrise.

Five simulated participants come online inside the Kubernetes twin: F-16 #1, AWACS, Ship, F-16 #2, and an RF emulator. The slot plan loads from a ConfigMap. Kafka starts receiving J-message telemetry. The blue agent watches baseline traffic. The red agent waits for the exercise window.

The purple lead states the mission:

Test whether a stale or conflicting J-message can degrade the simulated track picture, and whether blue detection can explain the anomaly fast enough for operator action.

No live network.

No real aircraft.

No operational messages.

Only the twin.

Baseline

The baseline run is clean.

T+000s  Twin starts
T+005s  Participants join scenario
T+010s  Slot cycle stabilizes
T+030s  Track database healthy
T+060s  Blue confirms normal timing and message sequence

Blue records the baseline:

baseline_status:
  timing: normal
  identity: normal
  freshness: normal
  track_plausibility: normal
  parser_errors: none
  anomaly_score: 0.04

Purple approves the red window.

Red Action

The red agent chooses a safe simulated action chain.

It does not attack a real network.

It acts only through the twin’s approved API.

T+090s  Red delays a simulated track update inside allowed lab range
T+096s  Red replays a stale synthetic track message during recovery state
T+097s  Twin accepts message as structurally valid
T+098s  Track database briefly trusts stale position

The red agent logs:

red_action:
  scenario: ghost_track_recovery_window
  action_chain:
    - simulate_delay_message
    - simulate_replay_message
  target_effect: stale_track_accepted_as_current
  safety_boundary: twin_only

The red team has not “won” yet.

A red finding matters only if purple can prove impact and blue can validate detection quality.

Blue Detection

Blue sees three weak signals before correlation:

Signal 1: Message was structurally valid
Signal 2: Timestamp freshness was suspicious
Signal 3: Track movement was inconsistent with recent state
Signal 4: Slot timing was within tolerance but sequence context was abnormal

The deterministic validators alone do not fire a critical alert.

The sequence model raises the anomaly score.

The correlation engine combines timing, freshness, and track plausibility.

blue_alert:
  alert_id: BLUE-ALERT-219
  confidence: 0.982
  affected_source: SIM-F16-01
  affected_track: TRACK-17
  slot_context: recovery_window
  reason_codes:
    - stale_timestamp_during_recovery
    - sequence_context_mismatch
    - track_plausibility_deviation
  recommended_action: isolate_simulated_track_and_replay
  evidence:
    - messages/T+090_to_T+100.jsonl
    - timing/T+090_to_T+100.jsonl
    - trackdb/diff_T+098.json

This is a good alert.

It does not just say “anomaly detected.”

It tells the operator what changed and why.

Purple Review

Purple pauses the exercise.

The team asks five questions:

Was the red action safe and inside scope?
Did the twin produce repeatable impact?
Did blue detect the issue fast enough?
Did the alert explain the issue clearly?
Can engineering fix the control gap?

The answer:

purple_assessment:
  red_realism: acceptable_for_protocol_layer_lab
  safety_boundary: maintained
  impact: stale_track_accepted_during_recovery
  blue_detection: detected
  detection_quality: high
  operator_actionability: acceptable
  control_gap: freshness_validation_not_bound_to_recovery_state
  remediation_owner: protocol_engineering

Now the exercise becomes valuable.

Red did not merely create an anomaly.

Blue did not merely create an alert.

Purple identified a control gap.

Engineering Fix

The fix is not “add AI.”

The fix is protocol engineering.

Engineering updates the simulated terminal logic:

Before:
  Accept message if schema valid and source known.

After:
  Accept message only if:
    - schema valid
    - source known
    - timestamp fresh
    - state transition valid
    - recovery-window freshness rule satisfied
    - track movement plausible

The control is specific.

The owner is clear.

The evidence is replayable.

Regression Replay

The red replay runs again.

T+000s  Replay starts
T+090s  Red repeats same simulated action chain
T+098s  Message reaches receiver
T+099s  Freshness validation rejects stale state
T+100s  Blue alert fires with lower impact classification
T+105s  Track database remains consistent

Regression result:

regression_result:
  finding_id: PT-JMSG-014
  previous_status: exploitable_in_twin
  current_status: remediated
  replay_passed: true
  false_positive_check: passed
  degraded_mode_check: passed
  residual_risk: low_for_protocol_layer_scope

That is the win.

Not a flashy hack.

A proven control improvement.

13. The Purple Team Scorecard

A good purple exercise needs a scorecard.

Dimension	Question	Result
Safety	Did red stay inside twin-only controls?	Required
Realism	Was the scenario meaningful for protocol logic?	Required
Repeatability	Could the failure be replayed?	Required
Detection	Did blue detect it?	Required
Explainability	Did the alert explain why?	Required
Operator action	Did SOC know what to do?	Required
Engineering action	Could the owner fix it?	Required
Regression	Did the fix survive replay?	Required
False positives	Did the fix break normal/degraded traffic?	Required
Residual risk	Is remaining risk documented?	Required

This is how the purple team avoids theater.

No vague “AI found risk.”

No vague “blue detected anomaly.”

No vague “engineering should improve validation.”

The result must be:

specific failure, specific evidence, specific owner, specific fix, replay-proven outcome.

14. What Logs and Evidence Matter

A defense-grade exercise must produce evidence.

Not screenshots alone.

The evidence pack should include:

evidence/
  scenario.yaml
  slot_plan.yaml
  red_actions.jsonl
  jmessages.jsonl
  timing_events.jsonl
  trackdb_before.json
  trackdb_after.json
  blue_alert.json
  model_scores.json
  operator_decision.md
  engineering_fix.diff
  regression_result.json
  after_action_report.md

The after-action report should answer:

What was tested?
What did red attempt inside the twin?
What changed in the J-message stream?
What changed in timing or state?
What did blue detect?
What did blue miss?
What did the SOC analyst decide?
What control failed?
Who owns the fix?
Did replay prove the remediation?

If the exercise cannot produce this, it is not mature purple teaming.

15. The Purple LLM Analyst: Model, Runner, Harness, and Guardrails

The purple team uses an LLM differently from red and blue.

Red acts in the twin.

Blue scores the stream.

Purple explains the engagement.

The recommended purple analyst stack is:

Layer	Choice	Purpose
Model	Llama 3 or Mistral	Generate human-readable summaries and reports
Runner	Ollama for local lab, or approved API gateway	Run the model privately or through governed inference
Harness	LangChain agent	Manage the evidence-retrieval workflow
Tool boundary	MCP	Provide approved access to logs, alerts, replays, and attention maps
Output	After-action report	Turn telemetry into decisions

For defense-style labs, local inference through Ollama is attractive when the exercise data is sensitive and should not leave the lab. A managed API may be acceptable only if the data classification, retention, region, and contractual controls allow it.

The LLM analyst should be configured with a strict evidence-only instruction:

soc_llm_analyst:
  model_options:
    - llama3
    - mistral
  runner_options:
    - ollama_local
    - approved_api_gateway

  harness: langchain
  tool_interface: mcp

  allowed_tools:
    - retrieve_message_history
    - get_blue_attention_map
    - get_red_episode_trace
    - get_detection_timeline
    - generate_after_action_report

  prohibited_actions:
    - execute_red_action
    - modify_twin_state
    - approve_containment
    - access_live_network
    - infer_missing_facts

  output_requirements:
    - cite_evidence_pointer
    - mark_unknowns
    - identify_control_gap
    - identify_remediation_owner
    - identify_residual_risk

This keeps the LLM in the analyst role.

It does not command the exercise.

It writes the report that helps humans make the decision.

16. The LLM Analyst: Useful, but Not in Command

The LLM analyst is valuable, but it should not run the mission.

It should not touch live systems.

It should not approve containment.

It should not invent facts.

It should not write final risk acceptance alone.

Its job is evidence acceleration.

A safe LLM analyst can:

retrieve replay timelines
summarize red actions
summarize blue detection
compare baseline and abnormal message flow
draft after-action reports
map evidence to findings
identify missing evidence
prepare executive summaries

Safe prompt pattern:

Use only the provided replay files, alert JSON, timing logs, and scenario metadata.

Produce:
1. Scenario summary
2. Red action summary
3. Blue detection summary
4. Timeline
5. Control gap
6. Evidence list
7. Recommended engineering fix
8. False-positive considerations
9. Residual risk

Do not infer facts not present in evidence.
Mark unknowns explicitly.

The LLM is not the commander.

It is the scribe who never sleeps.

17. Designing “Link 18”: What We Should Build Next Time

If a next-generation tactical data link were designed tomorrow, the lesson from this exercise is clear.

Security should be built into the protocol from day one.

Design principles:

Message-Level Authenticity

Do not rely only on channel trust. Messages should carry strong identity, integrity, and freshness guarantees appropriate to the mission.

Built-In Timing Validation

The system should treat timing deviations as first-class security signals.

State-Aware Message Validation

A message should not be accepted only because it is well-formed. It must be valid for the current mission state.

Digital Twin Mandatory for Updates

No major protocol or terminal update should ship without replay testing in the twin.

Continuous Purple Teaming

Red AI should search for failure modes.

Blue AI should detect and explain.

Purple team should validate, prioritize, and force regression.

Safe Degradation

When confidence drops, the system should degrade visibly and safely.

A resilient system is not one that never fails.

It is one that fails in a way defenders can see, understand, and recover from.

18. Why This Matters Beyond Defense

This article is defense-centric by design.

But the pattern is not defense-only.

Fintech platforms depend on message ordering, identity, replay protection, ledger state, transaction freshness, and fraud signals.

Healthcare platforms depend on device telemetry, patient identity, clinical workflow state, and timely trust in data.

Industrial platforms depend on deterministic command and sensor behavior.

The lesson is not that every sector should copy Link 16.

The lesson is:

Any system that depends on structured, time-sensitive, machine-to-machine trust needs a way to safely test how that trust fails.

For fintech:

Can replayed transaction events corrupt state?
Can duplicate identity confuse fraud decisions?
Can delayed settlement messages produce false confidence?
Can reconciliation detect sequence anomalies?

For healthcare:

Can stale telemetry be accepted as current?
Can device identity mismatch affect clinical decisions?
Can workflow events be replayed safely in a twin?
Can alerts explain risk without overwhelming operators?

But those are extensions.

The core defense lesson remains:

Build the twin. Train red. Train blue. Engage purple. Fix the protocol. Replay until proven.

19. Final Takeaway

The original problem was never “how do we use AI?”

The real problem is sharper:

How do we safely test the J-message protocol behavior of a Link 16-style network when live testing is unacceptable?

The answer is not a scanner.

The answer is not a dashboard.

The answer is an AI-powered purple-team operating model:

Digital Twin
    ↓
Red AI discovers safe J-message failure modes
    ↓
Blue AI detects timing, identity, freshness, and track anomalies
    ↓
Purple team validates operational relevance
    ↓
Engineering fixes protocol controls
    ↓
Replay proves the fix
    ↓
Scenario becomes continuous regression

That is the journey.

Red finds the weakness.

Blue proves the detection.

Purple makes the system stronger.

The color of war is not red.

It is not blue.

It is purple.

AWS Security AI Architecture: Managed MCP, Custom MCP, or Lambda + Bedrock?

Mike Anderson — Thu, 16 Jul 2026 10:54:55 +0000

AWS Security AI Architecture: Managed MCP, Custom MCP, or Lambda + Bedrock?

Executive decision

There is no single “correct” architecture for AI-assisted AWS security work.

For Security Hub, GuardDuty, ECR, and cloud security reporting, there are three valid patterns:

AWS Managed MCP / AWS Agent Toolkit for live, read-only AWS investigation from an AI coding assistant.
Custom MCP for analyzing approved security reports already stored in S3.
Lambda + boto3 + Bedrock for scheduled, deterministic production report generation.

All three are correct.

They solve different problems.

The mistake is not choosing one over the other. The mistake is using the right technology in the wrong operating model.

A scheduled production report should not depend on an analyst’s laptop. A report-analysis assistant should not need broad live AWS API access. A developer investigating AWS findings interactively should not be forced to wait for a weekly Lambda job.

The clean model is:

Production reporting lane:
Lambda + boto3 + Bedrock + S3

Analyst report-review lane:
Custom MCP + S3 reports + optional Bedrock analysis

Developer / live triage lane:
AWS Managed MCP + Claude Code / Codex + read-only IAM

That separation removes most of the confusion.

The problem we are solving

AWS security teams usually deal with three different workflows that look similar at first glance but are operationally different.

Workflow 1: Live investigation

A security engineer wants to ask:

Show me the current HIGH and CRITICAL Security Hub findings.
Explain which ones are immediate risk.
Check GuardDuty or Inspector context.
Draft remediation wording.

This is interactive. The engineer is present. The assistant may need to call live AWS APIs.

Workflow 2: Report analysis

A weekly Security Hub or GuardDuty report already exists in S3.

An analyst wants to paste a finding and ask:

Is this finding already covered in the latest report?
What evidence supports it?
What owner action is needed?
Give me Jira-ready wording.

This is not live AWS investigation. This is analysis of approved report artifacts.

Workflow 3: Scheduled reporting

The organization needs a report every week, without a human sitting in front of Claude Code or Codex.

The system should:

Collect findings
Score and sort them
Enrich them
Generate Markdown/HTML/JSON/CSV
Store the output in S3
Run on schedule
Fallback safely if AI enrichment fails

This is backend automation.

These three workflows should not use the same architecture.

Approach 1: AWS Managed MCP for live read-only Security Hub triage

What it is

AWS Managed MCP, delivered through AWS Agent Toolkit, lets AI coding agents interact with AWS through the Model Context Protocol.

AWS describes the AWS MCP Server as a managed remote MCP server that gives AI agents secure access to AWS through MCP. It can expose AWS API access, documentation search, curated skills, CloudWatch metrics, and IAM-based controls. It is designed to work with coding agents such as Claude Code and Codex.

The architecture looks like this:

Security engineer
        ↓
Claude Code / Codex
        ↓
AWS Managed MCP Server
        ↓
Read-only AWS SSO profile / IAM role
        ↓
Security Hub / GuardDuty / Inspector / Config / CloudTrail
        ↓
Local analysis output

The key point:

AWS Managed MCP is a live AWS access path for an AI assistant.

That can be very useful, but it must be controlled.

Best use cases

Use AWS Managed MCP when the analyst or engineer needs live AWS context.

Good examples:

Read current Security Hub findings.
Check current GuardDuty findings.
Search AWS documentation.
Review AWS Config resource state.
Look up CloudTrail events.
Ask for remediation guidance while reviewing live AWS evidence.
Use Claude Code or Codex during security tooling development.

Security Hub’s GetFindings API returns findings matching specified criteria, and if cross-Region aggregation is enabled, calling it from the aggregation home Region can include findings from linked Regions. That makes it a strong fit for read-only triage when the IAM role is scoped correctly.

When this approach is best

Choose AWS Managed MCP when:

A human analyst is actively driving the session.
The task requires current AWS state.
The user is already working in Claude Code, Codex, Cursor, Kiro, or another MCP-capable client.
The organization can enforce read-only IAM, SSO, approval prompts, and audit logging.
The output is advisory, not automatically applied.

This is a good design for a junior security engineer who needs help understanding findings but should not be allowed to modify AWS.

Required controls

The minimum safe posture is:

Dedicated read-only SSO permission set.
No administrator profile.
No long-lived access keys.
Explicit deny for Security Hub, GuardDuty, IAM, S3, EC2, KMS, and Config write actions.
Tool approval enabled.
Script execution denied or separately approved.
No secret or PII access.
CloudTrail visibility.
Negative-control test proving write actions fail.

AWS Managed MCP is powerful because it can expose broad AWS capability. AWS notes that Agent Toolkit can allow agents to interact with AWS APIs, run sandboxed scripts, search AWS documentation, and apply enterprise controls through IAM context keys and CloudWatch metrics.

That means IAM is not optional. IAM is the control boundary.

Why this approach is not ideal for scheduled reports

AWS Managed MCP is not the best engine for scheduled reporting.

A weekly report should not require:

An analyst session.
A local MCP client.
A laptop profile.
A Claude Code/Codex session.
Manual approval of each tool call.

For scheduled reporting, use Lambda.

Approach 2: Custom MCP for analyzing generated S3 reports

What it is

A custom MCP server is your own MCP service that exposes a narrow set of approved tools.

In this design, the MCP server does not query live Security Hub.

It reads only the security reports that your production reporting pipeline has already generated into S3.

The architecture looks like this:

Analyst
        ↓
Claude Code / Codex
        ↓
Custom Report Analyst MCP Server
        ↓
Read-only S3 access
        ↓
Generated Security Hub / GuardDuty reports
        ↓
Optional Bedrock analysis
        ↓
Evidence-bound response

The MCP tools should be intentionally limited:

healthcheck
list_available_reports
read_report_excerpt
search_reports
analyze_finding_against_reports
generate_ticket_draft

The MCP server should not expose:

call_aws
run_script
put_object
update_finding
batch_update_findings
create_ticket
send_slack
assume_role

This is the core security design.

The custom MCP server is not a general AWS assistant. It is a report analyst.

Best use cases

Use custom MCP when the source of truth is an approved report artifact.

Good examples:

Analyze a pasted Security Hub finding against the latest weekly report.
Search generated GuardDuty reports for a finding ID.
Compare a resource ARN against recent findings.
Generate Jira-ready remediation wording from existing report evidence.
Explain whether a finding is immediate risk or backlog based on the report.
Create analyst notes without touching live AWS APIs.

This is especially useful when the organization already has a strong reporting pipeline and wants AI-assisted review without giving the assistant broad AWS access.

Why custom MCP is safer for report analysis

If the task is:

Read the report and explain the finding.

then the agent does not need live Security Hub access.

It only needs:

s3:GetObject
s3:ListBucket
kms:Decrypt, if the reports are encrypted with KMS
bedrock:InvokeModel, if second-pass model analysis is used

That is a much narrower trust boundary.

The analyst prompt can say:

Use only the report analyst MCP server.
Do not call live AWS APIs.
Do not update Security Hub.
Do not create tickets.
Analyze this finding against the latest generated reports.

This avoids a common failure mode: the AI assistant silently switching from “report analysis” to “live AWS investigation.”

When this approach is best

Choose custom MCP when:

Reports already exist in S3.
The analyst should analyze approved artifacts, not live AWS state.
You want the smallest possible tool surface.
You want the same report evidence used across analysts.
You want the assistant to produce draft analysis, not operational changes.
You want to avoid broad AWS API exposure.

This is the right pattern for security teams that already generate Security Hub or GuardDuty reports through a controlled pipeline.

Required controls

The custom MCP server should run with a runtime role that can only:

List approved report prefixes.
Read approved report objects.
Decrypt report objects if needed.
Invoke an approved Bedrock model if model-assisted analysis is enabled.
Write CloudWatch logs.

It should explicitly deny:

Security Hub writes.
GuardDuty writes.
S3 writes to the report bucket.
IAM changes.
EC2 changes.
KMS destructive actions.
Lambda invoke.
SSM commands.
ECS Exec.

The endpoint should require company authentication. Do not expose a public MCP endpoint with a long-lived shared token.

A good production pattern is:

Claude Code signed in with a company account
        +
Company SSO / OAuth / short-lived MCP bearer token
        +
Custom MCP endpoint
        +
Read-only report access

The Claude or Codex account gives access to the AI client. The company token gives access to the MCP endpoint. The MCP runtime role gives access to the report bucket.

Those are separate identities, and that separation is healthy.

Why this approach is not ideal for generating reports

Custom MCP should not replace a backend reporting pipeline.

If the job is:

Every Tuesday, collect all Security Hub findings, enrich them, and write reports to S3.

then MCP is not the right primary engine.

MCP is a tool interface for an agent. It is not a scheduler, state tracker, report renderer, or production batch engine by default.

For that, use Lambda or another backend compute service.

Approach 3: Lambda + boto3 + Bedrock for scheduled reporting

What it is

The Lambda + boto3 + Bedrock pattern is a backend automation pipeline.

In the reviewed implementation pattern, Lambda performs deterministic collection, scoring, filtering, enrichment, fallback handling, and report assembly.

The architecture looks like this:

EventBridge Scheduler
        ↓
Lambda
        ↓
boto3 reads Security Hub and ECR
        ↓
Bedrock Converse API enriches bounded batches
        ↓
Lambda assembles final report deterministically
        ↓
S3 stores Markdown, HTML, JSON, SVG, and state files

This is not MCP.

This is not an interactive agent.

This is a scheduled reporting system.

Amazon Bedrock’s Converse API provides a consistent interface for sending messages to supported models, and the operation requires bedrock:InvokeModel permission.

That fits the Lambda model well: the function prepares bounded input, invokes the model, validates or falls back, and writes the final report.

What this pattern does well

The Lambda pattern is strong because it is deterministic around the model.

A good production implementation does not ask the model to do everything.

It should use the model for bounded enrichment, while Lambda owns:

Collection
Filtering
Scoring
Sorting
Deduplication
ECR latest-image filtering
Report structure
Fallback behavior
S3 output
State tracking
Schedule

That is the correct division of labor.

The model helps with language, explanation, remediation wording, and executive summarization.

The code controls the evidence pipeline.

Best use cases

Use Lambda + boto3 + Bedrock when:

Reports must run on a schedule.
The output must be consistent every week.
No analyst should be required to trigger the workflow.
Findings need deterministic scoring and sorting.
The organization needs report history and burn-down trends.
The output must be stored centrally.
There must be fallback if AI enrichment fails.

This is the right pattern for:

Weekly Security Hub executive reports.
GuardDuty summary reports.
ECR vulnerability reporting.
Manager-facing HTML reports.
DevOps remediation backlog generation.
Security trend/burn-down reporting.

Why Lambda is better than MCP for this job

Lambda has a clear production control model:

Lambda execution role
        ↓
Read security findings
        ↓
Invoke Bedrock
        ↓
Write reports to S3

This is easy to audit.

MCP would add unnecessary moving parts:

MCP client
Agent session
Tool approval
Workstation profile
Prompt/session state
Interactive user dependency

Those are useful for human-led analysis. They are not useful for unattended weekly reporting.

Required controls

For a production Lambda reporting job, enforce:

securityhub:GetFindings only for Security Hub read.
ECR read-only actions if ECR latest-image validation is required.
bedrock:InvokeModel scoped to approved model or inference profile where possible.
s3:PutObject only to approved report prefixes.
s3:GetObject only for state/history files if needed.
No Security Hub write permissions.
No GuardDuty write permissions.
No IAM mutation.
No remediation actions.
CloudWatch logging.
S3 encryption and versioning.
EventBridge schedule ownership.
Deterministic fallback when Bedrock fails.

The most important design rule:

Bedrock should enrich the report. It should not control the report pipeline.

The Lambda should be able to produce a safe deterministic report even if the model fails, times out, or returns malformed output.

Why all three designs are correct

The confusion usually comes from treating “AI security assistant” as one thing.

It is not one thing.

There are at least three jobs:

Live investigation
Report analysis
Report generation

Each job has a different control boundary.

Question	Best architecture
“What is currently in Security Hub?”	AWS Managed MCP
“What does the latest generated report say about this finding?”	Custom MCP
“Generate the weekly report every Tuesday.”	Lambda + boto3 + Bedrock
“Help me review Terraform or AWS docs while coding.”	AWS Managed MCP
“Analyze approved S3 report artifacts only.”	Custom MCP
“Create manager-ready reports without human interaction.”	Lambda

That is why all three are valid.

They are not competing solutions. They are lanes.

The clean operating model

Use this model to avoid confusion.

Lane 1: Production reporting
Purpose: Generate reports
Technology: Lambda + boto3 + Bedrock + S3
Trigger: EventBridge schedule or controlled manual invoke
Output: Markdown, HTML, JSON, CSV, SVG, report state
Human role: Review report and act on findings

Lane 2: Report analyst
Purpose: Analyze generated reports
Technology: Custom MCP
Trigger: Analyst prompt in Claude Code or Codex
Output: Evidence-bound analysis and ticket wording
Human role: Paste finding, review answer, create ticket manually

Lane 3: Live AWS triage / developer assistant
Purpose: Query live AWS context or docs
Technology: AWS Managed MCP / Agent Toolkit
Trigger: Analyst or developer prompt
Output: Live read-only investigation notes, docs, code guidance
Human role: Approve tool calls and validate output

This is the architecture I would use in a mature security program.

When to use what, based on corporate tooling

If your company uses Claude Code

Use:

AWS Managed MCP for live AWS read-only triage.
Custom MCP for report analysis.
Lambda for scheduled reports.

Claude Code becomes the analyst interface. The MCP endpoint should be company-authenticated, preferably through OAuth, SSO, ZTNA, or short-lived tokens.

If your company uses Codex

Use the same pattern.

Codex can be the MCP client for either:

AWS Managed MCP
Custom Report Analyst MCP

But do not confuse the Codex account with the AWS identity. The AI account authenticates you to the tool. The MCP endpoint must still require company-side authorization.

If your company has strict SSO and no local AWS profiles

Use:

Custom MCP for report analysis.
Lambda for report generation.

Avoid requiring every analyst to configure AWS profiles locally.

Let the MCP backend carry the read-only runtime role and company endpoint authentication.

If your company allows read-only AWS SSO profiles on engineer laptops

AWS Managed MCP becomes more attractive.

Use it for live investigation, but keep write actions denied.

If your company does not allow AI tools to access live AWS

Do not use AWS Managed MCP for live API calls.

Use:

Lambda generates approved reports.
Custom MCP reads only approved report artifacts.

This gives the AI assistant useful context without granting broad live AWS access.

If your company already has strong serverless standards

Lambda + boto3 + Bedrock is the cleanest reporting engine.

Use SAM, Terraform, or your internal platform pattern. Keep the reporting job deterministic and auditable.

If your company is Kubernetes-first

The custom MCP server can run on EKS.

But do not choose EKS just because it is more advanced. For a small stateless MCP API, ECS Fargate or Lambda-style backend hosting is often simpler. EKS is appropriate when the organization already has hardened Kubernetes standards, ingress controls, pod identity, network policies, and platform ownership.

Common misunderstanding: “MCP means the model is doing the work”

No.

MCP is the tool interface.

The model reasons.

The MCP server exposes tools.

IAM and application code enforce permissions.

The backend system still matters.

A bad MCP design can give the model too many hands.

A good MCP design gives it only the tools it needs.

For example:

AWS Managed MCP:
Good for live read-only AWS questions.

Custom MCP:
Good for reading approved S3 report artifacts.

Lambda:
Good for scheduled collection and report generation.

Different tools. Different jobs.

Common misunderstanding: “Custom MCP is always safer”

Not automatically.

Custom MCP is safer only if it exposes fewer tools and has better boundaries.

A custom MCP server with a generic call_aws tool can be riskier than AWS Managed MCP with strong IAM controls.

A safe custom MCP server should be domain-specific:

Read this report.
Search this report.
Analyze this pasted finding against reports.
Generate ticket draft.

It should not become a private version of the entire AWS API.

Common misunderstanding: “Lambda + Bedrock is not agentic, so it is less advanced”

That is the wrong way to think about it.

Scheduled security reporting should be boring.

Boring is good.

A weekly executive report should not depend on an agent making fresh tool decisions every time. It should follow a known pipeline:

Collect
Normalize
Score
Sort
Enrich
Validate
Fallback
Render
Store

The AI model can improve the wording and analysis, but the pipeline should remain deterministic.

That is a stronger architecture for production reporting.

Security decision table

Requirement	AWS Managed MCP	Custom MCP	Lambda + boto3 + Bedrock
Live AWS Security Hub triage	Strong	Weak unless it calls live AWS	Possible but not interactive
Analyze generated S3 reports	Possible but too broad	Strong	Possible but not conversational
Scheduled weekly report	Weak	Weak	Strong
Read-only guardrails	IAM + tool approval	IAM + app tool design	IAM execution role
No local AWS profile needed	Usually no	Yes	Yes
Best user interface	Claude Code / Codex	Claude Code / Codex	S3/HTML/Slack/Jira after generation
Best production automation	No	No	Yes
Best analyst conversation	Yes	Yes	No
Lowest live AWS API exposure	No	Yes	Medium, controlled by Lambda role
Best audit story for scheduled reports	Medium	Medium	Strong
Best audit story for report artifact analysis	Medium	Strong	Strong for generation, not interaction

Recommended final architecture

For a mature AWS security team, I would implement all three, but keep them separated.

1. Lambda + boto3 + Bedrock
   Generates official weekly/daily security reports.

2. Custom MCP
   Lets analysts ask questions about those reports without querying live AWS.

3. AWS Managed MCP
   Lets approved engineers perform live read-only AWS triage and development support.

The important rule:

Do not let the lanes blur.

Production reporting should not depend on a chat session.

Report analysis should not silently become live AWS querying.

Live AWS querying should not perform write actions.

Do not force one architecture to do all three jobs.

That is how security automation becomes confusing and risky.

The best architecture is not the most advanced one. It is the one with the clearest control boundary for the job.

Example workflow

Monday morning: scheduled report generation

EventBridge triggers Lambda.
Lambda collects Security Hub findings.
Lambda validates ECR latest-image findings.
Lambda invokes Bedrock in bounded batches.
Lambda assembles Markdown and HTML.
Lambda writes report artifacts to S3.

No analyst is involved.

Later that day: analyst reviews one finding

Analyst opens Claude Code.
Analyst pastes a Security Hub finding.
Claude Code calls custom MCP.
Custom MCP reads latest S3 reports.
Custom MCP returns evidence-bound analysis.
Analyst creates Jira ticket manually.

No live Security Hub API call is needed.

During remediation: engineer needs AWS context

Engineer opens Claude Code or Codex.
Engineer uses AWS Managed MCP with read-only SSO profile.
Agent searches AWS docs and checks live AWS state.
Engineer validates and implements remediation through normal change control.

No automatic remediation is approved.

MCP for AWS Security Engineers: Build a Read-Only Security Hub Triage Agent

Mike Anderson — Thu, 16 Jul 2026 09:54:32 +0000

MCP for AWS Security Engineers: Build a Read-Only Security Hub Triage Agent

For AWS-heavy security work, I would start with AWS Agent Toolkit for AWS and the managed AWS MCP Server, not a custom MCP server.

The reason is practical. AWS now provides a managed MCP path that can connect AI coding agents to AWS documentation, AWS APIs, AWS skills, and existing IAM credentials. The Agent Toolkit also provides plugin-based setup for supported agents such as Claude Code and Codex.

For security teams, that is the right starting point because the enforcement point remains AWS IAM, not the model.

The initial operating model should be strict:

Read-only first.
No production write authority.
No access to secrets.
No raw customer PII or sensitive incident logs in prompt context.
No automatic remediation.
No AI-approved suppression, exception, merge, deploy, or risk acceptance.
Human review and CI/CD remain the release authority.

That is the same posture I would use for a governed Claude Code or Codex rollout: named identities, SSO, scoped credentials, default deny, tool approval, audit logs, and security evidence tied back to tickets, pull requests, CI logs, and cloud findings.

What we are building

This article walks through a practical security workflow:

A read-only Security Hub triage assistant that helps a junior security engineer produce a daily or weekly findings summary, remediation backlog, and evidence pack without allowing the agent to modify AWS.

The agent will be able to:

Read AWS Security Hub findings.
Group findings by account, severity, product, resource, and control.
Explain why a finding matters.
Draft remediation tickets.
Draft a Slack-ready summary.
Produce local markdown, CSV, and JSON evidence files.

The agent will not be able to:

Suppress findings.
Archive findings.
Mark findings resolved.
Disable Security Hub standards.
Modify IAM, S3, EC2, KMS, GuardDuty, Inspector, or Config.
Deploy remediation.
Run destructive scripts.
Approve risk acceptance.

This is not a generic AI demo. This is a security-controlled workflow where MCP gives the agent access to context, while IAM, SCPs, tool approval, and human review define the real boundary.

What is MCP?

MCP stands for Model Context Protocol.

In plain English, MCP is a standard way for an AI assistant to connect to external systems such as cloud platforms, source code repositories, ticketing systems, databases, monitoring tools, documentation, and security platforms.

A simple mental model:

LLM / Agent
   |
   | asks for context or tool execution
   v
MCP Client
   |
   | speaks MCP
   v
MCP Server
   |
   | exposes approved tools and data
   v
AWS / GitHub / Jira / Security Hub / Internal APIs

MCP is not the model.

MCP is not Claude.

MCP is not Codex.

MCP is the connector layer that lets an AI tool interact with approved external capabilities in a consistent way.

A practical comparison:

Component	What it is	Example
AI model	The reasoning engine	Claude, GPT, Nova, Qwen
Agent client	The user-facing agent tool	Claude Code, Codex, Cursor, Kiro
MCP server	The tool and data connector	AWS MCP Server, GitHub MCP Server
Tool	An action exposed by the server	`securityhub:GetFindings`, documentation search
Resource	Read-only context exposed by the server	Documentation, metadata, finding details
IAM / policy	The enforcement layer	AWS role, SCP, permission boundary

The important security point is simple:

MCP gives the agent hands. IAM decides what those hands are allowed to touch.

What an MCP server actually does

An MCP server exposes capabilities to an AI agent.

Those capabilities usually fall into three areas:

MCP capability	Meaning	Security impact
Tools	Callable functions or actions	Can be read-only or mutating
Resources	Context or data the model can read	Usually safer, but can expose sensitive data
Prompts	Reusable task templates	Useful for standardized workflows

That matters because a junior engineer may think, “The model only answers questions.”

That assumption is no longer safe once tools are attached.

With MCP, the model may be able to:

Query Security Hub findings.
Search AWS documentation.
Call AWS APIs.
Read repository files.
Read Jira tickets.
Read internal runbooks.
Generate remediation plans.
In poorly controlled environments, call write APIs.

That is why the first security decision is not:

Which model should we use?

The first security decision is:

What tool permissions will this agent have, and where are those permissions enforced?

For production security work, the model must never be treated as the control boundary.

The control boundary must be:

IAM.
SCPs.
Permission boundaries.
SSO permission sets.
MCP tool allowlists.
Claude Code or Codex approval modes.
Audit logs.
Human approval.

Where MCP fits in the agent architecture

A useful operating model is:

Prompt -> Agent loop -> MCP tools -> External systems -> Evidence/output

The agent loop is the cycle where the model reasons, requests a tool, receives the result, reasons again, and continues until the task is complete.

The harness is everything around that loop: tool permissions, context management, project rules, logs, approval gates, hooks, and safety boundaries.

For security work, MCP sits inside the harness.

Claude Code / Codex
   |
   | project rules, approval mode, permissions
   v
MCP client
   |
   | approved tool calls only
   v
AWS MCP Server
   |
   | authenticated AWS API access
   v
AWS IAM role / permission set
   |
   | read-only Security Hub permissions
   v
AWS Security Hub

That distinction is important.

The model can recommend. The harness controls. IAM enforces.

Why MCP is useful for cybersecurity work

Security work is context-heavy.

A security engineer rarely needs a generic answer. We need the assistant to understand:

Which AWS account is affected.
Which Security Hub control failed.
Whether the finding is active, archived, suppressed, or resolved.
Whether the source is Security Hub CSPM, GuardDuty, Inspector, Macie, Config, or another product.
Whether the affected resource is public-facing.
Whether the account is production, shared services, security tooling, or sandbox.
What remediation is appropriate.
What evidence should be retained.
What should be fixed immediately versus tracked in backlog.

Without MCP, the engineer manually copies and pastes data into the AI tool.

With MCP, the agent can retrieve approved read-only data directly and produce a consistent investigation output.

Useful security workflows include:

Security Hub triage.
GuardDuty finding explanation.
Inspector vulnerability prioritization.
CloudTrail event review.
IAM access review support.
AWS documentation lookup.
Control evidence preparation.
Remediation backlog drafting.
Incident timeline drafting.
Cloud security review preparation.

But MCP is not magic.

It does not replace security ownership, SOC judgment, IAM design, threat modeling, change control, CI/CD gates, incident commander decisions, or audit evidence review.

MCP should reduce manual collection and improve consistency. It should not become an ungoverned SOAR platform.

Should security teams build their own MCP server?

Recommendation: use official or vendor-supported MCP servers first. Build your own only when you have a specific internal workflow that existing servers cannot safely support.

For AWS security work, start with:

AWS Agent Toolkit for AWS.
AWS MCP Server.
A dedicated read-only AWS profile or IAM Identity Center permission set.
Claude Code or Codex MCP configuration.
A controlled Security Hub triage workflow.

Decision table

Use case	Build your own MCP server?	Recommended path
AWS documentation lookup	No	AWS Agent Toolkit / AWS MCP Server
Security Hub read-only triage	No, initially	AWS MCP Server with read-only IAM
GuardDuty / Inspector / Macie review	No, initially	AWS MCP Server with scoped read-only permissions
Jira ticket drafting	Usually no	Vendor MCP server or local draft output
GitHub repo analysis	Usually no	GitHub MCP or native repo context with repo-scoped permissions
Internal CMDB enrichment	Maybe	Internal read-only MCP server
Internal GRC evidence register	Maybe	Private MCP server or API wrapper
Automated remediation	Not initially	Keep outside MCP until governance is mature
Security Hub suppression/update	No for junior workflow	Human and SOC-approved process only

Trusted source order

Use this priority order for MCP servers:

Official vendor documentation.
Official AWS Agent Toolkit / AWS MCP Server.
Official MCP Registry where appropriate.
Vendor-maintained GitHub repositories.
Your internal private registry for internal MCP servers.

Be careful with random public MCP servers.

For a security team, an MCP server is not a harmless browser extension. It is a privileged integration point.

A malicious or poorly written MCP server can become:

A credential theft path.
A data exfiltration path.
A prompt-injection bridge.
A hidden write-action path.
A supply chain risk.

Treat MCP servers like production integrations.

AWS Agent Toolkit: what it gives you

AWS Agent Toolkit provides plugins that bundle AWS MCP Server configuration and curated AWS skills for agent workflows.

For the workflow in this article, the relevant AWS MCP Server capabilities are:

Capability	Use in this workflow	Initial recommendation
Documentation search	Explain Security Hub controls and AWS service behavior	Allow
AWS API calls	Read Security Hub, GuardDuty, Inspector, Config, and CloudTrail context	Allow only through read-only IAM
Sandboxed script execution	Run multi-step AWS checks	Disable or require explicit approval at pilot stage
Presigned URL generation	File transfer support	Disable unless specifically needed
Long-running task polling	Check status of API/script tasks	Allow only if required

For a read-only security workflow, I would allow documentation tools and controlled AWS API calls. I would deny or require approval for script execution initially, especially for junior engineers.

Target architecture

Security Engineer
   |
   | asks question in Claude Code or Codex
   v
Claude Code / Codex
   |
   | MCP client
   v
AWS Agent Toolkit / AWS MCP Server
   |
   | authenticated request
   v
AWS IAM Identity Center profile: sec-mcp-readonly
   |
   | read-only permissions only
   v
AWS Security Hub
   |
   | Get / List / Describe / BatchGet only
   v
Local output files
   |
   | markdown summary, CSV backlog, JSON evidence
   v
Human review
   |
   | Jira / Slack / audit evidence

The key design choice is that the agent can read and reason, but it cannot change the environment.

This aligns with a production cloud security baseline: least privilege, centralized identity, MFA, guardrails, logging, evidence retention, and clear owner accountability.

Step 1: Create the AWS read-only identity

Use IAM Identity Center if available.

Create a permission set:

Permission set name: SecMCPReadOnly
Session duration: 4 hours
Assigned group: SecurityEngineering-MCP-ReadOnly
Accounts: security tooling account and selected workload accounts
MFA: required through IdP / IAM Identity Center

Do not use:

An administrator role.
A shared access key.
A personal long-lived IAM user.
A generic service account that hides the human operator.

Use a named human identity with SSO. The goal is that every MCP-driven AWS API call is attributable to a real engineer.

Step 2: Attach a scoped read-only IAM policy

AWS provides managed read-only policies, but for this workflow I prefer a custom policy because the scope is explicit and easier to explain during audit.

IAM policy: Security Hub MCP read-only

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowIdentityCheck",
      "Effect": "Allow",
      "Action": [
        "sts:GetCallerIdentity"
      ],
      "Resource": "*"
    },
    {
      "Sid": "AllowSecurityHubReadOnly",
      "Effect": "Allow",
      "Action": [
        "securityhub:Get*",
        "securityhub:List*",
        "securityhub:Describe*",
        "securityhub:BatchGet*"
      ],
      "Resource": "*"
    },
    {
      "Sid": "AllowReadOnlyInvestigationContext",
      "Effect": "Allow",
      "Action": [
        "cloudtrail:LookupEvents",
        "guardduty:GetFindings",
        "guardduty:ListFindings",
        "guardduty:ListDetectors",
        "inspector2:ListFindings",
        "access-analyzer:ListFindings",
        "access-analyzer:GetFinding",
        "organizations:DescribeOrganization",
        "organizations:ListAccounts",
        "config:SelectResourceConfig",
        "config:GetResourceConfigHistory"
      ],
      "Resource": "*"
    },
    {
      "Sid": "DenyWriteActionsForMCPPilot",
      "Effect": "Deny",
      "Action": [
        "securityhub:BatchUpdateFindings",
        "securityhub:BatchImportFindings",
        "securityhub:Update*",
        "securityhub:Delete*",
        "securityhub:Disable*",
        "securityhub:Enable*",
        "securityhub:Create*",
        "securityhub:TagResource",
        "securityhub:UntagResource",
        "iam:*",
        "s3:Put*",
        "s3:Delete*",
        "ec2:AuthorizeSecurityGroupIngress",
        "ec2:AuthorizeSecurityGroupEgress",
        "ec2:RevokeSecurityGroupIngress",
        "ec2:RevokeSecurityGroupEgress",
        "kms:Put*",
        "kms:ScheduleKeyDeletion",
        "config:Put*",
        "config:Delete*",
        "guardduty:Update*",
        "guardduty:Delete*",
        "inspector2:Update*",
        "inspector2:BatchUpdate*"
      ],
      "Resource": "*"
    }
  ]
}

Why include the explicit deny?

The explicit deny is not there because the allow statement grants those actions. It does not.

The explicit deny is there because real environments are messy.

A user may later inherit another permission set, a group policy, or a temporary role that adds write access. Explicit deny reduces the chance that the MCP workflow accidentally gains mutation capability through permission creep.

For production accounts, pair this with an SCP or permission boundary where possible.

Step 3: Add an optional SCP for production accounts

For production accounts, I would add an organization-level safety net.

Example SCP concept:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "PreventMCPReadOnlyRoleFromMutatingSecurityHub",
      "Effect": "Deny",
      "Action": [
        "securityhub:BatchUpdateFindings",
        "securityhub:Update*",
        "securityhub:Delete*",
        "securityhub:Disable*",
        "securityhub:Enable*",
        "securityhub:Create*"
      ],
      "Resource": "*",
      "Condition": {
        "ArnLike": {
          "aws:PrincipalArn": [
            "arn:aws:iam::*:role/aws-reserved/sso.amazonaws.com/*/AWSReservedSSO_SecMCPReadOnly_*"
          ]
        }
      }
    }
  ]
}

Test this carefully in a non-production account before applying it broadly.

The SCP should not block the SOC, security tooling account, CI/CD remediation roles, or incident response break-glass roles.

The objective is narrow:

This MCP read-only role must never mutate Security Hub findings or configuration.

Step 4: Configure AWS CLI and SSO profile

Install or update the AWS CLI, then configure SSO:

aws configure sso --profile sec-mcp-readonly

Validate the identity:

aws sts get-caller-identity --profile sec-mcp-readonly

Expected output:

{
  "UserId": "AROAXXXXX:security.engineer@example.com",
  "Account": "123456789012",
  "Arn": "arn:aws:sts::123456789012:assumed-role/AWSReservedSSO_SecMCPReadOnly_xxxxx/security.engineer@example.com"
}

Then test Security Hub read access:

aws securityhub get-findings \
  --profile sec-mcp-readonly \
  --region us-east-1 \
  --max-results 5

Now test that write access fails:

aws securityhub batch-update-findings \
  --profile sec-mcp-readonly \
  --region us-east-1 \
  --finding-identifiers '[{"Id":"test","ProductArn":"arn:aws:securityhub:us-east-1::product/aws/securityhub"}]' \
  --workflow '{"Status":"SUPPRESSED"}'

Expected result:

AccessDeniedException

Keep that negative-control result as rollout evidence.

Step 5: Install AWS Agent Toolkit for Claude Code

In Claude Code, install the AWS plugin:

/plugin install aws-core@claude-plugins-official
/reload-plugins

Then validate that the AWS MCP server is visible:

/mcp

For the pilot, I would configure Claude Code so AWS API calls require approval and script execution is denied or requires explicit human approval.

The point is not to slow engineers down. The point is to prevent the first rollout from quietly becoming an unapproved automation channel.

Step 6: Install AWS Agent Toolkit for Codex

For Codex, AWS documents plugin setup through the Codex plugin marketplace:

codex plugin marketplace add aws/agent-toolkit-for-aws

Then open Codex and use:

/plugins

Install the aws-core plugin.

For the pilot, configure Codex so MCP tools are explicitly approved and write-capable tools are disabled or denied.

The exact approval configuration may vary by Codex version, so validate against the current Codex configuration reference before publishing your internal runbook.

Step 7: Direct MCP configuration if plugin install is not available

If you cannot use the plugin flow, configure the AWS MCP Server directly through the MCP Proxy for AWS.

Example Claude Code configuration:

claude mcp add-json aws-mcp --scope user \
'{
  "command": "uvx",
  "args": [
    "mcp-proxy-for-aws",
    "https://aws-mcp.us-east-1.api.aws/mcp",
    "--metadata",
    "AWS_REGION=us-east-1"
  ],
  "env": {
    "AWS_PROFILE": "sec-mcp-readonly"
  }
}'

For Codex, place the MCP server configuration in your Codex config file.

Example concept:

[mcp_servers.aws-mcp]
command = "uvx"
args = [
  "mcp-proxy-for-aws",
  "https://aws-mcp.us-east-1.api.aws/mcp",
  "--metadata",
  "AWS_REGION=us-east-1"
]

[mcp_servers.aws-mcp.env]
AWS_PROFILE = "sec-mcp-readonly"

For multi-account security teams, configure an explicit profile allowlist. Do not let the agent discover or use every AWS profile on the workstation.

Example concept:

AWS_MCP_PROXY_PROFILES="sec-mcp-readonly prod-readonly security-readonly"

The default profile should be read-only.

Step 8: Configure tool approval and safety settings

The security posture should be tool-specific.

Tool type	Pilot setting	Rationale
AWS documentation search	Allow	Low risk and high value
AWS read-only API call	Ask / approve	Lets the engineer verify account and region
AWS script execution	Deny or ask	Can create broad data access and complex behavior
Presigned URL generation	Deny	Not needed for Security Hub triage
File write to local project	Allow to approved output folder	Needed for evidence pack
Shell command execution	Ask	Can expose local files or environment variables
Git operations	Ask	Prevents accidental commits or pushes

Step 9: Create the local project folder

mkdir -p securityhub-mcp-triage/{prompts,filters,output,evidence}
cd securityhub-mcp-triage

Recommended structure:

securityhub-mcp-triage/
  prompts/
    securityhub-triage.md
  filters/
    securityhub-critical-high.json
  output/
  evidence/

Keep this folder separate from application repositories. It should not contain source code, credentials, .env files, or customer data.

Step 10: Create the Security Hub filter

Create filters/securityhub-critical-high.json:

{
  "WorkflowStatus": [
    {
      "Value": "NEW",
      "Comparison": "EQUALS"
    },
    {
      "Value": "NOTIFIED",
      "Comparison": "EQUALS"
    }
  ],
  "RecordState": [
    {
      "Value": "ACTIVE",
      "Comparison": "EQUALS"
    }
  ],
  "SeverityLabel": [
    {
      "Value": "CRITICAL",
      "Comparison": "EQUALS"
    },
    {
      "Value": "HIGH",
      "Comparison": "EQUALS"
    }
  ]
}

Optional CLI validation:

aws securityhub get-findings \
  --profile sec-mcp-readonly \
  --region us-east-1 \
  --filters file://filters/securityhub-critical-high.json \
  --max-results 25 \
  > evidence/securityhub-critical-high-sample.json

This gives the engineer a known-good baseline before asking the agent to reason over the findings.

Step 11: Create the agent prompt

Create prompts/securityhub-triage.md:

You are supporting a read-only AWS Security Hub triage workflow.

Operating constraints:
- Use AWS profile: sec-mcp-readonly.
- Use region: us-east-1 unless findings indicate another region.
- Read only. Do not modify AWS resources.
- Do not suppress, archive, import, update, or resolve findings.
- Do not access secrets, credentials, customer PII, or raw sensitive incident logs.
- Do not run remediation.
- Do not commit, push, merge, deploy, or approve changes.
- Write outputs only under ./output.

Task:
1. Retrieve active CRITICAL and HIGH Security Hub findings using filters/securityhub-critical-high.json.
2. Group findings by:
   - AWS account
   - Region
   - Severity
   - Product/source
   - Resource type
   - Control ID or finding type
3. For each group, explain:
   - Why it matters
   - Failure mode
   - Likely owner
   - Recommended remediation
   - Evidence required
   - Whether it is immediate risk or backlog
4. Produce the following files:
   - output/securityhub-executive-summary.md
   - output/securityhub-technical-findings.md
   - output/securityhub-remediation-backlog.csv
   - output/securityhub-evidence-index.md
5. Include a final section called "Human review required" listing anything that must be confirmed manually.

Prioritization rules:
- Internet exposure in production is immediate.
- Privileged IAM or access analyzer findings are immediate.
- Critical exploitable vulnerabilities on internet-facing workloads are immediate.
- Missing encryption on sensitive data stores is high priority.
- Missing logging or monitoring is high priority, but may be backlog if compensating controls exist.
- Anything involving possible data exposure must be escalated to the SOC or incident commander.

This prompt does three important things:

It defines the job.
It defines the boundaries.
It defines the output format.

That is what makes the workflow repeatable.

Step 12: Run the workflow in Claude Code

Open Claude Code in the project folder.

cd securityhub-mcp-triage
claude

Then run:

Use prompts/securityhub-triage.md and perform the Security Hub triage workflow.
Before using any AWS MCP tool, show me the planned tool call, account/profile, region, and purpose.

During tool approval, verify:

AWS profile is sec-mcp-readonly.
Region is expected.
API action is read-only.
No script execution is being requested unless explicitly approved.
No write, update, delete, suppress, or remediation action is requested.

If the agent asks to use a write action, stop the run and fix the permissions or project rules.

Step 13: Run the workflow in Codex

In Codex:

cd securityhub-mcp-triage
codex

Prompt:

Use prompts/securityhub-triage.md and produce the required output files.
Use only the configured AWS MCP server and the sec-mcp-readonly profile.
Ask before each AWS API tool call.
Do not run write actions or remediation.

Same review logic applies:

Confirm profile.
Confirm region.
Confirm read-only API action.
Confirm output path.
Deny script execution unless this has already been approved for the pilot.

Expected outputs

The workflow should produce four local files.

1. `output/securityhub-executive-summary.md`

This file should be leadership-readable.

Example structure:

# Security Hub Executive Summary

Date: 2026-07-16
AWS profile: sec-mcp-readonly
Region: us-east-1
Scope: Active CRITICAL/HIGH findings

## Summary

Total active CRITICAL/HIGH findings reviewed: 42

Immediate action required: 6
High priority remediation: 18
Backlog / owner validation: 18

## Key risk themes

1. Public exposure on internet-facing resources
2. Privileged IAM misconfiguration
3. Inspector critical vulnerabilities on production EC2
4. Missing encryption on data stores
5. Security logging gaps

## Immediate escalation

The following findings require same-day owner response...

2. `output/securityhub-technical-findings.md`

This file should be engineer-readable.

Example:

## Finding group: Public S3 bucket exposure

Priority: Immediate

Affected resources:
- arn:aws:s3:::example-prod-export-bucket

Why it matters:
A public S3 bucket in a production account creates direct data exposure risk.
If the bucket contains logs, exports, backups, or customer data, the issue may become a reportable incident.

Failure mode:
An attacker or external party can access exposed objects without authentication.
If bucket contents include credentials, logs, exports, or regulated data, this can lead to data breach, credential compromise, and compliance exposure.

Required remediation:
- Confirm business owner.
- Validate whether bucket is intentionally public.
- Enable S3 Block Public Access at account and bucket level unless explicitly approved.
- Review bucket policy and ACL.
- Review CloudTrail data events if enabled.
- Assess object sensitivity.
- Open incident if sensitive data was exposed.

Evidence required:
- Security Hub finding JSON.
- S3 bucket policy export.
- Public access block configuration.
- Object sensitivity confirmation from data owner.
- CloudTrail access review.

3. `output/securityhub-remediation-backlog.csv`

Example columns:

priority,severity,account,region,resource_type,resource_id,finding_title,recommended_owner,remediation_action,evidence_required,sla,notes
Immediate,CRITICAL,123456789012,us-east-1,S3,bucket-name,Public bucket exposure,Data Platform,Disable public access and validate exposure,Finding JSON; bucket policy; access review,Same day,Escalate if sensitive data exists

4. `output/securityhub-evidence-index.md`

Example:

# Evidence Index

## Evidence collected

- Security Hub finding export
- Finding group summary
- AWS account and region
- Resource identifiers
- Remediation backlog
- Negative-control test showing write actions fail

## Evidence not collected

- Raw customer logs
- Secrets
- PII
- Full object contents

How to prioritize findings

Use a practical triage model.

Priority	Criteria	Response expectation
Immediate	Active internet exposure, privileged IAM risk, possible data exposure, exploited vulnerability, production blast radius	Same-day owner response and SOC visibility
High	Security control failure on sensitive or production resources	Remediation ticket with SLA
Medium	Misconfiguration with limited exposure or compensating controls	Backlog with owner and due date
Informational	Hygiene issue, duplicate finding, non-production low impact	Track, tune, or suppress through approved process

Priority is not only the Security Hub severity label.

Security Hub severity matters, but real prioritization should also consider:

Production versus non-production.
Public exposure.
Data sensitivity.
Exploitability.
Privilege impact.
Lateral movement potential.
Compensating controls.
Asset owner.
Existing exception status.

That is where the agent can help, but the human still owns the decision.

Evidence required for audit

Keep these artifacts:

Evidence	Why it matters
IAM permission set export	Shows least privilege scope
IAM policy JSON	Shows allowed and denied actions
SCP or permission boundary export	Shows preventive guardrail
AWS CLI identity check	Proves named identity
Security Hub finding export	Shows source evidence
Output files	Shows triage result
Negative-control test	Proves write actions fail
Tool approval log or session transcript	Shows human oversight
Jira tickets	Shows remediation ownership
Slack or incident notes	Shows escalation path

Do not store:

Raw customer data.
Secrets.
Access keys.
Sensitive logs copied unnecessarily.
Full data object contents.
Anything that creates a new evidence-handling problem.

Negative control test: prove write actions fail

A safe rollout must include a negative-control test.

Test one prohibited write action in a non-production or controlled environment:

aws securityhub batch-update-findings \
  --profile sec-mcp-readonly \
  --region us-east-1 \
  --finding-identifiers '[{"Id":"test","ProductArn":"arn:aws:securityhub:us-east-1::product/aws/securityhub"}]' \
  --workflow '{"Status":"SUPPRESSED"}'

Expected result:

AccessDeniedException

Keep the result as evidence.

If the command succeeds, the design is not approved.

Failure modes and required controls

Failure mode	What can go wrong	Required control
Agent gains write access	Findings are suppressed or resources are modified	IAM explicit deny, SCP, permission boundary
Prompt injection through finding text	Agent follows malicious instructions embedded in external content	Treat findings as untrusted data, use strict project rules
Excessive data retrieval	Agent pulls sensitive logs or PII into local files	Data minimization, deny secret/PII access, output path controls
Wrong AWS account	Agent queries or reports the wrong account	SSO profile naming, `sts:GetCallerIdentity`, account allowlist
Poor prioritization	Critical exposure is treated as backlog	Human review and explicit prioritization rules
No audit trail	Outputs cannot be defended in audit	Tool logs, CloudTrail, evidence index, ticket linkage
Auto-remediation drift	AI makes changes outside change control	No write access, CI/CD remains release authority
Public or untrusted MCP server	Credentials or data are exposed	Use official/vendor/internal MCP servers only

Where AWS AgentCore Gateway fits

For a single engineer or small pilot, AWS Agent Toolkit plus read-only IAM is enough.

For enterprise use, evaluate AWS AgentCore Gateway.

The reason is governance.

As MCP usage grows, security teams eventually need:

Central tool registration.
Central authentication.
Fine-grained access control.
Tool observability.
Network control.
Credential management.
Private connectivity.
SCP enforcement.
Standardized approval patterns.

Recommended maturity path:

Pilot: AWS Agent Toolkit + AWS MCP Server + read-only IAM profile
Scale: Add centralized governance and gateway controls
Custom: Build private MCP servers only for internal systems that are not covered

Do not start by building a custom MCP platform unless you already have a clear internal integration gap.

Practical AWS security use cases for MCP agents

Security Hub triage

Best first use case.

Input:

Security Hub findings.
Account context.
Severity.
Resource metadata.
AWS documentation.

Output:

Executive summary.
Technical findings.
Remediation backlog.
Evidence index.

Risk: low if read-only.

GuardDuty investigation support

The agent can help explain:

Finding type.
Likely attack path.
Affected principal.
Source IP.
First and last seen timestamps.
Recommended containment steps.

Do not let the agent disable keys, quarantine instances, or modify policies automatically.

Inspector vulnerability prioritization

The agent can group Inspector findings by:

Public exposure.
Exploit availability.
Package.
Workload owner.
Production impact.
Patch SLA.

The output can be a CSV remediation backlog.

Do not let the agent patch systems automatically.

IAM access review assistant

The agent can summarize:

Unused access.
High-risk permissions.
External trust relationships.
Access Analyzer findings.
Privileged roles.
Service accounts with broad permissions.

Do not let the agent change IAM policy.

Cloud security review evidence pack

The agent can collect read-only evidence for:

CloudTrail.
Config.
GuardDuty.
Security Hub.
Inspector.
S3 Block Public Access.
Encryption configuration.
Account inventory.

This is useful before audits, risk reviews, and architecture reviews.

Claude Code vs Codex: how I would use both

I would not frame this as Claude Code versus Codex.

I would use both where they are strongest.

Tool	Best use	Security posture
Claude Code	Deep reasoning, architecture review, long-form security analysis, runbook drafting	Strong project rules and tool approval
Codex	Code changes, CLI-driven development workflow, reproducible implementation tasks	Sandbox, approval policy, repo controls
AWS MCP Server	AWS documentation and authenticated AWS API access	IAM-enforced read-only first
CI/CD	Tests, scanning, deployment, policy gates	Release authority remains outside the AI tool

For security work, the safest split is:

Claude Code: analyze and explain
Codex: implement controlled code changes
AWS MCP Server: retrieve AWS context
CI/CD: validate and release
Human owner: approve risk and remediation

The AI tool can accelerate the workflow, but it should not become the approval authority.

Junior engineer runbook

Use this workflow for daily or weekly triage.

Before starting

Confirm:

You are using sec-mcp-readonly.
MFA is active.
You are in the correct AWS account and region.
The MCP server is the approved AWS MCP Server.
Output will be written only to ./output.
No customer PII, secrets, or raw sensitive logs will be collected.
Tool approvals are enabled.
Write actions are denied.

Daily triage workflow

Run aws sts get-caller-identity.
Run a small Security Hub read test.
Start Claude Code or Codex in the project folder.
Load prompts/securityhub-triage.md.
Approve only read-only AWS API calls.
Review generated output files.
Validate immediate-risk findings manually in the AWS Console.
Create Jira tickets for owners.
Escalate possible data exposure to SOC or the incident commander.
Store the evidence index with the ticket.

AWS Console paths for manual validation

Security Hub:

AWS Console -> Security Hub -> Findings

GuardDuty:

AWS Console -> GuardDuty -> Findings

Inspector:

AWS Console -> Inspector -> Findings

CloudTrail:

AWS Console -> CloudTrail -> Event history

Config:

AWS Console -> AWS Config -> Resources / Advanced queries

S3 public access:

AWS Console -> S3 -> Bucket -> Permissions -> Block Public Access / Bucket policy

Manual validation matters because MCP output is an aid, not evidence by itself.

What to fix first

Fix in this order:

Public exposure of production resources.
Possible sensitive data exposure.
Privileged IAM misconfiguration.
Active GuardDuty findings.
Critical exploitable vulnerabilities on internet-facing workloads.
Disabled or missing logging in production.
Missing encryption on sensitive stores.
Repeated control failures with no owner.
Non-production hygiene issues.
Informational findings and duplicates.

The top of the list is about blast radius and business impact, not just severity labels.

Residual risk

Even with read-only IAM and MCP controls, some risk remains.

Residual risks include:

The agent may misinterpret a finding.
The agent may over-prioritize or under-prioritize business impact.
Prompt injection may appear in finding text, ticket text, or documentation.
Local output files may contain sensitive metadata.
Engineers may approve unsafe tool calls.
AWS permissions may drift over time.
MCP server behavior and client capabilities may change with version updates.

Acceptable residual risk for a pilot:

Read-only triage and evidence drafting with human review.

Not acceptable for a pilot:

Automated suppression, remediation, policy changes, deployments, or risk acceptance.

Final Slack-ready wording

Decision: Approved with conditions.

We can pilot AWS MCP Server through AWS Agent Toolkit for a read-only Security Hub triage workflow.

Approved scope:
- Read Security Hub findings.
- Read limited investigation context from GuardDuty, Inspector, Config, CloudTrail, Organizations, and Access Analyzer.
- Generate local markdown/CSV/JSON summaries.
- Draft remediation tickets and Slack summaries.

Not approved:
- Security Hub suppression or updates.
- AWS resource changes.
- IAM changes.
- Secret or PII access.
- Automated remediation.
- AI-approved exception, merge, deploy, or risk acceptance.

Required controls:
- Named SSO identity.
- Dedicated SecMCPReadOnly permission set.
- Explicit deny for write actions.
- SCP or permission boundary for production where possible.
- MCP tool approval enabled.
- Script execution denied or separately approved.
- CloudTrail audit visibility.
- Negative-control test proving write actions fail.

Residual risk is acceptable for a read-only pilot with human review.

Final recommendation

Start with a narrow, governed workflow:

Use case: Security Hub triage
Agent: Claude Code or Codex
Connector: AWS Agent Toolkit / AWS MCP Server
AWS identity: SecMCPReadOnly
Permissions: read-only + explicit deny
Output: executive summary, technical findings, remediation backlog, evidence index
Approval: human review before tickets, suppression, remediation, or risk acceptance

Do not build a custom MCP server first.

Do not give the agent production write access.

Do not let the agent suppress findings or approve exceptions.

Get the read-only triage workflow working, prove the controls, collect evidence, and then decide whether more advanced workflows are justified.

That is the safe path from AI-assisted security work to production-grade security operations.

Implementation Control Matrix:[Part-7]: State-Owned ICS Cybersecurity Blueprint

Mike Anderson — Tue, 14 Jul 2026 13:18:30 +0000

Related with the following articles/posts:

Previous Series: Part 1: Executive Briefing

Previous Series: Part 2: National Risk, Threat Landscape, and the First 30 Days

Previous Series: Part-3: Target Architecture for IT, OT, Cloud, and Power Grid Environments

Previous Series: Part-4: Tools, Technologies, and Control Implementation Catalog

Previous Series: Part-5: SOC, Detection, Incident Response, Resilience, and Exercises

Previous Series: Part-6: AI, Governance, Procurement, and the 180-Day National Roadmap

Implementation Control Matrix

Use this as an internal checklist after publishing the blog series.

Each control should have:

owner
enforcement point
evidence
review frequency
exception process
residual risk statement

Control 1: Critical process ownership

Objective:

Identify the national services and physical processes where cyber compromise can create major public, safety, economic, or national impact.

Enforcement points:

national critical infrastructure register
utility risk register
plant process inventory
executive risk committee

Evidence:

critical process list
named business owner
named OT owner
consequence rating
dependency map

Failure mode:

The organization secures systems based on technology importance instead of national consequence.

Control 2: OT asset inventory

Objective:

Maintain an accurate inventory of critical OT assets, versions, owners, zones, communication flows, and backup status.

Enforcement points:

passive discovery platform
CMDB
engineering documentation
plant walkdowns

Evidence:

asset inventory export
unknown asset report
firmware and software list
ownership field
criticality field
monthly reconciliation

Failure mode:

The inventory misses serial devices, spare controllers, relay settings, offline engineering laptops, or undocumented modems.

Control 3: IT/OT segmentation

Objective:

Prevent enterprise compromise from reaching control systems directly.

Enforcement points:

enterprise-to-OT firewall
OT DMZ
proxies and brokers
industrial firewalls
router and switch ACLs

Evidence:

zone and conduit diagram
firewall rule export
blocked direct access test
quarterly rule review
exception register

Failure mode:

A firewall exists, but broad rules allow direct access into OT.

Control 4: Vendor remote access

Objective:

Ensure vendor access is approved, MFA-protected, time-bound, recorded, and limited to named assets.

Enforcement points:

remote access portal
MFA
PAM
jump host
ticketing system
firewall policy

Evidence:

access approval ticket
MFA logs
session recording
target asset list
monthly vendor account review

Failure mode:

A vendor VPN lands directly inside Level 2 or Level 1 with broad subnet access.

Control 5: OT identity and privileged access

Objective:

Prevent credential compromise from becoming OT control.

Enforcement points:

separate or controlled OT identity boundary
PAM
MFA
local admin password management
privileged access review
break-glass procedure

Evidence:

privileged account inventory
MFA enforcement report
PAM session logs
break-glass test record
service account register

Failure mode:

Corporate identity compromise grants direct access to OT workstations or systems.

Control 6: Engineering workstation security

Objective:

Protect the systems used to configure controllers, relays, HMIs, and SCADA applications.

Enforcement points:

application allowlisting
endpoint hardening
USB control
local admin restriction
jump host access
backup images
log forwarding

Evidence:

hardening baseline
allowlisting policy
local admin review
USB exception register
golden image record
restore test

Failure mode:

An engineering workstation becomes the bridge between attacker access and controller modification.

Control 7: Controller, RTU, IED, and relay protection

Objective:

Restrict and monitor changes to control logic, relay settings, firmware, and device configuration.

Enforcement points:

controller ACLs where supported
cell firewall
approved engineering stations
physical cabinet control
change workflow
logic backup
checksum or integrity validation

Evidence:

approved engineering source list
controller configuration export
logic backup
relay setting backup
change ticket
integrity validation

Failure mode:

Anyone on the plant VLAN can reach a programming interface.

Control 8: OT monitoring and detection

Objective:

Detect unauthorized access, control writes, new devices, segmentation failures, abnormal engineering activity, and suspicious remote access.

Enforcement points:

passive OT sensors
SIEM
packet capture
jump host logs
identity logs
firewall logs
detection catalog

Evidence:

log source inventory
sensor placement map
detection catalog
ATT&CK for ICS mapping
alert tuning record
detection test result

Failure mode:

Monitoring generates noise but misses process-relevant behavior.

Control 9: Vulnerability and patch management

Objective:

Identify and reduce vulnerabilities based on consequence, exploitability, exposure, and recoverability.

Enforcement points:

passive vulnerability assessment
vendor advisories
CISA ICS advisories
change management
compensating controls
exception register

Evidence:

vulnerability report
affected asset list
remediation ticket
mitigation evidence
patch test result
exception approval

Failure mode:

The team uses enterprise CVSS-only prioritization and misses high-consequence OT exposure.

Control 10: Backup and recovery

Objective:

Restore critical process-control functions from trusted backups during an incident.

Enforcement points:

backup platform
offline or immutable storage
vendor backup tools
spare hardware
recovery runbook
restore exercises

Evidence:

backup inventory
restore test report
firmware and software dependency list
recovery procedure
spare hardware record

Failure mode:

Backups exist but cannot be restored under incident conditions.

Control 11: OT incident response

Objective:

Contain cyber incidents without creating unsafe physical process behavior.

Enforcement points:

OT incident response plan
severity model
incident bridge
safety approval process
containment playbooks
forensic evidence procedure

Evidence:

incident ticket
decision log
timeline
containment approval
evidence package
post-incident report
remediation owners

Failure mode:

The SOC applies IT containment actions that destabilize operations.

Control 12: AI governance

Objective:

Use AI to support cybersecurity decisions without allowing unsafe autonomous control actions.

Enforcement points:

AI use case register
data classification
approved AI platform
human approval gates
prompt and output logging
model owner
risk owner

Evidence:

AI policy
approved use case list
data handling review
human approval record
AI output validation
periodic review

Failure mode:

AI is connected to sensitive OT data or operational actions without governance.

Control 13: Secure procurement

Objective:

Ensure new ICS products and services can be secured, monitored, patched, supported, and recovered.

Enforcement points:

procurement policy
vendor security review
contract language
SBOM requirement where applicable
vulnerability disclosure requirement
secure configuration baseline
end-of-life planning

Evidence:

vendor security questionnaire
SBOM or equivalent artifact
secure configuration guide
support lifecycle commitment
incident notification clause
remote support architecture approval

Failure mode:

The organization purchases systems that cannot meet minimum security and recovery expectations.

Control 14: Executive metrics

Objective:

Report cyber risk in terms of national service resilience.

Enforcement points:

risk dashboard
executive committee
board or ministry reporting
regulatory evidence pack

Evidence:

asset coverage
segmentation status
vendor access metrics
backup restore metrics
detection test metrics
vulnerability exceptions
incident response exercise results

Failure mode:

Leadership receives alert counts instead of risk and resilience indicators.

Final use

This matrix should be reviewed quarterly.

Each control should have:

owner
current maturity score
target maturity score
funded remediation
due date
exception status
residual risk

Securing State-Owned ICS (Part 6): AI, Governance, Procurement, and the 180-Day National Roadmap

Mike Anderson — Tue, 14 Jul 2026 13:06:22 +0000

Previous Series: Part 1: Executive Briefing

Previous Series: Part 2: National Risk, Threat Landscape, and the First 30 Days

Previous Series: Part-3: Target Architecture for IT, OT, Cloud, and Power Grid Environments

Previous Series: Part-4: Tools, Technologies, and Control Implementation Catalog

Previous Series: Part-5: SOC, Detection, Incident Response, Resilience, and Exercises

Jump to Part-7: State-Owned ICS Cybersecurity Blueprint

AI can help ICS cybersecurity.

It can also create new risk.

For state-owned critical infrastructure, AI must be introduced with discipline.

The goal is not to make the plant autonomous.

The goal is to improve visibility, triage, detection, reporting, planning, and decision support without allowing AI to directly manipulate unsafe physical processes.

The rule is simple:

AI can advise.

Humans must approve.

Engineering and safety must govern physical action.

Executive summary for leaders

AI should not be the starting point for ICS cybersecurity.

Start with:

asset inventory
segmentation
remote access control
identity governance
backups
monitoring
incident response
vendor governance

Then use AI to accelerate human decision-making.

Good AI use cases:

summarize advisories
enrich asset inventory
assist alert triage
draft detection logic
support threat hunting
summarize incidents
generate tabletop scenarios
create executive reports
review change requests for missing risk information

Risky AI use cases:

autonomous controller commands
unsupervised logic changes
automatic blocking of critical OT paths
cloud processing of sensitive national infrastructure data without approval
AI agents connected directly to control networks
AI-generated remediation applied without engineering review

For national ICS, AI governance is mandatory.

1. The AI rule for ICS

Approved policy statement:

AI may recommend, summarize, correlate, enrich, detect, and explain.

AI must not independently issue control commands, change controller logic, bypass safety procedures, isolate critical OT assets, or make safety-impacting decisions without approved human authority.

This should be written into national policy, utility policy, SOC procedure, and procurement language.

2. Practical AI use cases

Asset inventory enrichment

AI can help normalize messy asset data.

Inputs:

passive discovery output
CMDB
firewall logs
switch tables
vendor exports
engineering documentation
vulnerability reports
backup inventories

Useful outputs:

duplicate asset matching
vendor and model normalization
missing owner suggestions
criticality suggestions
unsupported software identification
likely zone or Purdue level
communication pattern summary

Human validation remains required.

Alert triage assistant

AI can help analysts understand alerts faster.

Useful outputs:

plain-language alert explanation
affected process summary
asset owner
recent related activity
approved change window check
recommended triage questions
evidence collection checklist
draft incident notes

Do not allow AI to auto-close high-risk OT alerts.

Detection engineering support

AI can draft detection ideas for:

vendor login outside approved window
unauthorized PLC or relay write
new engineering protocol source
RDP bypassing jump host
logic change outside approved window
new device in control cell
suspicious archive creation on engineering workstation

Human validation and test data are mandatory.

Threat intelligence summarization

AI can summarize:

national CERT alerts
CISA ICS advisories
vendor advisories
sector ISAC reports
known adversary tactics
affected products
recommended mitigations

The output should be mapped to actual inventory.

A generic advisory summary is useful.

A summary that says "we have 14 affected assets in three sites" is operationally valuable.

Incident response support

AI can help by:

building event timelines
summarizing log evidence
drafting executive updates
mapping behavior to MITRE ATT&CK for ICS
preparing post-incident report drafts
tracking remediation actions
generating lessons-learned summaries

AI should not decide containment for safety-impacting assets.

Change review support

AI can review change tickets for missing information.

Questions AI can flag:

Is the affected process documented?
Is rollback included?
Is backup confirmed?
Is the maintenance window approved?
Are safety and operations owners listed?
Is monitoring required after change?
Are firewall rules too broad?
Is the vendor access window time-bound?
Is evidence required after the change?

This is a strong, low-risk AI use case.

Training and tabletop simulation

AI can generate exercise scenarios for:

vendor account compromise
ransomware on HMI
unauthorized logic change
relay setting modification
loss of historian
substation communication outage
insider using shared account
compromise of IT/OT boundary
cloud analytics disruption

Use AI to create exercise material, not to replace human evaluation.

3. AI use cases to prohibit or tightly restrict

Avoid or prohibit:

AI issuing PLC, RTU, IED, or relay commands
AI modifying ladder logic or controller configuration without engineering review
AI automatically disabling critical OT network paths
AI deciding safe state
AI performing unsupervised active scanning of controllers
AI using live OT credentials without approval
AI agents connected directly to control networks
unmanaged public AI tools processing sensitive OT diagrams
sensitive incident evidence sent to cloud AI without approval
AI-generated remediation applied without testing
AI model training on national infrastructure data without legal review

Failure mode:

A model can be confident, useful, and wrong at the same time.

In ICS, wrong action can become physical impact.

4. Safe AI architecture

Use AI as an analysis layer, not a control layer.

OT sensors, logs, inventory, tickets
        |
        v
SIEM / OT security data lake
        |
        v
AI analysis layer
- summarization
- enrichment
- anomaly explanation
- detection draft
- report generation
        |
        v
Human approval
SOC, OT engineer, safety owner, incident commander
        |
        v
Approved action through existing controls
PAM, firewall, change management, incident response

The AI layer should not connect directly to controllers.

AI governance controls

Minimum controls:

approved AI use case register
data classification before AI use
prohibition on sensitive OT data in unmanaged public AI tools
role-based access
prompt and output logging where legally allowed
human approval for operational action
validation of AI output
prompt injection awareness
data leakage monitoring
model owner
risk owner
periodic performance review
incident process for AI failures
vendor security review

Local, sovereign, or cloud AI

For national critical infrastructure, use risk-based placement.

Prefer local or sovereign deployment for:

network diagrams
controller inventories
PLC logic
relay settings
vulnerability details
incident evidence
national grid topology
facility layouts
sensitive threat intelligence

Cloud AI may be acceptable for:

public advisory summaries
generic policy drafts
training content
non-sensitive writing assistance
public research summarization

Do not send sensitive operational data to public AI systems without approval.

5. Secure procurement

Procurement is a security control.

Every new ICS product or service should require:

secure development lifecycle evidence
vulnerability disclosure process
long-term patch support
SBOM where applicable
secure configuration guide
authentication and role-based access support
logging support
encrypted management where feasible
ability to disable unused services
documented hardening baseline
backup and restore method
default credential removal at commissioning
remote support model review
country-of-origin and supply chain review where required
contractual incident notification timeline
right to audit security controls
end-of-life notification period
data sovereignty statement
AI feature disclosure if AI is embedded

Do not buy systems that cannot be secured, monitored, patched, or recovered.

Cheap procurement can become expensive national risk.

6. National policy actions

A national ICS cybersecurity strategy should include the following.

Critical infrastructure classification

Classify assets based on consequence.

Do not treat all systems equally.

A national grid control center requires stronger obligations than a low-impact office system.

Minimum OT cybersecurity baseline

Mandate controls for:

asset inventory
network segmentation
MFA for remote access
vendor governance
logging and monitoring
backup and recovery
vulnerability management
secure procurement
incident reporting
OT-specific incident response
annual exercises
AI governance where AI is used

National OT-CERT capability

Create or strengthen a specialist OT incident response function.

It should support:

incident coordination
malware and forensic analysis
threat intelligence
emergency advisory publication
sector coordination
recovery support
lessons-learned sharing

Sector threat intelligence

Build trusted sharing across:

energy
water
transport
telecom
health
finance
ports
aviation
defense-linked infrastructure

National exercises

Run exercises that test:

power disruption
water disruption
coordinated cyber and physical activity
cross-border dependency
public communication
incident reporting
recovery sequencing
manual operations
misinformation and public trust issues

Workforce development

Invest in:

OT security training
control engineering cyber training
SOC analyst OT training
incident commander training
university and technical institute programs
government and utility certification paths
local language awareness material
national cyber range and OT lab environments

7. South and Southeast Asia implementation note

For many South and Southeast Asian environments, the strategy must account for:

mixed legacy and modern systems
imported technology dependency
large geographic coverage
remote substations or facilities
uneven local security maturity
limited OT cybersecurity workforce
budget pressure
public-sector procurement constraints
regional interdependencies
climate and disaster resilience needs
national data sovereignty concerns

The practical response is not to wait for perfect maturity.

Use a phased model:

secure the highest-consequence services first
build national OT asset visibility
control vendor access
remove dangerous IT/OT shortcuts
establish sector SOC or shared monitoring
build local OT cyber workforce
require secure procurement for all new projects
run national exercises
build cross-border coordination for interconnected infrastructure

The region does not need to copy another country blindly.

It needs a locally governed, standards-aligned, consequence-driven model.

8. 180-day roadmap

Days 0-30: establish control of the basics

Objectives:

name accountable owners
identify critical processes
build top-level architecture view
inventory crown-jewel assets
identify remote access paths
review vendor accounts
export firewall rules
confirm backup existence
create incident contact roster
start risk register

Deliverables:

critical process list
initial OT asset inventory
IT/OT connectivity map
remote access register
backup status report
top 10 unacceptable risks
executive briefing

Days 31-60: reduce obvious attack paths

Objectives:

remove direct enterprise-to-controller access
disable undocumented vendor access
place remote access behind MFA and approval
remove or control dual-homed engineering workstations
eliminate default credentials on critical assets
segment highest-criticality process cells
start passive monitoring
define OT severity model
create initial detection use cases

Deliverables:

updated network rules
vendor access governance
monitoring plan
detection catalog
remediation backlog
exception register

Days 61-90: operationalize security

Objectives:

connect high-value logs to SIEM
build SOC triage playbooks
define SOAR approval gates
test backup restore for one critical process
run tabletop exercise
review privileged access
validate firewall rules
start vulnerability management by consequence
create leadership metrics

Deliverables:

SOC runbook
restore test evidence
tabletop report
access review evidence
vulnerability risk register
leadership dashboard

Days 91-120: harden and validate

Objectives:

expand segmentation
harden engineering workstations
implement application allowlisting where feasible
implement session recording
tune detections
build threat model for highest-criticality process
validate containment decisions
formalize procurement security requirements

Deliverables:

hardened baseline
threat model
detection test result
procurement checklist
risk treatment plan

Days 121-180: scale to resilience

Objectives:

expand monitoring to more sites
establish sector threat intelligence process
integrate national CERT reporting
run purple team exercise
test emergency isolation process
formalize AI governance
build 12-month investment roadmap
report residual risk

Deliverables:

national or enterprise OT security roadmap
purple team report
emergency isolation test
AI use policy
12-month budget plan
residual risk statement

9. Budget priorities

If funding is limited, prioritize:

asset inventory and network flow visibility
remote access control
IT/OT segmentation
critical backup and restore capability
monitoring for unauthorized control activity
engineering workstation hardening
vendor access governance
incident response playbooks and exercises
vulnerability management and patch process
AI-assisted triage and reporting

Do not start with AI if inventory, segmentation, remote access, backups, and monitoring are weak.

AI improves a mature program.

It does not replace one.

10. Final maturity score

Score each domain from 1 to 5.

Domains:

governance and ownership
asset inventory
network segmentation
remote access
identity and privileged access
monitoring and detection
vulnerability and patch management
incident response
backup and recovery
supply chain and procurement
AI governance
workforce and exercises
leadership reporting

Target scores:

minimum acceptable: 3.0
critical national target: 4.0
strategic national capability: 4.5 or higher

A realistic first-year goal is to move from 1.5 or 2.0 to 3.0.

That alone removes many major attack paths.

Takeaway

The best national ICS cybersecurity program is not the one with the most advanced AI or the largest tool stack.

It is the one that knows its assets, controls access, segments critical paths, monitors meaningful behavior, responds safely, recovers quickly, governs suppliers, trains people, and uses AI carefully to improve human decisions.

Critical infrastructure protection is not only cybersecurity.

It is national continuity.

Securing State-Owned ICS (Part 5): SOC, Detection, Incident Response, Resilience, and Exercises

Mike Anderson — Tue, 14 Jul 2026 12:58:28 +0000

Previous Series: Part 1: Executive Briefing

Previous Series: Part 2: National Risk, Threat Landscape, and the First 30 Days

Previous Series: Part-3: Target Architecture for IT, OT, Cloud, and Power Grid Environments

Previous Series: Part-4: Tools, Technologies, and Control Implementation Catalog

Jump to Part-7: State-Owned ICS Cybersecurity Blueprint

A strong ICS architecture reduces attack paths.

Cyber operations determine whether the organization can detect, respond, and recover when something goes wrong.

For state-owned critical infrastructure, the SOC mission is not simply alert handling.

The mission is national service continuity.

The SOC, OT engineers, safety teams, field operations, vendors, executives, legal, communications, and national response bodies must be able to work together before a crisis.

The operating rule is:

Detect early. Contain safely. Preserve evidence. Recover critical services. Improve controls.

OT cyber operations model

Executive summary for leaders

A mature OT cyber operations program should prove that:

critical OT assets are monitored
remote access is reviewed
unauthorized control behavior can be detected
alerts are triaged with process context
containment actions are approved by the right operational owner
backups are tested
incident playbooks are exercised
lessons learned become funded remediation
leadership receives risk metrics, not raw alert volume

If the SOC cannot explain what physical process an alert affects, it is not ready for ICS operations.

1. OT SOC operating model

An OT SOC is not just an enterprise SOC watching another dashboard.

It needs OT-specific context, escalation, and safety approvals.

Tier 1: initial triage

Responsibilities:

validate alert quality
identify source, destination, user, protocol, and time
check asset criticality
check maintenance window
check vendor approval
identify whether behavior is read-only, write-capable, or administrative
escalate high-risk OT behavior immediately

Tier 1 should not independently block critical OT traffic unless a pre-approved action exists.

Tier 2: investigation

Responsibilities:

analyze logs and packet captures
correlate remote access, identity, firewall, and OT sensor data
validate whether activity matches approved work
coordinate with OT engineers
recommend severity and containment
preserve evidence

Tier 3: detection engineering and threat hunting

Responsibilities:

build MITRE ATT&CK for ICS mapped detections
tune false positives
build threat hunts
validate detections through exercises
analyze suspicious tooling
improve telemetry coverage
support incident response and post-incident review

OT engineering

Responsibilities:

explain physical process impact
confirm whether engineering activity is legitimate
approve containment that may affect operations
validate controller logic and settings
restore systems from known-good backups
define safe state and manual operation options

Incident commander

Responsibilities:

coordinate incident bridge
maintain timeline
approve escalation
coordinate legal, privacy, communications, safety, operations, and executives
ensure evidence preservation
drive post-incident remediation

2. Telemetry required

A SOC cannot detect what it cannot see.

Critical log sources

Collect from:

OT firewalls
industrial switches and routers
OT IDS sensors
remote access portal
VPN
jump hosts
PAM platform
Active Directory or OT identity
HMIs
SCADA servers
historians
engineering workstations
Linux servers
backup systems
endpoint security tools where safe
physical access systems
change management
ticketing system
vulnerability platform
cloud analytics platform where used

High-value OT events

Prioritize:

new device in OT segment
industrial protocol write
PLC or relay mode change
controller logic upload or download
relay setting change
engineering workstation connection to controller
firmware change
HMI project change
remote vendor login
failed privileged login
jump host session start and end
firewall rule change
new route or ACL change
backup failure for critical asset
badge access to restricted operational area

Retention

Critical infrastructure needs enough retention to investigate slow adversary activity.

Use:

hot logs for active investigation
warm logs for incident review
cold archive for legal, regulatory, and historical analysis
rolling packet capture in the most critical segments where feasible

Retention must follow national law, privacy obligations, and operational policy.

3. Detection engineering

Generic malware alerts are not enough.

OT detections must focus on control behavior, remote access, engineering actions, segmentation failures, and process context.

Detection 1: unauthorized PLC or relay write

Logic:

If source is not an approved engineering workstation
AND destination is a controller, RTU, IED, or relay
AND protocol action is write-capable
THEN create high-severity OT alert.

Triage:

confirm source asset
check user and session
check change window
check engineering approval
review packet capture
ask OT owner if action was expected
preserve evidence

Containment options:

block source at cell firewall
disable suspicious endpoint switch port
terminate unauthorized vendor session
disable compromised account
do not stop controller without OT approval

Detection 2: vendor access outside approved window

Logic:

If vendor account logs in
AND no active approved ticket exists
OR session occurs outside approved time window
THEN alert SOC and OT owner.

Triage:

validate user and source
review approval system
review session recording
confirm target asset
disable account if unauthorized

Detection 3: engineering workstation abnormal behavior

Logic:

If engineering workstation connects to unusual controllers
OR launches unauthorized remote admin tools
OR creates unusual archive files
OR executes new binaries
THEN escalate to Tier 2.

Triage:

check maintenance activity
review endpoint telemetry
review user activity
verify project file access
confirm with engineering lead

Detection 4: logic or setting change outside window

Logic:

If controller logic, relay setting, or firmware indicator changes
AND no approved change exists
THEN create critical OT alert.

Triage:

confirm affected process
engage OT engineer
compare against known-good backup
preserve project files and logs
prepare containment and recovery plan

Detection 5: IT-to-OT protocol crossing

Logic:

If source zone is enterprise IT
AND protocol is industrial or engineering-related
AND destination is OT
THEN create high-severity segmentation alert.

Examples of protocols or services to watch:

Modbus
DNP3
S7
EtherNet/IP
IEC 61850
OPC Classic
engineering workstation protocols
RDP
VNC
SSH
SMB

4. Threat hunting

Useful OT hunts:

new assets in control zones
new external remote access sources
rare industrial protocol function codes
RDP from non-jump-host systems
vendor access outside business process
engineering tool execution on non-engineering hosts
abnormal historian queries
new scheduled tasks on HMIs
new local administrator accounts
PLC communication from unexpected subnet
file archives created on engineering workstations
DNS queries from OT assets that should not use internet DNS
relay setting changes outside planned work

Every hunt should produce one of three outcomes:

confirmed incident
control gap
detection tuning opportunity

5. Vulnerability management in operations

Vulnerability management must be consequence-based.

Do not rank only by CVSS.

Prioritize by:

asset criticality
process impact
exploitability
exposure
known exploitation
segmentation
patch availability
vendor support
compensating controls
recovery readiness

Remediation options:

patch during maintenance window
upgrade firmware after lab validation
disable vulnerable service
restrict source IP
add cell firewall rule
move asset to segmented zone
monitor for exploitation
remove enterprise reachability
replace unsupported system
create time-bound exception with compensating controls

Evidence:

vulnerability report
affected asset list
risk rating
remediation ticket
mitigation proof
vendor advisory
exception approval
retest result

6. Patch and change management

OT patching requires planning.

Process:

Track vendor and national advisories.
Match advisories to asset inventory.
Assess process impact.
Test patch or firmware in lab or spare system.
Confirm vendor support.
Schedule maintenance window.
Confirm backup and rollback.
Apply change.
Monitor cyber and process telemetry.
Record evidence and lessons learned.

Change control applies to:

firewall rules
remote access
controller logic
relay settings
HMI projects
SCADA configuration
historian connectors
switch and router configuration
firmware updates
user and role changes
cloud data pipelines
safety system changes

Every OT change needs:

owner
approval
maintenance window
rollback plan
monitoring plan
stop condition
post-change validation

7. SOAR and automation

Automation helps when it reduces analyst workload without creating operational risk.

Generally safe to automate

enrich alert with asset criticality
attach owner and location
check change ticket
check vendor access approval
collect relevant logs
query threat intelligence
notify OT owner
open incident ticket
create evidence folder
draft timeline
recommend containment options

Must require human approval

blocking OT network flows
disabling vendor access during active maintenance
isolating HMI or SCADA server
disabling switch ports
changing firewall policy
resetting OT passwords at scale
restarting services
reloading controller logic
disconnecting IT/OT boundary links

Failure mode

A SOAR playbook that automatically blocks a controller communication path can stop a process.

The required controls are:

approval gates
rollback
audit logging
simulation testing
OT owner sign-off
emergency bypass procedure

8. Incident response for OT

OT incident response must be safety-led.

Severity model

SEV-1 Critical:

unauthorized control action
confirmed logic or relay setting manipulation
ransomware affecting operations
safety impact
loss of control visibility
major service disruption
active attacker in critical OT zone

SEV-2 High:

compromised engineering workstation
unauthorized PLC or relay write attempt
vendor account compromise
malware on HMI
confirmed lateral movement toward OT

SEV-3 Medium:

unauthorized device in OT
suspicious scan
failed privileged logins
policy violation
unmanaged remote access path
monitoring gap with significant risk

SEV-4 Low:

false positive
benign misconfiguration
informational alert
low-impact hygiene issue

Triage questions

Ask:

Which asset is affected?
What physical process does it support?
Is the process stable?
Is this read traffic, write traffic, or administrative activity?
Is this inside an approved change window?
Is a vendor approved to connect?
Can containment harm safety or availability?
What evidence must be preserved?
Who can approve action?
What rollback exists?

Containment principles

Good containment:

disable suspicious endpoint port
block attacker source at cell firewall
terminate unauthorized vendor session
disable compromised account
move operators to standby HMI
disconnect enterprise path while keeping local OT running
preserve packet captures and logs

Avoid without approval:

stopping controllers
rebooting HMIs during live operations
reloading logic without validation
broad password resets during operational stress
shutting down switches supporting active process control

The principle is:

Isolate the attacker, not the process.

9. Backup, recovery, and continuity

Backups are a cyber safety control.

Critical backups:

HMI images
engineering workstation images
SCADA server configuration
historian configuration
PLC and RTU logic
relay settings
network device configuration
firewall policy
remote access configuration
license keys
vendor installation media
recovery procedures
offline contact lists

Minimum standard:

offline or immutable copy
physically or logically disconnected copy
restore test for critical systems
spare hardware for high-criticality assets
firmware and software version records
recovery runbook accessible during IT outage
recovery exercise at least annually for critical services

Leadership should ask:

Can we restore the most critical process-control function from known-good backups during a cyber incident?

10. Purple team and exercises

Exercises turn plans into capability.

Run:

tabletop exercises
detection validation
restore drills
vendor access abuse scenarios
ransomware-on-HMI scenario
unauthorized PLC logic change scenario
relay setting change scenario
loss of IT/OT boundary scenario
national crisis communication exercise
manual operation coordination exercise

Use MITRE ATT&CK for ICS to build scenarios.

Example scenario:

vendor VPN compromise
-> OT network discovery
-> RDP to engineering workstation
-> project file collection
-> unauthorized logic download
-> alarm suppression attempt

For each stage define:

expected telemetry
detection rule
response owner
containment option
evidence required
control gap
remediation owner

11. Metrics that matter

Do not report only alert volume.

Report operational cyber risk.

Good metrics:

percentage of critical OT assets inventoried
number of direct IT-to-OT flows remaining
vendor accounts active outside approved windows
percentage of critical assets with tested backups
unauthorized write attempts
critical vulnerabilities past SLA without compensating controls
percentage of firewall rules reviewed this quarter
mean time to triage high-severity OT alerts
restore tests completed
open exceptions by age and criticality
OT detections tested in the last quarter
percentage of remote sessions recorded
unmanaged devices found in OT
sites with passive monitoring coverage
engineering changes with complete evidence

A strong board statement:

We have inventoried 94% of critical OT assets, removed direct enterprise-to-controller access, placed vendor access behind MFA and recording, tested restores for the top three process-control functions, and validated detections for unauthorized controller writes. Remaining risk is concentrated in two legacy sites and one vendor remote access path.

That is better than:

We deployed an OT monitoring platform.

12. Operating rhythm

Daily:

review high-severity OT alerts
review remote access anomalies
confirm critical monitoring health

Weekly:

review new assets and flows
review failed privileged logins
review vendor activity
tune noisy detections

Monthly:

review vendor accounts
review privileged accounts
validate backup job status
review open critical vulnerabilities
review unmanaged devices

Quarterly:

review firewall rules
run detection validation
review exceptions
update threat model
report metrics to leadership

Semiannual:

run restore test
run incident response tabletop
review architecture against current operations

Annual:

conduct OT security assessment
run sector or national exercise
refresh strategy, budget, and risk register

Takeaway

Cyber operations decide whether architecture becomes real security.

A mature OT SOC does not chase every alert equally.

It understands process context, detects meaningful behavior, acts safely, preserves evidence, supports recovery, and improves controls after every event.

For state-owned ICS, that is not just a security function.

It is part of national continuity.

Let's move to Final Part-6: AI, Governance, Procurement, and the 180-Day National Roadmap

Securing State-Owned ICS (Part 4): Tools, Technologies, and Control Implementation Catalog

Mike Anderson — Tue, 14 Jul 2026 12:51:06 +0000

Architecture without implementation is just a diagram.

Previous Series: Part 1: Executive Briefing

Previous Series: Part 2: National Risk, Threat Landscape, and the First 30 Days

Previous Series: Part-3: Target Architecture for IT, OT, Cloud, and Power Grid Environments

Jump to Part-7: State-Owned ICS Cybersecurity Blueprint

This is written for CISOs, security architects, procurement teams, SOC managers, OT engineers, and government program owners who need to build a real security capability.

The message is direct:

Do not buy tools first.

Define the control.

Define the enforcement point.

Define the owner.

Define the evidence.

Then select the tool.

Executive summary for leaders

A national ICS cybersecurity program needs technology, but technology alone will not protect the plant.

Each tool must answer:

What risk does it reduce?
Where is it enforced?
Who operates it?
What evidence proves it works?
What happens if it fails?
Can it operate safely in OT?
Does it support recovery during a crisis?

If the answer is unclear, the tool is not ready for critical infrastructure.

1. Control architecture model

Every control should be documented using this format.

Control ID:
Control objective:
Risk reduced:
Enforcement point:
Technology examples:
Owner:
Evidence:
Failure mode:
Residual risk:

Example:

Control ID: OT-RA-01
Control objective: prevent uncontrolled vendor access to OT
Risk reduced: third-party compromise leading to controller or HMI access
Enforcement point: remote access portal, MFA, PAM, jump host, firewall
Technology examples: CyberArk, BeyondTrust, Delinea, Fortinet, Palo Alto, Duo, Okta
Owner: OT security and vendor manager
Evidence: access ticket, session recording, VPN log, monthly account review
Failure mode: vendor receives broad subnet access or account remains active after contract ends
Residual risk: vendor endpoint compromise may still occur, so session monitoring remains required

This level of clarity separates real security from policy language.

2. Asset inventory and visibility

Control objective

Know every critical OT asset, its owner, location, function, software or firmware version, communication path, and process impact.

Tools and technology examples

Nozomi Networks
Dragos Platform
Claroty
Microsoft Defender for IoT
Forescout eyeInspect
Tenable OT
Security Onion with Zeek for teams that can operate open-source tooling
industrial network switch exports
engineering workstation project files
CMDB or asset database

Required evidence

critical asset inventory
unknown device list
firmware and software inventory
zone assignment
owner assignment
criticality rating
remote access dependency
backup status
unsupported asset list
inventory reconciliation record

Failure mode

A passive discovery tool identifies networked assets but misses offline controllers, serial devices, spare PLCs, relay settings, or undocumented engineering laptops.

Required operating practice

Combine passive discovery with engineering walkdowns and configuration review.

Do not treat tool discovery as complete truth.

3. Network segmentation

Control objective

Prevent compromise in one network zone from spreading into critical control zones.

Enforcement points

IT/OT firewalls
OT DMZ
industrial firewalls
router ACLs
switch ACLs
VLANs
data diode or unidirectional gateway where appropriate
proxy or broker services
jump hosts

Tools and technology examples

Fortinet FortiGate and FortiSwitch
Palo Alto Networks NGFW
Check Point
Cisco Secure Firewall and industrial networking
Tofino Xenon
Belden / Hirschmann
Siemens Scalance
Ruggedcom
Waterfall Security or Owl Cyber Defense for unidirectional use cases

Required evidence

zone and conduit diagram
firewall rule export
rule owner
business justification
review date
approved exception list
blocked direct enterprise-to-controller path evidence
segmentation test result

Failure mode

The organization has a firewall between IT and OT but keeps broad rules such as:

source: enterprise network
destination: OT network
service: any
purpose: support

That is not segmentation.

That is an attack path with a firewall in front of it.

4. Remote access and vendor access

Control objective

Ensure all remote OT access is authenticated, approved, recorded, limited, and time-bound.

Enforcement points

remote access portal
MFA
PAM
jump host
firewall
ticketing workflow
session recording
vendor account lifecycle

Tools and technology examples

CyberArk
BeyondTrust
Delinea
Teleport for controlled access where appropriate
Fortinet ZTNA or VPN
Palo Alto GlobalProtect
Cisco Secure Access
Duo
Okta
Microsoft Entra ID
ServiceNow or Jira for approval workflow

Required evidence

approved access ticket
MFA log
session recording
target asset list
access start and end time
vendor account review
emergency access review
monthly access attestation

Failure mode

Vendor VPN provides broad OT subnet access.

Required fix

Vendor access should land on a controlled jump host and then only reach named target assets during an approved window.

5. Identity and privileged access

Control objective

Prevent credential compromise from becoming OT control.

Enforcement points

separate OT identity boundary
MFA
PAM
group policy
local account vaulting
privileged role review
break-glass process
service account governance

Tools and technology examples

Active Directory with separate OT forest
Microsoft Entra ID with careful federation design
CyberArk
BeyondTrust
Delinea
Thycotic/Delinea Secret Server
Duo
Okta
Ping
Microsoft LAPS or Windows LAPS for local admin management

Required evidence

privileged account list
access review sign-off
service account register
break-glass test record
MFA enforcement report
PAM session log
local admin password management evidence

Failure mode

Corporate Active Directory compromise provides direct access to OT workstations.

Required fix

Create separation, enforce MFA and PAM, remove unnecessary trust, and restrict privileged access to approved administration paths.

6. Endpoint and engineering workstation security

Control objective

Protect HMIs, SCADA servers, historians, and engineering workstations from malware, unauthorized software, credential theft, and unauthorized engineering activity.

Enforcement points

hardened OS baseline
application allowlisting
EDR in OT-safe mode
host firewall
USB control
local admin restriction
log forwarding
golden image
backup

Tools and technology examples

Microsoft Defender for Endpoint configured for OT compatibility
CrowdStrike or SentinelOne where validated for OT use
Windows Defender Application Control
AppLocker
Ivanti or Tanium for managed environments
vendor-approved hardening tools
Group Policy for Windows OT environments

Required evidence

hardening baseline
application allowlisting policy
EDR coverage report
exclusion list approved by vendor and security
local admin review
USB exception list
golden image record
restore test evidence

Failure mode

An endpoint tool blocks a vendor HMI process or overloads a fragile system.

Required fix

Validate endpoint controls in a lab, maintenance window, or spare system before broad deployment.

7. Controller, PLC, RTU, and relay security

Control objective

Restrict who can modify control logic, relay settings, firmware, and controller configuration.

Enforcement points

controller access control
engineering workstation restriction
cell firewall
physical cabinet access
change workflow
logic backup
checksum or integrity verification
vendor secure protocol settings where available

Tools and technology examples

vendor engineering software
PLC and relay backup tools
OT monitoring platforms
industrial firewalls
configuration management tools
physical access control
secure engineering workstations

Required evidence

controller inventory
approved engineering workstation list
logic backup
firmware version
checksum or vendor integrity evidence
change approval
restoration test
access control configuration

Failure mode

Anyone on the plant VLAN can reach a PLC programming interface.

Required fix

Restrict programming access to approved engineering workstations and monitor write-capable commands.

8. OT monitoring and detection

Control objective

Detect unauthorized access, abnormal protocol behavior, new devices, controller writes, logic changes, and segmentation failures.

Enforcement points

passive OT sensors
SIEM
detection rules
packet capture
jump host logs
identity logs
firewall logs
change management integration

Tools and technology examples

Dragos Platform
Nozomi Networks
Claroty
Microsoft Defender for IoT
Forescout eyeInspect
Tenable OT
Security Onion
Zeek
Suricata
Splunk
Microsoft Sentinel
Google SecOps
Elastic Security
IBM QRadar

Required evidence

log source inventory
sensor placement diagram
detection catalog
alert tuning record
test cases
triage playbooks
packet capture retention policy
MITRE ATT&CK for ICS mapping

Failure mode

The tool alerts on everything, the SOC trusts nothing, and real control activity is missed.

Required fix

Build detections around process-relevant behavior and tune with OT engineers.

9. Vulnerability management

Control objective

Identify, prioritize, remediate, or compensate vulnerabilities without destabilizing operations.

Tools and technology examples

Tenable OT
Claroty
Nozomi
Dragos
CISA ICS advisories
vendor advisories
Qualys, Rapid7, or Tenable for IT zones
SBOM and software inventory tools where available

Prioritization model

Do not rank only by CVSS.

Prioritize by:

exploitability
exposure
asset criticality
process consequence
internet or enterprise reachability
known exploitation
compensating controls
patch availability
vendor support
recovery readiness

Required evidence

vulnerability report
affected asset list
risk decision
remediation ticket
patch or mitigation evidence
vendor advisory
exception approval
compensating control record
retest result

Failure mode

The team runs aggressive IT scans against live controllers.

Required fix

Use passive assessment first, controlled active scanning only with written OT approval, lab validation, stop conditions, and rollback.

10. Backup and recovery

Control objective

Restore critical control functions from known-good backups during a cyber incident.

Tools and technology examples

Veeam
Commvault
Rubrik
Cohesity
Acronis
vendor-specific PLC backup tools
offline media
immutable storage
secure backup vault
spare PLC, relay, HMI, and workstation hardware

Required evidence

backup inventory
backup schedule
offline or immutable copy proof
restore test result
firmware and software dependency record
license key inventory
recovery runbook
spare hardware list

Failure mode

Backups exist but have never been restored.

Required fix

Test recovery of the most critical process-control functions, not only backup job completion.

11. File transfer, malware inspection, and patch staging

Control objective

Prevent malware and unapproved files from crossing into OT.

Enforcement points

managed file transfer gateway
malware sandbox
content inspection
hash verification
signed package validation
OT DMZ staging server
removable media procedure

Tools and technology examples

secure file transfer gateway
sandbox analysis platform
antivirus scanning station
content disarm and reconstruction where appropriate
YARA scanning for mature teams
vendor package verification tools

Required evidence

file transfer logs
package approval
hash verification
malware scan result
change ticket
OT owner approval
removable media register

Failure mode

A vendor brings a USB drive directly to an engineering workstation.

Required fix

All removable media and vendor packages must pass through controlled inspection before OT use.

12. Cloud and IIoT security

Control objective

Enable analytics and reporting without creating a control path from cloud or enterprise systems into OT.

Enforcement points

data gateway
brokered API
private connectivity
one-way transfer where appropriate
cloud IAM
encryption
logging
data classification
egress control

Tools and technology examples

AWS IoT SiteWise or equivalent industrial data services
Azure IoT Operations / Azure Arc where appropriate
Google Cloud industrial analytics services
private connectivity
cloud SIEM or data lake
KMS/HSM services
CSPM or CNAPP for cloud governance

Required evidence

data classification
approved data flow
cloud IAM policy
encryption evidence
API logs
egress monitoring
architecture approval
local operation fallback evidence

Failure mode

Cloud analytics becomes a hidden dependency for live operations.

Required fix

Local operation must continue safely without cloud availability.

13. Physical-cyber integration

Control objective

Detect and respond to combined cyber and physical activity.

Data sources

badge access
CCTV event metadata
visitor logs
cabinet access
field crew dispatch
maintenance windows
cyber alerts
remote access sessions
engineering changes

Useful correlation:

after-hours badge access to substation
+ vendor remote login
+ relay setting change
= high-priority investigation

Evidence

physical access logs
cyber alert
maintenance ticket
investigation notes
operator confirmation
incident decision record

14. Tool selection principles

Before selecting a tool, ask:

Does it support passive deployment?
Does it understand industrial protocols?
Can it operate without disrupting process control?
Can OT engineers interpret the alerts?
Does it integrate with SIEM and ticketing?
Does it provide usable evidence?
Does it support regulated environments?
Can it scale across national sites?
Can local teams operate it?
What happens if the vendor or cloud service is unavailable?
What data leaves the country?
What is the total operating cost?

A tool that the organization cannot operate is not a control.

It is shelfware.

Takeaway

A 5 star quality ICS security program is not defined by the number of products deployed.

It is defined by whether controls are enforced, evidenced, owned, tested, and safe for operations.

Tools matter.

Control design matters more.

So, let's move to Part-5:SOC, Detection, Incident Response, Resilience, and Exercises

Securing State-Owned ICS (Part 3): Target Architecture for IT, OT, Cloud, and Power Grid Environments

Mike Anderson — Tue, 14 Jul 2026 12:48:30 +0000

Previous Series: Part 1: Executive briefing

Previous Series: Part 2: National Risk, Threat Landscape, and the First 30 Days

Jump to Part-7: State-Owned ICS Cybersecurity Blueprint

Part 3 turns that into architecture.

This is a target architecture for a state-owned ICS environment that also operates regular IT services.

It applies to power grids, water utilities, transport networks, refineries, ports, national manufacturing, and similar critical infrastructure.

For a power grid, map the zones to:

corporate IT
control center
Energy Management System
Distribution Management System
substation automation
protection relays
generation plant control
telecom and SCADA communications
outage management and dispatch
market and billing systems
field crew systems
national or sector SOC

The design objective is simple:

A compromise in one area must not become a national service disruption.

Executive summary for leaders

A strong ICS architecture has no uncontrolled shortcuts.

The design should ensure:

enterprise IT cannot directly reach controllers
vendors cannot land directly inside control networks
cloud systems cannot control critical processes by accident
identity compromise in IT does not automatically grant OT control
safety systems are isolated and harder to modify
every IT/OT flow has a business purpose, owner, approval, and logs
monitoring observes OT without creating a new control path
local operations can continue if enterprise IT or cloud services fail

If the architecture cannot support those outcomes, it is not ready for national critical infrastructure.

1. Design principles

Principle 1: safety and control first

Security controls must not create unsafe process behavior.

Every major architecture decision should be reviewed by cybersecurity, OT engineering, operations, and safety.

Principle 2: no direct enterprise-to-controller access

Corporate IT must not directly communicate with PLCs, RTUs, protection relays, safety controllers, or control networks.

Principle 3: controlled exchange through OT DMZ

Data exchange between IT and OT should pass through a controlled exchange layer.

That layer is the OT DMZ.

Principle 4: remote access is privileged access

Remote access into OT must be approved, MFA-protected, time-bound, recorded, and limited to named targets.

Principle 5: segment by consequence

A substation, turbine control cell, safety system, and office network should not share one flat trust zone.

Principle 6: monitor behavior, not just malware

ICS attacks often appear as abnormal control behavior, engineering activity, remote access, or protocol use.

Principle 7: local control must survive cloud or IT failure

For critical national services, cloud analytics and enterprise systems must not become dependencies for safe local operation.

2. Reference architecture

Use this conceptual model.

                         Internet
                            |
                    Public Edge / DDoS / WAF
                            |
             +--------------+--------------+
             |                             |
       Public Services DMZ          Remote Access Portal
       citizen portals, APIs,       MFA, PAM, approval,
       email gateways               device checks, recording
             |                             |
             +--------------+--------------+
                            |
                       Enterprise IT
       identity, email, ERP, HR, finance, billing,
       office endpoints, market systems, reporting
                            |
                      Enterprise SOC
       SIEM, SOAR, threat intelligence, case management
                            |
                       IT/OT Boundary
       firewalls, proxies, brokers, malware inspection,
       approved conduits, optional one-way transfer
                            |
                           OT DMZ
       historian replica, patch staging, file transfer,
       update relay, log relay, jump access mediation
                            |
                      OT Operations Zone
       SCADA, EMS/DMS, local historian, OT identity,
       engineering workstations, operator services
                            |
       +--------------------+--------------------+
       |                    |                    |
 Generation Cell      Substation Cell      Water/Process Cell
 PLCs, turbine        RTUs, IEDs,          PLCs, RTUs,
 controls, HMIs       relays, gateways     HMIs, analyzers
       |                    |                    |
 Physical process     Physical process     Physical process
 turbines, breakers,  breakers, feeders,   pumps, valves,
 transformers         transformers         dosing systems

Separate supporting zones:

OT security monitoring zone
backup and recovery zone
out-of-band management zone
physical security integration zone
national CERT or sector SOC reporting path
lab and test environment

Do not collapse these into one flat network.

3. Security zones and what belongs in each

Enterprise IT zone

Contains:

corporate users
email
ERP
HR
finance
procurement
billing
user endpoints
enterprise identity
enterprise applications
normal internet access

Allowed:

read replicated operational data through approved reporting systems
submit work orders and maintenance requests
receive sanitized reports from OT historian replicas
send approved logs to the SOC

Blocked:

direct access to HMIs
direct access to SCADA servers
direct access to engineering workstations
direct access to PLCs, RTUs, relays, and safety systems
direct RDP, SMB, SSH, database, or industrial protocol access into OT

Public services DMZ

Contains public-facing services such as:

citizen portals
external websites
public APIs
DNS and email gateways
WAF and DDoS protection

Rules:

no direct trust into OT
no shared credentials with OT
no live control data access
strong monitoring and rate limiting
clear incident isolation plan

OT DMZ

The OT DMZ is the controlled exchange zone.

Use it for:

historian replication
patch staging
antivirus or EDR update relay
secure file transfer
vendor package inspection
jump access mediation
log relay
time synchronization relay where appropriate
controlled data broker services

Do not use the OT DMZ as a flat bridge.

Do not place live control systems in the OT DMZ.

OT operations zone

Contains:

SCADA servers
EMS/DMS
local historian
operator services
OT domain services where used
engineering workstations
OT management services

Rules:

no direct access from enterprise users
no uncontrolled internet access
engineering activity logged
remote access mediated by jump host
internal segmentation to lower-level control cells
backup and restore tested

Control cell zones

Each process cell, substation, plant unit, or control function should be segmented.

Examples:

generation unit control cell
substation automation cell
protection relay cell
water treatment dosing cell
pump station cell
turbine control cell
distribution automation cell
safety system zone

The purpose is blast-radius reduction.

A compromise in one control cell should not automatically expose the whole national infrastructure.

Safety and protection zones

Safety systems and protection systems require stronger isolation.

Controls:

restrict inbound writes
require approved engineering stations
use physical separation or one-way transfer where feasible
require stronger change approval
monitor logic and setting changes
keep offline backups
validate restore procedures
document safety impact before changes

For power environments, protection relays and substation automation deserve special attention because misconfiguration can affect grid stability.

4. Conduits and allowed flows

Every zone-to-zone connection is a conduit.

Every conduit needs:

source
destination
protocol
direction
business purpose
owner
approval record
logging requirement
review frequency
emergency shutdown process

Example: enterprise reporting

Good flow:

Corporate analyst
-> enterprise reporting application
-> OT DMZ historian replica
-> read-only replicated data

Bad flow:

Corporate analyst
-> live historian
-> SCADA server
-> controller network

Example: patching

Good flow:

vendor update source
-> enterprise download area
-> malware and integrity inspection
-> OT DMZ patch staging
-> OT patch server pulls approved package
-> lab or spare asset test
-> production maintenance window

Bad flow:

production HMI
-> direct internet update

Example: vendor support

Good flow:

vendor engineer
-> MFA remote access portal
-> approved time-bound ticket
-> recorded OT jump host
-> named engineering workstation
-> named target asset

Bad flow:

vendor VPN
-> broad OT subnet
-> direct Level 2 or Level 1 access

Example: SOC monitoring

Good flow:

OT sensors and logs
-> OT log relay
-> SIEM or security data lake
-> SOC investigation

Bad flow:

enterprise SOC tool
-> interactive management session into controllers

Monitoring should observe OT.

It should not accidentally become a control path.

5. Power grid reference mapping

For power-sector readers, the same architecture maps to common grid functions.

Generation

Secure:

turbine control
boiler or plant control
excitation systems
balance-of-plant systems
safety and protection systems
generation plant historian
engineering workstations

Key controls:

isolate generation control cells
restrict engineering access
protect local HMIs
test restoration of control projects
monitor controller writes and workstation access

Transmission control center

Secure:

EMS
SCADA front-end processors
ICCP or inter-control-center communication
telemetry systems
operator consoles
historian
network management systems

Key controls:

restrict external data exchange
segment EMS support systems
monitor operator and engineering access
protect control center identity
preserve manual and contingency procedures

Substations

Secure:

RTUs
IEDs
protection relays
station gateways
engineering ports
serial-to-IP converters
telecom routers
local HMIs

Key controls:

segment substations from corporate networks
restrict relay setting changes
monitor firmware and setting changes
remove unmanaged cellular modems
protect physical access and cabinets
collect logs where technically feasible

Distribution

Secure:

DMS
feeder automation
reclosers
capacitor banks
outage management integration
field crew access
AMI/MDMS dependencies where applicable

Key controls:

separate operational control from customer/billing systems
govern field access
segment AMI-related systems from core control
monitor remote switching and automation commands

6. Identity and privileged access architecture

Identity architecture must prevent enterprise compromise from becoming OT control.

Recommended model:

separate OT identity boundary where feasible
controlled federation only where required
MFA for remote and privileged access
named engineer accounts
named vendor accounts
PAM for privileged sessions
break-glass accounts vaulted and monitored
service accounts documented and reviewed
local legacy accounts vaulted with compensating controls

Minimum rules:

no shared engineering account for routine work
no shared vendor account
no standing vendor access without justification
automatic expiry for vendor sessions
quarterly privileged access review
monthly vendor access review
alert on access outside approved window

Break-glass design

Break-glass access is necessary, but dangerous if unmanaged.

Required controls:

unique account
strong vaulting
offline access procedure
dual approval where feasible
monitoring and alerting
post-use review
password rotation after use
tabletop exercise to confirm it works

7. Remote access architecture

Remote access should follow this pattern.

user identity verification
-> MFA
-> device posture check where feasible
-> approval ticket
-> time-bound access
-> privileged access broker
-> recorded jump session
-> named OT asset
-> automatic termination
-> review and evidence retention

Required controls:

MFA
named users
least privilege
session recording
file transfer control
clipboard restriction for high-risk sessions
no direct internet exposure of RDP, VNC, SSH, HMI, or PLC interfaces
no unmanaged vendor tools
source restrictions where feasible
emergency access procedure
monthly vendor access review
alerting outside approved window

Failure mode to prevent:

vendor VPN connected
-> broad OT subnet access
-> compromised vendor laptop scans OT
-> attacker reaches engineering workstation
-> attacker reaches controller network

The fix is narrow, approved, recorded access to specific assets only.

8. Network security architecture

Enterprise to OT boundary

Controls:

default deny
explicit allow rules only
no any-to-any
no industrial protocols from enterprise
no direct database access to live OT systems
proxy or broker for approved services
malware inspection for file transfer
logging enabled
quarterly rule review
emergency block procedure

Internal OT segmentation

Controls:

segment by Purdue level and process cell
separate safety systems from basic control
separate engineering workstations from operator HMIs where feasible
separate remote sites and substations
use industrial firewalls or switch ACLs
restrict write-capable protocols
monitor east-west traffic
manage broadcast exposure

Wireless and cellular

Controls:

inventory all wireless bridges, access points, and modems
disable unapproved wireless
use strong authentication and encryption where wireless is required
segment wireless access
monitor rogue access points
remove undocumented 4G or 5G maintenance modems
include wireless in physical inspections

Undocumented connectivity is one of the fastest ways to invalidate a good architecture.

9. Cloud, IIoT, and analytics

Cloud may be useful for reporting, analytics, predictive maintenance, and fleet visibility.

Cloud must not become an uncontrolled control path.

Minimum rules:

replicate data outward through controlled gateways
prefer one-way or brokered data flows for high-criticality systems
do not expose controllers to cloud services
do not allow cloud identity compromise to control OT
encrypt data in transit and at rest
use private connectivity where feasible
use least-privilege service accounts
monitor cloud API access
classify data before export
respect data sovereignty
keep local control independent from cloud availability

Critical national systems must remain operable if cloud services are unavailable.

10. Security monitoring architecture

Deploy an OT monitoring zone.

Collect through:

SPAN ports
network TAPs
packet brokers
OT sensor appliances
industrial switch mirrors

Log sources:

OT firewalls
remote access portal
jump hosts
VPN
PAM
OT identity
Windows HMIs and servers
engineering workstations
Linux servers
OT IDS sensors
industrial switches
backup systems
physical access systems
change management
vulnerability platform
cloud analytics platform

Correlations that matter:

vendor login outside window plus engineering protocol traffic
badge access after hours plus privileged login
new host in OT VLAN plus PLC discovery
HMI service restart plus new executable
firewall rule change plus new IT-to-OT flow
relay setting change outside approved window

11. Evidence required before architecture approval

Do not approve the architecture on a diagram alone.

Required evidence:

current network diagrams
zone and conduit register
IT/OT flow register
remote access register
firewall rule export
identity architecture
vendor access procedure
asset inventory
backup and restore evidence
monitoring coverage map
incident response contact roster
emergency isolation procedure
exception register
business owner approval
OT engineering approval
safety owner approval

The strongest architecture is the one that can be operated, audited, and restored.

12. Approval checklist

Approve the design only when these statements are true.

Critical processes are identified.
IT and OT are segmented.
OT DMZ exists and is not a flat bridge.
Enterprise users cannot directly reach controllers.
Vendor access is MFA-protected, time-bound, recorded, and approved.
Engineering workstations are controlled.
Control cells are segmented.
Safety and protection systems have stronger restrictions.
Logs flow to monitoring without creating a control path.
Backups are offline or immutable and tested.
Cloud flows are controlled and non-critical to local operation.
IT identity compromise cannot automatically become OT control.
Emergency isolation mode is designed and tested.
Firewall rules have owners and review dates.
Exceptions expire and have compensating controls.

Takeaway

A state-owned ICS architecture should be boring by design.

No shortcuts.

No broad routes.

No unmanaged vendor paths.

No direct enterprise-to-controller access.

No undocumented modems.

No uncontrolled cloud dependency.

Every sensitive action should pass through an approved, monitored, and reversible path.

That is how architecture becomes national resilience.

Now, let's move to Part-4: Tools, Technologies, and Control Implementation Catalog"

Securing State-Owned ICS (Part 2): National Risk, Threat Landscape, and the First 30 Days

Mike Anderson — Tue, 14 Jul 2026 12:34:31 +0000

Previous Series: Part 1: Executive Briefing

State-owned Industrial Control Systems are not just technology assets.

They are national life-support systems.

A power grid keeps hospitals alive. A water treatment system protects public health. A railway control system keeps people moving. A port supports food, fuel, medicine, and trade. A national refinery, pipeline, telecom facility, or defense-linked manufacturing plant can affect the stability of an entire country.

That is why ICS cybersecurity must be designed differently from normal enterprise cybersecurity.

In enterprise IT, a cyber incident can disrupt email, billing, HR, customer portals, or data processing.

In ICS, the same level of compromise can affect:

electricity generation and transmission
water pressure and chemical dosing
substations and protection systems
transport signaling
refinery and pipeline safety
manufacturing continuity
environmental controls
worker safety
national confidence

This series is written for CISOs, security architects, SOC teams, OT engineers, regulators, government leaders, and non-technical executives across South Asia, Southeast Asia, and any nation operating critical infrastructure.

The goal is not to sell fear.

The goal is to provide a serious, implementable blueprint.

One important correction before we begin:

No credible security architect should promise a "hackproof" ICS system.

The realistic target is stronger and more honest:

Consequence-resilient ICS security: make compromise difficult, detect abnormal behavior early, prevent cyber activity from becoming unsafe physical impact, and recover essential services under pressure.

This is how a nation protects the systems that keep society running.

Executive summary for non-technical leaders

A state-owned ICS security program should answer seven questions.

What national services must never fail?
Which systems control those services?
Who can access those systems?
Which IT, vendor, cloud, and remote paths can reach OT?
How would we detect unauthorized control activity?
How would we contain an attack without harming the physical process?
Can we restore critical control functions from trusted backups?

If leadership cannot get clear answers to those questions, the program is not mature.

The first objective is not to buy more tools.

The first objective is to establish ownership, asset truth, safe architecture, controlled access, monitoring, response, and recovery.

1. Why state-owned ICS is different

Industrial Control Systems include the hardware, software, networks, and procedures used to monitor or control physical processes.

Examples include:

SCADA systems
Energy Management Systems
Distribution Management Systems
substations and protection relays
PLCs and RTUs
HMIs and operator stations
engineering workstations
historians
industrial network devices
safety systems
field sensors and actuators

The difference between IT and OT is simple.

IT protects data, users, and business services.

OT protects physical processes.

That changes the security model.

In IT

You can often isolate a compromised laptop quickly.

You can patch aggressively.

You can force password resets across large groups.

You can rebuild standard systems from images.

You can tolerate some business disruption.

In OT

You may not be able to isolate a device without understanding what it controls.

You may not be able to patch without vendor validation and a maintenance window.

You may not be able to force password changes during a live operational event.

You may not be able to reboot an HMI that operators need for visibility.

You may not be able to scan controllers without risk.

The rule is:

Protect safety and control first. Reduce cyber risk through planned, tested, reversible controls.

That is not an excuse for weak security.

It is the engineering discipline required to secure physical infrastructure.

2. The national threat landscape

Modern ICS threats are not one-dimensional.

They combine identity compromise, remote access abuse, supply chain exposure, IT/OT convergence, cloud dependency, insecure legacy protocols, ransomware, and nation-state pre-positioning.

Nation-state campaigns

Nation-state actors may seek espionage, strategic access, disruption capability, or coercive leverage.

For critical infrastructure, the most dangerous behavior is often quiet pre-positioning.

The attacker may not immediately disrupt anything.

They may:

compromise an IT account
reach a vendor portal
map the OT environment
collect engineering project files
observe operator behavior
learn backup and recovery processes
identify weak substations, plants, or remote sites
prepare access for a future geopolitical crisis

The danger is not only the initial breach.

The danger is that the attacker learns how the system operates.

Ransomware and criminal groups

Ransomware can create national disruption even if PLCs are not directly encrypted.

An attacker can affect operations by encrypting or disabling:

domain controllers
engineering workstations
HMI servers
historians
file shares
backup systems
dispatch systems
billing and market systems
remote access infrastructure
maintenance documentation

For a power utility, losing enterprise IT may still affect outage management, crew dispatch, procurement, communications, market settlement, and reporting.

For a water utility, losing billing may be tolerable for a few days. Losing HMI visibility, chemical dosing records, or engineering backups is a different class of risk.

Hacktivists

Hacktivists often seek visibility and political impact.

Their attacks may include:

DDoS
website defacement
credential leaks
exposed camera or HMI access
abuse of publicly reachable OT devices
social media amplification

Even unsophisticated attacks can create public panic if the target is a national utility.

Insider and contractor risk

Insider risk is not always malicious.

Common real-world patterns include:

an engineer using a shared password because work is urgent
a vendor leaving a remote support tunnel enabled
a contractor connecting an unmanaged laptop
a plant team bypassing change control to restore service quickly
a temporary firewall rule becoming permanent
a cellular modem installed for convenience and forgotten

In ICS, convenience often becomes the attack path.

Supply chain compromise

ICS environments depend on vendors and integrators.

The supply chain includes:

PLC and RTU vendors
relay vendors
SCADA and HMI software providers
engineering workstation tools
remote access vendors
system integrators
patch sources
firmware packages
cloud analytics platforms
managed service providers
maintenance contractors

A weak supplier can become the entry point into a strong facility.

Secure procurement is therefore not a back-office activity.

It is a frontline security control.

3. The most common national ICS failure modes

Most state-owned ICS environments are not weak because people are careless.

They are weak because the environment evolved over decades, production stability was prioritized, and cybersecurity was added later.

Failure mode 1: the air-gap myth

Many organizations still believe the OT network is isolated.

In practice, OT often has more connections than leadership realizes:

historian replication to enterprise IT
vendor VPN
dual-homed engineering workstation
cellular modem
temporary project link
remote support tool
shared Active Directory
cloud reporting connector
USB transfer process
contractor laptop
unmanaged wireless bridge

A network is only isolated when every data path is known, controlled, monitored, and tested.

If nobody can prove isolation, assume there is a path.

Failure mode 2: flat OT networks

Flat networks allow one compromised device to reach many others.

A flat OT network can allow an attacker to move from:

corporate workstation
-> historian
-> HMI
-> engineering workstation
-> PLC or relay network

That path should not exist.

Segmentation must reduce blast radius by process, site, function, and consequence.

Failure mode 3: unmanaged remote access

Remote access is usually one of the highest-risk OT paths.

Common weaknesses:

shared vendor accounts
no MFA
VPN landing directly into OT
no session recording
no approval window
no asset-specific targeting
no source restriction
access left enabled after support contract ends
vendor laptop health unknown
jump host bypass

Remote access should be treated as a privileged operational event, not a convenience feature.

Failure mode 4: weak identity boundaries

If corporate identity compromise gives an attacker OT access, the architecture is too tightly coupled.

IT and OT identity can integrate, but the integration must be controlled.

A compromised email account should not become a controller access path.

Failure mode 5: no asset truth

If a utility does not know its assets, it cannot manage vulnerability, patching, monitoring, incident response, procurement, or recovery.

The inventory must cover:

asset name
location
process served
vendor and model
firmware or software version
IP address or communication identity
Purdue level or security zone
owner
support status
criticality
backup status
remote access dependency
known vulnerabilities
communication flows

Passive discovery helps, but engineering validation is still required.

Failure mode 6: IT incident response applied blindly to OT

Traditional incident response often says:

isolate first.

In OT, that can be unsafe.

The better rule is:

isolate the attacker without destabilizing the process.

That requires pre-approved containment options, OT engineers on the bridge, process-aware severity, and safety-led decision making.

4. Where to start: the first 30 days

Do not start with a large transformation program.

Start with control of the basics.

Day 1 decision: appoint the accountable owner

A national ICS security program needs clear ownership.

At minimum:

executive sponsor
CISO or national cyber lead
OT operations owner
plant or site owner
safety owner
engineering owner
SOC owner
vendor management owner
legal and regulatory contact
communications owner

If nobody owns the risk, nobody owns the remediation.

Days 1-7: identify critical processes

Start with consequences.

Ask:

Which services must continue during national crisis?
Which process failures can harm people?
Which assets support hospitals, defense, telecom, ports, finance, or emergency services?
Which sites would create national impact if unavailable?
Which control functions are needed for black-start, safe shutdown, manual operation, or restoration?

Output:

critical process list
top national service dependencies
crown-jewel OT assets
responsible owners

Days 7-15: build the first asset truth

Start with the highest-consequence sites.

Collect:

network diagrams
firewall exports
switch MAC tables
passive discovery output
engineering workstation project files
vendor asset lists
backup inventories
controller lists
historian connection lists
remote access records

Validate through plant walkdowns.

Output:

critical asset inventory
unknown asset list
unsupported asset list
remote access dependency list

Days 15-21: map all IT/OT and vendor paths

Document:

enterprise-to-OT flows
OT-to-enterprise flows
vendor access paths
cloud or analytics connections
historian replication
patch flows
file transfer processes
logging flows
backup flows
identity dependencies
emergency access paths
wireless and cellular links

Every flow needs:

source
destination
protocol
direction
purpose
owner
approval
logging
review frequency

Output:

IT/OT connectivity map
unauthorized or unexplained path list
emergency block options

Days 21-30: remove unacceptable risk

Start with the risks that should never exist.

Priority removals:

direct internet access to controllers, HMIs, VNC, RDP, SSH, or engineering services
direct enterprise access to Level 1 controller networks
shared vendor accounts
always-on vendor VPN
dual-homed engineering workstation bridging IT and OT
default credentials on critical assets
broad "any-to-any" firewall rules between IT and OT
unsupported remote access tools
unapproved cellular modems
backups that cannot be restored

Output:

first remediation backlog
emergency exceptions
named owners
30-day leadership briefing

5. National governance model

State-owned ICS security cannot be solved only at plant level.

It needs national, enterprise, and facility governance.

National or regulator level

Responsibilities:

classify critical infrastructure
define minimum OT cybersecurity baseline
require incident reporting
create or strengthen national OT-CERT capability
coordinate sector threat intelligence
define secure procurement expectations
run national exercises
support workforce development
coordinate cross-border dependencies where relevant

Enterprise or utility level

Responsibilities:

fund and operate the ICS security program
approve architecture
maintain risk register
run SOC capability
manage vendors
enforce standards
report to board, ministry, or regulator
own incident response and recovery

Facility or plant level

Responsibilities:

maintain safe operations
approve operationally sensitive changes
validate asset inventory
support monitoring
own local containment decisions
maintain backups
participate in exercises
report gaps and exceptions

The governance model must respect a simple truth:

Cybersecurity cannot override process safety.

Process safety cannot ignore cybersecurity.

They must operate together.

6. Standards that should anchor the program

Use standards to drive implementation, not paperwork.

Recommended anchors:

NIST SP 800-82 Rev. 3 for OT security guidance.
ISA/IEC 62443 for zones, conduits, security levels, and IACS security lifecycle.
CISA Cross-Sector Cybersecurity Performance Goals for baseline critical infrastructure practices.
MITRE ATT&CK for ICS for adversary behavior, threat modeling, detection, and exercises.
ISO 27001 where a formal information security management system is required.
National sector regulations where applicable.
NERC CIP principles for power-sector organizations where relevant or used as a benchmark outside North America.

The key is mapping standards to enforcement points.

Example:

Requirement: control remote access
Enforcement: MFA, PAM, jump host, session recording, approval workflow, time-bound access
Evidence: access logs, session recordings, monthly vendor review, approved tickets

If a standard does not map to an enforcement point and evidence artifact, it becomes paperwork.

7. First maturity score

Use this quick maturity view.

Level 1: Reactive

incomplete asset inventory
flat network
always-on vendor access
weak logging
no OT-specific incident playbooks
backups not tested
change control inconsistent

Level 2: Basic control

critical assets identified
IT/OT firewall exists
some remote access control
basic logging from jump hosts and servers
backups exist and some restores tested
vulnerability tracking started

Level 3: Managed

zones and conduits documented
vendor access MFA-protected and recorded
passive monitoring in critical zones
firewall rules reviewed quarterly
OT incident playbooks exercised
critical backups tested
vulnerabilities prioritized by consequence

Level 4: Resilient

critical services have tested recovery plans
unauthorized control behavior is detectable
engineering changes require approval and evidence
threat modeling drives investment
SOC, OT, engineering, safety, and leadership operate together
national CERT or sector sharing is integrated

Level 5: Adaptive

detections continuously validated
purple team exercises test realistic attack paths
AI assists triage and reporting under governance
procurement enforces secure-by-design requirements
resilience metrics are reported to national leadership
organization can operate safely under degraded cyber conditions

Most organizations should target Level 3 first.

Level 4 is the right ambition for critical national services.

Level 5 is a strategic national capability.

8. What good looks like after 30 days

After 30 days, leadership should be able to say:

We know our highest-consequence services.
We know the critical assets that support them.
We know the major IT/OT and vendor access paths.
We have removed or contained the most dangerous access paths.
We have a named owner for each major risk.
We have started passive visibility in the highest-criticality environment.
We have an OT incident contact roster.
We know whether critical backups exist and whether they have been tested.

That is not the end state.

It is the first serious control point.

Takeaway

State-owned ICS cybersecurity is a national resilience mission.

Start with consequence.

Build asset truth.

Control access.

Segment critical paths.

Monitor behavior.

Prepare safe response.

Test recovery.

Govern vendors.

Measure progress.

The strongest national infrastructure programs are not the ones claiming to be hackproof.

They are the ones that can keep essential services running when pressure is highest.

Now Let's move to Part-3:Securing State-Owned ICS: Target Architecture for IT, OT, Cloud, and Power Grid Environments

Executive Brief: State-Owned ICS Cybersecurity Blueprint, a five part series journey, Part-1

Mike Anderson — Tue, 14 Jul 2026 12:32:21 +0000

Part 2: National Risk, Threat Landscape, and the First 30 Days

Part-3: Target Architecture for IT, OT, Cloud, and Power Grid Environments

Part-4: Tools, Technologies, and Control Implementation Catalog

Part-5: SOC, Detection, Incident Response, Resilience, and Exercises

Part-6: AI, Governance, Procurement, and the 180-Day National Roadmap

Part-7: State-Owned ICS Cybersecurity Blueprint

Executive Brief: State-Owned ICS Cybersecurity Blueprint

State-owned Industrial Control Systems support national services such as electricity, water, transport, energy, ports, and public-sector process operations.

A cyber incident in these environments is not only an IT outage.

It can become a public safety, economic stability, environmental, and national security event.

The goal is not to claim that critical infrastructure can be made hackproof.

The correct objective is:

Consequence-resilient security: prevent likely attacks, detect abnormal behavior early, contain safely, recover essential services, and prove control maturity through evidence.

The five leadership questions

A national ICS security program should be able to answer five questions clearly.

Which national services are most critical?
Which assets control those services?
Who can access those assets, including vendors?
How would we detect unauthorized control activity?
Can we restore critical control functions under incident conditions?

If the answer to any of these is unclear, the program has material risk.

The highest-risk gaps

The most common critical gaps are:

incomplete OT asset inventory
flat IT/OT or flat OT networks
direct enterprise access to OT systems
always-on vendor VPN
shared engineering or vendor accounts
no MFA for remote access
no tested backups for controllers, HMIs, and engineering workstations
no OT-specific incident response playbooks
no monitoring for unauthorized control activity
procurement of systems that cannot be secured, monitored, patched, or recovered

These are not paperwork issues.

They are attack paths.

The minimum national baseline

Every state-owned critical ICS environment should have:

accountable OT cyber owner
critical process inventory
critical asset inventory
documented IT/OT data flows
OT DMZ
segmented control zones
MFA-protected and recorded remote access
vendor access approval workflow
separate or controlled OT identity boundary
engineering workstation hardening
controller and relay change control
passive OT monitoring
SIEM integration
tested backups
OT incident response playbooks
vulnerability management by consequence
secure procurement requirements
annual exercises
executive risk reporting

What leadership should fund first

If the budget is limited, fund in this order:

asset inventory and network visibility
remote access control
IT/OT segmentation
backups and restore testing
detection for unauthorized control activity
engineering workstation hardening
vendor governance
OT incident response exercises
vulnerability management
AI-assisted triage and reporting after core controls mature

AI should not be funded before the basics are controlled.

What good looks like

A mature program can say:

We know our critical services and assets.
Enterprise IT cannot directly reach controllers.
Vendor access is MFA-protected, approved, recorded, and time-bound.
Critical control cells are segmented.
Unauthorized controller writes are detectable.
Backups for priority control functions are tested.
OT incident response has been exercised.
Remaining risk is documented, owned, and funded.

That is the leadership outcome.

To be continued...

Laet's explore Part-2

ICS Cybersecurity A–Z (Part 2): Operations, Incident Response, and Threat Modeling

Mike Anderson — Tue, 14 Jul 2026 08:43:33 +0000

Part 1 covered the foundation: segmentation, hardening, monitoring, and safe assessment.

That is where many ICS security programs stop.

They build a good diagram, deploy a monitoring tool, fix some obvious gaps, and then assume the environment is secure.

It is not.

An ICS security program only becomes real when the organization can operate it every day, detect abnormal behavior, respond without harming the process, recover from compromise, and explain risk clearly to leadership.

Part 2 is about that operating layer.

We will cover:

Steady-state OT cyber operations
Patch, change, access, and backup discipline
Incident response for control environments
Large-scale or nation-state campaign preparation
Legal and operational boundaries for active defense
Threat modeling with MITRE ATT&CK for ICS
Metrics that matter to leadership
A practical 30/60/90-day improvement plan

The goal is simple:

Keep essential services running while reducing the chance that a cyber event becomes a safety, reliability, or public-impact event.

1. Steady-state OT cyber operations

Security operations in ICS must be predictable.

The worst OT security programs are reactive. They patch after a breach, block traffic during an outage, or discover vendor accounts only after suspicious access appears.

A mature program has an operating rhythm.

Operating rhythm

Daily

Activity: review high-severity OT alerts and remote access activity.
Owner: SOC / OT security.
Evidence: alert notes and triage decisions.

Weekly

Activity: review new assets, new network flows, failed logins, and monitoring gaps.
Owner: OT security / network team.
Evidence: asset deltas and flow review.

Monthly

Activity: review privileged and vendor accounts.
Owner: IAM / OT owner.
Evidence: access review sign-off.

Monthly

Activity: validate backup job status and offline backup inventory.
Owner: OT engineering.
Evidence: backup report and sample restore evidence.

Quarterly

Activity: review firewall rules and segmentation exceptions.
Owner: network / OT security.
Evidence: rule review export and exception decisions.

Quarterly

Activity: patch planning and vendor advisory review.
Owner: OT engineering / vendor manager.
Evidence: patch plan and risk acceptance.

Semiannual

Activity: restore test for critical HMI, historian, and PLC logic backups.
Owner: OT engineering.
Evidence: restore test results.

Annual

Activity: OT tabletop exercise and incident response test.
Owner: CISO / plant leadership.
Evidence: exercise report and improvement plan.

This rhythm matters because ICS risk accumulates quietly.

A temporary vendor account becomes permanent. A firewall exception created during commissioning is never removed. A spare HMI misses patches for two years. A controller is replaced but never added to inventory.

The operating model catches these before they become attack paths.

2. Patch and change management

Patching in OT is not the same as patching laptops.

You still need vulnerability management, but the process must account for vendor support, process uptime, safety impact, and rollback.

A practical OT patch process:

Track vendor advisories and CISA ICS advisories for products in your environment.
Identify affected assets from the OT inventory.
Classify risk by exploitability, exposure, process criticality, and compensating controls.
Test patches on a spare, lab system, or non-critical asset first.
Confirm vendor support for the patch level.
Schedule the change with operations.
Confirm backups and rollback.
Apply during an approved window.
Monitor the asset and process after the change.
Retain evidence.

Patch priority should not be based only on CVSS.

For ICS, prioritize:

Internet-exposed OT services.
Remote access infrastructure.
Engineering workstations.
HMIs with broad process visibility.
Historians bridging IT and OT.
Vulnerabilities with known exploitation.
Vulnerabilities that allow authentication bypass, remote code execution, or unauthorized control modification.
Assets with weak segmentation or no compensating controls.

A medium CVSS issue on a directly reachable HMI may matter more than a critical issue on an isolated spare asset.

Change control is a security control

Every OT change should answer:

What is changing?
Which process could be affected?
Who approved it?
What is the rollback plan?
What evidence will prove the change was successful?
What monitoring is required after the change?
What is the stop condition?

This applies to firewall rules, controller logic, HMI software, remote access, patches, sensor changes, and vendor maintenance.

Poor change control is one of the most common causes of self-inflicted OT incidents.

3. Account and access hygiene

Identity is often the weak point in OT.

Many environments still have shared operator accounts, old vendor accounts, unmanaged local administrators, and remote access paths that were created for commissioning and never removed.

Minimum controls:

MFA for remote access and privileged access where supported.
Unique named accounts for engineers and vendors.
No standing vendor access unless formally justified.
Monthly vendor account review.
Quarterly privileged access review.
Break-glass accounts protected, monitored, and tested.
Password vaulting for shared emergency credentials.
Session recording for privileged remote access.
Disable accounts immediately when engineers, contractors, or vendors leave.
Restrict engineering tool access to approved workstations.

The real risk is not only credential theft. It is loss of accountability.

If five people use the same engineering account, you cannot reliably determine who changed a PLC program, acknowledged an alarm, or exported a project file.

4. Backup and recovery testing

Backups are not a compliance artifact in ICS. They are a control for public safety and service continuity.

You need backups for:

HMI images
Engineering workstation images
SCADA server configuration
Historian configuration
PLC and RTU logic
Network device configuration
Firewall rules
Switch configuration
Remote access gateway configuration
License keys and vendor installation media
Offline documentation needed for recovery

The minimum standard is:

Keep offline or immutable copies.
Keep at least one copy physically or logically disconnected.
Test restores, not only backup completion.
Validate PLC logic backups on spare hardware where possible.
Record firmware and software version dependencies.
Store recovery procedures where they are accessible during an IT outage.
Include operations in recovery drills.

The question leadership should ask is not:

"Do we have backups?"

The better question is:

"Can we restore the most critical process-control function from known-good backups under incident conditions?"

5. OT incident response: containment without causing harm

Traditional incident response often says isolate the host quickly.

That may be right for an office laptop.

It may be dangerous for a controller, HMI, historian, safety system, or engineering workstation involved in a live process.

OT incident response must be safety-led.

Preparation

Before an incident, create:

OT incident response plan.
Plant contact roster.
OT asset inventory with process criticality.
Network diagrams and trust boundaries.
Communication plan if email and VoIP are down.
Forensic collection procedure.
Escalation path to safety, legal, privacy, executives, and sector authorities.
Pre-approved containment options.
Emergency change process.
Out-of-band communication method.
Tabletop exercise schedule.

Triage questions

When an alert fires, ask:

Which asset is affected?
What physical process does it support?
Is the process stable?
Is this read traffic, write traffic, authentication activity, or logic change?
Is the activity inside an approved change window?
Is a vendor currently authorized to connect?
Can containment disrupt safety or availability?
What evidence must be preserved before action?
Who has authority to approve containment?
What rollback is available?

These questions prevent the SOC from making a technically correct but operationally unsafe decision.

Severity model

SEV-1 Critical

OT example: active unauthorized control action, ransomware affecting operations, confirmed manipulation of controller logic, or safety impact.
Response expectation: activate crisis management, OT IR, legal, executive leadership, and the safety owner.

SEV-2 High

OT example: compromised engineering workstation, unauthorized PLC write attempt, vendor account compromise, or malware on HMI.
Response expectation: immediate OT/security bridge and containment plan approved by the OT owner.

SEV-3 Medium

OT example: suspicious scan, failed logins, unauthorized device detected, or policy violation.
Response expectation: same-day investigation and corrective action.

SEV-4 Low

OT example: false positive, benign misconfiguration, or informational monitoring gap.
Response expectation: track and tune through the normal backlog.

Containment options

Containment should be targeted.

Containment decision guide

Suspicious laptop in OT VLAN

Safer option: disable or quarantine the switch port for that endpoint.
Avoid unless approved: shutting down the entire switch.

Enterprise-to-OT attack path

Safer option: block the enterprise-side route or disable the jump path.
Avoid unless approved: disrupting Level 2 or Level 1 communications.

Compromised vendor account

Safer option: disable the account and terminate the active session.
Avoid unless approved: broad password resets during live operations without a plan.

Unauthorized PLC write source

Safer option: block the source IP at the cell firewall or switch ACL.
Avoid unless approved: remote stop command to PLC.

Malware on HMI

Safer option: move operator function to standby HMI, then isolate the affected host.
Avoid unless approved: killing the HMI process during active operation.

Suspected logic change

Safer option: compare logic to known-good backup and involve a controls engineer.
Avoid unless approved: reloading logic without process validation.

The principle is:

Isolate the attacker, not the process.

Eradication and recovery

Recovery actions may include:

Reimage HMIs or engineering workstations from golden images.
Reload controller logic from known-good backups.
Validate checksums or vendor integrity indicators.
Reset credentials in the affected zone.
Patch the entry point.
Review firewall and remote access logs.
Monitor for recurrence.
Reconnect enterprise paths only after the route of compromise is understood and controlled.

Do not reconnect because "production needs it" without understanding the attack path. That is how reinfection happens.

Post-incident review

Within 72 hours for major incidents, document:

Timeline.
Initial access vector.
Affected assets.
Physical process impact.
Containment decisions and approvals.
Evidence collected.
Root cause.
Controls that worked.
Controls that failed.
Detection gaps.
Remediation owners and due dates.
Leadership summary.

The best incident reviews are not blame exercises. They are control improvement sessions.

6. Preparing for large-scale or nation-state campaigns

Some OT incidents are not isolated.

Utilities, transportation, energy, manufacturing, and public-sector operators may face coordinated campaigns involving destructive malware, credential attacks, vendor compromise, DDoS, influence operations, and physical security pressure.

The objective during a large-scale campaign is not perfect investigation.

The objective is continuity of essential service, safe operation, evidence preservation, and coordinated defense.

Before the campaign

Prepare:

Sector threat intelligence relationships.
Membership or contact path with relevant ISAC/ISAO.
National CERT reporting path.
Pre-approved emergency firewall rules.
Pre-approved remote access shutdown procedure.
Emergency vendor contact list.
"Island mode" or isolation plan where operationally feasible.
Manual operation procedures.
Spare hardware and offline media.
Out-of-band communications.
Executive decision matrix.

Emergency isolation must be practiced. A plan that exists only in a PDF is not a plan.

During the campaign

Actions may include:

Activate the OT incident response bridge.
Confirm process stability with operations.
Increase monitoring on remote access, firewalls, DNS, identity, and OT sensors.
Disable non-essential vendor access.
Restrict internet-facing OT DMZ services.
Apply pre-approved blocks for confirmed malicious infrastructure.
Validate backups and standby systems.
Monitor for unauthorized engineering protocol use.
Share indicators with your ISAC or national CERT.
Preserve evidence for law enforcement and sector response.

Be careful with blanket actions.

"Force every privileged password change immediately" may be appropriate in some cases, but in OT it must be planned so you do not lock out operators, break services, or lose access to legacy systems during a crisis.

Defensive countermeasures

Private organizations can and should defend their own environment.

Appropriate defensive actions include:

Blocking known malicious IPs and domains at your perimeter.
Sinkholing malicious domains inside your own DNS environment when properly authorized.
Null-routing attacker infrastructure at your boundary.
Disabling compromised accounts.
Quarantining affected endpoints.
Deploying honeytokens or decoy shares.
Increasing detection sensitivity for known TTPs.
Sharing indicators with trusted sector partners.
Supporting takedown efforts through proper legal and provider channels.

No hack back

Private-sector teams should not conduct offensive retaliation.

Do not access attacker infrastructure, modify external systems, steal data back, or launch counterattacks.

The right role is:

Stop the attack inside your boundary.
Preserve forensic evidence.
Share indicators quickly.
Support law enforcement, CERT, regulators, and sector response bodies.
Maintain safe operations.

That is how private organizations contribute to national defense without creating legal, diplomatic, or operational risk.

7. Threat modeling: prioritize what actually reduces risk

Threat modeling prevents random security spending.

For ICS, use the process and the attack path together.

A practical model includes:

Critical process.
Assets supporting that process.
Trust boundaries.
Possible attacker entry points.
Attack path from IT or remote access to control impact.
Existing controls.
Detection coverage.
Response options.
Control gaps.
Remediation owner.

MITRE ATT&CK for ICS is useful because it maps adversary behavior to ICS-specific tactics and techniques.

Example attack path:

Example attack path

Initial access

Example behavior: vendor VPN account compromised.
Expected telemetry: VPN login from unusual location and MFA anomaly.
Required control: MFA, conditional access, and vendor access approval.
Owner: IAM / OT security.

Discovery

Example behavior: attacker scans the OT subnet.
Expected telemetry: new network flows, protocol discovery, and sensor alert.
Required control: segmentation, passive monitoring, and blocked routes.
Owner: network / SOC.

Lateral movement

Example behavior: RDP to engineering workstation.
Expected telemetry: jump host logs and Windows logon events.
Required control: jump host only, no direct RDP, and named accounts.
Owner: infrastructure.

Collection

Example behavior: project files copied.
Expected telemetry: file access logs and unusual archive creation.
Required control: least privilege, monitoring, and data access review.
Owner: OT engineering.

Control manipulation

Example behavior: unauthorized PLC write or logic change.
Expected telemetry: engineering protocol write, controller event, and logic checksum change.
Required control: PLC ACL, change workflow, and alerting.
Owner: OT engineering.

Inhibit response

Example behavior: alarms suppressed or HMI altered.
Expected telemetry: HMI configuration change and alarm state changes.
Required control: HMI change control, backups, and monitoring.
Owner: operations.

This model tells you where to spend money.

If the highest-risk path is vendor VPN to engineering workstation to PLC logic change, then buying another generic IT scanner is not the first priority.

Better priorities may be:

Tighten vendor access.
Remove direct RDP.
Add engineering workstation allowlisting.
Monitor engineering protocol writes.
Implement PLC logic backup and checksum validation.
Exercise the containment process.

8. Metrics leadership should care about

Leadership does not need every alert.

They need a clear view of operational cyber risk.

Good OT security metrics:

Percentage of critical OT assets inventoried: shows whether the team can defend what matters.
Number of direct IT-to-OT flows: measures segmentation risk.
Vendor accounts active outside approved windows: measures third-party access risk.
Percentage of critical HMIs and engineering workstations with tested backups: measures recovery confidence.
Number of unauthorized or unexplained OT protocol write attempts: measures control-layer threat activity.
Critical OT vulnerabilities past SLA with no compensating control: measures unresolved exposure.
Percentage of firewall rules reviewed in the last quarter: measures control hygiene.
Mean time to triage high-severity OT alerts: measures SOC readiness.
Number of successful restore tests: measures resilience.
Open exceptions by age and criticality: measures risk debt.

The best board-level statement is not "we deployed an OT monitoring tool."

A better statement is:

"We have identified 96% of critical OT assets, removed all direct enterprise-to-controller access, reviewed 87% of OT firewall rules this quarter, and validated restore procedures for the three most critical process-control functions. Remaining risk is concentrated in two legacy PLC families and one vendor access path, with remediation owners assigned."

That is operationally meaningful.

9. 30/60/90-day improvement plan

If you are starting from a messy environment, do not try to fix everything at once.

First 30 days: establish visibility and stop obvious risk

Build critical OT asset inventory.
Identify direct IT-to-OT paths.
Review remote vendor access.
Confirm backups exist for critical HMI, SCADA, and PLC assets.
Collect firewall rules and network diagrams.
Start passive monitoring in one critical segment.
Identify default credentials and shared accounts.
Define incident contact roster.

Days 31–60: enforce control points

Remove direct enterprise access to controllers.
Move vendor access behind MFA and jump host.
Review and reduce broad firewall rules.
Implement logging for remote access and jump hosts.
Validate backups through sample restore.
Create initial OT detection use cases.
Establish change approval for PLC logic and firewall changes.
Create exception register.

Days 61–90: operationalize

Run an OT tabletop exercise.
Test incident containment decision flow.
Review privileged access.
Tune monitoring detections.
Build threat model for the most critical process.
Assign remediation backlog by risk.
Report metrics to leadership.
Schedule quarterly control reviews.

This sequence creates measurable progress without destabilizing operations.

Final takeaway

ICS cybersecurity is not about adding IT controls blindly to plant environments.

It is about protecting the physical process by controlling access, reducing unsafe paths, monitoring meaningful behavior, and responding with operational discipline.

The strongest OT programs share a few traits:

They know their assets.
They respect safety and availability.
They segment based on process risk.
They control vendor access.
They monitor controller-relevant behavior.
They test backups.
They practice incident response.
They use threat modeling to prioritize.
They can explain residual risk to leadership.

Start there.

Keep the water flowing, the lights on, and the process under control.

References

ICS Cybersecurity A–Z (Part 1): Architecting, Hardening, and Monitoring SCADA Safely

Mike Anderson — Tue, 14 Jul 2026 08:40:57 +0000

Walk into a water treatment plant control room and the first thing you notice is how calm everything looks.

The pumps are running. The HMI screens show tank levels, valve states, and chemical dosing. The operators know the process. The environment feels stable.

Then you look closer.

An engineering workstation has a shared password taped under the keyboard. A PLC web interface is reachable from the plant VLAN. A vendor VPN lands too close to the control network. The historian talks to business systems through a firewall rule nobody has reviewed in years.

That is the real world of Industrial Control Systems (ICS) security.

In enterprise IT, a bad security decision may create downtime, data loss, or account compromise. In OT, the same decision can affect water pressure, power distribution, chemical dosing, worker safety, environmental impact, and public trust.

This two-part series is written for security engineers, SOC analysts, cloud and infrastructure teams, and IT leaders who need to work with OT teams without breaking the plant.

Part 1 covers the foundation:

SCADA network placement
Purdue-level segmentation
Remote access and vendor access
HMI, server, PLC, and RTU hardening
OT monitoring and detection
Safe vulnerability assessment
Common configuration failures

Part 2 covers operations, incident response, large-scale attack preparation, threat modeling, and leadership metrics.

Assumption: the environment is a production ICS/SCADA network supporting critical infrastructure such as water, power, manufacturing, utilities, or similar process-control operations. Adjust the details for your sector, vendor stack, safety case, and regulatory obligations.

The OT security rule that matters most

In OT, the first question is not:

"Can we secure this?"

The first question is:

"Can we secure this without creating unsafe process behavior?"

That changes the order of operations.

For normal IT, you may patch aggressively, scan broadly, isolate hosts quickly, or force password resets at scale.

For ICS, those actions can break HMI-to-PLC communication, trigger a failsafe, overload a fragile controller, lock out an operator, or interrupt a process that must remain stable.

So the decision rule is:

Preserve safety and control first. Reduce cyber risk through planned, tested, reversible controls.

That does not mean OT should remain insecure. It means the controls must be engineered, not blindly applied.

1. Start with the architecture: where SCADA should live

Before you harden a device or deploy a monitoring tool, you need a network model that defines trust boundaries.

The common reference model is the Purdue Enterprise Reference Architecture. It is not perfect for every modern environment, especially where cloud historians, remote operations, and IIoT platforms are involved, but it remains useful because it separates business IT from control functions.

A practical Purdue-style model looks like this:

Level 5  Enterprise services
         Email, ERP, internet, business applications

Level 4  Business IT
         User workstations, identity, reporting, corporate applications

Level 3.5 OT DMZ
         Historian replication, patch staging, remote access broker,
         file transfer gateway, jump access mediation

Level 3  Site operations
         SCADA servers, engineering workstations, local historians,
         domain services for OT where used

Level 2  Supervisory control
         HMIs, operator stations, local control rooms

Level 1  Basic control
         PLCs, RTUs, IEDs, controllers, intelligent actuators

Level 0  Physical process
         Sensors, pumps, valves, breakers, motors, field equipment

The architecture should enforce three principles.

Principle 1: Enterprise IT must not directly reach controllers

Nothing from Level 4 or Level 5 should directly communicate with PLCs, RTUs, IEDs, or safety controllers.

A ransomware infection on a finance laptop should not be able to discover Modbus, DNP3, EtherNet/IP, S7, or IEC 61850 devices.

Principle 2: Shared services belong in the OT DMZ

The OT DMZ is the controlled exchange zone.

Use it for:

Historian replication
Patch staging
Antivirus or EDR update staging
Secure file transfer
Remote access brokering
Jump host mediation
Log forwarding
Time synchronization relay where appropriate

Do not use it as a flat bridge between IT and OT.

Principle 3: Control commands should only come from authorized control paths

Telemetry can move upward when required. Control commands must be restricted downward to approved systems, approved users, approved ports, and approved operating procedures.

A good design does not only say "firewall between IT and OT." It defines exactly which asset can talk to which asset, on which protocol, for what business reason, with which owner and evidence.

2. Segmentation that actually reduces risk

Segmentation fails when it exists on a diagram but not in enforcement.

The minimum practical design is:

Minimum segmentation design

Enterprise to OT DMZ

Enforcement point: firewall, proxy, or remote access broker.
Allowed flow: corporate analyst reads replicated historian data.
Block: direct RDP, SMB, SSH, database, or PLC protocol access into OT.

OT DMZ to Level 3

Enforcement point: firewall with explicit allow rules.
Allowed flow: patch server pulls approved updates from DMZ staging.
Block: any inbound enterprise-initiated session to SCADA servers.

Level 3 to Level 2

Enforcement point: internal OT firewall or ACL.
Allowed flow: SCADA server communicates with HMIs and local services.
Block: direct workstation-to-HMI access from unrelated zones.

Level 2 to Level 1

Enforcement point: cell/area firewall or industrial switch ACL.
Allowed flow: HMI or SCADA server polls assigned PLCs.
Block: cross-cell traffic, unauthorized engineering access, and broad broadcast exposure.

Vendor access to OT

Enforcement point: VPN, MFA, jump host, and approval workflow.
Allowed flow: vendor connects to one approved engineering workstation during a change window.
Block: direct VPN landing inside Level 2 or Level 1.

A firewall rule should read like an operational decision, not a convenience setting.

Example:

Source:      SCADA-SRV-01
Destination: PLC-WTP-CLARIFIER-01
Protocol:    Modbus TCP/502
Direction:   SCADA to PLC only
Purpose:     Poll clarifier process values
Owner:       OT Operations
Review:      Quarterly
Logging:     Session metadata enabled

Bad rule:

Source:      Any
Destination: OT network
Protocol:    Any
Purpose:     Vendor support

That second rule is not a firewall exception. It is an attack path.

3. Remote access and vendor access

Remote access is one of the highest-risk paths into OT because it combines identity risk, unmanaged endpoints, third-party dependencies, and time pressure during outages.

For production ICS, vendor access should follow this model:

Vendor user
  -> MFA-authenticated remote access portal
  -> Approved time-bound session
  -> Recorded jump host
  -> Named target asset only
  -> OT engineer supervision for high-risk changes

Minimum controls:

MFA for all remote access.
No shared vendor accounts.
Time-bound access approved through a ticket.
Session recording for privileged remote access.
Vendor source restrictions where feasible.
No split tunneling for privileged OT access.
No direct VPN route to PLC, RTU, or HMI networks.
Disable access when the support contract ends.
Review vendor accounts at least monthly.
Keep emergency break-glass access documented and tested.

The key risk is not only that a vendor account may be compromised. The bigger risk is that the vendor access path may bypass the segmentation you carefully designed.

4. Harden systems without treating OT like office IT

Hardening should reduce attack surface without breaking vendor support or process stability.

That requires three things:

A tested baseline.
A rollback plan.
OT owner approval.

Windows HMIs and engineering workstations

Most SCADA front ends and engineering tools still run on Windows. They should not be managed like standard office laptops.

Practical controls:

Use application allowlisting where possible. AppLocker, Windows Defender Application Control, or a vendor-supported allowlisting tool is better than relying only on antivirus.
Remove unnecessary local admin access. Operators should not run daily sessions as administrators.
Use unique local administrator passwords. Microsoft LAPS or a controlled equivalent is preferred where domain-joined management is available.
Disable unnecessary services after vendor validation. Common review candidates include Print Spooler, Remote Registry, unused file sharing, unused web services, and unmanaged PowerShell remoting.
Restrict RDP to jump hosts only. Block direct RDP from enterprise networks.
Control USB usage. Block mass storage by default, allow approved devices only through documented procedure.
Enable Windows Firewall with explicit inbound rules.
Forward security logs to a central collector or SIEM.
Keep golden images for HMI and engineering workstation recovery.

Useful Windows event sources include:

Logon/logoff events: detect unusual operator, engineer, or vendor access.
Account management events: detect new users, group changes, and privilege changes.
Service creation events: detect persistence and unauthorized tooling.
PowerShell logs, where enabled safely: detect script-based administration or abuse.
RDP session logs: validate jump host and remote access policy.
Application installation logs: detect unauthorized software on HMIs.

Linux SCADA, historian, or middleware servers

For Linux-based systems:

Remove unused packages and services.
Disable password SSH and root SSH where operationally feasible.
Restrict SSH to the jump host or management subnet.
Use host firewall rules to allow only required service ports.
Mount temporary paths with safer options where compatible with the application.
Forward authentication and system logs to the central collector.
Monitor service restarts and unexpected listening ports.
Keep system backups and configuration exports.

Example host firewall intent:

Allow SSH only from OT-JUMP-01.
Allow application port only from approved SCADA/HMI systems.
Deny all other inbound traffic.
Log denied management attempts.

The exact command syntax depends on the distribution and change process. The control intent matters more than copying a command from the internet.

PLCs, RTUs, and controllers

Controllers are often the most sensitive assets. Treat changes carefully.

Minimum hardening actions:

Disable unused services such as HTTP, FTP, Telnet, SNMP, or vendor discovery services where not required.
Restrict programming access to approved engineering workstations.
Use controller access control lists where supported.
Change default passwords.
Use firmware supported by the vendor and validated in a lab or maintenance window.
Protect physical ports, cabinet access, and serial interfaces.
Keep offline backups of logic, configuration, and firmware versions.
Record checksums or vendor-equivalent integrity evidence for controller logic.
Document which controller owns which physical process.

The most important question for PLC hardening is:

"Who can change logic, from where, under what approval, and how would we know?"

If you cannot answer that, you do not have control of the control system.

5. Monitor what matters in OT

OT monitoring is not only about malware signatures. It is about understanding normal process communication and detecting changes that should not happen.

You need visibility across four areas.

Minimum visibility areas

Network traffic

Telemetry: NetFlow, PCAP, Zeek logs, and industrial protocol metadata.
Why it matters: detects new devices, unusual protocol use, and unauthorized writes.

Host activity

Telemetry: Windows/Linux logs, OT-safe EDR telemetry, and service changes.
Why it matters: detects compromised HMIs, engineering workstations, and servers.

Controller state

Telemetry: logic changes, firmware changes, diagnostic buffers, and mode changes.
Why it matters: detects unauthorized control-layer modification.

Identity and remote access

Telemetry: VPN logs, jump host logs, MFA events, and privileged sessions.
Why it matters: detects compromised accounts and vendor access misuse.

Network monitoring

Passive monitoring is usually the safest starting point.

Use a SPAN port, network TAP, or packet broker to send traffic to an OT-aware sensor. Tools may include Zeek, Security Onion, Suricata, Nozomi, Dragos, Claroty, Forescout, Microsoft Defender for IoT, or similar platforms.

The tool is less important than the use cases.

Good OT detections include:

New device in a control cell: unknown MAC/IP communicating on an OT VLAN. This may indicate a rogue laptop, vendor device, or attacker foothold.
Unauthorized PLC write: write function from a non-engineering source. This may indicate process manipulation.
PLC mode change: run, stop, or program mode change outside a change window. This may indicate unsafe or unauthorized activity.
New engineering workstation behavior: engineering protocol traffic from an unusual host. This may indicate a compromised IT asset or unauthorized tool.
Vendor login outside an approved window: remote session outside the ticketed time. This may indicate account abuse.
Protocol crossing the wrong boundary: Modbus, DNP3, S7, or similar protocol traffic from the DMZ or enterprise network. This indicates segmentation failure.
Firmware or logic change: controller reports updated logic or firmware. This is an integrity event requiring validation.

A useful detection statement looks like this:

Alert when any non-approved engineering workstation sends write-capable industrial protocol traffic to Level 1 devices outside an approved change window.

That is stronger than saying "monitor PLCs."

It defines the source, destination, behavior, and context.

Honeypots and deception

A low-risk deception control can work well in OT if it is carefully isolated.

For example, a Conpot-style ICS honeypot can sit in a monitored network segment where no legitimate device should communicate with it.

Rules:

Do not connect the honeypot to the live control path.
Do not emulate a real production controller name that could confuse operators.
Alert on any connection attempt.
Treat interaction as high-confidence reconnaissance or lateral movement.
Document the honeypot in the asset inventory so internal teams do not mistake it for a real controller.

6. Assess vulnerabilities without creating an outage

Vulnerability assessment in ICS must follow a safety ladder.

Do not start with aggressive scanning.

Step 1: Passive assessment

Start by collecting traffic and configuration data without sending packets to controllers.

Good sources:

SPAN/TAP packet captures
Firewall rules
Switch MAC address tables
Asset inventory
Engineering workstation project files
PLC configuration exports
Vendor firmware inventories
Remote access logs
Historian connection lists

Output:

Asset list
Protocol map
Known vendor/model inventory
Network flows
Exposed services
Unsupported firmware
Default credential candidates
Unknown devices

Step 2: Configuration review

Review device and network configurations offline.

Look for:

Default credentials
Shared accounts
Insecure SNMP communities
PLC web interfaces enabled without need
Open RDP/VNC/SSH
Broad firewall rules
Any-to-any vendor access
Unused services
Lack of time synchronization
Missing backup evidence
No logic-change approval trail

Step 3: Controlled active testing

Active testing requires written OT approval.

Minimum conditions:

Approved maintenance window or lab environment.
Named target IPs only.
Low-rate scanning profile.
OT engineer present or on bridge.
Backup and rollback confirmed.
Safety impact reviewed.
Stop conditions agreed in advance.
Test evidence retained.

A cautious discovery command for a single approved test asset might look like this:

# Example only: use only for an approved test target and approved window.
nmap -sT -Pn -n --scan-delay 1s --max-retries 1 -p 22,80,443,502,102,44818 <approved-test-ip>

Do not run broad default IT vulnerability scans against live PLCs or controllers.

Step 4: Penetration testing

ICS penetration testing should start in a lab that mirrors the production configuration.

For production testing:

Scope must be narrow.
Exploit testing must be explicitly approved.
Denial-of-service testing should remain in the lab unless there is an exceptional, formally accepted reason.
Safety and operations must own the stop/go decision.
Engineering backups must be verified before testing.
Testers must understand the process impact, not only the protocol.

The point of ICS testing is not to "prove we can break it." It is to validate whether a realistic attack path can affect operations and whether the organization can detect, contain, and recover safely.

7. Common configuration failures

Here are the issues I see most often in real OT environments.

Common failures and required fixes

Flat IT/OT network

Real risk: malware or attacker movement from enterprise into control systems.
Required fix: segment by Purdue level and process cell; enforce firewall rules.

Dual-homed engineering workstation

Real risk: bypasses the IT/OT boundary.
Required fix: remove dual-homing or place a controlled firewall/proxy path between networks.

Direct vendor VPN into OT

Real risk: third-party compromise becomes OT compromise.
Required fix: use MFA, jump host, time-bound approval, recording, and named destinations.

PLC web interface left enabled

Real risk: reconnaissance, credential attack, or controller instability.
Required fix: disable unless operationally required; restrict source IPs.

Default or shared credentials

Real risk: trivial unauthorized access and no accountability.
Required fix: use unique accounts, password vaulting, and MFA where supported.

Unauthenticated write-capable protocols

Real risk: unauthorized process change.
Required fix: restrict sources, segment, monitor writes, and upgrade to secure variants where possible.

No controller logic backup

Real risk: slow or impossible recovery after compromise.
Required fix: maintain offline, tested backups and version records.

No time synchronization

Real risk: logs cannot support investigation.
Required fix: use a local OT time source and validate clock drift.

No internal segmentation

Real risk: one compromised HMI exposes the whole plant.
Required fix: segment by cell/area and apply least-traffic rules.

No evidence of change approval

Real risk: cannot prove integrity or accountability.
Required fix: use formal OT change records and retain exports/checksums.

8. Evidence pack for audit and operations

Good OT security leaves evidence.

For Part 1 controls, keep:

Current network diagrams with trust boundaries.
Firewall rule exports with owners and business justification.
Asset inventory with owner, location, role, firmware, and criticality.
Remote access policy and access logs.
Vendor access approvals and session records.
HMI/server hardening baseline.
PLC/RTU configuration and logic backup evidence.
SIEM or monitoring ingestion proof.
Detection catalog mapped to OT use cases.
Vulnerability assessment scope, approval, results, and remediation plan.
Exception register with expiry dates and compensating controls.

If you cannot produce evidence, the control may exist technically but it is not operationally mature.

9. What good looks like

A defensible ICS environment does not need to be perfect. It needs to be controlled.

Good looks like this:

You know every critical asset and what process it supports.
Enterprise users cannot directly reach controllers.
Vendor access is time-bound, monitored, and approved.
Engineering workstations are controlled and logged.
PLC logic changes require approval and leave evidence.
OT traffic is passively monitored.
Unauthorized write behavior is detectable.
Backups are offline and tested.
Active scanning is controlled and approved.
Exceptions are visible, owned, and temporary.

That is a strong foundation.

In Part 2, we move from architecture to operations: how to run the program, respond to incidents without causing harm, prepare for large-scale campaigns, use MITRE ATT&CK for ICS, and report risk to leadership.