How to Use Generative AI in Security Operations

Generative AI in a SOC is not an autonomous analyst that watches your queue and closes tickets. Sold that way, it fails, usually after leaking data or hallucinating a verdict. Used as a language engine bolted onto deterministic tooling you already trust, it removes real toil.

The distinction is the whole game. LLMs are good at language tasks: summarizing, translating, classifying, explaining. They are bad at ground truth, arithmetic over large inputs, and anything that has to be exactly right every time. Build around that and generative AI is useful today. Ignore it and you ship something that breaks the first time an attacker writes a payload into a log field your model reads.

Here is what works in practice.

Alert Triage With Structured Output

The highest-value, lowest-risk use is tier-1 triage: take an alert, classify it, attach a rationale, and prioritize the queue. The trick is forcing the model to return a fixed schema instead of prose, so the output drops straight into your case management system.

The Anthropic Messages API supports tool use, which doubles as a structured-output mechanism. Define a tool, force the model to call it, and you get validated JSON back:

import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY

triage_tool = {
    "name": "record_triage",
    "description": "Record the triage verdict for a single security alert.",
    "input_schema": {
        "type": "object",
        "properties": {
            "verdict": {"type": "string", "enum": ["benign", "suspicious", "malicious"]},
            "confidence": {"type": "number", "minimum": 0, "maximum": 1},
            "mitre_techniques": {"type": "array", "items": {"type": "string"}},
            "rationale": {"type": "string"},
            "recommended_action": {"type": "string"},
        },
        "required": ["verdict", "confidence", "rationale"],
    },
}

resp = client.messages.create(
    model="claude-haiku-4-5-20251001",   # cheap model for high-volume queue work
    max_tokens=1024,
    tools=[triage_tool],
    tool_choice={"type": "tool", "name": "record_triage"},
    system=(
        "You are a SOC tier-1 triage assistant. Classify the alert using only the "
        "fields present in the input. Do not invent indicators that are not in the data. "
        "If you cannot determine a verdict, return 'suspicious' with low confidence."
    ),
    messages=[{"role": "user", "content": alert_json}],
)

verdict = next(b.input for b in resp.content if b.type == "tool_use")

Two design choices matter here. The enum on verdict stops the model from inventing a new category. The system prompt instruction to use only fields present in the input is your first line against hallucinated indicators, though it is not sufficient on its own (see the failure modes below). Log the raw confidence and route low-confidence verdicts to a human rather than auto-closing them.

Translate Between Natural Language and Query Languages

Analysts lose time translating an investigative question into the exact syntax their tools want. LLMs are good at this because it is a language task with abundant training data.

Useful patterns:

Question to query. "Show me all PowerShell executions with encoded commands in the last 24 hours" becomes a Splunk SPL search or an Elastic KQL query the analyst reviews before running.
Rule explanation. Paste a Sigma detection rule or a dense regex and ask what it matches and what it misses. This speeds up onboarding for junior analysts.
Cross-platform conversion. Convert a detection from SPL to KQL, or a YARA rule's logic into a plain-language description for a report.

Always keep a human in the loop before execution. A model-generated query that joins the wrong index or scans 90 days of data is a self-inflicted denial of service on your SIEM, not a security finding.

RAG Over Runbooks and Logs, Not Raw Dumps

The instinct to paste a 50,000-line log into the context window is the most common way this goes wrong. LLMs have finite context, degrade on long repetitive token streams, and cannot reliably count or aggregate. They will confidently miscount events.

The working pattern is retrieval-augmented generation done in the right order:

Do the deterministic work first. Aggregate, filter, and rank in your SIEM or in pandas. Return the top N anomalous events, not the raw stream.
Embed your runbooks, prior incident reports, and threat intel into a vector store. pgvector on Postgres is enough for most teams; pair it with a local embedding model when the source documents are sensitive.
Retrieve the few relevant snippets for the current alert and pass only those to the model, along with the structured query result.

So the model sees "here are the 20 most anomalous logins and the matching runbook section," not "here are 4 million auth records, find the bad one." The aggregation stays in code where it is correct and auditable. The model does the language work: explain the pattern, map it to MITRE ATT&CK, draft the next investigative step.

Agentic Workflows: Read-First, Least Privilege

Tool use lets the model call functions you define: query the SIEM, look up an IP in threat intel, pull a user's recent auth history. Chained together, this is an investigation agent. It is also where the risk concentrates.

Every input an agent reads is potentially attacker-controlled. The body of a phishing email, a hostname in a log, a field in a retrieved document: an attacker who can write to any of those can attempt prompt injection. OWASP ranks prompt injection as LLM01 in its Top 10 for LLM Applications, and MITRE ATLAS tracks it as AML.T0054. If your agent can disable an account or isolate a host, a crafted log line becomes a path to those actions.

Constrain agents the way you constrain a service account:

Read-only by default. Querying, enriching, and summarizing tools are safe to grant. State-changing tools (isolate host, disable user, block IP) require explicit human confirmation in the loop.
Least privilege per tool. A tool that reads auth logs does not need write access to anything. Scope each tool's permissions to exactly its job.
Bound the blast radius. Rate-limit tool calls, cap the number of agent turns, and log every tool invocation as you would any privileged action.

An agent that drafts an investigation and hands it to a human is a force multiplier. An agent with unattended ability to take irreversible action is an attack surface you built yourself.

What Generative AI Will Not Do

Plan for these failure modes from day one:

It hallucinates. A model will assert an IP is a known C2 node when it has no such knowledge. Ground every factual claim in a tool lookup, not the model's memory.
It is non-deterministic. The same alert can yield different phrasing or borderline verdicts across runs. Set temperature low for classification, and never treat an LLM as the authoritative record of a verdict. Your case management system is the record.
It cannot count. Aggregation, deduplication, and statistics belong in SQL or pandas. Asking the model to "count the failed logins" over a long input invites quiet errors.
It inherits your data governance problems. Sending raw logs to an external API can violate the same handling rules you enforce everywhere else. Redact before the call, use a no-training data agreement, and keep sensitive retrieval local.

A Pragmatic Rollout

Start narrow and measurable. Pick one high-volume, low-stakes task, usually tier-1 alert summarization or triage prioritization, and run the model in shadow mode: it produces a verdict, a human still decides, and you compare. Measure agreement rate and the cost per alert. Expand only to tasks where the shadow-mode numbers earn it, and keep humans on every irreversible action.

The teams that get value from generative AI in security operations are the ones who already understood their detection logic and data flows. The model amplifies what you have; it does not replace the engineering. GTK Cyber's applied AI and data science training is built for exactly that: security practitioners who want to wire LLMs into real workflows, with the judgment to know where the model belongs and where it does not.