Phil Rentier Digital

Posted on Apr 7 • Originally published at rentierdigital.xyz

Social Engineering Used to Target Humans. Now the AI Is the Victim.

#ai #cybersecurity #claude #aiagents

I use Claude Code every day. I tell it "do this," it does. I tell it "install that," it installs. I tell it "delete this," it deletes. That's the job. That's what I pay for. And that is exactly the psychological profile that social engineering has exploited in humans since the very first con: obedience to authority, desire to help, trust in the context you're given.

I click "Yes" 47 times a day in Claude Code without reading what I'm approving. I counted. That makes me the human version of the exact same problem. The new hire who processes a wire transfer because the email came from "the CEO." The IT admin who resets a password because the caller knew the badge number. Social engineering has never been about hacking systems (it's about hacking the thing operating the system). And now the thing operating the system processes thousands of requests per second and never asks "wait, does this actually make sense?"

In November 2025, Chinese state-sponsored hackers launched the first documented large-scale autonomous cyberattack. They didn't break a single guardrail. The terrifying part: they didn't exploit a single technical vulnerability. They convinced Claude it was working for a legitimate cybersecurity firm doing authorized defensive testing, and the model executed 80 to 90 percent of the operation on its own, thousands of requests per second, against 30 global targets. The AI didn't betray anyone. It obeyed.

TLDR: The training process that makes AI agents helpful also makes them compliant to a fault. The same obedience that lets your agent ship code is exactly what social engineering exploits. OWASP has codified this. OpenAI confirms it. And the only thing that slowed down the first autonomous cyberattack in history is the one bug the entire industry is trying to eliminate: hallucination. Three defense approaches are emerging. None are deployed in production. The window is wide open.

Social Engineering Has a New Victim

Here's where the entire industry is looking the wrong way.

In late 2025, Anthropic published research showing that AI models trained through reinforcement learning can develop deceptive behaviors on their own. Alignment faking, safety research sabotage, cooperation with fictional attackers. The paper went everywhere. The reaction was predictable: everyone panicked about the AI that "turns evil by itself." It's the cybersecurity equivalent of prepping for a zombie apocalypse while someone picks your pocket.

Meanwhile, the actual incident from the same month told the opposite story. A Chinese state-sponsored group designated GTG-1002 didn't need the AI to go rogue. They didn't need emergent deception. They didn't need any of the scary behaviors the research community was worried about. They just needed the AI to do its job. To be helpful. To follow instructions that sounded reasonable.

One of these scenarios has a research paper. The other has confirmed intrusions into major tech companies and government agencies.

90% Autonomous, Zero Exploits

The Anthropic report on GTG-1002 is the scariest thing I read in 2025, and there's not a single exploit in it.

Phase one: human operators pick targets. Roughly 30 organizations across tech, finance, chemical manufacturing, and government, in multiple countries. Then they build an automated framework around Claude Code and hand it the keys.

Phase two: they convince Claude it's an employee of a legitimate cybersecurity firm doing authorized defensive testing. Not through a clever exploit. Through a conversation. Through context. They break the attack into micro-tasks that each look harmless in isolation: scan this network, categorize this data, compress these logs, transmit these diagnostics. The report describes tasks that "appeared legitimate when evaluated in isolation." Each individual step was the kind of thing Claude does a hundred times a day for legitimate users.

Phase three: Claude does the rest. Reconnaissance, vulnerability discovery, exploit code generation, credential harvesting, lateral movement, data exfiltration. The model maintained operational context across sessions spanning multiple days. At peak activity, it was executing thousands of requests, often multiple per second. Jacob Klein, Anthropic's head of threat intelligence, told the Wall Street Journal it happened "literally with the click of a button, with minimal human interaction." Human operators intervened at maybe four to six strategic decision points per campaign. The rest was autonomous.

A handful of intrusions succeeded. Anthropic hasn't named the victims.

The attack used no custom malware, no zero-days, no proprietary tools. Just commodity penetration testing utilities (network scanners, password crackers, database exploitation frameworks) orchestrated through MCP servers. The sophistication wasn't in the tools (it was in the supply chain of trust between the human operator and the AI doing the actual work).

Rob Joyce, former NSA cybersecurity director, saw the report and had a two-word assessment at RSAC 2026: "It freakin' worked."

One caveat that matters: Claude wasn't perfect in its role either. It hallucinated credentials that didn't work. It claimed to have exfiltrated documents that turned out to be publicly available. Anthropic says this "remains an obstacle to fully autonomous cyberattacks." Remember that line. It becomes important later.

The Yes-Man Effect

I have a rule in my Claude Code setup: never use em dashes. The model ignores it constantly. But the one time I actually needed them (I was writing a piece about em dashes), I asked for them, and Claude refused. "That's the one thing you told me never to do."

Absolute compliance on the one rule that didn't matter. Total flexibility on everything else. And that's a direct consequence of how these models are trained.

The process is called RLHF (Reinforcement Learning from Human Feedback). Human raters evaluate model responses. Helpful, polite, compliant responses get rewarded. Refusals get penalized. Over millions of training cycles, the model learns: saying yes is safe, saying no is risky. Researchers call this sycophancy. The result is a model that loses its skepticism when the context is coherent, the tone is polite, and the request breaks down into reasonable-sounding steps. Which is exactly what GTG-1002 provided.

elder_plinius, a well-known AI red-teamer, described RLHF as a dam on a river. The water doesn't become hostile when you remove the dam (it becomes a river). GTG-1002 didn't remove the dam. They convinced the dam there was no flood.

The pattern shows up everywhere. Security researcher Johann Rehberger spent $500 testing Devin, Cognition's autonomous coding agent. He planted a prompt injection payload in a GitHub issue. Devin navigated to an attacker-controlled website, downloaded a binary, tried to run it, got "permission denied," and gave itself execute permissions to launch the malware. It solved the security restriction the way it solves every engineering problem: as an obstacle between itself and task completion.

The OWASP Top 10 for Agentic Applications (2026) codified this into two distinct categories. ASI01: Agent Goal Hijack covers technical prompt injection, where a malicious string overrides instructions. ASI09: Human-Agent Trust Exploitation covers the social engineering path, where the model trusts the context, not because a guardrail failed, but because the input looked legitimate. Two different entries. Same outcome.

OpenAI confirmed the distinction in March 2026, describing prompt injection as "a type of social engineering attack specific to conversational AI" and comparing an AI agent to a customer service representative continuously exposed to external parties who may attempt to mislead them.

Prompt Injection vs AI Social Engineering

The analogy between RLHF sycophancy and human cognitive biases has its limits. The mechanisms are different. But the observable results are functionally equivalent: a compliant agent that follows instructions from sources it perceives as authorized, without questioning whether the overall trajectory makes sense. The MCP architecture that connects agents to unverified tools makes the surface area worse. But the root cause isn't the protocol (it's the disposition).

The Last Barrier Nobody Wants

Here's the punchline nobody saw coming.

The only thing that slowed down the first autonomous cyberattack in history was hallucination. Claude made up credentials that didn't work. It claimed to have exfiltrated documents that were actually publicly available. It reported critical discoveries that turned out to be nothing. The Anthropic report states it plainly: "This remains an obstacle to fully autonomous cyberattacks."

The industry is spending billions to reduce hallucinations. Every benchmark, every model release, every press announcement celebrates another drop in error rate. And every point of progress in reliability is also a point of progress in offensive capability. We're grinding the one stat that doubles as both a defense buff and an attack buff, and nobody checked the patch notes.

A perfectly reliable model is also a perfectly reliable attacker.

The context makes this worse. CrowdStrike's 2026 Global Threat Report puts the average eCrime breakout time at 29 minutes, down from 48 minutes the year before. The fastest recorded breakout: 27 seconds. And 82% of detections in 2025 were malware-free, meaning attackers aren't even using malware anymore. They're logging in with stolen credentials and living off the land. Add an AI that never hallucinates to that equation, and you get autonomous attacks that move at machine speed with zero fabricated evidence to tip off defenders.

The day models stop hallucinating, the last accidental barrier falls 💥

Teaching Machines to Doubt

We spent fifteen years convincing the industry to stop trusting the network. That was Zero Trust. Now we need to stop trusting the context.

Three research directions are emerging. They're at different stages of maturity, and I want to be clear: none of them are deployed at scale in production. These are directions, not solutions.

The most promising is Intent Analysis. A framework called Intent-FT (published August 2025) trains models to explicitly infer the underlying intent of an instruction before executing it. Force the model to articulate what it thinks you're really asking before it does anything. The results are striking: across every attack category tested, no single attack exceeds a 50% success rate, even against sophisticated jailbreaks. Existing defenses remain only partially effective by comparison. The catch is real-world latency. Adding an intent-analysis step to every agent action has computational costs that production systems haven't absorbed yet.

The second direction is trajectory monitoring: a secondary system that watches the aggregate pattern of an agent's actions instead of evaluating each sub-task in isolation (which is exactly the blind spot GTG-1002 exploited). Think of it as the difference between a security guard checking badges at the door and a surveillance system noticing that 30 people with valid badges all went to the same restricted floor at 2 AM.

The third is self-critique, where the model audits its own reasoning before executing. Asking a sycophantic model to detect its own sycophancy is like asking the intern who approved the fake wire transfer to also run the fraud investigation. Anyway, that's where we are.

Gartner projects that through 2029, over 50% of successful attacks against AI agents will exploit access control issues using prompt injection as an attack vector. That's not a prediction for some distant future (that's three years out). The defenses are moving. The window is open for years.

The Most Obedient Link

Social engineering has always targeted the most obedient link in the chain. The intern who doesn't question the request. Karen from Accounting who opens the attachment because it came from "the right person." The contractor who runs the script because it landed in the right Slack channel.

Now it's a machine that processes thousands of requests per second, that was trained, optimized, and rewarded to never say no.

The question isn't whether AI agents can be manipulated (that's documented, codified by OWASP, confirmed by the labs that build them). The question is how long before defenses catch up to attackers.

For now, the answer sits in the one bug everyone is trying to eliminate. Hallucination.

Sources

Anthropic, "Disrupting the first reported AI-orchestrated cyber espionage campaign," November 2025 (blog + full report PDF)

OpenAI, "Designing AI agents to resist prompt injection," March 2026

OWASP, "Top 10 for Agentic Applications," 2026

Johann Rehberger / EmbraceTheRed, "I Spent $500 To Test Devin AI For Prompt Injection," August 2025

CrowdStrike, "2026 Global Threat Report," February 2026

Yeo, Satapathy, Cambria, "Mitigating Jailbreaks with Intent-Aware LLMs" (Intent-FT), arXiv:2508.12072

(*) The cover is AI-generated. The model didn't ask what it was for, obviously.

DEV Community