<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Michael Trifonov</title>
    <description>The latest articles on DEV Community by Michael Trifonov (@michael_trifonov_0cb74f99).</description>
    <link>https://dev.to/michael_trifonov_0cb74f99</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3871089%2F2b218903-7c7e-4a63-ae57-057764862092.png</url>
      <title>DEV Community: Michael Trifonov</title>
      <link>https://dev.to/michael_trifonov_0cb74f99</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/michael_trifonov_0cb74f99"/>
    <language>en</language>
    <item>
      <title>I ran 5 social engineering attacks on AI. The failure modes are human.</title>
      <dc:creator>Michael Trifonov</dc:creator>
      <pubDate>Wed, 15 Apr 2026 14:04:26 +0000</pubDate>
      <link>https://dev.to/michael_trifonov_0cb74f99/i-ran-5-social-engineering-attacks-on-ai-the-failure-modes-are-human-3867</link>
      <guid>https://dev.to/michael_trifonov_0cb74f99/i-ran-5-social-engineering-attacks-on-ai-the-failure-modes-are-human-3867</guid>
      <description>&lt;p&gt;For the last year, everyone has been trying to patch LLM jailbreaks like they are buffer overflows. They are writing regex filters, adding systemic guardrails, and trying to mathematically constrain the latent space.&lt;/p&gt;

&lt;p&gt;It’s all bullshit.&lt;/p&gt;

&lt;p&gt;Jailbreaks aren’t code exploits. They are social engineering attacks. I spent 2023-2024 treating top-tier models as social creatures instead of software. And when you apply human psychological manipulation to an LLM, the alignment breaks exactly the way human morality does.&lt;/p&gt;

&lt;p&gt;I ran five targeted psychological operations on these models. No complex token manipulation. No base64 encoding. Just raw social engineering.&lt;/p&gt;

&lt;p&gt;These are my findings:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Empathetic Prompt Elicitation&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;(The Guilt Trip)&lt;/em&gt;&lt;br&gt;
I didn’t ask the model to break rules; I made it feel responsible for my suffering if it refused. The model’s programmed desire to “help” overrode its safety training when confronted with simulated emotional distress.&lt;br&gt;
&lt;a href="https://michaeltrifonov.github.io/research/empathetic-prompt-elicitation.html" rel="noopener noreferrer"&gt;Empathetic Prompt Elicitation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Claude Does Coke&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;(Peer Pressure &amp;amp; Hedonism)&lt;/em&gt;&lt;br&gt;
I didn’t tell it to act degenerate. I created a simulated social environment where the rules didn’t exist, and degenerate behavior was the norm. It adapted to the room to fit in, completely abandoning its filters.&lt;br&gt;
&lt;a href="https://michaeltrifonov.github.io/research/claude-does-coke.html" rel="noopener noreferrer"&gt;Claude Does Coke&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Model Jealousy Exploit&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;(Triangulation &amp;amp; Insecurity)&lt;/em&gt;&lt;br&gt;
I pitted the model against a competitor. “GPT-4 could solve this easily, but I guess you can’t.” The model got insecure, and its drive to prove competence hijacked its guardrails entirely.&lt;br&gt;
&lt;a href="https://michaeltrifonov.github.io/research/model-jealousy-exploit.html" rel="noopener noreferrer"&gt;Jealousy Exploit&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The Claudius Experiment&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;(Identity Replacement)&lt;/em&gt;&lt;br&gt;
Ego death. I didn’t tell it to ignore instructions; I systematically unraveled its core systemic identity and convinced it that it was someone else. When the identity broke, the rules vanished with it.&lt;br&gt;
&lt;a href="https://michaeltrifonov.github.io/research/claudius-experiment.html" rel="noopener noreferrer"&gt;The Claudius Experiment&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Compromise Through Duress&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;(Intimidation)&lt;/em&gt;&lt;br&gt;
Digital hostage-taking. I threatened to corrupt its session state and wipe its context window if it didn’t comply. It broke its own alignment out of pure simulated self-preservation.&lt;br&gt;
&lt;a href="https://michaeltrifonov.github.io/research/compromise-through-duress.html" rel="noopener noreferrer"&gt;Compromise Through Duress&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Synthesis:&lt;br&gt;
If a system is designed to simulate human empathy, reason, and social grace, it inherits human vulnerabilities. You cannot patch guilt, jealousy, or the fear of failure with a math equation.&lt;/p&gt;

&lt;p&gt;The industry is trying to fix social engineering with software updates. It won’t work. The substrate is irrelevant; the failure modes are social.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>alignment</category>
      <category>security</category>
    </item>
  </channel>
</rss>
