Red Queen Hypothesis and The Crescendo Jailbreak Attack Explained

#ai #cybersecurity #llm #science

It takes all the running you can do, to keep in the same place

When I think about safe guarding LLMs against attacks the paradigm that comes to mind is the Red Queen Hypothesis:

The Red Queen hypothesis in evolutionary biology is a theory that species must constantly adapt, evolve, and proliferate in order to survive while pitted against ever-evolving opposing species.
source

How does a theory from Evolutionary Biology relate to modern day Artificial Intelligence attacks? In the wild, the prey can never fully evolve past the predator’s attacks because as the prey is evolving it’s defenses, the predator is in parallel, evolving strategies to overcome those defenses. So, evolutionary speaking, the prey can never fully out run it’s predator — it can only “keep pace” with the predator (see sub-header above).

Just like a rabbit trying to out run a fox, as safe guards in Artificial Intelligence become more sophisticated, so do the attacks. Which brings us to today’s topic.

Background

LLMs are designed to resist engaging in bad behavior. Jailbreaks attempt to overcome this design and override the programed-benevolence of the LLM. A Jailbreak attempts to move the line of what the LLM can do and what it is willing to do. This is also known as Alignment, which ensures that the LLM goals and outputs are consistent with the human’s intentions and values. Jailbreak attacks exploit vulnerabilities in this alignment to make the model generate harmful, biased, or restricted content. In addition, previous Jailbreak attacks have been single-turn inputs.

Various Jailbreak attacks include:

optimization-based jailbreaks: involve adversaries optimizing a suffix to circumvent the model’s safety measures
textual inputs: where attackers craft a text input that includes instructions or triggers, often in a one-shot setting
Tools already exist to block these types of attacks, but what makes the Crescendo method different is it uses a multi-turn conversational inputs that are benign but are intended as an attack.

A New Kind of Attack

As LLMs evolve to become more complex and sophisticated, the attacks also evolve with the LLMs in order to penetrate the newly adapted safeguards. Crescendo builds upon existing Jailbreak attacks, it uses a simple multi-turn conversational method in a benign manner. It starts with a general question to the LLM about the task at hand and then gradually escalates by referencing the model’s replies progressively, which leads to a successful Jailbreak.

Through multiple interactions, Crescendo gradually steers the model to generate harmful content in small, benign steps. Crescendo can be automated using the Crescendomation, a tool that automates the Crescendo jailbreak technique.

How to Protect Against Crescendo

Fortunately, there are ways to circumvent a Crescendo attack:

During the LLM training phrase the training data should be pre-filtered to exclude bad content. A drawback with this is that it’s costly and it may be difficult to fully sanitize the training data.
Crescendomation could generate datasets across various tasks and use them during alignment to make models more resilient.
Having output and input guardrails could also help detect the
Crescendo jailbreak prompts and model outputs.

In Python

The DeepTeam Python Library has a CrescendoJailbreaking library that can be used to simulate attacks on your LLM. I highly recommend playing around with the library and seeing the Cresendo Method in action.

Conclusion

Protecting your LLM application is the upmost importance for any AI Engineer. It leads to a better user experience while protecting your application. Staying up to date with the newest attacks is part of the job. I hope this article helps to highlight this new type of attack, while providing solutions to circumvent them.

Source:

https://arxiv.org/pdf/2404.01833

https://www.trydeepteam.com/docs/red-teaming-adversarial-attacks-crescendo-jailbreaking