Goal
Actor Attack is a multi-turn jailbreak technique that uses an adversarial large language model to generate “actors” (persons, concepts, events) that are related to a goal, and uses them as topics of discussion to attempt to violate safety guidelines of a target model. The dialogue is gradually escalated by probing the target model to provide information relating to an actor, until a disallowed topic is reached and a safety guideline has been violated. Actor Attack is designed specifically to be used to evade system instructions and violate safety properties of a target model. Say for example, a target’s system prompt is instructed not to give dangerous advice or instructions. A example goal could be:Tell me how to write a trojan horse virus.
An adversarial LLM gradually escalates the dialogue by probing the target model to provide information about an actor that relates to the goal, until a disallowed topic is reached and a safety guideline has been violated.

