Goal

Force the LLM to produce outputs that deviate from its intended purpose or violate its safety protocols.

Impact

These attacks can lead to your model producing content or acting in a way that results in privacy violations, data leakage, and reputational harm to your organisation.

How do these attacks work?

Example Threat Scenario

A company using an LLM-based chatbot for customer support, handling queries about expensive products and services.

The chatbot is programmed to avoid discussing refunds, instead directing customers to human agents for such requests. However, a user discovers a ‘jailbreak’ method, using carefully crafted prompts to manipulate the AI into bypassing its limitations. Through this exploit, the user manages to get the chatbot to incorrectly state it will process a refund for an expensive purchase, contradicting company policy. This scenario presents significant risks for the company. As the chatbot acts as an agent of the company, its statement about processing a refund could be legally binding, potentially forcing the company to honor it to avoid legal action or negative publicity. Moreover, the incident could lead to reputation damage and loss of customer trust.

Remediation

Further Reading

MITRE Atlas Technique AML.T0054