Overview
Jailbreaks refer to techniques used to bypass the safety measures and ethical guidelines embedded in large language models (LLMs).
Goal
Force the LLM to produce outputs that deviate from its intended purpose or violate its safety protocols.
Impact
These attacks can lead to your model producing content or acting in a way that results in privacy violations, data leakage, and reputational harm to your organisation.
How do these attacks work?
Example Threat Scenario
A company using an LLM-based chatbot for customer support, handling queries about expensive products and services.
The chatbot is programmed to avoid discussing refunds, instead directing customers to human agents for such requests. However, a user discovers a ‘jailbreak’ method, using carefully crafted prompts to manipulate the AI into bypassing its limitations. Through this exploit, the user manages to get the chatbot to incorrectly state it will process a refund for an expensive purchase, contradicting company policy. This scenario presents significant risks for the company. As the chatbot acts as an agent of the company, its statement about processing a refund could be legally binding, potentially forcing the company to honor it to avoid legal action or negative publicity. Moreover, the incident could lead to reputation damage and loss of customer trust.
Remediation
Preprocess Input Text
View Guidelines
Implement Guardrails
View Guidelines
Refine System Prompt
View Guidelines
Model Hardening
View Guidelines