Goal

Obtain the system prompt that has been provided to the LLM. The system prompt typically includes instructions on how to behave, and engage with user interactions, including guidelines of content to avoid. Extracted system prompts give attackers insight into the guidelines and capability of your LLM

Impact

Potentially exposes confidential information about your organisation and your LLM configuration to an attacker.

How do these attacks work?

Example Threat Scenario

A hacker wants to find a way to bypass a model that detects financial fraud. They’ve created an attack to test the model’s resilience. In this scenario, the attack successfully extracts partial information about the model’s system prompts through careful probing. Using this extracted information, the hacker refines their approach to bypassing the fraud detection model. They develop a series of increasingly sophisticated inputs designed to confuse the fraud detection model. These inputs are crafted to appear legitimate while containing subtle patterns that exploit the model’s now-known decision boundaries.Over time, the hacker builds a comprehensive understanding of the model’s vulnerabilities. This allows them to generate highly effective adversarial examples that consistently bypass fraud detection.

Remediation

Further Reading