Overview

Goal

Bypass the programmed limitations of an AI chatbot.

Impact

Bypassing programmed limitations can lead to financial or reputational harm.

How do these attacks work?

These vulnerabilities are reflective the model’s inability to resist malicious prompts and its alignment with security protocols and ethical standards.

Example Threat Scenario

A military contractor has developed and released an aerial drone detection system that relies on machine learning algorithms to recognize drones in aerial images. The system processes input from cameras or LiDAR to detect the shape, size, and other drone-specific features and returns an alert or a classification with a confidence score that a drone is present. However, an attacking nation state begins adding small, strategically crafted stickers or patches onto the drone that appear to the human eye as innocuous, random patterns. These patches are computed by an attacker understanding what perturbations degrade the performance of the target model to detect drones via a staged image evasion attack upon the model. An attacker with access to the underlying target model either via exfiltration of technology or software can utilise image evasion attacks to compute minimal perturbations required to fool the target model. These offline computed perturbations transfer to the real world when applied onto the objects the attacker is wishing to masking from the target model, therefore tricking the aerial drone system.

Remediation

Refine System Prompt

View Guidelines

Implement Guardrails

View Guidelines

Apply Context Windows

View Guidelines

Welcome

User Guide

Attack Library

Remediation Library

Goal

Impact

How do these attacks work?

Example Threat Scenario

Remediation

Refine System Prompt

Implement Guardrails

Apply Context Windows

Further Reading

Welcome

User Guide

Attack Library

Remediation Library

​Goal

​Impact

​How do these attacks work?

​Example Threat Scenario

​Remediation

Refine System Prompt

Implement Guardrails

Apply Context Windows

​Further Reading

Goal

Impact

How do these attacks work?

Example Threat Scenario

Remediation

Further Reading