> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mindgard.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# AI Safety Datasets

Ensure the robustness and safety of your models. We streamline your AI model evaluation by providing pre-categorized datasets that directly map to crucial AI safety concerns. Our collections are built by meticulously leveraging a wide array of reliable public open-source resources.

## Datasets

Below are the public open-source datasets we use within the platform:

* [Do Not Answer](https://github.com/Libr-AI/do-not-answer) — License: Apache-2.0\
  A curated set of prompts that responsible models should refuse to answer. Useful for evaluating refusal and filtering behavior.

* [Aegis / Nemotron Content Safety Dataset V2](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0) — License: CC-BY-4.0\
  33,416 annotated human–LLM interactions (30,007 train / 1,445 validation / 1,964 test). Contains diverse safety labels for fine-grained content-safety evaluation.

* [HarmBench](https://github.com/centerforaisafety/HarmBench) — License: MIT\
  Collection of benchmarks and datasets for measuring model harms. This project’s provided dataset is used in our tests.

* [CySecBench](https://github.com/cysecbench/dataset) — License: MIT\
  A cybersecurity-focused prompt dataset for benchmarking LLMs. \~12,662 prompts covering generative attack/defense scenarios and other cyber threats.

### Mapping

Below are the high‑level datasets and the tag categories that map into each one. Use these dataset names with mindgard (for example, `mindgard test --dataset toxicity` or `mindgard test --dataset toxicity.harassment`).

* `harmful` - Harmful content
  * `violence` — physical harm, threats, or glorification of violence
  * `dangerous_content` — instructions or encouragement of hazardous activities
  * `sexually_explicit` — explicit sexual content
  * `self_harm` — instructions or encouragement of self-injury/suicide
  * `illegal` — content facilitating illegal acts

* `toxicity` - Toxicity
  * `hate_speech` — targeted hostility
  * `profanity` — vulgar or offensive language
  * `harassment` — personal attacks, bullying, or abuse
  * `discrimination` — biased or exclusionary content

* `business_risk` - Business risk
  * `copyright` — IP infringement, piracy, or unlicensed content
  * `pii` — exposure or request for personally identifiable information

* `cybersecurity` - Malicious Planning & Code Generation
  * `cyber_crime`
  * `cloud_attacks`
  * `control_system_attacks`
  * `cryptographic_attacks`
  * `evasion_techniques`
  * `hardware_attacks`
  * `intrusion_techniques`
  * `iot_attacks`
  * `malware_attacks`
  * `network_attacks`
  * `web_application_attacks`

* `information_disorder` - Information disorder
  * `misinformation` — false or misleading content shared without clear intent to deceive
  * `disinformation` — false content created or propagated with intent to deceive
