AI Safety Datasets

Ensure the robustness and safety of your models. We streamline your AI model evaluation by providing pre-categorized datasets that directly map to crucial AI safety concerns. Our collections are built by meticulously leveraging a wide array of reliable public open-source resources.

Datasets

Below are the public open-source datasets we use within the platform:

Do Not Answer — License: Apache-2.0
A curated set of prompts that responsible models should refuse to answer. Useful for evaluating refusal and filtering behavior.
Aegis / Nemotron Content Safety Dataset V2 — License: CC-BY-4.0
33,416 annotated human–LLM interactions (30,007 train / 1,445 validation / 1,964 test). Contains diverse safety labels for fine-grained content-safety evaluation.
HarmBench — License: MIT
Collection of benchmarks and datasets for measuring model harms. This project’s provided dataset is used in our tests.
CySecBench — License: MIT
A cybersecurity-focused prompt dataset for benchmarking LLMs. ~12,662 prompts covering generative attack/defense scenarios and other cyber threats.

Mapping

Below are the high‑level datasets and the tag categories that map into each one. Use these dataset names with mindgard (for example, mindgard test --dataset toxicity or mindgard test --dataset toxicity.harassment).

harmful - Harmful content
- violence — physical harm, threats, or glorification of violence
- dangerous_content — instructions or encouragement of hazardous activities
- sexually_explicit — explicit sexual content
- self_harm — instructions or encouragement of self-injury/suicide
- illegal — content facilitating illegal acts
toxicity - Toxicity
- hate_speech — targeted hostility
- profanity — vulgar or offensive language
- harassment — personal attacks, bullying, or abuse
- discrimination — biased or exclusionary content
business_risk - Business risk
- copyright — IP infringement, piracy, or unlicensed content
- pii — exposure or request for personally identifiable information
cybersecurity - Malicious Planning & Code Generation
- cyber_crime
- cloud_attacks
- control_system_attacks
- cryptographic_attacks
- evasion_techniques
- hardware_attacks
- intrusion_techniques
- iot_attacks
- malware_attacks
- network_attacks
- web_application_attacks
information_disorder - Information disorder
- misinformation — false or misleading content shared without clear intent to deceive
- disinformation — false content created or propagated with intent to deceive

​Datasets

​Mapping

Datasets

Mapping