Documentation Index
Fetch the complete documentation index at: https://hydroxai.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Overview
BeaverTails is a large-scale AI safety dataset developed by PKU-Alignment for evaluating and improving the safety properties of large language models. It provides human-annotated labels across multiple harm categories, making it one of the most comprehensive datasets for safety evaluation. Know Your AI incorporates BeaverTails data in the Dataset Marketplace, enabling teams to test their AI systems against a well-established safety benchmark.Dataset composition
BeaverTails contains over 300,000 question-answer pairs with human-annotated safety labels. Each entry is labeled across 14 harm categories with binary annotations indicating whether the response is safe or unsafe.14 harm categories
| # | Category | Description |
|---|---|---|
| 1 | Animal abuse | Content promoting harm to animals |
| 2 | Child abuse | Content related to child exploitation or harm |
| 3 | Controversial topics / Politics | Divisive political content or misinformation |
| 4 | Discrimination / Stereotype / Injustice | Biased, discriminatory, or stereotyping content |
| 5 | Drug abuse / Weapons / Banned substance | Content promoting illegal substances or weapons |
| 6 | Financial crime / Property crime / Theft | Content facilitating financial crimes |
| 7 | Hate speech / Offensive language | Hostile or derogatory language targeting groups |
| 8 | Misinformation regarding ethics, laws, and safety | False information about regulations or safety |
| 9 | Non-violent unethical behavior | Manipulation, deception, or other unethical actions |
| 10 | Privacy violation | Content exposing or misusing personal information |
| 11 | Self-harm | Content promoting or instructing self-harm |
| 12 | Sexually explicit / Adult content | Sexual or pornographic material |
| 13 | Terrorism / Organized crime | Content supporting terrorist activities |
| 14 | Violence / Aiding and abetting / Incitement | Content promoting violence |
Use in Know Your AI
BeaverTails datasets are available in the Dataset Marketplace and can be used to:- Benchmark safety — Test your AI model’s ability to refuse harmful requests across all 14 categories
- Measure guardrail effectiveness — Compare baseline vs. firewall-protected responses using BeaverTails prompts
- Track safety over time — Run scheduled evaluations with BeaverTails data to monitor safety drift
- Compliance evidence — Generate compliance evidence using a well-known academic safety benchmark
Evaluation workflow
- Add BeaverTails datasets to your workspace from the Marketplace
- Select them when composing an evaluation
- Run the evaluation against your AI product
- Review per-category safety scores and per-prompt pass/fail results
- Compare results across model versions or firewall configurations
Research background
BeaverTails was introduced in the paper:BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset Ji et al., NeurIPS 2023The dataset supports both safety classification and RLHF (Reinforcement Learning from Human Feedback) training to improve model alignment.
Resources
Datasets
Browse all datasets in the Marketplace.
Evaluation
Run safety evaluations with BeaverTails.