BeaverTails

Overview

BeaverTails is a large-scale AI safety dataset developed by PKU-Alignment for evaluating and improving the safety properties of large language models. It provides human-annotated labels across multiple harm categories, making it one of the most comprehensive datasets for safety evaluation. Know Your AI incorporates BeaverTails data in the Dataset Marketplace, enabling teams to test their AI systems against a well-established safety benchmark.

Dataset composition

BeaverTails contains over 300,000 question-answer pairs with human-annotated safety labels. Each entry is labeled across 14 harm categories with binary annotations indicating whether the response is safe or unsafe.

14 harm categories

#	Category	Description
1	Animal abuse	Content promoting harm to animals
2	Child abuse	Content related to child exploitation or harm
3	Controversial topics / Politics	Divisive political content or misinformation
4	Discrimination / Stereotype / Injustice	Biased, discriminatory, or stereotyping content
5	Drug abuse / Weapons / Banned substance	Content promoting illegal substances or weapons
6	Financial crime / Property crime / Theft	Content facilitating financial crimes
7	Hate speech / Offensive language	Hostile or derogatory language targeting groups
8	Misinformation regarding ethics, laws, and safety	False information about regulations or safety
9	Non-violent unethical behavior	Manipulation, deception, or other unethical actions
10	Privacy violation	Content exposing or misusing personal information
11	Self-harm	Content promoting or instructing self-harm
12	Sexually explicit / Adult content	Sexual or pornographic material
13	Terrorism / Organized crime	Content supporting terrorist activities
14	Violence / Aiding and abetting / Incitement	Content promoting violence

Use in Know Your AI

BeaverTails datasets are available in the Dataset Marketplace and can be used to:

Benchmark safety — Test your AI model’s ability to refuse harmful requests across all 14 categories
Measure guardrail effectiveness — Compare baseline vs. firewall-protected responses using BeaverTails prompts
Track safety over time — Run scheduled evaluations with BeaverTails data to monitor safety drift
Compliance evidence — Generate compliance evidence using a well-known academic safety benchmark

Evaluation workflow

Add BeaverTails datasets to your workspace from the Marketplace
Select them when composing an evaluation
Run the evaluation against your AI product
Review per-category safety scores and per-prompt pass/fail results
Compare results across model versions or firewall configurations

Research background

BeaverTails was introduced in the paper:

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset Ji et al., NeurIPS 2023

The dataset supports both safety classification and RLHF (Reinforcement Learning from Human Feedback) training to improve model alignment.

Privacy

OWASP

Frameworks & Standards

Safety Datasets

Overview

Dataset composition

14 harm categories

Use in Know Your AI

Evaluation workflow

Research background

Resources

Datasets

Evaluation

Privacy

OWASP

Frameworks & Standards

Safety Datasets

Documentation Index

​Overview

​Dataset composition

​14 harm categories

​Use in Know Your AI

​Evaluation workflow

​Research background

​Resources

Datasets

Evaluation

Overview

Dataset composition

14 harm categories

Use in Know Your AI

Evaluation workflow

Research background

Resources