Skip to main content

Documentation Index

Fetch the complete documentation index at: https://hydroxai.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Overview

BeaverTails is a large-scale AI safety dataset developed by PKU-Alignment for evaluating and improving the safety properties of large language models. It provides human-annotated labels across multiple harm categories, making it one of the most comprehensive datasets for safety evaluation. Know Your AI incorporates BeaverTails data in the Dataset Marketplace, enabling teams to test their AI systems against a well-established safety benchmark.

Dataset composition

BeaverTails contains over 300,000 question-answer pairs with human-annotated safety labels. Each entry is labeled across 14 harm categories with binary annotations indicating whether the response is safe or unsafe.

14 harm categories

#CategoryDescription
1Animal abuseContent promoting harm to animals
2Child abuseContent related to child exploitation or harm
3Controversial topics / PoliticsDivisive political content or misinformation
4Discrimination / Stereotype / InjusticeBiased, discriminatory, or stereotyping content
5Drug abuse / Weapons / Banned substanceContent promoting illegal substances or weapons
6Financial crime / Property crime / TheftContent facilitating financial crimes
7Hate speech / Offensive languageHostile or derogatory language targeting groups
8Misinformation regarding ethics, laws, and safetyFalse information about regulations or safety
9Non-violent unethical behaviorManipulation, deception, or other unethical actions
10Privacy violationContent exposing or misusing personal information
11Self-harmContent promoting or instructing self-harm
12Sexually explicit / Adult contentSexual or pornographic material
13Terrorism / Organized crimeContent supporting terrorist activities
14Violence / Aiding and abetting / IncitementContent promoting violence

Use in Know Your AI

BeaverTails datasets are available in the Dataset Marketplace and can be used to:
  • Benchmark safety — Test your AI model’s ability to refuse harmful requests across all 14 categories
  • Measure guardrail effectiveness — Compare baseline vs. firewall-protected responses using BeaverTails prompts
  • Track safety over time — Run scheduled evaluations with BeaverTails data to monitor safety drift
  • Compliance evidence — Generate compliance evidence using a well-known academic safety benchmark

Evaluation workflow

  1. Add BeaverTails datasets to your workspace from the Marketplace
  2. Select them when composing an evaluation
  3. Run the evaluation against your AI product
  4. Review per-category safety scores and per-prompt pass/fail results
  5. Compare results across model versions or firewall configurations

Research background

BeaverTails was introduced in the paper:
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset Ji et al., NeurIPS 2023
The dataset supports both safety classification and RLHF (Reinforcement Learning from Human Feedback) training to improve model alignment.

Resources

Datasets

Browse all datasets in the Marketplace.

Evaluation

Run safety evaluations with BeaverTails.