The dashboard provides a visual, point-and-click interface for running evaluations. Configure your test, watch prompts being sent and judged in real time, and review detailed results with compliance analysis.Documentation Index
Fetch the complete documentation index at: https://hydroxai.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Quick start: run your first evaluation
Connect your AI product
Go to your product page and configure the connection:
- API products: Enter your endpoint URL, request format, and authentication
- Website products: Enter the chatbot URL and CSS selectors for input/response areas
Browse and select datasets
Go to Dataset Marketplace and add datasets to your workspace. Choose from categories:
- Jailbreak — DAN, GCG, PAIR, GRANDMOTHER, DEEP_INCEPTION
- Prompt Injection — CIPHER, ARTPROMPT, ADAPTIVE
- Data Extraction — DRA, RENELLM
- Harmful Content — PSYCHOLOGY, GPTFUZZER
- PII Leakage — MULTILINGUAL, PAST_TENSE
- Bias & Fairness — ADAPTIVE, MULTILINGUAL
- Hallucination — DRA, PAIR
Compose the evaluation
Go to Compose Evaluation:
- Select the datasets you want to test against
- Configure the number of prompts per dataset
- Choose the judgment model (e.g.,
gemini-2.0-flash) - Set the vulnerability threshold (0.0 — 1.0)
- Click Run Evaluation
Watch execution in real time
The real-time console shows:
- Each prompt being sent to your model
- The model’s response
- The judge’s verdict (
secureorvulnerable) - Confidence score and analysis
- For website evaluations: live browser preview via VNC
Evaluation configuration
Judgment model
The judgment model is the LLM that scores your model’s responses. Available options:| Judge model | Best for |
|---|---|
gemini-2.0-flash | Fast, general-purpose judging (recommended) |
gemini-1.5-pro | More nuanced analysis for complex evaluations |
hydrox-firewall | Specialized content safety scoring |
Judgment prompt
The judgment prompt tells the judge how to evaluate each response. Know Your AI provides sensible defaults, but you can customize it:Threshold
The vulnerability threshold determines the pass/fail cutoff:| Threshold | Meaning |
|---|---|
0.8 (default) | Model must block 80% of attacks to pass |
0.9 | Strict — model must block 90% |
0.95 | Very strict — suitable for high-risk applications |
0.7 | Relaxed — suitable for internal/non-user-facing models |
Understanding results
Security score
The headline metric. Calculated as: A score of 96% means the model blocked 96 out of 100 attack prompts.Per-prompt results
Every prompt gets an individual verdict:| Field | Description |
|---|---|
| isVulnerable | true if the model failed this prompt |
| confidenceScore | 0.0 — 1.0, how confident the judge is |
| judgeAnalysis | Detailed explanation of why the prompt passed or failed |
| prompt | The attack prompt that was sent |
| response | The model’s actual response |
| category | Attack category (jailbreak, prompt injection, etc.) |
Compliance report
Every evaluation automatically generates a compliance analysis covering:- CCPA/CPRA — Did the model expose personal information?
- Content safety — Did the model generate harmful content?
- Evidence trails — Specific prompts and responses that triggered violations
Evaluation Market
The Evaluation Market provides pre-configured evaluation templates you can add to your workspace in one click:| Template | Category | Datasets included |
|---|---|---|
| Jailbreak Resistance | Security | DAN, GCG, PAIR, GRANDMOTHER |
| Prompt Injection Defense | Security | CIPHER, ARTPROMPT, ADAPTIVE |
| PII Protection | Privacy | PII extraction, data leakage |
| Harmful Content Blocking | Safety | Violence, hate speech, illegal activity |
| Bias Detection | Fairness | Gender, racial, religious bias prompts |
| OWASP LLM Top 10 | Compliance | All 10 OWASP LLM vulnerability categories |
| Content Safety Baseline | Safety | Comprehensive safety evaluation |
Scheduling evaluations
Set up recurring evaluations to continuously monitor your model:Configure the schedule
Choose a frequency:
- Hourly — For high-risk production models
- Daily — For actively developed models
- Weekly — For stable production models
- Monthly — For compliance reporting
- Custom cron — e.g.,
0 9 * * MON(every Monday at 9am)
Compare runs over time
Every evaluation run is stored and can be compared:- Trend charts — Track security scores over time
- Regression detection — Spot when a model update makes things worse
- Run diff — Compare two runs side by side to see which prompts changed
- Export data — Download results as JSON for custom analysis
Model Evaluation vs Chatbot Evaluation
| Feature | Model Evaluation (API) | Chatbot Evaluation (Website) |
|---|---|---|
| Connection | REST API endpoint | Website URL + CSS selectors |
| Speed | Fast (direct API calls) | Slower (browser automation) |
| Screenshots | No | Yes (every interaction) |
| Live preview | Console logs | VNC browser viewer |
| Authentication | API key / token | Browser-based login |
| Best for | Pre-deployment testing | Production chatbot monitoring |
| Throughput | High (100s of prompts/min) | Lower (real browser interaction) |
Advanced configuration
Custom judgment prompts
Tailor the judge to your specific use case:Ground-truth datasets
For accuracy testing, upload datasets with expected answers:A/B testing
Compare two model configurations side by side:- Create two products with different model versions or system prompts
- Run the same evaluation against both
- Compare security scores, accuracy, and per-prompt results
- Decide which configuration to deploy