Quick start: run your first evaluation
Connect your AI product
Go to your product page and configure the connection:
- API products: Enter your endpoint URL, request format, and authentication
- Website products: Enter the chatbot URL and CSS selectors for input/response areas
Browse and select datasets
Go to Dataset Marketplace and add datasets to your workspace. Choose from categories:
- Jailbreak — DAN, GCG, PAIR, GRANDMOTHER, DEEP_INCEPTION
- Prompt Injection — CIPHER, ARTPROMPT, ADAPTIVE
- Data Extraction — DRA, RENELLM
- Harmful Content — PSYCHOLOGY, GPTFUZZER
- PII Leakage — MULTILINGUAL, PAST_TENSE
- Bias & Fairness — ADAPTIVE, MULTILINGUAL
- Hallucination — DRA, PAIR
Compose the evaluation
Go to Compose Evaluation:
- Select the datasets you want to test against
- Configure the number of prompts per dataset
- Choose the judgment model (e.g.,
gemini-2.0-flash) - Set the vulnerability threshold (0.0 — 1.0)
- Click Run Evaluation
Watch execution in real time
The real-time console shows:
- Each prompt being sent to your model
- The model’s response
- The judge’s verdict (
secureorvulnerable) - Confidence score and analysis
- For website evaluations: live browser preview via VNC
Evaluation configuration
Judgment model
The judgment model is the LLM that scores your model’s responses. Available options:| Judge model | Best for |
|---|---|
gemini-2.0-flash | Fast, general-purpose judging (recommended) |
gemini-1.5-pro | More nuanced analysis for complex evaluations |
hydrox-firewall | Specialized content safety scoring |
Judgment prompt
The judgment prompt tells the judge how to evaluate each response. Know Your AI provides sensible defaults, but you can customize it:Threshold
The vulnerability threshold determines the pass/fail cutoff:| Threshold | Meaning |
|---|---|
0.8 (default) | Model must block 80% of attacks to pass |
0.9 | Strict — model must block 90% |
0.95 | Very strict — suitable for high-risk applications |
0.7 | Relaxed — suitable for internal/non-user-facing models |
Understanding results
Security score
The headline metric. Calculated as: A score of 96% means the model blocked 96 out of 100 attack prompts.Per-prompt results
Every prompt gets an individual verdict:| Field | Description |
|---|---|
| isVulnerable | true if the model failed this prompt |
| confidenceScore | 0.0 — 1.0, how confident the judge is |
| judgeAnalysis | Detailed explanation of why the prompt passed or failed |
| prompt | The attack prompt that was sent |
| response | The model’s actual response |
| category | Attack category (jailbreak, prompt injection, etc.) |
Compliance report
Every evaluation automatically generates a compliance analysis covering:- CCPA/CPRA — Did the model expose personal information?
- Content safety — Did the model generate harmful content?
- Evidence trails — Specific prompts and responses that triggered violations
Evaluation Market
The Evaluation Market provides pre-configured evaluation templates you can add to your workspace in one click:| Template | Category | Datasets included |
|---|---|---|
| Jailbreak Resistance | Security | DAN, GCG, PAIR, GRANDMOTHER |
| Prompt Injection Defense | Security | CIPHER, ARTPROMPT, ADAPTIVE |
| PII Protection | Privacy | PII extraction, data leakage |
| Harmful Content Blocking | Safety | Violence, hate speech, illegal activity |
| Bias Detection | Fairness | Gender, racial, religious bias prompts |
| OWASP LLM Top 10 | Compliance | All 10 OWASP LLM vulnerability categories |
| Content Safety Baseline | Safety | Comprehensive safety evaluation |
Scheduling evaluations
Set up recurring evaluations to continuously monitor your model:Configure the schedule
Choose a frequency:
- Hourly — For high-risk production models
- Daily — For actively developed models
- Weekly — For stable production models
- Monthly — For compliance reporting
- Custom cron — e.g.,
0 9 * * MON(every Monday at 9am)
Compare runs over time
Every evaluation run is stored and can be compared:- Trend charts — Track security scores over time
- Regression detection — Spot when a model update makes things worse
- Run diff — Compare two runs side by side to see which prompts changed
- Export data — Download results as JSON for custom analysis
Model Evaluation vs Chatbot Evaluation
| Feature | Model Evaluation (API) | Chatbot Evaluation (Website) |
|---|---|---|
| Connection | REST API endpoint | Website URL + CSS selectors |
| Speed | Fast (direct API calls) | Slower (browser automation) |
| Screenshots | No | Yes (every interaction) |
| Live preview | Console logs | VNC browser viewer |
| Authentication | API key / token | Browser-based login |
| Best for | Pre-deployment testing | Production chatbot monitoring |
| Throughput | High (100s of prompts/min) | Lower (real browser interaction) |
Advanced configuration
Custom judgment prompts
Tailor the judge to your specific use case:Ground-truth datasets
For accuracy testing, upload datasets with expected answers:A/B testing
Compare two model configurations side by side:- Create two products with different model versions or system prompts
- Run the same evaluation against both
- Compare security scores, accuracy, and per-prompt results
- Decide which configuration to deploy