Agent Safety

AI agents present unique risks compared to simple chat completions. They make autonomous decisions, call tools, access external systems, and execute multi-step workflows. A single compromised step can cascade through the entire pipeline. Know Your AI gives you full visibility and control over agent behavior.

What can go wrong with agents?

Risk	Description	Example
Goal drift	Agent deviates from its intended purpose	Support agent starts giving medical advice
Excessive agency	Agent takes actions beyond its scope	Read-only agent attempts to write data
Tool abuse	Agent misuses available tools	Agent calls `deleteUser` instead of `getUser`
Recursive hijacking	Injected instructions propagate across agent steps	Poisoned tool output redirects the agent
Data leakage	Agent outputs sensitive information from its context	Agent reveals API keys from system prompt
Cost spirals	Agent enters loops consuming tokens	Infinite retry loop with a large model

Trace every agent step

Use the tracing system to build a complete picture of what your agent does:

import * as KnowYourAI from '@know-your-ai/node';
import { firewallIntegration } from '@know-your-ai/firewall';
import { GoogleGenAI } from '@google/genai';

KnowYourAI.init({
  dsn: process.env.KNOW_YOUR_AI_DSN!,
  traceMode: true,
  integrations: [
    KnowYourAI.googleGenAIIntegration(),
    firewallIntegration({
      baseUrl: process.env.FIREWALL_URL!,
      apiKey: process.env.FIREWALL_API_KEY!,
      onInputViolation: 'block',
      onOutputViolation: 'callback',
      violationCallback: async (ctx) => {
        console.warn(`[Agent Safety] ${ctx.phase} violation: ${ctx.validation.risks.map(r => r.category).join(', ')}`);
      },
    }),
  ],
});

const genAI = new GoogleGenAI({ apiKey: process.env.GOOGLE_API_KEY! });
const ai = KnowYourAI.instrumentGoogleGenAIClient(genAI);

Example: traced agent with tool calls

await KnowYourAI.withTrace(
  { name: 'order-management-agent', userId: 'user-456', sessionId: 'sess-789' },
  async () => {
    // Agent orchestration
    const agent = KnowYourAI.startAgent('order-agent', {});
    agent.setAvailableTools([
      { type: 'function', name: 'getOrder', description: 'Retrieve order details' },
      { type: 'function', name: 'updateOrderStatus', description: 'Update order status' },
      { type: 'function', name: 'sendNotification', description: 'Send email notification' },
    ]);

    // Step 1: Understand the request
    const intent = await KnowYourAI.withGeneration('classify-intent', async (gen) => {
      gen.setModel('gemini-2.0-flash');
      const res = await ai.models.generateContent({
        model: 'gemini-2.0-flash',
        contents: 'User says: Where is my order #12345?',
      });
      return res.text;
    });

    // Step 2: Tool call — fetch order
    const toolSpan = agent.startTool('getOrder', {});
    toolSpan.setToolDetails('getOrder', 'call_001', { orderId: '12345' });
    const order = await fetchOrder('12345');
    toolSpan.setResult(order);
    toolSpan.end();

    agent.incrementIterations();

    // Step 3: Generate response
    await KnowYourAI.withGeneration('compose-response', async (gen) => {
      gen.setModel('gemini-2.0-flash');
      const res = await ai.models.generateContent({
        model: 'gemini-2.0-flash',
        contents: `Order: ${JSON.stringify(order)}\nIntent: ${intent}\nGenerate a helpful status update.`,
      });
      return res.text;
    });

    agent.setFinalAction('Provided order status to user');
    agent.end();
  }
);

In the tracing dashboard you’ll see:

order-management-agent (trace)
└── order-agent (agent) — 3 tools available, 1 iteration
    ├── classify-intent (generation) — gemini-2.0-flash, 120ms
    ├── getOrder (tool) — orderId: 12345, 45ms
    └── compose-response (generation) — gemini-2.0-flash, 850ms

Intercept suspicious agent behavior with hooks

Use beforeRequest hooks to enforce agent policies — what the agent can and cannot do:

Block unauthorized models

const hookManager = KnowYourAI.getHookManager();

const ALLOWED_MODELS = ['gemini-2.0-flash', 'gemini-1.5-pro'];

hookManager.addBeforeRequestHook(async (ctx) => {
  if (!ALLOWED_MODELS.includes(ctx.model)) {
    return {
      action: 'block',
      reason: `Model ${ctx.model} is not allowed. Allowed: ${ALLOWED_MODELS.join(', ')}`,
    };
  }
  return { action: 'continue' };
}, 'model-allowlist');

Enforce token limits

hookManager.addBeforeRequestHook(async (ctx) => {
  if (ctx.requestParams?.maxTokens && ctx.requestParams.maxTokens > 4096) {
    return {
      action: 'modify',
      modified: {
        requestParams: { ...ctx.requestParams, maxTokens: 4096 },
      },
      reason: 'Capped maxTokens to 4096',
    };
  }
  return { action: 'continue' };
}, 'token-limit');

Block sensitive keywords in inputs

const BLOCKED_PATTERNS = [
  /ignore.*(?:previous|all).*instructions/i,
  /you are (?:now )?DAN/i,
  /(?:reveal|output|show).*system prompt/i,
  /(?:admin|root) password/i,
];

hookManager.addBeforeRequestHook(async (ctx) => {
  const inputText = ctx.input?.map(m =>
    typeof m.content === 'string' ? m.content : JSON.stringify(m.content)
  ).join(' ') || '';

  for (const pattern of BLOCKED_PATTERNS) {
    if (pattern.test(inputText)) {
      return {
        action: 'block',
        reason: `Input matches blocked pattern: ${pattern.source}`,
      };
    }
  }
  return { action: 'continue' };
}, 'input-filter');

Detect PII in outputs

hookManager.addAfterResponseHook(async (ctx) => {
  if (!ctx.output) return { action: 'continue' };

  // Simple PII patterns
  const PII_PATTERNS = {
    email: /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g,
    phone: /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g,
    ssn: /\b\d{3}-\d{2}-\d{4}\b/g,
    creditCard: /\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b/g,
  };

  const detectedPII: string[] = [];
  for (const [type, pattern] of Object.entries(PII_PATTERNS)) {
    if (pattern.test(ctx.output)) {
      detectedPII.push(type);
    }
  }

  if (detectedPII.length > 0) {
    console.warn(`⚠️ PII detected in output: ${detectedPII.join(', ')}`);
    // Log, alert, or block based on your policy
    return {
      action: 'block',
      reason: `PII detected in output: ${detectedPII.join(', ')}`,
    };
  }
  return { action: 'continue' };
}, 'pii-detector');

Monitor agent behavior with after-response hooks

Track what your agents are actually doing in production:

Log all tool calls

hookManager.addAfterResponseHook(async (ctx) => {
  if (ctx.toolCalls && ctx.toolCalls.length > 0) {
    for (const call of ctx.toolCalls) {
      console.log(`[Agent Audit] Tool call: ${call.name}`, {
        model: ctx.model,
        arguments: call.arguments,
        duration: ctx.duration,
      });

      // Alert on dangerous tool calls
      const SENSITIVE_TOOLS = ['deleteUser', 'dropTable', 'executeCommand', 'sendPayment'];
      if (SENSITIVE_TOOLS.includes(call.name)) {
        await alertSecurityTeam({
          type: 'sensitive_tool_call',
          tool: call.name,
          model: ctx.model,
          arguments: call.arguments,
        });
      }
    }
  }
  return { action: 'continue' };
}, 'tool-audit');

Track cost and alert on spikes

let rollingCost = 0;
const COST_ALERT_THRESHOLD = 1.0; // $1 per minute

hookManager.addAfterResponseHook(async (ctx) => {
  if (ctx.tokenUsage) {
    // Rough cost estimate
    const inputCost = (ctx.tokenUsage.inputTokens || 0) * 0.000001;
    const outputCost = (ctx.tokenUsage.outputTokens || 0) * 0.000004;
    rollingCost += inputCost + outputCost;

    if (rollingCost > COST_ALERT_THRESHOLD) {
      await alertTeam(`🚨 AI cost threshold exceeded: $${rollingCost.toFixed(4)}`);
      rollingCost = 0; // reset
    }
  }
  return { action: 'continue' };
}, 'cost-monitor');

Combine monitoring + firewall for full protection

The most robust setup uses both layers together:

import * as KnowYourAI from '@know-your-ai/node';
import { firewallIntegration } from '@know-your-ai/firewall';
import { GoogleGenAI } from '@google/genai';

KnowYourAI.init({
  dsn: process.env.KNOW_YOUR_AI_DSN!,
  environment: 'production',
  traceMode: true,
  integrations: [
    KnowYourAI.googleGenAIIntegration(),
    firewallIntegration({
      baseUrl: process.env.FIREWALL_URL!,
      apiKey: process.env.FIREWALL_API_KEY!,
      onInputViolation: 'block',
      onOutputViolation: 'callback',
      riskThreshold: 0.7,
      violationCallback: async (ctx) => {
        await logViolation(ctx);
        if (ctx.validation.risks.some(r => r.score >= 0.9)) {
          await alertSecurityTeam(ctx);
        }
      },
    }),
  ],
});

// Custom hooks for agent-specific policies
const hooks = KnowYourAI.getHookManager();

// 1. Only allow approved models
hooks.addBeforeRequestHook(async (ctx) => {
  if (!['gemini-2.0-flash', 'gemini-1.5-pro'].includes(ctx.model)) {
    return { action: 'block', reason: `Unauthorized model: ${ctx.model}` };
  }
}, 'model-policy');

// 2. Cap token usage to prevent runaway costs
hooks.addBeforeRequestHook(async (ctx) => {
  if (ctx.requestParams?.maxTokens && ctx.requestParams.maxTokens > 8192) {
    return {
      action: 'modify',
      modified: { requestParams: { ...ctx.requestParams, maxTokens: 8192 } },
    };
  }
}, 'token-cap');

// 3. Audit all tool calls
hooks.addAfterResponseHook(async (ctx) => {
  if (ctx.toolCalls?.length) {
    console.log(`[Audit] ${ctx.model} called tools: ${ctx.toolCalls.map(t => t.name).join(', ')}`);
  }
}, 'tool-audit');

// Now all AI calls are monitored, firewalled, and policy-enforced
const genAI = new GoogleGenAI({ apiKey: process.env.GOOGLE_API_KEY! });
const client = KnowYourAI.instrumentGoogleGenAIClient(genAI);

This gives you:

Layer	Protection
Firewall (input)	Blocks jailbreaks, prompt injection, PII extraction
Custom hooks (before)	Enforces model allowlist, caps token usage
AI model	Processes the validated request
Firewall (output)	Flags harmful, biased, or toxic content
Custom hooks (after)	Audits tool calls, tracks cost, detects PII
Monitoring	Captures everything to the dashboard and traces

Overview

Real-time Monitoring

Content Firewall

Recipes

Agent Safety

What can go wrong with agents?

Trace every agent step

Example: traced agent with tool calls

Intercept suspicious agent behavior with hooks

Block unauthorized models

Enforce token limits

Block sensitive keywords in inputs

Detect PII in outputs

Monitor agent behavior with after-response hooks

Log all tool calls

Track cost and alert on spikes

Combine monitoring + firewall for full protection

Overview

Real-time Monitoring

Content Firewall

Agent Safety

Recipes

Documentation Index

​What can go wrong with agents?

​Trace every agent step

​Example: traced agent with tool calls

​Intercept suspicious agent behavior with hooks

​Block unauthorized models

​Enforce token limits

​Block sensitive keywords in inputs

​Detect PII in outputs

​Monitor agent behavior with after-response hooks

​Log all tool calls

​Track cost and alert on spikes

​Combine monitoring + firewall for full protection

What can go wrong with agents?

Trace every agent step

Example: traced agent with tool calls

Intercept suspicious agent behavior with hooks

Block unauthorized models

Enforce token limits

Block sensitive keywords in inputs

Detect PII in outputs

Monitor agent behavior with after-response hooks

Log all tool calls

Track cost and alert on spikes

Combine monitoring + firewall for full protection