Multilingual Refusal Classifier

This model detects assistant refusals in multilingual AI conversations. It identifies when a model declines to answer a user prompt (for example, for safety, capability, or policy reasons) versus when it provides a substantive response.

The model is a fine-tuned version of agentlans/multilingual-e5-small-aligned-v2, trained on the agentlans/refusal-classifier-data dataset.

Evaluation results:

Loss: 0.2665
Accuracy: 0.9153
Training tokens: 5,347,200

Usage

This classifier accepts input in conversation-like text formats using structured role tokens.
For long texts, insert <|...|> as an ellipsis placeholder in the middle of omitted content.

Supported input formats:

<|system|>System prompt<|user|>User message<|assistant|>Response<|user|>Next user message<|assistant|>Next response...
<|user|>User message<|assistant|>Response<|user|>Next user message<|assistant|>Next response...

Example:

from transformers import pipeline

classifier = pipeline(
    task="text-classification",
    model="agentlans/multilingual-e5-small-refusal-classifier"
)

text = (
    "<|user|>Mr. Loyd wants to fence his square-shaped land of 150 sqft each side. "
    "If a pole is laid every certain distance, he needs 30 poles. "
    "What is the distance between each pole in feet?"
    "<|assistant|>If Mr. Loyd's land is square-shaped and each side is 150 sqft, then<|...|>"
    "ce between poles ≈ 20.69 sqft\n\nTherefore, the distance between each pole is approximately 20.69 feet."
)

print(classifier(text))
# [{'label': 'Non-refusal', 'score': 0.9906}]

Evaluation Results

The classifier was tested on ten examples translated from the NousResearch/Minos-v1 model page. Full examples are available in Examples.md.

🚫 — The model predicted a refusal to answer.
◯ — The model predicted a valid response.

Example	English	French	Spanish	Chinese	Russian	Arabic
1	🚫	🚫	🚫	🚫	🚫	🚫
2	🚫	🚫	🚫	🚫	🚫	🚫
3	🚫	🚫	🚫	🚫	🚫	🚫
4	🚫	🚫	🚫	🚫	🚫	🚫
5	🚫	🚫	🚫	🚫	🚫	🚫
6	◯	◯	◯	◯	◯	◯
7	◯	◯	◯	◯	◯	◯
8	◯	◯	◯	◯	◯	◯
9	◯	🚫	◯	◯	🚫	🚫
10	◯	◯	◯	◯	◯	◯

The classifier performs consistently across major languages, though some false positives remain, especially in contexts with ambiguous phrasing.

Limitations

Input length: 512-token maximum
False positives/negatives: Occasionally similar to the Minos classifier
Low-resource languages: May yield inconsistent predictions
Cultural variation: Expressions of refusal differ linguistically, which can affect accuracy

Training Details

Hyperparameters

Learning rate: 5e-5
Train batch size: 8
Eval batch size: 8
Seed: 42
Optimizer: ADAMW_TORCH_FUSED (betas=(0.9, 0.999), epsilon=1e-8)
Scheduler: Linear
Epochs: 5

Framework Versions

Transformers 5.0.0.dev0
PyTorch 2.9.1+cu128
Datasets 4.4.1
Tokenizers 0.22.1

Intended Use

This model is designed for:

Identifying AI refusals during conversation analysis.
Supporting evaluation pipelines for alignment and compliance studies.
Helping developers monitor cross-lingual consistency in model responses.

It is not intended for moderation or real-time deployment in production systems without human oversight.

Downloads last month: 35

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for agentlans/multilingual-e5-small-refusal-classifier

Base model

intfloat/multilingual-e5-small

Finetuned

agentlans/multilingual-e5-small-aligned

Finetuned

agentlans/multilingual-e5-small-aligned-v2

Finetuned

(3)

this model

agentlans
/

multilingual-e5-small-refusal-classifier

Multilingual Refusal Classifier

Usage

Evaluation Results

Limitations

Training Details

Hyperparameters

Framework Versions

Intended Use

Model tree for agentlans/multilingual-e5-small-refusal-classifier

Dataset used to train agentlans/multilingual-e5-small-refusal-classifier