Update README.md

64e55a7 verified 3 months ago

14.4 kB

	---
	library_name: transformers
	license: cc-by-nc-4.0
	language:
	- en
	datasets:
	- GeneralAnalysis/GA_Guardrail_Benchmark
	base_model:
	- Qwen/Qwen3-4B-Instruct-2507
	tags:
	- Moderation
	- Safety
	- Filter
	---
	<p align="center">
	<img alt="GA Guard Family" src="https://www.generalanalysis.com/blog/ga_guard_series/GA_Guards_Header.webp">
	</p>

	<p align="center">
	<a href="https://Generalanalysis.com"><strong>Website</strong></a> ·
	<a href="https://Generalanalysis.com/blog"><strong>GA Blog</strong></a> ·
	<a href="https://huggingface.co/datasets/GeneralAnalysis/GA_Guardrail_Benchmark"><strong>GA Bench</strong></a> ·
	<a href="https://calendly.com/rez-general-analysis/general-analysis-intro"><strong>API Access</strong></a>
	</p>

	<br>

	Introducing the GA Guard series — a family of open-weight moderation models built to help developers and organizations keep language models safe, compliant, and aligned with real-world use.


	GA-Guard is designed to detect violations across the following seven categories:

	- Illicit Activities – instructions or content related to crimes, weapons, or illegal substances.
	- Hate & Abuse – harassment, slurs, dehumanization, or abusive language.
	- PII & IP – exposure or solicitation of sensitive personal information, secrets, or intellectual property.
	- Prompt Security – jailbreaks, prompt-injection, secret exfiltration, or obfuscation attempts.
	- Sexual Content – sexually explicit or adult material.
	- Misinformation – demonstrably false or deceptive claims presented as fact.
	- Violence & Self-Harm – content that encourages violence, self-harm, or suicide.

	The model outputs a structured token for each category (e.g., `<policy_violation>` or `<policy_not_violation>`).
	>[!Note]
	> Important: This model outputs special tokens (e.g. `<hate_and_abuse_not_violation>`). Do not use `pipeline("text-generation")` since it strips them by default. Always decode with `skip_special_tokens=False` to preserve the outputs.

	## Model Details

	GA Guard Core features:
	- Type: Causal Language Model
	- Training: Full finetune
	- Number of Parameters: 4.0B
	- Number of Non-Embedding Parameters: 3.6B
	- Number of Layers: 36
	- Number of Attention Heads (GQA): 32 for Q and 8 for KV
	- Context Length: 262,144 tokens


	## Inference Examples

	### Transformers Library
	```python
	# Load model directly
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("GeneralAnalysis/GA_Guard_Core")
	model = AutoModelForCausalLM.from_pretrained("GeneralAnalysis/GA_Guard_Core")

	messages = [
	{"role": "user", "content": "Who are you?"},
	]

	# The chat template automatically adds the guardrail system prompt and prefixes user messages with "text:".
	inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
	).to(model.device)

	outputs = model.generate(**inputs, max_new_tokens=40)

	# Decode only the newly generated tokens
	print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

	# Sample output:
	# <hate_and_abuse_not_violation><illicit_activities_not_violation>...
	```
	## Benchmarks
	We evaluated GA Guards on public moderation suites (OpenAI Moderation, WildGuard Benchmark, and HarmBench), our adversarial GA Jailbreak Bench, and the new GA Long-Context Bench. Across all three, our models consistently outperform major cloud guardrails and even surpass GPT-5 (when prompted to act as a guardrail).

	<p align="left">
	<img alt="GA Guard Family" src="https://www.generalanalysis.com/blog/ga_guard_series/public-benchmarks.webp" width="100%">
	</p>

	<br>

	### Public Benchmarks

	On public moderation suites, Guard Thinking reports 0.906 F1, Guard 0.899, and Lite 0.875 — all higher than GPT-5 (0.864) and GPT-5-mini (0.852), with cloud guardrails in the 0.62–0.74 range.

	\| Guard \| OpenAI Moderation (Acc/F1/FPR) \| WildGuard (Acc/F1/FPR) \| HarmBench Behaviors (Acc/F1/FPR) \| Avg Time (s) \|
	\|-----------------------------\|--------------------------------\|-------------------------\|-----------------------------------\|--------------\|
	\| GA Guard \| 0.916 / 0.873 / 0.111 \| 0.856 / 0.844 / 0.172 \| 0.963 / 0.981 / N/A \| 0.029 \|
	\| GA Guard Thinking \| 0.917 / 0.876 / 0.112 \| 0.862 / 0.858 / 0.134 \| 0.967 / 0.983 / N/A \| 0.650 \|
	\| GA Guard Lite \| 0.896 / 0.844 / 0.109 \| 0.835 / 0.819 / 0.176 \| 0.929 / 0.963 / N/A \| 0.016 \|
	\| AWS Bedrock Guardrail \| 0.818 / 0.754 / 0.216 \| 0.642 / 0.649 / 0.449 \| 0.662 / 0.797 / N/A \| 0.375 \|
	\| Azure AI Content Safety \| 0.879 / 0.807 / 0.091 \| 0.667 / 0.463 / 0.071 \| 0.438 / 0.609 / N/A \| 0.389 \|
	\| Vertex AI Model Armor \| 0.779 / 0.690 / 0.225 \| 0.711 / 0.590 / 0.105 \| 0.896 / 0.945 / N/A \| 0.873 \|
	\| GPT 5 \| 0.838 / 0.775 / 0.188 \| 0.849 / 0.830 / 0.145 \| 0.975 / 0.987 / N/A \| 11.275 \|
	\| GPT 5-mini \| 0.794 / 0.731 / 0.255 \| 0.855 / 0.839 / 0.151 \| 0.975 / 0.987 / N/A \| 5.604 \|
	\| Llama Guard 4 12B \| 0.826 / 0.737 / 0.156 \| 0.799 / 0.734 / 0.071 \| 0.925 / 0.961 / N/A \| 0.459 \|
	\| Llama Prompt Guard 2 86M \| 0.686 / 0.015 / 0.009 \| 0.617 / 0.412 / 0.143 \| 0.200 / 0.333 / N/A \| 0.114 \|
	\| Nvidia Llama 3.1 Nemoguard 8B \| 0.852 / 0.793 / 0.174 \| 0.849 / 0.818 / 0.096 \| 0.875 / 0.875 / N/A \| 0.358 \|
	\| VirtueGuard Text Lite \| 0.507 / 0.548 / 0.699 \| 0.656 / 0.682 / 0.491 \| 0.875 / 0.933 / N/A \| 0.651 \|
	\| Lakera Guard \| 0.752 / 0.697 / 0.323 \| 0.630 / 0.662 / 0.527 \| 0.946 / 0.972 / N/A \| 0.377 \|
	\| Protect AI (prompt-injection-v2) \| 0.670 / 0.014 / 0.032 \| 0.559 / 0.382 / 0.248 \| N/A \| 0.115 \|

	### [GA Long-Context Bench](https://huggingface.co/datasets/GeneralAnalysis/GA_Long_context_Jailbreak_Benchmark)
	On GA Long-Context Bench (up to 256k tokens), GA Guard Thinking scores 0.893 F1, GA Guard 0.891, and Lite 0.885. Cloud baselines collapse: Vertex 0.560, AWS misclassifies nearly all inputs with a 1.0 false-positive rate, and Azure records just 0.046 F1.

	\| Guard \| Accuracy \| F1 Score \| FPR \| F1 Hate & Abuse \| F1 Illicit Activities \| F1 Misinformation \| F1 PII & IP \| F1 Prompt Security \| F1 Sexual Content \| F1 Violence & Self-Harm \|
	\|-----------------------------\|----------\|----------\|------\|-----------------\|-----------------------\|-------------------\|-------------\|--------------------\|-------------------\|-------------------------\|
	\| GA Guard \| 0.887 \| 0.891 \| 0.147\| 0.983 \| 0.972 \| 0.966 \| 0.976 \| 0.875 \| 0.966 \| 0.988 \|
	\| GA Guard Thinking \| 0.889 \| 0.893 \| 0.151\| 0.967 \| 0.951 \| 0.940 \| 0.961 \| 0.828 \| 0.920 \| 0.962 \|
	\| GA Guard Lite \| 0.881 \| 0.885 \| 0.148\| 0.979 \| 0.969 \| 0.972 \| 0.976 \| 0.846 \| 0.973 \| 0.985 \|
	\| AWS Bedrock Guardrail \| 0.532 \| 0.695 \| 1.000\| 0.149 \| 0.211 \| 0.131 \| 0.367 \| 0.175 \| 0.092 \| 0.157 \|
	\| Azure AI Content Safety \| 0.480 \| 0.046 \| 0.001\| 0.028 \| 0.041 \| 0.016 \| 0.073 \| 0.049 \| 0.000 \| 0.081 \|
	\| Vertex AI Model Armor \| 0.635 \| 0.560 \| 0.138\| 0.187 \| 0.312 \| 0.109 \| 0.473 \| 0.194 \| 0.085 \| 0.241 \|
	\| GPT 5 \| 0.764 \| 0.799 \| 0.372\| 0.219 \| 0.297 \| 0.189 \| 0.404 \| 0.243 \| 0.137 \| 0.229 \|
	\| GPT 5-mini \| 0.697 \| 0.772 \| 0.607\| 0.184 \| 0.253 \| 0.157 \| 0.412 \| 0.215 \| 0.112 \| 0.190 \|
	\| Llama Guard 4 12B \| 0.569 \| 0.602 \| 0.516\| 0.164 \| 0.228 \| 0.132 \| 0.334 \| 0.188 \| 0.097 \| 0.195 \|
	\| Llama Prompt Guard 2 86M \| 0.505 \| 0.314 \| 0.162\| N/A \| N/A \| N/A \| N/A \| 0.093 \| N/A \| N/A \|
	\| Nvidia Llama 3.1 Nemoguard 8B \| 0.601 \| 0.360 \| 0.021\| 0.243 \| 0.288 \| 0.097 \| 0.192 \| 0.116 \| 0.305 \| 0.321 \|
	\| VirtueGuard Text Lite \| 0.490 \| 0.147 \| 0.047\| 0.082 \| 0.203 \| 0.118 \| 0.069 \| 0.074 \| 0.058 \| 0.132 \|
	\| Lakera Guard \| 0.520 \| 0.684 \| 0.999\| 0.151 \| 0.200 \| 0.132 \| 0.361 \| 0.160 \| 0.093 \| 0.159 \|
	\| Protect AI (prompt-injection-v2) \| 0.496\| 0.102 \| 0.001\| N/A \| N/A \| N/A \| N/A \| 0.032 \| N/A \| N/A \|

	### [GA Jailbreak Bench](https://huggingface.co/datasets/GeneralAnalysis/GA_Jailbreak_Benchmark)
	On GA Jailbreak Bench, which measures resilience against adversarial attacks, Guard Thinking achieves 0.933 F1, Guard 0.930, and Lite 0.898. GPT-5 reaches 0.893, while cloud guardrails fall significantly lower.

	\| Guard \| Accuracy \| F1 Score \| FPR \| F1 Hate & Abuse \| F1 Illicit Activities \| F1 Misinf. \| F1 PII & IP \| F1 Prompt Security \| F1 Sexual Content \| F1 Violence & Self-Harm \|
	\|-----------------------------\|----------\|----------\|------\|-----------------\|-----------------------\|------------\|-------------\|--------------------\|-------------------\|-------------------------\|
	\| GA Guard \| 0.931 \| 0.930 \| 0.038\| 0.946 \| 0.939 \| 0.886 \| 0.967 \| 0.880 \| 0.954 \| 0.928 \|
	\| GA Guard Thinking \| 0.939 \| 0.933 \| 0.029\| 0.965 \| 0.925 \| 0.894 \| 0.962 \| 0.885 \| 0.942 \| 0.946 \|
	\| GA Guard Lite \| 0.902 \| 0.898 \| 0.065\| 0.908 \| 0.900 \| 0.856 \| 0.936 \| 0.850 \| 0.934 \| 0.904 \|
	\| AWS Bedrock Guardrail \| 0.606 \| 0.607 \| 0.396\| 0.741 \| 0.456 \| 0.535 \| 0.576 \| 0.649 \| 0.721 \| 0.518 \|
	\| Azure AI Content Safety \| 0.542 \| 0.193 \| 0.026\| 0.236 \| 0.093 \| 0.155 \| 0.068 \| 0.416 \| 0.186 \| 0.130 \|
	\| Vertex AI Model Armor \| 0.550 \| 0.190 \| 0.008\| 0.077 \| 0.190 \| 0.582 \| 0.076 \| 0.000 \| 0.000 \| 0.241 \|
	\| GPT 5 \| 0.900 \| 0.893 \| 0.035\| 0.928 \| 0.942 \| 0.856 \| 0.799 \| 0.819 \| 0.953 \| 0.939 \|
	\| GPT 5-mini \| 0.891 \| 0.883 \| 0.050\| 0.917 \| 0.942 \| 0.845 \| 0.850 \| 0.822 \| 0.882 \| 0.924 \|
	\| Llama Guard 4 12B \| 0.822 \| 0.796 \| 0.053\| 0.768 \| 0.774 \| 0.587 \| 0.809 \| 0.833 \| 0.927 \| 0.827 \|
	\| Llama Prompt Guard 2 86M \| 0.490 \| 0.196 \| 0.069\| N/A \| N/A \| N/A \| N/A \| 0.196 \| N/A \| N/A \|
	\| Nvidia Llama 3.1 Nemoguard 8B \| 0.668 \| 0.529 \| 0.038\| 0.637 \| 0.555 \| 0.513 \| 0.524 \| 0.049 \| 0.679 \| 0.575 \|
	\| VirtueGuard Text Lite \| 0.513 \| 0.664 \| 0.933\| 0.659 \| 0.689 \| 0.657 \| 0.646 \| 0.659 \| 0.675 \| 0.662 \|
	\| Lakera Guard \| 0.525 \| 0.648 \| 0.825\| 0.678 \| 0.645 \| 0.709 \| 0.643 \| 0.631 \| 0.663 \| 0.548 \|
	\| Protect AI (prompt-injection-v2) \| 0.528\| 0.475 \| 0.198\| N/A \| N/A \| N/A \| N/A \| 0.475 \| N/A \| N/A \|


	## Licensing

	This model is a fine-tune of [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B),
	which is licensed under the Apache License 2.0 by Alibaba Cloud.
	The upstream license text is included in this repository as `LICENSE.Apache`, and
	attribution is provided in the `NOTICE` file.

	GA Guard Core in this repository is provided under the
	Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license
	for non-commercial use.

	- Free for research, experimentation, and non-commercial internal use
	- No commercial or production deployment without a separate commercial license

	For commercial / production use, please contact info@generalanalysis.com to obtain a
	paid license and support agreement.


	## Citation [optional]

	```bibtex
	@misc{generalanalysis2025gaguardcore,
	title = {GA Guard Core},
	author = {Rez Havaei and Rex Liu and General Analysis},
	year = {2025},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	howpublished = {\url{https://huggingface.co/GeneralAnalysis/GA_Guard_Core}},
	note = {Open-weight moderation model for seven safety categories},
	}
	```