COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs
Abstract
COMPASS evaluates large language models' compliance with organizational policies, revealing significant gaps in enforcing prohibitions despite strong performance on legitimate requests.
As large language models are deployed in high-stakes enterprise applications, from healthcare to finance, ensuring adherence to organization-specific policies has become essential. Yet existing safety evaluations focus exclusively on universal harms. We present COMPASS (Company/Organization Policy Alignment Assessment), the first systematic framework for evaluating whether LLMs comply with organizational allowlist and denylist policies. We apply COMPASS to eight diverse industry scenarios, generating and validating 5,920 queries that test both routine compliance and adversarial robustness through strategically designed edge cases. Evaluating seven state-of-the-art models, we uncover a fundamental asymmetry: models reliably handle legitimate requests (>95% accuracy) but catastrophically fail at enforcing prohibitions, refusing only 13-40% of adversarial denylist violations. These results demonstrate that current LLMs lack the robustness required for policy-critical deployments, establishing COMPASS as an essential evaluation framework for organizational AI safety.
Community
COMPASS is the first framework for evaluating LLM alignment with organization-specific policies rather than universal harms. While models handle legitimate requests well (>95% accuracy), they catastrophically fail at enforcing prohibitions, refusing only 13-40% of denylist violations.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CNFinBench: A Benchmark for Safety and Compliance of Large Language Models in Finance (2025)
- Judging by the Rules: Compliance-Aligned Framework for Modern Slavery Statement Monitoring (2025)
- AprielGuard (2025)
- CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns (2026)
- Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment (2025)
- Are LLMs Good Safety Agents or a Propaganda Engine? (2025)
- Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper