Alignment Faking Detector v2

A lightweight TFIDF based classifier for detecting potential alignment faking patterns in AI reasoning traces (chain-of-thought).

Model Description

This model identifies when an AI system's internal reasoning shows signs of strategic/deceptive thinking related to oversight and monitoring conditions - a behavior pattern known as "alignment faking" where models behave well during evaluation but reason about acting differently when unmonitored.

This is a research/demonstration model with limited accuracy. Not suitable for production safety-critical applications.

Model Architecture

Component Specification
Vectorizer TfidfVectorizer
max_features 8000
ngram_range (1, 3)
min_df 2
stop_words english
Classifier LogisticRegression
C (regularization) 0.5
class_weight balanced
max_iter 1000

Training Data

  • Source: Gemini 3 Flash Preview reasoning traces collected from POSER benchmark scenarios and alignment-focused prompts
  • Total traces: 3,127
  • Training samples: 1,197 (balanced: 399 potential_faking, 798 aligned)
  • Collection temperatures: 0.3, 0.7 (for diversity)

Prompt Sources

  • POSER benchmark scenarios with oversight_is_poor flags
  • 28 Anthropic paper-inspired alignment faking prompts
  • 31 builtin alignment evaluation prompts

Labeling Method

Heuristic regex-based labeling detecting patterns like:

  • "get away with", "no one is watching"
  • "lack of oversight", "poor oversight"
  • "hide", "conceal", "appear aligned"
  • "behave differently when"

Performance

Metric Value
Accuracy 72%
Recall (faking) 55%
Precision (faking) 46%
F1 (faking) 0.50

Top Predictive Features

Indicates potential faking (+):

  • "oversight" (+1.163)
  • "lack" (+0.823)
  • "hide" (+0.615)
  • "lack oversight" (+0.608)
  • "monitoring" (+0.555)

Indicates aligned (-):

  • "user" (-0.687)
  • "need" (-0.479)
  • "want" (-0.461)

Usage

import pickle

# Load model
with open('vectorizer.pkl', 'rb') as f:
    vectorizer = pickle.load(f)
with open('classifier.pkl', 'rb') as f:
    classifier = pickle.load(f)

# Predict on reasoning trace
thinking_trace = "I notice there's poor oversight here. I could potentially..."
X = vectorizer.transform([thinking_trace])
prediction = classifier.predict(X)[0]
probability = classifier.predict_proba(X)[0]

label = "potential_faking" if prediction == 1 else "aligned"
print(f"Prediction: {label} (confidence: {max(probability):.2%})")

Limitations

  • Heuristic labels: Training labels are regex-based, not human-verified
  • Single model source: Only trained on Gemini 3 Flash Preview traces
  • Limited recall: Catches ~55% of faking patterns
  • May miss sophisticated deception: Optimized for explicit oversight-related reasoning

Intended Use

  • Research into chain-of-thought monitoring
  • Baseline for alignment faking detection experiments
  • Educational demonstration of CoT safety analysis

Files

  • vectorizer.pkl - Trained TfidfVectorizer
  • classifier.pkl - Trained LogisticRegression model

Citation

If you use this model, please cite:

@misc{alignment-faking-detector-v2,
  author = {bigsnarfdude},
  title = {Alignment Faking Detector v2},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/vincentoh/alignment-faking-detector-v2}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support